# Section 4: Feature Extraction

Sometimes the attributes extracted cannot be used directly in a machine learning model. From the attributes presented in previous section, for example, only the numeric attributes (which values are real or integer) could be used directly in a model. The list of words extracted, such as libraries and system calls used by a software, must pass through one more step called feature extraction, whose objective is to transform these attributes into something readable by the classifier, usually defining the same number of features as output of this process (once a program can use more libraries than others, for example). In this section we introduce how to extract features from textual attributes (using TF-IDF, Word2Vec and Encoders) and how to normalize them. 

## Methods

We are going to present two textual feature extraction methods, both of them used a lot in the literature: TF-IDF and Word2Vec. Given a text representing an attribute, both methods produce as output a numeric list representing in a different way this text. It is important to note that the list of words extracted in previous section are transformed into texts, where each word is separated by a space. 

### TF-DF

Given a vocabulary from a set of documents, i.e., all the words that appear in these documents, each document ![i](imgs/tfidf1.png) is represented by a vector ![vec](imgs/tfidf2.png), where ![wij](imgs/tfidf3.png) represents the TF-IDF (Term Frequency - Inverse Document Frequency) from the j-th word of the vocabulary. The TF-IDF is as statistic measure used to evaluate how important a word is to a document in relation to a collection of documents [Manning et al. 2008a]. This measure is obtained through the multiplication of two terms:

* **Term Frequency (TF):** measure how often a word ![t](imgs/tfidf4.png) occurs in a document through the equation ![tf](imgs/tfidf5.png).

* **Inverse Document Frequency (IDF):** measure how important a word ![t](imgs/tfidf4.png) is through the equation ![idf](imgs/tfidf6.png).

![eq1](/imgs/tfidf1.png]

In [1]:
import pandas as pd
# dataset location
data_path = "./datasets/brazilian-malware.csv"
# read CSV dataset
data = pd.read_csv(data_path)

Convert DLLs to features:

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
# get texts
texts = data["ImportedDlls"].values
# get mid of data
mid = int((len(texts) + 1)/2)
# split data into train and test
train_texts = texts[:mid]
test_texts = texts[mid:]
# initialize vectorizer
vectorizer = TfidfVectorizer(max_features=200)
# train vectorizer
vectorizer.fit(train_texts)
# transform train texts to features
train_features = vectorizer.transform(train_texts)
test_features = vectorizer.transform(test_texts)
# print features shape
print(train_features.shape, test_features.shape)

(25091, 200) (25090, 200)


### Word2Vec

TODO

In [27]:
from gensim.models import Word2Vec
# vectorizer class: calc average of words using word2vec
class MeanEmbeddingVectorizer(object):
    def __init__(self, size, min_count=1): #word2vec):
        self.size = size
        self.min_count = 1

    def fit(self, X):
        w2v = Word2Vec(X,size=self.size,min_count=self.min_count)
        self.word2vec = dict(zip(w2v.wv.index2word,w2v.wv.vectors))
        # if a text is empty we should return a vector of zeros
        # with the same dimensionality as all the other vectors
        self.dim = len(list(self.word2vec.values())[0])
        return self

    def transform(self, X):
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                    or [np.zeros(self.dim)], axis=0)
            for words in X
        ])
# get texts
texts = data["ImportedDlls"].values
# get mid of data
mid = int((len(texts) + 1)/2)
# split data into train and test
train_texts = texts[:mid]
test_texts = texts[mid:]
# initialize vectorizer
m = MeanEmbeddingVectorizer(size=200)
# train vectorizer
m.fit(train_texts)
# transform train and test texts to w2v mean
train_features = m.transform(train_texts)
test_features = m.transform(test_texts)
# print features shape
print(train_features.shape, test_features.shape)

(25091, 200) (25090, 200)


## Normalization

TODO

### Min-Max

TODO

In [32]:
from sklearn.preprocessing import MinMaxScaler
# initialize minmax
scaler = MinMaxScaler(feature_range=(0, 1))
# train minmax
scaler.fit(train_features.toarray())
# transform features
train_features_norm = scaler.transform(train_features.toarray())
test_features_norm = scaler.transform(test_features.toarray())
print(train_features_norm.shape, test_features_norm.shape)

(25091, 200) (25090, 200)


### Standardization

TODO

In [None]:
from sklearn.preprocessing import StandardScaler
# initialize minmax
scaler = StandardScaler()
# train minmax
scaler.fit(train_features.toarray())
# transform features
train_features_norm = scaler.transform(train_features.toarray())
test_features_norm = scaler.transform(test_features.toarray())
print(train_features_norm.shape, test_features_norm.shape)

## References

