# Feature extraction

## Extracting features from categorical variables

It may seem intuitive to represent the values with a single integer feature. But it encodes artifical information. There is no natural order of cities:

In [2]:
from sklearn.feature_extraction import DictVectorizer

onehot_encoder = DictVectorizer()
X = [
    {'city': 'New York'},
    {'city': 'San Francisco'},
    {'city': 'Chapel Hill'}
]

print(onehot_encoder.fit_transform(X).toarray())

[[0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]]


## Standadizing features

The standardized data has zero mean and unit variance.

In [6]:
from sklearn import preprocessing
import numpy as np

X = np.array([
    [0., 0., 5., 13., 9., 1.],
    [0., 0., 13., 15., 10., 15.],
    [0., 3., 15., 2., 0., 11.]
])

print(preprocessing.scale(X))

[[ 0.         -0.70710678 -1.38873015  0.52489066  0.59299945 -1.35873244]
 [ 0.         -0.70710678  0.46291005  0.87481777  0.81537425  1.01904933]
 [ 0.          1.41421356  0.9258201  -1.39970842 -1.4083737   0.33968311]]


To mitigate the effect of large outliers use `RobustScaler`.

## Extracting features from text

### The bag-of-words model

The bag-of-words model is motivated by the intuition that documents containing similar words often have similar meaning.  A collection of documents is called a **corpus**.

In [11]:
corpus = [
    'UNC played Duke in basketball',
    'Duke lost the basketball game',
    'I ate a sandwitch'
]

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
print(vectorizer.fit_transform(corpus).todense())

[[0 1 1 0 1 0 1 0 0 1]
 [0 1 1 1 0 1 0 0 1 0]
 [1 0 0 0 0 0 0 1 0 0]]


In [13]:
print(vectorizer.vocabulary_)

{'unc': 9, 'played': 6, 'duke': 2, 'in': 4, 'basketball': 1, 'lost': 5, 'the': 8, 'game': 3, 'ate': 0, 'sandwitch': 7}


In [15]:
from sklearn.metrics.pairwise import euclidean_distances
X = vectorizer.fit_transform(corpus).todense()
print('Distance between 1st and 2nd documents: ', euclidean_distances(X[0], X[1]))
print('Distance between 1st and 3rd documents: ', euclidean_distances(X[0], X[2]))
print('Distance between 2nd and 3rd documents: ', euclidean_distances(X[1], X[2]))

Distance between 1st and 2nd documents:  [[2.44948974]]
Distance between 1st and 3rd documents:  [[2.64575131]]
Distance between 2nd and 3rd documents:  [[2.64575131]]


### Stop word filtering

A basic strategy for the reducing dimanesions is to convert all of the text to lowercase. A second strategy is to remove words that are common to most of the documents:

In [16]:
vectorizer = CountVectorizer(stop_words='english')
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

[[0 1 1 0 0 1 0 1]
 [0 1 1 1 1 0 0 0]
 [1 0 0 0 0 0 1 0]]
{'unc': 7, 'played': 5, 'duke': 2, 'basketball': 1, 'lost': 4, 'game': 3, 'ate': 0, 'sandwitch': 6}


### Stemming and lemmatization

Two strategies for condensing infected and derived forms of word into a single feature:

In [17]:
corpus = [
    'He ate the sandwiches',
    'Every sandwich was eaten by him'
]

vectorizer = CountVectorizer(binary=True, stop_words='english')
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

[[1 0 0 1]
 [0 1 1 0]]
{'ate': 0, 'sandwiches': 3, 'sandwich': 2, 'eaten': 1}


The documents have similar meaning, but their feature vectors have no elements in common!

In [27]:
import nltk
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import pos_tag

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /Users/demas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/demas/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /Users/demas/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [24]:
wordnet_tag = ['n', 'v']
stemmer = PorterStemmer()
print('Stemmed: ', [[stemmer.stem(token) for token in word_tokenize(document)] for document in corpus])

Stemmed:  [['He', 'ate', 'the', 'sandwich'], ['everi', 'sandwich', 'wa', 'eaten', 'by', 'him']]


In [25]:
def lemmatize(token, tag):
        if tag[0].lower() in ['n', 'v']:
            return lemmatizer.lemmatize(token, tag[0].lower())
        return token

In [28]:
lemmatizer =WordNetLemmatizer()
tagged_corpus = [pos_tag(word_tokenize(document)) for document in corpus]
print('Lemmatized: ', [[lemmatize(token, tag) for token, tag in document] for document in tagged_corpus])

Lemmatized:  [['He', 'eat', 'the', 'sandwich'], ['Every', 'sandwich', 'be', 'eat', 'by', 'him']]


### Extending bag-of-words with tf-idf weights

Our feature vectors do not encode grammar, word order, or frequencies of words:

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'The dog ate a sandwitch and I ate a sandwitch',
    'The wizard transfigured a sandwitch'
]

vectorizer = TfidfVectorizer(stop_words='english')
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

[[0.75458397 0.37729199 0.53689271 0.         0.        ]
 [0.         0.         0.44943642 0.6316672  0.6316672 ]]
{'dog': 1, 'ate': 0, 'sandwitch': 2, 'wizard': 4, 'transfigured': 3}


### Space-efficient feature vectorizing with the hashing trick



In [33]:
from sklearn.feature_extraction.text import HashingVectorizer

corpus = ['the', 'ate', 'bacon', 'cat']
vectorizer = HashingVectorizer(n_features=6)
print(vectorizer.fit_transform(corpus).todense())

[[-1.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.]
 [ 0.  0.  0.  0. -1.  0.]
 [ 0.  1.  0.  0.  0.  0.]]


## Word embeddings

While the bag-of-words model uses a scalar to represent each token, word embeddings use a vector. Words that are semantically similar to each other a represented by vectors are near each other. Concretely, words embedding are parametrized functions that take a token from some language as an input and output a vector. This function is essentially a lookup table that is parametrized by a matrix of embeddings.

    The second component is a binary classifier that predicts whether the five vectors represent a valid se