# Extracting text features exercise -- Solution

The word "Python" can refer to both a snake and a programming language.

In this exercise, we use the text vectorizers provided in `sklearn` to transform sentences into numerical vectors, and check if sentences using "Python" with the same meaning have similar vector representations.

1. Use `CountVectorizer` to transform the sentences given below to numerical vectors, and print the result. Only set `lowercase=True` as input argument to `CountVectorizer`. (The output of the vectorizer is a sparse array; use the method `toarray()` to print the result as a regular Numpy array);
2. Print the list of extracted features (see the `get_feature_names` method of `CountVectorizer`);
3. Compute the pairwise distance between the vector representations of the sentences using the `cosine_distances` function, defined in `sklearn.metrics.pairwise` [1]. Which sentences are closer in the feature space?
4. Repeat steps 1-3, removing stopwords (using the `stop_words` optional argument of `CountVectorizer`).
5. Repeat steps 1-3 using a `TfidfVectorizer`.
6. Finally, use `CountVectorizer` using your own processing method, including tokenization, normalization, and stemming (use the `analyzer` argument of `CountVectorizer` to use your processing method, as discussed in the class slides).

[1] `cosine_distances` is a distance measure that is independent of the length of the vectors, see http://bit.ly/1DcbLZx for a detailed description

In [13]:
import numpy as np
np.set_printoptions(precision=4, suppress=True)

import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_distances

documents = [
   'Pythons are non-venomous but dangerous snakes.',
   'Python is a great programming language!',
   'A Python is a constricting snake.',
   'Python and Matlab are popular languages in science.',
]

In [14]:
# 1. Create a count vectorizer.
vectorizer = CountVectorizer(lowercase=True)

In [15]:
# Train the vectorizer and transform the documents in a single step.
features = vectorizer.fit_transform(documents)
print features.toarray()

[[0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1]
 [0 0 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 0 0 0]
 [0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0]
 [1 1 0 0 0 0 1 0 0 1 1 0 1 0 1 0 1 0 0 0]]


In [16]:
# 2. Show the extracted features.
print vectorizer.get_feature_names()

[u'and', u'are', u'but', u'constricting', u'dangerous', u'great', u'in', u'is', u'language', u'languages', u'matlab', u'non', u'popular', u'programming', u'python', u'pythons', u'science', u'snake', u'snakes', u'venomous']


In [17]:
# 3. Compute the matrix of pairwise distances.
print cosine_distances(features)

[[ 0.      1.      1.      0.8664]
 [ 1.      0.      0.5528  0.8419]
 [ 1.      0.5528  0.      0.8232]
 [ 0.8664  0.8419  0.8232  0.    ]]


In [18]:
# 4. Re-run the analysis after removing stopwords.
vectorizer = CountVectorizer(lowercase=True, stop_words='english')
features = vectorizer.fit_transform(documents).toarray()
print cosine_distances(features)

[[ 0.      1.      1.      1.    ]
 [ 1.      0.      0.7113  0.7764]
 [ 1.      0.7113 -0.      0.7418]
 [ 1.      0.7764  0.7418  0.    ]]


In [19]:
# 4. Re-run the analysis with a TF-IFD vectorizer.
vectorizer = TfidfVectorizer(lowercase=True, stop_words='english')
features = vectorizer.fit_transform(documents).toarray()
print cosine_distances(features)

[[-0.      1.      1.      1.    ]
 [ 1.     -0.      0.8578  0.8949]
 [ 1.      0.8578 -0.      0.8749]
 [ 1.      0.8949  0.8749 -0.    ]]


In [21]:
 zip(vectorizer.idf_, vectorizer.get_feature_names())

[(1.9162907318741551, u'constricting'),
 (1.9162907318741551, u'dangerous'),
 (1.9162907318741551, u'great'),
 (1.9162907318741551, u'language'),
 (1.9162907318741551, u'languages'),
 (1.9162907318741551, u'matlab'),
 (1.9162907318741551, u'non'),
 (1.9162907318741551, u'popular'),
 (1.9162907318741551, u'programming'),
 (1.2231435513142097, u'python'),
 (1.9162907318741551, u'pythons'),
 (1.9162907318741551, u'science'),
 (1.9162907318741551, u'snake'),
 (1.9162907318741551, u'snakes'),
 (1.9162907318741551, u'venomous')]

In [26]:
# 4. Re-run the analysis with a custom analysis method, including stemming.
from string import punctuation
from nltk.corpus import stopwords

def stemming_analyzer(raw):
    tokens = nltk.wordpunct_tokenize(raw)
    tokens = [w.strip(punctuation).lower() for w in tokens if w not in punctuation]
    tokens = [w for w in tokens if w not in stopwords.words('english')]
    stemmer=nltk.PorterStemmer()
    return [stemmer.stem(w) for w in tokens]

vectorizer = CountVectorizer(analyzer=stemming_analyzer)
features = vectorizer.fit_transform(documents).toarray()
print cosine_distances(features)

[[ 0.      0.7764  0.4836  0.8   ]
 [ 0.7764  0.      0.7113  0.5528]
 [ 0.4836  0.7113 -0.      0.7418]
 [ 0.8     0.5528  0.7418  0.    ]]


In [23]:
vectorizer.get_feature_names()

[u'constrict',
 u'danger',
 u'great',
 u'languag',
 u'matlab',
 u'non',
 u'popular',
 u'program',
 u'python',
 u'scienc',
 u'snake',
 u'venom']