# Natural Language Processing Lab

In this lab we will further explore Scikit's and NLTK's capabilities to process text. We will use the 20 Newsgroup dataset, which is provided by Scikit-Learn.

In [138]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

%matplotlib inline




In [4]:
from sklearn.datasets import fetch_20newsgroups

In [5]:

categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

## 1. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Let's inspect them.

1. What data taype is `data_train`
> sklearn.datasets.base.Bunch
- Is it like a list? Or like a Dictionary? or what?
> Dict
- How many data points does it contain?
- Inspect the first data point, what does it look like?
> A blurb of text

In [8]:
print 'There are', len(data_train.data), 'data points in it.'

There are 2034 data points in it.


## 2. Bag of Words model

Let's train a model using a simple count vectorizer

1. Initialize a standard CountVectorizer and fit the training data
- how big is the feature dictionary
- repeat eliminating english stop words
- is the dictionary smaller?
- transform the training data using the trained vectorizer
- what are the 20 words that are most common in the whole corpus?
- what are the 20 most common words in each of the 4 classes?
- evaluate the performance of a Lotistic Regression on the features extracted by the CountVectorizer
    - you will have to transform the test_set too. Be carefule to use the trained vectorizer, without re-fitting it
- try the following 3 modification:
    - restrict the max_features
    - change max_df and min_df
    - use a fixed vocabulary of size 80 combining the 20 most common words per group found earlier
- for each of the above print a confusion matrix and investigate what gets mixed
> Anwer: not surprisingly if we reduce the feature space we lose accuracy
- print out the number of features for each model

In [12]:
# Initialize a standard CountVectorizer and fit the training data
v = CountVectorizer()
X_train_counts = v.fit_transform(data_train.data)


In [13]:
# how big is the feature dictionary
X_train_counts.shape

(2034, 26879)

In [14]:
# repeat eliminating english stop words
v = CountVectorizer(stop_words='english')
X_train_counts = v.fit_transform(data_train.data)

In [17]:
# is the dictionary smaller?
print X_train_counts.shape
print 'It is barely smaller.'

(2034, 26576)
It is barely smaller.


In [18]:
# transform the training data using the trained vectorizer
v = CountVectorizer(stop_words='english')
X_vec = v.fit_transform(data_train.data)


In [39]:
# what are the 20 words that are most common in the whole corpus?
_20 = CountVectorizer(stop_words='english', max_features = 20)
_20.fit_transform(data_train.data)
_20.vocabulary_

{u'data': 0,
 u'does': 1,
 u'don': 2,
 u'edu': 3,
 u'god': 4,
 u'good': 5,
 u'graphics': 6,
 u'image': 7,
 u'jesus': 8,
 u'just': 9,
 u'know': 10,
 u'like': 11,
 u'nasa': 12,
 u'people': 13,
 u'say': 14,
 u'space': 15,
 u'think': 16,
 u'time': 17,
 u'use': 18,
 u'way': 19}

In [55]:
X = pd.DataFrame(data_train.data, columns =['data'])
y = pd.DataFrame(data_train.target, columns =['class'])
y.head()
stuff = pd.concat([X, y], axis = 1)
stuff.head()

Unnamed: 0,data,class
0,"Hi,\n\nI've noticed that if you only save a mo...",1
1,"\n\nSeems to be, barring evidence to the contr...",3
2,\n >In article <1993Apr19.020359.26996@sq.sq.c...,2
3,I have a request for those who would like to s...,0
4,AW&ST had a brief blurb on a Manned Lunar Exp...,2


In [98]:
# what are the 20 most common words in each of the 4 classes?
lsity = []
for i in range(4):
    _20 = CountVectorizer(stop_words='english', max_features = 20)
    _20.fit_transform(stuff.ix[stuff['class']== i]['data'])
    print i, [x for x in _20.vocabulary_]
    listy.append([x for x in _20.vocabulary_])
    

0 [u'people', u'time', u'argument', u'say', u'religion', u'atheists', u'don', u'jesus', u'does', u'way', u'true', u'atheism', u'said', u'just', u'think', u'bible', u'like', u'god', u'believe', u'know']
1 [u'format', u'data', u'image', u'gif', u'ftp', u'graphics', u'does', u'software', u'available', u'use', u'pub', u'like', u'images', u'file', u'edu', u'jpeg', u'color', u'program', u'files', u'know']
2 [u'people', u'time', u'data', u'just', u'year', u'space', u'launch', u'orbit', u'new', u'don', u'lunar', u'shuttle', u'like', u'earth', u'satellite', u'moon', u'program', u'mission', u'nasa', u'use']
3 [u'life', u'people', u'time', u'say', u'jesus', u'does', u'way', u'think', u'don', u'said', u'just', u'did', u'bible', u'good', u'believe', u'point', u'like', u'god', u'christian', u'know']


In [102]:
listy2 = []
for i in listy:
    for j in i:
        listy2.append(j)


In [97]:
# this = 0
# _20 = CountVectorizer(stop_words='english', max_features = 20)
# for i in range(4):
#     if this == 0:
#         _80thing = pd.DataFrame(_20.fit_transform(stuff.ix[stuff['class']== i]['data']).todense())
#         this =1
#     elif this == 1:
#         print this
#         _80thing = pd.concat([_80thing, pd.DataFrame(_20.fit_transform(stuff.ix[stuff['class']== i]['data']).todense())], axis=1)
#         this = 2
#     elif this == 2:
#         print this
#         this = 3
#         _80thing = pd.concat([_80thing, pd.DataFrame(_20.fit_transform(stuff.ix[stuff['class']== i]['data']).todense())], axis=1)
#     else:
#         print this
#         _80thing = pd.concat([_80thing, pd.DataFrame(_20.fit_transform(stuff.ix[stuff['class']== i]['data']).todense())], axis=1)
# _80thing.info()

In [112]:
'''
evaluate the performance of a Lotistic Regression on the features extracted by the CountVectorizer
you will have to transform the test_set too. Be carefule to use the trained vectorizer, without re-fitting it
'''
def log_score(v):
    X_vec = v.fit_transform(data_train.data)
    test_vec = v.transform(data_test.data)

    lr = LogisticRegression()
    lr.fit(X_vec, data_train.target)
    y_pred = lr.predict(test_vec)
    print 'score:', lr.score(test_vec, data_test.target)
    print  'con mat:','\n', confusion_matrix(data_test.target, y_pred)
    print 'number of words:', len(v.vocabulary_)
    
v = CountVectorizer(stop_words='english')
log_score(v)

score: 0.745011086475
con mat: 
[[187  16  46  70]
 [ 13 345  28   3]
 [ 22  23 333  16]
 [ 67  14  27 143]]
number of words: 26576


In [128]:
'''
try the following 3 modification:
restrict the max_features
change max_df and min_df
use a fixed vocabulary of size 80 combining the 20 most common words per group found earlier
for each of the above print a confusion matrix and investigate what gets mixed
Anwer: not surprisingly if we reduce the feature space we lose accuracy
print out the number of features for each model
'''

for i in [10, 100, 1000, 10000]:
    print 'max_features =', i
    v = CountVectorizer(stop_words='english', max_features = i)
    log_score(v)
    print '\n'
    
for i in [.1, .2, .3, .4, .5, .6, .7, .8, .9]:
    print 'max_df =', i
    v = CountVectorizer(stop_words='english', max_df = i)
    log_score(v)
    print '\n'
    
for i in [.1, .2]:
    print 'min_df =', i
    v = CountVectorizer(stop_words='english', min_df = i)
    log_score(v)
    print '\n'

print 'With 80 most common words from each class.'
v = CountVectorizer(stop_words='english', vocabulary=list(set(listy2)))
log_score(v)

max_features = 10
score: 0.427198817443
con mat: 
[[108 160  37  14]
 [ 23 310  55   1]
 [ 31 214 148   1]
 [ 89 129  21  12]]
number of words: 10


max_features = 100
score: 0.619364375462
con mat: 
[[161  28  56  74]
 [ 19 302  59   9]
 [ 38  34 298  24]
 [107  19  48  77]]
number of words: 100


max_features = 1000
score: 0.69696969697
con mat: 
[[172  15  59  73]
 [ 16 331  36   6]
 [ 28  29 312  25]
 [ 76  16  31 128]]
number of words: 1000


max_features = 10000
score: 0.742054693274
con mat: 
[[186  15  46  72]
 [ 14 344  28   3]
 [ 21  27 331  15]
 [ 67  14  27 143]]
number of words: 10000


max_df = 0.1
score: 0.746489283075
con mat: 
[[186  16  52  65]
 [ 11 341  34   3]
 [ 21  27 335  11]
 [ 59  14  30 148]]
number of words: 26558


max_df = 0.2
score: 0.747967479675
con mat: 
[[184  15  48  72]
 [  9 345  32   3]
 [ 19  25 336  14]
 [ 64  16  24 147]]
number of words: 26572


max_df = 0.3
score: 0.745011086475
con mat: 
[[187  16  46  70]
 [ 13 345  28   3]
 [ 22  23 333  1

## 3. Hashing and TF-IDF

Let's see if Hashing or TF-IDF improves the accuracy.

1. Initialize a HashingVectorizer and repeat the test with no restriction on the number of features
- does the score improve with respect to the count vectorizer?
    - can you change any of the default parameters to improve it?
- print out the number of features for this model
- Initialize a TF-IDF Vectorizer and repeat the analysis above
- can you improve on your best score above?
    - can you change any of the default parameters to improve it?
- print out the number of features for this model

In [120]:
def log_score_2(v):
    X_vec = v.fit_transform(data_train.data)
    test_vec = v.transform(data_test.data)

    lr = LogisticRegression()
    lr.fit(X_vec, data_train.target)
    y_pred = lr.predict(test_vec)
    print 'score:', lr.score(test_vec, data_test.target)
    print  'con mat:','\n', confusion_matrix(data_test.target, y_pred)
    print 'number of features:', v.n_features

In [127]:
v = HashingVectorizer(stop_words='english')
log_score_2(v)


score: 0.736881005174
con mat: 
[[197  15  65  42]
 [  9 347  32   1]
 [ 21  23 350   0]
 [ 86  18  44 103]]
number of features: 1048576


In [130]:
v = TfidfVectorizer(stop_words='english')
log_score(v)

for i in [10, 100, 1000, 10000]:
    print 'max_features =', i
    v = TfidfVectorizer(stop_words='english', max_features = i)
    log_score(v)
    print '\n'
    
for i in [.1, .2, .3, .4, .5, .6, .7, .8, .9]:
    print 'max_df =', i
    v = TfidfVectorizer(stop_words='english', max_df = i)
    log_score(v)
    print '\n'
    
for i in [.1, .2]:
    print 'min_df =', i
    v = TfidfVectorizer(stop_words='english', min_df = i)
    log_score(v)
    print '\n'

print 'With 80 most common words from each class.'
v = TfidfVectorizer(stop_words='english', vocabulary=list(set(listy2)))
log_score(v)

print 'Reducing max_df to 0.2 increased the score.'

score: 0.747967479675
con mat: 
[[198  15  65  41]
 [  8 351  29   1]
 [ 17  21 356   0]
 [ 82  16  46 107]]
number of words: 26576
max_features = 10
score: 0.430155210643
con mat: 
[[150 141  24   4]
 [ 49 292  48   0]
 [ 54 201 139   0]
 [112 123  15   1]]
number of words: 10


max_features = 100
score: 0.631189948263
con mat: 
[[168  26  64  61]
 [ 21 306  59   3]
 [ 35  38 304  17]
 [109  21  45  76]]
number of words: 100


max_features = 1000
score: 0.726533628973
con mat: 
[[185  16  63  55]
 [  7 343  37   2]
 [ 23  26 337   8]
 [ 79  15  39 118]]
number of words: 1000


max_features = 10000
score: 0.748706577975
con mat: 
[[198  17  60  44]
 [  8 350  30   1]
 [ 17  22 355   0]
 [ 83  18  40 110]]
number of words: 10000


max_df = 0.1
score: 0.739837398374
con mat: 
[[191  11  76  41]
 [  7 345  36   1]
 [ 12  23 359   0]
 [ 76  13  56 106]]
number of words: 26558


max_df = 0.2
score: 0.749445676275
con mat: 
[[194  14  66  45]
 [  7 351  30   1]
 [ 13  22 359   0]
 [ 79  16  

## 4. Classifier comparison

Of all the vectorizers tested above, choose one that has a reasonable performance with a manageable number of features and compare the performance of these models:

- KNN
- Logistic Regression
- Decision Trees
- Support Vector Machine
- Random Forest
- Extra Trees

In order to speed up the calculation it's better to vectorize the data only once and then compare the models.

In [143]:
models = [KNeighborsClassifier(), LogisticRegression(), DecisionTreeClassifier(), SVC(), RandomForestClassifier(), ExtraTreesClassifier()]

def mod_score(model, v):
    X_vec = v.fit_transform(data_train.data)
    test_vec = v.transform(data_test.data)
    print model
    mod = model
    mod.fit(X_vec, data_train.target)
    y_pred = mod.predict(test_vec)
    print 'score:', mod.score(test_vec, data_test.target)
    print  'con mat:','\n', confusion_matrix(data_test.target, y_pred)
    print 'number of words:', len(v.vocabulary_)
v = TfidfVectorizer(stop_words='english', max_df=0.2)
for i in models:
    mod_score(i, v)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
score: 0.252032520325
con mat: 
[[122 108  34  55]
 [158 101  62  68]
 [118 135  72  69]
 [ 86  80  39  46]]
number of words: 26572
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
score: 0.749445676275
con mat: 
[[194  14  66  45]
 [  7 351  30   1]
 [ 13  22 359   0]
 [ 79  16  46 110]]
number of words: 26572
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='be

## Bonus: Other classifiers

Adapt the code from [this example](http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#example-text-document-classification-20newsgroups-py) to compare across all the classifiers suggested and to display the final plot

## Bonus: NLTK

NLTK is a vast library. Can you find some interesting bits to share with classmates?
Start here: http://www.nltk.org/