# Natural Language Processing Lab

In this lab we will further explore Scikit's and NLTK's capabilities to process text. We will use the 20 Newsgroup dataset, which is provided by Scikit-Learn.

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

%matplotlib inline

In [6]:
from sklearn.datasets import fetch_20newsgroups

In [7]:
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

## 1. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Let's inspect them.

1. What data taype is `data_train`
> sklearn.datasets.base.Bunch
- Is it like a list? Or like a Dictionary? or what?
> Dict
- How many data points does it contain?
- Inspect the first data point, what does it look like?
> A blurb of text

In [10]:
print "Type: ", type(data_train.data)
print '# of Data points: ', len(data_train.data)
data_train.data[0]


Type:  <type 'list'>
# of Data points:  2034


u"Hi,\n\nI've noticed that if you only save a model (with all your mapping planes\npositioned carefully) to a .3DS file that when you reload it after restarting\n3DS, they are given a default position and orientation.  But if you save\nto a .PRJ file their positions/orientation are preserved.  Does anyone\nknow why this information is not stored in the .3DS file?  Nothing is\nexplicitly said in the manual about saving texture rules in the .PRJ file. \nI'd like to be able to read the texture rule information, does anyone have \nthe format for the .PRJ file?\n\nIs the .CEL file format available from somewhere?\n\nRych"

## 2. Bag of Words model

Let's train a model using a simple count vectorizer

1. Initialize a standard CountVectorizer and fit the training data
- how big is the feature dictionary
- repeat eliminating english stop words
- is the dictionary smaller?
- transform the training data using the trained vectorizer
- what are the 20 words that are most common in the whole corpus?
- what are the 20 most common words in each of the 4 classes?
- evaluate the performance of a Lotistic Regression on the features extracted by the CountVectorizer
    - you will have to transform the test_set too. Be carefule to use the trained vectorizer, without re-fitting it
- try the following 3 modification:
    - restrict the max_features
    - change max_df and min_df
    - use a fixed vocabulary of size 80 combining the 20 most common words per group found earlier
- for each of the above print a confusion matrix and investigate what gets mixed
> Anwer: not surprisingly if we reduce the feature space we lose accuracy
- print out the number of features for each model

In [17]:
# Initialize a standard CountVectorizer and fit the training data
v = CountVectorizer()
X_train_vect = v.fit_transform(data_train.data)

In [19]:
# how big is the feature dictionary
X_train_vect.shape

(2034, 26879)

In [22]:
# repeat eliminating english stop words
# transform the training data using the trained vectorizer
v = CountVectorizer(stop_words='english')
X_train_vect = v.fit_transform(data_train.data)

# is the dictionary smaller?
print "Yes"
X_train_vect.shape

Yes


(2034, 26576)

In [24]:
# what are the 20 words that are most common in the whole corpus?
top_20 = CountVectorizer(stop_words='english', max_features = 20)
top_20.fit_transform(data_train.data)
top_20.vocabulary_

{u'data': 0,
 u'does': 1,
 u'don': 2,
 u'edu': 3,
 u'god': 4,
 u'good': 5,
 u'graphics': 6,
 u'image': 7,
 u'jesus': 8,
 u'just': 9,
 u'know': 10,
 u'like': 11,
 u'nasa': 12,
 u'people': 13,
 u'say': 14,
 u'space': 15,
 u'think': 16,
 u'time': 17,
 u'use': 18,
 u'way': 19}

In [25]:
# what are the 20 most common words in each of the 4 classes?

X = pd.DataFrame(data_train.data, columns = ['data'])
y = pd.DataFrame(data_train.target, columns = ['class'])
df_common = pd.concat([X, y], axis = 1)

Unnamed: 0,data,class
0,"Hi,\n\nI've noticed that if you only save a mo...",1
1,"\n\nSeems to be, barring evidence to the contr...",3
2,\n >In article <1993Apr19.020359.26996@sq.sq.c...,2
3,I have a request for those who would like to s...,0
4,AW&ST had a brief blurb on a Manned Lunar Exp...,2


In [27]:
listy = []
for a in range(4):
    top_20 = CountVectorizer(stop_words='english', max_features = 20)
    top_20.fit_transform(df_common.ix[df_common['class']== a]['data'])
    
    listy.append([x for x in top_20.vocabulary_])
    print a
    print top_20.vocabulary_

0
{u'people': 12, u'time': 17, u'argument': 0, u'say': 15, u'religion': 13, u'atheists': 2, u'don': 6, u'jesus': 8, u'does': 5, u'way': 19, u'true': 18, u'atheism': 1, u'said': 14, u'just': 9, u'think': 16, u'bible': 4, u'like': 11, u'god': 7, u'believe': 3, u'know': 10}
1
{u'format': 7, u'data': 2, u'image': 11, u'gif': 9, u'ftp': 8, u'graphics': 10, u'does': 3, u'software': 18, u'available': 0, u'use': 19, u'pub': 17, u'like': 15, u'images': 12, u'file': 5, u'edu': 4, u'jpeg': 13, u'color': 1, u'program': 16, u'files': 6, u'know': 14}
2
{u'people': 12, u'time': 17, u'data': 0, u'just': 3, u'year': 19, u'space': 16, u'launch': 4, u'orbit': 11, u'new': 10, u'don': 1, u'lunar': 6, u'shuttle': 15, u'like': 5, u'earth': 2, u'satellite': 14, u'moon': 8, u'program': 13, u'mission': 7, u'nasa': 9, u'use': 18}
3
{u'life': 11, u'people': 13, u'time': 18, u'say': 16, u'jesus': 8, u'does': 4, u'way': 19, u'think': 17, u'don': 5, u'said': 15, u'just': 9, u'did': 3, u'bible': 1, u'good': 7, u'beli

In [34]:
# evaluate the performance of a Logistic Regression on the features extracted by the CountVectorizer
# you will have to transform the test_set too. Be carefule to use the trained vectorizer, without re-fitting it

def log_reg_score(vectorizer):
    X_train = vectorizer.fit_transform(data_train.data)
    y_train = data_train.target
    X_test = vectorizer.transform(data_test.data)
    y_test = data_test.target

    model_log_reg = LogisticRegression()
    model_log_reg.fit(X_train, y_train)
    y_predictions = model_log_reg.predict(X_test)
    confusion_matrix_ = confusion_matrix(y_test, y_predictions)
    
    print 'Score:', model_log_reg.score(X_test, y_test)
    print pd.DataFrame(confusion_matrix_, columns = ['pred_0','pred_1','pred_2','pred_3'], index = ['0','1','2','3'])    
    print 'Word Count:', len(vectorizer.vocabulary_)
    
v = CountVectorizer(stop_words='english')
log_reg_score(v)

Score: 0.745011086475
   pred_0  pred_1  pred_2  pred_3
0     187      16      46      70
1      13     345      28       3
2      22      23     333      16
3      67      14      27     143
Word Count: 26576


In [38]:
#try the following 3 modification:

#restrict the max_features

for i in [10, 100, 1000, 10000]:
    print 'max_features =', i
    v = CountVectorizer(stop_words='english', max_features = i)
    log_reg_score(v)
    print '\n'

    
for i in [.1, .2]:
    print 'min_df =', i
    v = CountVectorizer(stop_words='english', min_df = i)
    log_reg_score(v)
    print '\n'

max_features = 10
Score: 0.427198817443
   pred_0  pred_1  pred_2  pred_3
0     108     160      37      14
1      23     310      55       1
2      31     214     148       1
3      89     129      21      12
Word Count: 10


max_features = 100
Score: 0.619364375462
   pred_0  pred_1  pred_2  pred_3
0     161      28      56      74
1      19     302      59       9
2      38      34     298      24
3     107      19      48      77
Word Count: 100


max_features = 1000
Score: 0.69696969697
   pred_0  pred_1  pred_2  pred_3
0     172      15      59      73
1      16     331      36       6
2      28      29     312      25
3      76      16      31     128
Word Count: 1000


max_features = 10000
Score: 0.742054693274
   pred_0  pred_1  pred_2  pred_3
0     186      15      46      72
1      14     344      28       3
2      21      27     331      15
3      67      14      27     143
Word Count: 10000


min_df = 0.1
Score: 0.441241685144
   pred_0  pred_1  pred_2  pred_3
0     116   

In [40]:
# change max_df and min_df
for i in [.1, .2, .3, .4, .5]:
    print 'max_df =', i
    v = CountVectorizer(stop_words='english', max_df = i)
    log_reg_score(v)
    print '\n'

for i in [.1, .2, .3, .4, .5]:
    print 'min_df =', i
    v = CountVectorizer(stop_words='english', min_df = i)
    log_reg_score(v)
    print '\n'

max_df = 0.1
Score: 0.746489283075
   pred_0  pred_1  pred_2  pred_3
0     186      16      52      65
1      11     341      34       3
2      21      27     335      11
3      59      14      30     148
Word Count: 26558


max_df = 0.2
Score: 0.747967479675
   pred_0  pred_1  pred_2  pred_3
0     184      15      48      72
1       9     345      32       3
2      19      25     336      14
3      64      16      24     147
Word Count: 26572


max_df = 0.3
Score: 0.745011086475
   pred_0  pred_1  pred_2  pred_3
0     187      16      46      70
1      13     345      28       3
2      22      23     333      16
3      67      14      27     143
Word Count: 26576


max_df = 0.4
Score: 0.745011086475
   pred_0  pred_1  pred_2  pred_3
0     187      16      46      70
1      13     345      28       3
2      22      23     333      16
3      67      14      27     143
Word Count: 26576


max_df = 0.5
Score: 0.745011086475
   pred_0  pred_1  pred_2  pred_3
0     187      16      46      

ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.

In [41]:
# use a fixed vocabulary of size 80 combining the 20 most common words per group found earlier
v = CountVectorizer(stop_words='english', max_features=80)
log_reg_score(v)

Score: 0.60458240946
   pred_0  pred_1  pred_2  pred_3
0     152      28      68      71
1      16     301      62      10
2      43      36     292      23
3     102      21      55      73
Word Count: 80


## 3. Hashing and TF-IDF

Let's see if Hashing or TF-IDF improves the accuracy.

1. Initialize a HashingVectorizer and repeat the test with no restriction on the number of features
- does the score improve with respect to the count vectorizer?
    - can you change any of the default parameters to improve it?
- print out the number of features for this model
- Initialize a TF-IDF Vectorizer and repeat the analysis above
- can you improve on your best score above?
    - can you change any of the default parameters to improve it?
- print out the number of features for this model

In [52]:
def log_reg_score_2(vectorizer):
    X_train = vectorizer.fit_transform(data_train.data)
    y_train = data_train.target
    X_test = vectorizer.transform(data_test.data)
    y_test = data_test.target

    model_log_reg = LogisticRegression()
    model_log_reg.fit(X_train, y_train)
    y_predictions = model_log_reg.predict(X_test)
    confusion_matrix_ = confusion_matrix(y_test, y_predictions)
    
    print 'Score:', model_log_reg.score(X_test, y_test)
    print pd.DataFrame(confusion_matrix_, columns = ['pred_0','pred_1','pred_2','pred_3'], index = ['0','1','2','3'])    
    print 'Number of features:', vectorizer.n_features
    
v = CountVectorizer(stop_words='english')
log_reg_score(v)

 Score: 0.745011086475
   pred_0  pred_1  pred_2  pred_3
0     187      16      46      70
1      13     345      28       3
2      22      23     333      16
3      67      14      27     143
Word Count: 26576


In [54]:
# Initialize a HashingVectorizer and repeat the test with no restriction on the number of features
# Print number of features
v = HashingVectorizer(stop_words='english')
log_reg_score_2(v)

 Score: 0.736881005174
   pred_0  pred_1  pred_2  pred_3
0     197      15      65      42
1       9     347      32       1
2      21      23     350       0
3      86      18      44     103
Number of features: 1048576


In [55]:
print "Does the score improve with respect to the count vectorizer? NO"

Does the score improve with respect to the count vectorizer? NO


In [57]:
# Initialize a TF-IDF Vectorizer and repeat the analysis above
v = TfidfVectorizer(stop_words='english')
log_reg_score(v)


 Score: 0.747967479675
   pred_0  pred_1  pred_2  pred_3
0     198      15      65      41
1       8     351      29       1
2      17      21     356       0
3      82      16      46     107
Word Count: 26576


## 4. Classifier comparison

Of all the vectorizers tested above, choose one that has a reasonable performance with a manageable number of features and compare the performance of these models:

- KNN
- Logistic Regression
- Decision Trees
- Support Vector Machine
- Random Forest
- Extra Trees

In order to speed up the calculation it's better to vectorize the data only once and then compare the models.

In [61]:
listy_models = [KNeighborsClassifier(), LogisticRegression(), DecisionTreeClassifier(), SVC(), RandomForestClassifier(), ExtraTreesClassifier()]

def model_score(model, v):
    X_train = v.fit_transform(data_train.data)
    X_test = v.transform(data_test.data)
    y_train = data_train.target
    y_test = data_test.target
    model.fit(X_train, y_train)
    y_predictions = model.predict(X_test)
    
    print 'score:', model.score(X_test, y_test)
    print  'con mat:','\n', confusion_matrix(y_test, y_predictions)
    print 'number of words:', len(v.vocabulary_)

In [63]:
v = TfidfVectorizer(stop_words='english', max_df=0.2)
for i in listy_models:
    print i
    model_score(i, v)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
score: 0.252032520325
con mat: 
[[122 108  34  55]
 [158 101  62  68]
 [118 135  72  69]
 [ 86  80  39  46]]
number of words: 26572
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
score: 0.749445676275
con mat: 
[[194  14  66  45]
 [  7 351  30   1]
 [ 13  22 359   0]
 [ 79  16  46 110]]
number of words: 26572
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='be

## Bonus: Other classifiers

Adapt the code from [this example](http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#example-text-document-classification-20newsgroups-py) to compare across all the classifiers suggested and to display the final plot

## Bonus: NLTK

NLTK is a vast library. Can you find some interesting bits to share with classmates?
Start here: http://www.nltk.org/