<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Natural Language Processing Lab

_Authors: Dave Yerrington (SF)_

---

In this lab, we'll explore scikit-learn and NLTK's capabilities for processing text even further. We'll use the 20 newsgroups data set, which is provided by scikit-learn.

In [1]:
# Standard data science imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Getting the scikit-learn data set:
from sklearn.datasets import fetch_20newsgroups

### 1) Use the `fetch_20newsgroups` function to download a training and testing set.

Look up the function documentation for how to grab the data.

You should pull these categories:
- `alt.atheism`
- `talk.religion.misc`
- `comp.graphics`
- `sci.space`

Additionally, remove the headers, footers, and quotes using the function's `remove` keyword argument.

In [3]:
newsgroups_train = fetch_20newsgroups(subset='train')

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [101]:
categories = ['alt.atheism', 'talk.religion.misc','comp.graphics','sci.space']
newsgroups_train2 = fetch_20newsgroups(subset='train',categories=categories, remove=('headers', 'footers', 'quotes'))
newsgroups_test2 = fetch_20newsgroups(subset='test',categories=categories, remove=('headers', 'footers', 'quotes'))

In [102]:
print newsgroups_train2.target.shape
print newsgroups_train2.data[:1]

(2034L,)
[u"Hi,\n\nI've noticed that if you only save a model (with all your mapping planes\npositioned carefully) to a .3DS file that when you reload it after restarting\n3DS, they are given a default position and orientation.  But if you save\nto a .PRJ file their positions/orientation are preserved.  Does anyone\nknow why this information is not stored in the .3DS file?  Nothing is\nexplicitly said in the manual about saving texture rules in the .PRJ file. \nI'd like to be able to read the texture rule information, does anyone have \nthe format for the .PRJ file?\n\nIs the .CEL file format available from somewhere?\n\nRych"]


### 2) Inspect the data.

We've downloaded a few `newsgroups` categories and removed their headers, footers, and quotes.

Because this is a scikit-learn data set, it comes with pre-split training and testing sets (note: we were able to call "train" and "test" in subset).

Let's inspect them.

1) What data type is `data_train`?
- Is it a list? A dictionary? What else?
- How many data points does it contain?
- Inspect the first data point. What does it look like?

In [103]:
print type(newsgroups_train2.data)
print len(newsgroups_train2.data)
print newsgroups_train2.data[0]

<type 'list'>
2034
Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych


In [104]:
print type(newsgroups_test2.target)
print len(newsgroups_test2.target)
print newsgroups_test2.target[0]

<type 'numpy.ndarray'>
1353
2


In [105]:
pd.DataFrame(newsgroups_test2.target)[0].unique()

array([2, 1, 0, 3], dtype=int64)

### 3) Create a bag-of-words model.

Let's train a model using a simple count vectorizer.

1) Initialize a standard CountVectorizer and fit the training data.
- How big is the feature dictionary?
- Eliminate English stop words.
- Is the dictionary smaller?
- Transform the training data using the trained vectorizer.
- Evaluate the performance of a logistic regression on the features extracted by the CountVectorizer.
    - You will have to transform the `test_set`, too. Be careful to use the trained vectorizer without refitting it.

**Bonus**
- Try a couple of modifications:
    - Restrict the `max_features`.
    - Change the `max_df` and `min_df`.

In [108]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
#max_features: build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
#max_df: if no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) 
#to automatically detect and filter stop words based on intra corpus document frequency of terms.
#min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. 
#If float, the parameter represents a proportion of documents, integer absolute counts.
cvec = CountVectorizer(stop_words='english')
# cvec = CountVectorizer(max_df=0.999, min_df=0.001)
cvec.fit(newsgroups_train2.data)
print len(cvec.vocabulary_)#26576 with stop_words, 26879 without stop words, 12442 words with max_df=0.999, min_df=0.001

26576


In [160]:
X_train  = pd.DataFrame(cvec.transform(newsgroups_train2.data).todense(),
             columns=cvec.get_feature_names())
X_train.head()
# X_train.transpose().sort_values(0, ascending=False).head(10)

Unnamed: 0,00,000,0000,00000,000000,000005102000,000062david42,0001,000100255pixel,00041032,...,zurich,zurvanism,zus,zvi,zwaartepunten,zwak,zwakke,zware,zwarte,zyxel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [110]:
#vectorize test set as well
X_test = pd.DataFrame(cvec.transform(newsgroups_test2.data).todense(),
             columns=cvec.get_feature_names())

In [120]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
clf = MultinomialNB(alpha=0.1)
clf.fit(X_train, newsgroups_train2.target)
pred = clf.predict(X_test)

In [121]:
pd.DataFrame(newsgroups_test2.target)[0].value_counts()

2    394
1    389
0    319
3    251
Name: 0, dtype: int64

In [122]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
conmat = np.array(confusion_matrix(newsgroups_test2.target, pred))
confusion = pd.DataFrame(conmat, index=['alt.atheism_0', 'talk.religion.misc_1','comp.graphics_2','sci.space_3'],
                         columns=['predicted_0','predicted_1','predicted_2','predicted_3'])
print confusion
print classification_report(newsgroups_test2.target, pred)
metrics.f1_score(newsgroups_test2.target, pred, average='macro')#0.76 with stop_words, 

                      predicted_0  predicted_1  predicted_2  predicted_3
alt.atheism_0                 227            4           28           60
talk.religion.misc_1           11          351           24            3
comp.graphics_2                19           21          343           11
sci.space_3                    82            7           21          141
             precision    recall  f1-score   support

          0       0.67      0.71      0.69       319
          1       0.92      0.90      0.91       389
          2       0.82      0.87      0.85       394
          3       0.66      0.56      0.61       251

avg / total       0.78      0.78      0.78      1353



0.7628399561432191

In [164]:
ath = fetch_20newsgroups(subset='test', categories=['alt.atheism'], remove=('headers', 'footers', 'quotes'))

In [165]:
len(ath.data)

319

## Gridsearch through pipeline to find best parameters

In [149]:
# Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Peter Prettenhofer <peter.prettenhofer@gmail.com>
#         Mathieu Blondel <mathieu@mblondel.org>
# License: BSD 3 clause

from __future__ import print_function

from pprint import pprint
from time import time
import logging

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
# from sklearn.feature_extraction.text import TfidfTransformer
# from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

print(__doc__)

# Display progress logs on stdout
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s')


# #############################################################################
# Load some categories from the training set
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
# Uncomment the following to do the analysis on all the categories
#categories = None

print("Loading 4 newsgroups dataset for categories:")
print(categories)

train = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))
print("%d documents" % len(train.filenames))
print("%d categories" % len(train.target_names))
print()

# #############################################################################
# Define a pipeline combining a text feature extractor with a simple
# classifier
pipeline = Pipeline([
    ('vect', CountVectorizer(stop_words='english')),
    #('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__min_df': (0.1, 0.01),
    'vect__max_features': (None, 1000, 20000, 30000),
#     'vect__ngram_range': ((1, 1), (1, 2), (1, 3)),  # unigrams or bigrams
    #'tfidf__use_idf': (True, False),
    #'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': (0.05, 0.1, 0.15),
    #'clf__penalty': ('l2', 'elasticnet'),
    #'clf__n_iter': (10, 50, 80),
}

if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
#     t0 = time()
    grid_search.fit(train.data, train.target)
#     print("done in %0.3fs" % (time() - t0))
#     print()

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Automatically created module for IPython interactive environment
Loading 4 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
2034 documents
4 categories

Performing grid search...
pipeline: ['vect', 'clf']
parameters:
{'clf__alpha': (0.05, 0.1, 0.15),
 'vect__max_df': (0.5, 0.75, 1.0),
 'vect__max_features': (None, 1000, 20000, 30000),
 'vect__min_df': (0.1, 0.01)}
Fitting 3 folds for each of 72 candidates, totalling 216 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   26.5s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 216 out of 216 | elapsed:  2.7min finished


Best score: 0.760
Best parameters set:
	clf__alpha: 0.15
	vect__max_df: 0.5
	vect__max_features: None
	vect__min_df: 0.01


In [157]:
# cvec = CountVectorizer(stop_words='english',max_df=0.5)
# cvec.fit(data.data)
# X_train = cvec.transform(data.data)
# clf = MultinomialNB(alpha=0.15)
# clf.fit(X_train, data.target)
from sklearn.linear_model import LogisticRegression
news_class = Pipeline([('vect', CountVectorizer(stop_words='english',max_df=0.9)),
#                        [('vect', TfidfVectorizer(stop_words='english',max_df=0.9)),
                       ('clf', MultinomialNB(alpha=0.15))
#                        ('logit', LogisticRegression())])
news_class_fit = news_class.fit(data.data, data.target)

In [158]:
test = fetch_20newsgroups(subset='test',categories=categories, remove=('headers', 'footers', 'quotes'))
news_class_pred = news_class_fit.predict(test.data)

In [159]:
conmat = np.array(confusion_matrix(test.target, news_class_pred))
confusion = pd.DataFrame(conmat, index=['alt.atheism_0', 'talk.religion.misc_1','comp.graphics_2','sci.space_3'],
                         columns=['predicted_0','predicted_1','predicted_2','predicted_3'])
print (confusion)
print (classification_report(test.target, news_class_pred))
metrics.f1_score(test.target, news_class_pred, average='macro')

                      predicted_0  predicted_1  predicted_2  predicted_3
alt.atheism_0                 187           16           46           70
talk.religion.misc_1           13          345           28            3
comp.graphics_2                22           23          333           16
sci.space_3                    67           14           27          143
             precision    recall  f1-score   support

          0       0.65      0.59      0.62       319
          1       0.87      0.89      0.88       389
          2       0.77      0.85      0.80       394
          3       0.62      0.57      0.59       251

avg / total       0.74      0.75      0.74      1353



0.72208976281306

### 4) Test Out Hashing and TF-IDF.

Let's see if hashing or TF-IDF improves our accuracy.

1) Initialize a HashingVectorizer and repeat the test with no restriction on the number of features.
- Does the score improve with respect to the CountVectorizer?
- Print out the number of features for this model.
- Initialize a TF-IDF vectorizer and repeat the analysis above.
- Print out the number of features for this model.

**Bonus**
- Change the parameters of either (or both) models to improve your score.

In [171]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline

In [175]:
#pipeline
pclass = Pipeline([('vect', TfidfVectorizer(stop_words='english',max_df=0.9)),
                       ('clf', MultinomialNB(alpha=0.15))])
#                        ('logit', LogisticRegression())])

In [176]:
#fit and predict
class_fit = pclass.fit(data.data, data.target)
class_pred = class_fit.predict(test.data)

In [179]:
#confusion matrix and f1-score
conmat = np.array(confusion_matrix(test.target, class_pred))
confusion = pd.DataFrame(conmat, index=['alt.atheism_0', 'talk.religion.misc_1','comp.graphics_2','sci.space_3'],         columns=['predicted_0','predicted_1','predicted_2','predicted_3'])
print (confusion)
print (classification_report(test.target, class_pred))
metrics.f1_score(test.target, class_pred, average='macro')

                      predicted_0  predicted_1  predicted_2  predicted_3
alt.atheism_0                 228           10           40           41
talk.religion.misc_1            7          358           23            1
comp.graphics_2                19           18          355            2
sci.space_3                    89           11           24          127
             precision    recall  f1-score   support

          0       0.66      0.71      0.69       319
          1       0.90      0.92      0.91       389
          2       0.80      0.90      0.85       394
          3       0.74      0.51      0.60       251

avg / total       0.79      0.79      0.78      1353



0.7627353148351768