# Natural Language Processing Lab

In this lab we will further explore Scikit's and NLTK's capabilities to process text. We will use the 20 Newsgroup dataset, which is provided by Scikit-Learn.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from sklearn.datasets import fetch_20newsgroups

In [3]:

categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


## 1. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Because this is an SKLearn dataset, it comes with pre-split train and test sets (note we were able to call 'train' and 'test' in subset).

Let's inspect them.

1. What data taype is `data_train`
> sklearn.datasets.base.Bunch
- Is it like a list? Or like a Dictionary? or what?
> Dict
- How many data points does it contain?
- Inspect the first data point, what does it look like?
> A blurb of text

In [4]:
type(data_train)

sklearn.utils.Bunch

In [6]:
data_train.keys()

['description', 'DESCR', 'filenames', 'target_names', 'data', 'target']

In [7]:
len(data_train['data'])

2034

In [8]:
len(data_train['target'])

2034

In [9]:
data_train['data'][0]

u"Hi,\n\nI've noticed that if you only save a model (with all your mapping planes\npositioned carefully) to a .3DS file that when you reload it after restarting\n3DS, they are given a default position and orientation.  But if you save\nto a .PRJ file their positions/orientation are preserved.  Does anyone\nknow why this information is not stored in the .3DS file?  Nothing is\nexplicitly said in the manual about saving texture rules in the .PRJ file. \nI'd like to be able to read the texture rule information, does anyone have \nthe format for the .PRJ file?\n\nIs the .CEL file format available from somewhere?\n\nRych"

## 2. Bag of Words model

Let's train a model using a simple count vectorizer

1. Initialize a standard CountVectorizer and fit the training data
- how big is the feature dictionary?
- repeat eliminating english stop words
- is the dictionary smaller?
- transform the training data using the trained vectorizer
- evaluate the performance of a Lotistic Regression on the features extracted by the CountVectorizer
    - you will have to transform the test_set too. Be carefule to use the trained vectorizer, without re-fitting it

BONUS:
- try a couple modifications:
    - restrict the max_features
    - change max_df and min_df

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

In [25]:
# show vectorizer options
vect

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [24]:
# remove English stop words
vect = CountVectorizer(stop_words='english')
tokenize_test(vect)
vect.get_params()

NameError: name 'tokenize_test' is not defined

In [11]:
# learn the 'vocabulary' of the training data
vect = CountVectorizer()
vect.fit(data_train)
vect.get_feature_names()

[u'data', u'descr', u'description', u'filenames', u'target', u'target_names']

In [12]:
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(data_train)
simple_train_dtm

<6x6 sparse matrix of type '<type 'numpy.int64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [13]:
# print the sparse matrix
print(simple_train_dtm)

  (0, 2)	1
  (1, 1)	1
  (2, 3)	1
  (3, 5)	1
  (4, 0)	1
  (5, 4)	1


In [26]:
# rows are documents, columns are terms (aka "tokens" or "features")
simple_train_dtm.shape

(6, 6)

In [27]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
X_train_dtm = vect.fit_transform(data_train)
X_train_dtm.shape

(6, 6)

In [14]:
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()

array([[0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0]])

In [15]:
# examine the vocabulary and document-term matrix together
import pandas as pd
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,data,descr,description,filenames,target,target_names
0,0,0,1,0,0,0
1,0,1,0,0,0,0
2,0,0,0,1,0,0
3,0,0,0,0,0,1
4,1,0,0,0,0,0
5,0,0,0,0,1,0


In [21]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(data_test)
X_test_dtm 

<6x6 sparse matrix of type '<type 'numpy.int64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [22]:
# print the sparse matrix
print(X_test_dtm)

  (0, 2)	1
  (1, 1)	1
  (2, 3)	1
  (3, 5)	1
  (4, 0)	1
  (5, 4)	1


In [23]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(X_test_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,data,descr,description,filenames,target,target_names
0,0,0,1,0,0,0
1,0,1,0,0,0,0
2,0,0,0,1,0,0
3,0,0,0,0,0,1
4,1,0,0,0,0,0
5,0,0,0,0,1,0


In [29]:
# define a function that accepts a vectorizer and calculates the accuracy
def tokenize_test(vect):
    X_train_dtm = vect.fit_transform(data_train)
    print('Features: ', X_train_dtm.shape[1])
    X_test_dtm = vect.transform(data_test)
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_pred_class = nb.predict(X_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

In [31]:
# remove English stop words
vect = CountVectorizer(stop_words='english')
tokenize_test(vect)
vect.get_params()

('Features: ', 6)


NameError: global name 'MultinomialNB' is not defined

## 3. Hashing and TF-IDF

Let's see if Hashing or TF-IDF improves the accuracy.

1. Initialize a HashingVectorizer and repeat the test with no restriction on the number of features
- does the score improve with respect to the count vectorizer?
- print out the number of features for this model
- Initialize a TF-IDF Vectorizer and repeat the analysis above
- print out the number of features for this model

BONUS
- Change the parameters of either (or both!) models to improve your score

In [32]:
vect.fit_transform(data_train).toarray()

array([[0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0]])

In [33]:
vect.get_feature_names()

[u'data', u'descr', u'description', u'filenames', u'target', u'target_names']

In [35]:
# Term Frequency
vect = CountVectorizer()
tf = pd.DataFrame(vect.fit_transform(data_train).toarray(), columns=vect.get_feature_names())
tf

Unnamed: 0,data,descr,description,filenames,target,target_names
0,0,0,1,0,0,0
1,0,1,0,0,0,0
2,0,0,0,1,0,0
3,0,0,0,0,0,1
4,1,0,0,0,0,0
5,0,0,0,0,1,0


In [36]:
# Document Frequency
vect = CountVectorizer(binary=True)
df = vect.fit_transform(data_train).toarray().sum(axis=0)
pd.DataFrame(df.reshape(1, 6), columns=vect.get_feature_names())

Unnamed: 0,data,descr,description,filenames,target,target_names
0,1,1,1,1,1,1


In [37]:
tf.sum()

data            1
descr           1
description     1
filenames       1
target          1
target_names    1
dtype: int64

In [38]:
# Term Frequency-Inverse Document Frequency (simple version)
tf/df

Unnamed: 0,data,descr,description,filenames,target,target_names
0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,1.0,0.0


In [40]:
# TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vect = TfidfVectorizer()
df = pd.DataFrame(vect.fit_transform(data_train).toarray(), columns=vect.get_feature_names())
df

Unnamed: 0,data,descr,description,filenames,target,target_names
0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,1.0,0.0


In [41]:
df.mean()

data            0.166667
descr           0.166667
description     0.166667
filenames       0.166667
target          0.166667
target_names    0.166667
dtype: float64

In [42]:
vect = TfidfVectorizer(max_features=1)
vect.fit_transform(data_train).toarray()
vect.transform(data_train)
vect.get_feature_names()

[u'data']

In [43]:
# create a document-term matrix using TF-IDF
vect = TfidfVectorizer(stop_words='english')
dtm = vect.fit_transform(data_train)
features = vect.get_feature_names()
dtm.shape

(6, 6)

In [44]:
# use logistic regression with text column only
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train_dtm, y_train)
y_pred_class = logreg.predict(X_test_dtm)
print(metrics.accuracy_score(y_test, y_pred_class))

NameError: name 'LogisticRegression' is not defined