# Natural Language Processing Lab

In this lab we will further explore Scikit's and NLTK's capabilities to process text. We will use the 20 Newsgroup dataset, which is provided by Scikit-Learn.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from sklearn.datasets import fetch_20newsgroups

In [3]:

categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

## 1. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Because this is an SKLearn dataset, it comes with pre-split train and test sets (note we were able to call 'train' and 'test' in subset).

Let's inspect them.

1. What data taype is `data_train`
> sklearn.datasets.base.Bunch
- Is it like a list? Or like a Dictionary? or what?
> Dict
- How many data points does it contain?
- Inspect the first data point, what does it look like?
> A blurb of text

In [158]:
type(data_train)

sklearn.datasets.base.Bunch

In [487]:
data_train.keys()

['description', 'DESCR', 'filenames', 'target_names', 'data', 'target']

In [160]:
len(data_train['data'])

2034

In [183]:
len(data_test['data'])

1353

In [161]:
len(data_train['target'])

2034

In [527]:
type(data_train['data'])

list

## 2. Bag of Words model

Let's train a model using a simple count vectorizer

1. Initialize a standard CountVectorizer and fit the training data
- how big is the feature dictionary?
- repeat eliminating english stop words
- is the dictionary smaller?
- transform the training data using the trained vectorizer
- evaluate the performance of a Lotistic Regression on the features extracted by the CountVectorizer
    - you will have to transform the test_set too. Be carefule to use the trained vectorizer, without re-fitting it

BONUS:
- try a couple modifications:
    - restrict the max_features
    - change max_df and min_df

In [4]:
import scipy as sp
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
%matplotlib inline

In [None]:
# Removed all of the trial and error cells.  There were a lot!  Restarted the notebook to run the finals

In [5]:
# Vectorizer for Training set (fit vocabulary)
vect = CountVectorizer()
vect.fit(data_train['data'])

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [6]:
# Transform and add to a df with column names. Tried "todense", really cool
ng_train = pd.DataFrame(vect.transform(data_train['data']).todense(), columns=vect.get_feature_names())

In [7]:
ng_train.shape

(2034, 26879)

In [8]:
ng_train.target.shape

(2034L,)

In [9]:
# Adding target as a column to use for "y" in the regression model
ng_train['target'] = data_train['target']

In [10]:
ng_train.shape

(2034, 26879)

In [11]:
# Rerun with Stopwords
vect = CountVectorizer(stop_words='english')
vect.fit(data_train['data'])

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [12]:
ng_train = pd.DataFrame(vect.transform(data_train['data']).todense(), columns=vect.get_feature_names())

In [13]:
ng_train.shape # 303 fewer features / tokens

(2034, 26576)

In [14]:
ng_train['target'] = data_train['target']

In [16]:
ng_train.head(1)

Unnamed: 0,00,000,0000,00000,000000,000005102000,000062david42,0001,000100255pixel,00041032,...,zurich,zurvanism,zus,zvi,zwaartepunten,zwak,zwakke,zware,zwarte,zyxel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
# Vectorizer for Training set, transform and add to a df with column names
ng_test = pd.DataFrame(vect.transform(data_test['data']).todense(), columns=vect.get_feature_names())

In [18]:
ng_test.shape

(1353, 26576)

In [19]:
# Adding target as a column to use for "y" in the regression model
ng_test['target'] = data_test['target']

In [24]:
ng_test.head(1)

Unnamed: 0,00,000,0000,00000,000000,000005102000,000062david42,0001,000100255pixel,00041032,...,zurich,zurvanism,zus,zvi,zwaartepunten,zwak,zwakke,zware,zwarte,zyxel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# Start of Logistic Regression Model

In [21]:
from sklearn.linear_model import LogisticRegression

In [23]:
# Set "y" values from added target column
y_train = ng_train.target
y_test = ng_test.target

In [28]:
from sklearn import metrics

In [27]:
logreg = LogisticRegression()
logreg.fit(ng_train, y_train)
y_pred_class = logreg.predict(ng_test)
print(metrics.accuracy_score(y_test, y_pred_class))

0.954175905395


In [None]:
# Additional Modifications

In [50]:
vect2 = CountVectorizer(stop_words='english', max_features=100)
vect2.fit(data_train['data'])

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=100, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [51]:
ng_train2 = pd.DataFrame(vect2.transform(data_train['data']).todense(), columns=vect2.get_feature_names())
ng_train2.shape

(2034, 100)

In [52]:
vect3 = CountVectorizer(max_df=2)
vect3.fit(data_train['data'])

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=2, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [53]:
ng_train3 = pd.DataFrame(vect3.transform(data_train['data']).todense(), columns=vect3.get_feature_names())
ng_train3.shape

(2034, 18234)

In [39]:
vect4 = CountVectorizer(min_df=0.5)
vect4.fit(data_train['data'])

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=0.5,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [54]:
ng_train4 = pd.DataFrame(vect4.transform(data_train['data']).todense(), columns=vect4.get_feature_names())
ng_train4.shape

(2034, 9)

In [55]:
vect5 = CountVectorizer(min_df=0.7)
vect5.fit(data_train['data'])

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=0.7,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [56]:
ng_train5 = pd.DataFrame(vect5.transform(data_train['data']).todense(), columns=vect5.get_feature_names())
ng_train5.shape

(2034, 3)

In [57]:
vect6 = CountVectorizer(min_df=0.2)
vect6.fit(data_train['data'])

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=0.2,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [58]:
ng_train6 = pd.DataFrame(vect6.transform(data_train['data']).todense(), columns=vect6.get_feature_names())
ng_train6.shape

(2034, 52)

## 3. Hashing and TF-IDF

Let's see if Hashing or TF-IDF improves the accuracy.

1. Initialize a HashingVectorizer and repeat the test with no restriction on the number of features
- does the score improve with respect to the count vectorizer?
- print out the number of features for this model
- Initialize a TF-IDF Vectorizer and repeat the analysis above
- print out the number of features for this model

BONUS
- Change the parameters of either (or both!) models to improve your score

In [59]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [80]:
tf_vect = TfidfVectorizer(stop_words='english')
tf_vect.fit(data_train['data'])

TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [81]:
tf_train = pd.DataFrame(tf_vect.transform(data_train['data']).todense(), columns=tf_vect.get_feature_names())
tf_train.shape

(2034, 26576)

In [82]:
tf_train['target'] = data_train['target']

In [89]:
tf_test = pd.DataFrame(tf_vect.transform(data_test['data']).todense(), columns=tf_vect.get_feature_names())
tf_test.shape

(1353, 26576)

In [84]:
tf_test['target'] = data_test['target']

In [85]:
y_tf_train = tf_train.target
y_tf_test = tf_test.target

In [86]:
logreg = LogisticRegression()
logreg.fit(tf_train, y_tf_train)
y_pred_class = logreg.predict(tf_test)
print(metrics.accuracy_score(y_tf_test, y_pred_class))

0.994826311899


In [71]:
from sklearn.feature_extraction.text import HashingVectorizer

In [95]:
hv_vect = HashingVectorizer(stop_words='english', n_features=26576)
hv_vect.fit(data_train['data'])

HashingVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
         dtype=<type 'numpy.float64'>, encoding=u'utf-8', input=u'content',
         lowercase=True, n_features=26576, ngram_range=(1, 1),
         non_negative=False, norm=u'l2', preprocessor=None,
         stop_words='english', strip_accents=None,
         token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None)

In [96]:
hv_train = pd.DataFrame(hv_vect.transform(data_train['data']).todense())
hv_train.shape

(2034, 26576)

In [97]:
hv_train['target'] = data_train['target']

In [102]:
hv_test = pd.DataFrame(hv_vect.transform(data_test['data']).todense())
hv_test.shape

(1353, 26576)

In [103]:
hv_test['target'] = data_test['target']

In [104]:
y_hv_train = hv_train.target
y_hv_test = hv_test.target

In [105]:
logreg = LogisticRegression()
logreg.fit(hv_train, y_hv_train)
y_pred_class = logreg.predict(hv_test)
print(metrics.accuracy_score(y_hv_test, y_pred_class))

0.990391722099
