<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Natural Language Processing Lab

_Authors: Dave Yerrington (SF)_

---

In this lab we will further explore sklearn and NLTK's capabilities for processing text. We will use the 20 Newsgroup dataset, which is provided by sklearn.

In [1]:
# Standard Data Science Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Getting that SKLearn Dataset
from sklearn.datasets import fetch_20newsgroups

### 1. Use the `fetch_20newsgroups` function to download a training and testing set.

Look up the function documentation for how to grab the data.

You should pull these categories:
- `alt.atheism`
- `talk.religion.misc`
- `comp.graphics`
- `sci.space`

Also remove the headers, footers, and quotes using the `remove` keyword argument of the function.

In [None]:
#Extracting Information from the Data's Dictionary format 
# Categories of emails we want
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
# Setting out training data
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))
# Setting our testing data
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


### 2. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Because this is an sklearn dataset, it comes with pre-split train and test sets (note we were able to call 'train' and 'test' in subset).

Let's inspect them.

1. What data taype is `data_train`
- Is it like a list? Or like a Dictionary? or what?
- How many data points does it contain?
- Inspect the first data point, what does it look like?

In [None]:
type(data_train)

In [None]:
list(data_train.keys())

In [None]:
# Making sure our  Data and Target columns are equal length
len(data_train['data'])

In [None]:
len(data_train['target'])

In [None]:
# Lets checkmeowt what our data actually looks like.
data_train['data'][0]

### 3. Bag of Words model

Let's train a model using a simple count vectorizer.

1. Initialize a standard CountVectorizer and fit the training data
- how big is the feature dictionary?
- repeat eliminating english stop words
- is the dictionary smaller?
- transform the training data using the trained vectorizer
- evaluate the performance of a Logistic Regression on the features extracted by the CountVectorizer
    - you will have to transform the test_set too. Be carefule to use the trained vectorizer, without re-fitting it

**BONUS:**
- try a couple modifications:
    - restrict the max_features
    - change max_df and min_df

In [None]:
# What does the target variable look like
data_train['target']

In [None]:
# NLP Using a count vectorizer.  
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Setting the vectorizer just like we would set a model
cvec = CountVectorizer()

# Fitting the vectorizer on our training data
cvec.fit(data_train['data'])

In [None]:
# Lets check the length of our data that is in a vectorized state
len(cvec.get_feature_names())

In [None]:
# Lets use the stop_words argument to remove words like "and, the, a"
cvec = CountVectorizer(stop_words='english')

# Fit our vectorizer using our train data
cvec.fit(data_train['data'])

# and check out the length of the vectorized data after
len(cvec.get_feature_names())

In [None]:
# Transforming our x_train data using our fit cvec.
# And converting the result to a DataFrame.
X_train = pd.DataFrame(cvec.transform(data_train['data']).todense(),
                       columns=cvec.get_feature_names())

In [None]:
# We still have the same number of rows but the vectorization has converted every word, 
# or what is believed to be a word, from our test data into a feature.  Like dummy coded
# variables for words (except counts rather than just occurances).

In [None]:
X_train.shape

In [None]:
# Which words appear the most?
word_counts = X_train.sum(axis=0)
word_counts.sort_values(ascending = False).head(20)

In [None]:
names = data_train['target_names']
names

In [None]:
# What are we trying to predict
y_train = data_train['target']

In [None]:
# Lets look through some of the categories common words
common_words = []
for i in range(4):
    word_count = X_train[y_train==i].sum(axis=0)
    print(names[i], "most common words")
    cw = word_count.sort_values(ascending = False).head(20)
    print(cw)
    common_words.extend(cw.index)
    print()

In [None]:
# Converting out vectorized test data to a dataframe
# Using the CVEC which we fit earlier
X_test = pd.DataFrame(cvec.transform(data_test['data']).todense(),
                      columns=cvec.get_feature_names())

In [None]:
# Getting our Y test information
y_test = data_test['target']

In [None]:
#Import and fit our logistic regression and test it too
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

### 4. Hashing and TF-IDF

Let's see if Hashing or TF-IDF improves the accuracy.

1. Initialize a HashingVectorizer and repeat the test with no restriction on the number of features
- does the score improve with respect to the count vectorizer?
- print out the number of features for this model
- Initialize a TF-IDF Vectorizer and repeat the analysis above
- print out the number of features for this model

**BONUS:**
- Change the parameters of either (or both!) models to improve your score

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

In [None]:
# A pipeline is a way for us to construct a function to execute
# the same tasks continuously
# In our variable model we fit a vectorizer, and a model
# our Model variable is stored with the fit vectorizer and model
# so we we call model.xxxx it uses that information stored
model = make_pipeline(HashingVectorizer(stop_words='english',
                                        non_negative=True,
                                        n_features=2**16),
                      LogisticRegression(),
                      )
model.fit(data_train['data'], y_train)
y_pred = model.predict(data_test['data'])
print(accuracy_score(y_test, y_pred))
print("Number of features:", 2**16)

In [None]:
model = make_pipeline(TfidfVectorizer(stop_words='english',
                                      sublinear_tf=True,
                                      max_df=0.5,
                                      max_features=1000),
                      LogisticRegression(),
                      )
model.fit(data_train['data'], y_train)
y_pred = model.predict(data_test['data'])
print(accuracy_score(y_test, y_pred))
print("Number of features:", len(model.steps[0][1].get_feature_names()))