<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Natural Language Processing Lab

_Authors: Dave Yerrington (SF)_

---

In this lab, we'll explore scikit-learn and NLTK's capabilities for processing text even further. We'll use the 20 newsgroups data set, which is provided by scikit-learn.

In [47]:
# Standard data science imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [48]:
# Getting the scikit-learn data set:
from sklearn.datasets import fetch_20newsgroups

### 1. Use the `fetch_20newsgroups` function to download a training and testing set.

The "20 Newsgroups" dataset is described [here](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html).

For this lab let's choose 4 categories to analyze.  The full list is given below.


```python
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
```

Note that the solution code will use these categories:
- `alt.atheism`
- `talk.religion.misc`
- `comp.graphics`
- `sci.space`

Also remove the headers, footers, and quotes using the `remove` keyword argument of the function.

In [49]:
#Extracting Information from the Data's Dictionary format 

categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]  # Fill in whatever categories you want to use!!

# Setting out training data
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))
# # Setting our testing data
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

**Question:** What does the `shuffle` argument do?  Why are we setting a `random_state`?

In [8]:
# A:

### 2) Inspect the data.

We've downloaded a few `newsgroups` categories and removed their headers, footers, and quotes.

Because this is a scikit-learn data set, it comes with pre-split training and testing sets (note: we were able to call "train" and "test" in subset).

Let's inspect them.

1) What data type is `data_train`?
- Is it a list? A dictionary? What else?
- How many data points does it contain?
- Inspect the first data point. What does it look like?

In [50]:
type(data_train)

sklearn.utils.Bunch

In [51]:
data_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])

In [52]:
# Making sure our  Data and Target columns are equal length
len(data_train['data'])

2034

In [53]:
len(data_train['target'])

2034

In [55]:
# Lets checkmeowt what our data actually looks like.
data_train['data'][0]

"Hi,\n\nI've noticed that if you only save a model (with all your mapping planes\npositioned carefully) to a .3DS file that when you reload it after restarting\n3DS, they are given a default position and orientation.  But if you save\nto a .PRJ file their positions/orientation are preserved.  Does anyone\nknow why this information is not stored in the .3DS file?  Nothing is\nexplicitly said in the manual about saving texture rules in the .PRJ file. \nI'd like to be able to read the texture rule information, does anyone have \nthe format for the .PRJ file?\n\nIs the .CEL file format available from somewhere?\n\nRych"

### 3) Create a bag-of-words model.

Let's train a model using a simple count vectorizer.

1) Initialize a standard CountVectorizer and fit the training data.
- How big is the feature dictionary?
- Eliminate English stop words.
- Is the dictionary smaller?
- Transform the training data using the trained vectorizer.
- Evaluate the performance of a logistic regression on the features extracted by the CountVectorizer.
    - You will have to transform the `test_set`, too. Be careful to use the trained vectorizer without refitting it.

**Bonus**
- Try a couple of modifications:
    - Restrict the `max_features`.
    - Change the `max_df` and `min_df`.

In [15]:

# What does the target variable look like# What d 
data_train['target']

array([1, 3, 2, ..., 1, 0, 1])

In [16]:
# Baseline
target = pd.Series(data_train['target'])
target.value_counts()/len(target)

2    0.291544
1    0.287119
0    0.235988
3    0.185349
dtype: float64

In [17]:

# NLP Using a count vectorizer.  # NLP Us 
from sklearn.feature_extraction.text import CountVectorizer

In [18]:
# Setting the vectorizer just like we would set a model# Settin 
cvec = CountVectorizer()

# Fitting the vectorizer on our training data
cvec.fit(data_train['data'])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [19]:
# Lets check the length of our data that is in a vectorized state
len(cvec.get_feature_names())

26879

In [20]:
# Lets use the stop_words argument to remove words like "and, the, a"
cvec = CountVectorizer(stop_words='english')

# Fit our vectorizer using our train data
cvec.fit(data_train['data'])

# and check out the length of the vectorized data after
len(cvec.get_feature_names())

26576

In [58]:
# Transforming our x_train data using our fit cvec.
# And converting the result to a DataFrame.
X_train = pd.DataFrame(cvec.transform(data_train['data']).todense(),
                       columns=cvec.get_feature_names())

X_train.head()

Unnamed: 0,00,000,0000,00000,000000,000005102000,000062david42,0001,000100255pixel,00041032,...,zurich,zurvanism,zus,zvi,zwaartepunten,zwak,zwakke,zware,zwarte,zyxel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
# We still have the same number of rows but the vectorization has converted every word, 
# or what is believed to be a word, from our test data into a feature.  Like dummy coded
# variables for words (except counts rather than just occurances).

In [23]:
X_train.shape

(2034, 26576)

In [60]:
# Which words appear the most?
word_counts = X_train.sum(axis=0)
word_counts.sort_values(ascending = False).head(20)

space       1061
people       793
god          745
don          730
like         682
just         675
does         600
know         592
think        584
time         546
image        534
edu          501
use          468
good         449
data         444
nasa         419
graphics     414
jesus        411
say          409
way          387
dtype: int64

In [61]:
names = data_train['target_names']
names

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']

In [62]:
# What are we trying to predict
y_train = data_train['target']

In [63]:
# Lets look through some of the categories common words
common_words = []    #Empty list
for i in range(4):   #range of 4 because that's how many categories we have
    word_count = X_train[y_train==i].sum(axis=0) #going through every row to get the word counts
    print(names[i], "most common words")  #print the most common words for each category
    cw = word_count.sort_values(ascending = False).head(20)  #sorting the words in descending order
    print(cw)   #printing word count
#     
    print()

alt.atheism most common words
god         405
people      330
don         262
think       215
just        209
does        207
atheism     199
say         174
believe     163
like        162
atheists    162
religion    156
jesus       155
know        154
argument    148
time        135
said        131
true        131
bible       121
way         120
dtype: int64

comp.graphics most common words
image        484
graphics     410
edu          297
jpeg         267
file         265
use          225
data         219
files        217
images       212
software     212
program      199
ftp          189
available    185
format       178
color        174
like         167
know         165
pub          161
gif          160
does         157
dtype: int64

sci.space most common words
space        989
nasa         374
launch       267
earth        222
like         222
data         216
orbit        201
time         197
shuttle      192
just         189
satellite    187
lunar        182
moon         168
n

In [31]:
# Converting out vectorized test data to a dataframe
# Using the CVEC which we fit earlier
X_test = pd.DataFrame(cvec.transform(data_test['data']).todense(),
                      columns=cvec.get_feature_names())

In [33]:

# Getting our Y test information
y_test = data_test['target']

In [64]:

#Import and fit our logistic regression and test it too#Import  
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
print(lr.score(X_train, y_train))
print(lr.score(X_test, y_test))

0.976892822026
0.745011086475


In [65]:
from  sklearn.metrics import classification_report
print(classification_report(y_test, lr.predict(X_test)))

             precision    recall  f1-score   support

          0       0.65      0.59      0.62       319
          1       0.87      0.89      0.88       389
          2       0.77      0.85      0.80       394
          3       0.62      0.57      0.59       251

avg / total       0.74      0.75      0.74      1353



### 4) Test Out Hashing and TF-IDF.

Let's see if hashing or TF-IDF improves our accuracy.

1) Initialize a HashingVectorizer and repeat the test with no restriction on the number of features.
- Does the score improve with respect to the CountVectorizer?
- Print out the number of features for this model.
- Initialize a TF-IDF vectorizer and repeat the analysis above.
- Print out the number of features for this model.

**Bonus**
- Change the parameters of either (or both) models to improve your score.

In [36]:
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

In [44]:
# A pipeline is a way for us to construct a function to execute
# the same tasks continuously
# In our variable model we fit a vectorizer, and a model
# our Model variable is stored with the fit vectorizer and model
# so we we call model.xxxx it uses that information stored
model = make_pipeline(HashingVectorizer(stop_words='english'),
                                LogisticRegression(),
                      )
model.fit(data_train['data'], y_train)
y_pred = model.predict(data_test['data'])
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, model.predict(data_test['data'])))

0.736881005174
             precision    recall  f1-score   support

          0       0.63      0.62      0.62       319
          1       0.86      0.89      0.88       389
          2       0.71      0.89      0.79       394
          3       0.71      0.41      0.52       251

avg / total       0.73      0.74      0.73      1353



In [66]:
model = make_pipeline(TfidfVectorizer(stop_words='english'),
                                LogisticRegression(),
                      )
model.fit(data_train['data'], y_train)
y_pred = model.predict(data_test['data'])
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, model.predict(data_test['data'])))

0.747967479675
             precision    recall  f1-score   support

          0       0.65      0.62      0.63       319
          1       0.87      0.90      0.89       389
          2       0.72      0.90      0.80       394
          3       0.72      0.43      0.53       251

avg / total       0.75      0.75      0.74      1353



### 5. [Bonus] Robust Text Preprocessing

Your mission, should you choose to accept it, is to write a preprocessing function for all of your text.  This functions should

- convert all text to lowercase,
- remove punctuation,
- stem or lemmatize each word of the text,
- remove stopwords.

The function should receive one string of text and return the processed text.

Once you have built your function, use it to process your train and test data, then fit a Logistic Regression model to see how it performs.

In [67]:
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

In [68]:
def remove_punctuation(text):
    lower = text.lower()
    exclude = set(string.punctuation)
    return " ".join(ch for ch in lower if ch not in exclude)