<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Natural Language Processing Lab

---

In this lab, we'll explore scikit-learn and NLTK's capabilities for processing text even further. We'll use the 20 newsgroups data set, which is provided by scikit-learn.

In [38]:
# Standard data science imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
%matplotlib inline

In [2]:
# Getting the scikit-learn data set:
from sklearn.datasets import fetch_20newsgroups

### 1. Use the `fetch_20newsgroups` function to download a training and testing set.

The "20 Newsgroups" dataset is described [here](http://scikit-learn.org/0.19/datasets/twenty_newsgroups.html).

For this lab let's choose 4 categories to analyze.  The full list is given below.


```python
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
```

Note that the solution code will use these categories:
- `alt.atheism`
- `talk.religion.misc`
- `comp.graphics`
- `sci.space`

Also remove the headers, footers, and quotes using the `remove` keyword argument of the function.

In [3]:
#Extracting Information from the Data's Dictionary format 

categories = ['rec.sport.baseball', 'sci.space', 'sci.electronics', 'soc.religion.christian']  # Fill in whatever categories you want to use!!

# Setting out training data
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))
# Setting our testing data
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

**Question:** What does the `shuffle` argument do?  Why are we setting a `random_state`?

We set a random state to ensure that the randomness in the splitting is controlled and repeatable.  It wouldn't be useful to split the data in different ways by accident and confused the model.

Shuffle will do exactly its name: shuffle the data.  Important to ensure that there is no numbering/indexing bias in the split.

### 2) Inspect the data.

We've downloaded a few `newsgroups` categories and removed their headers, footers, and quotes.

Because this is a scikit-learn data set, it comes with pre-split training and testing sets (note: we were able to call "train" and "test" in subset).

Let's inspect them.

1) What data type is `data_train`?
- Is it a list? A dictionary? What else?
- How many data points does it contain?
- Inspect the first data point. What does it look like?

In [4]:
type(data_train)

sklearn.utils.Bunch

In [46]:
data_train.data

["Y'all lighten up on Harry, Skip'll be like that in a couple of years!!>\n\nHarry's a great personality.  He's the reason I like Cubs broadcasts.\n(It's certainly not the quality of the team).\n\nChop Chop\n\nMichael Mule'\n\n",
 "Once again, the Rockies bullpen fell apart.  Andy Ashby pitched six (somewhat\nshaky) innings giving up just one run.  Then game the dreaded relief.  Three\npicthers combined to give up 3 runs (one each I believe) in the 7th inning\nand blew the save opportunity.  (Final was 4-2 vs Expos).\n\nDespite their problems in the pen, I think the Rockies are a team that wont\nbe taken lightly.  Going into today's game, the had the league's leading\nhitter and RBI man (Galarraga), two of the leaders in stolen bases (Young\nand Cole) and increasingly strong starting pitching.\n-- \n-------------------------------------------------------------------------------\nDavid Rex Wood -- davewood@cs.colorado.edu -- University of Colorado at Boulder",
 "\nWow.  ESPN can repeat 

In [49]:
data_train['target']

array([0, 0, 0, ..., 1, 2, 1], dtype=int64)

In [6]:
data_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [7]:
data_train['data']

["Y'all lighten up on Harry, Skip'll be like that in a couple of years!!>\n\nHarry's a great personality.  He's the reason I like Cubs broadcasts.\n(It's certainly not the quality of the team).\n\nChop Chop\n\nMichael Mule'\n\n",
 "Once again, the Rockies bullpen fell apart.  Andy Ashby pitched six (somewhat\nshaky) innings giving up just one run.  Then game the dreaded relief.  Three\npicthers combined to give up 3 runs (one each I believe) in the 7th inning\nand blew the save opportunity.  (Final was 4-2 vs Expos).\n\nDespite their problems in the pen, I think the Rockies are a team that wont\nbe taken lightly.  Going into today's game, the had the league's leading\nhitter and RBI man (Galarraga), two of the leaders in stolen bases (Young\nand Cole) and increasingly strong starting pitching.\n-- \n-------------------------------------------------------------------------------\nDavid Rex Wood -- davewood@cs.colorado.edu -- University of Colorado at Boulder",
 "\nWow.  ESPN can repeat 

The first data point is the text of the article.

data_train is a dictionary. 

### 3) Create a bag-of-words model.

Let's train a model using a simple count vectorizer.

1) Initialize a standard CountVectorizer and fit the training data.
- How big is the feature dictionary?
- Eliminate English stop words.
- Is the dictionary smaller?
- Transform the training data using the trained vectorizer.
- Evaluate the performance of a logistic regression on the features extracted by the CountVectorizer.
    - You will have to transform the `test_set`, too. Be careful to use the trained vectorizer without refitting it.

**Bonus**
- Try a couple of modifications:
    - Restrict the `max_features`.
    - Change the `max_df` and `min_df`.

In [39]:
count_vec_big = CountVectorizer(max_features = 10000)

In [96]:
count_vec_big.fit(data_train['data'], data_train['target'])
features_big = count_vec_big.transform(data_train['data'])
features_big

<2380x5000 sparse matrix of type '<class 'numpy.int64'>'
	with 188416 stored elements in Compressed Sparse Row format>

In [89]:
count_vec = CountVectorizer(stop_words = 'english', max_features = 5500)

In [97]:
count_vec.fit(data_train['data'], data_train['target'])
train_features = count_vec.transform(data_train['data'])
train_features

<2380x5500 sparse matrix of type '<class 'numpy.int64'>'
	with 115267 stored elements in Compressed Sparse Row format>

In [98]:
log_reg = LogisticRegression()
log_reg.fit(train_features, data_train['target'])



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [103]:
count_vec.fit(data_test['data'], data_test['target'])
test_features = count_vec.transform(data_test['data'])
test_features

<1582x5500 sparse matrix of type '<class 'numpy.int64'>'
	with 76159 stored elements in Compressed Sparse Row format>

In [104]:
log_reg.score(test_features, data_test['target'])

0.3672566371681416

### 4) Test Out Hashing and TF-IDF.

Let's see if hashing or TF-IDF improves our accuracy.

1) Initialize a HashingVectorizer and repeat the test with no restriction on the number of features.
- Does the score improve with respect to the CountVectorizer?
- Print out the number of features for this model.
- Initialize a TF-IDF vectorizer and repeat the analysis above.
- Print out the number of features for this model.

**Bonus**
- Change the parameters of either (or both) models to improve your score.

In [106]:
from sklearn.feature_extraction.text import HashingVectorizer

hvec = HashingVectorizer(stop_words='english', norm=None, alternate_sign=False)
hvec.fit(data_train['data'], data_train['target'])

HashingVectorizer(alternate_sign=False, analyzer='word', binary=False,
         decode_error='strict', dtype=<class 'numpy.float64'>,
         encoding='utf-8', input='content', lowercase=True,
         n_features=1048576, ngram_range=(1, 1), non_negative=False,
         norm=None, preprocessor=None, stop_words='english',
         strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
         tokenizer=None)

In [107]:
hvec.transform(data_train['data'])

<2380x1048576 sparse matrix of type '<class 'numpy.float64'>'
	with 150676 stored elements in Compressed Sparse Row format>

### 5. [Bonus] Robust Text Preprocessing

Your mission, should you choose to accept it, is to write a preprocessing function for all of your text.  This functions should

- convert all text to lowercase,
- remove punctuation,
- stem or lemmatize each word of the text,
- remove stopwords.

The function should receive one string of text and return the processed text.

Once you have built your function, use it to process your train and test data, then fit a Logistic Regression model to see how it performs.