<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Natural Language Processing Lab

---

In this lab we will further explore sklearn and NLTK's capabilities for processing text. We will use the 20 Newsgroup dataset, which is provided by sklearn.

In [122]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
plt.style.use('ggplot')
sns.set(font_scale=1.5)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [123]:
# Getting the Sklearn Dataset
from sklearn.datasets import fetch_20newsgroups

### 1. Use the `fetch_20newsgroups` function to download a training and testing set.

Look up the [function documentation](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) for how to grab the data.

You should pull these categories:
- `alt.atheism`
- `talk.religion.misc`
- `comp.graphics`
- `sci.space`

Also remove the headers, footers, and quotes using the `remove` keyword argument of the function.

In [124]:
# Extracting Information from the Data's Dictionary format 
# Categories of emails we want
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
# Setting training data
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))
# Setting testing data
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

### 2. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Because this is an sklearn dataset, it comes with pre-split train and test sets (note we were able to call 'train' and 'test' in subset).

Let's inspect them.

1. What data type is `data_train`?
- What does `data_train` contain? 
- How many data points does `data_train` contain?
- How many data points of each category does `data_train` contain?
- Inspect the first data point, what does it look like?

In [125]:
# A:
type(data_train)

sklearn.utils.Bunch

In [126]:
data_train.target_names

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']

In [127]:
data_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

### 3. Bag of Words model

Let's train a model using a simple count vectorizer.

1. Initialize a standard CountVectorizer and fit the training data.
- How big is the feature dictionary?
- Repeat eliminating English stop words.
- Is the dictionary smaller?
- Transform the training data using the trained vectorizer.
- What are the 20 words that are most common in the whole corpus?
- What are the 20 most common words in each of the 4 classes?
- Evaluate the performance of a Logistic Regression on the features extracted by the CountVectorizer.
    - You will have to transform the test_set, too. Be careful to use the trained vectorizer, without re-fitting it.
    - Create a confusion matrix.

**BONUS:**
- Try a couple of modifications:
    - restrict max_features
    - change max_df and min_df
    - for each of the above print a confusion matrix and investigate what gets mixed

In [128]:
# A:
from sklearn.feature_extraction.text import CountVectorizer

cvec = CountVectorizer()
cvec.fit(data_train.data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [129]:
len(cvec.get_feature_names())

26879

In [130]:
cvec = CountVectorizer(stop_words='english', token_pattern=)
cvec.fit(data_train.data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [131]:
len(cvec.get_feature_names())

26576

In [132]:
import pandas as pd
df = pd.DataFrame(cvec.transform(data_train.data).toarray(),
                  columns=cvec.get_feature_names())

In [133]:
type(cvec.transform(data_train.data))

scipy.sparse.csr.csr_matrix

In [134]:
df.T.sort_values(0,ascending=False).T

Unnamed: 0,file,3ds,prj,orientation,does,save,texture,information,format,able,...,earths,earthquake,earthly,earthings,earthinfo,earthers,earth,ears,earnshaw,zyxel
0,6,3,3,2,2,2,2,2,2,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
7,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [137]:
df.sum().sort_values(ascending=False)[:20]

space       1061
people       793
god          745
don          730
like         682
just         675
does         600
know         592
think        584
time         546
image        534
edu          501
use          468
good         449
data         444
nasa         419
graphics     414
jesus        411
say          409
way          387
dtype: int64

In [138]:
Top20_0 = pd.DataFrame(dict(Target0 = df[data_train.target==0].sum().sort_values(ascending=False)[:20]))
Top20_1 = pd.DataFrame(dict(Target1 = df[data_train.target==1].sum().sort_values(ascending=False)[:20]))
Top20_2 = pd.DataFrame(dict(Target2 = df[data_train.target==2].sum().sort_values(ascending=False)[:20]))
Top20_3 = pd.DataFrame(dict(Target3 = df[data_train.target==3].sum().sort_values(ascending=False)[:20]))

Top20_1

Unnamed: 0,Target1
image,484
graphics,410
edu,297
jpeg,267
file,265
use,225
data,219
files,217
images,212
software,212


In [141]:
# Evaluate the performance of a Logistic Regression on the features extracted by the CountVectorizer.
    # You will have to transform the test_set, too. Be careful to use the trained vectorizer, without re-fitting it.
    # Create a confusion matrix.

from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()

logistic.fit(cvec.transform(data_train.data),data_train.target)
print('Train score:',logistic.score(cvec.transform(data_train.data),data_train.target))
print('Test score:',logistic.score(cvec.transform(data_test.data),data_test.target))

Train score: 0.9783677482792527
Test score: 0.7354028085735402


In [143]:
from sklearn.metrics import confusion_matrix

predictions_train = logistic.predict(cvec.transform(data_train.data))
print(confusion_matrix(data_train.target, predictions_train, labels=[0,1,2,3]))

[[468   0  12   0]
 [  0 568  16   0]
 [  0   0 593   0]
 [  0   0  16 361]]


In [144]:
predictions_test = logistic.predict(cvec.transform(data_test.data))
print(confusion_matrix(data_test.target, predictions, labels=[0,1,2,3]))

[[176  15  57  71]
 [  7 336  41   5]
 [ 15  21 351   7]
 [ 61  12  40 138]]


In [146]:
from sklearn.metrics import classification_report
print(classification_report(data_train.target, predictions_train))
print(classification_report(data_test.target, predictions_test))

              precision    recall  f1-score   support

           0       1.00      0.97      0.99       480
           1       1.00      0.97      0.99       584
           2       0.93      1.00      0.96       593
           3       1.00      0.96      0.98       377

    accuracy                           0.98      2034
   macro avg       0.98      0.98      0.98      2034
weighted avg       0.98      0.98      0.98      2034

              precision    recall  f1-score   support

           0       0.64      0.57      0.60       319
           1       0.86      0.88      0.87       389
           2       0.76      0.83      0.80       394
           3       0.59      0.57      0.58       251

    accuracy                           0.74      1353
   macro avg       0.71      0.71      0.71      1353
weighted avg       0.73      0.74      0.73      1353



### 4. TF-IDF

Let's see if TF-IDF improves the accuracy.

- Initialize a TF-IDF Vectorizer and repeat the analysis above.
- Does the score improve with respect to the count vectorizer? 
- Print out the number of features for this model.

**BONUS:**
- Change the parameters of either (or both!) models to improve your score.

In [173]:
# A:
from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer(stop_words='english', norm='l2')
tvec.fit(data_train.data)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words='english', strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [174]:
len(tvec.get_feature_names())

26576

In [169]:
from sklearn.linear_model import LogisticRegressionCV

logistic_tvec = LogisticRegression(max_iter=10000, n_jobs=2)
logistic_tvec.fit(tvec.transform(data_train.data),data_train.target)

print('Train score:',logistic_tvec.score(tvec.transform(data_train.data),data_train.target))
print('Test score:',logistic_tvec.score(tvec.transform(data_test.data),data_test.target))

Train score: 0.967551622418879
Test score: 0.7479674796747967


In [170]:
predictions_train = logistic_tvec.predict(tvec.transform(data_train.data))
print(confusion_matrix(data_train.target, predictions_train, labels=[0,1,2,3]))

[[465   0  13   2]
 [  0 568  16   0]
 [  0   6 587   0]
 [  9   1  19 348]]


In [171]:
predictions_test = logistic_tvec.predict(tvec.transform(data_test.data))
print(confusion_matrix(data_test.target, predictions_test, labels=[0,1,2,3]))

[[198  11  64  46]
 [  9 348  31   1]
 [ 20  22 352   0]
 [ 85  12  40 114]]


In [172]:
print(classification_report(data_train.target, predictions_train))
print(classification_report(data_test.target, predictions_test))

              precision    recall  f1-score   support

           0       0.98      0.97      0.97       480
           1       0.99      0.97      0.98       584
           2       0.92      0.99      0.96       593
           3       0.99      0.92      0.96       377

    accuracy                           0.97      2034
   macro avg       0.97      0.96      0.97      2034
weighted avg       0.97      0.97      0.97      2034

              precision    recall  f1-score   support

           0       0.63      0.62      0.63       319
           1       0.89      0.89      0.89       389
           2       0.72      0.89      0.80       394
           3       0.71      0.45      0.55       251

    accuracy                           0.75      1353
   macro avg       0.74      0.72      0.72      1353
weighted avg       0.75      0.75      0.74      1353



In [179]:
from sklearn.model_selection import GridSearchCV

logistic_tvec = LogisticRegressionCV(max_iter=10000, n_jobs=2, cv=5)
logistic_tvec.fit(tvec.transform(data_train.data),data_train.target)

logistic_tvec_params = {'penalty': ['l1','l2'],
                        'solver': ['liblinear'],
                        'Cs':'np.logspace[]'}

gs = GridSearchCV(logistic_tvec, logistic_tvec_params, cv=5, n_jobs=2, verbose=2)

gs.fit(tvec.transform(data_train.data),data_train.target)
print(gs.best_params_)
print(gs.best_score_)
best_est = gs.best_estimator_
print(best_est.C_)
gs.score(tvec.transform(data_test.data),data_test.target)

ValueError: Parameter values for parameter (Cs) need to be a sequence(but not a string) or np.ndarray.

### 5. Classifier comparison

Of all the vectorizers tested above, choose one that has a reasonable performance with a manageable number of features and compare the performance of these models:

- KNN
- Logistic Regression
- Decision Trees
- Support Vector Machine
- Random Forest
- Extra Trees

In order to speed up the calculation it's better to vectorize the data only once and then compare the models.

### Bonus: Other classifiers

Adapt the code from [this example](https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py) to compare across all the classifiers suggested and to display the final plot

### Bonus: 

- #### Fit a model to the 20newsgroups dataset with all classes

- #### Choose texts, for example from newspaper articles, and check what is the class label predicted for them. Does the predicted label meet your expectations?