**Note**: Type `export PYTHONIOENCODING=utf8` in your terminal shell prior to running this notebook.

Alternative:
```python
import sys  
reload(sys)  
sys.setdefaultencoding('utf8')
```

# Introduction to Natural Language Processing

## Data:

## The 20 Newsgroups dataset

* [Official Website](http://qwone.com/~jason/20Newsgroups/)
* The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
* The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

## The 20 Newsgroups dataset

* In the following we will use the built-in dataset loader for 20 newsgroups from scikit-learn.
* In order to get faster execution times, we will work on a partial dataset with only 4 categories out of the 20 available in the dataset.

In [1]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

## The 20 Newsgroups dataset

The returned dataset is a scikit-learn “bunch”:

* a simple holder object with fields that can be both accessed as python dict keys or object attributes for convenience
* for instance, the *target_names* holds the list of the requested category names:

In [2]:
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

## Agenda

* Tokenization
* Stop-word Removal
* Stemming
* Word Cloud
* TF-IDF

## Some Intuition



## Constructing the Datasets

It would be desiarable to split the dataset into the following parts:

* X_train
* y_train
* X_test
* y_test

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(twenty_train.data, twenty_train.target, train_size = 0.8)

## Constructing the Datasets - X_train

In [None]:
for i in range(5):
    print(X_train[i])

In [None]:
len(X_train)

## Constructing the Datasets - X_test

In [None]:
X_test[0]

## Converting list to Pandas Series

In [6]:
import pandas as pd

X_train = pd.Series(X_train)
X_test = pd.Series(X_test)

X_train[0]

u"From: paj@uk.co.gec-mrc (Paul Johnson)\nSubject: Re: sore throat\nReply-To: paj@uk.co.gec-mrc (Paul Johnson)\nOrganization: GEC-Marconi Research Centre, Great Baddow, UK\nLines: 29\n\nIn article <47835@sdcc12.ucsd.edu> wsun@jeeves.ucsd.edu (Fiberman) writes:\n>I have had a sore throat for almost a week.  When I look into\n>the mirror with the aid of a flash light, I see white plaques in\n>the very back of my throat (on the sides).  I went to a health\n>center to have a throat culture taken.  They said that I do not\n>have strep throat.  Could a viral infection cause white plaques\n>on the sides of my throat?\n\nFirst, I am not a doctor.  I know about this because I have been\nthrough it.\n\nIt sounds like tonsilitis (lit. swollen tonsils).  Feel under your jaw\nhinge for a swelling on each side.  If you find them, its tonsilitis.\nI've had this a couple of times in the past.  The doctor prescribed a\nweeks course of penicillin and that cleared it up.\n\nIn my case it was associated w

## Constructing the Datasets - y_train

In [None]:
y_train

## Constructing the Datasets - y_test

In [None]:
y_test

## Tokenization

Tokenization breaks unstructured data, text, into chunks of
information which can be counted as discrete elements. 

These counts of token
occurrences in a document can be used directly as a vector representing that document.


This immediately turns an unstructured string (text document) into a structured,
numerical data structure suitable for machine learning.

* Tokenization segments a document into its atomic elements (tokens)
* Typically, our tokens are the words
    - As an example where characters will be more appropriate as tokens, consider Language Detection

## Tokenization

In [None]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')

* The above code will match any word characters until it reaches a non-word character, like a space
* This can cause problems for words like *don’t* which will be read as two tokens - *don* and *t*.
* A better tokeniser is TreeBankWordTokenizer which would break words like *don't* into *do* and *n't* 
* NLTK provides a number of pre-constructed tokenizers (like nltk.tokenize.simple)

## Tokenization

In [7]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()

In [8]:
X_train = X_train.apply(lambda row: row.lower())
X_train = X_train.apply(lambda row: tokenizer.tokenize(row))
X_train.head(20)

0     [from, :, paj, @, uk.co.gec-mrc, (, paul, john...
1     [from, :, ed, @, cwis.unomaha.edu, (, ed, stas...
2     [from, :, sloan, @, cis.uab.edu, (, kenneth, s...
3     [from, :, jayne, @, mmalt.guild.org, (, jayne,...
4     [from, :, hans, @, cs.kuleuven.ac.be, (, hans,...
5     [from, :, geb, @, cs.pitt.edu, (, gordon, bank...
6     [from, :, caralv, @, caralv.auto-trol.com, (, ...
7     [from, :, donald, mackie, <, donald_mackie, @,...
8     [from, :, jaeger, @, buphy.bu.edu, (, gregg, j...
9     [from, :, draper, @, gnd1.wtp.gtefsd.com, (, p...
10    [from, :, ricky, @, watson.ibm.com, (, rick, t...
11    [from, :, bio1, @, navi.up.ac.za, (, fourie, j...
12    [from, :, ata, @, hfsi.hfsi.com, (, john, ata,...
13    [from, :, joshuaf, @, yang.earlham.edu, subjec...
14    [from, :, dpw, @, sei.cmu.edu, (, david, wood,...
15    [from, :, kaminski, @, netcom.com, (, peter, k...
16    [from, :, g9134255, @, wampyr.cc.uow.edu.au, (...
17    [from, :, geb, @, cs.pitt.edu, (, gordon, 

## Stop-word Removal

Let's look at the stopwords from the *stop_words* package, a [relatively conservative list](https://github.com/Alir3z4/stop-words/blob/master/english.txt).

In [None]:
from stop_words import get_stop_words

# create English stop words list
en_stop = get_stop_words('en')
print(en_stop)

## Impact of stop-word removal

In [None]:
X_train = X_train.apply(lambda row: [i for i in row if i not in en_stop])
X_train[0]

## Stemming

In [None]:
from nltk.stem.porter import PorterStemmer
p_stemmer = PorterStemmer()
print(p_stemmer)

## The Porter Stemmer

* *p_stemmer* requires all tokens to be type str
* p_stemmer returns the string parameter in stemmed form

In [None]:
# X_train = X_train.apply(lambda row: [p_stemmer.stem(i) for i in row])
# X_train.head()

In [None]:
text = []
for i in range(len(X_train)):
    tokens = X_train[i]
    tokens = [p_stemmer.stem(i) for i in tokens]
    text = text + tokens
    print(tokens)

## WordCloud

WordClouds are a quick way to check the result of our preprocessing steps and debug them.

In [None]:
textall = " ".join(text)
textall

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline

wordcloud = WordCloud().generate(textall)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

## WordCloud with lower max font size

In [None]:
# lower max_font_size
wordcloud = WordCloud(max_font_size=40).generate(textall)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

## WordCloud with additional stopwords

In [None]:
from os import path
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

haptik_mask = np.array(Image.open("./images/haptik.png"))

stopwords = set(STOPWORDS)
stopwords.add("flight")

wc = WordCloud(max_words=2000, stopwords=stopwords)
# generate word cloud
wc = wc.generate(textall)

# show
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.figure()
plt.show()

## Text Preprocessing using `sklearn`

`sklearn`'s `feature_extraction` module provides convenient API **CountVectorizer** to "convert raw text into a matrix of token counts" along with all the text processing steps we covered.

In [None]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [None]:
vect

In [None]:
# use TreeankWordTokenizer
tokenizer = TreebankWordTokenizer()
vect.set_params(tokenizer=tokenizer.tokenize)

In [None]:
# remove English stop words
vect.set_params(stop_words='english')

In [None]:
# include 1-grams and 2-grams
vect.set_params(ngram_range=(1, 2))

In [None]:
# ignore terms that appear in more than 50% of the documents
vect.set_params(max_df=0.5)

In [None]:
# only keep terms that appear in at least 2 documents
vect.set_params(min_df=2)

**Note:** vect takes data as rows of text. Hence, we will have to get X_train in that format.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(twenty_train.data, twenty_train.target, train_size = 0.8)

X_train

In [None]:
X_train = pd.Series(X_train)
X_test = pd.Series(X_test)

X_train

In [None]:
# learn the 'vocabulary' of the training data
vect.fit(X_train)

# examine the fitted vocabulary
vect.get_feature_names()

Next, we transform training data into a 'document-term matrix'

In [None]:
simple_train_dtm = vect.transform(X_train)
simple_train_dtm

In [None]:
test_dtm = vect.transform(X_test)

Next, we examine the vocabulary and document-term matrix together

In [None]:
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names()).head(4)

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [None]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(simple_train_dtm, y_train)

In [None]:
# make class predictions for test_dtm
y_pred_class = nb.predict(test_dtm)

In [None]:
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

In [None]:
metrics.confusion_matrix(y_test, y_pred_class)