## Text Normalization

1. Basic processing
2. Stemming
3. Lemmatization
4. Non-standard words mapping
5. Stopwords

ref:[The Tao of Text Normalization](https://medium.com/parrot-prediction/the-tao-of-text-normalization-2e7aecd1861)

Other methods:
* [Categorizing and Tagging Words](http://www.nltk.org/book/ch05.html)

## Convert A Corpus to A Vector Format 

The Note below refers to Wikipedia page [bag-of-words](http://en.wikipedia.org/wiki/Bag-of-words_model).

### 1.  Bag- Of- Words
The following models a text document using bag-of-words. Here are two simple text documents:

    >>(1) John likes to watch movies. Mary likes movies too.

    >>(2) John also likes to watch football games.
    
Based on these two text documents, a list is constructed as follows for each document:

    >>"John","likes","to","watch","movies","Mary","likes","movies","too"
    
    >>"John","also","likes","to","watch","football","games"

Representing each bag-of-words as a dictionary:

    >>BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};

    >>BoW2 = {"John":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};

Each key is the word, and each value is the number of occurrences of that word in the given text document.

The order of elements is free, so, for example {"too":1,"Mary":1,"movies":2,"John":1,"watch":1,"likes":2,"to":1} is also BoW1. It is also what we expect from a strict JSON object representation.

Set BoW3 = Union(BoW1, BoW2), and extract the value to be a vector form, we can have:
    >>(1) [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]

    >>(2) [1, 1, 1, 1, 0, 0, 0, 1, 1, 1]

The bag-of-words representation will not reveal that the verb "likes" always follows a person's name in this text. As an alternative, the **n-gram model can store this spatial information**.

### 2. N-gram
Bigram (N = 2)
>>[
    "John likes",
    "likes to",
    "to watch",
    "watch movies",
    "Mary likes",
    "likes movies",
    "movies too",
]

Bag-of-words can be taken as a special case of N-gram with N = 1


## Data Input and Data Analysis

In [7]:
import pandas as pd

In [8]:
messages = pd.read_csv('smsspamcollection/SMSSpamCollection', sep='\t',names=["label", "message"])
messages.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [9]:
messages.describe()

Unnamed: 0,label,message
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


In [10]:
messages.groupby('label').describe()

Unnamed: 0_level_0,message,message,message,message
Unnamed: 0_level_1,count,unique,top,freq
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,653,Please call our customer service representativ...,4


## Text Pre-processing



### Normalization
What we will do:
1. Convert a corpus to a vector format - **Bag-of-Words**
2. Remove punctuation - **python string api**
3. Remove stopwords - **python nltk.corpus library**

### Vectorization

0. Get matrix of token counts - **CountVectorizer**

1. Count how many times does a word occur in each message (Known as **term frequency**) - **TF-IDF**

2. Weigh the counts, so that frequent tokens get lower weight (**inverse document frequency**) - **TF-IDF**

3. Normalize the vectors to unit length, to abstract from the original text length (**L2 norm**)

1. CountVectorizer : convert a collection of text documents to a matrix of token counts.

<table border = “1“>
<tr>
<th></th> <th>Message 1</th> <th>Message 2</th> <th>...</th> <th>Message N</th> 
</tr>
<tr>
<td><b>Word 1 Count</b></td><td>0</td><td>1</td><td>...</td><td>0</td>
</tr>
<tr>
<td><b>Word 2 Count</b></td><td>0</td><td>0</td><td>...</td><td>0</td>
</tr>
<tr>
<td><b>...</b></td> <td>1</td><td>2</td><td>...</td><td>0</td>
</tr>
<tr>
<td><b>Word N Count</b></td> <td>0</td><td>1</td><td>...</td><td>1</td>
</tr>
</table>


## Train - Naive Bayes Classifier Algorithm

## Create an Pipeline

## Evaluation
1. predict + classification_report
2. train / test split

## More Reference for NLP:
[Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
[Bag-of-Words / N-gram](http://en.wikipedia.org/wiki/Bag-of-words_model)
[Stemming](https://en.wikipedia.org/wiki/Stemming)