## Strategies
### Word focused
If the vocabulary is what you are looking at, the next step is to remove the punctuation and the stopwords. 

With the following code, we are going to generate a list of words used in the `Body` column. It won't be extremely useful for the classification task but comparing vocabulary can be very revealing (given the right subset: spam vs. regular emails,etc.).

Once we have a list of words, we can start the vocabulary analysis by coding a counter or using `FreqDist` from the `nltk` library for example.

In [None]:
import pandas as pd
df = pd.read_csv('data.csv', index_col='Id')
df

In [None]:
for text in df['Body']:
    tokens = word_tokenize(text)
    # converts to lower case
    tokens = [tok.lower() for tok in tokens]
    # removes the stopwords
    words = [word for word in tokens if word not in stop_words]

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

stop_words = list(string.punctuation)
stop_words += stopwords.words('english')

In [1]:
def toklowstop(text):
    tokens = word_tokenize(text)
    # converts to lower case
    tokens = [tok.lower() for tok in tokens]
    # removes the stopwords
    words = [word for word in tokens if word not in stop_words]
    return words

In [None]:
df['Body'].apply(lambda x: toklowstop(x))

In [None]:
FreqDist(words).plot(50)
plt.show()

## Wordcloud
Wordclouds are to text data what pie charts are to numerical data. They are at best confusing and at worst useless. But it can be generate a nice picture for your article header :)

In [None]:
from wordcloud import WordCloud
wordcloud = WordCloud(width = 3000,
                      height = 2000,
                      stopwords = stop_words)

wordcloud.generate(" ".join(words))

plt.imshow(wordcloud)
plt.axis('off')
plt.show()

## Stemming
This process is going to remove the ending of the words, shrinking them to their stem, their common denominator. For example, in the following list:
- programmer
- programmation
- programmed
- programming
- program
- programme

The stem of these words is **program**. It is easy to imagine a stemmed document being harder to read (by a human at least) but at the same time, it is going to be easier to compare different documents. This "normalization" is going to help make a model more robust as the ambiguity is reduced.

It is worth noting:
- there are a lot of stemming algorithms, available in various languages. [nltk.stem API module](https://www.nltk.org/api/nltk.stem.html) will give you a list of stemming classes available with NLTK.
- whilst stemming simplifies a document, it also creates "new" noise:  the stem for "flies" is "fli" for example.
- whilst stemming simplifies a document, it will induce a loss of information. For example, in the "program" list above, programme is the British spelling and program, the American spelling. If the origin of the review is not an important information for your analysis: Great, stemming has made this simpler! However if the localisation is key to your problem, you might miss some nuances by stemming your documents.

In [2]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

In [4]:
def stem(document):
    """
    Stemming words using 
    """
    doc_split = document.split(' ')
    stemmed = ''
    for word in doc_split:
        stemmed += stemmer.stem(word) + ' '
        
    return stemmed

## Lemmatisation