<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-Load-the-data" data-toc-modified-id="1.-Load-the-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>1. Load the data</a></span></li><li><span><a href="#2.-Filtering-out-the-noise" data-toc-modified-id="2.-Filtering-out-the-noise-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>2. Filtering out the noise</a></span></li><li><span><a href="#3.-Even-better-filtering" data-toc-modified-id="3.-Even-better-filtering-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>3. Even better filtering</a></span></li><li><span><a href="#4.-Term-frequency-times-inverse-document-frequency" data-toc-modified-id="4.-Term-frequency-times-inverse-document-frequency-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>4. Term frequency times inverse document frequency</a></span></li><li><span><a href="#5.-Utility-function" data-toc-modified-id="5.-Utility-function-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>5. Utility function</a></span></li></ul></div>

This notebook is part of the [Machine Learning class](https://github.com/erachelson/MLclass) by [Emmanuel Rachelson](https://personnel.isae-supaero.fr/emmanuel-rachelson?lang=en).

License: CC-BY-SA-NC.

<div style="font-size:22pt; line-height:25pt; font-weight:bold; text-align:center;">Text data pre-processing</div>

In this exercice, we shall load a database of email messages and pre-format them so that we can design automated classification methods or use off-the-shelf classifiers.

"What is there to pre-process?" you might ask. Well, actually, text data comes in a very noisy form that we, humans, have become accustomed to and filter out effortlessly to grasp the core meaning of the text. It has a lot of formatting (fonts, colors, typography...), punctuation, abbreviations, common words, grammatical rules, etc. that we might wish to discard before even starting the data analysis.

Here are some pre-processing steps that can be performed on text:
1. loading the data, removing attachements, merging title and body;
2. tokenizing - splitting the text into atomic "words";
3. removal of stop-words - very common words;
4. removal of non-words - punctuation, numbers, gibberish;
3. lemmatization - merge together "find", "finds", "finder".

The final goal is to be able to represent a document as a mathematical object, e.g. a vector, that our machine learning black boxes can process.

# 1. Load the data

Let's first load the emails.

In [6]:
import os
data_switch=1
if(data_switch==0):
    train_dir = '../data/ling-spam/train-mails/'
    email_path = [os.path.join(train_dir,f) for f in os.listdir(train_dir)]
else:
    train_dir = '../data/lingspam_public/bare/'
    email_path = []
    email_label = []
    for d in os.listdir(train_dir):
        folder = os.path.join(train_dir,d)
        email_path += [os.path.join(folder,f) for f in os.listdir(folder)]
        email_label += [f[0:3]=='spm' for f in os.listdir(folder)]
print("number of emails",len(email_path))
email_nb = 8 # try 8 for a spam example
print("email file:", email_path[email_nb])
print("email is a spam:", email_label[email_nb])
print(open(email_path[email_nb]).read())

number of emails 2893
email file: ../data/lingspam_public/bare/part1/3-425msg1.txt
email is a spam: False
Subject: what language is this ?

the toronto police have contacted our department for help in identifying the language of the label on a ball of wool in the purse of an elderly woman accused of shoplifting . she does not speak english and the police wish to obtain an interpreter for her . the following was dictate to me over the telephone ( so may not be 100 % accurate ) : ata lucru de myna din bumbac cardat . . . . please send replies directly to me . there is some urgency in this , as the woman is being held until they can question her . ron smyth smyth @ lake . scar . utoronto . ca



# 2. Filtering out the noise

One nice thing about scikit-learn is that is has lots of preprocessing utilities. Like [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) for instance, that converts a collection of text documents to a matrix of token counts.

- To remove stop-words, we set: `stop_words='english'`
- To convert all words to lowercase: `lowercase=True`
- The default tokenizer in scikit-learn removes punctuation and only keeps words of more than 2 letters.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
countvect = CountVectorizer(input='filename', stop_words='english', lowercase=True)
word_count = countvect.fit_transform(email_path)

In [12]:
print("Number of documents:", len(email_path))
words = countvect.get_feature_names()
print("Number of words:", len(words))
print("Document - words matrix:", word_count.shape)
print("First words:", words[40000:40100])

Number of documents: 2893
Number of words: 60618
Document - words matrix: (2893, 60618)
First words: ['notes', 'noteworthy', 'nothanku', 'nothofer', 'noti', 'notice', 'noticeably', 'noticed', 'notices', 'noticias', 'noticing', 'notification', 'notifications', 'notificazione', 'notified', 'notifiee', 'notify', 'notifying', 'noting', 'notion', 'notional', 'notionally', 'notionnelle', 'notions', 'notker', 'notorious', 'notoriously', 'notpossible', 'notre', 'notrequire', 'notting', 'nottingham', 'notturno', 'notwendig', 'notwithstanding', 'noufal', 'noun', 'nouniness', 'nouns', 'nous', 'nouvelle', 'nov', 'nova', 'noveau', 'noveck', 'novel', 'novelist', 'novelists', 'novell', 'novell1', 'novels', 'novelty', 'november', 'november1998', 'novembre', 'novenera', 'novetats', 'novi', 'novice', 'novices', 'novick', 'novicklr', 'noviembre', 'novmember', 'novokuznetsk', 'novum', 'novus', 'nowaday', 'nowadays', 'nowak', 'nowens', 'nowflake', 'nowo', 'nowotka', 'noxious', 'noyau', 'noyer', 'nozue', 'n

# 3. Even better filtering

That's already quite ok, but this pre-processing does not perform lemmatization, the list of stop-words could be better and we could wish to remove non-english words (misspelled, with numbers, etc.).

A slightly better preprocessing uses the [Natural Language Toolkit](https://www.nltk.org/https://www.nltk.org/). The one below:
- tokenizes;
- removes punctuation;
- removes stop-words;
- removes non-English and misspelled words (optional);
- removes 1-character words;
- removes non-alphabetical words (numbers and codes essentially).

In [16]:
from nltk import wordpunct_tokenize          
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.corpus import words
from string import punctuation
class LemmaTokenizer(object):
    def __init__(self, remove_non_words=True):
        self.wnl = WordNetLemmatizer()
        self.stopwords = set(stopwords.words('english'))
        self.words = set(words.words())
        self.remove_non_words = remove_non_words
    def __call__(self, doc):
        # tokenize words and punctuation
        word_list = wordpunct_tokenize(doc)
        # remove stopwords
        word_list = [word for word in word_list if word not in self.stopwords]
        # remove non words
        if(self.remove_non_words):
            word_list = [word for word in word_list if word in self.words]
        # remove 1-character words
        word_list = [word for word in word_list if len(word)>1]
        # remove non alpha
        word_list = [word for word in word_list if word.isalpha()]
        return [self.wnl.lemmatize(t) for t in word_list]

countvect = CountVectorizer(input='filename',tokenizer=LemmaTokenizer(remove_non_words=True))
word_count = countvect.fit_transform(email_path)
feat2word = {v: k for k, v in countvect.vocabulary_.items()}

In [17]:
print("Number of documents:", len(email_path))
words = countvect.get_feature_names()
print("Number of words:", len(words))
print("Document - words matrix:", word_count.shape)
print("First words:", words[0:100])

Number of documents: 2893
Number of words: 14279
Document - words matrix: (2893, 14279)
First words: ['aa', 'aal', 'aba', 'aback', 'abacus', 'abandon', 'abandoned', 'abandonment', 'abbas', 'abbreviation', 'abdomen', 'abduction', 'abed', 'aberrant', 'aberration', 'abide', 'abiding', 'abigail', 'ability', 'ablative', 'ablaut', 'able', 'abler', 'aboard', 'abolition', 'abord', 'aboriginal', 'aborigine', 'abound', 'abox', 'abreast', 'abridged', 'abroad', 'abrogate', 'abrook', 'abruptly', 'abscissa', 'absence', 'absent', 'absolute', 'absolutely', 'absoluteness', 'absolutist', 'absolutive', 'absolutization', 'absorbed', 'absorption', 'abstract', 'abstraction', 'abstractly', 'abstractness', 'absurd', 'absurdity', 'abu', 'abundance', 'abundant', 'abuse', 'abusive', 'abyss', 'academe', 'academic', 'academically', 'academician', 'academy', 'accelerate', 'accelerated', 'accelerative', 'accent', 'accentuate', 'accentuation', 'accept', 'acceptability', 'acceptable', 'acceptance', 'acceptation', 'acc

# 4. Term frequency times inverse document frequency

After this first preprocessing, each document is summarized by a vector of size "number of words in the extracted dictionnary". For example, the first email in the list has become:

In [18]:
mail_number = 0
text = open(email_path[mail_number]).read()
print("Original email:")
print(text)
#print(LemmaTokenizer()(text))
#print(len(set(LemmaTokenizer()(text))))
#print(len([feat2word[i] for i in word_count2[mail_number, :].nonzero()[1]]))
#print(len([word_count2[mail_number, i] for i in word_count2[mail_number, :].nonzero()[1]]))
#print(set([feat2word[i] for i in word_count2[mail_number, :].nonzero()[1]])-set(LemmaTokenizer()(text)))
emailBagOfWords = {feat2word[i]: word_count[mail_number, i] for i in word_count[mail_number, :].nonzero()[1]}
print("Bag of words representation (", len(emailBagOfWords), " words in dict):", sep='')
print(emailBagOfWords)
print("\nVector reprensentation (", word_count[mail_number, :].nonzero()[1].shape[0], " non-zero elements):", sep='')
print(word_count[mail_number, :])

Original email:
Subject: re : 5 . 1196 corpus analysis of - body / - one

it seems to me altogether possible , even likely , that there is the following interaction , namely , that the particular lexical form < everybody > has come to acquire , for many speakers certainly ( me , for instance , a quasi native speaker , and i share ellen prince 's intuition concerning her sentence # 1 ) a distinctly collective sense , whereas < every - one > maintains , again for these speakers at least , a distributive sense , at least as a living option . thus 1 ' . everybody came , bringing their respective wives . seems especially good , though , oddly enough maybe , this instantiates the special claim that , for this particular lexical - body item , the formal / informal register distinction has at least contextually ? ) collapsed . now , it seems plausible , but no more than that in the present state of knowledge ( or of ignorance ) that there is a more underlying interaction , namely , that there 

Counting words is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called `tf` for Term Frequencies.

Another refinement on top of `tf` is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called `tf–idf` for “Term Frequency times Inverse Document Frequency” and again, scikit-learn does the job for us with the [TfidfTransformer](scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) function.

In [19]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer().fit_transform(word_count)
tfidf.shape

(2893, 14279)

Now every email in the corpus has a vector representation that filters out unrelevant tokens and retains the significant information.

In [26]:
print("email 0:")
print(tfidf[0,:])

email 0:
  (0, 14201)	0.03794448933062206
  (0, 14077)	0.11995441175512732
  (0, 14058)	0.043946205166242404
  (0, 14027)	0.11149263446882586
  (0, 13758)	0.03977564480489188
  (0, 13460)	0.18525919282977144
  (0, 13301)	0.057300017725114234
  (0, 13172)	0.03701306596891324
  (0, 12800)	0.08381173695376727
  (0, 12775)	0.042588071692877726
  (0, 12756)	0.03644487951808927
  (0, 12504)	0.047093650349899534
  (0, 12502)	0.03136664566252002
  (0, 12425)	0.07579766304265152
  (0, 12324)	0.056168664307386096
  (0, 12263)	0.09262959641488572
  (0, 12224)	0.09262959641488572
  (0, 12153)	0.011190869172231764
  (0, 11947)	0.030862053793468572
  (0, 11784)	0.03165177877275147
  (0, 11779)	0.04081740755061132
  (0, 11707)	0.042026852024637885
  (0, 11328)	0.046088079926399794
  (0, 11314)	0.058828416669491145
  (0, 11276)	0.11434408783521394
  :	:
  (0, 2339)	0.033281005008342
  (0, 2315)	0.0616019187558951
  (0, 2302)	0.3371784221408105
  (0, 2131)	0.049783771439385546
  (0, 2081)	0.07861008993

# 5. Utility function

Let's put all this loading process into a separate file so that we can reuse it in other experiments.

In [21]:
import load_spam
spam_data = load_spam.spam_data_loader()
spam_data.load_data()

In [27]:
spam_data.print_email(100)

email file: ../data/lingspam_public/bare/part1/spmsga129.txt
email is a spam: True
Subject: lists and software worldwide

order form : all addresses are fresh and cleaned against international remove lists for the best results with the minimum irritation to those who do not wish to recieve unsolicited mail . all discs come with details of web sites for usefull mailing programs and other related products available on the net . many new mailer programs bypass your isp and send mail direct to the recipient so you dont need an expensive " bulk-friendly " isp . disc supplied come with a free mailing program , its not the best but will get you started if you dont have one . prices are quoted in uk pound sterling / us dollars and are fully inclusive of postage and packing 1 , 000 , 000 email addresses @ 15 / $ 35 [ ] 2 , 000 , 000 email addresses @ 29 / $ 59 [ ] 3 , 000 , 000 email addresses @ 42 / $ 80 [ ] 4 , 000 , 000 email addresses @ 54 / $ 102 [ ] 5 , 000 , 000 email addresses @ 65 / $ 