# Introduction to NLP - Week 1
## Cleaning and preprocessing textual data

#### In this weeks session we will cover some fundamentals to get started with natural language processing. This includes cleaning and preprocessing textual data in order to make further analysis easier and more efficient.

###### We will cover:
- Removal of unwanted characters
- Normalisation
- Tokenisation into n-grams
- Stopword removal
- Lemmatising/Stemming

#### In order to give you as many different examples and show cases as possible I will not apply all this to one particular dataset but rather to individual textual examples, since text cleaning is task specific and there is no "one-size-fits-all" solution. In the practial you will apply suitable preprocessing techniques to a whole dataset of your choice.


In [62]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from nltk.stem import WordNetLemmatizer, PorterStemmer
from collections import Counter
import re

In [65]:
foxnews = pd.read_csv("foxnews.csv")

In [66]:
foxnews.head()

Unnamed: 0,title,text,label,succ,meta,user,mentions,prev
0,German lawmakers approve 'no means no' rape la...,Merkel would never say NO,1,,German lawmakers passed a bill Thursday that w...,barryswallows,,
1,German lawmakers approve 'no means no' rape la...,"Expect more and more women to be asking .. ""wh...",1,,German lawmakers passed a bill Thursday that w...,PostApocalypticHero,,
2,German lawmakers approve 'no means no' rape la...,Groping people in public wasn't already illega...,0,,German lawmakers passed a bill Thursday that w...,californiamojo,,
3,German lawmakers approve 'no means no' rape la...,"Merkel, possible the only person in charge who...",1,,German lawmakers passed a bill Thursday that w...,MikeSte,,
4,German lawmakers approve 'no means no' rape la...,"They know very well, no means NO! They need to...",1,,German lawmakers passed a bill Thursday that w...,scientist,,


In [67]:
foxnews.groupby('title').text.nunique()

title
Black Lives Matter must rescind anti-Israel declaration                                         64
Confederate flag debate: Protecting hatred preserves freedom                                   156
First lady Obama takes swipe at Trump                                                          161
Fury as feminist blames toddler alligator death on white 'entitlement'                         165
German lawmakers approve 'no means no' rape law after Cologne assaults                         121
ICE program failing to rid US of dangerous illegal immigrants, analysts say                    162
Navy names ship after gay rights advocate Harvey Milk                                          192
States moving to restore work requirements for food stamp recipients                           141
Supreme Court has spoken on affirmative action. Now, colleges should boost income diversity    114
Texas, 12 states to ask judge to stall Obama transgender bathroom order                        244
Name

In [68]:
# saving the three types of text data in 3 separate dataframes
texas = foxnews[foxnews['title'] == "Texas, 12 states to ask judge to stall Obama transgender bathroom order"]

### 1. Removing unwanted chatacters - which characters to remove?
The is a primary step in the process of text cleaning. If we scrap some text from HTML/XML sources, we’ll need to get rid of all the tags, HTML entities, punctuation, non-alphabets, and any other kind of characters which might not be a part of the language. The general methods of such cleaning involve regular expressions, which can be used to filter out most of the unwanted texts.

However, sometimes, depending on the type of data, we want to retain certain types of punctuation. Consider for example human generated tweets which you want to classify as very angry, angry, neutral, happy, and very happy. Simple sentiment analysis might find it hard to differentiate between a happy, and a very happy sentiment, because the only difference between a happy and a very happy tweet might be punctuation.

Example:

*This is amazing*    <pre>  vs     </pre>  *THIS IS AMAZING!!!!!*

---

Or what about this one



*I don't know :) <3* <pre>  vs     </pre>  *I don't know :(((*


In [27]:
#regular expression keeping only letters 

def keep_letters_only(raw_text):
    letters_only_text = re.sub("[^a-zA-Z]", " ", raw_text)
    return letters_only_text



In [43]:
sample = "Now we have reverse discrimination, BIG TIME because of this PERCEIVED PROBLEM that our DEAR LEADER intended to remedy by EXECUTIVE FIAT! Just another reason why there should not be a federal department of education that doles out OUR OWN TAX MONEY and makes the states JUMP THROUGH HOOPS TO GET IT! Don't you think ENOUGH IS ENOUGH!"
keep_letters_only(sample) 

'Now we have reverse discrimination  BIG TIME because of this PERCEIVED PROBLEM that our DEAR LEADER intended to remedy by EXECUTIVE FIAT  Just another reason why there should not be a federal department of education that doles out OUR OWN TAX MONEY and makes the states JUMP THROUGH HOOPS TO GET IT  Don t you think ENOUGH IS ENOUGH '

### 2. Normalisation
Recall our sms sample:

**** **** CONGRATlations **** You won 2 tIckETs to Hamilton in 
NYC http://www.hamiltonbroadway.com/J?NaIOl/event   wORtH over $500.00...CALL 
555-477-8914 or send message to: hamilton@freetix.com to get ticket !! !

I'd definitely deem this as spam. But clearly there's a lot going on here: phone numbers, emails, website URLs, money amounts, and gratuitous whitespace and punctuation. Some terms are randomly capitalized, others are in all-caps. Since these terms might show up in any one of the training examples in countless forms, we need a way to ensure each training example is on equal footing via a preprocessing step called **normalisation**.

Instead of removing the following terms, for each training example, let's replace them with a specific string.

- Replace email addresses with `emailaddr`
- Replace URLs with `httpaddr`
- Replace money symbols with `moneysymb`
- Replace phone numbers with `phonenumbr`
- Replace numbers with `numbr`
- get rid of all other punctuations


In [70]:
def normalisation_sms(raw_text):
    cleaned = re.sub(r'\b[\w\-.]+?@\w+?\.\w{2,4}\b', 'emailaddr', raw_text)
    cleaned = re.sub(r'(http[s]?\S+)|(\w+\.[A-Za-z]{2,4}\S*)', 'httpaddr',
                     cleaned)
    cleaned = re.sub(r'£|\$|\€', 'moneysymb ', cleaned) #add whitespace
    cleaned = re.sub(
        r'\b(\+\d{1,2}\s)?\d?[\-(.]?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b',
        'phonenumbr', cleaned)
    cleaned = re.sub(r'\d+(\.\d+)?', 'numbr', cleaned)
    letters_only_text = re.sub("[^a-zA-Z]", " ", cleaned)
    return letters_only_text

In [71]:
normalisation_sms(sample)

'Now we have reverse discrimination  BIG TIME because of this PERCEIVED PROBLEM that our DEAR LEADER intended to remedy by EXECUTIVE FIAT  Just another reason why there should not be a federal department of education that doles out OUR OWN TAX MONEY and makes the states JUMP THROUGH HOOPS TO GET IT  Don t you think ENOUGH IS ENOUGH '

### 3. Tokenisation
Tokenisation is just the process of splitting a sentence into words.

Text: *A bad day in London is still better than a bad day anywhere else*

Tokens: `['a', 'bad', 'day', 'in', 'London', 'is', 'still', 'better', 'than', 'a', 'bad', 'day', 'anywhere', 'else']`

This example not only divides the individual entities, but also gets rid of the capitalism involved (no pun intended). Capitalisation and De-capitalisation is again, dependent on the data and the task at hand. If we want to differentiate between any sentiments, then something written in uppercase might mean something different than something written in lowercase. 

In [72]:
# Note: If you have already removed all punctuations you can just use pythons inbuilt .split() function
sample.lower().split()

['now',
 'we',
 'have',
 'reverse',
 'discrimination,',
 'big',
 'time',
 'because',
 'of',
 'this',
 'perceived',
 'problem',
 'that',
 'our',
 'dear',
 'leader',
 'intended',
 'to',
 'remedy',
 'by',
 'executive',
 'fiat!',
 'just',
 'another',
 'reason',
 'why',
 'there',
 'should',
 'not',
 'be',
 'a',
 'federal',
 'department',
 'of',
 'education',
 'that',
 'doles',
 'out',
 'our',
 'own',
 'tax',
 'money',
 'and',
 'makes',
 'the',
 'states',
 'jump',
 'through',
 'hoops',
 'to',
 'get',
 'it!',
 "don't",
 'you',
 'think',
 'enough',
 'is',
 'enough!']

In [73]:
#However, if you still have punctuation, look what happens
sample.lower().split()

['now',
 'we',
 'have',
 'reverse',
 'discrimination,',
 'big',
 'time',
 'because',
 'of',
 'this',
 'perceived',
 'problem',
 'that',
 'our',
 'dear',
 'leader',
 'intended',
 'to',
 'remedy',
 'by',
 'executive',
 'fiat!',
 'just',
 'another',
 'reason',
 'why',
 'there',
 'should',
 'not',
 'be',
 'a',
 'federal',
 'department',
 'of',
 'education',
 'that',
 'doles',
 'out',
 'our',
 'own',
 'tax',
 'money',
 'and',
 'makes',
 'the',
 'states',
 'jump',
 'through',
 'hoops',
 'to',
 'get',
 'it!',
 "don't",
 'you',
 'think',
 'enough',
 'is',
 'enough!']

'London' and 'London,' are not the same thing!

In [74]:
# so it's better to use nltk's tokenize
from nltk import word_tokenize, bigrams, trigrams

nltk_tokens = word_tokenize(sample.lower())
nltk_tokens

['now',
 'we',
 'have',
 'reverse',
 'discrimination',
 ',',
 'big',
 'time',
 'because',
 'of',
 'this',
 'perceived',
 'problem',
 'that',
 'our',
 'dear',
 'leader',
 'intended',
 'to',
 'remedy',
 'by',
 'executive',
 'fiat',
 '!',
 'just',
 'another',
 'reason',
 'why',
 'there',
 'should',
 'not',
 'be',
 'a',
 'federal',
 'department',
 'of',
 'education',
 'that',
 'doles',
 'out',
 'our',
 'own',
 'tax',
 'money',
 'and',
 'makes',
 'the',
 'states',
 'jump',
 'through',
 'hoops',
 'to',
 'get',
 'it',
 '!',
 'do',
 "n't",
 'you',
 'think',
 'enough',
 'is',
 'enough',
 '!']

In [76]:
print(list(bigrams(nltk_tokens)))
print("*****---------------------------------------------------------*****")
print(list(trigrams(nltk_tokens)))

[('now', 'we'), ('we', 'have'), ('have', 'reverse'), ('reverse', 'discrimination'), ('discrimination', ','), (',', 'big'), ('big', 'time'), ('time', 'because'), ('because', 'of'), ('of', 'this'), ('this', 'perceived'), ('perceived', 'problem'), ('problem', 'that'), ('that', 'our'), ('our', 'dear'), ('dear', 'leader'), ('leader', 'intended'), ('intended', 'to'), ('to', 'remedy'), ('remedy', 'by'), ('by', 'executive'), ('executive', 'fiat'), ('fiat', '!'), ('!', 'just'), ('just', 'another'), ('another', 'reason'), ('reason', 'why'), ('why', 'there'), ('there', 'should'), ('should', 'not'), ('not', 'be'), ('be', 'a'), ('a', 'federal'), ('federal', 'department'), ('department', 'of'), ('of', 'education'), ('education', 'that'), ('that', 'doles'), ('doles', 'out'), ('out', 'our'), ('our', 'own'), ('own', 'tax'), ('tax', 'money'), ('money', 'and'), ('and', 'makes'), ('makes', 'the'), ('the', 'states'), ('states', 'jump'), ('jump', 'through'), ('through', 'hoops'), ('hoops', 'to'), ('to', 'ge

### 4. Stopword Removal
Stopwords are the words which are used very frequently. Words like “of, are, the, it, is” are some examples of stopwords. In applications like document search engines and document classification, where keywords are more important than general terms, removing stopwords can be a good idea. However, if there’s some application about, for instance, songs lyrics search, or searching for specific quotes, stopwords can be important. 

“To be, or not not be” - Stopwords in such phrases actually play an important role, and hence, should not be dropped.

Another example is negation. "not" is contained in many stopword lists, but deleting "not" out of a negative review can make a positive out of it.

There are two common approaches of removing the stopwords, and both are fairly straightforward. One way is to count all the word occurrences, and providing a threshold value on the count, and getting rid of all the terms/words occurring more than the specified threshold value. The other way is to have a predetermined list of stopwords, which can be removed from the list of tokens/tokenised sentences. I personally, believe the second one is better, as determining thresholds can be quite difficult and you can use tf-idf (more on that next lesson) to weigh the importance of words.

NLTK comes with many corpora, including a stopword list. This list contains around 200 terms. For my research, however, I use one that contains over 600 terms: http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a11-smart-stop-list/english.stop (I removed apostrophes as I remove punctuations before I remove stopwords)

In [77]:
stop_words_file = 'SmartStoplist.txt'

stop_words = []

with open(stop_words_file, "r") as f:
    for line in f:
        stop_words.extend(line.split())   

In [78]:
#NLTK alternative
from nltk.corpus import stopwords


stop_words_nltk = set(stopwords.words('english'))

filtered_sentence_nltk = [w for w in nltk_tokens if not w in stop_words_nltk]
filetered_sentence_smart = [w for w in nltk_tokens if not w in stop_words]

### 5. Lemmatising and Stemming
Lemmatisation and stemming both refer to a process of reducing a word to its root. The difference is that stem might not be an actual word whereas, a lemma is an actual word. It’s a handy tool if you want to avoid treating different forms of the same word as different words, e.g. *love, loved, loving*

**Lemmatising:** considered, considers, consider → “consider”

**Stemming:** considered, considering, consider → “consid”

I personally have never noticed a significat difference between lemmatising and stemming when training classifiers. However, I suggest you try out yourself. NLTK comes with many different in-built lemmatisers and stemmers, so just plug and play.

A note of caution: WordNetLemmatizer requires a POS-tag. The default is set to "noun" and therefore doesn't work with other words.

In [79]:
from nltk.stem import WordNetLemmatizer, PorterStemmer, SnowballStemmer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

word = "considers"
word_2 = "apple"

stemmed_word =  stemmer.stem(word)
lemmatised_word = lemmatizer.lemmatize(word)

stemmed_word_2 =  stemmer.stem(word_2)
lemmatised_word_2 = lemmatizer.lemmatize(word_2)

print(stemmed_word)
print(lemmatised_word)

consid
considers


### Putting it all together - in one preprocess() function
Now that we covered everything we need to know, we can combine everything into one function and apply it to the whole data. Let's keep it simple and write one for the news articles

In [80]:
def preprocess_news(raw_text):
    
    #keeping only letters
    letters_only_text = re.sub("[^a-zA-Z]", " ", raw_text)

    # convert to lower case and tokenise
    tokens = word_tokenize(letters_only_text.lower())
    

    cleaned_words = []
    stemmer = PorterStemmer()
    
    # remove stopwords
    for word in tokens:
        if word not in stop_words:
            cleaned_words.append(word)
    
    # stemm or lemmatise words
    stemmed_words = []
    for word in cleaned_words:
        word = stemmer.stem(word)
        stemmed_words.append(word)
    
    # converting list back to string
    return " ".join(stemmed_words)

In [81]:
foxnews['prep_text'] = foxnews['text'].apply(preprocess_news)

In [82]:
foxnews

Unnamed: 0,title,text,label,succ,meta,user,mentions,prev,prep_text
0,German lawmakers approve 'no means no' rape la...,Merkel would never say NO,1,,German lawmakers passed a bill Thursday that w...,barryswallows,,,merkel
1,German lawmakers approve 'no means no' rape la...,"Expect more and more women to be asking .. ""wh...",1,,German lawmakers passed a bill Thursday that w...,PostApocalypticHero,,,expect women men longer interest touch pull pant
2,German lawmakers approve 'no means no' rape la...,Groping people in public wasn't already illega...,0,,German lawmakers passed a bill Thursday that w...,californiamojo,,,grope peopl public wasn illeg deutschland
3,German lawmakers approve 'no means no' rape la...,"Merkel, possible the only person in charge who...",1,,German lawmakers passed a bill Thursday that w...,MikeSte,,,merkel person charg wors obama hardest merkel ...
4,German lawmakers approve 'no means no' rape la...,"They know very well, no means NO! They need to...",1,,German lawmakers passed a bill Thursday that w...,scientist,,,mean pass law make legal castrat anim
...,...,...,...,...,...,...,...,...,...
1523,Fury as feminist blames toddler alligator deat...,"No, 10000 Loose nuts off their meds!",1,,A white Chicago-based Social Justice Warrior w...,toorotten,,,loos nut med
1524,Fury as feminist blames toddler alligator deat...,Just another ugly leftist.,1,"True. Most leftists ,esp female leftists have ...",A white Chicago-based Social Justice Warrior w...,randwolf13,,,ugli leftist
1525,Fury as feminist blames toddler alligator deat...,"True. Most leftists ,esp female leftists have ...",1,,A white Chicago-based Social Justice Warrior w...,creatingfalserealitiesleadstoinsanity,@randwolf13,Just another ugly leftist.,true leftist esp femal leftist violent hatr pr...
1526,Fury as feminist blames toddler alligator deat...,"First, lets get this straight: a white, gay ma...",1,,A white Chicago-based Social Justice Warrior w...,KhaosDominus,,,straight white gay male pretend feminist brien...
