# Introduction to NLP - Week 1
## Cleaning and preprocessing textual data

#### In this weeks session we will cover some fundamentals to get started with natural language processing. This includes cleaning and preprocessing textual data in order to make further analysis easier and more efficient.

###### We will cover:
- Removal of unwanted characters
- Normalisation
- Tokenisation into n-grams
- Stopword removal
- Lemmatising/Stemming

#### In order to give you as many different examples and show cases as possible I will not apply all this to one particular dataset but rather to individual textual examples, since text cleaning is task specific and there is no "one-size-fits-all" solution. In the practial you will apply suitable preprocessing techniques to a whole dataset of your choice.


In [86]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from nltk.stem import WordNetLemmatizer, PorterStemmer
from collections import Counter
import re

In [87]:
dummy_data = pd.read_csv("dummy_data.csv")

In [88]:
dummy_data

Unnamed: 0,type,text,tag
0,sms,GENT! We are trying to contact you. Last weeke...,spam
1,sms,"Wa, ur openin sentence very formal... Anyway, ...",ham
2,sms,"As I entered my cabin my PA said, '' Happy B'd...",ham
3,sms,You are a winner U have been specially selecte...,spam
4,sms,Goodo! Yes we must speak friday - egg-potato r...,ham
5,sms,Hmm...my uncle just informed me that he's payi...,ham
6,sms,PRIVATE! Your 2004 Account Statement for 07742...,spam
7,sms,URGENT! Your Mobile No. was awarded σú2000 Bon...,spam
8,sms,here is my new address -apples&pairs&all that ...,ham
9,sms,Todays Voda numbers ending 7548 are selected t...,spam


In [89]:
# saving the three types of text data in 3 separate dataframes
sms_df = dummy_data[dummy_data['type'] == "sms"]
review_df = dummy_data[dummy_data['type'] == "review"]
news_df = dummy_data[dummy_data['type'] == "news_article"]

In [90]:
# an example of each for testing out stuff before applying it to the whole column

sms_sample = """***** CONGRATlations **** You won 2 tIckETs to Hamilton in 
NYC http://www.hamiltonbroadway.com/J?NaIOl/event   wORtH over $500.00...CALL 
555-477-8914 or send message to: hamilton@freetix.com to get ticket !! !"""
review_sample = """ THIS FOOD AND STAFF WAS AMAZING!!!!! ABSOLUTELY LOVE THAT PLACE <3<3<3"""
news_sample = """worldcom ex-boss launches defence lawyers defending former worldcom chief bernie ebbers against a battery of fraud charges have called a company whistleblower as their first witness.  cynthia cooper  worldcom s ex-head of internal accounting  alerted directors to irregular accounting practices at the us telecoms giant in 2002. her warnings led to the collapse of the firm following the discovery of an $11bn (┬ú5.7bn) accounting fraud. mr ebbers has pleaded not guilty to charges of fraud and conspiracy.  prosecution lawyers have argued that mr ebbers orchestrated a series of accounting tricks at worldcom  ordering employees to hide expenses and inflate revenues to meet wall street earnings estimates. but ms cooper  who now runs her own consulting business  told a jury in new york on wednesday that external auditors arthur andersen had approved worldcom s accounting in early 2001 and 2002. she said andersen had given a  green light  to the procedures and practices used by worldcom. mr ebber s lawyers have said he was unaware of the fraud  arguing that auditors did not alert him to any problems.  ms cooper also said that during shareholder meetings mr ebbers often passed over technical questions to the company s finance chief  giving only  brief  answers himself. the prosecution s star witness  former worldcom financial chief scott sullivan  has said that mr ebbers ordered accounting adjustments at the firm  telling him to  hit our books . however  ms cooper said mr sullivan had not mentioned  anything uncomfortable  about worldcom s accounting during a 2001 audit committee meeting. mr ebbers could face a jail sentence of 85 years if convicted of all the charges he is facing. worldcom emerged from bankruptcy protection in 2004  and is now known as mci. last week  mci agreed to a buyout by verizon communications in a deal valued at $6.75bn."""

### 1. Removing unwanted chatacters - which characters to remove?
The is a primary step in the process of text cleaning. If we scrap some text from HTML/XML sources, we’ll need to get rid of all the tags, HTML entities, punctuation, non-alphabets, and any other kind of characters which might not be a part of the language. The general methods of such cleaning involve regular expressions, which can be used to filter out most of the unwanted texts.

However, sometimes, depending on the type of data, we want to retain certain types of punctuation. Consider for example human generated tweets which you want to classify as very angry, angry, neutral, happy, and very happy. Simple sentiment analysis might find it hard to differentiate between a happy, and a very happy sentiment, because the only difference between a happy and a very happy tweet might be punctuation.

Example:

*This is amazing*    <pre>  vs     </pre>  *THIS IS AMAZING!!!!!*

---

Or what about this one



*I don't know :) <3* <pre>  vs     </pre>  *I don't know :(((*


In [91]:
#regular expression keeping only letters 

def keep_letters_only(raw_text):
    letters_only_text = re.sub("[^a-zA-Z]", " ", raw_text)
    return letters_only_text



In [92]:
keep_letters_only(sms_sample) 

'      CONGRATlations      You won   tIckETs to Hamilton in  NYC http   www hamiltonbroadway com J NaIOl event   wORtH over           CALL               or send message to  hamilton freetix com to get ticket     '

#### you can see that this is not ideal as this leaves us with a lot of random stuff like "www" and "com". We will get back to that later

In [93]:
keep_letters_only(review_sample) 

' THIS FOOD AND STAFF WAS AMAZING      ABSOLUTELY LOVE THAT PLACE       '

#### We don't lose any meaning, but as mentioned previously, keeping the exclamation marks might be useful if we want to distinguish between positive and *VERY* positive reviews

In [94]:
keep_letters_only(news_sample)



#### for news articles that works perfectly fine as we do not lose any relevant information in this case since we want to classify by genre (sports, business, tech etc)

### 2. Normalisation
Recall our sms sample:

**** **** CONGRATlations **** You won 2 tIckETs to Hamilton in 
NYC http://www.hamiltonbroadway.com/J?NaIOl/event   wORtH over $500.00...CALL 
555-477-8914 or send message to: hamilton@freetix.com to get ticket !! !

I'd definitely deem this as spam. But clearly there's a lot going on here: phone numbers, emails, website URLs, money amounts, and gratuitous whitespace and punctuation. Some terms are randomly capitalized, others are in all-caps. Since these terms might show up in any one of the training examples in countless forms, we need a way to ensure each training example is on equal footing via a preprocessing step called **normalisation**.

Instead of removing the following terms, for each training example, let's replace them with a specific string.

- Replace email addresses with `emailaddr`
- Replace URLs with `httpaddr`
- Replace money symbols with `moneysymb`
- Replace phone numbers with `phonenumbr`
- Replace numbers with `numbr`
- get rid of all other punctuations


In [95]:
def normalisation_sms(raw_text):
    cleaned = re.sub(r'\b[\w\-.]+?@\w+?\.\w{2,4}\b', 'emailaddr', raw_text)
    cleaned = re.sub(r'(http[s]?\S+)|(\w+\.[A-Za-z]{2,4}\S*)', 'httpaddr',
                     cleaned)
    cleaned = re.sub(r'£|\$|\€', 'moneysymb ', cleaned) #add whitespace
    cleaned = re.sub(
        r'\b(\+\d{1,2}\s)?\d?[\-(.]?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b',
        'phonenumbr', cleaned)
    cleaned = re.sub(r'\d+(\.\d+)?', 'numbr', cleaned)
    letters_only_text = re.sub("[^a-zA-Z]", " ", cleaned)
    return letters_only_text

In [96]:
normalisation_sms(sms_sample)

'      CONGRATlations      You won numbr tIckETs to Hamilton in  NYC httpaddr   wORtH over moneysymb numbr   CALL  phonenumbr or send message to  emailaddr to get ticket     '

### 3. Tokenisation
Tokenisation is just the process of splitting a sentence into words.

Text: *A bad day in London is still better than a bad day anywhere else*

Tokens: `['a', 'bad', 'day', 'in', 'London', 'is', 'still', 'better', 'than', 'a', 'bad', 'day', 'anywhere', 'else']`

This example not only divides the individual entities, but also gets rid of the capitalism involved (no pun intended). Capitalisation and De-capitalisation is again, dependent on the data and the task at hand. If we want to differentiate between any sentiments, then something written in uppercase might mean something different than something written in lowercase. 

In [97]:
# Note: If you have already removed all punctuations you can just use pythons inbuilt .split() function
"A bad day in London is still better than a bad day anywhere else".lower().split()

['a',
 'bad',
 'day',
 'in',
 'london',
 'is',
 'still',
 'better',
 'than',
 'a',
 'bad',
 'day',
 'anywhere',
 'else']

In [98]:
#However, if you still have punctuation, look what happens
"A bad day in London, is still better than a bad day anywhere else!".lower().split()

['a',
 'bad',
 'day',
 'in',
 'london,',
 'is',
 'still',
 'better',
 'than',
 'a',
 'bad',
 'day',
 'anywhere',
 'else!']

'London' and 'London,' are not the same thing!

In [99]:
# so it's better to use nltk's tokenize
from nltk import word_tokenize, bigrams, trigrams

nltk_tokens = word_tokenize("A bad day in London, is still better than a bad day anywhere else!".lower())
nltk_tokens

['a',
 'bad',
 'day',
 'in',
 'london',
 ',',
 'is',
 'still',
 'better',
 'than',
 'a',
 'bad',
 'day',
 'anywhere',
 'else',
 '!']

In [100]:
print(list(bigrams(nltk_tokens)))
print(list(trigrams(nltk_tokens)))

[('a', 'bad'), ('bad', 'day'), ('day', 'in'), ('in', 'london'), ('london', ','), (',', 'is'), ('is', 'still'), ('still', 'better'), ('better', 'than'), ('than', 'a'), ('a', 'bad'), ('bad', 'day'), ('day', 'anywhere'), ('anywhere', 'else'), ('else', '!')]
[('a', 'bad', 'day'), ('bad', 'day', 'in'), ('day', 'in', 'london'), ('in', 'london', ','), ('london', ',', 'is'), (',', 'is', 'still'), ('is', 'still', 'better'), ('still', 'better', 'than'), ('better', 'than', 'a'), ('than', 'a', 'bad'), ('a', 'bad', 'day'), ('bad', 'day', 'anywhere'), ('day', 'anywhere', 'else'), ('anywhere', 'else', '!')]


### 4. Stopword Removal
Stopwords are the words which are used very frequently. Words like “of, are, the, it, is” are some examples of stopwords. In applications like document search engines and document classification, where keywords are more important than general terms, removing stopwords can be a good idea. However, if there’s some application about, for instance, songs lyrics search, or searching for specific quotes, stopwords can be important. 

“To be, or not not be” - Stopwords in such phrases actually play an important role, and hence, should not be dropped.

Another example is negation. "not" is contained in many stopword lists, but deleting "not" out of a negative review can make a positive out of it.

There are two common approaches of removing the stopwords, and both are fairly straightforward. One way is to count all the word occurrences, and providing a threshold value on the count, and getting rid of all the terms/words occurring more than the specified threshold value. The other way is to have a predetermined list of stopwords, which can be removed from the list of tokens/tokenised sentences. I personally, believe the second one is better, as determining thresholds can be quite difficult and you can use tf-idf (more on that next lesson) to weigh the importance of words.

NLTK comes with many corpora, including a stopword list. This list contains around 200 terms. For my research, however, I use one that contains over 600 terms: http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a11-smart-stop-list/english.stop (I removed apostrophes as I remove punctuations before I remove stopwords)

In [101]:
stop_words_file = 'SmartStoplist.txt'

stop_words = []

with open(stop_words_file, "r") as f:
    for line in f:
        stop_words.extend(line.split())


#lets remove stopwords in the news article
news_sample_clean = []
for word in news_sample.split():
    if word not in stop_words:
        news_sample_clean.append(word)
        
        

In [106]:
#NLTK alternative
from nltk.corpus import stopwords


stop_words_nltk = set(stopwords.words('english'))

filtered_sentence_nltk = [w for w in nltk_tokens if not w in stop_words_nltk]
filetered_sentence_smart = [w for w in nltk_tokens if not w in stop_words]

In [69]:
news_sample_clean

['worldcom',
 'ex-boss',
 'launches',
 'defence',
 'lawyers',
 'defending',
 'worldcom',
 'chief',
 'bernie',
 'ebbers',
 'battery',
 'fraud',
 'charges',
 'called',
 'company',
 'whistleblower',
 'witness.',
 'cynthia',
 'cooper',
 'worldcom',
 'ex-head',
 'internal',
 'accounting',
 'alerted',
 'directors',
 'irregular',
 'accounting',
 'practices',
 'telecoms',
 'giant',
 '2002.',
 'led',
 'collapse',
 'firm',
 'discovery',
 '$11bn',
 '(┬ú5.7bn)',
 'accounting',
 'fraud.',
 'mr',
 'ebbers',
 'pleaded',
 'guilty',
 'charges',
 'fraud',
 'conspiracy.',
 'prosecution',
 'lawyers',
 'argued',
 'mr',
 'ebbers',
 'orchestrated',
 'series',
 'accounting',
 'tricks',
 'worldcom',
 'ordering',
 'employees',
 'hide',
 'expenses',
 'inflate',
 'revenues',
 'meet',
 'wall',
 'street',
 'earnings',
 'estimates.',
 'ms',
 'cooper',
 'runs',
 'consulting',
 'business',
 'told',
 'jury',
 'york',
 'wednesday',
 'external',
 'auditors',
 'arthur',
 'andersen',
 'approved',
 'worldcom',
 'accounting'

In [107]:
filtered_sentence_nltk # using nltk stopwordlist

['bad',
 'day',
 'london',
 ',',
 'still',
 'better',
 'bad',
 'day',
 'anywhere',
 'else',
 '!']

In [109]:
filetered_sentence_smart #using smartsportword list

['bad', 'day', 'london', ',', 'bad', 'day', '!']

### 5. Lemmatising and Stemming
Lemmatisation and stemming both refer to a process of reducing a word to its root. The difference is that stem might not be an actual word whereas, a lemma is an actual word. It’s a handy tool if you want to avoid treating different forms of the same word as different words, e.g. *love, loved, loving*

**Lemmatising:** considered, considers, consider → “consider”

**Stemming:** considered, considering, consider → “consid”

I personally have never noticed a significat difference between lemmatising and stemming when training classifiers. However, I suggest you try out yourself. NLTK comes with many different in-built lemmatisers and stemmers, so just plug and play.

A note of caution: WordNetLemmatizer requires a POS-tag. The default is set to "noun" and therefore doesn't work with other words.

In [76]:
from nltk.stem import WordNetLemmatizer, PorterStemmer, SnowballStemmer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

word = "considers"
word_2 = "apple"

stemmed_word =  stemmer.stem(word)
lemmatised_word = lemmatizer.lemmatize(word)

stemmed_word_2 =  stemmer.stem(word_2)
lemmatised_word_2 = lemmatizer.lemmatize(word_2)

print(stemmed_word)
print(lemmatised_word)

consid
considers


In [77]:
print(stemmed_word_2)
print(lemmatised_word_2)

appl
apple


### Putting it all together - in one preprocess() function
Now that we covered everything we need to know, we can combine everything into one function and apply it to the whole data. Let's keep it simple and write one for the news articles

In [78]:
def preprocess_news(raw_text):
    
    #keeping only letters
    letters_only_text = re.sub("[^a-zA-Z]", " ", raw_text)

    # convert to lower case and tokenise
    tokens = word_tokenize(letters_only_text.lower())
    

    cleaned_words = []
    stemmer = PorterStemmer()
    
    # remove stopwords
    for word in tokens:
        if word not in stop_words:
            cleaned_words.append(word)
    
    # stemm or lemmatise words
    stemmed_words = []
    for word in cleaned_words:
        word = stemmer.stem(word)
        stemmed_words.append(word)
    
    # converting list back to string
    return " ".join(stemmed_words)

In [79]:
news_sample



In [80]:
preprocess_news(news_sample)

'worldcom boss launch defenc lawyer defend worldcom chief berni ebber batteri fraud charg call compani whistleblow wit cynthia cooper worldcom head intern account alert director irregular account practic telecom giant warn led collaps firm discoveri bn bn account fraud mr ebber plead guilti charg fraud conspiraci prosecut lawyer argu mr ebber orchestr seri account trick worldcom order employe hide expens inflat revenu meet wall street earn estim ms cooper run consult busi told juri york wednesday extern auditor arthur andersen approv worldcom account earli andersen green light procedur practic worldcom mr ebber lawyer unawar fraud argu auditor alert problem ms cooper sharehold meet mr ebber pass technic question compani financ chief give answer prosecut star wit worldcom financi chief scott sullivan mr ebber order account adjust firm tell hit book ms cooper mr sullivan mention uncomfort worldcom account audit committe meet mr ebber face jail sentenc year convict charg face worldcom eme

great - it works, let's apply it to the "whole data"

In [83]:
news_df['prep_text'] = news_df['text'].apply(preprocess_news)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  news_df['prep_text'] = news_df['text'].apply(preprocess_news)


In [84]:
news_df

Unnamed: 0,type,text,tag,prep_text
20,news_article,worldcom ex-boss launches defence lawyers defe...,business,worldcom boss launch defenc lawyer defend worl...
21,news_article,german business confidence slides german busin...,business,german busi confid slide german busi confid fe...
22,news_article,bbc poll indicates economic gloom citizens in ...,business,bbc poll econom gloom citizen major nation sur...
23,news_article,lifestyle governs mobile choice faster bett...,tech,lifestyl govern mobil choic faster funkier har...
24,news_article,enron bosses in $168m payout eighteen former e...,business,enron boss payout eighteen enron director agre...
25,news_article,howard truanted to play snooker conservative...,politics,howard truant play snooker conserv leader mich...
26,news_article,wales silent on grand slam talk rhys williams ...,sport,wale silent grand slam talk rhi william wale t...
27,news_article,french honour for director parker british film...,entertainment,french honour director parker british film dir...
28,news_article,car giant hit by mercedes slump a slump in pro...,business,car giant hit merced slump slump profit luxuri...
29,news_article,fockers fuel festive film chart comedy meet th...,entertainment,focker fuel festiv film chart comedi meet fock...
