# Need for Pre-processing:

- Inconsistent results from the NLP applications can be minimized if we use right kind of preprocessing on text.
- One type of pre-processing may not be suitable for other, so it's task dependent.
- Let’s say you are trying to discover commonly used words in a news dataset. If your pre-processing step involves removing stop words because some other task used it, then you are probably going to miss out on some of the common words as you have ALREADY eliminated it. So really, it’s not a one-size-fits-all approach.



# Dataset:

- A data which contains what corporations actually talk about on social media. The dataset has statements classified as information (objective statements about the company or it's activities), dialog (replies to users, etc.), or action (messages that ask for votes or ask users to click on links, etc.).
- Our interest is in the text column of dataset, so we can apply pre-processing on it.

# Types of text preprocessing techniques

- There are different ways to preprocess your text. Here are some of the approaches that you should know about and I will highlight the importance of each.

In [1]:
# Import necessary libraries.
import re, string, unicodedata
import pandas as pd
import nltk           
                        # Natural language processing tool-kit
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

!pip install contractions
import contractions


from bs4 import BeautifulSoup                 # Beautiful soup is a parsing library that can use different parsers.
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords, wordnet    # Stopwords, and wordnet corpus
from nltk.stem import LancasterStemmer, WordNetLemmatizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [0]:
# Load dataset.
dataset = pd.read_csv('/content/drive/My Drive/Colab Notebooks/data/crowdflower-corporate-messaging/data/corporate_messaging_dfe.csv')

In [3]:
# Chect first 5 rows of data.
dataset.head()

Unnamed: 0,unit_id,golden,unit_state,trusted_judgments,last_judgment_at,category,category_confidence,category_gold,id,screenname,text
0,662822308,False,finalized,3,2015-02-18T04:31:00,Information,1.0,,436528000000000000,Barclays,Barclays CEO stresses the importance of regula...
1,662822309,False,finalized,3,2015-02-18T13:55:00,Information,1.0,,386013000000000000,Barclays,Barclays announces result of Rights Issue http...
2,662822310,False,finalized,3,2015-02-18T08:43:00,Information,1.0,,379580000000000000,Barclays,Barclays publishes its prospectus for its �5.8...
3,662822311,False,finalized,3,2015-02-18T09:13:00,Information,1.0,,367530000000000000,Barclays,Barclays Group Finance Director Chris Lucas is...
4,662822312,False,finalized,3,2015-02-18T06:48:00,Information,1.0,,360385000000000000,Barclays,Barclays announces that Irene McDermott Brown ...


In [0]:
# Here we are going to deal with text data, so we seperate out the text column in a new dataframe: data
data = dataset.drop(['golden', 'unit_state', 'trusted_judgments', 'last_judgment_at', 'category', 'category_confidence', 'category_gold', 'screenname'], axis=1)

In [5]:
# Check first 5 rows of dataframe.
data.head()

Unnamed: 0,unit_id,id,text
0,662822308,436528000000000000,Barclays CEO stresses the importance of regula...
1,662822309,386013000000000000,Barclays announces result of Rights Issue http...
2,662822310,379580000000000000,Barclays publishes its prospectus for its �5.8...
3,662822311,367530000000000000,Barclays Group Finance Director Chris Lucas is...
4,662822312,360385000000000000,Barclays announces that Irene McDermott Brown ...


In [6]:
# First row of data.
pd.set_option('display.max_colwidth', None) # It will enable the entire row visible with truncation of the text. (We can see full text.)
data.loc[[0]]

Unnamed: 0,unit_id,id,text
0,662822308,436528000000000000,Barclays CEO stresses the importance of regulatory and cultural reform in financial services at Brussels conference http://t.co/Ge9Lp7hpyG


In [7]:
# Removal of the http link using Regular Expression.
for i, row in data.iterrows():
    clean_text = re.sub(r"http\S+", "", data.at[i, 'text'])
    data.at[i,'text'] = clean_text
data.head()

Unnamed: 0,unit_id,id,text
0,662822308,436528000000000000,Barclays CEO stresses the importance of regulatory and cultural reform in financial services at Brussels conference
1,662822309,386013000000000000,Barclays announces result of Rights Issue
2,662822310,379580000000000000,Barclays publishes its prospectus for its �5.8bn Rights Issue:
3,662822311,367530000000000000,Barclays Group Finance Director Chris Lucas is to step down at the end of the week due to ill health
4,662822312,360385000000000000,Barclays announces that Irene McDermott Brown has been appointed as Group Human Resources Director


# cleaning of the text.

In [8]:
def replace_contractions(text):
    """Replace contractions in string of text"""
    return contractions.fix(text)

# Perform the above operation over all the rows of text column of the dataframe.
for i, row in data.iterrows():
    text = data.at[i, 'text']
    clean_text = replace_contractions(text)
    data.at[i,'text'] = clean_text
data.head()

Unnamed: 0,unit_id,id,text
0,662822308,436528000000000000,Barclays CEO stresses the importance of regulatory and cultural reform in financial services at Brussels conference
1,662822309,386013000000000000,Barclays announces result of Rights Issue
2,662822310,379580000000000000,Barclays publishes its prospectus for its �5.8bn Rights Issue:
3,662822311,367530000000000000,Barclays Group Finance Director Chris Lucas is to step down at the end of the week due to I will health
4,662822312,360385000000000000,Barclays announces that Irene McDermott Brown has been appointed as Group Human Resources Director


In [9]:
# Tokenize the words of whole dataframe.
for i, row in data.iterrows():
    text = data.at[i, 'text']
    words = nltk.word_tokenize(text)
    data.at[i,'text'] = words
data.head()

Unnamed: 0,unit_id,id,text
0,662822308,436528000000000000,"[Barclays, CEO, stresses, the, importance, of, regulatory, and, cultural, reform, in, financial, services, at, Brussels, conference]"
1,662822309,386013000000000000,"[Barclays, announces, result, of, Rights, Issue]"
2,662822310,379580000000000000,"[Barclays, publishes, its, prospectus, for, its, �5.8bn, Rights, Issue, :]"
3,662822311,367530000000000000,"[Barclays, Group, Finance, Director, Chris, Lucas, is, to, step, down, at, the, end, of, the, week, due, to, I, will, health]"
4,662822312,360385000000000000,"[Barclays, announces, that, Irene, McDermott, Brown, has, been, appointed, as, Group, Human, Resources, Director]"


In [10]:
# save the stopwords in a list named stopwords.
stopwords = stopwords.words('english')
stopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [0]:
def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)        # Append processed words to new list.
    return new_words

In [0]:
def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        new_word = word.lower()           # Converting to lowercase
        new_words.append(new_word)        # Append processed words to new list.
    return new_words

# Lowercasing

- Lowercasing ALL your text data, although commonly overlooked, is one of the simplest and most effective form of text preprocessing. It is applicable to most text mining and NLP problems and can help in cases where your dataset is not very large and significantly helps with consistency of expected output.

- An example where lowercasing is very useful is for search. Imagine, you are looking for documents containing “usa”. However, no results were showing up because “usa” was indexed as “USA”.

- An example where lowercasing may result in inaccuracy is in predicting the programming language of a source code file. The word System in Java is quite different from system in python. Lowercasing the two makes them identical, causing the classifier to lose important predictive features. While lowercasing is generally helpful, it may not be applicable for all tasks.

In [0]:
def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)    # Append processed words to new list.
    return new_words

# Stopword Removal:
- Stop words are a set of commonly used words in a language.
- Examples of stop words in English are “a”, “the”, “is”, “are” and etc. The intuition behind using stop words is that, by removing low information words from text, we can focus on the important words instead.

In [0]:
def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        if word not in stopwords:
            new_words.append(word)        # Append processed words to new list.
    return new_words

# Stemming:

- Stemming is the process of reducing inflection in words (e.g. running, runs) to their root form (e.g. run). The “root” in this case may not be a real root word, but just a canonical form of the original word.

In [0]:
def stem_words(words):
    """Stem words in list of tokenized words"""
    stemmer = LancasterStemmer()
    stems = []                            # Create empty list to store pre-processed words.
    for word in words:
        stem = stemmer.stem(word)
        stems.append(stem)                # Append processed words to new list.
    return stems

# Lemmatization:

- Lemmatization on the surface is very similar to stemming, where the goal is to remove inflections and map a word to its root form.
- The only difference is that, lemmatization tries to do it the proper way.
- It doesn’t just chop things off, it actually transforms words to the actual root. For example, the word “better” would map to “good”.

In [0]:
def lemmatize_verbs(words):
    """Lemmatize verbs in list of tokenized words"""
    lemmatizer = WordNetLemmatizer()
    lemmas = []                           # Create empty list to store pre-processed words.
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)              # Append processed words to new list.
    return lemmas

### Now it's time to execute the above functions:

### So we define a new function normalize, which processes all the steps together.

In [0]:
def normalize(words):
    words = remove_non_ascii(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = remove_stopwords(words)
    return words

In [18]:
# Iterate the normalize funtion over whole data.
for i, row in data.iterrows():
    words = data.at[i, 'text']
    words = normalize(words)
    data.at[i,'text'] = words
data.head()

Unnamed: 0,unit_id,id,text
0,662822308,436528000000000000,"[barclays, ceo, stresses, importance, regulatory, cultural, reform, financial, services, brussels, conference]"
1,662822309,386013000000000000,"[barclays, announces, result, rights, issue]"
2,662822310,379580000000000000,"[barclays, publishes, prospectus, 58bn, rights, issue]"
3,662822311,367530000000000000,"[barclays, group, finance, director, chris, lucas, step, end, week, due, health]"
4,662822312,360385000000000000,"[barclays, announces, irene, mcdermott, brown, appointed, group, human, resources, director]"


In [0]:
def stem_and_lemmatize(words):
    stems = stem_words(words)
    lemmas = lemmatize_verbs(words)
    return stems, lemmas

In [20]:
data['lemma'] = ''
data['stem'] = ''

for i, row in data.iterrows():
    words = data.at[i, 'text']
    stems, lemmas = stem_and_lemmatize(words)
    data.at[i,'stem'] = stems
    data.at[i, 'lemma'] = lemmas
data.head()

Unnamed: 0,unit_id,id,text,lemma,stem
0,662822308,436528000000000000,"[barclays, ceo, stresses, importance, regulatory, cultural, reform, financial, services, brussels, conference]","[barclays, ceo, stress, importance, regulatory, cultural, reform, financial, service, brussels, conference]","[barclay, ceo, stresses, import, reg, cult, reform, fin, serv, brussel, conf]"
1,662822309,386013000000000000,"[barclays, announces, result, rights, issue]","[barclays, announce, result, right, issue]","[barclay, annount, result, right, issu]"
2,662822310,379580000000000000,"[barclays, publishes, prospectus, 58bn, rights, issue]","[barclays, publish, prospectus, 58bn, right, issue]","[barclay, publ, prospect, 58bn, right, issu]"
3,662822311,367530000000000000,"[barclays, group, finance, director, chris, lucas, step, end, week, due, health]","[barclays, group, finance, director, chris, lucas, step, end, week, due, health]","[barclay, group, fin, direct, chris, luca, step, end, week, due, heal]"
4,662822312,360385000000000000,"[barclays, announces, irene, mcdermott, brown, appointed, group, human, resources, director]","[barclays, announce, irene, mcdermott, brown, appoint, group, human, resources, director]","[barclay, annount, ir, mcdermott, brown, appoint, group, hum, resourc, direct]"


- As we can see here that, the text column contains tokenized words, lemma contains lemmatized words, and stem column contains the stemmed words.
- So, we can use these techniques according to our need of the project as suitable.

# So, the tasks are:

- ### Noise removal (Special character, html tags, accented characters, punctuation removal)
- ### Lowercasing (can be task dependent in some cases)
- ### Stop-word removal
- ### Stemming / lemmatization

- ### Now that the text cleaning is done, our text data is ready to be converted into the format, which the machine can understand (i.e. numbers).
- ### We will see it in the next lectures in Vectorization and after that we can perform the following tasks on that:
  - ### Sentiment Analysis
  - ### Text Classification
### etc. etc.