# 2. Text Preprocessing
The objective of this notebook is to process the article data and prepare that for modeling. 

## Introduction

Typically, any NLP-based problem can be solved by a methodical workflow that has a sequence of steps. The major steps are depicted in the following figure.

[1]: https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72

<div>
<img src="./img/NLP-workflow.png" width="650"/>
</div>

We usually start with a corpus of text documents and follow standard processes of text wrangling and pre-processing, parsing and basic exploratory data analysis. Based on the initial insights, we usually represent the text using relevant feature engineering techniques. Depending on the problem at hand, we either focus on building predictive supervised models or unsupervised models, which usually focus more on pattern mining and grouping. Finally, we evaluate the model and the overall success criteria with relevant stakeholders or customers, and deploy the final model for future usage.

Text produced through web scraping is typically highly noisy containing spelling errors, abbreviations, non-standard words, false starts, repetitions, missing punctuations, missing letter case information, pause filling words such as “um” and “uh” and other texting and speech disfluencies. Such text can be seen in large amounts in contact centers, chat rooms, optical character recognition (OCR) of text documents, short message service (SMS) text, etc. Generally the pre-processing steps to approach any NLP problems include:

1. **Noise Cleaning**: Noise is a common issue in unstructured text. Noisy unstructured text data is found in informal settings such as online chat, text messages, e-mails, message boards, newsgroups, blogs, wikis and web pages. Also, text produced by processing spontaneous speech using automatic speech recognition and printed or handwritten text using optical character recognition contains processing noise. Noise removal usually consist removal of HTML tags, white spaces, punctuations, etc.

2. **Spell Checking**: In this step the misspelled words are fixed, accented characters are replaced with appropriate english characters, and strings with no meaning are filtered.

3. **Contraction Mapping**: Contractions are words or combinations of words that are shortened by dropping letters and replacing them by an apostrophe. In English contractions, we often drop the vowels from a word to form the contractions. Removing contractions contributes to text standardization. 

4. **Stemming/Lemmatization**: Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

5. **‘Stop Words’ Identification**: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. We would not want these words to take up space in our database, or taking up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to stop words. NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16 different languages.

6. **Case Conversion**: Converting all your data to lowercase helps in the process of preprocessing and in later stages in the NLP application, when we are doing parsing.


These steps are implemented in the next section and will be used to build a text normalization pipeline.

In [None]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup as bs 
import unicodedata
import re
import spacy
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from textblob import TextBlob
import pycountry
from sklearn.feature_extraction.text import TfidfVectorizer
from contractions import contractions_dict

In [None]:
# Parameters
# Stop words
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

# Lemmatization
nlp = spacy.load('en_core_web_sm')
nlp.max_length = 1500000

In [None]:
# Load data
df = pd.read_csv('../data/interim/covid_articles_preprocessed.csv')

In [None]:
df.head()

### Noize Cleaning

In [None]:
# Remove html tages
def strip_html_tags(text):
    soup = bs(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

In [None]:
# Remove accented text
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

In [None]:
# Remove Special Characters
def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

### Contraction mapping:

In [None]:
#contractions = CONTRACTION_MAP
cotractions = contractions_dict
def expand_contractions(text, contraction_mapping=contractions):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

### Spelling Correction

In [None]:
# Correct misspelled words
def spell_correction(text):
    tb = TextBlob(text)
    fixed_text = tb.correct()
    return str(fixed_text)

In [None]:
# Removing Meaningless strings
def remove_incomprehensible_words(text):
    en_words = set(nltk.corpus.words.words())
    text = " ".join(w for w in nltk.wordpunct_tokenize(text) \
     if w.lower() in en_words or not w.isalpha())
    return text

### Lemmatization

In [None]:
def lemmatize_text(text, disable = ['ner', 'parser']):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

### Remove Stopwords

In [None]:
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

### Case conversion

In [None]:
def convert_text(text, case='lower'):
    if case == 'lower':
        text = text.lower()
    elif case == 'upper':
        text = text.upper()
    elif case == 'title':
        text = text.title()
    else:
        text
    return text


### Corpus Normalization Pipeline

In [None]:
def normalize_corpus(corpus, html_stripping=True, contraction_expansion=True,
                     accented_char_removal=True, text_lower_case=True, 
                     text_lemmatization=True, special_char_removal=True, 
                     stopword_removal=True, remove_digits=True,
                     correct_spelling=False, remove_incomprehensible_word=True):
    
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:
        # strip HTML
        if html_stripping:
            doc = strip_html_tags(doc)
        # remove accented characters
        if accented_char_removal:
            doc = remove_accented_chars(doc)
        # expand contractions    
        if contraction_expansion:
            doc = expand_contractions(doc)
        # lowercase the text    
        if text_lower_case:
            doc = doc.lower()
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)
        # remove special characters and\or digits    
        if special_char_removal:
            # insert spaces between special characters to isolate them    
            special_char_pattern = re.compile(r'([{._(-)!}])')
            doc = special_char_pattern.sub(" \\1 ", doc)
            doc = remove_special_characters(doc, remove_digits=remove_digits)  
        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case)
        # correct spelling
        if correct_spelling:
            doc = spell_correction(doc)
        # remove incomprehensible words
        if remove_incomprehensible_word:
            doc = remove_incomprehensible_words(doc)
            
        normalized_corpus.append(doc)
        
    return normalized_corpus

## Normalize COVID News Articles
Now the normaliztion pipeline is defined, I normalize the covid articles and save them for future use.

In [None]:
text_df = df
text_df.info()

In [None]:
normal_corpus = normalize_corpus(text_df['content'].values)
print(normal_corpus[0])

In [None]:
normal_corpus_df = pd.concat((text_df[['title', 'topic_area']], pd.DataFrame(normal_corpus, columns=['content'])), axis=1)
normal_corpus_df.head()

In [None]:
normal_corpus_df.to_csv('../data/interim/covid_articles_normalized.csv', index=False)