## Preprocessing

#### Tokenization

Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.

#### Stop words

Stop words are the most common words in a language like "the", "is", "a". These words do not carry important meaning and are usually removed from texts.

#### Stemming

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.

#### Lemmatization

Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

In [61]:
import re
import pandas as pd
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

class preprocessing:
    # clean_stem_lemmatize_column: full clean up for a column
    def clean_stem_lemmatize_column(self, dataframe, column, more_stop_words = [], drop_unnecessary_columns = False):
        self.clean_column(dataframe, column, more_stop_words)
        self.stem_column(dataframe, f"{column}_clean")
        self.lemmatize_column(dataframe, f"{column}_clean_stems")
        if drop_unnecessary_columns is True:
            dataframe = dataframe.drop([f"{column}_clean", f"{column}_clean_stems"], axis=1)
        return dataframe

    # tokenize_sentence: tokenizes a given sentence
    def tokenize_sentence(self, sentence):
        words = word_tokenize(sentence)
        return words

    # tokenize_column: tokenizes a column of a dataframe and saves it as column_tokens
    def tokenize_column(self, dataframe, column):
        tokens = list()
        # tokenize each element of the column
        for index, row in dataframe.iterrows():
            sentence_tokens = self.tokenize_sentence(dataframe.loc[index,column])
            tokens.append(sentence_tokens)
        # save tokens as a new column
        dataframe[f"{column}_tokens"] = pd.Series(tokens, index=dataframe.index)

    # clean_sentence: cleans a given sentence
    def clean_sentence(self, sentence):
        sentence = re.sub("[^a-zA-z\s]", "", sentence) # remove special characters
        sentence = re.sub("_", "", sentence)
        sentence = re.sub("\s+", " ",sentence)         # change any white space to one space
        sentence = sentence.strip()                    # remove start and end white spaces
        sentence = sentence.lower()                    # convert sentence into lower case
        return sentence
    
    # remove_stop_words_from_sentence: removes stop words fast using dictionary
    def remove_stop_words_from_sentence(self, sentence, more_stop_words = []):
        stop_words = stopwords.words("english") + more_stop_words
        stopwords_dictionary = Counter(stop_words)
        sentence = " ".join([word for word in sentence.split() if word not in stopwords_dictionary])
        return sentence

    # clean_column: cleans a dataframe column from symbols, removes stop words and saves it as column_clean
    def clean_column(self, dataframe, column, more_stop_words = []):
        # clean and remove each element of the column
        for index, row in dataframe.iterrows():
            dataframe.loc[index,f"{column}_clean"] = self.clean_sentence(dataframe.loc[index,column])
            dataframe.loc[index,f"{column}_clean"] = self.remove_stop_words_from_sentence(dataframe.loc[index,f"{column}_clean"], more_stop_words)

    # stem_sentence: stems a given sentence
    def stem_sentence(self, sentence):
        porter = PorterStemmer()
        words = word_tokenize(sentence)
        stems_sentence = list()
        for word in words:
            stems_sentence.append(porter.stem(word))
        return " ".join(stems_sentence)

    # stem_column: stems a column of a dataframe and saves it as column_stems
    def stem_column(self, dataframe, column):
        for index, row in dataframe.iterrows():
            dataframe.loc[index,f"{column}_stems"] = self.stem_sentence(dataframe.loc[index,column])

    # lemmatize_sentence: lemattizes a given sentence
    def lemmatize_sentence(self, sentence):
        lemmatizer = WordNetLemmatizer()
        words = word_tokenize(sentence)
        lemmas_sentence = list()
        for word in words:
            lemmas_sentence.append(lemmatizer.lemmatize(word))
        return " ".join(lemmas_sentence)

    # lemmatize_sentence: lemattizes a column of a dataframe and saves it as column_lemmas
    def lemmatize_column(self, dataframe, column):
        for index, row in dataframe.iterrows():
            dataframe.loc[index,f"{column}_lemmas"] = self.lemmatize_sentence(dataframe.loc[index,column])

#### Examples

In [62]:
test_set  = pd.read_csv("./data_sets/test_set.tsv", delimiter="\t",names=["id","title","content"],           header=0)
more_stop_words = ["say","said","want","thing","may","see","make","look","likely","well","told","uses","used","use","bn","mr","year","people","new"]
pre = preprocessing()
test_set = pre.clean_stem_lemmatize_column(test_set,"content",more_stop_words)
display(test_set)

Unnamed: 0,id,title,content,content_clean,content_clean_stems,content_clean_stems_lemmas
0,385,Tate & Lyle boss bags top award\n,\n Tate & Lyle's chief executive has been nam...,tate lyles chief executive named european busi...,tate lyle chief execut name european businessm...,tate lyle chief execut name european businessm...
1,1984,Halo 2 sells five million copies\n,\n Microsoft is celebrating bumper sales of i...,microsoft celebrating bumper sales xbox scifi ...,microsoft celebr bumper sale xbox scifi shoote...,microsoft celebr bumper sale xbox scifi shoote...
2,986,MSPs hear renewed climate warning\n,\n Climate change could be completely out of ...,climate change could completely control within...,climat chang could complet control within seve...,climat chang could complet control within seve...
3,1387,Pavey focuses on indoor success\n,\n Jo Pavey will miss January's View From Gre...,jo pavey miss januarys view great edinburgh in...,jo pavey miss januari view great edinburgh int...,jo pavey miss januari view great edinburgh int...
4,1295,Tories reject rethink on axed MP\n,\n Sacked MP Howard Flight's local Conservati...,sacked mp howard flights local conservative as...,sack mp howard flight local conserv associ ins...,sack mp howard flight local conserv associ ins...
...,...,...,...,...,...,...
440,439,German economy rebounds\n,"\n Germany's economy, the biggest among the 1...",germanys economy biggest among countries shari...,germani economi biggest among countri share eu...,germani economi biggest among countri share eu...
441,300,J&J agrees $25bn Guidant deal\n,\n Pharmaceutical giant Johnson & Johnson has...,pharmaceutical giant johnson johnson agreed bu...,pharmaceut giant johnson johnson agre buy medi...,pharmaceut giant johnson johnson agre buy medi...
442,1286,Child access law shake-up planned\n,\n Parents who refuse to allow former partner...,parents refuse allow former partners contact c...,parent refus allow former partner contact chil...,parent refus allow former partner contact chil...
443,1506,Real in talks over Gravesen move\n,\n Real Madrid are closing in on a Β£2m deal ...,real madrid closing deal evertons thomas grave...,real madrid close deal everton thoma gravesen ...,real madrid close deal everton thoma gravesen ...


In [63]:
test_set  = pd.read_csv("./data_sets/test_set.tsv", delimiter="\t",names=["id","title","content"],           header=0)
more_stop_words = ["say","said","want","thing","may","see","make","look","likely","well","told","uses","used","use","bn","mr","year","people","new"]
pre = preprocessing()
test_set = pre.clean_stem_lemmatize_column(test_set,"content",more_stop_words,True)
display(test_set)

Unnamed: 0,id,title,content,content_clean_stems_lemmas
0,385,Tate & Lyle boss bags top award\n,\n Tate & Lyle's chief executive has been nam...,tate lyle chief execut name european businessm...
1,1984,Halo 2 sells five million copies\n,\n Microsoft is celebrating bumper sales of i...,microsoft celebr bumper sale xbox scifi shoote...
2,986,MSPs hear renewed climate warning\n,\n Climate change could be completely out of ...,climat chang could complet control within seve...
3,1387,Pavey focuses on indoor success\n,\n Jo Pavey will miss January's View From Gre...,jo pavey miss januari view great edinburgh int...
4,1295,Tories reject rethink on axed MP\n,\n Sacked MP Howard Flight's local Conservati...,sack mp howard flight local conserv associ ins...
...,...,...,...,...
440,439,German economy rebounds\n,"\n Germany's economy, the biggest among the 1...",germani economi biggest among countri share eu...
441,300,J&J agrees $25bn Guidant deal\n,\n Pharmaceutical giant Johnson & Johnson has...,pharmaceut giant johnson johnson agre buy medi...
442,1286,Child access law shake-up planned\n,\n Parents who refuse to allow former partner...,parent refus allow former partner contact chil...
443,1506,Real in talks over Gravesen move\n,\n Real Madrid are closing in on a Β£2m deal ...,real madrid close deal everton thoma gravesen ...
