## Introduction

This notebook is an attempt to present different text preprocessing techniques used in natural language processing (NLP) at one place for better understanding and as a single point of reference. I have tried to make it as comprehensive as possible and this notebook will be updated time to time to include additional methods or update existing methods if any other more effecient approaches are availble to achieve the same task. Few of the methods included here are specific to the tweets dataset such as tweeter mentions while others can be used for any NLP dataset. 

### Why to do text preprocessing
All the machine learning models available today only work on numerical values and do not understand the text as is. So we have to represent the textual information using numerical values. The textual data generated from natural languaes is inherently unstructured and noisy. To represent the text data, we have to clean it to remove parts of the text which do not add any information, such as punctuations and stopwords, to make the data more consistent. Such consistent data can be represented efficiently for the use in machine learning models.   

### Methods used for text preprocessing
There are number of methods used for preprocessing the textual data and their use depends upon the data itself as well as the application. For example, the data with human communication such as emails or tweets will have many contractions of words like can't for can not or they've for they have. On the other hand, such contractions might not be present in a formal document such as a scientific journal article and their expansion will not be a necessary step in text preprocessing. 
Here are different methods included in this notebook.
1. Lower Casing
2. Removing HTML
3. Expand Contractions
4. Removing URLs
5. Removing Email IDs
6. Removing Tweeter Mentions
7. Handling Emojis
8. Handling Accented Words
8. Removing Unicode Characters
9. Abbreviation/Acronym Disambiguation
10. Removing Digits or Words with Digits
11. Removing Stopwords
12. Removing Extra Spaces
13. Stemming or Lemmatization
14. Spelling Correction
15. Correcting Compound Words

A breif explanation about each method is included in the subsections and the sequence of these preprocessing steps is discussed at the end.


### Why to keep hashtags in Tweets data
The hashtags in tweets is a way to include keywords or key phrases for better discoverability on Tweeter. Moreover, they also provide a general context of the contents of the tweet and the topic it is related to. For example, there are hashtags like #wildfire, #flooding and #caraccident in the disaster tweets dataset, which mention a disaster, which is exactly what we are trying to classify. So if the tweet contains such hashtag, it is more likely to be related to the particular disaster witnessed by the person tweeting it. Hence keeping the hashtags will include more information (sometimes more than the tweet itself).    

In [1]:
import re
from nltk.corpus import stopwords
import string
import pandas as pd

import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

## Dataset

In [2]:
tweets_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
tweets_df.head() 

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


If we check the description of the competition, we can observe that the keywords are important for the classification of distaster tweet and hence a combined tweet column is created by joining keyword and text. First the empty keywords are replaced by "".

In [3]:
tweets_df["keyword"] = tweets_df["keyword"].fillna("")
tweets_df["tweet"] = tweets_df["keyword"] + " " + tweets_df["text"]
tweets_df.sample(5, random_state=42)

Unnamed: 0,id,keyword,location,text,target,tweet
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1,destruction So you have a new weapon that can ...
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0,deluge The f$&amp;@ing things I do for #GISHWH...
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1,police DT @georgegalloway: RT @Galloway4Mayor:...
132,191,aftershock,,Aftershock back to school kick off was great. ...,0,aftershock Aftershock back to school kick off ...
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0,trauma in response to trauma Children of Addic...


## Lower Case
Lower casing is generally the first preprocessing step performed. As each word is considered as a token, if we consider all the possible combinations of the casings of the word, it will be $2^{(len(word))}$ possible different tokens for each word, like dataset, Dataset, DaTaSeT and so on. We want to represent the text as efficiently as possible and hence want to make the representation case insentitve, as generally the casing does not provide any addditional information. Exception to this are the abbreviations, which are handled separately.  

In [4]:
tweets_df["tweet_lower"] = tweets_df["tweet"].str.lower()
tweets_df["tweet_lower"].sample(5, random_state=42)

2644    destruction so you have a new weapon that can ...
2227    deluge the f$&amp;@ing things i do for #gishwh...
5448    police dt @georgegalloway: rt @galloway4mayor:...
132     aftershock aftershock back to school kick off ...
6845    trauma in response to trauma children of addic...
Name: tweet_lower, dtype: object

## Remove HTML
HTML stands for HyperText Markup Language, which is used for formatting the flow of the webpages. There are many html entities, which creep into the textual data such as "& gt;" and "& lt;". Also text might contain html tags such as < p >, < a > or < div >. It is important to remove these entities as they are nothing but noise and can negatively affect the performance of the model if not removed.

In [5]:
from bs4 import BeautifulSoup
text = r"&gt;&gt; $15 Aftershock : Protect Yourself and Profit in the Next Global Financial... ##book http://t.co/f6ntUc734Z esquireattire"
soup = BeautifulSoup(text)
soup.get_text()

'>> $15 Aftershock : Protect Yourself and Profit in the Next Global Financial... ##book http://t.co/f6ntUc734Z esquireattire'

In [6]:
def remove_html(text):
    soup = BeautifulSoup(text)
    text = soup.get_text()
    return text

In [7]:
tweets_df["tweet_noHTML"] = tweets_df["tweet_lower"].apply(remove_html)
tweets_df["tweet_noHTML"].sample(5, random_state=42)

2644    destruction so you have a new weapon that can ...
2227    deluge the f$&@ing things i do for #gishwhes j...
5448    police dt @georgegalloway: rt @galloway4mayor:...
132     aftershock aftershock back to school kick off ...
6845    trauma in response to trauma children of addic...
Name: tweet_noHTML, dtype: object

## Expand Contractions

There are many contractions of words used in informal communication such as can't: can not, they've: they have or even modern contractions such as sux: sucks. In many cases, these contractions are considered as stopwords and are removed. There is a python package to expand such contractions conveniently named as `contractions`, which has collection of most of such contractions and can be used for expanding them as a preprocessing step.

In [8]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii
  Downloading anyascii-0.3.1-py3-none-any.whl (287 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m287.5/287.5 kB[0m [31m888.0 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyahocorasick
  Downloading pyahocorasick-2.0.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (101 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.8/101.8 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.1 contractions-0.1.73 pyahocorasick-2.0.0 textsearch-0.0.24
[0m

In [9]:
import contractions

tweets_df["tweet_noContractions"] = tweets_df["tweet_noHTML"].apply(contractions.fix)
tweets_df["tweet_noContractions"].sample(5, random_state=42)

2644    destruction so you have a new weapon that can ...
2227    deluge the f$&@ing things i do for #gishwhes j...
5448    police dt @georgegalloway: rt @galloway4mayor:...
132     aftershock aftershock back to school kick off ...
6845    trauma in response to trauma children of addic...
Name: tweet_noContractions, dtype: object

## Remove URLs
URL stands for Uniform Resource Locator, which is used to locate resources on the web. However, they generally do not provide any additional information in the NLP task and are hard to handle otherwise. Hence they need to be removed. All the URLs can be completely removed or can be replaced by some common word such as 'website' or 'url' to keep the information about the presense of URL in the text.

In [10]:
def remove_urls(text):
    pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)(/\w*)?')
    text = re.sub(pattern, "", text)
    return text

In [11]:
text = "#stlouis #caraccidentlawyer Speeding Among Top Causes of Teen Accidents https://t.co/k4zoMOF319 https://t.co/S2kXVM0cBA Car Accident"
remove_urls(text)

'#stlouis #caraccidentlawyer Speeding Among Top Causes of Teen Accidents   Car Accident'

In [12]:
tweets_df["tweet_noURLs"] = tweets_df["tweet_noContractions"].apply(remove_urls)
tweets_df["tweet_noURLs"].sample(5, random_state=42)

2644    destruction so you have a new weapon that can ...
2227    deluge the f$&@ing things i do for #gishwhes j...
5448    police dt @georgegalloway: rt @galloway4mayor:...
132     aftershock aftershock back to school kick off ...
6845    trauma in response to trauma children of addic...
Name: tweet_noURLs, dtype: object

## Remove Email IDs
Email ids have become ubiquitous over the years and appear everywhere. As they do not provide any additional information (unless you are specifically extracting the emails from the text for specific usecase) we need to remove them. Similar to the previous case, email id can be completely removed or replaced with a common word such as "email".  

In [13]:
def remove_emails(text):
    pattern = re.compile(r"[\w\.-]+@[\w\.-]+\.\w+")
    text = re.sub(pattern, "", text)
    return text

In [14]:
text = "please send your feedback to myemail@gmail.com "
remove_emails(text)

'please send your feedback to  '

In [15]:
tweets_df["tweet_noEmail"] = tweets_df["tweet_noURLs"].apply(remove_emails)
tweets_df["tweet_noEmail"].sample(5, random_state=42)

2644    destruction so you have a new weapon that can ...
2227    deluge the f$&@ing things i do for #gishwhes j...
5448    police dt @georgegalloway: rt @galloway4mayor:...
132     aftershock aftershock back to school kick off ...
6845    trauma in response to trauma children of addic...
Name: tweet_noEmail, dtype: object

## Remove Tweeter Mentions
The text contains maintions using @. This generally appears in Tweeter and online forums. We need to remove these mentions before removing the punctutions otherwise they will be hard to find without the @ attached to it. They are generally names of people and don't provide any additional information helpful for the NLP task. (Exception to this can be the News channel handles in Disaster Tweets dataset, which can be handled separately if required.)

In [16]:
def remove_mentions(text):
    pattern = re.compile(r"@\w+")
    text = re.sub(pattern, "", text)
    return text

In [17]:
tweets_df["tweet_noMention"] = tweets_df["tweet_noEmail"].apply(remove_mentions)
tweets_df["tweet_noMention"].sample(5, random_state=42)

2644    destruction so you have a new weapon that can ...
2227    deluge the f$& things i do for #gishwhes just ...
5448    police dt : rt : ûïthe col police can catch a...
132     aftershock aftershock back to school kick off ...
6845    trauma in response to trauma children of addic...
Name: tweet_noMention, dtype: object

Hashtags can also be removed in similar way but in this competition the hashtags are important as they include key information hence are not removed as mentioned before.

## Handling Emojis

Generally emojis are removed, but in the case of distaster tweets, the emojis can contain some information and hence need to be handled properly.

I propose to convert the emojis to six basic emotions such as happiness, sadness, anger, disgust, fear, surprise and the neutral state. Each emotion class can contain multiple emojis such as happiness can contain 😀 😃 😄 😁 😆 😅 😂 🤣. Other possible approach might be to simply replace replace the emoji by the name of emoji such as replace 😀 by "smily face". 

This step needs to be done before removing Unicode characters in the next step because emojis are represented in unicode. 

**I think I will wait for some discussion in the comments regarding this before implementing any approach for this, because there can be ambiguity in the use of emojis as well**

## Handling Accented Words
Accent marks (accents) in English is largely confined to proper names or “borrowed” words of foreign origin, such as résumé and tête-à-tête, they occur frequently in several other European languages, including Spanish, French, Italian, German and Portuguese. We have to handle these accented characters before removing unicode characters, otherwise they will get removed. This may not may not be helpful in Distaster tweets classification task as text data is too noisy and contain many unicode characters which can get converted into ASCII characters creating giberish.

In [18]:
from unidecode import unidecode
text = "words of foreign origin, such as résumé and tête-à-tête"
unidecode(text)

'words of foreign origin, such as resume and tete-a-tete'

In [19]:
def handle_accents(text):
    text = unidecode(text)
    return text

In [20]:
tweets_df["tweet_handleAccents"] = tweets_df["tweet_noMention"].apply(handle_accents)
tweets_df["tweet_handleAccents"].sample(5, random_state=42)

2644    destruction so you have a new weapon that can ...
2227    deluge the f$& things i do for #gishwhes just ...
5448    police dt : rt : uithe col police can catch a ...
132     aftershock aftershock back to school kick off ...
6845    trauma in response to trauma children of addic...
Name: tweet_handleAccents, dtype: object

## Remove Unicode Charachers
> Unicode, formally The Unicode Standard, is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, which is maintained by the Unicode Consortium, defines as of the current version (15.0) 149,186 characters covering 161 modern and historic scripts, as well as symbols, 3664 emoji (including in colors), and non-visual control and formatting codes. - [Wikipedia](https://en.wikipedia.org/wiki/Unicode)

On the other hand, the ASCII character set has only 128 characters which include upper and lower case english letters, numbers, few punctuations and control characters. In many cases we need to represent the text in only in alphanumeric characters and hence need to remove the other unicode characters. (This might not be true while working with languages other than English)

In [21]:
def remove_unicode_chars(text):
    text = text.encode("ascii", "ignore").decode()
    return text

As mentioned bbefore, the accented characters are removed by this step. For example.

In [22]:
text = "words of foreign origin, such as résumé and tête-à-tête"
remove_unicode_chars(text)

'words of foreign origin, such as rsum and tte--tte'

In [23]:
tweets_df["tweet_noUnicode"] = tweets_df["tweet_noMention"].apply(remove_unicode_chars)
tweets_df["tweet_noUnicode"].sample(5, random_state=42)

2644    destruction so you have a new weapon that can ...
2227    deluge the f$& things i do for #gishwhes just ...
5448    police dt : rt : the col police can catch a pi...
132     aftershock aftershock back to school kick off ...
6845    trauma in response to trauma children of addic...
Name: tweet_noUnicode, dtype: object

## Abbreviation/Acronym Disambiguation
There are large number of abbreviations and acronyms used in the text. These abbreviations can contain meaningful information for the classification task and might get removed or destorted during other preprocessing steps and hence they need to be expanded earlier in the preprocessing. @gunesevitan has given many of these abbreviations in his [notebook](https://www.kaggle.com/code/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert). I am also trying to come up with some approach to find all such abbreviations in the Disaster Tweets dataset in notebook about [Abbreviation Disambiguation](https://www.kaggle.com/code/rohitgarud/abbreviation-disambiguation-for-disaster-tweets?rvi=1). 

In [24]:
# Acronyms
def remove_abbreviations(text):
    text = re.sub(r"mh370", "missing malaysia airlines flight", text)
    text = re.sub(r"okwx", "oklahoma city weather", text)
    text = re.sub(r"arwx", "arkansas weather", text)    
    text = re.sub(r"gawx", "georgia weather", text)  
    text = re.sub(r"scwx", "south carolina weather", text)  
    text = re.sub(r"cawx", "california weather", text)
    text = re.sub(r"tnwx", "tennessee weather", text)
    text = re.sub(r"azwx", "arizona weather", text)  
    text = re.sub(r"alwx", "alabama Weather", text)
    text = re.sub(r"wordpressdotcom", "wordpress", text)    
    text = re.sub(r"usnwsgov", "united states national weather service", text)
    text = re.sub(r"suruc", "sanliurfa", tweet)
    return text

There are many more abbreviations in the dataset and a more thorough checking is required to find all the abbreviations/acronyms.

## Remove Punctuations
Punctuations are used for defining the structure of the text such as full stops for terminating the sentences. They can be used for sentense tokenization. However, in some NLP tasks, punctuations do not provide any relevant information and need to be removed. There are a number of ways of removing punctuations. The built-in regular expression library from Python is used for removing punctuations.   

In [25]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [26]:
def remove_punctuations(text):
    text = re.sub('[%s]' % re.escape(string.punctuation), " ",text)
    return text

Another approach might be to only keep alphanumeric characters using regex pattern "[^a-zA-Z0-9]".

In [27]:
tweets_df["tweet_noPuncts"] = tweets_df["tweet_noUnicode"].apply(remove_punctuations)
tweets_df["tweet_noPuncts"].sample(5, random_state=42)

2644    destruction so you have a new weapon that can ...
2227    deluge the f   things i do for  gishwhes just ...
5448    police dt   rt   the col police can catch a pi...
132     aftershock aftershock back to school kick off ...
6845    trauma in response to trauma children of addic...
Name: tweet_noPuncts, dtype: object

## Remove Digits or Words Containing Digits
This might not be appropriate in many cases. For example "MH370" mentioned in the tweets corresponds to Malaysia Airlines Flight 370 which went missing. In this case, keeping this number in the text might be useful in the disaster tweet classification.

In [28]:
def remove_digits(text):
    pattern = re.compile("\w*\d+\w*")
    text = re.sub(pattern, "",text)
    return text

In [29]:
text = " m194 0104 utc5km s of volcano hawaii"
remove_digits(text)

'    s of volcano hawaii'

In [30]:
tweets_df["tweet_noDigits"] = tweets_df["tweet_noPuncts"].apply(remove_digits)
tweets_df["tweet_noDigits"].sample(5, random_state=42)

2644    destruction so you have a new weapon that can ...
2227    deluge the f   things i do for  gishwhes just ...
5448    police dt   rt   the col police can catch a pi...
132     aftershock aftershock back to school kick off ...
6845    trauma in response to trauma children of addic...
Name: tweet_noDigits, dtype: object

## Remove Stopwords
Stopwords removal is one of the fundamental preprocessing operations in many NLP tasks. Stopwords are words like 'a, and, the, is, can' which are removed to only keep information rich words in the text.

In [31]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

{'his', "haven't", 'in', 'will', 'what', 'down', 'after', 'whom', 'hadn', 'most', 'doing', 'with', 'ourselves', 'under', "didn't", 'not', 'doesn', "that'll", 'each', "aren't", 'o', 're', 'into', 'themselves', 'ma', 'these', 'myself', 'yours', 'we', 'yourselves', 'very', "doesn't", 'does', 'has', 'herself', 'no', 'mightn', 'm', 'up', "mightn't", 'for', 'about', 'have', "you're", 'to', 'an', 'd', 'having', 'this', "hasn't", 'being', 'were', 'above', 'now', 'wouldn', 'do', 'her', 'mustn', 'because', 'here', 'didn', 'once', 'before', 'your', 'wasn', 'more', 'is', 'off', 'further', 'aren', 'who', "hadn't", 'haven', 'both', "won't", 'that', "wouldn't", "isn't", 'all', 'should', 'it', 'can', "wasn't", 've', 'was', "mustn't", 'hers', 'ain', 'same', 'such', 'if', 'of', 'so', "needn't", "she's", 'are', 'on', 'than', 'the', 'don', "should've", 'been', 'their', 'any', 'too', 'its', 'my', 'between', 'why', 'ours', 'which', 'or', 'isn', 'won', 'out', 'how', 'i', 'where', 'only', 'as', 'll', 'until',

NLTK library supports multiple languages and stopwords from these languages can be obtained by simply replacing 'english' with the name of the language in above code. The supported languages are:

In [32]:
print(stopwords.fileids())

['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', 'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']


In [33]:
def remove_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in stop_words])

In [34]:
tweets_df["tweet_noStopwords"] = tweets_df["tweet_noDigits"].apply(remove_stopwords)
tweets_df["tweet_noStopwords"].sample(5, random_state=42)

2644    destruction new weapon cause un imaginable des...
2227    deluge f things gishwhes got soaked deluge goi...
5448    police dt rt col police catch pickpocket liver...
132     aftershock aftershock back school kick great w...
6845    trauma response trauma children addicts develo...
Name: tweet_noStopwords, dtype: object

I sometimes remove stopwords before removing punctuations as many stopwords contain apostrophe. However, most of these stopwprds are expanded during contraction expansion process above. 

## Removing Extra Spaces
While performing different preprocessing steps, additional spaces are introduced in the text at the start, end or in-between words which need to be removed. In above case while removing stopwords we split the text using spaces which removes extra spaces. However, we can still run the following code to be sure

In [35]:
def remove_extra_spaces(text):
    text = re.sub(' +', ' ', text).strip()
    return text

In [36]:
tweets_df["tweet_noExtraspace"] = tweets_df["tweet_noStopwords"].apply(remove_extra_spaces)
tweets_df["tweet_noExtraspace"].sample(5, random_state=42)

2644    destruction new weapon cause un imaginable des...
2227    deluge f things gishwhes got soaked deluge goi...
5448    police dt rt col police catch pickpocket liver...
132     aftershock aftershock back school kick great w...
6845    trauma response trauma children addicts develo...
Name: tweet_noExtraspace, dtype: object

## Stemming or Lemmatization
Stemming uses the stem of the word, while lemmatization uses the context in which the word is being used. However, stemming leads to incorrect meaning and spelling. Lemmatization gives meaningful words based on the context. Hence, I generally prefer lemmatization over stemming as lemmatization.

In [37]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    words = [lemmatizer.lemmatize(word) for word in text.split()]
    text = ' '.join(words)
    return text

In [38]:
tweets_df["tweet_lemmatised"] = tweets_df["tweet_noExtraspace"].apply(lemmatize_text)
tweets_df["tweet_lemmatised"].sample(5, random_state=42)

2644    destruction new weapon cause un imaginable des...
2227    deluge f thing gishwhes got soaked deluge goin...
5448    police dt rt col police catch pickpocket liver...
132     aftershock aftershock back school kick great w...
6845    trauma response trauma child addict develop de...
Name: tweet_lemmatised, dtype: object

## Spelling Correction
Spelling correction can help reduce the variations of the word and avoid missrepresentation of the information. It can help in the NLP task of tweet classification in the considered example because the tweets are particularly succeptible to incorrect spellings of words, either deliberate or otherwise. There are few options such as spell checker from TextBlob and Symspellpy (Python port of SymSpell). However, the Textblob is prohibitively slow while Symspellpy is very fast and accurate. Also Symspellpy is language agnostic if proper dictionary is used, hence is used here

In [39]:
!pip install symspellpy

Collecting symspellpy
  Downloading symspellpy-6.7.7-py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting editdistpy>=0.1.3
  Downloading editdistpy-0.1.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (125 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.5/125.5 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: editdistpy, symspellpy
Successfully installed editdistpy-0.1.3 symspellpy-6.7.7
[0m

In [40]:
import pkg_resources
from symspellpy import SymSpell, Verbosity

SymSpellpy give multiple suggestions to the words for spelling correction. We can select the first suggested word having highest probability.

In [41]:
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

True

In [42]:
def correct_spelling_symspell(text):
    words = [
        sym_spell.lookup(
            word, 
            Verbosity.CLOSEST, 
            max_edit_distance=2,
            include_unknown=True
            )[0].term 
        for word in text.split()] 
    text = " ".join(words)
    return text

The `include_unknown` option keeps the words not within `max_edit_distance` from the words in the dictionary 

In [43]:
tweets_df["tweet_spellcheck"] = tweets_df["tweet_lemmatised"].apply(correct_spelling_symspell)
tweets_df["tweet_spellcheck"].sample(5, random_state=42)

2644    destruction new weapon cause in imaginable des...
2227    deluge of thing gishwhes got soaked deluge goi...
5448    police it it col police catch pickpocket liver...
132     aftershock aftershock back school kick great w...
6845    trauma response trauma child addict develop de...
Name: tweet_spellcheck, dtype: object

It can be observed that it is not perfect and introduces more stopwords but can help in many cases. Some more investigation is required with the competition solution results

The [symspellpy library](https://symspellpy.readthedocs.io/en/latest/examples/dictionary.html) is said be "language independent (agnostic)" and can be used with any language. The already available english dictionary is used in the above example, but such a dictionary can be easily created for any language using large enough text data in 'plain text' format using the `create_dictionary` function. You can read [1000x Faster Spelling Correction algorithm](https://wolfgarbe.medium.com/1000x-faster-spelling-correction-algorithm-2012-8701fcd87a5f) and the [documentation of symspellpy library](https://symspellpy.readthedocs.io/en/latest/) for more details.

In [44]:
# from symspellpy import SymSpell

# sym_spell = SymSpell()
# corpus_path = <path/to/plain/text/file>
# sym_spell.create_dictionary(corpus_path)

# print(sym_spell.words)

## Correcting Componded Words 
Sometimes multiple words are concatenated without space leading to words not available in the dictionary. These words result in misrepresentation of the information. Such compound words present in the tweets dataset are mostly resulting from the hashtags. As mentioned before, hashtags contain useful information and hence the compund words need to be segmented into meaningful words. Some hashtags are proper names.

In [45]:
bigram_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_bigramdictionary_en_243_342.txt"
)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)

True

In [46]:
def correct_spelling_symspell_compound(text):
    words = [
        sym_spell.lookup_compound(
            word, 
            max_edit_distance=2
            )[0].term 
        for word in text.split()] 
    text = " ".join(words)
    return text

In [47]:
text = "IranDeal PantherAttack TrapMusic StrategicPatience socialnews NASAHurricane onlinecommunities humanconsumption"
correct_spelling_symspell_compound(text)

'iran deal panther attack trap music strategic patience social news as hurricane online communities human consumption'

In [48]:
tweets_df["tweet_spellcheck_compound"] = tweets_df["tweet_spellcheck"].apply(correct_spelling_symspell_compound)
tweets_df["tweet_spellcheck_compound"].sample(5, random_state=42)

2644    destruction new weapon cause in imaginable des...
2227    deluge of thing gish hes got soaked deluge goi...
5448    police it it col police catch pickpocket liver...
132     aftershock aftershock back school kick great w...
6845    trauma response trauma child addict develop de...
Name: tweet_spellcheck_compound, dtype: object

## Britsh-American English Conversion
There are many words in British and American english which differ in spellings such as (colour: color), (standardize: standardise) and so on. Depending upon the text data, the words from both of them can be present and one might want to convert all British english words to American words or vice versa.

In [49]:
import requests

url ="https://raw.githubusercontent.com/hyperreality/American-British-English-Translator/master/data/american_spellings.json"
american_to_british_dict = requests.get(url).json()

url ="https://raw.githubusercontent.com/hyperreality/American-British-English-Translator/master/data/british_spellings.json"
british_to_american_dict = requests.get(url).json()

In [50]:
# Based on https://stackoverflow.com/questions/42329766/python-nlp-british-english-vs-american-english
def britishize(text):
    text = [american_to_british_dict[word] if word in american_to_british_dict else word for word in text.split()]
    return " ".join(text)


def americanize(text):
    text = [british_to_american_dict[word] if word in british_to_american_dict else word for word in text.split()]   
    return " ".join(text)

In [51]:
text = "Discount analyse standardised colour"
americanize(text)

'Discount analyze standardized color'

In [52]:
text = "'Discount analyze standardized color'"
britishize(text)

"'Discount analyse standardised color'"

In [53]:
tweets_df["tweet_american"] = tweets_df["tweet_spellcheck_compound"].apply(americanize)
tweets_df["tweet_american"].sample(5, random_state=42)

2644    destruction new weapon cause in imaginable des...
2227    deluge of thing gish hes got soaked deluge goi...
5448    police it it col police catch pickpocket liver...
132     aftershock aftershock back school kick great w...
6845    trauma response trauma child addict develop de...
Name: tweet_american, dtype: object

In [54]:
tweets_df["tweet_british"] = tweets_df["tweet_spellcheck_compound"].apply(britishize)
tweets_df["tweet_british"].sample(5, random_state=42)

2644    destruction new weapon cause in imaginable des...
2227    deluge of thing gish hes got soaked deluge goi...
5448    police it it col police catch pickpocket liver...
132     aftershock aftershock back school kick great w...
6845    trauma response trauma child addict develop de...
Name: tweet_british, dtype: object

## Final Stopward Removal
Due to previous spell checking and compound word segmentation steps, few new stopwords are introduced in the data and hence one final stopward removal step is required. Additional words can be included in the stopwords list based on specific application.

In [55]:
tweets_df["tweet_final"] = tweets_df["tweet_spellcheck_compound"].apply(remove_stopwords)
tweets_df["tweet_final"].sample(5, random_state=42)

2644    destruction new weapon cause imaginable destru...
2227    deluge thing gish hes got soaked deluge going ...
5448    police col police catch pickpocket liverpool s...
132     aftershock aftershock back school kick great w...
6845    trauma response trauma child addict develop de...
Name: tweet_final, dtype: object

In [56]:
tweets_df.to_csv("distaster_tweets_cleaned.csv")

Do check the csv file generated after these steps in some external software like MS Excel or Google Docs to better understand the effects of each preprocessing step on the input text.

## Sequence of Preprocessing Steps
Proper sequence of these operations need to be determined to achieve higher efficiency of data preprocessing

(will be added soon)

## Classification using Clean Dataset

I will add few examples of training multiple simple models and compare the results soon