## What is Text Preprocessing? Why do we need it?

A text, in its original form may contain non-alphanumeric characters, URLs, stop words, wirds with mixed cases, emojis, short forms etc...
These make it difficult for an Natural Language Processing (NLP) algorithm to analyse the text... Text preprocessing is a method to clean the text data and make it ready to feed data to the model and hence prevent the model from getting confused with various unrecognizable characters.

## Importing Dataset

Link to dataset: https://www.kaggle.com/sudalairajkumar/getting-started-with-text-preprocessing/data

In [None]:
import pandas as pd
dataset = pd.read_csv('sample.csv')
print(dataset.columns)
print(dataset.head())

Index(['tweet_id', 'author_id', 'inbound', 'created_at', 'text',
       'response_tweet_id', 'in_response_to_tweet_id'],
      dtype='object')
   tweet_id     author_id  ...  response_tweet_id in_response_to_tweet_id
0    119237        105834  ...             119236                     NaN
1    119238  ChaseSupport  ...                NaN                119239.0
2    119239        105835  ...             119238                     NaN
3    119240  VirginTrains  ...             119241                119242.0
4    119241        105836  ...             119243                119240.0

[5 rows x 7 columns]


In [None]:
dataset = dataset.drop(['tweet_id', 'author_id', 'inbound', 'created_at','response_tweet_id', 'in_response_to_tweet_id'],axis=1)
# Drop all columns except 'text' because we just wish to only preprocess the text
dataset

Unnamed: 0,text
0,@AppleSupport causing the reply to be disregar...
1,@105835 Your business means a lot to us. Pleas...
2,@76328 I really hope you all change but I'm su...
3,@105836 LiveChat is online at the moment - htt...
4,@VirginTrains see attached error message. I've...
...,...
88,@105860 I wish Amazon had an option of where I...
89,They reschedule my shit for tomorrow https://t...
90,"@105861 Hey Sara, sorry to hear of the issues ..."
91,@Tesco bit of both - finding the layout cumber...


In [None]:
dataset

Unnamed: 0,text
0,@AppleSupport causing the reply to be disregar...
1,@105835 Your business means a lot to us. Pleas...
2,@76328 I really hope you all change but I'm su...
3,@105836 LiveChat is online at the moment - htt...
4,@VirginTrains see attached error message. I've...
...,...
88,@105860 I wish Amazon had an option of where I...
89,They reschedule my shit for tomorrow https://t...
90,"@105861 Hey Sara, sorry to hear of the issues ..."
91,@Tesco bit of both - finding the layout cumber...


## Removal of URL

In [None]:
print(dataset['text'][4])

@VirginTrains see attached error message. I've tried leaving a voicemail several times in the past week https://t.co/NxVZjlYx1k


In [None]:
import re
for i in range(0,93): # To replace URLs with ""
  dataset['text'][i] = re.sub(r'https\S+', '', dataset['text'][i])

In [None]:
print(dataset['text'][4])

@VirginTrains see attached error message. I've tried leaving a voicemail several times in the past week 


## Removal of Non-Alphanumeric characters

In [None]:
import re
for i in range(0,93):
  ### Replace non-alphanumeric characters with spaces
  dataset['text'][i] = re.sub('[^a-zA-Z0-9]',' ', dataset['text'][i])

In [None]:
dataset

Unnamed: 0,text
0,AppleSupport causing the reply to be disregar...
1,105835 Your business means a lot to us Pleas...
2,76328 I really hope you all change but I m su...
3,105836 LiveChat is online at the moment or...
4,VirginTrains see attached error message I ve...
...,...
88,105860 I wish Amazon had an option of where I...
89,They reschedule my shit for tomorrow
90,105861 Hey Sara sorry to hear of the issues ...
91,Tesco bit of both finding the layout cumber...


## Converting Text to Lower Case

In [None]:
for i in range(0,93):
  dataset['text'][i] = dataset['text'][i].lower()

In [None]:
dataset

Unnamed: 0,text
0,applesupport causing the reply to be disregar...
1,105835 your business means a lot to us pleas...
2,76328 i really hope you all change but i m su...
3,105836 livechat is online at the moment or...
4,virgintrains see attached error message i ve...
...,...
88,105860 i wish amazon had an option of where i...
89,they reschedule my shit for tomorrow
90,105861 hey sara sorry to hear of the issues ...
91,tesco bit of both finding the layout cumber...


## Lemmatization

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet
nltk.download('wordnet')
nltk.download('punkt')
 

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:

lemmatizer = WordNetLemmatizer()
# Define function to lemmatize each word with its POS tag
# POS stands for Part of Speech
# We add a POS tag with a each word to define its type if its adjective, verb, noun, adverb or none
def pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:         
        return None
# tokenize the sentence and find the POS tag for each token
for i in range (0,93):
  pos_tagged = nltk.pos_tag(nltk.word_tokenize(str(dataset['text'][i]))) 
  wordnet_tagged = list(map(lambda x: (x[0], pos_tagger(x[1])), pos_tagged))
  lemmatized_sentence = []
  for word, tag in wordnet_tagged:
      if tag is None:
          lemmatized_sentence.append(word)
      else:       
          lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
  lemmatized_sentence = " ".join(lemmatized_sentence)
  dataset['text'][i] = lemmatized_sentence


In [None]:
dataset

Unnamed: 0,text
0,applesupport cause the reply to be disregard a...
1,105835 your business mean a lot to us please d...
2,76328 i really hope you all change but i m sur...
3,105836 livechat be online at the moment or con...
4,virgintrains see attach error message i ve try...
...,...
88,105860 i wish amazon have an option of where i...
89,they reschedule my shit for tomorrow
90,105861 hey sara sorry to hear of the issue you...
91,tesco bit of both find the layout cumbersome a...


## Removal of Stopwords

In [None]:
import nltk # helps download stop words
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
for i in range(0,93):
  all_stopwords = stopwords.words('english')
  sentence = dataset['text'][i].split()
  for j in range(len(sentence)):
    if sentence[j] in set(all_stopwords): # Replace stopwords with ""
      sentence[j]=""
  sentence = ' '.join(sentence)
  dataset['text'][i] = sentence

In [None]:
dataset

Unnamed: 0,text
0,applesupport cause reply disregard tapped...
1,105835 business mean lot us please dm name...
2,76328 really hope change sure win
3,105836 livechat online moment contact 0333...
4,virgintrains see attach error message try le...
...,...
88,105860 wish amazon option get ship ...
89,reschedule shit tomorrow
90,105861 hey sara sorry hear issue ask ...
91,tesco bit find layout cumbersome remove ...


## Conclusion:

The data that has been used had earlier contained many URLs, special characters, words with mixed cases and different forms (adverbs, verbs, nouns, adjectives) of the same root word.
For eliminating such confusions, Removal of URLs, non-alphanumeric characters, conversion to lower case and Lemmatization methods have been carried out. 
The final dataset now, is almost clear of any noise!