Text standardization, Spelling correction and Tokenization
--

A> Standardizing Text
--
In this code example, we are going to discuss how to standardize the text. But before that, let’s understand what is text standardization and why we need to do it.

Most of the text data is in the form of either customer reviews, blogs, or tweets, where there is a high chance of people using short words and abbreviations to represent the same meaning. This may help the downstream process to easily understand and resolve the semantics of the text.

Problem
--
You want to standardize text.

Solution
--
We can write our own custom dictionary to look for short words and
abbreviations.

In [1]:
# Create a custom lookup dictionary
# This dictionary will be for text standardization based on your data.

lookup_dict = {'nlp':'natural language processing', 'ur':'your', "wbu" : "what about you"}

import re

# Create a custom function for text standardization

def text_std(input_text):
 words = input_text.split()
 new_words = []
 for word in words:
  word = re.sub(r'[^\w\s]','',word)
  if word.lower() in lookup_dict:
   word = lookup_dict[word.lower()]
   new_words.append(word)
   new_text = " ".join(new_words)

 return new_text

In [2]:
# Run the text_std function

text_std("I like nlp it's ur choice")

'natural language processing your'

In [3]:
text_std("I like nlp it's ur choice, wbu")

'natural language processing your what about you'

Correcting Spelling
--
In this coding example, we are going to discuss how to do spelling correction. But before that, let’s understand why this spelling correction is important.
Most of the text data is in the form of either customer reviews, blogs, or
tweets, where there is a high chance of people using short words and making typo errors. This will help us in reducing multiple copies of words, which represents the same meaning. For example, “proccessing” and “processing” will be treated as different words even if they are used in the same sense.

Note that abbreviations should be handled before this step, or else
the corrector would fail at times. Say, for example, “ur” (actually means
“your”) would be corrected to “or.”

Problem
--
You want to do spelling correction.

Solution
--
The simplest way to do this by using the TextBlob library.

In [4]:
# Let’s create a list of strings and assign it to a variable.
text=['Introduction to NLP','It is likely to be useful, to people ','Machine learning is the new electrcity', 'R is good langauage','I like this book','I want more books like this','angrezi medium realize']

#convert list to dataframe
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)

                                    tweet
0                     Introduction to NLP
1   It is likely to be useful, to people 
2  Machine learning is the new electrcity
3                     R is good langauage
4                        I like this book
5             I want more books like this
6                  angrezi medium realize


In [5]:
#Install textblob library
# !pip install textblob

#import libraries and use 'correct' function of TextBlob

## type your code here

from textblob import TextBlob
df['tweet'].apply(lambda x: str(TextBlob(x).correct()) )



0                        Introduction to NLP
1      It is likely to be useful, to people 
2    Machine learning is the new electricity
3                         R is good language
4                           I like this book
5                I want more books like this
6                     angrezi medium realize
Name: tweet, dtype: object

If you clearly observe this, it corrected the spelling of electricity and
language.

In [6]:
#You can also use autocorrect library as shown below

#install autocorrect
# !pip install autocorrect

from autocorrect import spell
from autocorrect import Speller
print(spell(u'mussage'))
print(spell(u'sirvice'))

autocorrect.spell is deprecated, use autocorrect.Speller instead
message
autocorrect.spell is deprecated, use autocorrect.Speller instead
service


Tokenizing Text
--
In this coding example, we would look at the ways to tokenize. 

Tokenization refers to splitting text into minimal meaningful units. There is a sentence tokenizer and word tokenizer. We will see a word tokenizer here, which is a mandatory step in text preprocessing for any kind of analysis. 

There are many libraries to perform tokenization like NLTK, SpaCy, and TextBlob. Here are a few ways to achieve it.

Problem
--
You want to do tokenization.

Solution
--
The simplest way to do this is by using the TextBlob library.

In [2]:
text=['This is introduction to NLP','It is likely to be useful, to people ','Machine learning is the new electrcity','There would be less hype around AI and more action going forward','python is the best tool!','R is good langauage', 'I like this book','I want more books like this']

#convert list to dataframe
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)

                                               tweet
0                        This is introduction to NLP
1              It is likely to be useful, to people 
2             Machine learning is the new electrcity
3  There would be less hype around AI and more ac...
4                           python is the best tool!
5                                R is good langauage
6                                   I like this book
7                        I want more books like this


In [3]:
# Tokenization Using textblob
from textblob import TextBlob
TextBlob(df['tweet'][3]).words

WordList(['There', 'would', 'be', 'less', 'hype', 'around', 'AI', 'and', 'more', 'action', 'going', 'forward'])

In [None]:
# Tokenization using NLTK

import nltk

#create data
mystring = df['tweet'][3]
nltk.word_tokenize(mystring)

In [None]:
#Tokenization using split function from python
mystring.split()