<a href="https://colab.research.google.com/github/deepeshaburse/winter-of-contributing/blob/Datascience_With_Python/text_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science with Python: Text Preprocessing in NLP

Natural Language Processing is a subfield of data science concerned with interactions between computers and human (natural) languages. Nowadays, companies have large amounts of data, a lot of which is text data. This text data carries a lot of data that can be analyzed and used for different purposes. The topic we choose, order of words, our tone, and a lot of other factors need to be considered in order to understand the text and intention. This data which is very messy and hard to manipulate is called unstructured data.
This unstructured data needs to be preprocessed before it can get used in machine learning models, it is essentially the first step in NLP projects. Some of the preprocessing steps are:
- Lowercasing
- Remove punctuations
- Remove words that contain numbers
- Tokenization
- Spelling correction
- Remove stop words
- Lemmatization
- Stemming
- Removing words with only one letter
- Regex
- Joining the remaining words


In [None]:
import pandas as pd

In [None]:
# Reading the dataset
df = pd.read_csv("/Womens Clothing E-Commerce Reviews.csv")
# Dropping unnecessary columns
df = df.drop(['Title', 'Positive Feedback Count', 'Sr. No.'], axis=1)
# Dropping rows containing missing values
df.dropna(inplace = True)
df.head()

Unnamed: 0,Clothing ID,Age,Review Text,Rating,Recommended IND,Division Name,Department Name,Class Name
0,767,33,Absolutely wonderful - silky and comfortable,4,1,Initmates,Intimate,Intimates
1,1080,34,Love this dress! it's sooo pretty. i happene...,5,1,General,Dresses,Dresses
2,1077,60,I had such high hopes for this dress and reall...,3,0,General,Dresses,Dresses
3,1049,50,"I love, love, love this jumpsuit. it's fun, fl...",5,1,General Petite,Bottoms,Pants
4,847,47,This shirt is very flattering to all due to th...,5,1,General,Tops,Blouses


Let’s look at each one of these in detail:


#### 1.	Lowercasing of text:

This is one of the most common steps of preprocessing and used to achieve a level of uniformity in the data.  In some situations, though, it could result in loss of information, for example when a word is completely in uppercase, it could signify intense emotions. 


In [None]:
df['Review']= df['Review Text'].apply(lambda x: x.lower())
df.head()

Unnamed: 0,Clothing ID,Age,Review Text,Rating,Recommended IND,Division Name,Department Name,Class Name,Review
0,767,33,Absolutely wonderful - silky and comfortable,4,1,Initmates,Intimate,Intimates,absolutely wonderful - silky and comfortable
1,1080,34,Love this dress! it's sooo pretty. i happene...,5,1,General,Dresses,Dresses,love this dress! it's sooo pretty. i happene...
2,1077,60,I had such high hopes for this dress and reall...,3,0,General,Dresses,Dresses,i had such high hopes for this dress and reall...
3,1049,50,"I love, love, love this jumpsuit. it's fun, fl...",5,1,General Petite,Bottoms,Pants,"i love, love, love this jumpsuit. it's fun, fl..."
4,847,47,This shirt is very flattering to all due to th...,5,1,General,Tops,Blouses,this shirt is very flattering to all due to th...


#### 2.	Removing punctuation:

This step is straightforward, it is used to remove any kind of punctuation present in the data.


In [None]:
import string
from string import punctuation

In [None]:
def remove_punctuation(text):
    text = " ".join([word.strip(string.punctuation) for word in text.split(" ")])
    return text
df['Review'] = df['Review'].apply(remove_punctuation)
df.head()

Unnamed: 0,Clothing ID,Age,Review Text,Rating,Recommended IND,Division Name,Department Name,Class Name,Review
0,767,33,Absolutely wonderful - silky and comfortable,4,1,Initmates,Intimate,Intimates,absolutely wonderful silky and comfortable
1,1080,34,Love this dress! it's sooo pretty. i happene...,5,1,General,Dresses,Dresses,love this dress it's sooo pretty i happened ...
2,1077,60,I had such high hopes for this dress and reall...,3,0,General,Dresses,Dresses,i had such high hopes for this dress and reall...
3,1049,50,"I love, love, love this jumpsuit. it's fun, fl...",5,1,General Petite,Bottoms,Pants,i love love love this jumpsuit it's fun flirty...
4,847,47,This shirt is very flattering to all due to th...,5,1,General,Tops,Blouses,this shirt is very flattering to all due to th...


#### 3.	Remove words that contain numbers:

More often than not, numbers in words are typos. This step is used to remove them.


In [None]:
def remove_numbers_in_word(text):
  text = "".join([word for word in text if not any(c.isdigit() for c in word)])
  return text
df['Review'] = df['Review'].apply(remove_numbers_in_word)
df.head()

Unnamed: 0,Clothing ID,Age,Review Text,Rating,Recommended IND,Division Name,Department Name,Class Name,Review
0,767,33,Absolutely wonderful - silky and comfortable,4,1,Initmates,Intimate,Intimates,absolutely wonderful silky and comfortable
1,1080,34,Love this dress! it's sooo pretty. i happene...,5,1,General,Dresses,Dresses,love this dress it's sooo pretty i happened ...
2,1077,60,I had such high hopes for this dress and reall...,3,0,General,Dresses,Dresses,i had such high hopes for this dress and reall...
3,1049,50,"I love, love, love this jumpsuit. it's fun, fl...",5,1,General Petite,Bottoms,Pants,i love love love this jumpsuit it's fun flirty...
4,847,47,This shirt is very flattering to all due to th...,5,1,General,Tops,Blouses,this shirt is very flattering to all due to th...


#### 4.	Tokenization:

In this step we cut the given text into pieces called tokens. You can also remove punctuation along with tokenizing the text. This is easy for a language like English where every word is separated with a space but that isn’t the same for every language. This also may be a little more complex for biomedical data as it will contain lots of words containing hyphens, parentheses, etc.


In [None]:
def tokenization(text):
  text = text.split()
  return text
df['Review'] = df['Review'].apply(tokenization)
df.head()

Unnamed: 0,Clothing ID,Age,Review Text,Rating,Recommended IND,Division Name,Department Name,Class Name,Review
0,767,33,Absolutely wonderful - silky and comfortable,4,1,Initmates,Intimate,Intimates,"[absolutely, wonderful, silky, and, comfortable]"
1,1080,34,Love this dress! it's sooo pretty. i happene...,5,1,General,Dresses,Dresses,"[love, this, dress, it's, sooo, pretty, i, hap..."
2,1077,60,I had such high hopes for this dress and reall...,3,0,General,Dresses,Dresses,"[i, had, such, high, hopes, for, this, dress, ..."
3,1049,50,"I love, love, love this jumpsuit. it's fun, fl...",5,1,General Petite,Bottoms,Pants,"[i, love, love, love, this, jumpsuit, it's, fu..."
4,847,47,This shirt is very flattering to all due to th...,5,1,General,Tops,Blouses,"[this, shirt, is, very, flattering, to, all, d..."


#### 5.	Spelling correction:

Although we can use NLTK here too, I prefer using the pyspellchecker library to spell check in Python. The dataset I have used, does not need a spell check but you can check out the official documentation of pyspellchecker [here](https://pypi.org/project/pyspellchecker/). I would only recommend using this step if the dataset contains a considerable amount of spelling mistakes. This step will increase the runtime significantly as each word gets checked one by one.


#### 6.	Remove stopwords:

Stopwords are words that do not add any value to the text analysis. These are usually very commonly used and help in cleaning the data significantly. 
The library NLTK already has a list of stopwords for English. Examples of stopwords: I, me, the, a, an, in, are, some, etc. In some cases, though, we can’t use the readymade list available in NLTK because our dataset might need them to analyze it, in this situation, we can make our own list of stopwords and use that. 


In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
def remove_stopwords(text):    
  stop = stopwords.words('english')
  text = [x for x in text if x not in stop]
  return text
df['Review'] = df['Review'].apply(remove_stopwords)
df.head()

Unnamed: 0,Clothing ID,Age,Review Text,Rating,Recommended IND,Division Name,Department Name,Class Name,Review
0,767,33,Absolutely wonderful - silky and comfortable,4,1,Initmates,Intimate,Intimates,"[absolutely, wonderful, silky, comfortable]"
1,1080,34,Love this dress! it's sooo pretty. i happene...,5,1,General,Dresses,Dresses,"[love, dress, sooo, pretty, happened, find, st..."
2,1077,60,I had such high hopes for this dress and reall...,3,0,General,Dresses,Dresses,"[high, hopes, dress, really, wanted, work, ini..."
3,1049,50,"I love, love, love this jumpsuit. it's fun, fl...",5,1,General Petite,Bottoms,Pants,"[love, love, love, jumpsuit, fun, flirty, fabu..."
4,847,47,This shirt is very flattering to all due to th...,5,1,General,Tops,Blouses,"[shirt, flattering, due, adjustable, front, ti..."


#### 7.	Lemmatization:

Lemmatization is the process of reducing a word to its root word. This helps in the standardization of text. It resolves words to their dictionary form (lemma) for which it requires detailed dictionaries in which the algorithm can look into and link words to their corresponding lemmas. 
Example: The words ‘running’, ‘ran’, ‘runs’ all reduce to ‘run’.


In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
def lemmatize(text):
  text = [WordNetLemmatizer().lemmatize(word) for word in text]
  return text
df['Review'] = df['Review'].apply(lemmatize)
df.head()

Unnamed: 0,Clothing ID,Age,Review Text,Rating,Recommended IND,Division Name,Department Name,Class Name,Review
0,767,33,Absolutely wonderful - silky and comfortable,4,1,Initmates,Intimate,Intimates,"[absolutely, wonderful, silky, comfortable]"
1,1080,34,Love this dress! it's sooo pretty. i happene...,5,1,General,Dresses,Dresses,"[love, dress, sooo, pretty, happened, find, st..."
2,1077,60,I had such high hopes for this dress and reall...,3,0,General,Dresses,Dresses,"[high, hope, dress, really, wanted, work, init..."
3,1049,50,"I love, love, love this jumpsuit. it's fun, fl...",5,1,General Petite,Bottoms,Pants,"[love, love, love, jumpsuit, fun, flirty, fabu..."
4,847,47,This shirt is very flattering to all due to th...,5,1,General,Tops,Blouses,"[shirt, flattering, due, adjustable, front, ti..."


#### 8.	Stemming:

This is another standardization step. The words are ‘stemmed’ to its base word in this step too. The disadvantage with stemming is that the word may lose its meaning after stemming. 
Example: The words ‘programming’, ‘programmer’, ‘program’ will be reduced to the base word ‘program’, ‘crazy’ is stemmed to ‘crazi’ and so on. 
It is because of this that lemmatization is preferred over stemming.


In [None]:
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

In [None]:
def stemming(text):
  stem_text = "".join([porter_stemmer.stem(word) for word in text])
  return stem_text
df['Stemming'] = df['Review Text'].apply(stemming)
df.head()

Unnamed: 0,Clothing ID,Age,Review Text,Rating,Recommended IND,Division Name,Department Name,Class Name,Review,Stemming
0,767,33,Absolutely wonderful - silky and comfortable,4,1,Initmates,Intimate,Intimates,"[absolutely, wonderful, silky, comfortable]",Absolutely wonderful - silky and comfortable
1,1080,34,Love this dress! it's sooo pretty. i happene...,5,1,General,Dresses,Dresses,"[love, dress, sooo, pretty, happened, find, st...",Love this dress! it's sooo pretty. i happene...
2,1077,60,I had such high hopes for this dress and reall...,3,0,General,Dresses,Dresses,"[high, hope, dress, really, wanted, work, init...",I had such high hopes for this dress and reall...
3,1049,50,"I love, love, love this jumpsuit. it's fun, fl...",5,1,General Petite,Bottoms,Pants,"[love, love, love, jumpsuit, fun, flirty, fabu...","I love, love, love this jumpsuit. it's fun, fl..."
4,847,47,This shirt is very flattering to all due to th...,5,1,General,Tops,Blouses,"[shirt, flattering, due, adjustable, front, ti...",This shirt is very flattering to all due to th...


In [None]:
df = df.drop('Stemming', axis=1)
df.head()

Unnamed: 0,Clothing ID,Age,Review Text,Rating,Recommended IND,Division Name,Department Name,Class Name,Review
0,767,33,Absolutely wonderful - silky and comfortable,4,1,Initmates,Intimate,Intimates,"[absolutely, wonderful, silky, comfortable]"
1,1080,34,Love this dress! it's sooo pretty. i happene...,5,1,General,Dresses,Dresses,"[love, dress, sooo, pretty, happened, find, st..."
2,1077,60,I had such high hopes for this dress and reall...,3,0,General,Dresses,Dresses,"[high, hope, dress, really, wanted, work, init..."
3,1049,50,"I love, love, love this jumpsuit. it's fun, fl...",5,1,General Petite,Bottoms,Pants,"[love, love, love, jumpsuit, fun, flirty, fabu..."
4,847,47,This shirt is very flattering to all due to th...,5,1,General,Tops,Blouses,"[shirt, flattering, due, adjustable, front, ti..."


#### 9.	Remove words with only one letter:

Words with only one letter will hardly ever add value to the text. In some datasets, we might need these words so this step may be skipped sometimes. 


In [None]:
def remove_one_letter_words(text):
  text = [t for t in text if len(t) > 1]
  return text
df['Review'] = df['Review'].apply(remove_one_letter_words)
df.head()

Unnamed: 0,Clothing ID,Age,Review Text,Rating,Recommended IND,Division Name,Department Name,Class Name,Review
0,767,33,Absolutely wonderful - silky and comfortable,4,1,Initmates,Intimate,Intimates,"[absolutely, wonderful, silky, comfortable]"
1,1080,34,Love this dress! it's sooo pretty. i happene...,5,1,General,Dresses,Dresses,"[love, dress, sooo, pretty, happened, find, st..."
2,1077,60,I had such high hopes for this dress and reall...,3,0,General,Dresses,Dresses,"[high, hope, dress, really, wanted, work, init..."
3,1049,50,"I love, love, love this jumpsuit. it's fun, fl...",5,1,General Petite,Bottoms,Pants,"[love, love, love, jumpsuit, fun, flirty, fabu..."
4,847,47,This shirt is very flattering to all due to th...,5,1,General,Tops,Blouses,"[shirt, flattering, due, adjustable, front, ti..."


#### 10.	Regex:

Regex, short for regular expression, is used to find data following a certain pattern. For example: if we are required to find all the email addresses in the given data, we will write a regex code that finds all the character sequences that follow the pattern of text@text.text. This can prove to be very useful in cleaning of data if we need to remove a certain kind of data from our dataset. 
We do not need to use regex for this dataset, but [here](https://docs.python.org/3/library/re.html) is the official documentation for regex. 

#### 11.	Joining all the words:

This step is used to join all the list of words now. This is the final, clean data that can directly be used in machine learning models and analysis!


In [None]:
def join_text(text):
  text = " ".join(text)
  return text
df['Review'] = df['Review'].apply(join_text)
df.head()

Unnamed: 0,Clothing ID,Age,Review Text,Rating,Recommended IND,Division Name,Department Name,Class Name,Review
0,767,33,Absolutely wonderful - silky and comfortable,4,1,Initmates,Intimate,Intimates,absolutely wonderful silky comfortable
1,1080,34,Love this dress! it's sooo pretty. i happene...,5,1,General,Dresses,Dresses,love dress sooo pretty happened find store i'm...
2,1077,60,I had such high hopes for this dress and reall...,3,0,General,Dresses,Dresses,high hope dress really wanted work initially o...
3,1049,50,"I love, love, love this jumpsuit. it's fun, fl...",5,1,General Petite,Bottoms,Pants,love love love jumpsuit fun flirty fabulous ev...
4,847,47,This shirt is very flattering to all due to th...,5,1,General,Tops,Blouses,shirt flattering due adjustable front tie perf...


Outside of these preprocessing steps, we have many more like rare word removal, frequent word removal, rephrasing text, etc. The steps must be chosen based on what dataset you need to work with and what you want to do with the data. 
I hope this walked you through the concepts of text preprocessing in NLP. 
