# Data Cleaning
The raw text data comes directly after the various sources are not cleaned. We apply multiple steps to make data clean. Un-cleaned text data contains useless information that deviates results, so it’s always the first step to clean the data. Some standard preprocessing techniques should be applied to make data cleaner. Cleaned data also prevent models from overfitting.
In this article, we will see the following topics under text processing and exploratory data analysis.
I am converting the raw text data into a pandas data frame and performing various data cleaning techniques.


In [2]:
import pandas as pd
text = ['This is the NLP TASKS ARTICLE written by ABhishek Jaiswal** ','IN this article I”ll be explaining various DATA-CLEANING techniques',
 'So stay tuned for FURther More &&','Nah I don"t think he goes to usf, he lives around']
df = pd.DataFrame({'text':text})

In [3]:
df

Unnamed: 0,text
0,This is the NLP TASKS ARTICLE written by ABhis...
1,IN this article I”ll be explaining various DAT...
2,So stay tuned for FURther More &&
3,"Nah I don""t think he goes to usf, he lives around"


In [4]:
# lowering the texts

df['lower'] = df['text'].apply(lambda x: " ".join(x.lower()  for x in x.split()))

In [5]:
df

Unnamed: 0,text,lower
0,This is the NLP TASKS ARTICLE written by ABhis...,this is the nlp tasks article written by abhis...
1,IN this article I”ll be explaining various DAT...,in this article i”ll be explaining various dat...
2,So stay tuned for FURther More &&,so stay tuned for further more &&
3,"Nah I don""t think he goes to usf, he lives around","nah i don""t think he goes to usf, he lives around"


In [6]:
# Punctuation Removal 
# Removing punctuation(*,&,%#@#()) is a crucial step since punctuation doesn’t add any extra information or value to our data

In [7]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [8]:
df.text.apply(lambda x: "".join(i for i in x if i not in string.punctuation))

0    This is the NLP TASKS ARTICLE written by ABhis...
1    IN this article I”ll be explaining various DAT...
2                      So stay tuned for FURther More 
3      Nah I dont think he goes to usf he lives around
Name: text, dtype: object

In [9]:
# we can also regex for the same
import re
df.text.apply(lambda x: ''.join(re.findall(r'[a-zA-Z+" "]',x)))

0    This is the NLP TASKS ARTICLE written by ABhis...
1    IN this article Ill be explaining various DATA...
2                      So stay tuned for FURther More 
3     Nah I don"t think he goes to usf he lives around
Name: text, dtype: object

# stop word removal

In [10]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\BiswajitRajguruMohap\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
from nltk.corpus import stopwords
all_stopwords = stopwords.words('english')
df.lower.apply(lambda x: " ".join(i for i in x.split() if i not in all_stopwords))

0         nlp tasks article written abhishek jaiswal**
1    article i”ll explaining various data-cleaning ...
2                                        stay tuned &&
3               nah don"t think goes usf, lives around
Name: lower, dtype: object

In [12]:
# spelling correction
# pip install textblob
from textblob import TextBlob
df['lower'].apply(lambda x: str(TextBlob(x).correct()))

0    this is the nap tasks article written by abhis...
1    in this article i”ll be explaining various dat...
2                    so stay tuned for further more &&
3      ah i don"t think he goes to us, he lives around
Name: lower, dtype: object