# Cleaning the Text Data using Pandas and sklearn library

#### Objective
Learn how we can clean any data having text field in the dataset


#### Procedure
1.	Understand Problem Statement and Data 
2.	Load the packages and data 
3.  Cleaning the text data
4.  Summary
5.  What to do Next?
 

#### Understand Problem Statement and Data
The first step we will be understanding the problem we want to solve and looking at each column in the dataset to understand them.

#### Problem Statement
Using this dataset, we are going to learn how we can clean text data.



#### The Dataset Information:

Plot summary descriptions scraped from Wikipedia

Content
The dataset contains descriptions of 34,886 movies from around the world. Column descriptions are listed below:

Release Year - Year in which the movie was released
Title - Movie title
Origin/Ethnicity - Origin of movie (i.e. American, Bollywood, Tamil, etc.)
Director - Director(s)
Plot - Main actor and actresses
Genre - Movie Genre(s)
Wiki Page - URL of the Wikipedia page from which the plot description was scraped
Plot - Long form description of movie plot (WARNING: May contain spoilers!!!)



# Importing libraries & Loading the dataset

In [24]:
import pandas as pd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import warnings
warnings.filterwarnings("ignore")

import nltk

In [25]:
%%time
df=pd.read_csv('wiki_movie_plots_deduped.csv')

Wall time: 898 ms


## Looking into the data specially in plot column which we are ging to clean

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34886 entries, 0 to 34885
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Release Year      34886 non-null  int64 
 1   Title             34886 non-null  object
 2   Origin/Ethnicity  34886 non-null  object
 3   Director          34886 non-null  object
 4   Cast              33464 non-null  object
 5   Genre             34886 non-null  object
 6   Wiki Page         34886 non-null  object
 7   Plot              34886 non-null  object
dtypes: int64(1), object(7)
memory usage: 2.1+ MB


In [27]:
df['Plot']

0        A bartender is working at a saloon, serving dr...
1        The moon, painted with a smiling face hangs ov...
2        The film, just over a minute long, is composed...
3        Lasting just 61 seconds and consisting of two ...
4        The earliest known adaptation of the classic f...
                               ...                        
34881    The film begins in 1919, just after World War ...
34882    Two musicians, Salih and Gürkan, described the...
34883    Zafer, a sailor living with his mother Döndü i...
34884    The film centres around a young woman named Am...
34885    The writer Orhan Şahin returns to İstanbul aft...
Name: Plot, Length: 34886, dtype: object

# Cleaning the text feild

### Replacing the '\r' and '\n' with blank spaces using key value pair

In [28]:
replace_dict={
    "\r": " ",
    "\n":" ",
}

for key,items in replace_dict.items():
    df['Plot_1']= df['Plot'].str.replace(key,items)

### Removing the Numbers from the text feild

In [29]:
%%time
df['Plot_1']=df['Plot_1'].apply(lambda x: re.sub(r'[0-9]', ' ',x))

Wall time: 498 ms


### List of punctuations
We are using the string.punctuation to get the list of punctations and replacing each of these punctuation symbols from each of the rows of the text data

In [30]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

### Removing the Punctations

In [31]:
%%time
def remove_punctuation(text):
    return ''.join([words.lower() for words in text if words not in string.punctuation])

df['plot_without_punct']=df['Plot_1'].apply(lambda x: remove_punctuation(x))

Wall time: 11.5 s


### Tokenizing the words
Tokenization is the process of breaking down a phrase, sentence, paragraph, or even an entire text document into smaller components like individual words or phrases. Tokens are the names given to each of these smaller units. Words, numerals, or punctuation marks could be used as tokens.

#### What is the need for tokenization in NLP?
I'd like you to consider the English language in this situation. Pick any sentence that comes to you and keep it in mind as you read this section. This will make it much easier for you to grasp the significance of tokenization.

We must first identify the words that make up a string of letters before we can process natural language. As a result, tokenization is the most fundamental step in the NLP process (text data). This is significant because the text's meaning may be easily deduced by examining the words in the text.

In [32]:
%%time
def tokenization(text):
    return nltk.tokenize.word_tokenize(text)

df['plot_tokenized']=df['plot_without_punct'].apply(lambda x: tokenization(x))

Wall time: 36.7 s


### Removing the Stopwords
Any term in a stop list that is filtered out before or after natural language data processing is referred to as a stop word. There is no standardised list of stop words used by all natural language processing tools, nor are there any agreed-upon methods for detecting stop words, and not all programmes even utilise one.
We are using the nltk library for removing the stopwards from the text field

In [33]:
%%time 
stopwords = nltk.corpus.stopwords.words('english')
def stopwards(words):
    return [i for i in words if i not in stopwords]

df['plot_tokenized_removed_sw']=df['plot_tokenized'].apply(lambda x: stopwards(x))

Wall time: 18.4 s


### Lemmatization
Lemmatization usually refers to doing things correctly using a vocabulary and morphological study of words, with the goal of removing only inflectional endings and returning the base or dictionary form of a word, known as the lemma.

What is the difference between lemmatization and stemming?

At get to the stem of the word, a portion of the word is simply sliced off at the end. Different methods are employed to determine how many letters must be removed, however the algorithms do not understand the meaning of the word in the language it belongs to. The algorithms in lemmatization, on the other hand, have this knowledge. In fact, you could argue that these algorithms consult a dictionary to figure out what a word means before reducing it to its root word, or lemma.
As a result, a lemmatization algorithm would recognise that better is derived from good, and hence the lemme is good.

In [34]:
%%time 
lemma= nltk.WordNetLemmatizer()
def lemmatization(words):
    return ' '.join([lemma.lemmatize(word) for word in words])

df['plot_lemma']=df['plot_tokenized_removed_sw'].apply(lambda x: lemmatization(x))

Wall time: 26.4 s


In [35]:
df

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot,Plot_1,plot_without_punct,plot_tokenized,plot_tokenized_removed_sw,plot_lemma
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr...","A bartender is working at a saloon, serving dr...",a bartender is working at a saloon serving dri...,"[a, bartender, is, working, at, a, saloon, ser...","[bartender, working, saloon, serving, drinks, ...",bartender working saloon serving drink custome...
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov...","The moon, painted with a smiling face hangs ov...",the moon painted with a smiling face hangs ove...,"[the, moon, painted, with, a, smiling, face, h...","[moon, painted, smiling, face, hangs, park, ni...",moon painted smiling face hang park night youn...
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed...","The film, just over a minute long, is composed...",the film just over a minute long is composed o...,"[the, film, just, over, a, minute, long, is, c...","[film, minute, long, composed, two, shots, fir...",film minute long composed two shot first girl ...
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...,Lasting just seconds and consisting of two ...,lasting just seconds and consisting of two ...,"[lasting, just, seconds, and, consisting, of, ...","[lasting, seconds, consisting, two, shots, fir...",lasting second consisting two shot first shot ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...,The earliest known adaptation of the classic f...,the earliest known adaptation of the classic f...,"[the, earliest, known, adaptation, of, the, cl...","[earliest, known, adaptation, classic, fairyta...",earliest known adaptation classic fairytale fi...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
34881,2014,The Water Diviner,Turkish,Director: Russell Crowe,Director: Russell Crowe\r\nCast: Russell Crowe...,unknown,https://en.wikipedia.org/wiki/The_Water_Diviner,"The film begins in 1919, just after World War ...","The film begins in , just after World War ...",the film begins in just after world war i...,"[the, film, begins, in, just, after, world, wa...","[film, begins, world, war, ended, centres, aro...",film begin world war ended centre around joshu...
34882,2017,Çalgı Çengi İkimiz,Turkish,Selçuk Aydemir,"Ahmet Kural, Murat Cemcir",comedy,https://en.wikipedia.org/wiki/%C3%87alg%C4%B1_...,"Two musicians, Salih and Gürkan, described the...","Two musicians, Salih and Gürkan, described the...",two musicians salih and gürkan described the a...,"[two, musicians, salih, and, gürkan, described...","[two, musicians, salih, gürkan, described, adv...",two musician salih gürkan described adventure ...
34883,2017,Olanlar Oldu,Turkish,Hakan Algül,"Ata Demirer, Tuvana Türkay, Ülkü Duru",comedy,https://en.wikipedia.org/wiki/Olanlar_Oldu,"Zafer, a sailor living with his mother Döndü i...","Zafer, a sailor living with his mother Döndü i...",zafer a sailor living with his mother döndü in...,"[zafer, a, sailor, living, with, his, mother, ...","[zafer, sailor, living, mother, döndü, coastal...",zafer sailor living mother döndü coastal villa...
34884,2017,Non-Transferable,Turkish,Brendan Bradley,"YouTubers Shanna Malcolm, Shira Lazar, Sara Fl...",romantic comedy,https://en.wikipedia.org/wiki/Non-Transferable...,The film centres around a young woman named Am...,The film centres around a young woman named Am...,the film centres around a young woman named am...,"[the, film, centres, around, a, young, woman, ...","[film, centres, around, young, woman, named, a...",film centre around young woman named amy tyler...


### Selecting the dersired columns 
We are going to keep only the original Plot column and final cleanned plot column namely plot_lemma

In [36]:
df.columns

Index(['Release Year', 'Title', 'Origin/Ethnicity', 'Director', 'Cast',
       'Genre', 'Wiki Page', 'Plot', 'Plot_1', 'plot_without_punct',
       'plot_tokenized', 'plot_tokenized_removed_sw', 'plot_lemma'],
      dtype='object')

In [38]:
df=df[['Release Year', 'Title', 'Origin/Ethnicity', 'Director', 'Cast',
       'Genre', 'Wiki Page', 'Plot', 'plot_lemma']]

Renaming the plot_lemma to plot cleanned for better understanding

In [39]:
df.rename(columns={'plot_lemma': 'Plot_cleanned'},inplace=True)

# Saving the cleanned dataframe in .csv format

In [40]:
df.to_csv('wiki_movie_plots_deduped_cleaned.csv',index=False)

# Summary

You have learned how one can clean the data using the sklearn library and pandas. Additionally, you get to know how to use list comprehension for making your query small. We have stuided about the lemmetization, tokenization, stopwards, how to deal with them and how to remove punctuations and number using regression.

# What to do Next?

Content-Based Movie Recommender:
Recommend movies with plots similar to those that a user has rated highly.

Movie Plot Generator:
Generate a movie plot description based on seed input, such as director and genre

Information Retrieval:
Return a movie title based on an input plot description

Text Classification:
Predict movie genre based on plot description