# Text PreProcessing:

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd

In [None]:
df= pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')

In [None]:
df.head()

## 1. Convert to Lower Case:

First thing that is very essential is, converting whole text to lower case becasue python is a case sensitive language and it will treat One and one as separate words.

We will start will perfom this on single row and then we will go for the whole dataset.

In [None]:
review= df['review'][2]

In [None]:
#converting to lower case 
review.lower()

In [None]:
#converting whole dataset to lower case
df['review']= df['review'].str.lower()

In [None]:
df['review'].head()

## 2. Remove Unimportatnt Things:

After you have converted the whole text to lower case, now remove all the things that are not important in the text like html tags (if you get the data by scraping and stuff).

Like the dataset we are using has the tags in it, so we have to remove them: For this we will use 'regular expression' library.

In [None]:
import re
#function for removing the html tags
def remove_html_tags(data):
    #finding the tag pattern in the data
    #it will catch anything in between <>
    pattern= re.compile('<.*?>')
    #now we will replace them with empty space
    return pattern.sub(r'', data)

In [None]:
#now we will use .apply() to apply the above function on the whole column of our dataframe
df['review']= df['review'].apply(remove_html_tags)

In [None]:
df['review'].head()

## 3. Remove URLs:

In [None]:
def remove_urls(data):
    #we want to detect pattern starting with https and www as most/all the urls start with this 
    pattern= re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', data)

In [None]:
df['review']= df['review'].apply(remove_urls)

In [None]:
df['review'].head()

## 4. Remove Punctuations:

Python consider some particular symbols as Punctuations that we should remove in Text Preprocessing.

In [None]:
import string
print('Punctuations according to Python:', string.punctuation)

Storing all the punctuations in the single variable:

In [None]:
exclude= string.punctuation
exclude

In [None]:
def remove_punctuations(data):
    #we will use maketrans function, that will replace character with '' if any is in exclude list
    return data.translate(str.maketrans('', '', exclude))

In [None]:
df['review']= df['review'].apply(remove_punctuations)

In [None]:
df['review'].head()

## 5. Removing Stop Words:

Stop words are those words that help in sentence formation but do not actually contribute in the contextual meaning of the sentences.

Like: and, for, it, I, me, my, we, our, ourselves etc

We will use 'nltk' library to remove these stop words.

In [None]:
from nltk.corpus import stopwords

download them firstly:
        
        -import nltk
        -nltk.download('stopwords')

In [None]:
#to get the english stop words
english_stop_words= stopwords.words('english')
english_stop_words

In [None]:
def remove_stop_words(data):
    #list where we will append all the words except stopwords
    new_text= []
    #splitting the sentences and iterating over words
    for word in data.split():
        #if the word is a stop word
        if word in english_stop_words:
            #append the empty space
            new_text.append('')
        else:
            #else if it is not the stop word then append the word
            new_text.append(word)
            
    x= new_text[:]
    new_text.clear()
    return " ".join(x)
            
            

In [None]:
df['review']= df['review'].apply(remove_stop_words)

In [None]:
df['review'].head()

## 6. Handling the Emojis:

We can use emoji package in the python, that will replace the emoji with its meaning in english.

In [None]:
import emoji

In [None]:
emoji.demojize(df['review'])

In [None]:
df['review'].head()

## 7. Tokenization:

Tokenization is basically converting the text into small tokens or textual parts that can we words or sentences.

Helps in while doing Feature Engineering or converting the words into numbers.

#### 1. Simple Tokenization:

If you simply want to do word or sentence tokenization, you can just use split function and get the tokenized output.

In [None]:
def word_tokenization(data):
    return data.split()
df['simle_words_tokens']= df['review'].apply(lambda x: word_tokenization(x))
df['review'].head()

In [None]:
df['simle_words_tokens'].head()

Similarly for the Sentence Tokenization you can split on '.' or new line.

#### 2. Using NLTK Library:

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize

In [None]:
df['nltk_word_tokenize']= df['review'].apply(lambda x: word_tokenize(x))

In [None]:
df['nltk_word_tokenize'].head()

Now lets use sentence transformer:

In [None]:
df['nltk_sentence_tokenize']= df['review'].apply(sent_tokenize)

In [None]:
df['nltk_sentence_tokenize'].head()

#### 3. Spacy:

Do this first:

    -pip install spacy
    -python -m spacy download en_core_web_sm

We will load small english dictionary from spacy:

In [None]:
import spacy
nlp= spacy.load('en_core_web_sm')

In [None]:
#we will convert the text into doc firstly
doc= nlp('I am a Student!')

In [None]:
for token in doc:
    print(token)

## 8. Stemming:

Converting Word to its root form.

        -Walking, Walked, Walks -----> Walk

**Stemming is huge part of Information Retrieval Systems like We can take the example of google, if we search fishing, fisher on google it will have information of fish also.**

There are different stemming algorithms but for english we have Porters Stemming method:

In [None]:
from nltk.stem.porter import PorterStemmer

In [None]:
#initiate the porter stemming object
ps= PorterStemmer()
def stem_words(data):
    #splitting the data and passing each word to stemmer and then returning it after joining as a sentence
    return " ".join([ps.stem(word) for word in data.split()])

In [None]:
#applying function 
df['porter_stemming']= df['review'].apply(lambda x: stem_words(x))

Now see the difference in both

In [None]:
df['review'].head()

In [None]:
df['porter_stemming'].head()

**Main problem with stemming is that it does not care if the reduced root form exist in the english or not. Like little goes to little, movie goes to movi**

**This issue is solved by "lemmitization" in which root form will always exist in english**

In [None]:
import nltk
nltk.download('wordnet')

In [None]:
from nltk.stem import WordNetLemmatizer
wordnet_lem= WordNetLemmatizer()

def lemmit(data):
    #using nltk word tokenizer
    words = word_tokenize(data)
    return " ".join([wordnet_lem.lemmatize(word) for word in words])

In [None]:
df['lemmitized']= df['review'].apply(lemmit)

In [None]:
df.columns

In [None]:
df[['review', 'porter_stemming', 'lemmitized']]