# <center><b><u>Text Preprocessing and Various Text Featurization Techniques</u></b> <br><br><u><b> Amazon Fine Food Reviews</b></u>
<b><div align = 'right'><i> By: <font color = 'darkred'>Aarat Satsangi</font></i></div></b>
<div align = 'right'>Referenced from: <a href = 'https://github.com/kushagra414/Amazon-Fine-Food-Reviews-Analysis/blob/master/1.%20Amazon%20Fine%20Food%20Review%20TSNE/Amazon%20Fine%20Food%20Reviews%20Analysis_TSNE.ipynb'>Kushagra Shekhawat</a></div>

## <b>Information About the Dataset</b>

Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews <br>

EDA: https://nycdatascience.com/blog/student-works/amazon-fine-foods-visualization/


The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.<br>

Number of reviews: 568,454<br>
Number of users: 256,059<br>
Number of products: 74,258<br>
Timespan: Oct 1999 - Oct 2012<br>
Number of Attributes/Columns in data: 10 

Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review


#### Objective:
Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).

<br>
[Q] How to determine if a review is positive or negative?<br>
<br> 
[Ans] We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.

## <b>Loading the Data</b>

The dataset is available in two forms
1. .csv file
2. SQLite Database

In order to load the data, We have used the SQLITE dataset as it easier to query the data and visualise the data efficiently.
<br> 

Here as we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If the score id above 3, then the recommendation wil be set to "positive". Otherwise, it will be set to "negative".

###  <b>Importing All Required Packages </b>

In [1]:
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem import SnowballStemmer

import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os

###  <b>Reading the Data</b>

In [2]:
con = sqlite3.connect("D:/Machine Learning/17.1 - Dataset overview Amazon Fine Food reviews(EDA)/DATA SET/database.sqlite")

Now we need to load only the filtered data which has score above or below 3. So that we can categorize positive and negative sentiment correctly.

i.e positive > 3 > negative 

In [3]:
data = pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE score != 3
""",con)

data

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...
...,...,...,...,...,...,...,...,...,...,...
525809,568450,B001EO7N10,A28KG5XORO54AY,Lettie D. Carter,0,0,5,1299628800,Will not do without,Great for sesame chicken..this is a good if no...
525810,568451,B003S1WTCU,A3I8AFVPEE8KI5,R. Sawyer,0,0,2,1331251200,disappointed,I'm disappointed with the flavor. The chocolat...
525811,568452,B004I613EE,A121AA1GQV751Z,"pksd ""pk_007""",2,2,5,1329782400,Perfect for our maltipoo,"These stars are small, so you can give 10-15 o..."
525812,568453,B004I613EE,A3IBEVCTXKNOH,"Kathy A. Welch ""katwel""",1,1,5,1331596800,Favorite Training and reward treat,These are the BEST treats for training and rew...


Now we have to convert the rating score into sentiment 

i.e. 

*positive = 1 (>3)* ,
*negative = 0 (<3)*

#### <b><font color = 'darkblue'> Converting score into positive (=1) or negative (=0)</font></b>

In [4]:
def pos_neg (x):
    if x<3:
        return 0
    return 1

Now, apply the function and convert the score!

In [5]:
filtered_data = data.Score.map(pos_neg)
data.Score = filtered_data

data

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,0,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,1,1350777600,Great taffy,Great taffy at a great price. There was a wid...
...,...,...,...,...,...,...,...,...,...,...
525809,568450,B001EO7N10,A28KG5XORO54AY,Lettie D. Carter,0,0,1,1299628800,Will not do without,Great for sesame chicken..this is a good if no...
525810,568451,B003S1WTCU,A3I8AFVPEE8KI5,R. Sawyer,0,0,0,1331251200,disappointed,I'm disappointed with the flavor. The chocolat...
525811,568452,B004I613EE,A121AA1GQV751Z,"pksd ""pk_007""",2,2,1,1329782400,Perfect for our maltipoo,"These stars are small, so you can give 10-15 o..."
525812,568453,B004I613EE,A3IBEVCTXKNOH,"Kathy A. Welch ""katwel""",1,1,1,1331596800,Favorite Training and reward treat,These are the BEST treats for training and rew...


Now, Let us analyse the number of USERS in the whole dataset 

In [6]:
unique = pd.read_sql_query("""
SELECT Count(*) as reviews , Count(DISTINCT UserId) as Users
FROM Reviews""",con)

In [7]:
unique

Unnamed: 0,reviews,Users
0,568454,256059


In [8]:
#TO EXTRACT VALUE FROM A CELL IN DATAFRAME
print("TOTAL NUMBER OF USERS :: ", (unique.iloc[0,1]))
print("TOTAL NUMBER OF REVIEWS :: " , (unique.iloc[0,0]))
print("AVG REVIEW PER PERSON :: " , (unique.iloc[0,0]) / (unique.iloc[0,1]))

TOTAL NUMBER OF USERS ::  256059
TOTAL NUMBER OF REVIEWS ::  568454
AVG REVIEW PER PERSON ::  2.2200117941568154


## <b><center><font color = "red"> [1].</font> <u>Data Cleaning: Deduplication</u></center></b>

It is observed (as shown in the table below) that the reviews data had many duplicate entries. Hence it was necessary to remove duplicates in order to get unbiased results for the analysis of the data. Following is an example:

In [9]:
pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE UserId = "AR5J8UI46CURR"
""",con)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


As it can be seen that the USER ID, HELPFULNESS NUMERATOR , HELPFULNESS DENOMIATOR, SCORE, TIME, SUMMARY and TEXT are the same for all these reviews.

As they all have the same timestamp, we can conclude that all these rows have been duplicated

Now, after analysing the PRODUCT ID(s) we found that they all belong to the same brand having a same product but with different flavours. 

Hence in order to reduce redundancy it we should eliminate the rows having same parameters

==============================================================================

In [10]:
# Removing redundancy on sorted data will be faster
sorted_data = data.sort_values("ProductId", axis = 0 , ascending = True , \
                              kind = 'quicksort' , na_position = 'last') #na_position is the position of NaN data
sorted_data

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
138706,150524,0006641040,ACITT7DI6IDDL,shari zychinski,0,0,1,939340800,EVERY book is educational,this witty little book makes my son laugh at l...
138688,150506,0006641040,A2IW4PEEKO2R0U,Tracy,1,1,1,1194739200,"Love the book, miss the hard cover version","I grew up reading these Sendak books, and watc..."
138689,150507,0006641040,A1S4A3IQ2MU7V4,"sally sue ""sally sue""",1,1,1,1191456000,chicken soup with rice months,This is a fun way for children to learn their ...
138690,150508,0006641040,AZGXZ2UUK6X,"Catherine Hallberg ""(Kate)""",1,1,1,1076025600,a good swingy rhythm for reading aloud,This is a great little book to read aloud- it ...
138691,150509,0006641040,A3CMRKGE0P909G,Teresa,3,4,1,1018396800,A great way to learn the months,This is a book of poetry about the months of t...
...,...,...,...,...,...,...,...,...,...,...
176791,191721,B009UOFTUI,AJVB004EB0MVK,D. Christofferson,0,0,0,1345852800,weak coffee not good for a premium product and...,"This coffee supposedly is premium, it tastes w..."
1362,1478,B009UOFU20,AJVB004EB0MVK,D. Christofferson,0,0,0,1345852800,weak coffee not good for a premium product and...,"This coffee supposedly is premium, it tastes w..."
303285,328482,B009UUS05I,ARL20DSHGVM1Y,Jamie,0,0,1,1331856000,Perfect,The basket was the perfect sympathy gift when ...
5259,5703,B009WSNWC4,AMP7K1O84DH1T,ESTY,0,0,1,1351209600,DELICIOUS,Purchased this product at a local store in NY ...


In [11]:
cleaner_data = sorted_data.drop_duplicates(subset = {"UserId" , "Time" , "ProfileName" , "Text"} , keep = "first")
print (cleaner_data.shape)

(364173, 10)


There is another type of redundancy:-
**Helpfulness Numerator > Helpfulness Denominator**

Which is not possible. Hence, we drop such Reviews too

Below is the example Reviews in which this happened

In [12]:
pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE HelpfulnessNumerator > HelpfulnessDenominator""",con)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...
1,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...


Only two such entries are found in the whole data set.

Now we clean these out, if they exist, from our sample dataset (cleaner_data)

In [13]:
cleaner_data = cleaner_data[cleaner_data.HelpfulnessNumerator <= cleaner_data.HelpfulnessDenominator]
print (cleaner_data.shape)

(364171, 10)


In [14]:
print("Percentage of Data remained :: {}%".format( format(((cleaner_data.shape[0]*1.0) / (data.shape[0]*1.0))*100 , '0.2f')))

Percentage of Data remained :: 69.26%


In [15]:
print (cleaner_data.shape)
print (cleaner_data.Score.value_counts())

(364171, 10)
1    307061
0     57110
Name: Score, dtype: int64


**Check for NaN entries ::**

In [16]:
print (cleaner_data.isna().any())

Id                        False
ProductId                 False
UserId                    False
ProfileName               False
HelpfulnessNumerator      False
HelpfulnessDenominator    False
Score                     False
Time                      False
Summary                   False
Text                      False
dtype: bool


There are no NaN values. Hence we move on to the next part.

Now, as we know that, that the way people write anything changes according to the time. For example, a product review might not get good reviews at the starting but as time progresses they make some changes and start getting more positive reviews.<br>
So, we should sort the data by time.

In [17]:
cleaner_data = cleaner_data.sort_values(by="Time")

## <b><center><font color = "red"> [2].</font> <u>Text Preprocessing</u></center></b>

### <b><font color = "darkorange"> [i].</font> Understanding Text Preprocessing</b>


Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>

After which we collect the words used to describe positive and negative reviews

#### <b><font color = "green"> (1). </font> REMOVING HTML TAGS</b>

Some people, while writing reviews tend to insert links in it. And sometimes the text copied from a website might have html tags.<br>
As they are of no use in sentiment analysis we should get rid of them<br>
For this we can either use Regex module and subsititute accordingly or we can use BeautifulSoup from BS4 to remove these links and tags in just one line.

Let us take an example :: 

In [18]:
x = cleaner_data.Text.values[57]
print ("Inititally :: \n\n" , x , "\n\n" , '='*70)
print ("After Removing Tags and links:: \n" )
x =  re.sub(r"http\S+" , "" , x)
x = BeautifulSoup(x , 'lxml').get_text()
print (x , "\n" , "="*70)
# r is used for creating regular expressions, 
# \S used for non-whitespace
# + used for sequence i.e. \S+   ----> sequence of non-whitespace.

Inititally :: 

 For me, when the days get colder nothing is as rewarding as a simple cup of hot tea. And for it's claimed immunity benefits, a basic green tea is a common pick for maintaining a healthy natural balance during the flu season. From previous experiences in tasting the Tazo brand, both of the bottled and boxed products, they have proven to be unsurpassed for quality and flavor. Once I've tried their teas they immediately became my drink of choice. <p>The Zen Green Tea Blend is a wonderful one that has only a few ingredients with no artificial anything. And thankfully, doesn't boast the addition of fortified vitamins in some senseless amount. It truly is an enlightening blend of green tea, spearmint, lemongrass and lemon verbena. Thus making it versatile refreshment for anytime of the day, whether it's right after meals or between meals, or just before bedtime. Generally light and mild tasting, but that will depend upon how long you steep it and if you add a sweetener of so

As you can see all *TAGS* and *LINKS* have been removed

#### <b><font color = "green"> (2). </font> PERFORMING DECONTRACTION (FUNCTION)</b>

We need to replace words that use short forms such as --> I'm or won't or couldn't or wouldn't etc. to their full form

In [19]:
def decontract (phrase):
    import contractions
    return contractions.fix(phrase)

In [20]:
print ("Original Data ::\n\n")
print (x , "\n")
print ("="*70)
print ("Decontracted Data ::\n\n")
print (decontract(x) )

Original Data ::


For me, when the days get colder nothing is as rewarding as a simple cup of hot tea. And for it's claimed immunity benefits, a basic green tea is a common pick for maintaining a healthy natural balance during the flu season. From previous experiences in tasting the Tazo brand, both of the bottled and boxed products, they have proven to be unsurpassed for quality and flavor. Once I've tried their teas they immediately became my drink of choice. The Zen Green Tea Blend is a wonderful one that has only a few ingredients with no artificial anything. And thankfully, doesn't boast the addition of fortified vitamins in some senseless amount. It truly is an enlightening blend of green tea, spearmint, lemongrass and lemon verbena. Thus making it versatile refreshment for anytime of the day, whether it's right after meals or between meals, or just before bedtime. Generally light and mild tasting, but that will depend upon how long you steep it and if you add a sweetener of som

#### <b> <font color = "green"> (3). </font> REMOVING ALL  SPECIAL CHARACTERS AND PUNCTUATION</b>

Here is a regex to match a string of characters that are not letters or numbers:

In [21]:
x = re.sub('[^A-Za-z0-9]+' , " " , x)
print (x)

For me when the days get colder nothing is as rewarding as a simple cup of hot tea And for it s claimed immunity benefits a basic green tea is a common pick for maintaining a healthy natural balance during the flu season From previous experiences in tasting the Tazo brand both of the bottled and boxed products they have proven to be unsurpassed for quality and flavor Once I ve tried their teas they immediately became my drink of choice The Zen Green Tea Blend is a wonderful one that has only a few ingredients with no artificial anything And thankfully doesn t boast the addition of fortified vitamins in some senseless amount It truly is an enlightening blend of green tea spearmint lemongrass and lemon verbena Thus making it versatile refreshment for anytime of the day whether it s right after meals or between meals or just before bedtime Generally light and mild tasting but that will depend upon how long you steep it and if you add a sweetener of some form Interesting too are the amusin

#### <b><font color = "green"> (4). </font> REMOVING ALPHANUMERIC LETTERS</b>

Words that contain alphabets as well as numbers should be removed. e.g - 'B00004RBDY'

In [22]:
x = "blah blah blah B004RBDY blah blah blahhh"
print ( re.sub("\S*\d\S*" , "" , x ))

blah blah blah  blah blah blahhh


*As you can see alphanumeric words have been removed*

1. \S*  --> Means all character excluding white spaces might occurs 0 or more times
2. \d  ----> digit must occur one time.
3. \S*\d\S*  ---> eg:- saj3434jkdsd
4. This will remove all digits and string with digits.


#### <b><font color = "green"> (5). </font> CONVERTING INTO LOWER CASE</b>

**FUNCTION**

In [23]:
def lower_case(phrase):
    return ' '.join(word.lower() for word in phrase.split())

Example :: 

In [24]:
print ("Initially :: \n" , cleaner_data.Text.values[22])
print ("\nAfter LowerCase :: \n" , lower_case(cleaner_data.Text.values[22]))

Initially :: 
 Michael Keaton brings no distinguishing characteristics to the ghoul 'Beetlejuice', he merely acts bizarre, as does the script. It is often stunning cinematography but when the movie itself comes into focus, it's like finding one of Beetlejuice's snacks in your popcorn.

After LowerCase :: 
 michael keaton brings no distinguishing characteristics to the ghoul 'beetlejuice', he merely acts bizarre, as does the script. it is often stunning cinematography but when the movie itself comes into focus, it's like finding one of beetlejuice's snacks in your popcorn.


#### <b><font color = "green"> (6). </font> REMOVING STOP WORDS</b>

From the set of stop words we are
1. Removing - "no" , "nor" , "not"
2. Adding - br (as in previous steps it can be seen that <br /> becomes br)

In [25]:
stop = set(stopwords.words('english'))
stop.add("br")
stop.remove("no")
stop.remove("nor")
stop.remove("not")
print (len(stop) , len(stopwords.words('english')))

177 179


In [26]:
print (stop)

{'mustn', 'o', "isn't", 'own', 'each', 'as', 'ma', 'your', "aren't", "she's", 'mightn', 'once', 'between', 'and', "you'll", 'up', 'ours', 'when', 'our', 'below', 'what', 'during', 'most', 'further', "won't", 'than', 'about', 'who', 'such', 'before', "shan't", 'those', 'very', 'now', "hadn't", 'an', 'the', 'only', 'my', 'doing', 'having', 'd', 'which', 'himself', 'or', 'off', 'do', 'y', 'does', 'have', 'won', "don't", 'been', 'was', 'hadn', "couldn't", 'at', 'these', 'shan', 'you', 'its', 'too', "doesn't", "mustn't", 'couldn', 'herself', 'out', 'then', 'can', 'had', 'yourself', 'just', 'they', 'br', 'their', "mightn't", "hasn't", 'shouldn', 'both', 'don', "you're", 'will', 't', 'has', 'whom', 'above', 'her', 'doesn', 'under', 'how', 'some', 'them', 'there', 'because', 'should', 'while', 'theirs', 'with', 'weren', 'for', 'is', 'if', 'that', 'i', 'hasn', "haven't", 'he', 'isn', 'to', "that'll", 'by', 'against', 'other', "wasn't", 'so', 'needn', 'themselves', 'wouldn', 've', 'down', 'are',

**Function to remove stopwords**

In [27]:
def rem_stopwords(phrase):
    return ' '.join(word for word in phrase.split() if word not in stop)

In [28]:
print (cleaner_data.Text.values[23])
print ("\n" , "="*70 , "\n")
print (rem_stopwords(re.sub("[^A-Za-z0-9]+" , ' ' , cleaner_data.Text.values[23])))

I am continually amazed at the shoddy treatment that some movies get in their DVD release.  This DVD is simply a disgrace, especially considering what a great movie this is.  I give the movie itself 5 stars; it's a wonderful example of Tim Burton's energy and style.<p>This DVD has no extras worth mentioning.  No deleted scenes, no featurettes, not even a lousy commentary track!  To make it even worse, the film has been CUT DOWN from the theatrical release!  I have never seen a DVD release before where you get LESS than was originally presented in theaters.<p>My advice is to save your money until somebody figures out that when a movie is released on DVD, it needs to live up to the capabilities of the medium, and should always provide more material than was originally released, not less.


I continually amazed shoddy treatment movies get DVD release This DVD simply disgrace especially considering great movie I give movie 5 stars wonderful example Tim Burton energy style p This DVD no ext

#### <b><font color = "green"> (7). </font> STEMMING WORDS</b>

We will use SnowballStemmer as it is better than PorterStemmer

In [29]:
stemmer = SnowballStemmer('english')
print (stemmer.stem("manly"))

man


### <b><font color = "darkorange"> [ii].</font> Applying Text Preprocessing(All Together)</b>

In [30]:
null_review =[]
stemmed_preprocessed_reviews = []
unstemmed_preprocessed_reviews = [] #As word2vec works better with unstemmed text

for count , review in tqdm(enumerate(cleaner_data.Text.values)):
    #(1). REMOVING HTML TAGS :: 
    review  = re.sub(r"http\S+" , '' , review)
    #(2). REMOVING ALL OTHER TAGS
    review = BeautifulSoup(review).get_text()
    
#     There were many cases where review became ' '(empty), after implementing this line of code "re.sub(r'http\S+','',review)" \n
#     For example review number 24:-
#     '<a href="http://www.amazon.com/gp/product/B0000VMBDI">WILTON 13 PC GOLF SET 1306-7274</a><br /><br />I am very happ
#     Here, if http(link) is removed, "http://www.amazon.com/gp/product/B0000VMBDI">WILTON" This string will be removed.
#     Hence removing ">", which will create incorrect index.
#     Thus BeautifulSoup(review).get_text() will return ' '
#     BeautifulSoup(review).get_text() will result in empty string. For example check review number 24,121,456 and 601
#     That's why I have used the code below

    if review == '' or review == ' ':
        null_review.append(count)
        continue
    
    #(3). DECONTRACTION
    review = decontract(review)
    
    #(4). REMOVING SPECIAL CHARACTERS AND PUNCTUATION
    review = re.sub("[^A-Za-z0-9]+" , ' ' , review)
    
    #(5). REMOVING ALPHANUMERIC WORDS
    review = re.sub('\S*\d\S*' , '' , review)
    
    #(6). LOWER CASING AND REMOVING STOP WORDS 
    review = lower_case(rem_stopwords(review))
    
    unstemmed_preprocessed_reviews.append(review)
    
    #(7). STEMMING
    stemmed_preprocessed_reviews.append(' '.join(list(map(stemmer.stem , review.split()))))
    #print (preprocessed_reviews)
    #i+=1


364171it [05:39, 1072.03it/s]


In [31]:
print("TOTAL STEMMED PROCESSED REVIEWS :: " , len(stemmed_preprocessed_reviews))
print("TOTAL UNSTEMMED PROCESSED REVIEWS :: " , len(unstemmed_preprocessed_reviews))
print("TOTAL EMPTIED REVIEWS :: " , len(null_review))
print("SOME OF THE EMPTIED REVIEWS :: \n\t" , null_review[:10])

TOTAL STEMMED PROCESSED REVIEWS ::  363223
TOTAL UNSTEMMED PROCESSED REVIEWS ::  363223
TOTAL EMPTIED REVIEWS ::  948
SOME OF THE EMPTIED REVIEWS :: 
	 [10283, 10328, 10846, 10859, 11280, 11379, 11623, 11647, 11891, 11943]


In [32]:
print ("Example:\nStemmed Review - " , stemmed_preprocessed_reviews[1])
print ("\nUnstemmed Review - " , unstemmed_preprocessed_reviews[1])

Example:
Stemmed Review -  i rememb see show air televis year ago i child my sister later bought lp i day i thirti someth i use seri book song i student teach preschool turn whole school i purchas cd along book children the tradit live

Unstemmed Review -  i remember seeing show aired television years ago i child my sister later bought lp i day i thirty something i used series books songs i student teaching preschoolers turned whole school i purchasing cd along books children the tradition lives


#### <b><font color = "darkblue"> Saving the Preprocessed Reviews in a New Database</font></b>

In [33]:
ind = list(range(cleaner_data.shape[0]))
cleaner_data["index"] = ind
cleaner_data.set_index("index" , inplace = True)
final = cleaner_data.drop(null_review)
final["index"] = list(range(final.shape[0]))
final.set_index('index' , inplace = True)

final["CleanedText"] = stemmed_preprocessed_reviews
final["Unstemmed_CleanedText"] = unstemmed_preprocessed_reviews
#final


In [34]:
conn = sqlite3.connect('final.sqlite')
c = conn.cursor()
conn.text_factory = str
final.to_sql('Reviews' , conn  , schema = None , if_exists = 'replace')

#### <b><font color = 'darkblue'> Loading the Preprocessed Reviews from the Saved Database</font></b>

In [35]:
# os.getcwd()
conn = sqlite3.connect('final.sqlite')
final = pd.read_sql_query("""
SELECT *
FROM Reviews""",conn)
stemmed_preprocessed_reviews = final.CleanedText.values
unstemed_preprocessed_reviews = final.Unstemmed_CleanedText.values
print (type(stemmed_preprocessed_reviews) , len(stemmed_preprocessed_reviews))
print (type(unstemmed_preprocessed_reviews) , len(unstemmed_preprocessed_reviews))

<class 'numpy.ndarray'> 363223
<class 'list'> 363223


## <center><font color = "red"> [3]. </font> <u>Using Different ways to Featurize Text</u>

We have to somehow convert all the text into vectors(numbers) as its easy for computer to work with numbers than text.<br>
There are many ways availbale to build features that represent text data as numbers. For example:

1.<b> Bag of Words</b>
- Simple Bag of Words (uni-bi-n-gram)
- TF-IDF Weighted Bag of Words (uni-bi-n-gram)<br>

2.<b> Word2Vec</b>
- Simple Word2Vec
- TF-IDF Weighted Word2Vec (uni-bi-n-gram)


Lets take only a few reviews to understand text featurizaion

In [36]:
stemmed_sample = stemmed_preprocessed_reviews[:30000]
unstemmed_sample = unstemmed_preprocessed_reviews[:30000]

### <b><font color = "darkorange"> [i].</font> BAG OF WORDS!</b>

####  <b><font color = "darkgreen">(1).</font> Simple Bag of Words</b>

In [37]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import Normalizer

cv = CountVectorizer(min_df = 10 , max_features = 5000)

bow = cv.fit_transform(stemmed_sample)
bow = Normalizer().fit_transform(bow)

print ("Some Features :: \n" , cv.get_feature_names_out()[100:108])
print ("="*70)
print ("TYPE OF BOW :: " , type(bow))
print ("SHAPE OF SPARSE MATRIX :: " , bow.shape)

Some Features :: 
 ['alert' 'alfalfa' 'alfredo' 'alik' 'alittl' 'aliv' 'alkali' 'all']
TYPE OF BOW ::  <class 'scipy.sparse.csr.csr_matrix'>
SHAPE OF SPARSE MATRIX ::  (30000, 5000)


#### <b><font color = "darkgreen">(2).</font>TF-IDF Weighted Bag Of Words</b>

In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import Normalizer

tfidf_vect = TfidfVectorizer(min_df = 10 , max_features = 5000)

tfidf = tfidf_vect.fit_transform(stemmed_sample)
tfidf = Normalizer().fit_transform(tfidf)

print ("TYPE OF TFIDF BOW :: " , type(tfidf))
print ("SHAPE OF SPARSE MATRIX :: " , tfidf.shape)

TYPE OF TFIDF BOW ::  <class 'scipy.sparse.csr.csr_matrix'>
SHAPE OF SPARSE MATRIX ::  (30000, 5000)


##### <b><font color = "blue"> Function to get top N features</font></b>

Fuction to get the top 'N' TF-IDF features of a row(doc/review) and return them corresponding to the features

In [39]:
def top_tfidf_feats (row , features , top_n):
    topn_ids = np.argsort(row)[::-1][:top_n]
    
    # [::-1] --> reversing the sorted array into descending form
    # [:top_n] --> selecting the top n indices
    
    top_feats = [(features[i] , row[i]) for i in topn_ids]
    
    # Create a list of tupples using the indices in the array topn_ids
    
    df = pd.DataFrame(top_feats)
    
    # Convert into dataframe
    
    df.columns = ['feature' , 'tfidf']
    return df

In [40]:
top_tf_idf = top_tfidf_feats (tfidf[0,:].toarray()[0] , tfidf_vect.get_feature_names() , 5)

#.toarray()[0] --> To create a perfect one dimensional array which has no rows only columns :: vector.

top_tf_idf



Unnamed: 0,feature,tfidf
0,book,0.504049
1,son,0.278646
2,loud,0.234095
3,silli,0.218336
4,sing,0.216021


#### <b><font color = "darkgreen">(3).</font>Simple Bag of Words and TF-IDF weighted Bag of Words using Bi-Grams\N-Grams</b>

In [41]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import Normalizer

cv = CountVectorizer( ngram_range = (1,2) , max_features = 5000)
tfidf_vect = TfidfVectorizer(ngram_range = (1,2) , max_features = 5000)

bow = cv.fit_transform(stemmed_sample)
tfidf = tfidf_vect.fit_transform(stemmed_sample)

bow = Normalizer().fit_transform(bow)
tfidf = Normalizer().fit_transform(tfidf)

print ("Some Features of BOW bi-gram :: \n" , cv.get_feature_names_out()[100:108])
print ("Some Features of TFIDF weighted BOW bi-gram :: \n" , tfidf_vect.get_feature_names_out()[100:108])
print ("="*70)

Some Features of BOW bi-gram :: 
 ['also add' 'also avail' 'also contain' 'also enjoy' 'also good'
 'also great' 'also help' 'also like']
Some Features of TFIDF weighted BOW bi-gram :: 
 ['also add' 'also avail' 'also contain' 'also enjoy' 'also good'
 'also great' 'also help' 'also like']


##### <b><font color = "blue"> Comparing Uni-Gram and Bi-Gram</font></b>

In [42]:
cv11 = CountVectorizer(ngram_range = (1,1))
cv22 = CountVectorizer(ngram_range = (2,2))
cv12 = CountVectorizer(ngram_range = (1,2))

print("Number of features using Uni-Gram only - " , cv11.fit_transform(stemmed_sample).shape[1])
print("Number of features using Bi-Gram only - " , cv22.fit_transform(stemmed_sample).shape[1])
print("Number of features using Uni+bi-Gram -" , cv12.fit_transform(stemmed_sample).shape[1])

Number of features using Uni-Gram only -  23249
Number of features using Bi-Gram only -  487314
Number of features using Uni+bi-Gram - 510563


It can be seen that the bigram increases the features drastically. Therefore we must, even in the case of Uni-Grams, use the attribute max_features accordingly. 

### <b><font color = "darkorange"> [ii].</font>WORD2VEC!</b>

It takes semantic meaning of the words into account where as all the above techniques did not. It learns relationship between different words automatically from raw text.

<b><font color = "red">It should be kept in mind that pretrained w2v models do not support stemmed words</font></b>

####  <b><font color = "darkgray"> To Download and Load The Pretrained Models</font> </b>

In [43]:
#### <u> <font color = "orange"> To Download and Load The Pretrained Modelsfrom gensim.models import Word2Vec
from gensim.models import KeyedVectors 
import pickle
from gensim.test.utils import datapath
from gensim.scripts.glove2word2vec import glove2word2vec

In [44]:
import gensim.downloader
list(gensim.downloader.info()['models'].keys())

['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']

In [45]:
# glove_vectors = gensim.downloader.load('glove-twitter-100')

In [46]:
glove_file = datapath('C:/Users/aarat/gensim-data/glove-twitter-100/glove-twitter-100.txt')
model_twt_100 = KeyedVectors.load_word2vec_format(glove_file)

In [47]:
print("MOST SIMILAR WORDS TO MAN :: \n" , model_twt_100.most_similar("man")[0:2])
print("\nSIMILARITY BETWEEN MAN AND WOMAN :: " , model_twt_100.similarity("man" , "woman"))

MOST SIMILAR WORDS TO MAN :: 
 [('boy', 0.7652448415756226), ('dude', 0.7523702383041382)]

SIMILARITY BETWEEN MAN AND WOMAN ::  0.6703952


In [48]:
del model_twt_100

####  <b><font color = "darkblue"> Train Your Own w2v Model</font></b>

In [60]:
corpus = []

for review in tqdm(unstemmed_preprocessed_reviews):
    corpus.append(review.split())


w2v_model = Word2Vec(corpus , min_count = 5 , vector_size = 150 , workers = 8)

100%|██████████████████████████████████████████████████████████████████████| 363223/363223 [00:03<00:00, 108232.78it/s]


##### <b><u> <font color = "darkorange">Save The Model</font></u></b>

In [61]:
w2v_model.wv.save_word2vec_format('w2v_model.bin', binary=True)

##### <b><u> <font color = "darkorange">Load The Model</font></u></b>

In [62]:
w2v_model = KeyedVectors.load_word2vec_format("w2v_model.bin" , binary = True)

##### <b><u> <font color = "darkorange"> Performance of trained w2v Model</font></u></b>

In [63]:
vocabulary = w2v_model.key_to_index

In [65]:
print(w2v_model.most_similar('good')[:5])
print('\n')
print("Similarity between 'good' and 'tasty' = " , w2v_model.similarity('good','tasty'))
print('\n')
print(w2v_model.most_similar('smell')[:5])

[('great', 0.7840190529823303), ('decent', 0.747307300567627), ('awesome', 0.6455490589141846), ('bad', 0.6423206329345703), ('fantastic', 0.6403504610061646)]


Similarity between 'good' and 'tasty' =  0.5088861


[('scent', 0.7904716730117798), ('smells', 0.7542111873626709), ('odor', 0.7500920295715332), ('smelled', 0.7453361749649048), ('smelling', 0.736371636390686)]


In [66]:
print(len(w2v_model.key_to_index)) # Total number of words in our vocabulary
print(type(w2v_model.key_to_index))
print("Some words :",list(w2v_model.key_to_index)[1000:1010])

33588
<class 'dict'>
Some words : ['everyday', 'standard', 'shampoo', 'known', 'peach', 'various', 'brewing', 'was', 'expiration', 'jelly']


####  <b><font color = "darkgreen">(1).</font> Converting All Reviews To Vectors Using Avg-W2V</b>

In [69]:
avg_w2v_review_vectors = []

for review in tqdm(unstemmed_preprocessed_reviews):# preprocessed_reviews = 36k
    vec = np.zeros(150) # as we created 150 dimensional w2v
    count = 0
    for word in review.split():
        if word in vocabulary:
            vec += w2v_model[word]
            count += 1
    if count > 0:
        vec /= count
    
    avg_w2v_review_vectors.append(vec)
    
avg_w2v_review_vectors = np.array(avg_w2v_review_vectors)
avg_w2v_review_vectors = Normalizer().fit_transform(avg_w2v_review_vectors)

print("DIMENSIONS OF REVIEW VECOTRS ::" , avg_w2v_review_vectors.shape)
    

100%|████████████████████████████████████████████████████████████████████████| 363223/363223 [01:06<00:00, 5500.52it/s]


DIMENSIONS OF REVIEW VECOTRS :: (363223, 150)


####  <b><font color = "darkgreen">(2).</font> Converting All Reviews To Vectors Using TF-IDF Weighted Avg-W2V</b>

In [71]:
tfidf_vect = TfidfVectorizer(max_features = 5000)

tfidf = tfidf_vect.fit_transform(unstemmed_preprocessed_reviews)
# tfidf = Normalizer().fit_transform(tfidf)
tfidf_feats = tfidf_vect.get_feature_names()



In [73]:
tfidf_w2v_review_vectors = []

row = 0
for review in tqdm(unstemmed_preprocessed_reviews):
    review_vec = np.zeros(150)
    weighted_sum = 0
    
    for word in review.split():
        if word in vocabulary and word in tfidf_feats:
            
            vec = w2v_model[word]
            tfidf_val = tfidf[row , tfidf_feats.index(word)]
            
            review_vec += (vec * tfidf_val)
            weighted_sum += tfidf_val
            
    if weighted_sum > 0:
        review_vec /= weighted_sum
        
    tfidf_w2v_review_vectors.append(review_vec)
    row += 1

tfidf_w2v_review_vectors = np.array(tfidf_w2v_review_vectors)
tfidf_w2v_review_vectors = Normalizer().fit_transform(tfidf_w2v_review_vectors)

print("DIMENSIONS OF REVIEW VECOTRS ::" , tfidf_w2v_review_vectors.shape)

100%|█████████████████████████████████████████████████████████████████████████| 363223/363223 [31:57<00:00, 189.39it/s]


DIMENSIONS OF REVIEW VECOTRS :: (363223, 150)


<h3><b><center>End<center></b></h3>