DATASET : https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews

EDA : https://nycdatascience.com/blog/student-works/exploratory-data-visualization-of-amazon-fine-food-reviews/

- This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. - Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.

**Contents**
- Reviews.csv: Pulled from the corresponding SQLite table named Reviews in database.sqlite
- database.sqlite: Contains the table 'Reviews'

Data includes:

- Number of reviews: 525,814
- Number of users: 256,059
- Number of products: 74,258
- Timespan: Oct 1999 - Oct 2012
- Number of Attributes/Columns in data: 10

**Attribute Information:**

- Id - Row Id
- ProductId - unique identifier for the product
- UserId - unqiue identifier for the user
- ProfileName - Profile name of the user
- HelpfulnessNumerator - number of users who found the review helpful
- HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
- Score - rating between 1 and 5
- Time - timestamp for the review
- Summary - brief summary of the review
- Text - text of the review

**Objective:**
- Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).

In [1]:
import warnings
warnings.filterwarnings("ignore")

import re

import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', None)
import matplotlib.pyplot as plt
import seaborn as sns

import nltk                                         #Natural language processing tool-kit
from nltk.corpus import stopwords                   #Stopwords corpus
from nltk.stem import PorterStemmer                 # Stemmer
from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer          #For Bag of words
from sklearn.feature_extraction.text import TfidfVectorizer          #For TF-IDF
from gensim.models import Word2Vec                                   #For Word2Vec
from gensim.models import KeyedVectors

import os
import sqlite3
from tqdm import tqdm
from bs4 import BeautifulSoup

from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc

## 1. Reading Data

In [2]:
# using the SQLite Table to read data.

con = sqlite3.connect('database.sqlite')
tables = pd.read_sql_query("SELECT NAME AS 'Table_Name' FROM sqlite_master WHERE type='table'",con)
tables = tables["Table_Name"].values.tolist()
print(tables)


['Reviews']


In [3]:
data = pd.read_sql_query(""" SELECT * FROM Reviews""", con)
type(data)

pandas.core.frame.DataFrame

In [4]:
data.shape

(568454, 10)

In [5]:
# using the csv file to read data.
df=pd.read_csv("Reviews.csv")
df.head(2)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,"Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as ""Jumbo""."


In [6]:
df.shape

(568454, 10)

In [7]:
df.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

In [8]:
df['ProductId'].nunique()

74258

In [9]:
df['UserId'].nunique()

256059

In [10]:
df['Time'].min(),df['Time'].max()

(939340800, 1351209600)

In [11]:
import datetime
your_unix_timestamp = 939340800
formatted_date = datetime.datetime.fromtimestamp(your_unix_timestamp).strftime('%Y-%m-%d %H:%M:%S')
print(f"Minimum Unix timestamp {your_unix_timestamp} corresponds to: {formatted_date}")

Minimum Unix timestamp 939340800 corresponds to: 1999-10-08 05:30:00


In [12]:
import datetime
your_unix_timestamp = 1351209600
formatted_date = datetime.datetime.fromtimestamp(your_unix_timestamp).strftime('%Y-%m-%d %H:%M:%S')
print(f"Maximum Unix timestamp {your_unix_timestamp} corresponds to: {formatted_date}")

Maximum Unix timestamp 1351209600 corresponds to: 2012-10-26 05:30:00


In [13]:
df['Score'].value_counts()

5    363122
4     80655
1     52268
3     42640
2     29769
Name: Score, dtype: int64

# 2. Data Cleaning

**Objective**
- To predict whether a review is Positive or Negative based on the Text.

**Observations**
- if we see the Score column, it has values 1,2,3,4,5 . Considering 1, 2 as Negative reviews and 4, 5 as Positive reviews. For Score = 3, we will consider it as Neutral review and lets delete the rows that are neutral, so that we can predict either Positive or Negative

- HelfulnessNumerator says about number of people found that review usefull and HelpfulnessDenominator is about usefull review count + not so usefull count. So, from this we can see that HelfulnessNumerator is always less than or equal to HelpfulnesDenominator.

In [14]:
df.shape

(568454, 10)

In [15]:
df[df['HelpfulnessNumerator']>df['HelpfulnessDenominator']]

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
44736,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,"It was almost a 'love at first bite' - the perfectly roasted almond with a nice thin layer of pure flavorful cocoa on the top.<br /><br />You can smell the cocoa as soon as you open the canister - making you want to take a bite.<br /><br />You may or may not like the taste of this cocoa roasted almonds depending on your likingness for cocoa. We are so much used to the taste of chocolate (which is actually cocoa + many other ingredients like milk ...) - that you might have never really tasted really cocoa.<br /><br />Tasting this item it like tasting and enjoying flavorful pure raw cocoa with crunchy almonds in the center. Get yourself a box and see for yourself what real cocoa + almonds is !<br /><br />Where this product loses a star is in its packaging - the external sleeve is kind of comes in one piece, so if you try to remove the lid, the external sleeve kind of tends to come off fully - so careful when you are removing the external sleeve for the canister."
64421,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate ordering this for him. He says they are great. I have tried them myself and they are delicious. Just open and pop them in the microwave. It is very easy. The best thing about ordering from Amazon grocery is that they deliver to your door. If you have a loved one that lives far away and may have limited transportation this is the answer. Just order what you want them to have and Amazon takes care of the rest.


In [16]:
df=df[df['Score']!=3]
df.shape

(525814, 10)

In [17]:
df['Score'].value_counts()

5    363122
4     80655
1     52268
2     29769
Name: Score, dtype: int64

In [18]:
# 1 for positive reviews and 0 for negative reviews
df['Score']=df['Score'].apply(lambda x:1 if x>3 else 0)

In [19]:
df['Score'].value_counts()

1    443777
0     82037
Name: Score, dtype: int64

In [20]:
df.shape

(525814, 10)

In [21]:
df[df['HelpfulnessNumerator']>df['HelpfulnessDenominator']]

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
44736,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,1,1212883200,Pure cocoa taste with crunchy almonds inside,"It was almost a 'love at first bite' - the perfectly roasted almond with a nice thin layer of pure flavorful cocoa on the top.<br /><br />You can smell the cocoa as soon as you open the canister - making you want to take a bite.<br /><br />You may or may not like the taste of this cocoa roasted almonds depending on your likingness for cocoa. We are so much used to the taste of chocolate (which is actually cocoa + many other ingredients like milk ...) - that you might have never really tasted really cocoa.<br /><br />Tasting this item it like tasting and enjoying flavorful pure raw cocoa with crunchy almonds in the center. Get yourself a box and see for yourself what real cocoa + almonds is !<br /><br />Where this product loses a star is in its packaging - the external sleeve is kind of comes in one piece, so if you try to remove the lid, the external sleeve kind of tends to come off fully - so careful when you are removing the external sleeve for the canister."
64421,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,1,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate ordering this for him. He says they are great. I have tried them myself and they are delicious. Just open and pop them in the microwave. It is very easy. The best thing about ordering from Amazon grocery is that they deliver to your door. If you have a loved one that lives far away and may have limited transportation this is the answer. Just order what you want them to have and Amazon takes care of the rest.


In [22]:
df1=df[df['HelpfulnessNumerator']<=df['HelpfulnessDenominator']]
df1.shape

(525812, 10)

In [23]:
# Find duplicates based on '"UserId","ProfileName","Time","Text" columns
duplicates = df1[df1.duplicated(subset=["UserId","ProfileName","Time","Text"],keep=False)].sort_values(["UserId","ProfileName","Time","Text"])
duplicates


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
83317,83318,B005ZBZLT4,#oc-R115TNMSPFT9I7,Breyton,2,3,0,1331510400,"""Green"" K-cup packaging sacrifices flavor","Overall its just OK when considering the price of other K-cups. The SF Coffee K-cups do not look like other K-cups in that they do not have the plastic base which contains and seals around the coffee filter. This exposes the coffee to air and faster oxidation. When drinking the coffee it lacks a fresh flavor and tastes a bit stale. The K-cups come in a sealed plastic bag, but once its opened, the k-cups are exposed to air and they quickly go stale.<br /><br />The coffee flavor is a bit weak as well. The lack of the plastic housing around the coffee filter, in my opinion, causes the water to flow thru the coffee too quickly and contributes to a weak flavor.<br /><br />I don't think I would buy these again."
180871,180872,B007Y59HVM,#oc-R115TNMSPFT9I7,Breyton,2,3,0,1331510400,"""Green"" K-cup packaging sacrifices flavor","Overall its just OK when considering the price of other K-cups. The SF Coffee K-cups do not look like other K-cups in that they do not have the plastic base which contains and seals around the coffee filter. This exposes the coffee to air and faster oxidation. When drinking the coffee it lacks a fresh flavor and tastes a bit stale. The K-cups come in a sealed plastic bag, but once its opened, the k-cups are exposed to air and they quickly go stale.<br /><br />The coffee flavor is a bit weak as well. The lack of the plastic housing around the coffee filter, in my opinion, causes the water to flow thru the coffee too quickly and contributes to a weak flavor.<br /><br />I don't think I would buy these again."
290947,290948,B005HG9ESG,#oc-R11D9D7SHXIJB9,"Louis E. Emory ""hoppy""",0,0,1,1342396800,Muscle spasms,"My wife has recurring extreme muscle spasms, usually late at night or early morning. We started to use this water for her about 6 months ago. It has been the best purchase we have ever made [I am not kidding]. I haven't noticed anyone mentioning this benefit, but I can assure you it works!! We ran out a couple of days ago and had to wait and do without for about 3 days. She had awful muscle spasms in both legs last night. Back to where she was before we starting using this water. We were really thankful that we received 3 cases this morning. I have vowed ""as god is my witness I will never run out of Essentia water again"" {from the movie Gone with the wind, ha ha]. But for real, we are now ordering 4 case per month so that we will never run out again."
455533,455534,B005HG9ERW,#oc-R11D9D7SHXIJB9,"Louis E. Emory ""hoppy""",0,0,1,1342396800,Muscle spasms,"My wife has recurring extreme muscle spasms, usually late at night or early morning. We started to use this water for her about 6 months ago. It has been the best purchase we have ever made [I am not kidding]. I haven't noticed anyone mentioning this benefit, but I can assure you it works!! We ran out a couple of days ago and had to wait and do without for about 3 days. She had awful muscle spasms in both legs last night. Back to where she was before we starting using this water. We were really thankful that we received 3 cases this morning. I have vowed ""as god is my witness I will never run out of Essentia water again"" {from the movie Gone with the wind, ha ha]. But for real, we are now ordering 4 case per month so that we will never run out again."
496893,496894,B005HG9ET0,#oc-R11D9D7SHXIJB9,"Louis E. Emory ""hoppy""",0,0,1,1342396800,Muscle spasms,"My wife has recurring extreme muscle spasms, usually late at night or early morning. We started to use this water for her about 6 months ago. It has been the best purchase we have ever made [I am not kidding]. I haven't noticed anyone mentioning this benefit, but I can assure you it works!! We ran out a couple of days ago and had to wait and do without for about 3 days. She had awful muscle spasms in both legs last night. Back to where she was before we starting using this water. We were really thankful that we received 3 cases this morning. I have vowed ""as god is my witness I will never run out of Essentia water again"" {from the movie Gone with the wind, ha ha]. But for real, we are now ordering 4 case per month so that we will never run out again."
...,...,...,...,...,...,...,...,...,...,...
231423,231424,B003FDC2I2,AZZU1VEO8KUXH,"Mia P ""Mia P""",1,1,1,1317513600,NOT like the others,"I bought this for my 13 year old daughter who needs extra calories. All of the other drinks (for kids and adults) seemed to have an odd taste and/or after-taste. I took a sip and thought it had a good flavor. The consistency was nice and smooth, which I like. My daughter drank the entire container and told me she liked it. We'll be purchasing this product again with the automatic reorder program since it costs less per container to order this way."
294984,294985,B005V9UG18,AZZU1VEO8KUXH,"Mia P ""Mia P""",1,1,1,1317513600,NOT like the others,"I bought this for my 13 year old daughter who needs extra calories. All of the other drinks (for kids and adults) seemed to have an odd taste and/or after-taste. I took a sip and thought it had a good flavor. The consistency was nice and smooth, which I like. My daughter drank the entire container and told me she liked it. We'll be purchasing this product again with the automatic reorder program since it costs less per container to order this way."
404100,404101,B003FDG4K4,AZZU1VEO8KUXH,"Mia P ""Mia P""",1,1,1,1317513600,NOT like the others,"I bought this for my 13 year old daughter who needs extra calories. All of the other drinks (for kids and adults) seemed to have an odd taste and/or after-taste. I took a sip and thought it had a good flavor. The consistency was nice and smooth, which I like. My daughter drank the entire container and told me she liked it. We'll be purchasing this product again with the automatic reorder program since it costs less per container to order this way."
361526,361527,B0029XITW2,AZZU4D6TZ2L6J,"Sherry King ""llamasmama""",2,2,1,1247875200,cheese,My father thought this was the best cheese ever. The cheese got the cheese just in time for Father's day. Thanks



 - Reviews data had many duplicate entries. Hence it was necessary to remove duplicates in order to get unbiased results for the analysis of the data.

In [24]:
 #Ashis Kumar Sahu

In [25]:
#Sorting data according to ProductId in ascending order
sorted_data=df1.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
final.shape

(364171, 10)

In [26]:
#Checking to see how much % of data still remains
(final['Id'].size*1.0)/(df1['Id'].size*1.0)*100

69.25878450853156

In [27]:
final.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

- After removing duplicate entries, we have around 69% data

In [28]:
import re

def apply_mask_summary(filtered_data, regex_string):
    # Use na=False to handle NaN values
    mask = filtered_data['Summary'].str.lower().str.contains(regex_string, na=False)
    return filtered_data[mask]

# Apply the function with the regex pattern to check for 'book'
final1 = apply_mask_summary(final, r"\bbook\b")

# View the filtered result
final1.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
150525,150526,6641040,A3E9QZFE9KXH8J,R. Mitchell,11,18,0,1129507200,awesome book poor size,This is one of the best children's books ever written but it is a mini version of the book and was not portrayed as one. It is over priced for the product. I sent an email regarding my bewilderment to Amazon and got no response.
150523,150524,6641040,ACITT7DI6IDDL,shari zychinski,0,0,1,939340800,EVERY book is educational,"this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses: i love all the new words this book introduces and the silliness of it all. this is a classic book i am willing to bet my son will STILL be able to recite from memory when he is in college"
150520,150521,6641040,A3RMCRB2NDTDYP,Carol Carruthers,0,0,1,1243468800,This book is great!,"My 7 year old daughter brought this book home from the school library. It was a little easy for her reading skills, but she loved it anyways. My 4 year old daughter started reading it and now we can't get her to return it to school. The book is small but this is better for little hands. This is a great book for any age! The pictures are cute and go well with the writing. The rhyming makes it easy to get to know the months. I recommend this book to every parent!"
150518,150519,6641040,A12HY5OZ2QNK4N,Elizabeth H. Roessner,0,0,1,1256774400,It's a great book!,"I've always loved chicken soup and rice. My late great-grandmother, Ethel, always made me homemade chicken, chicken soup and rice. This book takes me back to the days my mother, my father, my sister, and I went to Ethel's house. My late great-grandfather, Isadore, would cook the chicken because Ethel was blind. So, it reminds me of the time we were all together as a family. It brings back happy memories of all the love we shared over bowls of hot soup."
150517,150518,6641040,AK1L4EJBA23JF,L. M. Kraus,0,0,1,1288224000,love this book,"Great book, perfect condition arrived in a short amount of time, long before the expected delivery date"


In [29]:
final1.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

In [30]:
final1.shape

(52, 10)

In [31]:
final.shape

(364171, 10)

In [32]:
import re

def apply_mask_summary(filtered_data, regex_string):
    # Use na=False to handle NaN values
    mask = filtered_data['Summary'].str.lower().str.contains(regex_string, na=False)
    # Return the DataFrame with rows that do NOT match the mask
    return filtered_data[~mask]

# Apply the function to drop rows with 'book'
final2 = apply_mask_summary(final, r"\bbook\b")

# View the filtered result
final2.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
150510,150511,6641040,A1C9K534BCI9GO,Laura Purdie Salas,0,0,1,1344211200,Charming and childlike,"A charming, rhyming book that describes the circumstances under which you eat (or don't) chicken soup with rice, month-by-month. This sounds like the kind of thing kids would make up while they're out of recess and sing over and over until they drive the teachers crazy. It's cute and catchy and sounds really childlike but is skillfully written."
150524,150525,6641040,A2QID6VCFTY51R,Rick,1,2,1,1025481600,"In December it will be, my snowman's anniversary...","My daughter loves all the ""Really Rosie"" books. She was introduced to the Really Rosie CD performed by Carole King (also available on Amazon!) on her 1st Birthday and now, a year later, she knows all the songs. As far as the books go, we own: One Was Johnny, Alligators All Around, & Chicken Soup w/Rice. These books are well written with clever art work by Maurice Sendak. Plus, they are really cheap!! Highly recommended :)"
150522,150523,6641040,A2P4F2UO0UMP8C,"Elizabeth A. Curry ""Lovely Librarian""",0,0,1,1096675200,MMMM chicken soup....,"Summary: A young boy describes the usefulness of chicken soup with rice for each month of the year.<br /><br />Evaluation: With Sendak's creative repetitious and rhythmic words, children will enjoy and learn to read the story of a boy who loves chicken soup with rice! Through Sendak's catchy story, children will also learn the months of the year, as well as what seasons go with what month! They learn to identify ice-skating and snowmen in the winter; strong wind in March; birds and flowers in the spring; swimming and hot temperatures in the summer; and finally different holidays throughout the year. Such as Halloween in October, and Christmas in December.<br /><br />Sendak's simple three colored crayon-like drawings are a perfect addition to his educational and entertaining story.<br /><br />A great activity that you can do with this book is to have children draw their own illustrations for each month of the year. Afterwards you can bind the pages together so the children can create their own book."
150521,150522,6641040,A1S3C5OFU508P3,Charles Ashbacher,0,0,1,1219536000,Children will find it entertaining and a generator of giggles,"This book contains a collection of twelve short statements, all of which end with the phrase ""Chicken Soup with Rice."" Each one is based on a month of the year and they have some elements of nonsense verse. For that reason, children will find them entertaining and they will generate a few giggles."
150519,150520,6641040,ADBFSA9KTQANE,"James L. Hammock ""Pucks Buddy""",0,0,1,1256688000,Great Gift,This book was purchased as a birthday gift for a 4 year old boy. He squealed with delight and hugged it when told it was his to keep and he did not have to return it to the library.


In [33]:
import re

def apply_mask_summary(filtered_data, regex_string):
    # Use na=False to handle NaN values
    mask = filtered_data['Text'].str.lower().str.contains(regex_string, na=False)
    # Return the DataFrame with rows that do NOT match the mask
    return filtered_data[mask]

# Apply the function to drop rows with 'book'
final2 = apply_mask_summary(final, r"\bbook\b")

# View the filtered result
final2.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
150510,150511,6641040,A1C9K534BCI9GO,Laura Purdie Salas,0,0,1,1344211200,Charming and childlike,"A charming, rhyming book that describes the circumstances under which you eat (or don't) chicken soup with rice, month-by-month. This sounds like the kind of thing kids would make up while they're out of recess and sing over and over until they drive the teachers crazy. It's cute and catchy and sounds really childlike but is skillfully written."
150525,150526,6641040,A3E9QZFE9KXH8J,R. Mitchell,11,18,0,1129507200,awesome book poor size,This is one of the best children's books ever written but it is a mini version of the book and was not portrayed as one. It is over priced for the product. I sent an email regarding my bewilderment to Amazon and got no response.
150523,150524,6641040,ACITT7DI6IDDL,shari zychinski,0,0,1,939340800,EVERY book is educational,"this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses: i love all the new words this book introduces and the silliness of it all. this is a classic book i am willing to bet my son will STILL be able to recite from memory when he is in college"
150522,150523,6641040,A2P4F2UO0UMP8C,"Elizabeth A. Curry ""Lovely Librarian""",0,0,1,1096675200,MMMM chicken soup....,"Summary: A young boy describes the usefulness of chicken soup with rice for each month of the year.<br /><br />Evaluation: With Sendak's creative repetitious and rhythmic words, children will enjoy and learn to read the story of a boy who loves chicken soup with rice! Through Sendak's catchy story, children will also learn the months of the year, as well as what seasons go with what month! They learn to identify ice-skating and snowmen in the winter; strong wind in March; birds and flowers in the spring; swimming and hot temperatures in the summer; and finally different holidays throughout the year. Such as Halloween in October, and Christmas in December.<br /><br />Sendak's simple three colored crayon-like drawings are a perfect addition to his educational and entertaining story.<br /><br />A great activity that you can do with this book is to have children draw their own illustrations for each month of the year. Afterwards you can bind the pages together so the children can create their own book."
150521,150522,6641040,A1S3C5OFU508P3,Charles Ashbacher,0,0,1,1219536000,Children will find it entertaining and a generator of giggles,"This book contains a collection of twelve short statements, all of which end with the phrase ""Chicken Soup with Rice."" Each one is based on a month of the year and they have some elements of nonsense verse. For that reason, children will find them entertaining and they will generate a few giggles."


In [34]:
final2.shape

(1419, 10)

## Drop all the rows where the word "book" or "books" appears in the 'Summary' or 'Text' column

In [35]:
import re

def drop_rows_with_book(filtered_data, regex_string):
    # Create a mask for 'Summary' column containing 'book'
    mask_summary = filtered_data['Summary'].str.lower().str.contains(regex_string, na=False)
    
    # Create a mask for 'Text' column containing 'book'
    mask_text = filtered_data['Text'].str.lower().str.contains(regex_string, na=False)
    
    # Return DataFrame where either 'Text' or 'Summary' doesn't contain 'book'
    return filtered_data[~(mask_summary | mask_text)]

# Apply the function to drop rows where either 'Text' or 'Summary' contains 'book'
final3 = drop_rows_with_book(final, r"\bbooks?\b")
# View the filtered result
final3.head()

# \b represents a word boundary, ensuring we match full words.
# s? ensures the regex matches "book" with an optional "s" (for "books").


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
150506,150507,0006641040,A1S4A3IQ2MU7V4,"sally sue ""sally sue""",1,1,1,1191456000,chicken soup with rice months,This is a fun way for children to learn their months of the year! We will learn all of the poems throughout the school year. they like the handmotions which I invent for each poem.
150504,150505,0006641040,A2PTSM496CF40Z,"Jason A. Teeple ""Nobody made a greater mistak...",1,1,1,1210809600,A classic,"Get the movie or sound track and sing along with Carol King. This is great stuff, my whole extended family knows these songs by heart. Quality kids storytelling and music."
150503,150504,0006641040,AQEYF1AXARWJZ,"Les Sinclair ""book maven""",1,1,1,1212278400,Chicken Soup with Rice,"A very entertaining rhyming story--cleaver and catchy.The illustrations are imaginative and fit right in. However, the paperback is somewhat small and flimsy. I'd opt for a bigger edition."
515425,515426,141278509X,AB1A5EGHHVA9M,CHelmic,1,1,1,1332547200,The best drink mix,"This product by Archer Farms is the best drink mix ever. Just mix a flavored packet with your 16 oz. water bottle. Contains the all natural sweetner Stevia, real fruit flavoring and no food coloring. Just colored with fruit or vegetable colors. Pure and natural and tastes great. There are eight packets in a box and only contains 10 calories per packet. Thank you Archer Farms!"
24750,24751,2734888454,A1C298ITT645B6,Hugh G. Pritchard,0,0,1,1195948800,Dog Lover Delites,Our dogs just love them. I saw them in a pet store and a tag was attached regarding them being made in China and it satisfied me that they were safe.


In [36]:
final3.shape

(362388, 10)

In [37]:
#Checking to see how much % of data still remains
(final3['Id'].size*1.0)/(df1['Id'].size*1.0)*100

68.91968992719832

# [3].  Text Preprocessing.

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>

After which we collect the words used to describe positive and negative reviews

In [49]:
# https://github.com/stopwords-iso/stopwords-en/blob/master/stopwords-en.txt
# we are removing the words from the stop words list: 'no', 'nor', 'not',"shouldn", "shouldn't", "shouldnt",  "wasn't","wasnt", "weren", 
#   "weren't",  "werent", "won", "won't", "wont","wouldn", "wouldn't", "wouldnt","aren", "aren't", "arent", "can", "can't", "cannot", "cant", 
# "couldn", "couldn't", "couldnt",    

# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved in the 1st step

# List of words
words = [
    "'ll", "'tis", "'twas", "'ve", "10", "39", "a", "a's", "able", "ableabout", "about", "above", "abroad", 
    "abst", "accordance", "according", "accordingly", "across", "act", "actually", "ad", "added", "adj", 
    "adopted", "ae", "af", "affected", "affecting", "affects", "after", "afterwards", "ag", "again", 
    "against", "ago", "ah", "ahead", "ai", "ain't", "aint", "al", "all", "allow", "allows", "almost", 
    "alone", "along", "alongside", "already", "also", "although", "always", "am", "amid", "amidst", 
    "among", "amongst", "amoungst", "amount", "an", "and", "announce", "another", "any", "anybody", 
    "anyhow", "anymore", "anyone", "anything", "anyway", "anyways", "anywhere", "ao", "apart", 
    "apparently", "appear", "appreciate", "appropriate", "approximately", "aq", "ar", "are", "area", 
    "areas", "arise", "around", "arpa", "as", "aside", "ask", "asked", 
    "asking", "asks", "associated", "at", "au", "auth", "available", "aw", "away", "awfully", "az", 
    "b", "ba", "back", "backed", "backing", "backs", "backward", "backwards", "bb", "bd", "be", 
    "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "began", 
    "begin", "beginning", "beginnings", "begins", "behind", "being", "beings", "believe", "below", 
    "beside", "besides", "best", "better", "between", "beyond", "bf", "bg", "bh", "bi", "big", 
    "bill", "billion", "biol", "bj", "bm", "bn", "bo", "both", "bottom", "br", "brief", "briefly", 
    "bs", "bt", "but", "buy", "bv", "bw", "by", "bz", "c", "c'mon", "c's", "ca", "call", "came", 
    "caption", "case", "cases", "cause", "causes", "cc", "cd", 
    "certain", "certainly", "cf", "cg", "ch", "changes", "ci", "ck", "cl", "clear", "clearly", 
    "click", "cm", "cmon", "cn", "co", "co.", "com", "come", "comes", "computer", "con", "concerning", 
    "consequently", "consider", "considering", "contain", "containing", "contains", "copy", 
    "corresponding", "could", "could've", "course", "cr", "cry", 
    "cs", "cu", "currently", "cv", "cx", "cy", "cz", "d", "dare", "daren't", "darent", "date", 
    "de", "dear", "definitely", "describe", "described", "despite", "detail", "did", "didn", 
    "didn't", "didnt", "differ", "different", "differently", "directly", "dj", "dk", "dm", "do", 
    "does", "doesn", "doesn't", "doesnt", "doing", "don", "don't", "done", "dont", "doubtful", 
    "down", "downed", "downing", "downs", "downwards", "due", "during", "dz", "e", "each", "early", 
    "ec", "ed", "edu", "ee", "effect", "eg", "eh", "eight", "eighty", "either", "eleven", "else", 
    "elsewhere", "empty", "end", "ended", "ending", "ends", "enough", "entirely", "er", "es", 
    "especially", "et", "et-al", "etc", "even", "evenly", "ever", "evermore", "every", "everybody", 
    "everyone", "everything", "everywhere", "ex", "exactly", "example", "except", "f", "face", 
    "faces", "fact", "facts", "fairly", "far", "farther", "felt", "few", "fewer", "ff", "fi", 
    "fifteen", "fifth", "fifty", "fify", "fill", "find", "finds", "fire", "first", "five", "fix", 
    "fj", "fk", "fm", "fo", "followed", "following", "follows", "for", "forever", "former", 
    "formerly", "forth", "forty", "forward", "found", "four", "fr", "free", "from", "front", 
    "full", "fully", "further", "furthered", "furthering", "furthermore", "furthers", "fx", "g", 
    "ga", "gave", "gb", "gd", "ge", "general", "generally", "get", "gets", "getting", "gf", "gg", 
    "gh", "gi", "give", "given", "gives", "giving", "gl", "gm", "gmt", "gn", "go", "goes", 
    "going", "gone", "good", "goods", "got", "gotten", "gov", "gp", "gq", "gr", "great", 
    "greater", "greatest", "greetings", "group", "grouped", "grouping", "groups", "gs", "gt", 
    "gu", "gw", "gy", "h", "had", "hadn't", "hadnt", "half", "happens", "hardly", "has", "hasn", 
    "hasn't", "hasnt", "have", "haven", "haven't", "havent", "having", "he", "he'd", "he'll", 
    "he's", "hed", "hell", "hello", "help", "hence", "her", "here", "here's", "hereafter", 
    "hereby", "herein", "heres", "hereupon", "hers", "herself", "herse”", "hes", "hi", "hid", 
    "high", "higher", "highest", "him", "himself", "himse”", "his", "hither", "hk", "hm", "hn", 
    "home", "homepage", "hopefully", "how", "how'd", "how'll", "how's", "howbeit", "however", 
    "hr", "ht", "htm", "html", "http", "hu", "hundred", "i", "i'd", "i'll", "i'm", "i've", 
    "i.e.", "id", "ie", "if", "ignored", "ii", "il", "ill", "im", "immediate", "immediately", 
    "importance", "important", "in", "inasmuch", "inc", "inc.", "indeed", "index", "indicate", 
    "indicated", "indicates", "information", "inner", "inside", "insofar", "instead", "int", 
    "interest", "interested", "interesting", "interests", "into", "invention", "inward", "io", 
    "iq", "ir", "is", "isn", "isn't", "isnt", "it", "it'd", "it'll", "it's", "itd", "itll", 
    "its", "itself", "itse”", "ive", "j", "je", "jm", "jo", "join", "jp", "just", "k", "ke", 
    "keep", "keeps", "kept", "keys", "kg", "kh", "ki", "kind", "km", "kn", "knew", "know", 
    "known", "knows", "kp", "kr", "kw", "ky", "kz", "l", "la", "large", "largely", "last", 
    "latter", "latterly", "latest", "lazily", "lb", "le", "least", "less", "lest", "let", 
    "let's", "letting", "li", "like", "liked", "likewise", "lil", "ll", "lm", "ln", "lo", 
    "long", "longer", "longest", "look", "looking", "looks", "lp", "lt", "lu", "ltd", "ly", 
    "m", "ma", "made", "mainly", "make", "makes", "making", "many", "may", "maybe", "me", 
    "mean", "meaning", "means", "meant", "meantime", "meanwhile", "more", "moreover", "most", 
    "mostly", "much", "must", "mustn't", "mustnt", "my", "myself", "n", "na", "name", "named", 
    "namely", "narrative", "narratives", "narrator", "narrators", "ne", "near", "nearly", 
    "necessarily", "necessary", "need", "needn", "needn't", "neednt", "needs", "neither", 
    "never", "nevermore", "new", "newer", "newest", "next", "nf", "ng", "ni", "nine", 
    "ninety", "nobody", "noone", "nothing", "novel", "now", "nowadays", 
    "nowhere", "nu", "number", "numbers", "o", "obviously", "of", "off", "often", "oh", 
    "ok", "okay", "old", "older", "oldest", "on", "once", "one", "ones", "only", "onto", 
    "open", "opened", "opening", "opens", "or", "other", "others", "otherwise", "our", 
    "ours", "ourselves", "out", "over", "overall", "own", "p", "page", "pages", "pair", 
    "pairs", "part", "particular", "particularly", "parts", "pass", "passed", "passing", 
    "pause", "pay", "pb", "pc", "pd", "pe", "people", "per", "perhaps", "pg", "ph", "pm", 
    "point", "pointed", "points", "pm", "policy", "poorly", "possible", "possibly", "praise", 
    "present", "presented", "presenting", "presents", "pretty", "pro", "probably", "problem", 
    "problems", "proceed", "proceeding", "proceeds", "q", "qa", "qe", "query", "question", 
    "questions", "quick", "quickly", "qk", "ql", "qm", "qn", "qo", "qs", "qt", "qu", "quite", 
    "qv", "r", "r's", "ra", "rather", "rd", "re", "referred", "regarding", "regardless", 
    "related", "relating", "relation", "relationship", "relationships", "relatively", "rely", 
    "remain", "remains", "render", "repeatedly", "represents", "result", "results", "return", 
    "returns", "rf", "rg", "rh", "ri", "right", "rl", "rm", "rn", "ro", "round", "rs", 
    "rt", "ru", "rv", "rw", "s", "s's", "sa", "same", "saw", "say", "saying", "says", 
    "sc", "second", "secondly", "see", "seeing", "seem", "seemed", "seeming", "seems", 
    "seen", "self", "selves", "sense", "sent", "separate", "seriously", "seven", "seventy", 
    "several", "shall", "shan", "shan't", "shant", "she", "she'd", "she'll", "she's", "shes", 
    "short", "should", "should've", "show", "showed", 
    "showing", "shown", "shows", "si", "side", "significantly", "similar", "similarly", 
    "since", "six", "sixty", "sk", "sl", "slightly", "so", "some", "somebody", "someone", 
    "something", "sometime", "sometimes", "somewhat", "somewhere", "soon", "sorry", 
    "specifically", "speech", "spoke", "spoken", "ss", "st", "stating", "still", "stop", 
    "stopped", "stopping", "strongly", "such", "suddenly", "suit", "suitable", "sure", 
    "t", "t's", "take", "taken", "takes", "talk", "talked", "talking", "tall", "tc", 
    "tell", "tells", "than", "thank", "thanks", "that", "that'll", "the", "their", "theirs", 
    "them", "themselves", "then", "there", "there's", "thereafter", "thereby", "therein", 
    "thereof", "thereon", "theres", "these", "they", "they'd", "they'll", "they're", "they've", 
    "thick", "thin", "third", "this", "thorough", "thoroughly", "those", "thou", "though", 
    "thousand", "three", "through", "throughout", "thru", "thus", "ti", "to", "together", 
    "too", "toward", "towards", "tr", "tra", "try", "trying", "twelve", "twenty", "two", 
    "u", "ua", "ub", "uk", "un", "under", "underneath", "understood", "until", "up", "upon", 
    "upwards", "us", "use", "used", "useful", "uses", "using", "ut", "v", "va", "value", 
    "various", "vc", "ve", "very", "via", "vs", "w", "wa", "want", "wants", "was", 
    "way", "we", "we'd", "we'll", "we're", "we've", "wed", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", 
    "where's", "whereabouts", "wherein", "whereupon", "whether", "which", "while", "whilst", 
    "who", "who'd", "whoever", "whole", "whom", "who's", "whose", "why", "will", "willing", 
    "with", "within", "without",  "would", "would've", "x", "y", "ye", "yes", "yet", "you", "you'd", "you'll", "you're", 
    "you've", "youre", "your", "yours", "yourself", "yourselves", "z"
]


stopwords= set(words)
len(stopwords)

1027

In [39]:
#https://gist.github.com/sebleier/554280
import requests
stopwords_list = requests.get("https://gist.githubusercontent.com/rg089/35e00abf8941d72d419224cfd5b5925d/raw/12d899b70156fd0041fa9778d657330b024b959c/stopwords.txt").content
stopwords1 = set(stopwords_list.decode().splitlines()) 
stopwords1

{'edu',
 'use',
 'fifteen',
 'effect',
 'each',
 "t's",
 'whos',
 'less',
 'because',
 'fill',
 'hr',
 'poorly',
 'possibly',
 'r2',
 'their',
 'sufficiently',
 'ow',
 'describe',
 "they've",
 'try',
 "aren't",
 'iz',
 'system',
 'x2',
 'gr',
 "i'm",
 'pm',
 'specifying',
 'thoroughly',
 'dr',
 'taken',
 'best',
 'so',
 'won',
 'si',
 'six',
 'himself',
 'a3',
 "didn't",
 'hu',
 'may',
 'thoughh',
 'alone',
 'nl',
 'everyone',
 "couldn't",
 'lest',
 'many',
 'se',
 'shall',
 'who',
 'if',
 'wont',
 'up',
 'ix',
 'causes',
 'theyd',
 'selves',
 'some',
 'don',
 'em',
 'while',
 'eighty',
 'once',
 'doing',
 'i6',
 "it'll",
 'ij',
 'cd',
 'fify',
 'hers',
 'saw',
 'ain',
 'ca',
 'hadn',
 'n2',
 'rq',
 'going',
 'indicate',
 'until',
 'gotten',
 'au',
 'ar',
 'wed',
 'out',
 'happens',
 'largely',
 'contains',
 "there've",
 'aren',
 'perhaps',
 'k',
 'kept',
 'seemed',
 'http',
 'as',
 'instead',
 'pc',
 'cs',
 'whither',
 'accordingly',
 'again',
 'shes',
 'cj',
 'ei',
 'tn',
 'com',
 'l

In [40]:
len(stopwords1)

1158

In [41]:
from nltk.corpus import stopwords
import re

# Download stopwords
nltk.download('stopwords')
stop_words3 = set(stopwords.words('english'))
len(stop_words3)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sahua\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


179

In [42]:
# https://stackoverflow.com/a/47091490/4084039
import re

def decontracted(phrase):
    # specific
    phrase = re.sub("won't", "will not", phrase)
    phrase = re.sub("can\'t", "can not", phrase)

    # general
    phrase = re.sub("n\'t", " not", phrase)
    phrase = re.sub("\'re", " are", phrase)
    phrase = re.sub("\'s", " is", phrase)
    phrase = re.sub("\'d", " would", phrase)
    phrase = re.sub("\'ll", " will", phrase)
    phrase = re.sub("\'t", " not", phrase)
    phrase = re.sub("\'ve", " have", phrase)
    phrase = re.sub("\'m", " am", phrase)
    return phrase

In [43]:
# printing some random reviews
sent_0 = final3['Text'].values[0]
print(sent_0)
print("="*50)

sent_1000 = final3['Text'].values[1000]
print(sent_1000)
print("="*50)

sent_1500 = final3['Text'].values[1500]
print(sent_1500)
print("="*50)

sent_4900 = final3['Text'].values[4900]
print(sent_4900)
print("="*50)

This is a fun way for children to learn their months of the year!  We will learn all of the poems throughout the school year.  they like the handmotions which I invent for each poem.
I didn't think to question or research this wine preserver until after I received it.  The principle SEEMS valid, but once I received it I questioned how effectively it actually evacuates air from the bottle.  Then my research revealed questions about the stopper's ability to maintain what vacuum there is.  So, I did a simple test.  I pumped all the air out of a bottle that the unit was capable of and let it sit for a few hours.  After only maybe 6 hours, the stopper was so easy to pull out (without releasing pressure first) there is no way there was much of a vacuum at all left in the bottle.  At least it was cheap, but this unit is a complete scam. An incredibly successful and profitable one at that, it seems.
I have a four month old st Bernard puppy and have always fed him and my 6 year old German shep 

In [44]:
sent_0 = re.sub("http\S+", "", sent_0)
sent_1000 = re.sub("http\S+", "", sent_1000)
sent_1500 = re.sub("http\S+", "", sent_1500)
sent_4900 = re.sub("http\S+", "", sent_4900)
print(sent_0)
print("="*50)
print(sent_1000)
print("="*50)
print(sent_1500)
print("="*50)
print(sent_4900)
print("="*50)


This is a fun way for children to learn their months of the year!  We will learn all of the poems throughout the school year.  they like the handmotions which I invent for each poem.
I didn't think to question or research this wine preserver until after I received it.  The principle SEEMS valid, but once I received it I questioned how effectively it actually evacuates air from the bottle.  Then my research revealed questions about the stopper's ability to maintain what vacuum there is.  So, I did a simple test.  I pumped all the air out of a bottle that the unit was capable of and let it sit for a few hours.  After only maybe 6 hours, the stopper was so easy to pull out (without releasing pressure first) there is no way there was much of a vacuum at all left in the bottle.  At least it was cheap, but this unit is a complete scam. An incredibly successful and profitable one at that, it seems.
I have a four month old st Bernard puppy and have always fed him and my 6 year old German shep 

In [45]:
soup = BeautifulSoup(sent_0, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_1000, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_1500, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_4900, 'lxml')
text = soup.get_text()
print(text)

This is a fun way for children to learn their months of the year!  We will learn all of the poems throughout the school year.  they like the handmotions which I invent for each poem.
I didn't think to question or research this wine preserver until after I received it.  The principle SEEMS valid, but once I received it I questioned how effectively it actually evacuates air from the bottle.  Then my research revealed questions about the stopper's ability to maintain what vacuum there is.  So, I did a simple test.  I pumped all the air out of a bottle that the unit was capable of and let it sit for a few hours.  After only maybe 6 hours, the stopper was so easy to pull out (without releasing pressure first) there is no way there was much of a vacuum at all left in the bottle.  At least it was cheap, but this unit is a complete scam. An incredibly successful and profitable one at that, it seems.
I have a four month old st Bernard puppy and have always fed him and my 6 year old German shep 

In [46]:
sent_1000 = decontracted(sent_1000)
print(sent_1000)
print("="*50)

I did not think to question or research this wine preserver until after I received it.  The principle SEEMS valid, but once I received it I questioned how effectively it actually evacuates air from the bottle.  Then my research revealed questions about the stopper is ability to maintain what vacuum there is.  So, I did a simple test.  I pumped all the air out of a bottle that the unit was capable of and let it sit for a few hours.  After only maybe 6 hours, the stopper was so easy to pull out (without releasing pressure first) there is no way there was much of a vacuum at all left in the bottle.  At least it was cheap, but this unit is a complete scam. An incredibly successful and profitable one at that, it seems.


In [47]:
sent_1000 = re.sub("\S*\d\S*", "", sent_1000).strip()
print(sent_1000)

I did not think to question or research this wine preserver until after I received it.  The principle SEEMS valid, but once I received it I questioned how effectively it actually evacuates air from the bottle.  Then my research revealed questions about the stopper is ability to maintain what vacuum there is.  So, I did a simple test.  I pumped all the air out of a bottle that the unit was capable of and let it sit for a few hours.  After only maybe  hours, the stopper was so easy to pull out (without releasing pressure first) there is no way there was much of a vacuum at all left in the bottle.  At least it was cheap, but this unit is a complete scam. An incredibly successful and profitable one at that, it seems.


In [50]:
# Combining all the above logics 
#from tqdm import tqdm
preprocessed_reviews = []
# tqdm is for printing the status bar
for sentence in tqdm(final['Text'].values):
    ## remove urls
    sentence = re.sub("http\S+", "", sentence)
    #remove-all-tags-from-an-element0
    sentence = BeautifulSoup(sentence, 'lxml').get_text()
    sentence = decontracted(sentence)
    ##remove words with numbers
    sentence = re.sub("\S*\d\S*", "", sentence).strip()
    ##remove spacial character
    sentence = re.sub('[^A-Za-z]+', ' ', sentence)
    # https://gist.github.com/sebleier/554280
    sentence = ' '.join(e.lower() for e in sentence.split() if e.lower() not in stopwords)
    preprocessed_reviews.append(sentence.strip())

100%|████████████████████████████████████████████████████████████████████████| 364171/364171 [02:13<00:00, 2720.23it/s]


In [None]:
preprocessed_reviews

# Bag Of Words

In [51]:
vectorizer = CountVectorizer() #in scikit-learn
final_counts = vectorizer.fit_transform(preprocessed_reviews)
print("some feature names ", vectorizer.get_feature_names_out()[:10])
print('='*50)

print("the type of count vectorizer ",type(final_counts))
print("the shape of out text BOW vectorizer ",final_counts.get_shape())
print("the number of unique words ", final_counts.get_shape()[1])

some feature names  ['aa' 'aaa' 'aaaa' 'aaaaa' 'aaaaaa' 'aaaaaaaaaaa' 'aaaaaaaaaaaa'
 'aaaaaaaaaaaaa' 'aaaaaaaaaaaaaa' 'aaaaaaaaaaaaaaa']
the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BOW vectorizer  (364171, 115997)
the number of unique words  115997


In [52]:
final_counts[0].toarray()

array([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [53]:
len(final_counts[0].toarray())

1

In [55]:
len(vectorizer.get_feature_names_out())

115997

# Bi-Grams and n-Grams.

In [56]:
vectorizer = CountVectorizer(ngram_range=(1,2), min_df=10, max_features=5000)
#ngram_range=(1,2) ,here minimum is 1 andmaximum is 2 . Soit will include both unigrams and bigrams
final_unibigram_counts = vectorizer.fit_transform(preprocessed_reviews)
print("the type of count vectorizer ",type(final_unibigram_counts))
print("the shape of out text BOW vectorizer ",final_unibigram_counts.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_unibigram_counts.get_shape()[1])


the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BOW vectorizer  (364171, 5000)
the number of unique words including both unigrams and bigrams  5000


In [57]:
final_unibigram_counts[0].toarray()

array([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [58]:
len(final_unibigram_counts[0].toarray()[0])

5000

# TF-IDF

In [59]:
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2), min_df=10)
tf_idf_vect.fit(preprocessed_reviews)
final_tf_idf = tf_idf_vect.transform(preprocessed_reviews)

print("some sample features(unique words in the corpus)",tf_idf_vect.get_feature_names_out()[0:10])
print('='*50)

print("the type of count vectorizer ",type(final_tf_idf))
print("the shape of out text TFIDF vectorizer ",final_tf_idf.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_tf_idf.get_shape()[1])

some sample features(unique words in the corpus) ['aa' 'aaa' 'aaaaa' 'aaah' 'aafco' 'ab' 'aback' 'abandon' 'abandoned'
 'abbey']
the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text TFIDF vectorizer  (364171, 148970)
the number of unique words including both unigrams and bigrams  148970


In [60]:
final_tf_idf[0].toarray()

array([[0., 0., 0., ..., 0., 0., 0.]])

In [61]:
len(final_tf_idf[0].toarray()[0])

148970

# Word2Vec

In [62]:
w2v_model=KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)


In [64]:
# Get the list of words from the model's vocabulary
w2v_words = set(w2v_model.key_to_index)  # Vocabulary from the Word2Vec model
w2v_words

{'HR_Chally',
 'VIDEO_INTERVIEW',
 'Jennifer_Hedger',
 'Daryle_Ward_homered',
 'Pinter_Betrayal',
 'Cesar_Nicolas',
 'Broxtowe',
 'Ultriva_flagship_product',
 'LITL',
 'Luda',
 'hijackers_Nawaf_al_Hazmi',
 'Senator_Tariq_Azim',
 'boxer_Antonio_Tarver',
 'cardiopulmonary_resuscitation_CPR',
 'Aden_Mansoura_district',
 'Mikaal',
 'homemade_cakes_pies',
 'Silverado_Resort',
 'gamekeeping',
 'Buffalo_defenseman_Teppo',
 'Ameliach',
 'iCalendar_files',
 'Green2V',
 'Griefer',
 'Winkel',
 'Ms._Buglisi',
 'trod',
 'prosecutor_Len_Doust',
 'devilishly_tricky',
 'THE_WRONG_WAY',
 'Tony_Gapes',
 'ESTHERVILLE_Iowa',
 'law_Kristie_McDevitt',
 'Facebook_FOXNews.com_IFILM',
 'ried',
 'Maureen_Garrity',
 'eschewal',
 'Butrous',
 "Nk'Mip_Desert_Cultural",
 'mountaintop_removal',
 'Samsung_Memoir',
 'Alphera_Financial',
 'Carmax_Explorations_Ltd.',
 'Topol_M',
 'on-base/slugging_percentage',
 'Provenge_prostate_cancer',
 'woodland_grassland',
 'Bogen',
 'orcs_elves',
 'By_ROB_BURGESS',
 'Carol_Vaness',

In [65]:
len(w2v_words)

3000000

In [66]:
preprocessed_reviews

['charming rhyming book describes circumstances eat not chicken soup rice month month sounds thing kids recess sing drive teachers crazy cute catchy sounds really childlike skillfully written',
 'children books written mini version book not portrayed priced product email bewilderment amazon no response',
 'daughter loves really rosie books introduced really rosie performed carole king amazon birthday year later songs books johnny alligators chicken soup rice books written clever art work maurice sendak plus really cheap highly recommended',
 'witty little book son laugh loud recite car driving can sing refrain learned whales india drooping roses love words book introduces silliness classic book bet son recite memory college',
 'summary young boy describes usefulness chicken soup rice month year evaluation sendak creative repetitious rhythmic words children enjoy learn read story boy loves chicken soup rice sendak catchy story children learn months year seasons month learn identify ice 

In [67]:
list_of_sentence=[]
for sentence in preprocessed_reviews:
    list_of_sentence.append(sentence.split())

In [68]:
list_of_sentence

[['charming',
  'rhyming',
  'book',
  'describes',
  'circumstances',
  'eat',
  'not',
  'chicken',
  'soup',
  'rice',
  'month',
  'month',
  'sounds',
  'thing',
  'kids',
  'recess',
  'sing',
  'drive',
  'teachers',
  'crazy',
  'cute',
  'catchy',
  'sounds',
  'really',
  'childlike',
  'skillfully',
  'written'],
 ['children',
  'books',
  'written',
  'mini',
  'version',
  'book',
  'not',
  'portrayed',
  'priced',
  'product',
  'email',
  'bewilderment',
  'amazon',
  'no',
  'response'],
 ['daughter',
  'loves',
  'really',
  'rosie',
  'books',
  'introduced',
  'really',
  'rosie',
  'performed',
  'carole',
  'king',
  'amazon',
  'birthday',
  'year',
  'later',
  'songs',
  'books',
  'johnny',
  'alligators',
  'chicken',
  'soup',
  'rice',
  'books',
  'written',
  'clever',
  'art',
  'work',
  'maurice',
  'sendak',
  'plus',
  'really',
  'cheap',
  'highly',
  'recommended'],
 ['witty',
  'little',
  'book',
  'son',
  'laugh',
  'loud',
  'recite',
  'car'

# Avg W2v

In [69]:
# average Word2Vec
# compute average word2vec for each review.
sent_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm(list_of_sentence): # for each review/sentence
    sent_vec = np.zeros(300) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors.append(sent_vec)
print(len(sent_vectors))
print(len(sent_vectors[0]))

100%|████████████████████████████████████████████████████████████████████████| 364171/364171 [01:08<00:00, 5281.31it/s]

364171
300





In [70]:
sent_vectors[0]

array([ 7.96893084e-02,  2.56438079e-02,  2.85395870e-04,  1.36422616e-01,
       -9.55550582e-02, -1.37736003e-02,  1.04122586e-01, -3.27750312e-02,
        3.39005082e-02,  7.25527163e-02, -1.00233290e-02, -8.55577257e-02,
       -6.41999421e-04,  1.47139938e-02, -1.26455802e-01,  1.38093171e-01,
        1.73701534e-02,  1.07467077e-01, -5.99681713e-02, -1.05086715e-01,
        3.98717810e-02,  1.07930501e-01,  9.42235876e-02, -2.02207212e-03,
       -2.32250072e-02, -1.03054470e-01,  8.32112630e-03,  2.58269133e-03,
       -1.49544610e-02,  1.61664044e-02, -1.20302554e-01,  1.63514879e-02,
        1.63945092e-02,  9.93211534e-02,  2.82219781e-02, -2.18415437e-02,
        8.07902018e-02, -2.49023438e-02, -3.93902814e-03,  7.12890625e-02,
        4.11648220e-02, -4.54053526e-02,  1.50856301e-01,  1.92645038e-02,
       -5.17371849e-02, -4.58006682e-02, -3.87713114e-02,  1.28569426e-02,
       -1.41895435e-02,  3.77250954e-02, -1.19976468e-01,  1.10982259e-02,
       -2.20786201e-02, -

# TFIDF weighted W2v

In [71]:
model = TfidfVectorizer()
model.fit(preprocessed_reviews)
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(model.get_feature_names(), list(model.idf_)))

In [72]:
# TF-IDF weighted Word2Vec
tfidf_feat = model.get_feature_names_out() # tfidf words/col-names
# final_tf_idf is the sparse matrix with row= sentence, col=word and cell_val = tfidf

tfidf_sent_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm(list_of_sentence): # for each review/sentence 
    sent_vec = np.zeros(300) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words and word in tfidf_feat:
            vec = w2v_model[word]
#             tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
            # to reduce the computation we are 
            # dictionary[word] = idf value of word in whole courpus
            # sent.count(word) = tf valeus of word in this review
            tf_idf = dictionary[word]*(sent.count(word)/len(sent))
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        sent_vec /= weight_sum
    tfidf_sent_vectors.append(sent_vec)
    row += 1

100%|███████████████████████████████████████████████████████████████████████| 364171/364171 [26:23:06<00:00,  3.83it/s]


In [73]:
tfidf_feat

array(['aa', 'aaa', 'aaaa', ..., 'zzzzzzzzzz', 'zzzzzzzzzzz',
       'zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz'], dtype=object)

In [74]:
print(len(tfidf_sent_vectors))
print(len(tfidf_sent_vectors[0]))

364171
300


In [75]:
tfidf_sent_vectors[0]

array([ 0.11396884,  0.02561451,  0.00118194,  0.11142329, -0.11046916,
       -0.02964562,  0.09483108, -0.00411715,  0.03612865,  0.06746001,
       -0.02436762, -0.06097512, -0.00464399,  0.02780824, -0.09815542,
        0.13406959,  0.02215951,  0.13281379, -0.05849127, -0.10085529,
        0.03393495,  0.138102  ,  0.07756602, -0.0020497 , -0.05065447,
       -0.11441903,  0.01680255, -0.0018775 , -0.01707765,  0.03244524,
       -0.12550038,  0.00081869,  0.00254604,  0.14487524,  0.02772679,
       -0.04647382,  0.07058791, -0.01431421, -0.02334432,  0.04632905,
        0.05522238, -0.04180608,  0.14647944,  0.02595438, -0.05132676,
       -0.03715276, -0.03607776,  0.03470499, -0.01387578,  0.03060926,
       -0.13325814, -0.01317776, -0.02843846, -0.03616672,  0.11080826,
        0.07819761, -0.04609804, -0.17937356,  0.00967246, -0.10830716,
        0.02380647,  0.05935133, -0.08510134, -0.07307952,  0.07473544,
       -0.00726191, -0.08047924,  0.03282513, -0.0240684 ,  0.07