# Amazon fine food reviews analysis
Data source : https://www.kaggle.com/snap/amazon-fine-food-reviews

## Context
This dataset consists of reviews of fine foods from amazon. The data span a period
of more than 10 years, including all ~500,000 reviews up to October 2012.
Reviews include product and user information, ratings, and a plain text review.
It also includes reviews from all other Amazon categories.

Data includes:

- Reviews from Oct 1999 - Oct 2012
- 568,454 reviews
- 256,059 users
- 74,258 products
- 260 users with > 50 reviews

Attributes information :

In [1]:
import pandas as pd
data = pd.read_csv('Reviews.csv')
print("Total number of attributes : ", len(data.columns))
for i in data.columns:
    print(str(i))

data.shape

Total number of attributes :  10
Id
ProductId
UserId
ProfileName
HelpfulnessNumerator
HelpfulnessDenominator
Score
Time
Summary
Text


(568454, 10)

HelpfulnessNumerator - Number of user who found reviews helpful

HelpfulnessDenominator - Number of user who indicated whether they found the review helpful or not.

## Objective :
For a given review determine whether it is positive or negative.

(Assumption : Considering 4 and 5 start as positive and 1 and 2 as negative.)

For testing purpose we will be using only 1,2,4,5 star rating only.

***

## Step 1 : Reading Data

We will use sqlite for reading data and perform most of the operation using sql commands.

In [2]:
import sqlite3

# Using sqlite to table to read data

# First create a connection between the sqlite database and python
con = sqlite3.connect('././database.sqlite')

#Discard 3 star rating

filtered_data = pd.read_sql_query("""SELECT * FROM REVIEWS WHERE SCORE!=3""",con)

# we will map 1 , 2 star rating with negative and  4 , 5 with positive rating.

def mapping(n):
    if n>3:
        return 'positive'
    else :
        return 'negative'

filtered_data['Score'] = filtered_data['Score'].map(mapping)

print(filtered_data['Score'].value_counts())
filtered_data.head()


positive    443777
negative     82037
Name: Score, dtype: int64


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,positive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,positive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,negative,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,positive,1350777600,Great taffy,Great taffy at a great price. There was a wid...


## Step 2 : Cleaning of data :

For cleaning purpose we will remove some duplicates items from the dataset.
We can also remove some ambiguous data like if helpfulnessNumerator is greater than
helpfulDenominator.

In [3]:
# we will sort the data then remove items where time , UserId, ProfileName , HelpfulnessNumerator,HelpfulDenominator , Score , Summary and Text are equal.
# We will choose the first data and remove other and also we don't want copy of the data.

filtered_data = filtered_data.sort_values(axis=0,ascending=True,inplace=False,by='ProductId')
filtered_data.drop_duplicates(subset=filtered_data.columns[2:],keep='first',inplace=False)
print(filtered_data.shape)

(525814, 10)


***
# Converting text into vector form

For data visualization we need to convert text into vector form so that we can apply
some linear algebra technique to find similarity and dissimilarity between two documents.

Popular and simple method of feature extraction with text data which are currently used are:
- Bag of words (BOW).
- TF-IDF
- Word2Vec

## Text pre-processing :
Before applying any text to vector conversion algorithm we need to remove some
unimportant and garbage data.
There are some important text pre-processing technique are
- Removing stopwords .
- Converting every letter to small letter.
- Stemming
- Lemmatization
- Semantic meanings of words

### Removing Stopwords :
In our english language there are many stopwords available like 'this' , 'that' , 'is' , 'not'
etc. these words are places in the sentences to make grammatically correct. They don't have much
signification impact on our data.
So before applying any algorithm we remove these stopwords from the documents.

### Stemming :
Stemming a process in which similar looking words with similar meaning are replace with
the other text.
Like 'tasty' , 'taste','tasteful' etc can be replace with  'taste'.
There are many algorithm available to do stemming but two most important stemming algorithm are :
- Porter Stemmer.(Old one)
- Snowball Stemmer.(New one)

### Lemmatization :
Breaking a sentences into words.

### Semantic Analysis :
In this case not same looking words but with similar meaning words can replace with same word.
Like tasty and delicious having same meaning so they can be replace with either tasty or delicious.

Note -
In bag of words and TF-IDF we don't consider Semantic Analysis.


## Bag of words(BOW) :
In bag of words technique we create a d-dimensions vector for each sentence.
Where d is equal to total number of different words.
After we find d we create a d-dimensional vector  where each dimension represent count of word of that particular type

## TF-IDF :
TF - Term frequency
$TF(w_j,r_i)$ is equal to total number of time $w_j$ occurs in $r_i$ divided by Total number
of words in $r_i$.

0<= $TF(w_j,r_i)$ <= 1

IDF - Inverse Document frequency

Let's say $D_c = { r_1,r_2,r_3, .... , r_n}$
where $D_c$ is knows as data corpus and $r_1,r_2,.....,r_n$ are the sentences.

$ IDF(w_j,D_c) = log(N/n_i)$ where

$IDF(w_j,D_c)$ is idf of word $w_j$ in document

N is number of documents $D_c$

$n_i$ is the number of documents which contain $w_j$

So the vector form of text can be written as  :

$TFIDF(w_j,r_i) = TF(w_j,r_i)*IDF(w_j,D_c)$

## Word2Vec :
Word2vec is a combination of models used to represent distributed representations
of words in a corpus C. Word2Vec (W2V) is an algorithm that accepts text corpus
as an input and outputs a vector representation for each word.

It's math is fairly complex so we will skip this for now but will do in  deep learning model.
What it does ?
- It's take word as input and return a "vector representation" of each word.
- If two word $w_1$ and $w_2$ are more similar w.r.t. $w_3$ then it will return vector $v_1,
v_2,v_3$ such that distance between $v_1$ and $v_2$ is less w.r.t to $v_3$.

But we need to convert a sentence into vector not a word .
So convert into sentences we will use two famous method :
- Avg Word2Vec.
- TFIDF Word2Vec.

In avg word2vec we will calculate word2vec for each word and then we will take avg of it's vector.

In TFIDF word2vec we will use weighted mean .
Weight of each word is equal to IFIDF.


