# Predicting sentiments from product reviews
The dataset used in this notebook can be found here [amazon_baby.csv](https://d396qusza40orc.cloudfront.net/phoenixassets/amazon_baby.csv)

In [1]:
import pandas as pd
import numpy as np

Import necessary package for words preprocessing (cleaning text):
- Import re (regular expression, for removing punctuation)
- Import nltk and download stopwords corpus (remove stopwords such as "and", "the", etc)

In [11]:
import re
import nltk
#nltk.download()
from nltk.corpus import stopwords

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

In [13]:
products = pd.read_csv("./data/amazon_baby.csv")

## A bit of data exploration

In [5]:
products.head(5)

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


In [6]:
products.count()

name      183213
review    182702
rating    183531
dtype: int64

Some reviews are obviously missing (NaN). Replace NaN reviews with " " to avoid errors when processing texts later on.

In [19]:
products["review"].fillna(" ",inplace=True)

## Create bag of words for reviews

### Clean the reviews first
- Remove punctuation (using regular expression)
- Change review to lowercase only and split text in words
- (optionally remove stop words. This is not done in the course)
- Finally Put all reviews in a list (reviews)

In [20]:
def clean_review(raw_review, stopswrd):
    review = re.sub("[^a-zA-Z]"," ",raw_review)
    review = review.lower().split()
    #review = [w for w in review if not w in stopswrd]
    return(" ".join(review))

In [21]:
stopswrd = set(stopwords.words("english"))

In [23]:
reviews = []
num_reviews = products.shape[0]

In [24]:
for i in range(0,num_reviews):
    reviews.append(clean_review(products["review"][i],stopswrd))

### Use sklearn then to create bag  of words
Each review will lead to a creation of a [max_features] vector holding the counts for each word of the vocabulary.

In [26]:
vectorizer = CountVectorizer(analyzer = "word",tokenizer = None,
                             preprocessor = None,
                             stop_words = None,
                             max_features = 5000)

In [27]:
words_count = vectorizer.fit_transform(reviews)

In [33]:
words_count = words_count.toarray()

In [39]:
words_count.shape

(183531, 5000)

In [48]:
dist = np.sum(words_count, axis=0)

In [50]:
dist.shape[0]

5000

In [54]:
for tag,count in zip(vectorizer.vocabulary_, range(0,dist.shape[0])):
    print(tag, words_count[0,count])

strapped 0
powder 0
manipulate 0
recently 1
select 0
bounce 0
harness 0
fluid 0
playmat 0
hopes 0
impressed 0
one 0
states 0
make 0
stories 0
consultant 0
usual 0
wouldn 0
soft 0
cats 0
constructed 0
breaking 0
ones 0
pinching 0
flaps 0
by 0
past 0
rinse 0
installed 0
benefits 0
annoyance 0
closest 0
worried 0
warms 0
usage 0
pram 0
missing 0
idea 0
fresh 0
homemade 0
moisture 0
texas 0
boost 0
saw 0
successfully 0
fork 0
laws 0
lets 0
likes 0
safe 0
prefold 0
blocking 0
danger 0
alot 0
mechanical 0
securely 0
brushes 0
substantial 0
moby 0
constantly 0
stream 0
been 0
recommended 0
roughly 0
spending 0
dry 0
loses 0
throws 0
reflection 0
durable 0
tether 0
serves 0
satin 0
dr 0
lots 0
however 0
sees 0
extreme 0
bored 0
carry 0
remaining 0
click 0
squares 0
relieved 0
managed 0
thankful 0
holders 0
crisp 0
warranty 0
sophie 0
petite 0
order 0
city 0
leather 0
board 0
regret 0
decorated 0
convenient 0
cause 0
awake 0
trial 0
edge 0
buzzing 0
say 0
hard 0
compartments 0
kicking 0
excitin

In [99]:
a = vectorizer.vocabulary_.keys()

In [111]:
vec = pd.DataFrame([[key,value] for key,value in vectorizer.vocabulary_.items()],columns=["word","iindex"])

In [113]:
vec.sort_values("iindex",ascending=True,axis=0,inplace=True)

In [114]:
vec

Unnamed: 0,word,iindex
212,aa,0
4665,ability,1
1382,able,2
1704,about,3
388,above,4
4378,absolute,5
1712,absolutely,6
943,absorb,7
1085,absorbant,8
558,absorbency,9


In [115]:
vec.set_index("iindex",inplace=True)

In [116]:
vec

Unnamed: 0_level_0,word
iindex,Unnamed: 1_level_1
0,aa
1,ability
2,able
3,about
4,above
5,absolute
6,absolutely
7,absorb
8,absorbant
9,absorbency


In [133]:
words_count[0].reshape(-1,1)

array([[0],
       [0],
       [0],
       ..., 
       [0],
       [0],
       [0]], dtype=int64)

In [126]:
a = vec

In [144]:
a.insert(1,"count",words_count[0].reshape(-1,1))

In [145]:
a

Unnamed: 0_level_0,word,count
iindex,Unnamed: 1_level_1,Unnamed: 2_level_1
0,aa,0
1,ability,0
2,able,0
3,about,1
4,above,0
5,absolute,0
6,absolutely,0
7,absorb,0
8,absorbant,0
9,absorbency,0


In [146]:
b = a[a["count"]>0]

In [148]:
b.shape

(53, 2)