# Predicting sentiments from product reviews
The dataset used in this notebook can be found here [amazon_baby.csv](https://d396qusza40orc.cloudfront.net/phoenixassets/amazon_baby.csv)

In [1]:
import pandas as pd
import numpy as np

Import necessary package for words preprocessing (cleaning text):
- Import re (regular expression, for removing punctuation)
- Import nltk and download stopwords corpus (remove stopwords such as "and", "the", etc)

In [14]:
import re
import nltk
# nltk.download()
from nltk.corpus import stopwords

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

In [4]:
products = pd.read_csv("./data/amazon_baby.csv")

## A bit of data exploration

In [5]:
products.head(5)

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


In [6]:
products.count()

name      183213
review    182702
rating    183531
dtype: int64

In [34]:
products.shape[0]

183531

# MUST REPLACE NaN in review by " " !!!

# Test NLP ...

### function to create bagofwords for each review
This function:
- Remove punctuation (using regular expression)
- Change review to lowercase only and split text in words
- (optionally remove stop words. This is not done in the course)

In [35]:
def clean_review(raw_review, stopswrd):
    review = re.sub("[^a-zA-Z]"," ",raw_review)
    review = review.lower().split()
    #review = [w for w in review if not w in stopswrd]
    return(" ".join(review))

In [36]:
stopswrd = set(stopwords.words("english"))

In [58]:
for i in range(0,products.shape[0]):
    print("i:",i)
    print("text:",clean_review(products["review"][i],stopswrd))
    reviews.append(clean_review(products["review"][i],stopswrd))

i: 0
text: these flannel wipes are ok but in my opinion not worth keeping i also ordered someimse vimse cloth wipes ocean blue countwhich are larger had a nicer softer texture and just seemed higher quality i use cloth wipes for hands and faces and have been usingthirsties pack fab wipes boyfor about months now and need to replace them because they are starting to get rough and have had stink issues for a while that stripping no longer handles
i: 1
text: it came early and was not disappointed i love planet wise bags and now my wipe holder it keps my osocozy wipes moist and does not leak highly recommend it
i: 2
text: very soft and comfortable and warmer than it looks fit the full size bed perfectly would recommend to anyone looking for this type of quilt
i: 3
text: this is a product well worth the purchase i have not found anything else like this and it is a positive ingenious approach to losing the binky what i love most about this product is how much ownership my daughter has in gett

TypeError: expected string or bytes-like object

In [45]:
reviews.append(clean_review(products["review"][2],stopswrd))

In [54]:
reviews = []
num_reviews = products.shape[0]

In [59]:
for i in range(35,40):
    reviews.append(clean_review(products["review"][i],stopswrd))

TypeError: expected string or bytes-like object

In [60]:
products["review"][38]

nan

In [8]:
rev_nopunc = re.sub("[^a-zA-Z]",
                      " ",
                      products["review"][0])

In [9]:
rev_nopunc

'These flannel wipes are OK  but in my opinion not worth keeping   I also ordered someImse Vimse Cloth Wipes Ocean Blue    countwhich are larger  had a nicer  softer texture and just seemed higher quality   I use cloth wipes for hands and faces and have been usingThirsties   Pack Fab Wipes  Boyfor about   months now and need to replace them because they are starting to get rough and have had stink issues for a while that stripping no longer handles '

In [17]:
rev_nopunc = rev_nopunc.lower().split()

In [23]:
stopswrd = set(stopwords.words("english"))

In [24]:
bow = [w for w in rev_nopunc if not w in stopswrd]

In [25]:
print(bow)

['flannel', 'wipes', 'ok', 'opinion', 'worth', 'keeping', 'also', 'ordered', 'someimse', 'vimse', 'cloth', 'wipes', 'ocean', 'blue', 'countwhich', 'larger', 'nicer', 'softer', 'texture', 'seemed', 'higher', 'quality', 'use', 'cloth', 'wipes', 'hands', 'faces', 'usingthirsties', 'pack', 'fab', 'wipes', 'boyfor', 'months', 'need', 'replace', 'starting', 'get', 'rough', 'stink', 'issues', 'stripping', 'longer', 'handles']


In [22]:
" ".join(bow)

'flannel wipes ok opinion worth keeping also ordered someimse vimse cloth wipes ocean blue countwhich larger nicer softer texture seemed higher quality use cloth wipes hands faces usingthirsties pack fab wipes boyfor months need replace starting get rough stink issues stripping longer handles'

In [29]:
create_bagofwords(products["review"][0],stopswrd)

'these flannel wipes are ok but in my opinion not worth keeping i also ordered someimse vimse cloth wipes ocean blue countwhich are larger had a nicer softer texture and just seemed higher quality i use cloth wipes for hands and faces and have been usingthirsties pack fab wipes boyfor about months now and need to replace them because they are starting to get rough and have had stink issues for a while that stripping no longer handles'