# Predicting sentiments from product reviews
The dataset used in this notebook can be found here [amazon_baby.csv](https://d396qusza40orc.cloudfront.net/phoenixassets/amazon_baby.csv)

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline

Import necessary package for words preprocessing (cleaning text):
- Import re (regular expression, for removing punctuation)
- Import nltk and download stopwords corpus (remove stopwords such as "and", "the", etc)

In [2]:
import re, collections
import nltk
#nltk.download()
from nltk.corpus import stopwords

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

In [4]:
products = pd.read_csv("./data/amazon_baby.csv")

## A bit of data exploration

In [5]:
products.head(5)

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


In [6]:
products.count()

name      183213
review    182702
rating    183531
dtype: int64

Some reviews are obviously **missing (NaN)**. Replace NaN reviews with " " to avoid errors when processing texts later on.

In [7]:
products["review"].fillna(" ",inplace=True)

## Create bag of words for reviews

### Clean the reviews first
- Remove punctuation (using regular expression)
- Change review to lowercase only and split text in words
- (optionally remove stop words. This is not done in the course)
- Finally Put all reviews in a list (reviews)

In [8]:
def clean_review(raw_review, stopswrd):
    review = re.sub("[^a-zA-Z]"," ",raw_review)
    review = review.lower().split()
    #review = [w for w in review if not w in stopswrd]
    return(" ".join(review))

In [9]:
stopswrd = set(stopwords.words("english"))

In [10]:
reviews = []
num_reviews = products.shape[0]

In [11]:
for i in range(0,num_reviews):
    reviews.append(clean_review(products["review"][i],stopswrd))

### The bag of words
Create a data frame containing bag of words for each review and add the column to the **products** DataFrame.

In [12]:
%%time
bagofwords = pd.DataFrame((str(dict(collections.Counter(re.findall(r"\w+", txt)))) for txt in reviews),columns=["word_count"])

Wall time: 10.6 s


In [13]:
products.insert(len(products.columns),"word_count",bagofwords)

In [14]:
products.head(5)

Unnamed: 0,name,review,rating,word_count
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3,"{'boyfor': 1, 'texture': 1, 'that': 1, 'been':..."
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,"{'love': 1, 'keps': 1, 'my': 2, 'wipe': 1, 'ba..."
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,"{'type': 1, 'warmer': 1, 'full': 1, 'looking':..."
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,"{'love': 2, 'found': 1, 'what': 1, 'ingenious'..."
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,"{'love': 1, 'help': 1, 'buy': 1, 'way': 1, 'we..."


In [17]:
data = products[["name","review"]]

In [22]:
data.groupby("name").count().sort_values("review",ascending=False)

Unnamed: 0_level_0,review
name,Unnamed: 1_level_1
Vulli Sophie the Giraffe Teether,785
"Simple Wishes Hands-Free Breastpump Bra, Pink, XS-L",562
Infant Optics DXR-5 2.4 GHz Digital Video Baby Monitor with Night Vision,561
Baby Einstein Take Along Tunes,547
"Cloud b Twilight Constellation Night Light, Turtle",520
"Fisher-Price Booster Seat, Blue/Green/Gray",489
Fisher-Price Rainforest Jumperoo,450
"Graco Nautilus 3-in-1 Car Seat, Matrix",419
Leachco Snoogle Total Body Pillow,388
"Regalo Easy Step Walk Thru Gate, White",374


# --------

### Use sklearn then to create bag  of words
Each review will lead to a creation of a [max_features] vector holding the counts for each word of the vocabulary.

In [12]:
vectorizer = CountVectorizer(analyzer = "word",tokenizer = None,
                             preprocessor = None,
                             stop_words = None,
                             max_features = 5000)

In [13]:
words_count = vectorizer.fit_transform(reviews)

words_count is a sparse matrix, it is more convenient to transform it to a numpy array.

In [14]:
words_count = words_count.toarray()
words_count.shape

(183531, 5000)

### Transform sklearn bag of words
Our bag of words is created (**words\_count** in association with the vocabulary learned by CountVectorizer instance, i.e. **vectorizer.vocabulary_**), but let's do some transformation to reproduce the course notebook (additional column with bag of words as a dictionnary for each review).

In [70]:
#def to_dict(review):
#    dic = pd.DataFrame(vectorizer.get_feature_names(), columns = ["vocabulary"])
#    dic.insert(1,"count",review.reshape(-1,1))
#    dic = dic[dic["count"]>0]
#    i = iter(dic.to_dict(orient="split")["data"])
#    return{k:v for k,v in i}

In [71]:
#%%time
#bagofwords = pd.DataFrame(words_count).apply(to_dict, axis=1)