# Review Analysis

## Import Required Packages

In [1]:
import pandas as pd
import numpy as np

## Read Review Data

In [3]:
review = pd.read_excel('Product Reviews.xlsx')
review

Unnamed: 0,Category,product_handle,state,rating,title,author,location,body,reply,created_at,replied_at
0,Anti-Aging,resist-foaming-cleanser,published,5,Good cleanser,Jeannie,,"A lotion based cleanser, slight foam. Face fee...",,2016-12-02 12:08:24 UTC,
1,Anti-Aging,resist-vitamin-c-spot-treatment,published,1,No effect,Yan,,"Unfortunately, it did not work on my pigmented...",,2016-12-03 13:26:10 UTC,
2,Sensitive Skin - Calm Series,calm-redness-relief-moisturizer-normal-to-dry,published,2,Not Suitable for My Sensitive Skin,Linda,,I had wanted to transit from the Hydralight Mo...,,2016-12-08 07:21:26 UTC,
3,Sensitive Skin - Calm Series,calm-redness-relief-1-bha-lotion,published,1,Does No Difference as compared to Skin Perfect...,Linda,,I was previously using the Skin Perfecting 1% ...,,2016-12-08 07:28:12 UTC,
4,Very Oily Skin,skin-balancing-pore-reducing-toner,published,3,Stings my face after prolonged usage,Linda,,"This was the first toner I had purchased, and ...",,2016-12-08 07:32:38 UTC,
5,Acne Clear Series,clear-regular-strength-anti-redness-exfoliatin...,published,1,Makes skin reactive,Linda,,BHA 1% is just right while BHA 2% is too stron...,,2016-12-08 07:35:18 UTC,
6,Anti-Aging,resist-moisture-renewal-oil-booster,published,2,Lightweight but clogs my pores,Linda,,I like how this feels lightweight and has no f...,,2016-12-08 07:37:30 UTC,
7,Anti-Aging,resist-retinol-skin-smoothing-body-treatment,published,3,Does not like it,Linda,,I am still trying to finish it after a year.\n...,,2016-12-08 07:41:19 UTC,
8,Anti-Aging,resist-hyaluronic-acid-booster,published,3,Would not re-purchase,Linda,,I had finished the whole bottle and does not f...,,2016-12-08 07:43:14 UTC,
9,Very Oily Skin,skin-balancing-invisible-finish-moisture-gel,published,5,Personally my HG moisturiser,Ana,,I have complicated sensitive/oily/clog-prone s...,,2016-12-11 06:23:48 UTC,


## Review Rating Analysis

In [6]:
group_by_catgory = review['rating'].groupby(review['Category'])
for name, group in group_by_catgory:
    print name
    print group.mean()
    print group.median()

Acne Clear Series
4.15151515152
4.0
Anti-Aging
4.36809815951
5.0
Bath
4.66666666667
5.0
Clinical Series
4.57142857143
5.0
Combi Skin
4.125
4.0
Dry Skin
4.11111111111
4.0
Earth Sourced
4.25
5.0
Men Series
4.72727272727
5.0
Sensitive Skin - Calm Series
3.5
3.0
Very Dry Skin
4.35714285714
5.0
Very Oily Skin
4.59722222222
5.0


## Review Topic Analysis

In [12]:
review['Category'].unique()

array([u'Anti-Aging', u'Sensitive Skin - Calm Series', u'Very Oily Skin',
       u'Acne Clear Series', u'Combi Skin', u'Dry Skin', u'Men Series',
       nan, u'Bath', u'Very Dry Skin', u'Clinical Series', u'Earth Sourced'], dtype=object)

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
import re
import string
#Tokenize the text
def tokenize(text):

    #Create Stemmer
    #stemmer = PorterStemmer()
    stemmer = WordNetLemmatizer()

    #Remove irrelevant character
    text = re.sub(r"[^a-zA-Z]", ' ', text)

    #Tokenization
    tokens = word_tokenize(text)
    tokens = [i for i in tokens if i not in string.punctuation]

    #Stemming
    stems = stem_tokens(tokens, stemmer)
    return stems

#Stemming Function
def stem_tokens(t,s):
    stemmed=[]
    for item in t:
        # stemmed.append(s.stem(item))
        if len(s.lemmatize(item))>2:
            stemmed.append(s.lemmatize(item))
    return stemmed

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print message

for cat in review['Category'].unique():
    if not pd.isnull(cat):
        print cat
        lda = LatentDirichletAllocation(n_components=3, max_iter=500,
                                        learning_method='online',
                                        learning_offset=50.,
                                        random_state=0)

        tf_vectorizer = CountVectorizer(min_df=2,tokenizer = tokenize,
                                        stop_words='english')

        tf = tf_vectorizer.fit_transform(review[review['Category'] == cat]['body'])

        lda.fit(tf)
        tf_feature_names = tf_vectorizer.get_feature_names()
        print_top_words(lda, tf_feature_names, 20)
        print "\n"

Anti-Aging
Topic #0: love texture product day great colour blemish makeup wish nice hide make making tint hard excellent moisturizing really little scar
Topic #1: skin oily use feel product really bha pore eye moisturiser using sunscreen doe sensitive love tried used acne combination like
Topic #2: skin use product try like really little just moisture day oily based good smell spot applied face cleanser size work


Sensitive Skin - Calm Series
Topic #0: skin using doe like sensitive product gentle moisturizer lotion kept time calming good exfoliant month feel pore better toner clogged
Topic #1: skin dry oily face good product work breakout used got redness little moisturiser combination like tried cause time leak opened
Topic #2: good moisturising bay use work sensitive wonder matte tried skin did absorbed definitely overall try using version breakout exfoliant time


Very Oily Skin
Topic #0: skin bha product sticky cleanser moisturizer feeling time control non breakout acne morning dr

In [21]:
for cat in review['rating'].unique():
    if not pd.isnull(cat):
        print cat
        lda = LatentDirichletAllocation(n_components=2, max_iter=500,
                                        learning_method='online',
                                        random_state=0)

        tf_vectorizer = CountVectorizer(min_df=2,tokenizer = tokenize,
                                        stop_words='english')

        tf = tf_vectorizer.fit_transform(review[review['rating'] == cat]['body'])

        lda.fit(tf)
        tf_feature_names = tf_vectorizer.get_feature_names()
        print_top_words(lda, tf_feature_names, 20)
        print "\n"

5
Topic #0: skin use love day makeup dry feel sunscreen product great doesn bottle clean like moisturizer little sticky long light oily
Topic #1: skin product using pore use acne oily used moisturiser combination really month feel good work time day sensitive breakout face


1
Topic #0: based work try oily skin type pimple just unfortunately use good better tried bha product bump right break caused using
Topic #1: skin bha break product using caused right bump use tried pimple better good oily work unfortunately just try type based


2
Topic #0: skin like oily little feel didnt use work heavy prefer pore broke packaging oil doesn unfortunately product good suitable doe
Topic #1: skin product doe sensitive month like good unfortunately pore oil toner suit feel consistency lightweight really love suitable doesn packaging


3
Topic #0: skin cleanser feel oily bottle sensitive work better didn gentle doe try did pretty strong exfoliator tried dry bit combination
Topic #1: product time bha 