## SENTIMENT ANALYSIS

In order to understand how people feel about something, we need to do sentiment analysis on text data that contains their opinion.

You will need to [install the textblob library](https://anaconda.org/conda-forge/textblob).

In [37]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.tokenize import TweetTokenizer
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from textblob import TextBlob

from string import punctuation

import pandas as pd
import numpy as np

In [38]:
#load the data from the Reviews.csv file
filepath = "datasets\WomensClothingReviews.csv"
df = pd.read_csv(filepath, encoding = "latin-1") #this file is encoded differently

df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [39]:
#display all columns with invalid data (null)
df.isnull().sum()

Unnamed: 0                    0
Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64

In [40]:
#Drop the columns that are invalid or NULL

dfData = df.dropna(subset=['Review Text'])
              

In [41]:
#display all columns with invalid data (null)
dfData.isnull().sum()

Unnamed: 0                    0
Clothing ID                   0
Age                           0
Title                      2966
Review Text                   0
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                13
Department Name              13
Class Name                   13
dtype: int64

In [43]:
#list of english stopwords (filler words)
eng_stopwords = stopwords.words('english')
#eng_stopwords

#### Make a sentiment value column in a dataframe


In [44]:
#create a function to clean up each review
#then it will analyze and assign a sentiment polarity
def reviewSentiment(review):
    
    #make text lowercase
    review = review.lower()
    
    #tokenize the review
    tknz_review = word_tokenize(review)
    
    #remove puntuation
    for token in tknz_review:
        if token in punctuation:
            tknz_review.remove(token)
    
    clean_tokens = []
    #remove filler words
    for token in tknz_review:
        if token not in eng_stopwords:
            clean_tokens.append(token)
            
    #put sentence back together with remaining clean words
    clean_review = ' '.join(clean_tokens)
    
    #turn into textblob
    blob_rev = TextBlob(clean_review)
    
    #get sentiment polarity
    r_pol = blob_rev.sentiment.polarity
    
    return r_pol

In [45]:
#create a new column to hold sentiment value from function
dfData['Review Sentiment'] = dfData['Review Text'].apply(reviewSentiment)
    
    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [29]:
#erify sentiment values in new column
dfData.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Review Sentiment
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates,0.633333
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses,0.31875
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,0.0823
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants,0.5
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses,0.5


In [47]:
#create a function to assign a polarity category to the sentiment
def sentimentCategory(sent_num):
    if sent_num >= 0.2:
        return "positive"
    if sent_num <= -0.2:
        return "negative"
    else:
        return "neutral"

In [48]:
#create a new column to hold sentiment category
dfData['Sentiment Category'] = dfData['Review Sentiment'].apply(sentimentCategory)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [49]:
#write the dataframe to a csv file into the working directory --- to verify in excel
dfData.to_csv('MyReviews.csv')

In [50]:
dfData.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Review Sentiment,Sentiment Category
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates,0.633333,positive
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses,0.31875,positive
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,0.0823,neutral
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants,0.5,positive
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses,0.5,positive


In [51]:
#compare frequency of positive, negative, and neutral reviews
dfData['Sentiment Category'].value_counts()

positive    14087
neutral      8401
negative      153
Name: Sentiment Category, dtype: int64

# Overall, it seems that a larger proportion of the shoppers (ages 29 to 53) feel so good about the clothing with only few complaints. Some consumers however choose to remian indifferent.

In [57]:
fd_nw = FreqDist(dfData['Age'])
fd_nw.most_common(20)

[(39, 1226),
 (35, 851),
 (36, 801),
 (34, 766),
 (38, 751),
 (37, 727),
 (41, 717),
 (33, 699),
 (46, 691),
 (42, 625),
 (48, 605),
 (44, 596),
 (32, 594),
 (40, 584),
 (43, 555),
 (31, 549),
 (47, 545),
 (53, 536),
 (45, 511),
 (29, 503)]