# Sentiment Analysis

## Learning Objectives:
1. How to prepare data for machine learning, i.e., feature selection
1. How to learn a machine learning classifier
1. How to learn a machine learning classifier
1. How to apply a machine learning classifier
1. How to evaluate a machine learning classifier

### Process:
1. load dataset
1. analyzse dataset
1. create feature vector
1. vectorize data
1. learn machine learning classifier
1. evaluate classifier
1. apply machine learning classifier


# Setup

## Install Dependencies

In [1]:
import sys
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install xlrd
!{sys.executable} -m pip install nltk
!{sys.executable} -m pip install sklearn
!{sys.executable} -m pip install openpyxl
!{sys.executable} -m pip install scipy


[33mYou are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


## Import Dependencies

In [2]:
################################
# Required
################################
import os, sys
import pandas as pd

# Feature Creation
from nltk.tokenize import RegexpTokenizer
#tokenizer = RegexpTokenizer(r'\w+[-]?\w+')

# Machine Learning Algorithms
from sklearn.naive_bayes import GaussianNB
# More at http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

# Evaluation
from sklearn.model_selection import cross_val_score

################################
# Optional
################################
import re #RegEx

# Improve feature creation
## Natural Language Processing module
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

# Improve feature selection
import sklearn
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Machine Learning
import scipy
from scipy.sparse import dok_matrix

In [3]:
# Checking versions
print('Version check:\n--------------')
print('Pandas v',pd.__version__)
print('NLTK v',nltk.__version__)
print('SKLearn v',sklearn.__version__)

Version check:
--------------
Pandas v 0.22.0
NLTK v 3.2.4
SKLearn v 0.19.1


### How to download NLTK packages
Replace *URL*, *USERNAME*, and *PASSWORD* if you need to configure a proxy.

Then, uncomment the lines and run it.

*Please note that a separate windows will popup. Select the appropriate package.*

In [4]:
# nltk.set_proxy('http://gate-zrh-os.swissre.com:8080', ('<USERNAME>', '<PASSWORD>'))
# nltk.download()

## Utilities

In [5]:
# Saving DataFrame as EXCEL for review
def writeResults(dfResults, sFilename, sPrefix='', sPostfix=''):
    fnOut = sFilename
    if sPrefix:
        fnOut = sPrefix + fnOut
    if sPostfix:
        fnOut = fnOut + sPostfix
        
    filepath = outDirectory + fnOut
    dfResults.to_excel(filepath)
    print('Results haven been written to ', filepath)

# Load Dataset

In [6]:
# Filepath to dataset
fpDataset = './data/customerfeedback.xlsx'

#Load Excel file into a DataFrame
dfExcelWorkbook = pd.read_excel(fpDataset, sheet_name=None)
sheets = list(dfExcelWorkbook.keys())
dfData = dfExcelWorkbook[sheets[0]]

# Prepare directory to output results
outDirectory = './result/'
if not os.path.exists(outDirectory):
    os.makedirs(outDirectory)

In [7]:
# Check dataset 
dfData.head(10)

Unnamed: 0,FEEDBACK,RATING
0,never got clean glasses in Warsaw either.,0
1,The bed in the Radisson Bleu was not comfortab...,1
2,Michael was an excellent tour director. He wen...,1
3,Krakow Hotel was below my expectations because...,0
4,All the city tour guides have been excellent a...,1
5,The bed in the Radisson Bleu was not comfortab...,0
6,The Prague hotel should provide in-room intern...,0
7,Michael (Tour Director) was brilliant! Thomas ...,1
8,The entire voyage was very well done by Viking...,0
9,Michael was excellent. The Prague hotel should...,1


## Stats and Infos
Some info about the data

In [8]:
print('Number of attributes:', dfData.shape[1])
print('Name of attributes:', dfData.columns)
print('Number of rows:', dfData.shape[0])
print('Positives/Negatives:', dfData['RATING'].mean())


Number of attributes: 2
Name of attributes: Index(['FEEDBACK', 'RATING'], dtype='object')
Number of rows: 28448
Positives/Negatives: 0.5643982002249719


# Feacture Vector

## Build Feature Vector
Machine learning requires that features (input) correlates with target class (output). For that reason, we need to define the features that we want to use for machine learning. We use words as features because they correlate with sentiments.

### Result
* **dfFeature**: DataFrame containing a list of features (words). It contains for each feature the frequency and avg. sentiment.

In [9]:
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# MODIFY THIS METHOD TO WIN
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# HINT: e.g., TF/IDF
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

# counts the tokens in a list of tokens.
def countTokens(tokens):
    results = {}

    for token in tokens:
        if token not in results:
            results[token] = 1
        else:
            results[token] = results[token] + 1
    return results

In [10]:
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# MODIFY THIS METHOD TO WIN
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# HINT: stopwords, lemmatization, stemming, named entity, lowercase, word combination (e.g, 'not good'), adjectives, etc. 
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

# extract features from a text
def extractTokens(strText):
    result = []
    # features = tokenizer.tokenize(strText)
    result = re.split('\s', strText)
    return result

In [11]:
featureColumns = ['feature', 'positives', 'negatives']
dfFeatures = pd.DataFrame(columns=featureColumns)
colFeedback = 'FEEDBACK'
colRating = 'RATING'

features = {}
for index, row in dfData.iterrows(): #dfData[1:1000].iterrows():#
    # get feedback
    feedback = dfData.iloc[index][colFeedback]
    rating = dfData.iloc[index][colRating]
    
    # analyze feedback
    tokens = extractTokens(str(feedback))
    featurecount = countTokens(tokens)
    
    # add to feature list
    for feature in featurecount.keys():
        if feature not in features:
            features[feature] = {'positives': 0, 'negatives': 0}
        if rating == 0:
            features[feature]['negatives'] = features[feature]['negatives'] + 1
        elif rating != 0:
            features[feature]['positives'] = features[feature]['positives'] + 1   
          
# create and beautify
dfFeatures = pd.DataFrame.from_dict(features, orient='index')
dfFeatures = dfFeatures.reset_index()
dfFeatures = dfFeatures.rename({'index':'feature'}, axis=1)

### Analyze Features

In [12]:
# Count number of times freature occures
dfFeatures['support'] = dfFeatures.apply(lambda x: x['positives'] + x['negatives'], axis=1)

# Compute sentiment value feature
dfFeatures['sentiment'] = dfFeatures.apply(lambda x: x['positives'] / x['support'], axis=1)
fnFeaturesAll = 'allfeatures.xlsx'
writeResults(dfFeatures, fnFeaturesAll)

dfFeatures.sort_values(by='support', ascending=False).head(20)

Results haven been written to  ./result/allfeatures.xlsx


Unnamed: 0,feature,positives,negatives,support,sentiment
45865,the,5730,5251,10981,0.52181
48319,was,5547,4053,9600,0.577812
23663,and,5551,3541,9092,0.610537
46351,to,3753,5029,8782,0.427351
38601,of,3174,3305,6479,0.48989
22700,a,3210,3102,6312,0.508555
34812,in,2518,3412,5930,0.424621
0,,3063,2611,5674,0.539831
11702,I,2612,2342,4954,0.527251
48604,were,3089,1798,4887,0.632085


In [13]:
print("Number of Features", dfFeatures.shape[0])

Number of Features 49414


# (i) DISCUSSION: How to clean up the features?

## Load Feature Vector

In [14]:
dfExcelWorkbook = pd.read_excel(outDirectory + fnFeaturesAll, sheet_name=None)
sheets = list(dfExcelWorkbook.keys())
dfFeatures = dfExcelWorkbook[sheets[0]]

In [15]:
print('Feature Stats:')
print('Number of Features:', dfFeatures.shape[0])
dfFeatures.head(20)

Feature Stats:
Number of Features: 49414


Unnamed: 0,feature,positives,negatives,support,sentiment
0,,3063,2611,5674,0.539831
1,!,63,19,82,0.768293
2,!!,14,3,17,0.823529
3,!!!,12,4,16,0.75
4,!!!!,1,0,1,1.0
5,!!!!!,1,0,1,1.0
6,!),0,1,1,0.0
7,!Clare,0,1,1,0.0
8,!Merci,1,0,1,1.0
9,"""",31,48,79,0.392405


## Feature Selection

In [16]:
dfFeatureVector = dfFeatures.sort_values(by='support', ascending=False)
dfFeatureVector.head(10)

Unnamed: 0,feature,positives,negatives,support,sentiment
45865,the,5730,5251,10981,0.52181
48319,was,5547,4053,9600,0.577812
23663,and,5551,3541,9092,0.610537
46351,to,3753,5029,8782,0.427351
38601,of,3174,3305,6479,0.48989
22700,a,3210,3102,6312,0.508555
34812,in,2518,3412,5930,0.424621
0,,3063,2611,5674,0.539831
11702,I,2612,2342,4954,0.527251
48604,were,3089,1798,4887,0.632085


## Top-10 Negatives

In [17]:
dfFeatureVector.sort_values(by='sentiment', ascending=True).head(10)

Unnamed: 0,feature,positives,negatives,support,sentiment
36808,lunch-NO,0,1,1,0.0
32,"""American""",0,1,1,0.0
31,"""American",0,1,1,0.0
49303,you'd,0,1,1,0.0
49304,you'll,0,1,1,0.0
49306,you've,0,1,1,0.0
28,"""AVERAGE""",0,1,1,0.0
49311,you.This,0,1,1,0.0
27,"""AUTODEFESS""",0,1,1,0.0
49351,zippered,0,1,1,0.0


# Top-10 Positives

In [18]:
dfFeatureVector.sort_values(by='sentiment', ascending=False).head(10)

Unnamed: 0,feature,positives,negatives,support,sentiment
7123,Choc,1,0,1,1.0
31202,excurison,2,0,2,1.0
3712,ANA.,1,0,1,1.0
46948,tries,2,0,2,1.0
3714,ANABELA!,1,0,1,1.0
41200,qualityof,1,0,1,1.0
41198,quality.I,1,0,1,1.0
3715,ANABELA),1,0,1,1.0
38614,ofered,2,0,2,1.0
3716,ANABELLA,1,0,1,1.0


In [19]:
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# MODIFY THIS METHOD TO WIN
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# HINT: e.g., remove rare features, remove irrelevant features
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

def selectFeatures(dfFeatures):
    print("Number of features:",dfFeatures.shape[0])
    result = dfFeatures
    # result = dfFeatures[dfFeatures.support > 10]
    
    print("Number of selected features:", result.shape[0])
    return result
    
dfSelectedFeatures = selectFeatures(dfFeatureVector)

Number of features: 49414
Number of selected features: 49414


# Prepare Trainingset

## Create Instances

An instance from a text is used to train a machine learning model or to classify the text. The instance is a vector representation of a text based on the given feature vector.

In [20]:
def createInstance(strText, dfFeatures=dfFeatureVector):
    result = []
    
    for feature in dfFeatureVector['feature']:
        if (str(feature) in strText):
            result.append(1)
        else:
            result.append(0)
    
    return result

In [21]:
trainSet = []
trainSetLabels = []
for index, row in dfData[0:100].iterrows():
    instance = createInstance(str(row['FEEDBACK']))
    trainSet.append(instance)
    trainSetLabels.append(row['RATING'])

In [22]:
# Check result
trainSet[0:1]

[[1,
  0,
  0,
  0,
  0,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,


## Feature Selection

In [23]:
#X2TrainInstances = SelectKBest(chi2, k=10).fit_transform(trainSet, trainSetLabels)

# Evaluation

In [37]:
m = dok_matrix(trainSet)

# Decision Tree
from sklearn.tree import DecisionTreeClassifier
myDecisionTree = DecisionTreeClassifier(max_depth=5)
classifier = myDecisionTree.fit(trainSet,trainSetLabels)

# Random Forest
# from sklearn.ensemble import RandomForestClassifier
# classifier = RandomForestClassifier(n_estimators=20).fit(trainSet,trainSetLabels)

# Naive Bayes (NB)
# classifier = GaussianNB().fit(trainSet,trainSetLabels)

# ADA Boost
#from sklearn.ensemble import AdaBoostClassifier
#classifier = AdaBoostClassifier(n_estimators=200).fit(m,trainSetLabels)

# Support Vector Machines (SVM)
# classifier = SVC().fit(m,trainSetLabels)

# Neural Network (NN)
#from sklearn.neural_network import MLPClassifier
#classifier = MLPClassifier(alpha=1).fit(trainSet,trainSetLabels)


## Quick Analysis

In [38]:
classifier.score(trainSet,trainSetLabels)

0.96

In [42]:
score = cross_val_score(myDecisionTree, trainSet, trainSetLabels, cv=10)
print('Score per fold:', score)
print('Avg. Score:', score.mean())

Precision_score = cross_val_score(myDecisionTree, trainSet, trainSetLabels, cv=10, scoring='precision')
print('Precision:', Precision_score.mean())
Recall_score = cross_val_score(myDecisionTree, trainSet, trainSetLabels, cv=10, scoring='recall')
print('Recall:', Recall_score.mean())
F1_score = cross_val_score(myDecisionTree, trainSet, trainSetLabels, cv=10, scoring='f1')
print('F1:', F1_score.mean())



Score per fold: [0.72727273 0.90909091 0.72727273 0.90909091 0.8        0.4
 0.77777778 0.44444444 0.77777778 0.77777778]
Avg. Score: 0.7250505050505051
Precision: 0.8196428571428571
Recall: 0.7785714285714286
F1: 0.8082950382950382


# Sentiment Analysis

In [43]:
classifier = myDecisionTree.fit(trainSet,trainSetLabels)

# Naive Bayes (NB)
#classifier = GaussianNB().fit(trainSet,trainSetLabels)

# Support Vector Machines (SVM)
# svm = SVC()
# classifier = svm.fit(m,trainSetLabels)

# ADA Boost
# adaBoost = AdaBoostClassifier(n_estimators=200)
# classifier = adaBoost.fit(m,trainSetLabels)

# Random Forest
# randomForest = RandomForestClassifier(n_estimat§ors=20)
# classifier = randomForest.fit(trainSet,trainSetLabels)

## Analyze Errors

In [44]:
for index, row in dfData[0:100].iterrows():
    feedback = str(row['FEEDBACK'])
    instance = [createInstance(feedback)]
    predictedSentiment = classifier.predict(instance)
    
    if (predictedSentiment != row['RATING']):
        sentiment = 'BAD'
        if predictedSentiment == 1:
            sentiment = 'GOOD'
        print("AI says", sentiment, ":\t",feedback)

AI says GOOD :	 The bed in the Radisson Bleu was not comfortable. However, the breakfast was excellent with good food and excellent service. WE had a wonderful time and will travel with Viking again
AI says GOOD :	 The Prague hotel should provide in-room internet. The krakow hotel was well situated but was "tired" in appearance & furnishings. Breakfast at all hotels was outstanding.
AI says GOOD :	 Would like to have spent longer at Lamego
AI says GOOD :	 Would have been nice to have had a box lunch since tour left @ 7:15 am & did not return until 2:30
