# Sentiment Analysis

## Learning Objectives:
1. How to prepare data for machine learning, i.e., feature selection
1. How to learn a machine learning classifier
1. How to learn a machine learning classifier
1. How to apply a machine learning classifier
1. How to evaluate a machine learning classifier

### Process:
1. load dataset
1. analyzse dataset
1. create feature vector
1. vectorize data
1. learn machine learning classifier
1. evaluate classifier
1. apply machine learning classifier


# Setup

## Install Dependencies

In [1]:
import sys
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install xlrd
!{sys.executable} -m pip install nltk
!{sys.executable} -m pip install sklearn
!{sys.executable} -m pip install openpyxl
!{sys.executable} -m pip install scipy


[33mYou are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


## Import Dependencies

In [2]:
################################
# Required
################################
import os, sys
import pandas as pd

# Feature Creation
from nltk.tokenize import RegexpTokenizer
#tokenizer = RegexpTokenizer(r'\w+[-]?\w+')

# Machine Learning Algorithms
from sklearn.naive_bayes import GaussianNB
# More at http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

# Evaluation
from sklearn.model_selection import cross_val_score

################################
# Optional
################################
import re #RegEx

# Improve feature creation
## Natural Language Processing module
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

# Improve feature selection
import sklearn
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Machine Learning
import scipy
from scipy.sparse import dok_matrix

In [3]:
# Checking versions
print('Version check:\n--------------')
print('Pandas v',pd.__version__)
print('NLTK v',nltk.__version__)
print('SKLearn v',sklearn.__version__)

Version check:
--------------
Pandas v 0.22.0
NLTK v 3.2.4
SKLearn v 0.19.1


### How to download NLTK packages
Replace *URL*, *USERNAME*, and *PASSWORD* if you need to configure a proxy.

Then, uncomment the lines and run it.

*Please note that a separate windows will popup. Select the appropriate package.*

In [4]:
# nltk.set_proxy('http://gate-zrh-os.swissre.com:8080', ('<USERNAME>', '<PASSWORD>'))
# nltk.download()

## Utilities

In [5]:
# Saving DataFrame as EXCEL for review
def writeResults(dfResults, sFilename, sPrefix='', sPostfix=''):
    fnOut = sFilename
    if sPrefix:
        fnOut = sPrefix + fnOut
    if sPostfix:
        fnOut = fnOut + sPostfix
        
    filepath = outDirectory + fnOut
    dfResults.to_excel(filepath)
    print('Results haven been written to ', filepath)

# Load Dataset

In [6]:
# Filepath to dataset
fpDataset = './data/customer-feedback_full_cleaned_1000.xlsx'

#Load Excel file into a DataFrame
dfExcelWorkbook = pd.read_excel(fpDataset, sheet_name=None)
sheets = list(dfExcelWorkbook.keys())
dfData = dfExcelWorkbook[sheets[0]]

# Prepare directory to output results
outDirectory = './result/'
if not os.path.exists(outDirectory):
    os.makedirs(outDirectory)

In [7]:
# Check dataset 
dfData.head(10)

Unnamed: 0,FEEDBACK,RATING
0,never got clean glasses in Warsaw either.,0
1,The bed in the Radisson Bleu was not comfortab...,1
2,Michael was an excellent tour director. He wen...,1
3,Krakow Hotel was below my expectations because...,0
4,All the city tour guides have been excellent a...,1
5,The bed in the Radisson Bleu was not comfortab...,0
6,The Prague hotel should provide in-room intern...,0
7,Michael (Tour Director) was brilliant! Thomas ...,1
8,The entire voyage was very well done by Viking...,0
9,Michael was excellent. The Prague hotel should...,1


## Stats and Infos
Some info about the data

In [8]:
print('Number of attributes:', dfData.shape[1])
print('Name of attributes:', dfData.columns)
print('Number of rows:', dfData.shape[0])
print('Positives/Negatives:', dfData['RATING'].mean())


Number of attributes: 2
Name of attributes: Index(['FEEDBACK', 'RATING'], dtype='object')
Number of rows: 1000
Positives/Negatives: 0.539


# Feacture Vector

## Build Feature Vector
Machine learning requires that features (input) correlates with target class (output). For that reason, we need to define the features that we want to use for machine learning. We use words as features because they correlate with sentiments.

### Result
* **dfFeature**: DataFrame containing a list of features (words). It contains for each feature the frequency and avg. sentiment.

In [9]:
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# MODIFY THIS METHOD TO WIN
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# HINT: e.g., TF/IDF
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

# counts the tokens in a list of tokens.
def countTokens(tokens):
    results = {}

    for token in tokens:
        if token not in results:
            results[token] = 1
        else:
            results[token] = results[token] + 1
    return results

In [10]:
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# MODIFY THIS METHOD TO WIN
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# HINT: stopwords, lemmatization, stemming, named entity, lowercase, word combination (e.g, 'not good'), adjectives, etc. 
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

# extract features from a text
def extractTokens(strText):
    result = []
    # features = tokenizer.tokenize(strText)
    result = re.split('\s', strText)
    return result

In [11]:
featureColumns = ['feature', 'positives', 'negatives']
dfFeatures = pd.DataFrame(columns=featureColumns)
colFeedback = 'FEEDBACK'
colRating = 'RATING'

features = {}
for index, row in dfData.iterrows():
    # get feedback
    feedback = dfData.iloc[index][colFeedback]
    rating = dfData.iloc[index][colRating]
    
    # analyze feedback
    tokens = extractTokens(str(feedback))
    featurecount = countTokens(tokens)
    
    # add to feature list
    for feature in featurecount.keys():
        if feature not in features:
            features[feature] = {'positives': 0, 'negatives': 0}
        if rating == 0:
            features[feature]['negatives'] = features[feature]['negatives'] + 1
        elif rating != 0:
            features[feature]['positives'] = features[feature]['positives'] + 1   
          
# create and beautify
dfFeatures = pd.DataFrame.from_dict(features, orient='index')
dfFeatures = dfFeatures.reset_index()
dfFeatures = dfFeatures.rename({'index':'feature'}, axis=1)

### Analyze Features

In [12]:
# Count number of times freature occures
dfFeatures['support'] = dfFeatures.apply(lambda x: x['positives'] + x['negatives'], axis=1)

# Compute sentiment value feature
dfFeatures['sentiment'] = dfFeatures.apply(lambda x: x['positives'] / x['support'], axis=1)
fnFeaturesAll = 'allfeatures.xlsx'
writeResults(dfFeatures, fnFeaturesAll)

dfFeatures.sort_values(by='support', ascending=False).head(20)

Results haven been written to  ./result/allfeatures.xlsx


Unnamed: 0,feature,positives,negatives,support,sentiment
4187,the,203,205,408,0.497549
4271,to,135,233,368,0.366848
4523,was,202,165,367,0.550409
1234,and,200,134,334,0.598802
3275,of,132,126,258,0.511628
1093,a,123,134,257,0.478599
2728,in,98,144,242,0.404959
0,,113,99,212,0.533019
4449,very,130,53,183,0.710383
91,-,87,83,170,0.511765


In [13]:
print("Number of Features", dfFeatures.shape[0])

Number of Features 4665


# (i) DISCUSSION: How to clean up the features?

## Load Feature Vector

In [14]:
dfExcelWorkbook = pd.read_excel(outDirectory + fnFeaturesAll, sheet_name=None)
sheets = list(dfExcelWorkbook.keys())
dfFeatures = dfExcelWorkbook[sheets[0]]

In [15]:
print('Feature Stats:')
print('Number of Features:', dfFeatures.shape[0])
dfFeatures.head(20)

Feature Stats:
Number of Features: 4665


Unnamed: 0,feature,positives,negatives,support,sentiment
0,,113,99,212,0.533019
1,!,2,0,2,1.0
2,!!,1,0,1,1.0
3,"""",2,2,4,0.5
4,"""Above",0,1,1,0.0
5,"""Erdogan""",0,1,1,0.0
6,"""Excellent""",1,0,1,1.0
7,"""Manning"".",0,1,1,0.0
8,"""My",0,1,1,0.0
9,"""Radisson",0,1,1,0.0


## Feature Selection

In [16]:
dfFeatureVector = dfFeatures.sort_values(by='support', ascending=False)
dfFeatureVector.head(10)

Unnamed: 0,feature,positives,negatives,support,sentiment
4187,the,203,205,408,0.497549
4271,to,135,233,368,0.366848
4523,was,202,165,367,0.550409
1234,and,200,134,334,0.598802
3275,of,132,126,258,0.511628
1093,a,123,134,257,0.478599
2728,in,98,144,242,0.404959
0,,113,99,212,0.533019
4449,very,130,53,183,0.710383
91,-,87,83,170,0.511765


## Top-10 Negatives

In [17]:
dfFeatureVector.sort_values(by='sentiment', ascending=True).head(10)

Unnamed: 0,feature,positives,negatives,support,sentiment
3482,plate,0,1,1,0.0
1353,back),0,1,1,0.0
3050,magnitude,0,1,1,0.0
3154,modified,0,1,1,0.0
1350,awkward,0,1,1,0.0
3042,"lunch,",0,1,1,0.0
3959,slightly,0,1,1,0.0
1348,"awful,",0,1,1,0.0
1345,away.etc....,0,1,1,0.0
3115,men,0,1,1,0.0


# Top-10 Positives

In [18]:
dfFeatureVector.sort_values(by='sentiment', ascending=False).head(10)

Unnamed: 0,feature,positives,negatives,support,sentiment
4664,“It,1,0,1,1.0
1335,"available,",1,0,1,1.0
1217,altogether,1,0,1,1.0
1220,"always,",1,0,1,1.0
1224,"amazing,",1,0,1,1.0
2001,dishes,2,0,2,1.0
1225,amazing.,1,0,1,1.0
1325,attitude.,1,0,1,1.0
317,Budapest-,2,0,2,1.0
4430,vacation.,2,0,2,1.0


In [19]:
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# MODIFY THIS METHOD TO WIN
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# HINT: e.g., remove rare features, remove irrelevant features
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

def selectFeatures(dfFeatures):
    print("Number of features:",dfFeatures.shape[0])
    result = dfFeatures
    # result = dfFeatures[dfFeatures.support > 10]
    
    print("Number of selected features:", result.shape[0])
    return result
    
dfSelectedFeatures = selectFeatures(dfFeatureVector)

Number of features: 4665
Number of selected features: 4665
