### 4.2 Exercise: Sentiment Analysis
In this exercise, you will do a sentiment analysis of text comments.
1. Load the data file DailyComments.csv from the Week 4 Data Files into a data frame.
2. Identify a scheme to categorize each comment as positive or negative. You can devise your own scheme or find a commonly used scheme to perform this sentiment analysis. However you decide to do this, make sure to explain the scheme you decide to use.
3. Implement your sentiment analysis with code and display the results. Note: DailyComments.csv is a purposely small file, so you will be able to clearly see why the results are what they are.
4. For up to 5% extra credit, find another set of comments, e.g., some tweets, and perform the same sentiment analysis.


In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np

In [2]:
# Import the file
day_comm = pd.read_csv('DailyComments.csv') 

In [3]:
# Need to know what it looks like
day_comm.shape

(7, 2)

In [4]:
# See how the data looks
day_comm.head(3)

Unnamed: 0,Day of Week,comments
0,Monday,"Hello, how are you?"
1,Tuesday,Today is a good day!
2,Wednesday,It's my birthday so it's a really special day!


In [5]:
# Carve out the column I'm interested in (the comments)
day_corpus = day_comm['comments']
print(day_corpus)

0                               Hello, how are you?
1                              Today is a good day!
2    It's my birthday so it's a really special day!
3         Today is neither a good day or a bad day!
4                             I'm having a bad day.
5         There' s nothing special happening today.
6                        Today is a SUPER good day!
Name: comments, dtype: object


In [6]:
# Create the vector for counting words
vectorizer = CountVectorizer()
day_corpus_vector = vectorizer.fit_transform(day_corpus)

print('vectorized words')
print('')

# Retrieve the word names
print(vectorizer.get_feature_names())
print('')

print('Identify Feature Words - Matrix View')

vectorized words

['are', 'bad', 'birthday', 'day', 'good', 'happening', 'having', 'hello', 'how', 'is', 'it', 'my', 'neither', 'nothing', 'or', 'really', 'so', 'special', 'super', 'there', 'today', 'you']

Identify Feature Words - Matrix View


In [7]:
# Put the words into an array
print( day_corpus_vector.toarray())

[[1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0]
 [0 0 1 1 0 0 0 0 0 0 2 1 0 0 0 1 1 1 0 0 0 0]
 [0 1 0 2 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0]
 [0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0]
 [0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0]]


In [8]:
# Create dataframe for the original comments column
day_corpus_df = pd.DataFrame({'text' : day_corpus})

# Check the shape
day_corpus_df.shape

(7, 1)

In [9]:
# Look at the data
day_corpus_df.head(3)

Unnamed: 0,text
0,"Hello, how are you?"
1,Today is a good day!
2,It's my birthday so it's a really special day!


In [11]:
# Look at the totals for each set of scores
day_corpus_df['positive1'] = day_corpus_df.text.str.count('good')
day_corpus_df['positive2']= day_corpus_df.text.str.count('special')
day_corpus_df['negative'] = day_corpus_df.text.str.count('bad')
day_corpus_df['totalScore'] = day_corpus_df.positive1 + day_corpus_df.positive2 - day_corpus_df.negative

In [16]:
# Look at the totals for each set of scores
print('Positive1 Score Total: ', sum(day_corpus_df['positive1']))
print("")
print('Positive2 Score Total: ', sum(day_corpus_df['positive2']))
print("")
print('Negative Score Total: ', sum(day_corpus_df['negative']))
print("")
print('Total Score: ', sum(day_corpus_df['totalScore']))


Positive1 Score Total:  3

Positive2 Score Total:  2

Negative Score Total:  2

Total Score:  3


In [15]:
print(day_corpus_df)

                                             text  positive1  positive2  \
0                             Hello, how are you?          0          0   
1                            Today is a good day!          1          0   
2  It's my birthday so it's a really special day!          0          1   
3       Today is neither a good day or a bad day!          1          0   
4                           I'm having a bad day.          0          0   
5       There' s nothing special happening today.          0          1   
6                      Today is a SUPER good day!          1          0   

   negative  totalScore  
0         0           0  
1         0           1  
2         0           1  
3         1           0  
4         1          -1  
5         0           1  
6         0           1  


In [12]:
day_z = sum(day_corpus_df['totalScore'])
print('Overall Score:  ',day_z)

Overall Score:   3


#### Score Results:
The overall score is 3, which I would consider as gernally positive.  The total positive score was 5 and the negative score was 2.  This leaves us with 3.  Positive