## The Office - Sentiment Analysis

Valence Aware Dictionary and sEntiment Reasoner or VADER for short is a lexicon and simple rule-based model for sentiment analysis.

VADER has the advantage of assessing the sentiment of any given text without the need for previous training as we might have to for Machine Learning models.

The result generated by VADER is a dictionary of 4 keys neg, neu, pos and compound:
- neg, neu, and pos meaning negative, neutral, and positive respectively. Their sum should be equal to 1 or close to it with float operation.
- compound corresponds to the sum of the valence score of each word in the lexicon and determines the degree of the sentiment rather than the actual value as opposed to the previous ones. Its value is between -1 (most extreme negative sentiment) and +1 (most extreme positive sentiment).

_Setup and Import_

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [2]:
nltk.download("vader_lexicon")

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\FC\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [3]:
sent_analyzer = SentimentIntensityAnalyzer()

SentimentIntensityAnalyzer.polarity_score() function provides the polarity of the text rendering the dictionary format

In [4]:
df = pd.read_csv('../data/the-office_lines.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Character,Line,Season,Episode_Number
0,0,Michael,All right Jim. Your quarterlies look very goo...,1,1
1,1,Jim,"Oh, I told you. I couldn’t close it. So…",1,1
2,2,Michael,So you’ve come to the master for guidance? Is...,1,1
3,3,Jim,"Actually, you called me in here, but yeah.",1,1
4,4,Michael,"All right. Well, let me show you how it’s don...",1,1


In [5]:
df = df.drop(['Unnamed: 0'], axis=1)

In [6]:
df.head()

Unnamed: 0,Character,Line,Season,Episode_Number
0,Michael,All right Jim. Your quarterlies look very goo...,1,1
1,Jim,"Oh, I told you. I couldn’t close it. So…",1,1
2,Michael,So you’ve come to the master for guidance? Is...,1,1
3,Jim,"Actually, you called me in here, but yeah.",1,1
4,Michael,"All right. Well, let me show you how it’s don...",1,1


_Sentiment Analysis_

Adding scores and labels to df

In [7]:
df['Scores'] = df['Line'].apply(lambda Line:sent_analyzer.polarity_scores(Line))
df.head()

Unnamed: 0,Character,Line,Season,Episode_Number,Scores
0,Michael,All right Jim. Your quarterlies look very goo...,1,1,"{'neg': 0.0, 'neu': 0.803, 'pos': 0.197, 'comp..."
1,Jim,"Oh, I told you. I couldn’t close it. So…",1,1,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
2,Michael,So you’ve come to the master for guidance? Is...,1,1,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
3,Jim,"Actually, you called me in here, but yeah.",1,1,"{'neg': 0.0, 'neu': 0.714, 'pos': 0.286, 'comp..."
4,Michael,"All right. Well, let me show you how it’s don...",1,1,"{'neg': 0.0, 'neu': 0.811, 'pos': 0.189, 'comp..."


In [8]:
df['Compound'] = df['Scores'].apply(lambda score_dict: score_dict['compound'])
df.head()

Unnamed: 0,Character,Line,Season,Episode_Number,Scores,Compound
0,Michael,All right Jim. Your quarterlies look very goo...,1,1,"{'neg': 0.0, 'neu': 0.803, 'pos': 0.197, 'comp...",0.4927
1,Jim,"Oh, I told you. I couldn’t close it. So…",1,1,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0
2,Michael,So you’ve come to the master for guidance? Is...,1,1,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0
3,Jim,"Actually, you called me in here, but yeah.",1,1,"{'neg': 0.0, 'neu': 0.714, 'pos': 0.286, 'comp...",0.4215
4,Michael,"All right. Well, let me show you how it’s don...",1,1,"{'neg': 0.0, 'neu': 0.811, 'pos': 0.189, 'comp...",0.2732


In [9]:
df['Compound-Score'] = df['Compound'].apply(lambda c: 'pos' if c >= 0 else 'neg')
df.head()

Unnamed: 0,Character,Line,Season,Episode_Number,Scores,Compound,Compound-Score
0,Michael,All right Jim. Your quarterlies look very goo...,1,1,"{'neg': 0.0, 'neu': 0.803, 'pos': 0.197, 'comp...",0.4927,pos
1,Jim,"Oh, I told you. I couldn’t close it. So…",1,1,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,pos
2,Michael,So you’ve come to the master for guidance? Is...,1,1,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,pos
3,Jim,"Actually, you called me in here, but yeah.",1,1,"{'neg': 0.0, 'neu': 0.714, 'pos': 0.286, 'comp...",0.4215,pos
4,Michael,"All right. Well, let me show you how it’s don...",1,1,"{'neg': 0.0, 'neu': 0.811, 'pos': 0.189, 'comp...",0.2732,pos


In [10]:
df.dtypes

Character          object
Line               object
Season              int64
Episode_Number      int64
Scores             object
Compound          float64
Compound-Score     object
dtype: object

In [11]:
# compound-score from obj to cat
df['Compound-Score'] = df['Compound-Score'].astype('category')

In [12]:
df.dtypes

Character           object
Line                object
Season               int64
Episode_Number       int64
Scores              object
Compound           float64
Compound-Score    category
dtype: object

In [13]:
df['Compound-Score'].cat.categories

Index(['neg', 'pos'], dtype='object')

_Save to CSV_

In [14]:
df.to_csv('../data/the-office_sentiment.csv')

_Data Visualization_

creation of dataframes for viz in power bi

In [15]:
# create a df with only the top10 characters
df_top10 = df[(df['Character'] == "Michael") | (df['Character'] == "Dwight") | (df['Character'] == "Jim") | (df['Character'] == "Andy") | (df['Character'] == "Pam") | (df['Character'] == "Angela") | (df['Character'] == "Kevin") | (df['Character'] == "Erin") | (df['Character'] == "Ryan") | (df['Character'] == "Oscar")]

In [16]:
# count number of pos - neg by character
df_gr = df_top10.groupby(['Character', 'Compound-Score'], as_index=False).count()

In [17]:
df_gr = df_gr[['Character', 'Compound-Score', 'Line']]

In [18]:
df_gr.rename(columns = {'Line':'Count'}, inplace = True)

In [19]:
df_gr

Unnamed: 0,Character,Compound-Score,Count
0,Andy,neg,707
1,Andy,pos,3226
2,Angela,neg,348
3,Angela,pos,1329
4,Dwight,neg,1487
5,Dwight,pos,5906
6,Erin,neg,231
7,Erin,pos,1221
8,Jim,neg,1009
9,Jim,pos,5657


In [20]:
df_gr.to_csv('../data/the-office_sent_count.csv')

In [21]:
# percent of pos
df_gr2 = df_gr

In [22]:
df_gr2 = df_gr2.groupby(['Character'], as_index=False).sum()
df_gr2.head()

Unnamed: 0,Character,Count
0,Andy,3933
1,Angela,1677
2,Dwight,7393
3,Erin,1452
4,Jim,6666


In [23]:
df_gr3 = df_gr

In [24]:
df_gr3 = df_gr3[df_gr3['Compound-Score'] == "pos"]
df_gr3.rename(columns = {'Count':'Count_Pos'}, inplace = True)
df_gr3.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_gr3.rename(columns = {'Count':'Count_Pos'}, inplace = True)


Unnamed: 0,Character,Compound-Score,Count_Pos
1,Andy,pos,3226
3,Angela,pos,1329
5,Dwight,pos,5906
7,Erin,pos,1221
9,Jim,pos,5657


In [25]:
df_gr3.index = df_gr2.index

In [26]:
df_gr2['Count_Pos'] = df_gr3['Count_Pos']
df_gr2.head()

Unnamed: 0,Character,Count,Count_Pos
0,Andy,3933,3226
1,Angela,1677,1329
2,Dwight,7393,5906
3,Erin,1452,1221
4,Jim,6666,5657


In [27]:
df_gr2['Perc_PosNeg'] = df_gr2['Count_Pos'] / df_gr2['Count']
df_gr2['Perc_PosNeg'] = round(df_gr2['Perc_PosNeg'], 3)
df_gr2

Unnamed: 0,Character,Count,Count_Pos,Perc_PosNeg
0,Andy,3933,3226,0.82
1,Angela,1677,1329,0.792
2,Dwight,7393,5906,0.799
3,Erin,1452,1221,0.841
4,Jim,6666,5657,0.849
5,Kevin,1678,1406,0.838
6,Michael,11806,9568,0.81
7,Oscar,1464,1219,0.833
8,Pam,5264,4464,0.848
9,Ryan,1324,1131,0.854


In [29]:
df_gr2[['Character', 'Perc_PosNeg']]

Unnamed: 0,Character,Perc_PosNeg
0,Andy,0.82
1,Angela,0.792
2,Dwight,0.799
3,Erin,0.841
4,Jim,0.849
5,Kevin,0.838
6,Michael,0.81
7,Oscar,0.833
8,Pam,0.848
9,Ryan,0.854


In [28]:
df_gr2.to_csv('../data/the-office_sent_count2.csv')