# Week 11: Sentiment Analysis, Part 2: Plotting Curves with Rolling Averages

We will talk more about sentiment and uncertainty, and model evaluation.


Then our focus switches plotting "sentiment curves." We will:
* Plot the raw values for sentiment in *The Sign of the Four*
* Calculate rolling averages, plot them, and investigate the portions of the text at which the minimum and maximum points occur

# Getting back to Sentiment Comparison

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from textblob import TextBlob
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')

vader_analyzer = SentimentIntensityAnalyzer()


# Getting started with Plotting Sentiment

First, let's import our libraries, set our Pandas display options, and reload the data produced during last week's lecture (handily stored in a CSV that lives in the same folder as this notebook).

## Rolling Averages in Pandas

Pandas is basically built to do things like calculate rolling averages. It makes it really easy, which is awesome. All we need to do is specify:
- What part of the DataFrame we're interested in (here, the `polarity` column)
- That we want to do a rolling... something or other (the `.rolling()` method)... and that we want our rolling window size to be (here it's 10, indicated with the `window=10` agument) and that we want these rolling averages to be "centred" around the point where we record it (the `centred=True` argument).
- That the "something or other" we want are rolling **averages (or means)** (indicated with the `.mean()` method)

In [None]:
# This line ensures the whole dataframe is displayed.
pd.set_option('display.max_colwidth', 0)

In [None]:
sot4_sentence_sentiment_df = pd.read_csv("sot4_all_sentiment.csv")
sot4_sentence_sentiment_df[:15]

Now let's run the command we ended last lecture with, plotting the "raw" sentiment polarity values for every sentence in *The Sign of the Four*. Can we learn much from this?

In [None]:
sot4_sentence_sentiment_df[['polarity']].plot(figsize=(20,8))
plt.show()

In [None]:
sot4_sentence_sentiment_df[['vader polarity']].plot(figsize=(20,8))
plt.show()

Just for fun, let's have a look at the raw subjectivity scores, too...

In [None]:
sot4_sentence_sentiment_df[['subjectivity']].plot(figsize=(20,8))
plt.show()

### Now let's discuss one way to try to find a signal in this noisy data -- rolling averages

Pandas gives us a good way to calculate characteristics of our data in progress.

For example, if we would have a dataset of all your weekly grades (transformed to the same scale), we could try to find how were you doing every day... week... two weeks... month... of your studies by calculating **rolling mean** with different windows.

We don't hav to always use time, any consistent index of progress works -- sentence order in the book works as well!

In [None]:
sot4_sentence_sentiment_df['rolling_10_polarity'] = sot4_sentence_sentiment_df['polarity'].rolling(window=10, center=True).mean()


In [None]:
sot4_sentence_sentiment_df[:10][["sentence","polarity","rolling_10_polarity"]]

Why the first value appears at row 5?
Because we need 10 rows centered around 5, so:

`[row0, row1, row2, row3, row4, (row5), row6, row7, row8, row9]`


Now all we need to do is to plot our new column!

Does that look any better?

In [None]:
sot4_sentence_sentiment_df[['rolling_10_polarity']].plot(figsize=(20,8))
plt.show()

In [None]:
sot4_sentence_sentiment_df[['polarity','rolling_10_polarity']].plot(figsize=(20,8))

Let's try some other window sizes: e.g. 25. (You can try others, too!)

In [None]:
sot4_sentence_sentiment_df['rolling_25_polarity'] = sot4_sentence_sentiment_df['polarity'].rolling(window=25, center=True).mean()

In [None]:
sot4_sentence_sentiment_df[['rolling_25_polarity']].plot(figsize=(20,8))
plt.show()

In [None]:
sot4_sentence_sentiment_df[['rolling_10_polarity','rolling_25_polarity']].plot(figsize=(20,8))
plt.show()

In [None]:
sot4_sentence_sentiment_df['rolling_50_polarity'] = sot4_sentence_sentiment_df['polarity'].rolling(window=50, center=True).mean()

In [None]:
sot4_sentence_sentiment_df[['rolling_50_polarity']].plot(figsize=(20,8))
plt.show()

In [None]:
sot4_sentence_sentiment_df[['rolling_10_polarity','rolling_50_polarity']].plot(figsize=(20,8))
plt.show()

In [None]:
sot4_sentence_sentiment_df['rolling_10_vader'] = sot4_sentence_sentiment_df['vader polarity'].rolling(window=10, center=True).mean()
sot4_sentence_sentiment_df['rolling_25_vader'] = sot4_sentence_sentiment_df['vader polarity'].rolling(window=25, center=True).mean()
sot4_sentence_sentiment_df['rolling_50_vader'] = sot4_sentence_sentiment_df['vader polarity'].rolling(window=50, center=True).mean()

In [None]:
# Let's compare vader and textblob with rolling polarities

sot4_sentence_sentiment_df[['rolling_10_polarity','rolling_10_vader']].plot(figsize=(20,8))
plt.show()

In [None]:
sot4_sentence_sentiment_df[['rolling_25_polarity','rolling_25_vader']].plot(figsize=(20,8))
plt.show()

In [None]:
sot4_sentence_sentiment_df[['rolling_50_polarity','rolling_50_vader']].plot(figsize=(20,8))
plt.show()

## Finding the Maximum and Minimum Points

The below line of Pandas code is an absolute whopper... but I think we're all ready for it at this point!

In [None]:
sot4_sentence_sentiment_df[sot4_sentence_sentiment_df['rolling_50_vader']==sot4_sentence_sentiment_df['rolling_50_vader'].min()]

Step-by-step explanation

In [None]:
sot4_sentence_sentiment_df['rolling_50_vader'].min()

In [None]:
min_polarity_50 = sot4_sentence_sentiment_df['rolling_50_vader'].min()

In [None]:
sot4_sentence_sentiment_df['rolling_50_vader'] == min_polarity_50

In [None]:
has_min_polarity_50 = sot4_sentence_sentiment_df['rolling_50_vader'] == min_polarity_50

In [None]:
sot4_sentence_sentiment_df[has_min_polarity_50]

how about maximum for window 25?

In [None]:
sot4_sentence_sentiment_df['rolling_25_vader'].max()

In [None]:
sot4_sentence_sentiment_df[sot4_sentence_sentiment_df['rolling_25_vader']==sot4_sentence_sentiment_df['rolling_25_vader'].max()]

## Investigating the Extremes

Let's have a look at the 25-sentence window that the `rolling_25_polarity` score indicates are the most most positive, and see if seems like it's on to something...

(Yeah, that's right, I'm just taking the min and max points, then manually adding and subtracting 12 or 13 to make a 25-ish-sentence window!)

In [None]:
sot4_sentence_sentiment_df[sot4_sentence_sentiment_df['rolling_25_vader']==sot4_sentence_sentiment_df['rolling_25_vader'].max()]

In [None]:
print(1539 - 13)
print(1539 + 13)

In [None]:
sot4_sentence_sentiment_df[1525:1551].head() #remove head to see all of it, maybe transfor texts to list

In [None]:
sot4_sentence_sentiment_df[1525:1551]["vader polarity"].hist()
plt.show()

What about the most negative?

In [None]:
sot4_sentence_sentiment_df[sot4_sentence_sentiment_df['rolling_25_vader']==sot4_sentence_sentiment_df['rolling_25_vader'].min()]

In [None]:
print(484 - 13)
print(484 + 13)

In [None]:
sot4_sentence_sentiment_df[471:497]["vader polarity"].hist()
plt.show()

Let's plot side by side

In [None]:
import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))

sot4_sentence_sentiment_df[1525:1551]["vader polarity"].hist(ax=ax1)
sot4_sentence_sentiment_df[471:497]["vader polarity"].hist(ax=ax2)

plt.tight_layout()
plt.show()