# Natural language processing part 5:
# Sentiment analysis

## Lecture objectives
* Learn how to implement and interpret sentiment analysis

In its simplest form, sentiment analysis works through applying a corpus of words and phrases that indicate sentiment. [SentiWordNet](https://github.com/aesuli/SentiWordNet) is a commonly used corpus—[look at the list here](https://raw.githubusercontent.com/aesuli/SentiWordNet/master/data/SentiWordNet_3.0.0.txt). For example, "worst," "terrible," and "apprehensive" have negative scores, while "feel_like_a_million_dollars" has a positive score. Some words have both positive and negative scores. Some algorithms consider the part of speech in which the word occurs (e.g. is it an adjective or noun).

Let's start through loading in the tweets that we saved in the previous video lecture.

In [None]:
import pickle
with open('data/tweets/miami.pickle', 'rb') as f:
    miami = pickle.load(f)
with open('data/tweets/chicago.pickle', 'rb') as f:
    chicago = pickle.load(f)
with open('data/tweets/toronto.pickle', 'rb') as f:
    toronto = pickle.load(f)
miami[0]

Let's turn to sentiment analysis. `textblob` uses the corpora (basically, a body of text) from the `nltk` library. We already downloaded a couple of corpora such as stop words, but we need two more. 

In [None]:
import nltk
nltk.download('sentiwordnet')
nltk.download('wordnet')
from textblob import TextBlob

Let's try some examples. Note from the [documentation](https://textblob.readthedocs.io/en/dev/quickstart.html#sentiment-analysis): 

* The sentiment property returns a named tuple of the form `Sentiment(polarity, subjectivity)`. 
* The polarity score is a float within the range [-1.0, 1.0].
* The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

In [None]:
sentence = TextBlob('I love riding public transit')
print(sentence.sentiment)

In [None]:
sentence = TextBlob('My bus was late AGAIN today.')
print(sentence.sentiment)

We can access the polarity score directly. (Let's ignore subjectivity.)

In [None]:
sentence.sentiment.polarity

Now let's come back to our tweets. We can compute the sentiment (polarity) score for each tweet. 

Let's use a list comprehension to loop over each tweet.

In [None]:
# the list comprehension is the same as
miami_sentiment = []
for tweet in miami:
    miami_sentiment.append(TextBlob(tweet).sentiment.polarity)

miami_sentiment = [TextBlob(tweet).sentiment.polarity for tweet in miami]
chicago_sentiment = [TextBlob(tweet).sentiment.polarity for tweet in chicago]
toronto_sentiment = [TextBlob(tweet).sentiment.polarity for tweet in toronto]

We get lists of sentiments. Let's look at the first few.

In [None]:
for i in range(5):
    print('Sentiment: {:.2f}. Tweet: {}'.format(miami_sentiment[i], miami[i]))

Now let's visualize it. A histogram seems appropriate here.

Note that the seaborn `histplot` can take a list or any other sequence, as well as a `DataFrame`. 

In [None]:
import seaborn as sns
help(sns.histplot)

In [None]:
sns.histplot(miami_sentiment)

Let's compare the different transit agencies.

We've already used the `plt.subplots()` function to create a set of axes. However, its real power comes in creating figures with multiple plots.

In [None]:
import matplotlib.pyplot as plt

# this creates a 1x3 matrix of plots, and returns a list of axes objects
fig, axes = plt.subplots(1, 3)

Because `axes` is a list, we can now access each axis as `axes[0]`, `axes[1]`, etc.

We can even loop over each axis as with any other list. That's useful for plotting lots of subgroups.

Here, we'll use a loop to clean up each plot. We'll use the `zip` notation, that loops over equal-length lists and pairs them up. For example, the first iteration of the loop will put the first element of `axes` in `ax`, and the first element of `cities` in `city`.

We'll make the axes have the same extent for each plot, and set the y-axis label on only the left-hand axis.

`fig.tight_layout()` is a useful command to clean up the spacing.

In [None]:
fig, axes = plt.subplots(1, 3)

sns.histplot(miami_sentiment, ax=axes[0])
sns.histplot(chicago_sentiment, ax=axes[1])
sns.histplot(toronto_sentiment, ax=axes[2])

cities = ['miami', 'chicago', 'toronto']
for ax, city in zip(axes, cities):
    ax.set_title(city)
    ax.set_ylim(0,50)
    ax.set_xlim(-1,1)
    ax.set_ylabel('')
axes[0].set_ylabel('Number of tweets')
axes[1].set_xlabel('Sentiment')
fig.tight_layout()

There is lots more we can do here. For example:
* We relied on a single search term. We might expand this (perhaps just "transit" rather than "public transit," and add "bus" and "train" as well. We might also consider adding the name of specific agency such as CTA. Note that generic terms are harder. For example, "Lyft" is easier to use as a search term, but "Uber" might return a lot of unrelated results.
* We could tokenize (split) each tweet into sentences. Otherwise, for longer tweets, the more polarized (opinionated) tweets might be watered down with other sentences.
* We could use a different sentiment analyzer ([`TextBlob` has a couple of pre-trained options](https://textblob.readthedocs.io/en/dev/advanced_usage.html)) or train our own sentiment analyzer using the `nltk` 
functionality. 
* Note that sentiment analyzers are often trained on movie reviews and similarly "opinionated" corpuses, and so more specialist applications need custom training. In some of my [own work](https://conbio.onlinelibrary.wiley.com/doi/10.1111/csp2.624), we found that the writing was too technical or dry in style for sentiment analysis to work.

But we'll leave those for you to explore on your own.


<div class="alert alert-block alert-info">
<h3>Key Takeaways</h3>
<ul>
    <li>Sentiment analysis can identify positive and negative sentiments towards a topic. The pre-trained models might not work well for your data, but you can train your own.</li>
</ul>
</div>