# Part 3: Create a Social Media 
Monitoring Service
In this third part, we’re going to talk about social media monitoring and, in
particular, about Sentiment Analysis which is one of the most popular and widely
used sub-branches of Natural Language Processing and Understanding.
Sentiment Analysis means identifying the underlying affective state of a text. In
practice, we usually talk about polarity: is the text transmitting good or bad vibes?
The main objective of sentiment analysis is very similar to the one of text classification.
In fact, sentiment analysis is exactly text classification with predefined
categories such as:

- Positive / Negative
- Positive / Neutral / Negative
- Very-Positive / Positive / Neutral / Negative / Very-Negative
- Objective / Subjective

Sentiment Analysis is used in various applications and the most common are:
1. Customer Feedback Analysis (identify the polarity of user reviews)
2. Brand Monitoring (quantify a brand’s online reputation)
3. Analyze Political Discourse (identify political bias)
4. Subjectivity/Objectivity Analysis (automatically label news articles for being
not objective)

Due to its practical application, there are many more resources available for
Sentiment Analysis than for general text classification, and we are going to explore
a good deal of them in this chapter.

## Basics of Sentiment Analysis

As I mentioned before, Sentiment Analysis represents the task of establishing if a
certain text is positive or negative. While this task may sound pretty straightforward,
once you get into the details of it, you might encounter some difficulties.

### Be Aware of Negations

Consider the following two sentences:
1. I like hiking very much.
2. I don’t like hiking very much.


To a human being, these two sentences are diametrically opposed in sentiment,
even if the only difference between them is the word “don’t”. A single negation
might seem negligible when working with full news articles but it might have an
important impact on the overall sentiment assessment. How would we handle this
situation in a classic ML approach? Things can get even more complicated:

- “I don’t dislike hiking.” - Double negation - Would the sentiment be positive or
negative?
- “I don’t like punching in the face.” Basically, I dislike a bad thing. Does that make
the sentiment to be positive?


### Machine Learning doesn’t get Humour
Let’s say you’re disappointed about a product you bought and when asked in a
survey about it, you leave the following comment:

> Yes, sure … I’m going to buy ten more. Can’t wait! Going to the store right
now! NOT!

Well, you might have felt the irony, but computer algorithms and machine learning
models have difficulties in understanding humour and figurative speech.
Here’s a simpler example:

> The experience I had at the Swan Resort Hotel was … interesting

You can sense that “interesting” was used in this case with a negative connotation
(it suggests that the person was unpleasantly surprised) rather than in its classic
positive sense. The suspension points are an indicator of this subtlety. So, would be
fair to say that a ML model will interpret the comment in the same way if it had
this rule implemented (suspension points = slightly negative sense / unpleasantly
surprised)? Would this rule be accurate? Are the suspension points generally used
in this way? Well, no and even more, people tend to express humour and figurate
speech in many different ways.


## Multiple and Mixed Sentiments

What happens if a text expresses multiple feelings? What happens if the sentiments
expressed are for different objects (Multiple Sentiments) and what happens if they
are for the same object (Mixed Sentiments)? What’s the overall feeling of the text?
These are important questions to have in mind, especially when you consider the
practical side of Sentiment Analysis. Most of the reviews, comments, articles, etc.,
have more than one sentence and more than one sentiment expressed towards an
object or multiple objects

Consider the following examples:

>“The phone’s design is the best I’ve seen so far, but the battery can definitely
use some improvements”
“I know the burger is unhealthy as hell, but damn it feels good eating it!”


Which one expresses multiple sentiments and which one expresses mixed sentiments?

### Non-Verbal Communication

Another area where computers still fall short is non-verbal communication. The
text we’re reading doesn’t contain any non-verbal cues. We, using experience,
usually attribute a certain tonality to a text. Did it ever happen to fight with
somebody in a chat application or email because you weren’t correctly interpreting
eachother’s tone?

Here is a list of what dimensions of communication can be added to a system that
is analyzing sentiment:
 bv
1. The Content (what is being transmitted, basically the text)
2. The Tone (the voice inflexions, laughter, etc.)
3. The Body Language (nodding, smiling, facial cues, gestures)
4. Cultural Context (different cultures have different gestures)
Can you imagine how complex a system can became if it were to take into account
all these factors?

## Twitter Sentiment Data
Twitter is one of the most open platforms and that makes it a perfect fit for
analyzing text from different perspectives. Twitter has been used for predicting
election results, reputation management, brand monitoring, customer support,
various bots, etc.

### Twitter Corpora
Twitter has been noticed in the academic environment for a while now, and various
corpora from tweets have been created. Here are some of them:

- nltk.corpus.twitter_samples - Sentiment annotated tweets -
- Twitter Airline Reviews17
- First GOP Debate Twitter Sentiment18
- Sanders Analytics Twitter Sentiment Corpus19 - 5513 hand-classified tweets
- OSU Twitter NLP Tools20 - Contains POS, Chunk and NER annotated tweets
- Tweebank21 - Twitter CoNLL-like annotated data

### Other Sentiment Analysis Corpora
Besides the ones stated so far, there is a multitude of other corpora related to
sentiment analysis and I am going to list some of the most well-known:

- Sentiment Annotated 50.000 IMDB movie reviews22
- Amazon Fine Food Reviews23
- Multi-Domain Sentiment Dataset24
- UMICH SI650 - Sentiment Classification on Kaggle25
- SentiWordnet: Sentiment Polarities for Wordnet26
- Miscellaneous Opinion annotated datasets27


### Building a Tweets Dataset

Let’s start by gathering the Twitter data and putting it all together. We are going to
use three out of the six resources I have mentioned earlier in the Twitter Corpora
chapter: NLTK Twitter Samples, Twitter Airline Reviews and First GOP Debate
Twitter Sentiment.

**Indexing NLTK Twitter Samples**

In [2]:
import pandas as pd
import matplotlib.pyplot as plt

# Initialize a dataframe for storing tweets
df = pd.DataFrame(columns=['tweet', 'source', 'sentiment'])

####################################
#
# NLTK Twitter Samples
#
####################################
from nltk.corpus import twitter_samples

# Add the positive tweets
for tweet in twitter_samples.strings('positive_tweets.json'):
    df.loc[len(df)] = [tweet, 'nltk.corpus.twitter_samples', 'positive']

for tweet in twitter_samples.strings('negative_tweets.json'):
    df.loc[len(df)] = [tweet, 'nltk.corpus.twitter_samples', 'negative']

For the next step, you need to download the Twitter Airline Reviews corpus from
Kaggle: https://www.kaggle.com/crowdflower/twitter-airline-sentiment28

**Indexing Twitter Airline Reviews**

In [3]:
####################################
#
# Twitter Airline Reviews
#
####################################
airline_tweets = pd.read_csv('data/Tweets.csv')
# Select only the columns of interest
airline_df = airline_tweets[['text', 'airline_sentiment']]
# Rename the columns to fit the header
airline_df = airline_df.rename(columns={'text': 'tweet', 'airline_sentiment': 'sentiment'})
# Add a constant column as the source
airline_df['source'] = 'https://www.kaggle.com/crowdflower/twitter-airline-sentiment'