# Homework 1: A Simple Sentiment Analysis of Tweets

The objective of this homework is to assess your current skills and your ability to solve an unknown problem. To successfully complete this homework, you may use any resources available to you.

You need to accomplish the following tasks:
1. Choose a Twitter API.
1. Configure access to the Twitter API
2. Identify a trending topic on Twitter.
3. Get at least 500 tweets on your trending topic.
3. Find lists of stopwords, positive words, and negative words.
4. Calculate the ratio of positive to negative words in your sample.
    
Answer the following questions: 
* __What is the ratio of positive to negative words on your trending topic?__ 
* __What is your interpretation of the ratio?__
* __What is the managerial insight that you could offer based on your results?__

If you use tutorials/code snippets that you find on the internet to complete this task, make sure that you reference them. Also make sure that the Jupyter notebook is free of mistakes, well-documented, and professionally formatted before you submit it.

This homework is due on **Tuesday, 16 2018**.

**CAUTION:** This notebook is purely for instructional purposes and not intended not suitable for use in production environments.

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Choose a package to access the Twitter API

I use the [python-twitter](https://github.com/bear/python-twitter) package to access the Twitter API. There are quite a [few packages to access Twitter](https://developer.twitter.com/en/docs/developer-utilities/twitter-libraries). I chose python-twitter because it seems to be reasonable well maintained.

First let's download the package. The documentation for python-twitter is [here](https://python-twitter.readthedocs.io/en/latest/).

In [None]:
!pip install python-twitter

## Configure access to the Twitter API

You will find a step-by-step how-to to access the Twitter API. You will need a Twitter account to create a Twitter app. For this app, you will need four impotant keys:
* consumer_key
* consumer_secret
* access_token_key
* access_token_secret

These keys have to be **secret** and **must not** be shared on Github. Thus, we put them in a config file `config.cfg` that we read during initialization of the API. **Add the config file to your .gitignore file**, so that you do not accidentially upload it.

The config file has the following structure:

`[twitter]
consumer_key=AAAAAAAA
consumer_secret=BBBBBBBB
access_token_key=CCCCCCC
access_token_secret=DDDDDDD`

In a next step, let's test the API.

For now, we need two packages:
* The confiparser allows us to read a config file
* The twitter allows us to access the twitter API

In [2]:
from configparser import ConfigParser
import twitter

We read the config.cfg file 

In [3]:
config = ConfigParser()
config.read('config.cfg')

['config.cfg']

We setup access to the API.

In [4]:
api = twitter.Api(consumer_key=config.get('twitter','consumer_key'), 
                  consumer_secret=config.get('twitter', 'consumer_secret'), 
                  access_token_key=config.get('twitter', 'access_token_key'),
                  access_token_secret=config.get('twitter', 'access_token_secret'))

## Identify a trending topic on Twitter.

We define our topic.

In [5]:
topic = 'trump'

## Get at least 500 tweets on your trending topic (Part I)

We search for the topic. The result is a list of `twitter.models.Status` objects. The search functionality is limited to [100 tweets](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets). We come back to that later.

In [6]:
results = api.GetSearch(
    raw_query="q="+topic+"&result_type=recent&count=100")

The object `twitter.models.Status` offers a text attribute that contains the actual tweet text. We get the texts by "looping" through the results in Python style.

In [7]:
results

[Status(ID=951679222006902784, ScreenName=jrockarolla, Created=Fri Jan 12 04:56:22 +0000 2018, Text='Donald Trump is the U.S. president. Of course U.S. stands for “ugly shithole.”'),
 Status(ID=951679221914619907, ScreenName=clark7950, Created=Fri Jan 12 04:56:22 +0000 2018, Text='RT @_Makada_: Liberals are freaking out because Trump supposedly called Haiti and some African nations "shithole countries" in an Oval Offi…'),
 Status(ID=951679221864239104, ScreenName=jsocial123, Created=Fri Jan 12 04:56:22 +0000 2018, Text='RT @realDonaldTrump: “His is turning out to be an enormously consequential presidency. So much so that, despite my own frustration over his…'),
 Status(ID=951679221721653248, ScreenName=Kitdupree, Created=Fri Jan 12 04:56:22 +0000 2018, Text='RT @alozrasT11: Oops!! She’s NOT licensed to practice PSYCHIATRY!! \nWhat an idiot!!! 🙄🤔\n\n#LiberlismIsAMentalDisorder\n\nLiberal Shrink ‘Diagn…'),
 Status(ID=951679221692358656, ScreenName=LamarcusDavis2, Created=Fri Jan 12 04:56

In [8]:
texts = [result.text for result in results]

## Find lists of stopwords, positive words, and negative words.

Text analysis is a common problem in data science. Luckily, very smart and capable people have developed powerful packages that already do most of the work. For text analysis, a great package is [NLTK](http://www.nltk.org). Let's install it.

In [235]:
!pip install nltk



Now, we are able to import NLTK. The task is to count the positive and negative words and exclude stopwords. NLTK already provides a list of stopwords. We download the list and import it. Then, we tokenize the tweets, which means that we separate each word. (We need the `punkt` list from NLTK for this.)

In [236]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('punkt')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to /home/dv/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

[nltk_data] Downloading package punkt to /home/dv/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

We focus on the English language.

In [237]:
sw = set(stopwords.words('english'))

We apply the function `word_tokenize` to each tweet text.

In [238]:
token_text = [word_tokenize(str(text)) for text in texts]

Next, we remove all the stopword that the list `sw` provides. Remember, we have a tokenized list now, which is essentially a list of lists. For each tweet, we retain all the words that are not stopwords.

In [239]:
relevant_words = []
for t in token_text:
    rw = [w for w in t if str(w).lower() not in sw]
    relevant_words.append(rw)

NLTK has the VADER Sentiment Analyzer, wich is particularly interesting for Social Media. Read more about VADER here:

>Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

We import VADER, download the lexicon, and create an Analyzer object.

In [240]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sid = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to /home/dv/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

Our goal was to count the positive and negative words. Thus, we flatten the list of tweets. We also create a function to get the compound scores from Vader and map this to the list of words.

In [302]:
flatten_relevant_words = [word for text in relevant_words for word in text]

def get_compound_score(word):
    if word is not None:
        return sid.polarity_scores(word)['compound']
    else: 
        return 0

scores = list( map(get_compound_score, flatten_relevant_words))

## Calculate the ratio of positive to negative words in your sample.

Next, we simply count the positive and negative words. We can also count the neutral words. (Caution: we have a very broad definition of what counts as positive or negative.)

In [242]:
pos = sum(1 for score in scores if score > 0)
neg = sum(1 for score in scores if score < 0)
neu = sum(1 for score in scores if score == 0)

Next, we can calculate the ratio of postive to negative words.

In [243]:
((pos-neg)/neg)*100

22.727272727272727

We can do the same procedure on the full tweets.

In [283]:
full_tweet_scores = list( map(get_compound_score, texts))
pos = sum(1 for score in full_tweet_scores if score > 0)
neg = sum(1 for score in full_tweet_scores if score < 0)
neu = sum(1 for score in full_tweet_scores if score == 0)
((pos-neg)/neg)*100

21.21212121212121

## Get at least 500 tweets on your trending topic (Part II)

Currently, our sample is limited by the search query, which only allows 100 tweets. Thus, we will collect tweets on our topic using the stream functionality of Twitter.

We setup a StreamFilter object and let it track the topic. Each tweet in this object will be appended to the list `streamed_tweets`. Once that list is 500 tweets long, we stop. The API documentation is a bit thin, but [this blog post](https://iseverythingstilltheworst.com/blog/2016/05/28/capturing_twitter_streams/) helped a lot.

In [306]:
stream = api.GetStreamFilter(track=[topic])
streamed_tweets = []

for tweet in stream:
    streamed_tweets.append(twitter.Status.NewFromJsonDict(tweet).text)
    if len(streamed_tweets) == 500:
        stream.close()

We apply the same function on the streamed tweets.

In [309]:
stream_scores = list( map(get_compound_score, streamed_tweets))
pos = sum(1 for score in stream_scores if score > 0.5)
neg = sum(1 for score in stream_scores if score < -0.5)
neu = sum(1 for score in stream_scores if score == 0)
((pos-neg)/neg)*100

8.771929824561402

1.087719298245614

This gives us the current ratio of positive to negative tweets.

## Clean up (Optional for Docker Use)

We have installed a couple of packages that we want to include in the docker container. We create a list of installed packages that we can use to updated the requirements for the docker container.

In [314]:
! pip freeze > requirements.txt