# Module 1 Lab: Data Wrangling and Simple Twitter Sentiment Analysis

Twitter represents a fundamentally different instrument to make social measurements. Millions of people voluntarily express opinions across any topic imaginable—this data source is incredibly valuable for both research and business.

For example, researchers have shown that the "mood" of communication on Twitter reflects [biological rhythms](http://www.nytimes.com/2011/09/30/science/30twitter.html) and can even be used to [predict the stock market](http://arxiv.org/pdf/1010.3003&embedded=true). A student here at UW used geocoded tweets to [plot a map of locations where "thunder" was mentioned in the context of a storm](http://cliffmass.blogspot.com/2012/07/thunderstorm-fest.html) system in Summer 2012.

Researchers from Northeastern University and Harvard University studying the characteristics and dynamics of Twitter have an [excellent resource](http://www.ccs.neu.edu/home/amislove/twittermood/) for learning more about how Twitter can be used to analyze moods on a national scale.


In this assignment, you will:

* access the Twitter Application Programming Interface (API) using Python,
* estimate the public's perception (the sentiment) of a particular term or phrase, and
* analyze the relationship between location and mood based on a sample of Twitter data.




Some points to keep in mind:

* This assignment is open-ended in several ways. You will need to make some decisions about how best to solve the problem and implement them carefully.
* It is acceptable to discuss your solution on the forum, but do not share code.
* Each student must submit their own solution to the problem.
* Your code will be run in a protected environment, so you should only use the Python standard libraries unless you are specifically instructed otherwise. Your code should also not rely on any external libraries or web services.


**All assignment materials are available in Canvas from 1.16 Lab: Twitter Sentiment Analysis**

Files provided:
* `Lab_1.16_Data_Wrangling_and_Twitter_Sentiment_Analysis.ipynb`: This jupyter notebook.
* `AFINN-111.txt`: Sentiment scores for a large number of words and phrases
* `AFINN-README.txt`: Documentation for the AFINN-111.txt


Assignment due date: Check Canvas! 

<div class="alert alert-block alert-info">
<b>Tip: The Twitter Application Programming Interface:</b> Twitter provides a very rich REST API for querying the system, accessing data, and controlling your account. Twitter provides <a href="https://dev.twitter.com/docs">extensive documentation about their API</a>.
</div>

<div class="alert alert-block alert-info">
<b>Tip: Learning Python:</b> If you are new to Python, you may find it valuable to work through the <a href="http://www.codecademy.com/tracks/python">Codeacademy Python tutorials</a>. Focus on tutorials 1–9, plus tutorial 12 on File IO. In addition, many students have recommended <a href="https://developers.google.com/edu/python/">Google's Python class</a> and the materials at <a href="https://www.datacamp.com/tracks/python-fundamentals">DataCamp</a>.
</div>

<div class="alert alert-block alert-warning">
<b>Warning: Unicode Strings</b> 
    
Strings in the Twitter data prefixed with the letter "u" are Unicode strings. For example:

`u"This is a string"`

Unicode is a standard for representing a much larger variety of characters beyond the Roman alphabet (Greek and Russian alphabet, mathematical symbols, logograms from nonphonetic writing systems such as kanji, etc.)

In most circumstances, you will be able to use a Unicode object just like a string.

If you encounter an error involving printing Unicode, you can use the encode method to properly print the international characters, like this:

            unicode_string = u"aaaàçççñññ"

            encoded_string = unicode_string.encode('utf-8')

            print encoded_string`

Once again: If you are new to Python, many students have recommended <a href="https://developers.google.com/edu/python/">Google's Python class</a>.

</div>

<div class="alert alert-block alert-warning">
<b>Warning: API requires mobile phone</b> For this assignment, we hope to have you work with the live Twitter stream using the development API. However, Twitter requires you to attach a mobile phone number to your account in order to use the developer API. If you do not have a mobile phone, or if your carrier is not supported, you may find <a href="http://meberhard.me/workaround-twitter-application-write-access-mobile-number-accepted-twitter-website/">this workaround</a> useful. However, if you cannot get access to Twitter data due to problems with their login verification protocol, you can instead use the sample dataset on Canvas called three_minute_tweets.json.
</div>

## 1. Get Twitter Data



To access the live stream, you will need to install the `oauth2` library so you can properly authenticate.

The command

        $ pip install oauth2

should work for most environments.

If you are on Windows, before you can run the command above, you may need to install pip. You can watch a video on installing pip on Windows.

The steps below will help you set up your Twitter account to be able to access the live 1\% stream.

Create a Twitter account if you do not already have one.

1. Go to https://dev.twitter.com/apps and log in with your Twitter credentials.
1. Click Apply to get a developer account
1. Enter your information and verify your email
1. Enter the app name (or go back to the developer page and click "Create App"
1. You should be shown your keys, or you can go to "Keys and Tokens" tab along the top.
1. Copy all the keys to a safe place.
1. Set the variables corresponding to the API key, API secret, and Bearer token. (Actually all you need is the bearer token)

🎒<font color='red'>(2 points)</font>
Enter your keys in the next cell

In [35]:
api_key = "<Enter api key>"
api_secret = "<Enter api secret>"
bearer_token = "AAAAAAAAAAAAAAAAAAAAAEP8awEAAAAAPoFSsd4OKDdT5Ai0XrRtGGMa9wc%3DeT4HeKBEIT4fIaZAhvEivIRfkSugHJIY4t8f8nSwpSfSWIRDFT"#"<Enter Bearer Token>"

The next cell includes some helper functions for accessing the Twitter API.  You do not need to edit the code here.

In [36]:
import requests
import os
import json

def create_url():
    base = "https://api.twitter.com/2/tweets/sample/stream"
    fields = "tweet.fields=created_at,entities,source,possibly_sensitive,lang"
    expansions = "expansions=author_id,geo.place_id"
    userfields = "user.fields=created_at,location,username"
    return base +"?" + "&".join([fields,expansions,userfields])

def bearer_oauth(r):
    """
    Method required by bearer token authentication.
    """
    r.headers["Authorization"] = f"Bearer {bearer_token}"
    r.headers["User-Agent"] = "v2SampledStreamPython"
    return r


def connect_to_endpoint(url):
    response = requests.request("GET", url, auth=bearer_oauth, stream=True)
    print(response.status_code)
    for response_line in response.iter_lines():
        if response_line:
            json_response = json.loads(response_line)
            yield json.dumps(json_response, indent=4, sort_keys=True)
    if response.status_code != 200:
        raise Exception(
            "Request returned an error: {} {}".format(
                response.status_code, response.text
            )
        )

def fetchsamples():
    return connect_to_endpoint(create_url())


Now we'll check to see if everything works by requesting 10 tweets and printing them.

In [40]:
# test the function: retrieve 10 tweets
streamer = fetchsamples()
for _ in range(10):
  print(next(streamer))

200
{
    "data": {
        "author_id": "755460264439390208",
        "created_at": "2022-04-02T21:52:44.000Z",
        "entities": {},
        "geo": {},
        "id": "1510374700484071435",
        "lang": "en",
        "possibly_sensitive": false,
        "source": "Twitter for Android",
        "text": "Damn"
    },
    "includes": {
        "users": [
            {
                "created_at": "2016-07-19T17:52:17.000Z",
                "id": "755460264439390208",
                "location": "United States",
                "name": "\ud835\ude49\ud835\ude64\ud835\ude56\ud835\ude5d \ud83d\udc51 (\ud83d\udc2741-18-10)",
                "username": "CrosbyWRLD"
            }
        ]
    }
}
{
    "data": {
        "author_id": "1289641802203455488",
        "created_at": "2022-04-02T21:52:44.000Z",
        "entities": {
            "annotations": [
                {
                    "end": 33,
                    "normalized_text": "matt",
                    "probability": 0.

In [41]:
streamer = fetchsamples()
# now collect 1,000 tweets
tweets = [next(streamer) for _ in range(1000)]

# The expression above is called a list comprehension -- they are very useful!   
# It creates a list by iterating over another list.
# https://www.w3schools.com/python/python_lists_comprehension.asp

200


## 2. Extract the tweets

The data structure returned by the API is complex, but the tweet text is in there somewhere.

The data returned by the API for Q1 is represented as JSON, which stands for JavaScript Object Notation. It is a simple format for representing nested structures of data --- lists of lists of dictionaries of lists of.... you get the idea.

Each line of the response represents a [streaming message](https://dev.twitter.com/docs/streaming-apis/messages). Most messages, but not all, will be a [tweet object](https://dev.twitter.com/docs/platform-objects/tweets). 

It is straightforward to convert a JSON string into a Python data structure using a library called json.  Then, to parse the tweet objects, you will apply the function `json.loads` to each element in your list.  

The `json.loads` function will parse the line of json data and return a Python data structure; in this case, it returns a dictionary. If needed, take a moment to read [the documentation for Python dictionaries](http://docs.python.org/2/library/stdtypes.html#typesmapping).

You can read the [Twitter documentation](https://dev.twitter.com/docs/platform-objects/tweets) to understand what information each tweet contains and how to access it, but it is not too difficult to deduce the structure by direct inspection.

🎒<font color='red'>(3 points)</font>
Use the json library to parse each line of the API response 

<div class="alert alert-block alert-info">
<b>Tip:</b> Use a list comprehension!</div>


In [48]:

json.loads("""{\n  "data":{\n     "foo":1,"bar":2}}""")

{'data': {'foo': 1, 'bar': 2}}

In [43]:
import json

# Iterate over the tweets you collected earlier and call json.loads on each tweet.

for x in tweets:
    print(json.loads(x))

{'data': {'author_id': '1388713662320312323', 'created_at': '2022-04-02T21:53:04.000Z', 'entities': {'hashtags': [{'end': 89, 'start': 85, 'tag': '최현석'}, {'end': 102, 'start': 90, 'tag': 'CHOIHYUNSUK'}], 'mentions': [{'end': 19, 'id': '1240627844868268033', 'start': 3, 'username': 'treasuremembers'}], 'urls': [{'display_url': 'pic.twitter.com/e77bhvQknG', 'end': 126, 'expanded_url': 'https://twitter.com/treasuremembers/status/1510270585347063809/photo/1', 'start': 103, 'url': 'https://t.co/e77bhvQknG'}, {'display_url': 'pic.twitter.com/e77bhvQknG', 'end': 126, 'expanded_url': 'https://twitter.com/treasuremembers/status/1510270585347063809/photo/1', 'start': 103, 'url': 'https://t.co/e77bhvQknG'}]}, 'geo': {}, 'id': '1510374784399396865', 'lang': 'ko', 'possibly_sensitive': False, 'source': 'Twitter for Android', 'text': 'RT @treasuremembers: 연습 끝났다하면 바로 미친듯이 치고 올라오는 머리😅 왤케 뜰까요 ㅋㅋㅋㅋㅋㅋㅋ암튼 오늘도 연습끝!!!!!!💜🔥💜🔥\n#최현석 #CHOIHYUNSUK https://t.co/e77bhvQknG'}, 'includes': {'users': [{'created_at'

## 3. Estimate the sentiment of each Tweet

Now that you have parsed the tweet objects, you can compute with the data.  

For this part, you will estimate the sentiment of each tweet based on the sum of the sentiment scores of the individual terms in the tweet. 

You are provided a file `AFINN-111.txt`

This file contains a list of precomputed sentiment scores for a large number of words and phrases.  See the file AFINN-README.txt for more information.

Each line in the file contains a word or phrase followed by a sentiment score. 

Each word or phrase that is present in a tweet but not present in AFINN-111.txt should be given a sentiment score of 0. 

You will write code to add up the sentiment scores for each term in each tweet in your sample.

You code will roughly look like this:

for each tweet:
  initialize tweet score
  for each word in the tweet:
      look up score for that word in AFINN-111.txt
      if found, add the score to the running total
  print score for each tweet

To look up the score for a word, you may find it useful to use the data in the AFINN-111.txt file to build a dictionary mapping each term to its score.  Note that the AFINN-111.txt file format is tab-delimited, meaning that the term and the score are separated by a tab character. A tab character can be identified as "\t".  So you will split each line on the tab to separate the key (the term) and the value (the score). 

Your code should print the total sentiment for each tweet, one numeric sentiment score per line. The first score should correspond to the first tweet, the second score should correspond to the second tweet, and so on. If you sort the scores, they will not match up. If you sort the tweets, they will not match up. If you put the tweets into a dictionary, the order will not be preserved.  

You must provide a score for **every** tweet in the list, even if that score is zero. You can assume the sample file will only include English tweets and no other types of streaming messages.  We will test your code using a custom list of tweets.



🎒<font color='red'>(5 points)</font>
Write code in the next cell to print the estimated sentiment score for each tweet 

<div class="alert alert-block alert-info">
<b>Tip:</b>This is real-world data, and it can be messy! Refer to the <a href="https://dev.twitter.com/docs/platform-objects/tweets">Twitter documentation</a> to understand more about the data structure you are working with. Do not get discouraged, and ask for help on the forums if you get stuck!
</div>

In [26]:
# for each tweet:
#  initialize tweet score
#  for each word in the tweet:
#      look up score for that word in AFINN-111.txt
#      if found, add the score to the running total
#  print score for each tweet

## 3. Build an inverted index

In this part, you will associate each term with a list of tweets that contain it.

You'll create a dictionary where each key is a term and each value is a list of tweets that contain that term.

Your output can print this dictionary.

🎒<font color='red'>(3 points)</font>
Write code in the next cell to construct a dictionary mapping each term in your dataset to a list of tweets that contain it.  Each tweet should just be a string, not the whole object. 


In [27]:
# let d be a dictionary
# for each tweet t
#   for each term w in t
#     insert w in d if needed
#     append t to list of tweets for w

## 4. Compute term frequency 

In this part, you will write code to compute the term frequency histogram of the tweets you downloaded from Problem 1.

The frequency of a term `t` is just the number of occurrences of `t` across all tweets.

Each line of output should contain a term, followed by a space, followed by the frequency of that term in the entire dataset. 

There should be one line per *unique* term in the entire file. Even if 25 tweets contain the term "lol", the term "lol" should only appear once in your output (and the frequency will be at least 25!). Each line should be in the format 

        term, frequency


For example, if you have the pair (bar, 0.1245) in Python, it should appear in the output as:

        bar 0.1245
      
If you wish, you may consider a term to be a multiword phrase but this is not required. You may compute the frequencies of individual tokens only.

Depending on how you parsed the data, you may end up computing frequencies for hashtags, links, stop words, phrases, etc. If you choose to filter out these non-words, that is okay, too.



🎒<font color='red'>(2 points)</font>
Write code in the next cell to count the number of tweets for each term.  Use the inverted index you created.


In [28]:
# for each w in any tweet, return the number of tweets it appears in

## 5. Derive the sentiment of unseen terms

In this part, you will estimate the sentiment for terms that **do not** appear in the file `AFINN-111.txt`.

Here is how you might think about the problem: We know we can use the sentiment-carrying words in `AFINN-111.txt` to deduce the overall sentiment of a tweet. Once you deduce the sentiment of a tweet, you can work backward to deduce the sentiment of the non-sentiment-carrying words --- those do not appear in AFINN-111.txt. For example, if the word `soccer` always appears in proximity with positive words like `great` and `fun`, then we can deduce that the term soccer itself carries a positive sentiment.
 
 
Do not feel obligated to use it, but the following paper may be helpful for developing a sentiment metric. Look at the Opinion Estimation subsection of the Text Analysis section in particular.

O'Connor, B., Balasubramanyan, R., Routedge, B. R., &amp; Smith, N. A. (2010, May). <a href="http://www.cs.cmu.edu/~nasmith/papers/oconnor+balasubramanyan+routledge+smith.icwsm10.pdf">*From tweets to polls: Linking text sentiment to public opinion time series*</a><span>. Proceedings of the International AAAI Conference on Weblogs and Social Media.
    
Your code should print results to the screen. Each line of output should contain a term, followed by a space, followed by the sentiment. That is, each line should be in the format `term`, `sentiment`
    
For example, if you have the pair ("foo", 103.256) in Python, it should appear in the output as:
    
        foo 103.256

The order of your output does not matter.

🎒<font color='red'>(5 points)</font>
Write code in the next cell to print an estimated sentiment score for each word in each tweet. 

<div class="alert alert-block alert-info">
<b>Tip:</b>How we will grade Part 5: We will run your script on a file that contains strongly positive and strongly negative tweets and verify that the nonsentiment-carrying terms in the strongly positive tweets are assigned a higher score than the nonsentiment-carrying terms in negative tweets. Your scores need not (and likely will not) exactly match any specific solution.
</div>

In [29]:
# one approach:
# Step 1: Compute sentiment for every tweet (you already did this)
# Step 2: Associate every term with a list of tweet se that contain it (you already did this)
# Step 3: For each term, compute the average sentiment of all tweets that contain it

## 6: Sentiment by State

Write code that computes the average sentiment for all tweets from a given state.  The learning objectives are 1) to consider how we use previous results in downstream tasks, such that assumptions made earlier may effect the interpretation of results later, 2) to read and understand API documentation to figure out how to extract the data you need, and 3) to gain experience working with imperfect data, adapt, and document the design decisions you make.   As a data scientist, you will not be handed the clean, perfect datasets you find in machine learning tutorials!  Your great ideas often won't work because the data is too broken; you'll need to adapt and do the best you can, and above all, make sure you explain what you did.  

There are different ways you might assign a location to a tweet, and none are perfect. Here are three:
* Use the coordinates field (a part of the place object, if it exists) to geocode the tweet. This method gives the most reliable location information, but unfortunately, this field is not always available and you must figure out some way of translating the coordinates into a state.
* Use the other metadata in the place field. Much of this information is hand-entered by the Twitter user and may not always be present or reliable, and may not typically contain a state name.
* Use the user field to determine the Twitter user's home city and state. This location does not necessarily correspond to the location where the tweet was posted, but it is reasonable to use it as a proxy.

You are free to develop your own strategy for determining the state that each tweet originates from.  You may find it useful to use this: [Python dictionary of state abbreviations](http://code.activestate.com/recipes/577305-python-dictionary-of-us-states-and-territories/)

You can ignore any tweets for which you cannot assign a location in the United States.

In this file, each line is a Tweet object, as [described in the Twitter documentation](https://dev.twitter.com/docs/platform-objects/tweets).  Not every tweet will even have a text field!  Again, real data is dirty! Be prepared to debug, and feel free to throw out tweets that your code cannot handle to get something working. 

For example, you might choose to ignore all non-English tweets.  That's ok, but document it.  No silent assumptions!  You never know how people may use your result, so you need to account for any an all biases you may introduce.

Your script should print the two-letter state abbreviation of the state with the average tweet sentiment. Your list should be sorted from highest sentiment to lowest sentiment.

Note that you may need **a lot** of tweets in order to get enough tweets with location data. Let the live stream run for a while if you wish.


🎒<font color='red'>(5 points)</font>
Write code in the next cell that prints the state abbreviation along with the highest average sentiment.  Explain limitations and biases in your estimate.

In [30]:
# create a dictionary to hold State -> (sum_of_sentiment, count)
# for each tweet t
#   determine the state s associated with t.  If you can't, it's ok to go on to the next tweet
#   add sentiment(t) to the total for s, and also increment the count so you can compute the average.
# for each state, divide sum of sentiment by the count to compute the average
# sort this list by average score.

## 7: Top 10 hashtags with the highest sentiment scores
In this part, you will write code that will compute the top 10 hashtags ordered by sentiment score.

The learning objectives are 1) to reuse previous code, approaches, and data structures, 2) navigate the Twitter data structures to find and extract the information you need (hashtags in this case), and 3) consider the limitations of this kind of analysis.

Your code should print each hastag and the associated score, one hashtag per line. 
There should be one line per unique hashtag in the entire file. Each line should be in the format 

        hashtag average_sentiment

For example, if you have the pair (bar, 30) in Python, it should appear in the output as:

        bar 30


🎒<font color='red'>(5 points)</font>
Write code in the next cell that prints the state abbreviation along with the highest average sentiment.  Explain limitations and biases in your estimate.

In [31]:
# create a dictionary to hold hashtag -> (sum_of_sentiment, count)
# for each tweet t
#   extract the list of hashtags associated with the tweet.
#   for each hashtag h
#      put it in the dictionary if it's not there already
#      add sentiment(t) to the total for h, and also increment the count so you can compute the average.
# for each hashtag, divide sum of sentiment by the count to compute the average
# sort this list by average score.