In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw5.ipynb")

# Hw 5: Sentiment Analysis on Twitter Data 🐦

Name:

Student ID:

Collaborators:

## Instructions

For this homework, work through **Lab 5 (Sentiment Analysis on Movie Reviews)** first. Most of the things we ask you to do in this homework are explained in the lab. In general, you should feel free to import any package that we have previously used in class. Ensure that all plots have the necessary components that a plot should have (e.g. axes labels, a title, and a legend if it is applicable).

Frequently **save** your notebook!

### Collaborators and Sources
Furthermore, in addition to recording your **collaborators** on this homework, please also remember to **cite/indicate all external sources** used when finishing this assignment. 
> This includes peers, TAs, and links to online sources. 

Note that these citations will be taken into account during the grading and regrading process.

In [None]:
# collaborators and sources:
# Albert Einstein and Marie Curie
# https://developers.google.com/edu/python/strings

# your code here
answer = 'my answer'

### Submission instructions
* Submit this python notebook including your answers in the code cells as homework submission.
* **Do not change the number of cells!** Your submission notebook should have exactly one code cell per problem. 
* Do **not** remove the `# your code here` line and add you solution after that line. 

### Some imports and configurations

In [None]:
import twitter

import sys
import re, string
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

## 1. API Usage

APIs, or "application programming interface," are tools and routines used to build software applications. For example, Twitter uses an API to allow different programs and projects to access different aspects of Twitter. With some APIs, one would be able to post to Twitter, or perhaps search for different tweets (which will be used within this homework). 

### Public and Secret Keys

This Twitter API, along with many others, utilizes public and secret keys to make sure that only validated individuals can access the API commands and functions. Plugging these into the program can validate the user and allow the API to know who is using which commands. For the most part, these APIs (or companies) make it pretty easy to receive their own API keys to be used for a number of processes. 

### Creating and Accessing API Keys

To create your own API Keys for Twitter Data, the steps are quite simple! While this does require a Twitter account, there is no extra information or payment necessary to access this data.

First, go to https://developer.twitter.com and sign in to your account. If you do not have a Twitter account (or don't want to link your personal account), you can use a separate email for this.

Next, click on Apps and fill out the forms to the best of your ability (not all information is necessary and where applicable, note that this is for a university course/academic use). This will require a phone number, which Twitter uses as a safety check (they don't hand out API keys to just everyone!).

Once approved, click on Keys and Tokens, and you should find your API Keys and access tokens!

**PLEASE READ:** If you are not approved within a timely manner, you may use someone else's (approved) account's key and access token. Please do not wait until the last day to do this!!! Give yourself ample time to finish the assignment. Alternatively, you can contact us via Piazza and we can provide a zipped `sampleTwitterData` folder for you to work with. This is only to be used in an emergency case. In this **pre-scraped** data, there are 10 CSV files containting the tweets for 10 keywords and each CSV has over 500 tweets, but these tweets are **not** processed! You will still need to implement and run an appropriate version of `preProcess()` (under problem 1) on that data.

<!-- BEGIN QUESTION -->

### Problem 1.1

**Do this!** Create a variable `api` and assign it the result of the `twitter.Api` function with the appropriate arguments. Ignore any arguments that are not realted to your consumer information or access tokens. Enable `sleep_on_rate_limit`.  

> **Hint**: Use the `?` character IPython's tool to explore documentation to find our what arguemnts are needed. **Note:** Make sure to remove this before you submit your assignment, it will break your grader and **we will not be able to give you credit**

In [None]:
...

<!-- END QUESTION -->

## 2. Get and Preprocess the Twitter Data

Let's start by adding in your Twitter API keys, which can be found from the Twitter developer website. This will give us access for searching tweets, but mostly **limiting us to the past week** and occasionally restricting the amount of tweets we can pull at a time.

Your next few tasks are to implement the following three functions that will pull and clean the information from Twitter. The last task will be conducting a sentiment analysis on the data that has been pulled! As a fair warning, Twitter data can be pretty messy; your results may not be nearly as clean as the movie dataset.

<!-- BEGIN QUESTION -->

### Problem 2.1 

**Do this!** Complete the following function that will take an arbitrary search term and return a list of tweet words.

**PLEASE BE CAREFUL ABOUT INFINITE LOOPS DURING THIS ASSIGNMENT! IF YOU MAKE TOO MANY REQUESTS TO THE API, YOU CAN GET RATE LIMITED, WHICH CAN BAN YOU FROM USING IT FOR A PERIOD OF TIME**

> **Hint**: Make sure it works before you continue. Look at the returned list, its length, and _some_ of its entries. **Best Practice**: Do not print out the entire list in the version of the notebook that you submit/deploy/share. 

In [None]:
def getResults(searchTerm, untilID):
    
    # This fuction will return a list of tweets.
    # We will use api.GetSearch() in order to pull the Twitter data
    #
    # There are several parameters that we will need to include:
    #
    #     term: string, the term that is being searched
    #     since: string in format 'YYYY-MM-DD' which will serve as the earliest date
    #     until: string in format 'YYYY-MM-DD' which will serve as the latest data 
    #     count: int, the number of tweets to return, max of 100
    #     result_type: string, type of sorting for tweets. Typically 'recent'
    #     max_id: int, another check to limit the tweets returned. Typically sys.maxsize
    #     lang: string, indication of the language being used. We recommend leaving lang as 'en'.
    #
    # You are free to change these however you wish.
    #
    # CAUTION: The 'since' variable must be at most a week prior to the current date!
    #          If you would like to search further, you must apply through Twitter
    
    
    ...
    
    return results

<!-- END QUESTION -->



In [None]:
my_search_term = "data science" # Replace this with your chosen search term! 

In [None]:
results = getResults(my_search_term, sys.maxsize) 

### Problem 2.2

Now, we will need to process the tweets. 

**Do this!** Complete the following function that processes _one_ tweet. The function takes in a Twitter object from python-twitter representing one tweet and return the processed result! 
* Make sure to not consider _retweets_ (`result.retweeted_status`) or a _media posts_ (`result.media`).
    * If a `result` is one of those, return `None`. 
* Often there are links at the end of the tweet. Remove those by keep anything before `"https://"`.
* Remove whitespaces and newlines (`\n`).
* Deal with punctuation within the tweets: Remove most (if not all) punctuation.
* Convert everything to all lowercase. 
* Split the result into a list of words. 

Return this list of words as `processedResult`.

> **Hint**: You may use _regular expressions_ for this probelm. Regular expressions are extremely useful when performing string parsing or string searching. `re` is the Python package for this. 

*Note: this question has hidden tests, or is graded on style of code and not just answer alone.*

In [None]:
def preProcess(result):
    
    ...
        
    return processedResult

In [None]:
grader.check("q2.2")

In [None]:
# To test this you might need to rerun this a couple of times 
# until you hit a tweet that is not a media post or retweet.
preProcess(results[np.random.randint(0,100)])

### Problem 2.3

Now, let's put it all togehter. Since we will ilkely end up with less than 100 tweets, once we preprocess all of them, we have to put the `getResults` call into a `while` loop. 

**Do this!** Complete the following fucntion that will take in a search term and run through the twitter API to find the most recent tweets using that search term!

* Create a `while` loop that runs as long as the length of `processedResults` is under `100`. In the while loop:
    * Call the `getResults` function, passing in the `searchTerm` as well as `untilID`.
    * Create a loop going through each `result` in the list of returned results from `getResults`. In that loop:
        * Run `preProcess` on each `result`: if the returned value is not `None`, append the returned list of words and the `result.id` to their respective lists.
        * Make sure to break the inner loop once you have 100 processed results.
* **WARNING**: Twitter api implements **rate limiting** and it is pretty strict. The api to scrape the Twitter data will run slowly after multiple trials and you will have to wait 15-30 minutes before being able to run again! So, please think carefully about your implmentation below, especially avoid implementing an infinite loop ☠️!

In [None]:
def searchTerm(searchTerm):
    
    # These two variables are used to keep track of calls made to the API
    untilID = sys.maxsize
    ids = []
    processedResults = []
    
    # # Use this as a template!
    #
    # while ...:
    #      **Your code here**
    #
    #      untilID = min(ids) - 1        # Be sure to include this!
    # 
    # return processedResults
    
    ...
    
    return processedResults

Select your search term to test the functions.

In [None]:
data = searchTerm(my_search_term) # feel free to go change this above!

In [None]:
grader.check("q2c")

> **Hint**: Make sure this works before you continue. Look at the returned data, its type, length, and _some_ of its entries. **Best Practice**: Do not print out the entire data in the version of the notebook that you submit/deploy/share. 

<!-- BEGIN QUESTION -->

## 3. Analyzing Twitter Data

Great! We now have all the data stored in our `data` variable. We can cycle through this data set and perform the same rule-based sentiment analysis that we saw previously in the lab.

### Problem 3.1

**Write up!** Let's create a hypothesis about the content of our tweets.
* What would you guess the fraction of tweets with positive and negative emotions will be for your data/search term?
* Write this up (_before_ you perform the sentiment analysis) in the form of a **hypothesis (Q1)**.


We will investigate how accurate this hypothesis was at the end of this section.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Problem 3.2

**Do this!** Complete the following function that runs a rule-based sentiment analysis on _one_ given entry.
* Set the score to zero, then loop through each word in the entry
    * At each word, add one to the score if it is in `positive_words`, 
    * subtract one if it is in `negative_words`
    * or do nothing if it is in neither!
* Return `1` if the score is not negative and `-1` otherwise. 


 > **Hint**: We will declare `positive_words` and `negative_words` as`global` variables, so don't bother about passing those in as arguments. 

In [None]:
def analyzeSentiment(entry):
    
    ...

Now, we can run all out tweets trhough this function and collect their sentiment. 

In [None]:
sentiments = []

global negative_words
global positive_words

with open('utility/data/negative-words.txt') as f:
    negative_words = [word.strip() for word in f.readlines() if word[0] not in [';', '\n']]

with open('utility/data/positive-words.txt') as f:
    positive_words = [word.strip() for word in f.readlines() if word[0] not in [';', '\n']]
    
for entry in data:
    sentiments.append(analyzeSentiment(entry))
sentiment_labels = np.array(sentiments)    
    
sentiment_labels

In [None]:
grader.check("q3b")

<!-- BEGIN QUESTION -->

## 4. Visualizing the Results

The final step is creating a few simple charts to look at the overall sentiment of the current Twitter search.

### Problem 4.1

**Do this!** Create a `bar` chart that visualizes the frequency of the positive and negative tweets in your dataset. Use appropriate axis labels, and include the search term (remember that you stored that in a variable earlier on) in your figure title. 

In [None]:
# Let's run a configuration to make prettier plots
plt.rcdefaults()

...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Problem 4.2

**Write up!** Let's compare these results with our hypothesis **(Q1)**.
* How accurate was your hypothesis?
* What do you think could have caused your guess being very accurate or inaccurate? 

We will see another way of looking at this data to find more explanations in the next part.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Another Visualization: Wordclouds

For a slightly more colorful view at the overall data, we can use a wordcloud module!

In [None]:
def get_all_words(data2plot):
    
    overallWords = ' '

    for entry in data2plot:
        for word in entry:
            overallWords += word + ' '

    return overallWords

wordcloud = WordCloud(width=600, height=430, max_words=50).generate(get_all_words(data))

plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

Now, let's look at wordclouds based on the sentiment! To do this, we split up the data into one list of lists of all positive tweets and one for all negative tweets.

In [None]:
positive_data = [i for indx,i in enumerate(data) if sentiments[indx] == 1]

negative_data = [i for indx,i in enumerate(data) if sentiments[indx] == -1]

<!-- BEGIN QUESTION -->

### Problem 4.3

**Do this!** Create one wordcloud for the postive tweets and one for the negative tweets, that _intuitivley_ visualizes your results. 
* Use a different `colormap` for each wordcloud (check out the availbale colormaps [here](https://matplotlib.org/stable/tutorials/colors/colormaps.html)) **and/or** 
* play with the `background_color` (check out the available colors [here](https://matplotlib.org/stable/gallery/color/named_colors.html)).
* Add appropriate titles to your subplots. 

> **Hint**: Follow the example above and use `get_all_words()`.

> **[🐍 Python Feature 🐍]**: We can create figures with **multiple plots** using `plt.subplot`. The first number indicates the number of rows, the second input is the number of columns and the third is the plot you want to fill next. Like if you want to plot into the lower right corner of a figure with four plots in a 2x2 grid, then you would use `plt.subplot(224)` 

*Note: this question is graded on style/design choices of your visualization and not just on correctness alone.*

In [None]:
plt.figure(figsize=(30,15))
plt.subplot(121)


...


plt.subplot(122)

...

plt.show()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

## 5. Summarize your Findings

### Problem 5.1

**Write up!** The visualizations reveal a lot of information about your Twitter data. Describe your basic findings by answering the following questions:

- What was the overall sentiment? 
- What were some of the most/least frequent words that were used (larger = more common)? 
- Why do you think this is? 
- Do you believe this would be different during different weeks?
- How do the words used and their frquency differ for positive versus negative sentiments?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Problem 5.2

**Write up!** Elaborate on one specific thing or insight from your analysis that you find particularly interesing. 

_Type your answer here, replacing this text._

<!-- END QUESTION -->



---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)