# Vibe Check
## Using Twitter’s API and Sentiment Analysis to Understand What’s the What on the Internet

Today's Agenda:
1. Who We Are
2. What are APIs?
3. Using the Twitter API
4. Basic Data Operations and Data Cleaning
5. Sentiment Analysis With Python using NLTK

<!--- TODO: Slides with: Intros - who we are, what does FN do overview, session goals/ what is an API + QR code with link to public google colab + session end slides - career guidance? (One of our learning objectives was "What jobs or internships can you search for to use the skills covered in this workshop?"

Slides instructions: https://medium.com/@mjspeck/presenting-code-using-jupyter-notebook-slides-a8a3c3b59d67
 
-->


 # Who We Are

# FiscalNote


# Us

## Annabelle Gary

## Karnika Arora

# APIs
### What are APIs?

An API, or "Application Programming Interface", is the most popular way to access data programmatically - API documentation will tell our clients what is available and how to “ask” our API for it.

If you've ever seen tweets embedded on a webpage, those were pulled in via an API!

Most modern APIs return data in JSON format - the API we will be working with today does as well. JSON is a data format, just like an excel file is a format in which we store data - but JSON is more flexible and more lightweight, which makes it a great option for exchanging data over the internet.


Today, we're going to access API data using Python. First, we'll set up some libraries and then our API authentication information.

We have our "bearer_token" - this is like a secret password that belongs to only us so Twitter knows who exactly is asking it for data - stored in a file. We're going to read the data in after our imports, and set up our API request URL and the headers for the request.


In [21]:
import requests # We use this library to make HTTP requests
import json # To parse through the JSON data we get back from the API
import urllib 
import ipywidgets as widgets
from IPython.display import display
from IPython.display import clear_output

with open(f"../utils/bearer_token.txt", "r") as token_file:
    bearer_token = token_file.read()
    
headers = {
    "Authorization": f"Bearer {bearer_token}"
}

search_url = "https://api.twitter.com/2/tweets/search/recent?"

You'll notice that we also set up what we need to make the actual request to the Twitter API, the headers and the request url:
1. **Headers**: This information tells Twitter exactly who we are:
    - Authorization: **`bearer_token`**: the secret password, this lets twitter know that we are allowed to access this data
    
<!--- Let's delete this: it is not required for the request:
    - **`User-Agent`**: a name for what project we're working on.
    - This information is important for Twitter to track so they can keep track of who is using their API and make sure that nobody is abusing the API. Pretty much every API will require you to identify yourself in some way before you can get data back. -->

2. The URL we're going to request data from. In this case: `https://api.twitter.com/2/tweets/search/recent` - We figured this out by looking at twitter's API documentation. Most APIs have extensive documentation that will help you decide what request URL to use. Take a look at Twitter's API documentation here:

[Twitter API Documentation](https://developer.twitter.com/en/docs/twitter-api/tweets/search/introduction)

 <!---
    - **`api.twitter.com`**: tells Twitter we're trying to hit the API, as opposed to the main feed/user interface.
    - **`2`**: shows that we're hitting Version 2.0 of the API. If we put `1` instead, we would hit the 1st version, which would both require slightly different request syntax, and would return data formatted differently.
    - **`tweets`**: indicates which data type we want to request. We could also input `users`, `spaces`, or `lists` to get different datatypes back.
    - **`search`**: says we want to search over tweets. We could also put `counts` to get the number of tweets, or we could look up tweets directly by their IDs. `search` allows us to give Twitter a query - a set of terms we want to include or exclude - and we'll get back tweets that match our query terms.
    - **`recent`**: Twitter allows you to search either over only Tweets from the last week, or `all` Tweets, depending on your level of access. We'll stick to `recent`, because we're interested in what's happening on Twitter right now. 
    
 #karnika: I think this can be shorter. I propose completely nixing the bullets I commented out.
-->



## Building a Query

See: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query

<!--- Audience Participation here - ask for hashtags/ keyword search ideas - maybe pull up twitter trends on a screen? Live edit notebook to change search keywords-->

- Search queries are used all throughout the internet - google, twitter - pretty much any app you've used allows you to search. If you've ever used the google advanced search feature before, you'll notice that we use some similar syntax here to comstruct our search queries.

- Our initial search will try to find tweets about Elon Musk - and then we'll take input from all of you and search for something you want to see!


In [None]:
query_string = '#twitter ' # tweets #HarryStyles hashtag
query_string += '"elon musk" ' # tweets that have "watermelon sugar" somewhere in their text
query_string += '-is:retweet ' # eliminate retweets
print(query_string)

> **Optional Fields**
>
>- tweet.fields lets us add specific fields -  here we add `created_at`
>
>**Query String**
>
>- is:retweet *excludes* any retweets

See all the different operators types you can add to your search here: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#operators

### Now, Let's turn our query string into a widget that we can easily edit

In [None]:
query_widget = widgets.Textarea(
    value= query_string,
    placeholder='Enter a search string',
    description='String:',
    disabled=False
)
display(query_widget)

# Making the API Request

## GET vs POST requests

When using APIs, there are multiple ways you can engage with them. The API Documentation will tell you what you're able to do, but one important thing to know about is what _type_ of requests you can make.

`GET` requests are exactly what they sound like - you usually use them to _GET_ data back from the API. 
`POST` requests are a little more complicated, but generally they are used to _create_ data via the API. Any Twitter Bot you see is going to be using POST requests to create Tweets. See: https://twitter.com/MagicRealismBot

Here, we'll make a `GET` request to get our response back from Twitter.


In [None]:
query_params = {'query': query_widget.value,
                'tweet.fields': 'created_at,id,lang,source,text', # what data we want to return
                'expansions': 'author_id'     # will include the profile ID of the author
               }

response = requests.get(url=search_url,params=query_params,
                       headers=headers)
print(response.status_code)

# HTTP Status Codes

You've probably noticed that any website url will start with 'HTTP' - this stands for 'Hypertext Transfer Protocol', and this protocol decides how _servers_ and _browsers_ communicate. We don't have to get to deep into this, but it can be useful to understand what certain HTTP codes mean:
 - 200: This is a 'successful' response from the server
 - 400: The server is telling you that you made a bad request
 - 401: Unauthorized - the server is telling you that your authentication information is expired or incorrect
 - 500: Server Error - the server failed to respond
    

In [None]:
data = response.json()["data"]
print(json.dumps(data, indent=2))

# Visualizing Our Data
 We'll use plotly and pandas to understand our twitter data a little bit better
 
 First, let's get _counts_ for our query and see what our data looks like

In [None]:
count_request = "https://api.twitter.com/2/tweets/counts/recent"
count_query = {
    "query": query_params['query'],
    "granularity": "day"
}
tweet_counts = requests.get(count_request, params=count_query, headers=headers)
print(tweet_counts.status_code)
print(json.dumps(tweet_counts.json(), indent=2))

In [None]:
#%pip install plotly
#%pip install pandas
# create a requirements.txt file @Annabelle
import plotly.express as px
import plotly.graph_objs as go
import pandas as pd

tweet_count_df = pd.DataFrame(tweet_counts.json()["data"]) 
tweet_count_df.head(5)


# When did dates get so complicated?

#alwaysbegoogling

This date format is pretty complex, and it includes a lot of information. This date is in the ISO 8601 representation. Before we simplify it a bit, let's see what each component means:
 - %Y - 2022 - The full four-digit date
 - %m - 11 - Two digit month (01 if it's 1)
 - %d - 29 - Two digit date
 - T - this marks the start of a timestamp
 - %H - 24h hour
 - %M - minute
 - %S - second 
 - %f - milliseconds
 - Z - timezone offset
> This time format can also be described as `%Y-%m-%dT%H:%M:%S.%fZ`

In [None]:
from datetime import datetime as dt

# Create a new column with just the simplified day
for i, r in tweet_count_df.iterrows():
    tweet_count_df.loc[i, 'day'] = dt.strptime(tweet_count_df.loc[i, 'start'], "%Y-%m-%dT%H:%M:%S.%fZ").date()
    
tweet_count_df.head(5)

## Visualizing Tweet Counts

In [None]:
fig = go.Figure(data=go.Scatter(x=tweet_count_df['day'].astype(dtype=str), 
                        y=tweet_count_df['tweet_count'],
                        marker_color='mediumvioletred', text="tweet_count"))
fig.update_layout({"title": 'Recent tweets',
                   "xaxis": {"title":"Days"},
                   "yaxis": {"title":"Total tweets"},
                   "showlegend": False})
fig.show()

In [None]:
# This is a notes/ skip slide
# just creates a function with the code from the first section so we can rerun it for the sentiment analysis portion
# Create query widget before calling this function by running display(query_widget)
# First obj returned is the count chart (display using fig.show(), second is the tweets obj)

def get_tweets():
    query_params = {'query': query_widget.value,
                'tweet.fields': 'created_at,id,lang,source,text', # what data we want to return
                'expansions': 'author_id'     # will include the profile ID of the author
               }
    response = requests.get(url=search_url,params=query_params,
                       headers=headers)
    count_request = "https://api.twitter.com/2/tweets/counts/recent"
    count_query = {
    "query": query_params['query'],
    "granularity": "day"
    }
    tweet_counts = requests.get(count_request, params=count_query, headers=headers)
    for i, r in tweet_count_df.iterrows():
        tweet_count_df.loc[i, 'day'] = dt.strptime(tweet_count_df.loc[i, 'start'], "%Y-%m-%dT%H:%M:%S.%fZ").date()
    fig = go.Figure(data=go.Scatter(x=tweet_count_df['day'].astype(dtype=str), 
                        y=tweet_count_df['tweet_count'],
                        marker_color='mediumvioletred', text="tweet_count"))
    fig.update_layout({"title": 'Recent tweets',
                   "xaxis": {"title":"Days"},
                   "yaxis": {"title":"Total tweets"},
                   "showlegend": False})
    return fig, response.json()['data']
    

# Getting Started with Sentiment Analysis

<!--- TODO: Add more here - what is NLP? Short explanation of word tokenization
Ref: https://realpython.com/python-nltk-sentiment-analysis/#using-nltks-pre-trained-sentiment-analyzer
-->

In [None]:
#%pip install nltk
import nltk

#nltk.download(["names", "stopwords", "averaged_perceptron_tagger", "vader_lexicon","punkt"])


In [None]:
words = []
json_response = response.json()
for item in json_response.get('data'):
    words.extend(nltk.word_tokenize(item.get('text')))
unwanted = nltk.corpus.stopwords.words("english")
unwanted.extend([w.lower() for w in nltk.corpus.names.words()])
words_clean = [w for w in words if w.isalpha() and w not in unwanted]

In [None]:
fd = nltk.FreqDist(words_clean)
print(fd.most_common(10))
print(fd.tabulate(5))
# Remove the https? is that in stopwords?

# What is NLP, Machine Learning etc


In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()

tweets = [t['text'].replace("://", "//") for t in json_response.get('data')]

def is_positive(tweet: str) -> bool:
    """True if tweet has positive compound sentiment, False otherwise."""
    return sia.polarity_scores(tweet)["compound"] > 0, sia.polarity_scores(tweet)["compound"]

compound_sentiment = {'tweet': [], 'pos': [], 'comp_score': []}

for t in tweets:
    compound_sentiment['tweet'].append(t)
    is_pos, score = is_positive(t)
    compound_sentiment['pos'].append(is_pos)
    compound_sentiment['comp_score'].append(score)

compound_sentiment_df = pd.DataFrame(compound_sentiment)
compound_sentiment_df.head(5)



In [None]:
mean_sentiment = compound_sentiment_df.mean(numeric_only=True)['comp_score']

tw_options = [(compound_sentiment_df.loc[i, 'tweet'], compound_sentiment_df.loc[i, 'comp_score']) for i, r in compound_sentiment_df.iterrows()]
tw_options.append(('Average Score', mean_sentiment))
tw_select = widgets.Dropdown(options=tw_options,
                             value = mean_sentiment,
                             description='Tweet:')
caption = widgets.Label(value='initial pos')


In [None]:
from IPython.display import clear_output

def create_gauge():
    fig_gc = go.Figure(go.Indicator(
        mode= "gauge+number",
        value = tw_select.value,
        gauge = {'axis': {'range': [-1, 1]},
                 'bar': {'color':'darkslategray'},
                 'steps': [{'range': [-1, 0], 'color': 'lightcoral'}, {'range': [0, 1], 'color': 'lightgreen'}]},
        domain = {'x': [0,1], 'y': [0,1]},
        title = "Average sentiment"))
    # make space for explanation / annotation
    fig_gc.update_layout(margin=dict(l=20, r=20, t=20, b=60),paper_bgcolor="white")

    # add annotation
    fig_gc.add_annotation(dict(font=dict(color='darkslategray',size=15),
                               x=0,
                               y=-0.12,
                               showarrow=False,
                               text=tw_select.label,
                               textangle=0,
                               xanchor='left',
                               xref="paper",
                               yref="paper"))
    return fig_gc

gauge_chart = create_gauge()

def handle_change(change):
    gauge_chart.update_traces(go.Indicator(
        mode= "gauge+number",
        value = tw_select.value,
        gauge = {'axis': {'range': [-1, 1]},
                 'bar': {'color':'darkslategray'},
                 'steps': [{'range': [-1, 0], 'color': 'lightcoral'}, {'range': [0, 1], 'color': 'lightgreen'}]},
        domain = {'x': [0,1], 'y': [0,1]},
        title = "Average sentiment"))

    gauge_chart.update_annotations(dict(font=dict(color='darkslategray',size=15),
                               x=0,
                               y=-0.12,
                               showarrow=False,
                               text=tw_select.label,
                               textangle=0,
                               xanchor='left',
                               xref="paper",
                               yref="paper"))
    clear_output()
    display(tw_select)
    gauge_chart.show()

tw_select.observe(handle_change)
display(tw_select)
gauge_chart.show()

# Take Two!

What do you want to analyze?

In [None]:
display(query_widget)

In [None]:
count_chart, json_response = get_tweets()
count_chart.show()

In [None]:
# @Annabelle - create a function here to re-run the nltk stuff? recreate tw_select and then:
display(tw_select)
gauge_chart.show()