# Vibe Check
### Using Twitter’s API and Sentiment Analysis to Understand What’s the What on the Internet

Today's session will cover:
1.   Setting up Access to the Twitter API and Getting API Access Keys
2.   Getting Data from Twitter using the Twitter API
3.   Basic Data Operations and Data Cleaning
4.   Sentiment Analysis With Python using NLTK

<!--- TODO: Slides with: Intros - who we are, what does FN do overview, session goals/ what is an API + QR code with link to public google colab + session end slides - career guidance? (One of our learning objectives was "What jobs or internships can you search for to use the skills covered in this workshop?" -->


# Data Retrieval

## APIs
### What are APIs?

An API is the most popular way to access data programmatically - API documentation will tell our clients what is available and how to “ask” our API for it.

If you've ever seen tweets embedded on a webpage, those were pulled in via an API!

First, we're going to set up some libraries, and our API authentication information:

We have our "bearer_token" - like a secret password that belongs to only us so Twitter knows who exactly is asking it for data - stored in a file. We're going to read the data in:


In [1]:
with open(f"../utils/bearer_token.txt", "r") as token_file:
    bearer_token = token_file.read()

Next, we'll set up what we need to make the actual request to the Twitter API:
1. The information telling Twitter exactly who we are:
    - **`bearer_token`**: the secret password, to Authenticate us
    - **`User-Agent`**: a name for what project we're working on.
    - This information is important for Twitter to track so they can keep track of who is using their API and make sure that nobody is abusing the API. Pretty much every API will require you to identify yourself in some way before you can get data back.


2. The URL we're going to request. In this case: `https://api.twitter.com/2/tweets/search/recent`
    - **`api.twitter.com`**: tells Twitter we're trying to hit the API, as opposed to the main feed/user interface.
    - **`2`**: shows that we're hitting Version 2.0 of the API. If we put `1` instead, we would hit the 1st version, which would both require slightly different request syntax, and would return data formatted differently.
    - **`tweets`**: indicates which data type we want to request. We could also input `users`, `spaces`, or `lists` to get different datatypes back.
    - **`search`**: says we want to search over tweets. We could also put `counts` to get the number of tweets, or we could look up tweets directly by their IDs. `search` allows us to give Twitter a query - a set of terms we want to include or exclude - and we'll get back tweets that match our query terms.
    - **`recent`**: Twitter allows you to search either over only Tweets from the last week, or `all` Tweets, depending on your level of access. We'll stick to `recent`, because we're interested in what's happening on Twitter right now. 


In [2]:
import requests
import json

headers = {
    "Authorization": f"Bearer {bearer_token}",
    "User-Agent": "stem-for-her-demo"
}

search_url = "https://api.twitter.com/2/tweets/search/recent?"

## Building a Query

See: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query

<!--- Audience Participation here - ask for hashtags/ keyword search ideas - maybe pull up twitter trends on a screen? Live edit notebook to change search keywords-->

-- Harry Styles, Elon Musk, other things that are trending, Taylor Swift tour?

### Optional Fields
tweet.fields lets us add specific fields -  here we add `created_at`

### Query String

-is:retweet *excludes* any retweets



In [27]:
query_string = '#harrystyles ' # tweets #HarryStyles hashtag
query_string += '"watermelon sugar" ' # tweets that have "watermelon sugar" somewhere in their text
query_string += '-is:retweet ' # eliminate retweets
print(query_string)

#harrystyles "watermelon sugar" -is:retweet 


See all the different operators types you can add to your search here: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#operators

## GET vs POST requests

When using APIs, there are multiple ways you can engage with them. The API Documentation will tell you what you're able to do, but one important thing to know about is what _type_ of requests you can make.

`GET` requests are exactly what they sound like - you use them to _GET_ data back from the API. 
`POST` requests are a little more complicated, but generally they are used to _create_ data via the API. Any Twitter Bot you see is going to be using POST requests to create Tweets. See: https://twitter.com/MagicRealismBot

One difference for our purposes is how you have to pass in the data or search you're trying to make.
For `GET` requests, you pass in your query via the URL.
- when you Google something, like "what's the top news today", it's going to _encode_ your query in the URL: https://www.google.com/search?q=what%27s+the+top+news+for+today

(BTW, every time you Google something, you're just using the Google Search API, except they're putting a user interface on top of it for you)

So we need to URL-encode our query too - that just means translating it into something that the API can read. Luckily, there are tools that will do this for us.

In [33]:
query_params = {'query': urllib.parse.quote(query_string),
                'tweet.fields': 'created_at,id,lang,source,text,id', # what data we want to return
                'expansions': 'author_id'     # will include the profile ID of the author
               }

In [34]:
import urllib

query_string = "&".join([f"{key}={value}" for key, value in query_params.items()])
print(query_string)

print("original query string: ", query_string)


# print("encoded query string: ", encoded_query_string)

query=%23harrystyles%20%22watermelon%20sugar%22%20-is%3Aretweet%20&tweet.fields=created_at,id,lang,source,text,id&expansions=author_id
original query string:  query=%23harrystyles%20%22watermelon%20sugar%22%20-is%3Aretweet%20&tweet.fields=created_at,id,lang,source,text,id&expansions=author_id


In [38]:
full_url = search_url + query_string
url =  "https://api.twitter.com/2/tweets/search/recent?query=%23harrystyles%20%22watermelon%20sugar%22%20-is%3Aretweet&max_results=10&tweet.fields=author_id,created_at,id,lang,referenced_tweets,source,text&expansions=author_id"


response = requests.get(url=url,
                       headers=headers)
print(response.status_code)
print(json.dumps(response.json(), indent=2))


200
{
  "data": [
    {
      "text": "'Watermelon Sugar' has 1,998,457,525, just missing 1.542.475 to reach 2 BILLION \n\n#HarryStyles \n\nI'm Voting for Harry Styles for Artist Of The Year at #AMAs",
      "id": "1590958033974489090",
      "source": "Twitter for Android",
      "created_at": "2022-11-11T06:42:08.000Z",
      "lang": "en",
      "edit_history_tweet_ids": [
        "1590958033974489090"
      ],
      "author_id": "1565807431057276929"
    },
    {
      "text": "'Watermelon Sugar' by @Harry_Styles has surpassed 2 BILLION streams on @Spotify.\n\n#HarryStyles",
      "id": "1590747844205752320",
      "source": "Twitter for Android",
      "created_at": "2022-11-10T16:46:55.000Z",
      "lang": "en",
      "edit_history_tweet_ids": [
        "1590747844205752320"
      ],
      "author_id": "1434619486930354180"
    },
    {
      "text": "VENDO INGRESSOS SHOW HARRYSTYLES \ud83c\udfab\n\ud83d\udccdlocal: S\u00e3o Paulo\n\ud83d\uddd306/12 \n\ud83c\udf86 pista, Watermelo

# Getting Started with NLTK

<!--- TODO: Add more here - what is NLP? Short explanation of word tokenization
Ref: https://realpython.com/python-nltk-sentiment-analysis/#using-nltks-pre-trained-sentiment-analyzer
-->

In [None]:
%pip install nltk
import nltk

nltk.download(["names", "stopwords", "averaged_perceptron_tagger", "vader_lexicon","punkt"])


In [None]:
words = []
for item in json_response.get('data'):
    words.extend(nltk.word_tokenize(item.get('text')))
unwanted = nltk.corpus.stopwords.words("english")
unwanted.extend([w.lower() for w in nltk.corpus.names.words()])
words_clean = [w for w in words if w.isalpha() and w not in unwanted]

In [None]:
fd = nltk.FreqDist(words_clean)
print(fd.most_common(5))
print(fd.tabulate(5))

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()

tweets = [t['text'].replace("://", "//") for t in json_response.get('data')]

def is_positive(tweet: str) -> bool:
    """True if tweet has positive compound sentiment, False otherwise."""
    return sia.polarity_scores(tweet)["compound"] > 0

for t in tweets[:20]:
    print(">", is_positive(t), t)