## Using APIs in Projects


When getting data from APIs, I strongly suggest following a three-step workflow:

1. Write some code that gets data from an API and saves all of the data (if possible) to a file - get anything you could possibly want. Raw data. 
2. Write a second program (usually a second file) that loads the data from the API, extracts the data that will be useful for analysis, and saves it in a flat file (typically a CSV).
3. Program number 3 loads the CSV file and does the analysis

This approach has a few important benefits.

The first and most important is that often it is difficult to get the same raw data again. If you are using Twitter, then the Search API only lets you get the last week. If you are doing analysis a month down the road and decide that you really wish you had saved metadata about the number of retweets, it is too late. By saving the raw data you can change your measures or analysis strategy and still have access to the data.

The second is that this gives you a nice pipeline, with intermediate files. Instead of including the entire raw data file in the code that does analysis, you only have to load the CSV, which is often much smaller and easier to work with.

This brief lesson will show an example of this workflow, using `tweepy`.

Note that I'm going to put everything in one file for convenience, but my typical workflow is to put these in separate files and then run each file separately.

## Program 1 - Data Retrieval

The goal of our project is to produce a visualization of the histogram of the number of retweets for recent tweets about President Trump. The first program gets tweets about President Trump.

In [4]:
import tweepy
import json
from twitter_authentication import CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True)

In [5]:
# Make a list to store the results
results = []
for tweet in tweepy.Cursor(api.search, 
                           q='Trump -filter:retweets', # only get the original tweets
                           tweet_mode = 'extended',
                           count=200).items(5000): # Change this to as high as you like, if you have time :)
    results.append(tweet._json)
    print(tweet.user.screen_name + "\t" + str(tweet.created_at) + "\t" + tweet.full_text)

ficer was attacked in the throat multiple times with a taser...
Yes.. blood thirsty scumbags..
They set up a noose for a hanging.
Yes .. blood thirsty.
ImprisonDJT	2021-05-28 22:41:14	@LLCWalk @gmangeegee @seanhannity Proof there was a coordinated attack on capitol by traitor Trump and his accomplices.https://t.co/cM2QmfZN66
veropat7	2021-05-28 22:41:13	Jan. 6th was a peaceful protest, infiltrated by Antifa&amp; BLM, supported by Pelosi to charge the Republicans &amp; Trump. Look at the video before, Trump NEVER said to violently protest! https://t.co/HPuPIg4k4Y
dd_chip	2021-05-28 22:41:13	@Mytwocents_801 @KarenMHJ @CortesSteve You really want it both ways. "No one was trying to overturn the election on Jan 6th due to them believing that the election was stolen from them due to massive fraud because Trump said so". Also "there was massive fraud cuz Trump said so".
UnluckyLeif	2021-05-28 22:41:12	@EpochTimes @AlanDersh @netflix That’s Republican’s favourite pedophile lawyer, Mr Alan Der

In [6]:
# Then, write the results to a file
with open('raw_trump_tweets.json', 'w') as f:
    json.dump(results, f)

## Program 2 - Data Cleaning

This program loads the saved raw data, grabs what we want, and converts it into a csv.

I decided to save the timestamp, text, and retweet and favorite counts.

This is also where you typically would do more complicated measure creation. Here I show how to create a measure of tweet_length.

In [7]:
with open('raw_trump_tweets.json', 'r') as f:
    tweets = json.load(f) #the output itself

In [None]:
import csv
with open('cleaned_data.csv', 'w', 
          encoding='UTF-8',
          newline='') as fn:
    f = csv.writer(fn)
    f.writerow(['created_at',
                'tweet_text',
                'retweets',
                'favorites',
                'tweet_length'
               ])
    for tweet in tweets:
        f.writerow([tweet['created_at'], 
                    tweet['full_text'],
                    tweet['retweet_count'],
                    tweet['favorite_count'],
                    len(tweet['full_text'])
                   ])

## Program 3 - Data Analysis

Here we use pandas to load the data and analyze it. This could include statistical tests. Here, I'm just visualizing the distribution of retweets and the relationship between retweets and length.

In [1]:
import pandas as pd
import seaborn as sns

In [2]:
df = pd.read_csv('./cleaned_data.csv')

In [3]:
df

Unnamed: 0,created_at,tweet_text,retweets,favorites,tweet_length


In [None]:
# Just make sure it looks OK.
df.sort_values('retweets')

In [None]:
sns.distplot(df.retweets)

As expected, it's super skewed, with most tweets never getting retweeted while a few get tons of retweets.

Let's see if it changes if we get rid of the tweets that never got retweeted (like, maybe we have a principled reason to believe they are different than other tweets).

In [None]:
sns.distplot(df.loc[df.retweets > 0, 'retweets'])

As I thought, this is a somewhat "scale-free" distribution, meaning wherever you zoom in, you see the same pattern. Try changing the `0` up above to any (small) number.

For fun, let's also look at the relationship between retweets and tweet length.

In [None]:
import numpy as np

In [None]:
sns.jointplot(y='retweets', x='tweet_length', data = df);

In [None]:
# Because retweets are so skewed, let's log them
p = sns.jointplot(y=np.log(df.retweets + 1), x='tweet_length', data = df)
p.set_axis_labels('Tweet Length','Retweets (log)');