Data Acquisition
---

This notebook shows the functions used to collect and save data using the Twitter API v2. The [original notebook](./data_acquisition/2_data_acquisition.ipynb) that gathered the data is in the data acquisition folder. File paths have been altered for this notebook, but the reader will need to provide their own Twitter Developer bearer token.

Data are saved in the [raw_data](./raw_data/) directory.

### Import libraries

In [9]:
import pandas as pd
import numpy as np

# searching and interpreting
import requests
from IPython.display import clear_output
from time import sleep
import json

---
## Define functions for search

### format_twitter_search_parameters

The search request requires a search query as well as a list of parameters to request in the response. There are many choices for these parameters. The user planning to perform the search should [familiarize themselves with this Twitter API v2 page](https://developer.twitter.com/en/docs/twitter-api/tweets/lookup/introduction).

This function accepts a series of parameters and generates a string of the appropriate format for the search.

In [7]:
def format_twitter_search_parameters(tweet_fields = default_tweet_fields,
                                     user_fields = default_user_fields,
                                     max_results = 10,
                                     next_token = '',
                                     end_time = '',
                                     until_id = 0,
                                     since_id = 0
                                    ):
    #optionally search up to an end time
    if end_time != '':
        end_time_bit = '&end_time=' + end_time
    else:
        end_time_bit = end_time
    #optionally search after a previous search using its next_token value (from 'meta')
    if next_token != '':
        next_token_bit = '&next_token=' + next_token
    else:
        next_token_bit = next_token
    #optionally search only before a certain tweet id
    if until_id == 0:
        until_id_bit = ''
    else:
        until_id_bit = '&until_id=' + str(until_id)
    #optionally search only since a certain tweet id
    if since_id == 0:
        since_id_bit = ''
    else:
        since_id_bit = '&since_id=' + str(since_id)
    #assemble the string
    parameter_string = 'tweet.fields=' + ','.join(tweet_fields) + '&user.fields=' + ','.join(user_fields) + '&max_results='+ str(max_results)+ until_id_bit + since_id_bit + next_token_bit + end_time_bit
    return parameter_string



### replace_next_token

Twitter searches stop at 100 tweets maximum, and give a token to send with a following request to read the next page. This function rewrites a twitter search parameter string to include a token for the following page.

In [1]:
# takes a twitter search parameter string with a next_token bit on it, removes the token, and adds the new one.
# returns the new string
def replace_next_token(twitter_search_parameters, next_token):
    next_token_label = '&next_token='
    split_parameters = twitter_search_parameters.split(next_token_label)
    new_parameter_string = split_parameters[0] + next_token_label + str(next_token)
    return new_parameter_string

## search_twitter

This function searches Twitter for a query, interpreting the query identically to using the search bar on the main Twitter page.

In [2]:
key_file_path = '../APIKEYS/Twitter.txt'
bearer_token_file_path = '../APIKEYS/twitter_bearer_token.txt'
with open(bearer_token_file_path, 'r') as bearer_token_file:
    my_bearer_token = bearer_token_file.read()

#its bad practice to place your bearer token directly into the script (this is just done for illustration purposes)
BEARER_TOKEN = my_bearer_token
#define search twitter function
def search_twitter(query, tweet_fields, bearer_token = BEARER_TOKEN):
    headers = {"Authorization": "Bearer {}".format(bearer_token)}

    url = "https://api.twitter.com/2/tweets/search/recent?query={}&{}".format(
        query, tweet_fields
    )
    response = requests.request("GET", url, headers=headers)

    print(response.status_code)

    if response.status_code != 200:
        raise Exception(response.status_code, response.text)
    return response.json()

### print_progress_bar

For visualizing a search. Used by `repeat_twitter_search`.

In [4]:
def print_progress_bar(current, total, bar_length = 100):
    clear_output()
    print('[' + '='*int(np.ceil(bar_length*current/total)) + '.'*int(np.ceil(bar_length*(total-current)/total)) + ']')
    print(f'{100*current/total}%'+ ' '*10 + f'{current}/{total}')

### repeat_twitter_search

This repeatedly performs a search of twitter, returning the next token for reading further pages of results

In [5]:
# takes in a query and search parameters, formats it and performs a series of searches. returns a dict with
# aggregated results and a token for the next page if you desire to continue searching
def repeat_twitter_search(search_query, tweet_fields, user_fields, search_size = 100, until_id = 0, since_id = 0, end_time = '', num_searches = 10, first_next_token = ''):
    # do a "priming search", preparing the dataframe and filling it with first values. the expansions bit is necessary for grabbing the associated user data
    search_parameters = format_twitter_search_parameters(tweet_fields = tweet_fields, user_fields = user_fields, max_results = search_size, until_id = until_id, since_id=since_id, end_time = end_time)+ '&expansions=author_id'
    # this can start with a next_token already defined.
    if first_next_token != '':
        search_parameters = search_parameters  + '&next_token=' + first_next_token 
    search_result = search_twitter(search_query, search_parameters)
    tweets_df = pd.DataFrame(search_result['data'])
    users_df = pd.DataFrame(search_result['includes']['users'])
    tweets_df.set_index('id', inplace = True)
    #store the next token for reading the second page of results. the loop ahead grabs successive pages.
    next_token = search_result['meta']['next_token']
    #edit the parameters to get the next page.
    # add in a next_token= to the parameters if it doesn't already have one.
    if first_next_token == '':
        search_parameters = search_parameters + '&next_token=' + search_result['meta']['next_token']
    #loop through the following searches, concatting their results to the dataframe.
    for search_number in range(num_searches-1):
        print_progress_bar(search_number, num_searches-1)
        #edit parameters to get the next page
        #expansions is needed to get the user data.
        search_parameters = replace_next_token(search_parameters, next_token)
        search_result = search_twitter(search_query, search_parameters)
        new_tweets_df = pd.DataFrame(search_result['data'])
        new_tweets_df.set_index('id', inplace = True)
        tweets_df = pd.concat([tweets_df, new_tweets_df])
        new_users_df = pd.DataFrame(search_result['includes']['users'])
        new_users_df.set_index('id', inplace = True)
        users_df = pd.concat([users_df, new_users_df])
        next_token = search_result['meta']['next_token']
    return tweets_df, users_df, next_token

## Gather Data

These cells perform successive searches of twitter for a set search query over many hours. The gathered data is saved in a series of CSV files in `./raw_data/`.

In [10]:
#define path to save data
raw_data_path = './raw_data/'

#define the first suffix for the saved csvs. All data in this repo has a number <200, so this will not interfere.
first_csv_number = 300

In [51]:
%%time
# gather and save data for a few hours.

#set the search query
the_search_query = '🟩 Wordle'

# set this to search since the tweet with this id
this_since_id = 1521924000000000000

# set this to search until the tweet with this id
this_until_id = 1521994876677672966

#set length and rate of searches
num_hours = 6
searches_per_hour = 4

#perform an initial search without a subsequent page token.
word_search_tweets, word_search_users, word_search_token = repeat_twitter_search(search_query = the_search_query,
                                                                                                  tweet_fields = default_tweet_fields,
                                                                                                  user_fields= default_user_fields,
                                                                                                   search_size=100,
                                                                                                   until_id = this_until_id,
                                                                                                  since_id = this_since_id,
                                                                                                   num_searches=20,
                                                                                                   first_next_token='')
word_search_tweets.to_csv(raw_data_path + f'tweets_{first_csv_number}.csv')
word_search_users.to_csv(raw_data_path+f'users_{first_csv_number}.csv')

#perform successive searches
for number_search_repeats in range(1, num_hours*searches_per_hour):
    print(f'Waiting for search repeat {number_search_repeats}/{num_hours*searches_per_hour}')
    sleep((60/searches_per_hour)*60) # (60min/4 = 15min )*(secs/min)
    word_search_tweets, word_search_users, word_search_token = repeat_twitter_search(search_query = the_search_query,
                                                                                                  tweet_fields = default_tweet_fields,
                                                                                                  user_fields= default_user_fields,
                                                                                                   search_size=100,
                                                                                                   until_id = this_until_id,
                                                                                                 since_id = this_since_id,
                                                                                                   num_searches=20,
                                                                                                   first_next_token=word_search_token)
    word_search_tweets.to_csv(raw_data_path + f'tweets_{number_search_repeats + first_csv_number}.csv')
    word_search_users.to_csv(raw_data_path + f'users_{number_search_repeats + first_csv_number}.csv')

90.0%          18/20
200
Wall time: 5h 43min 29s


# Cleanup

The next step is to clean this data in preparation for an exploratory data analysis. This cleanup is performed in the following notebook, [Data Cleanup](./2_data_cleanup.ipynb).