# Twitter V2 Full Archive Search

This document shows how to use Tweepy to conduct a full archive search using v2 of the Twitter API.

## Prep work

In order to use this code, you will need to have a developer account on Twitter, with access to the Academic Research product track. Information about who is eligible and how to apply is [here](https://developer.twitter.com/en/products/twitter-api/academic-research).

Once you have an account, you will need to create a new app at https://developer.twitter.com/en/portal/dashboard and generate a "bearer token" from the app. Copy the bearer token to your clipboard and paste it into a new file in the same directory as this file, called `twitter_authentication.py`. The entire contents of the file should look like this:

```python
bearer_token = "YOUR BEARER TOKEN HERE"
```

Note that you should **never** share this token with anyone else. If, for example, you are saving your work in a Git repository, make sure that you add the `twitter_authentication.py` file to your `.gitignore`.

If anyone gets this token, they will have access to your Twitter account and you will need to revoke the token (from the same interface where you created it).

If you've created the file successfully, then the following two blocks of code should work.

In [1]:
import tweepy
from twitter_authentication import bearer_token
import time
import pandas as pd

In [2]:
client = tweepy.Client(bearer_token, wait_on_rate_limit=True)

## The Search API

Full documentation for searching tweets is at https://docs.tweepy.org/en/latest/client.html#search-tweets. There are a lot of different options, but here is a simple version that gets all of the "COVID hoax" tweets from January 10, 2021. 

By default the only information returned is the tweet ID and the text. Often, we will want information about authors, too. To get information about the author, you need to add the `user_fields` parameter with the fields you want as well as the `expansions = 'author_id'` parameter. 

To get more information about the tweet, you need the `tweet_fields` parameter. The options are shown at https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all

You also likely want to build a somewhat advanced query - instructions are at https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query. For this query, I get English language tweets that are not retweets.


In [3]:
hoax_tweets = []
for response in tweepy.Paginator(client.search_all_tweets, 
                                 query = 'COVID hoax -is:retweet lang:en',
                                 user_fields = ['username', 'public_metrics', 'description', 'location'],
                                 tweet_fields = ['created_at', 'geo', 'public_metrics', 'text'],
                                 expansions = 'author_id',
                                 start_time = '2021-01-20T00:00:00Z',
                                 end_time = '2021-01-21T00:00:00Z',
                              max_results=500):
    time.sleep(1)
    hoax_tweets.append(response)

Note that I followed the best practice above of saving the raw response returned. If this were a real project, I would write out all of the raw responses into a file. For long-running queries (e.g., if you need to get hundreds of thousands of tweets), you will often want to build in some error handling and a way to resume data collection. For example, you might write all of the results to a file and then open the file, retrieve the last tweet, and use the ID of that tweet to tell the script where to start to retrieve new tweets.

The other problem is that the object that is returned is a bit confusing - it is nested, with the tweet data in `.data` and the user data in `.includes['users']`.

In [4]:
hoax_tweets[0].data[0]

<Tweet id=1352042670059737089 text=@larrycharlesism ALSO, the mitigation measures undertaken for COVID have absolutely crushed the flu in the Northern Hemisphere. So it's not equal. In a situation where flu has been reduced to a wisp, COVID has still killed 400,000. That's my argument to them of 'the hoax'.
https://t.co/dwMlQn75jf>

In [5]:
hoax_tweets[0].includes['users'][2]

<User id=862336513 name=Red Ace 🧦 username=CallOfDove>

Note that both of these are objects. The data that we asked for in `user_fields` and `tweet_fields` above are attributes of the objects. For example, here's the user's description:

In [6]:
hoax_tweets[0].includes['users'][2].description

'Libertarian+Conservative=Me, Catholic Christian, 🇺🇸 🇲🇽 🇵🇷 , In that order Also also also typo man'

We will often want to reorganize these into a flat file, which means connecting a tweet to the user data of the user who wrote it. I show an example of how to do that here:

In [7]:
result = []
user_dict = {}
# Loop through each response object
for response in hoax_tweets:
    # Take all of the users, and put them into a dictionary of dictionaries with the info we want to keep
    for user in response.includes['users']:
        user_dict[user.id] = {'username': user.username, 
                              'followers': user.public_metrics['followers_count'],
                              'tweets': user.public_metrics['tweet_count'],
                              'description': user.description,
                              'location': user.location
                             }
    for tweet in response.data:
        # For each tweet, find the author's information
        author_info = user_dict[tweet.author_id]
        # Put all of the information we want to keep in a single dictionary for each tweet
        result.append({'author_id': tweet.author_id, 
                       'username': author_info['username'],
                       'author_followers': author_info['followers'],
                       'author_tweets': author_info['tweets'],
                       'author_description': author_info['description'],
                       'author_location': author_info['location'],
                       'text': tweet.text,
                       'created_at': tweet.created_at,
                       'retweets': tweet.public_metrics['retweet_count'],
                       'replies': tweet.public_metrics['reply_count'],
                       'likes': tweet.public_metrics['like_count'],
                       'quote_count': tweet.public_metrics['quote_count']
                      })

# Change this list of dictionaries into a dataframe
df = pd.DataFrame(result)

In [8]:
df

Unnamed: 0,author_id,username,author_followers,author_tweets,author_description,author_location,text,created_at,retweets,replies,likes,quote_count
0,1173999122791055360,LardFDorkness2,1131,13548,1st LardFDorkness account banned for gettin' s...,,"@larrycharlesism ALSO, the mitigation measures...",2021-01-20 23:57:46+00:00,0,0,1,0
1,733606999,james_thomas127,10,4897,,,@covidhoax2020 @KevinVanAusdal LOLOL your name...,2021-01-20 23:55:40+00:00,0,0,0,0
2,862336513,CallOfDove,245,13286,"Libertarian+Conservative=Me, Catholic Christia...",,@SimonMichaelPa2 @cristhian_0707 @guypbenson I...,2021-01-20 23:54:40+00:00,0,0,0,0
3,971840730137178112,WilliamMcK25thP,2985,10862,Proud AMERICAN! USMC Brat #WalkedAway on Nov 2...,Southern California,"@XHNews They forgot to thank the Covid Hoax, t...",2021-01-20 23:54:21+00:00,0,0,0,0
4,798360156559982592,Since_U_Asked,1092,26133,I like ez hikes following streams\nthen cozy n...,Obama saved us at darkest hour,"400.000 soldiers buried in Arlington\n400,000 ...",2021-01-20 23:53:59+00:00,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1011,1326533089405833216,AlwaysVoteTruth,24,7439,MPA praying for our democracy back in the USA,,@RandPaul You meant to say the Trump regime th...,2021-01-20 00:02:58+00:00,0,0,0,0
1012,266256718,jaze_ca,4148,94447,#mandatoryvaccinenationwide\n#ResignDougFord\n...,,"@DeanLuk @weeser1 @MSNBC Hey Trumpers, if Covi...",2021-01-20 00:02:00+00:00,0,0,1,0
1013,59075703,panji90,359,13275,The A stands for APAAN TUH? (Kuis-Dang-Dut~). ...,"Cempakaputihindah, Indonesia",A very good resource on Covid-19. It explained...,2021-01-20 00:01:54+00:00,1,0,0,0
1014,24393759,soma77,66,23554,"The presenter, John Kuykendall lived as a monk...","Sparks, Nevada","Remembering Covid victims, Biden spends emotio...",2021-01-20 00:01:39+00:00,0,0,0,0


## `requests`-based version

If you want to do things without tweepy, here is some boilerplate code that should work. As you can see, it's much more complicated. Be grateful for the tweepy developers!! :)

In [None]:
import requests
import os
import json
import twitter_authentication as config
import time

# Save your bearer token in a file called twitter_authentication.py in this directory
# Should look like this:
# bearer_token = 'YOUR_BEARER_TOKEN_HERE'

bearer_token = config.bearer_token
query = '(#COVID) OR (#COVID-19)'
out_file = 'raw_tweets.txt'

search_url = "https://api.twitter.com/2/tweets/search/all"

# Optional params: start_time,end_time,since_id,until_id,max_results,next_token,
# expansions,tweet.fields,media.fields,poll.fields,place.fields,user.fields
query_params = {'query': query,
                'start_time': '2010-01-01T12:00:00Z',
                'tweet.fields': 'author_id,public_metrics',
                 'user.fields': 'username',
                'expansions': 'author_id',
                'max_results': 500
               }


def create_headers(bearer_token):
    headers = {"Authorization": "Bearer {}".format(bearer_token)}
    return headers


def connect_to_endpoint(url, headers, params, next_token = None):
    if next_token:
        params['next_token'] = next_token
    response = requests.request("GET", search_url, headers=headers, params=params)
    time.sleep(3.1)
    print(response.status_code)
    if response.status_code != 200:
        raise Exception(response.status_code, response.text)
    return response.json()


def get_tweets(num_tweets, output_fh):
    next_token = None
    tweets_stored = 0
    while tweets_stored < num_tweets:
        headers = create_headers(bearer_token)
        json_response = connect_to_endpoint(search_url, headers, query_params, next_token)
        if json_response['meta']['result_count'] == 0:
            break
        author_dict = {x['id']: x['username'] for x in json_response['includes']['users']}
        for tweet in json_response['data']:
            try:
                tweet['username'] = author_dict[tweet['author_id']]
            except KeyError:
                print(f"No data for {tweet['author_id']}")
            output_fh.write(json.dumps(tweet) + '\n')
            tweets_stored += 1
        try:
            next_token = json_response['meta']['next_token']
        except KeyError:
            break
    return None



def main():
    with open(out_file, 'w') as f:
        get_tweets(500, f)



main()

In [None]:
tweets = []
with open(out_file, 'r') as f:
    for row in f.readlines():
        tweet = json.loads(row)
        tweets.append(tweet)

In [None]:
tweets[0]