# Accessing the Twitter API with Python and Tweepy

## Technical requirements

To run this tutorial, you will need two things:

1. The ability to run Jupyter notebooks. This can either be done through the Binder link found on the readme (and if you're viewing this document inside the Binder web app right now, you're already good to go!), or by [installing Jupyter on your own computer](https://jupyter.org/install) and opening this file there.
2. An approved Twitter developer account, which will let you create API keys and a bearer token which are necessary to use the API.

## Package installation and authetication

In data science, there are often some common tasks that we frequently find ourselves doing. Repeatedly writing code for these common tasks would be tiring and inconvenient. For example, when using the vanilla Twitter API, the commands can change from just a few lines to tens and tens of lines that involves things like importing a lot of libraries, more authentication, selecting end points, and managing less readable outputs (see for yourself: https://github.com/twitterdev/Twitter-API-v2-sample-code). Imagine having to do all of that every time we wanted to gather some tweets!

This is where *packages* come in. Packages provide pre-written functions for common tasks.

We will be using the Tweepy package to access the Twitter API in Python. Additionally, we will be using the Pandas package to do some basic formatting of the search results. The following code cell will install Tweepy and Pandas via pip.

In [1]:
!pip install tweepy pandas

Collecting tweepy
  Downloading tweepy-4.11.0-py3-none-any.whl (96 kB)
[K     |████████████████████████████████| 96 kB 3.4 MB/s eta 0:00:011
Collecting oauthlib<4,>=3.2.0
  Downloading oauthlib-3.2.2-py3-none-any.whl (151 kB)
[K     |████████████████████████████████| 151 kB 34.6 MB/s eta 0:00:01
[?25hCollecting requests<3,>=2.27.0
  Downloading requests-2.28.1-py3-none-any.whl (62 kB)
[K     |████████████████████████████████| 62 kB 2.0 MB/s  eta 0:00:01
Installing collected packages: requests, oauthlib, tweepy
  Attempting uninstall: requests
    Found existing installation: requests 2.26.0
    Uninstalling requests-2.26.0:
      Successfully uninstalled requests-2.26.0
  Attempting uninstall: oauthlib
    Found existing installation: oauthlib 3.1.1
    Uninstalling oauthlib-3.1.1:
      Successfully uninstalled oauthlib-3.1.1
Successfully installed oauthlib-3.2.2 requests-2.28.1 tweepy-4.11.0


Now that the package is installed, we can *import* it, allowing us to use it in our code:

In [2]:
import tweepy

The first step in accessing the Twitter API is *authentication*. This is the same kind of process as logging in on a website - it lets the API know who you are, so that it can check if you have permission to access the data. In tweepy, authentication is handled by a `tweepy.Client` object. To authenticate, all you need to do is tell `tweepy.Client` your *bearer token*, which is our unique ID tied to our Twitter developer account (see the [official Twitter documentation](https://developer.twitter.com/en/docs/authentication/oauth-2-0/bearer-tokens) for more details):

In [3]:
client = tweepy.Client(
    bearer_token="Paste your bearer token inside these quotes"
)

After running the cell above, the `client` variable now remembers our authentication info, and we can use it to talk to the Twitter API.

## Collecting Tweets

The most common reason to use the Twitter API for social science research is searching for tweets. In tweepy, this is done using the `search_all_tweets` method. Here is a list of all the options you have for `search_all_tweets`:

```
Client.search_all_tweets(
    query, 
    end_time, 
    expansions, 
    max_results, 
    media_fields,
    next_token, 
    place_fields, 
    poll_fields, 
    since_id, 
    start_time, 
    tweet_fields, 
    until_id, 
    user_fields
)
```

As you can see, we have the ability to specify our search via a query, set a date beginning or end, the number of tweets we want, and more! Let's try an example:

In [4]:
tweets = client.search_all_tweets(query="Saint Patrick's Day", # keyword search (in this case, we want the text "Saint Patrick's Day")
                                  start_time="2021-03-16T00:00:00Z", # start and end dates we want to search over
                                  end_time="2021-03-18T00:00:00Z",
                                  max_results=100) # how many results do you want?

Forbidden: 403 Forbidden
When authenticating requests to the Twitter API v2 endpoints, you must use keys and tokens from a Twitter developer App that is attached to a Project. You can create a project via the developer portal.

What we have just done is ask the Twitter API for up to 100 tweets mentioning "Saint Patrick's Day" posted between March 16-18, 2021. We then saved the results to a variable called `tweets`. But what this this variable actually contain? Let's try printing it out:

In [5]:
print(tweets)

Response(data=[<Tweet id=1372336936195592200 text=Happy Saint Patrick’s day everyone!! @8Egan @AkronCoachZ @ZipsFB https://t.co/HeoYinJEOa>, <Tweet id=1372336933293207555 text=Happy Saint Patrick's Day!!! Let's Get Rowdy!! https://t.co/eDgN8nhUCz>, <Tweet id=1372336926527733764 text=RT @ChipsAhoy: One is not like the others... 👀 Happy Saint Patrick's Day! https://t.co/TFHloXnsws>, <Tweet id=1372336915068948484 text=Happy Saint Patrick’s Day #StPatricksDay https://t.co/gKdwBQjt73>, <Tweet id=1372336914909573123 text=Saint Patrick's Day Show - Who will win the lucky mouse pad? Extreme Anime Radio Podcast - 3/17/2021 https://t.co/oycCp98nP7
#twitch #twitchstreamer #twitchaffiliate #twitchgiveaway @twitch>, <Tweet id=1372336885775863809 text=I think you might have a problem if you take  Saint Patrick's day off just to drink all day. 🤷‍♂️>, <Tweet id=1372336864439410693 text=RT @LunarClient: Happy #StPatricksDay!

Our new limited edition Spring and Saint Patrick's Day cosmetics are now on s

This is not exactly very human-readable! But if we look carefully there are a few useful things we may notice. The most important thing is that inside the Response we got from the API, there is a list of tweets in a variable called `data`. In each tweet, we can see that there is an ID and the text of the tweet. This seems like the kind of data that would be best represented in a tabular format, if we want to explore it in a human-readable way. We can accomplish this using the Pandas package. Pandas provides tabular data formatting to Python in the form of R-style "Data Frames" (more here: https://pandas.pydata.org/).

In [6]:
import pandas # import the Pandas package so we can use its functions
tweets_df = pandas.DataFrame(tweets.data) # put the data into a tabular format and save it as tweets_df

In [8]:
print(tweets_df.head(5)) # more readable, right?

                    id                                               text
0  1372336936195592200  Happy Saint Patrick’s day everyone!! @8Egan @A...
1  1372336933293207555  Happy Saint Patrick's Day!!! Let's Get Rowdy!!...
2  1372336926527733764  RT @ChipsAhoy: One is not like the others... 👀...
3  1372336915068948484  Happy Saint Patrick’s Day #StPatricksDay https...
4  1372336914909573123  Saint Patrick's Day Show - Who will win the lu...


Much better, right?

Oftentimes we may want more information than just the ID and the text. *Metadata*, the other miscellaneous descriptive properties of a tweet, can often be useful for research as well. We can specify what properties we want by using the `tweet_fields` parameter of the `get_all_tweets` method. A full list of available properties are available at the following documentation page: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet.

As an example, let's expand our last search to give us information about who wrote each tweet, when each tweet was posted, and what conversation each tweet belongs to, in addition to the ID and text:

In [17]:
tweets = client.search_all_tweets(query="Saint Patrick's Day", 
                                  start_time="2021-03-16T00:00:00Z", 
                                  end_time="2021-03-18T00:00:00Z",
                                  tweet_fields=["id", "text", "author_id", "conversation_id", "created_at"], # specify all desired properties in a list (ID and text must be explicitly included if you still want them)
                                  max_results=100) 

In [18]:
print(pandas.DataFrame(tweets.data)) # just like last time, use Pandas to format the output

              author_id      conversation_id                created_at  \
0   1102273224245764101  1372336936195592200 2021-03-17 23:59:57+00:00   
1    946879478633873408  1372336933293207555 2021-03-17 23:59:56+00:00   
2   1191482696847020033  1372336926527733764 2021-03-17 23:59:54+00:00   
3    710323992360517632  1372336915068948484 2021-03-17 23:59:51+00:00   
4              67043745  1372336914909573123 2021-03-17 23:59:51+00:00   
..                  ...                  ...                       ...   
90             14364967  1372336311504371712 2021-03-17 23:57:28+00:00   
91            236760689  1372336308807421954 2021-03-17 23:57:27+00:00   
92             14364967  1372336306899062786 2021-03-17 23:57:26+00:00   
93  1352141744343822340  1372336304806105092 2021-03-17 23:57:26+00:00   
94             30032578  1372268967516123137 2021-03-17 23:57:26+00:00   

                     id                                               text  
0   1372336936195592200  Happy Sai

## Count Tweets

So what if you didn't want to spend all of your limit on collecting tweets hoping the sample was what you wanted? Or, you wanted to get an estimate of how many tweets would be pulled for a particular query?

This is where tweet counts come in.

Tweet counts don't count toward your tweet cap, but they can reveal really important information.

In [9]:
count = client.get_all_tweets_count(
    query="#tokyo", # let's try a hashtag analysis
    start_time="2021-07-01T00:00:00Z",
    end_time="2021-08-01T00:00:00Z",
    granularity="day" # tell the API to count number of tweets per day (can also do hour or minute).
)

In [11]:
print(pandas.DataFrame(count.data)) # just like last time, use Pandas to format the output

                         end                     start  tweet_count
0   2021-07-02T00:00:00.000Z  2021-07-01T00:00:00.000Z         6831
1   2021-07-03T00:00:00.000Z  2021-07-02T00:00:00.000Z         4547
2   2021-07-04T00:00:00.000Z  2021-07-03T00:00:00.000Z         7400
3   2021-07-05T00:00:00.000Z  2021-07-04T00:00:00.000Z         6417
4   2021-07-06T00:00:00.000Z  2021-07-05T00:00:00.000Z         5690
5   2021-07-07T00:00:00.000Z  2021-07-06T00:00:00.000Z         5424
6   2021-07-08T00:00:00.000Z  2021-07-07T00:00:00.000Z         7612
7   2021-07-09T00:00:00.000Z  2021-07-08T00:00:00.000Z         7121
8   2021-07-10T00:00:00.000Z  2021-07-09T00:00:00.000Z         6445
9   2021-07-11T00:00:00.000Z  2021-07-10T00:00:00.000Z         9012
10  2021-07-12T00:00:00.000Z  2021-07-11T00:00:00.000Z         9561
11  2021-07-13T00:00:00.000Z  2021-07-12T00:00:00.000Z         4689
12  2021-07-14T00:00:00.000Z  2021-07-13T00:00:00.000Z         6317
13  2021-07-15T00:00:00.000Z  2021-07-14T00:00:0

These results show the number of tweets by the level of granularity you selected. From this information we can now narrow our search to specific days or use this to create a visualization of tweets for a particular event.

From these results, we can see how something as simple as a tweet count can already reveal interesting phenomena. In this case, the number of tweets per day mentioning the hashtag \#tokyo seems to be consistently higher than normal starting on July 24th. Why do you think this is?

## User Tweets

Sometimes we may be interested in finding tweets by a specific user. We have to be careful here however! As humans, we typically identify users by their username, aka Twitter handle (e.g., @InfoTisayentist for Aspen). But the Twitter API instead identifies users by a numerical ID. Thankfully, Tweepy gives us tools to convert between the two:

In [13]:
user_info = client.get_user(username="InfoTisayentist") # look up information for the user InfoTisayentist
print(user_info) # note the "id" variable!

Response(data=<User id=872555593260838912 name=Aspen username=InfoTisayentist>, includes={}, errors=[], meta={})


Once we have found a user's numerical ID, we can use this to look up their tweets.

In [19]:
profile_tweets = client.get_users_tweets(
    id=872555593260838912, # InfoTisayentist's ID that we retrieved above
    start_time="2021-01-01T00:00:00Z",
    end_time="2021-01-05T00:00:00Z",
    max_results=100
)
print(pandas.DataFrame(profile_tweets.data).head(5))

                    id                                               text
0  1345070201180622848  RT @hood_naturalist: HERE WE GO!!!!!\n\nWhat w...


## Conversation Threads

Here we can see how using one query can factor into other searches. You could use a tweet or user search to identify a specific conversation that you want to dig into. The key to this is the `query` parameter in `search_all_tweets`. Previously, we have only provided text queries. But we can also query by other fields, as we will demonstrate here with `conversation_id`:

Feel free to swap the conversation ID with something more interesting!

In [30]:
thread_tweets = client.search_all_tweets(
    # Replace with ID of your choice to get replies (this ID is from the St Pat original set)
    query="conversation_id:1372268967516123137",
    start_time = "2021-03-01T00:00:00Z",
    end_time = "2021-03-30T00:00:00Z",
    max_results=100
)
print(pandas.DataFrame(thread_tweets.data).head(5))

                    id                                               text
0  1372336303346388995  @akent47 @CkFlakes On Saint Patrick’s Day. You...
1  1372307552084533249  @PRIMExSUSPECT @CkFlakes I’ll buy you a drink....
2  1372291391242440708             @akent47 @CkFlakes Buy me a beer Alan.
3  1372277883935191043  @CkFlakes wanna buy me a draaaank? ooooooOOOOO...


## Geo-tagged Tweets

Text queries and field-based queries are not mutually exclusive, a single query can contain both plain text to search for and specify individual fields. Let's demonstrate using a common use case: location filtering.

Location is really important information. Remember when Facebook added the "checking in" feature during natural disasters? Well, we can search for Tweets that specifically have location information embedded in those tweets.

Specifying `has:geo` in a query filters the search criteria to only the tweets with this information.

In [31]:
rams_geo = client.search_all_tweets(
    # "rams" is a super generic search term...but what if we limit to geo-tagged tweets around the time the LA Rams were playing?
    query="rams has:geo", 
    start_time="2022-01-15T00:00:00Z",
    end_time="2022-01-20T00:00:00Z",
    max_results=100
)
print(pandas.DataFrame(rams_geo.data).head(5))

                    id                                               text
0  1483949733311827973  .@DelmarvaSports \n@47ABCSports \n@DelmarvaSco...
1  1483948293830832129  @Holdem_junkie Ive had pupusas as my game day ...
2  1483947221787754504  @gimmethewooby @SportsSturm In the 49ers heyda...
3  1483942805282168836  @CourageTiger I personally didn't care if the ...
4  1483938356065153027  Bengals x Titans: Bengals\n49ers x Packers: Pa...


Location information can be `geo.place_id` or coordinates. If the former, then one more step is needed to convert the place code to a location (on how to do that here: https://developer.twitter.com/en/docs/twitter-api/v1/geo/places-near-location/api-reference/get-geo-reverse_geocode)

## Advanced Parameters

So far, we've augmented our naive text searches with two advanced parameters: `conversation_id` and `has:geo`. But this doesn't even begin to scratch the surface of ways we can customize our queries! There is too much to cover in a single workshop, but we encourage you to read the official documentation to learn more about all the parameters you can use: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query. For now, we will leave you with an example of a specific query we might use in practice:

In [32]:
tweets_advanced = client.search_all_tweets(
    # an advanced query combining several parameters
    query='rams has:geo place:"los angeles" lang:en -is:retweet -is:verified',
    start_time="2022-01-15T00:00:00Z",
    end_time="2022-01-20T00:00:00Z",
    max_results=100
)
print(pandas.DataFrame(tweets_advanced.data).head(5))

                    id                                               text
0  1483935951705444352  Attendance was very low for Rams games at Anah...
1  1483899947535638532  @DamonBruce @ArashMarkazi This guy got a blue ...
2  1483841939669544960  @laurafh16 @seattlerams_nfl That's all they ca...
3  1483642755674951686  About to rewatch yesterday’s game against the ...
4  1483639044344934401  @Larryk520 I was talking about the Rams, but t...


## Final Word

Congratulations! You have overcome perhaps the most daunting hurdle in using APIs. Which would be using an API. It is important to remember that your queries will not always be perfect or capture exactly what you want. That's okay, there are a slew of resources on data management and cleaning that help you get to the final step which will look more like a dataset worthy of analysis.

Well done and I wish you the best with your computational social science endeavors!

## Resources
Official Twitter API V2 developer documentation: https://developer.twitter.com/en/docs/twitter-api

Twitter API V2 Sample Code: https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research 