# Note
This jupyter notebook is for learning purposes.

First install the neccessary python packages from requirements.txt

`pip install -r requirements.txt`

In this document I use the searchtweets python wrapper. I recommend this option, however you can also make HTTP requests to the twitter api directly

[searchtweets documentation](https://github.com/twitterdev/search-tweets-python/tree/v2)

[twitter api HTTP requests sample code](https://github.com/twitterdev/Twitter-API-v2-sample-code)

Refer to the [Twitter api V2 Docs](https://developer.twitter.com/en/docs/twitter-api)

In [1]:
import requests
import os
import json
from searchtweets import ResultStream, gen_request_parameters, load_credentials, collect_results

In [2]:
full_archive_seach_args = load_credentials(filename="./twitter_keys.yaml",
                 yaml_key="search_tweets_fullarchive_dev",
                 env_overwrite=False)

# Full Archive Search

Its a good idea to set the MAX_TWEETS per call as a constant. The max is 500, but if you're testing its better to keep this limit smaller. Our limit is 10 million tweets a month

In [3]:
MAX_TWEETS = 10 # 500 is the max per call

See this [link](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet) for documentation on the tweet fields

The tweet object has many fields. You have to list the ones you want to retrieve as a comma delimited list.
`tweet_fields='id,created_at,text,public_metrics'`

See this [link](https://developer.twitter.com/en/docs/twitter-api/expansions) for documentation on expansions

expansions are also a comma delimited list

In [4]:
rule = gen_request_parameters("from:aei",
                            granularity=None,
                          start_time="2020-01-01", #UTC 2020-01-01 00:00
                          end_time="2021-01-30",#UTC 2020-01-30 00:00
                          tweet_fields='id,author_id,created_at,text,public_metrics,referenced_tweets,entities',
                          expansions="author_id",
                        results_per_call=MAX_TWEETS)
print(rule)

{"query": "from:aei", "start_time": "2020-01-01T00:00:00Z", "end_time": "2021-01-30T00:00:00Z", "max_results": 10, "tweet.fields": "id,author_id,created_at,text,public_metrics,referenced_tweets,entities", "expansions": "author_id"}


In [5]:
tweets = collect_results(rule, max_tweets=MAX_TWEETS, result_stream_args=full_archive_seach_args)

In [6]:
print(type(tweets))
print(len(tweets))
print(tweets[0].keys())


<class 'list'>
1
dict_keys(['data', 'includes', 'meta'])


`collect_results` returns a list of dicts. Each dict represents a page of results. Since we set `results_per_call` and `max_tweets` both as `MAX_TWEETS` there will only be one element in that list. See the `Crawling all tweets made by a user` section below for more information

that dict has `data`, `meta`, `includes` fields

`data` has all the tweet data so most of the stuff you need is there

In [7]:
print(len(tweets[0]['data'])) # max len for this is same as MAX_TWEETS
print(tweets[0]['data'][0])

10
{'author_id': '30864583', 'id': '1355254021024526337', 'referenced_tweets': [{'type': 'retweeted', 'id': '1355239914858835969'}], 'text': 'RT @AEIecon: There is an opportunity for bipartisan cooperation on expanding coverage and controlling costs — but only if the parties set a…', 'public_metrics': {'retweet_count': 2, 'reply_count': 0, 'like_count': 0, 'quote_count': 0}, 'created_at': '2021-01-29T20:38:32.000Z', 'entities': {'mentions': [{'start': 3, 'end': 11, 'username': 'AEIecon', 'id': '809552311'}]}}


`meta` has the first and last ids that it fetched so useful for constructing the next query

In [8]:
print(tweets[0]['meta'])

{'newest_id': '1355254021024526337', 'oldest_id': '1354856778647941121', 'result_count': 10, 'next_token': 'b26v89c19zqg8o3foskt5kx7wqa14ffp6hpnqd8sxqc8t'}


`includes` has all the expansion data

In [9]:
print(tweets[0]['includes'])

{'users': [{'id': '30864583', 'name': 'AEI', 'username': 'AEI'}]}


To construct a url for a tweet all you need is the tweet id and the author's username

the tweet id is the `id` field found in `data`
the username is the `username` field found in `includes`

`https://twitter.com/[username]/status/[tweet id]`

so for the above example it would be

https://twitter.com/AEI/status/1355254021024526337

# Crawling all tweets made by a user

When a query has more that 500 tweet results, there will be multiple pages of results. To get all the results we have to get the results from all pages.

See the [Pagination documentation](https://developer.twitter.com/en/docs/twitter-api/pagination)

Essentially we use the `next_token` field from `meta` to get the next page of results. We continue this process until the next token is empyt (i.e no more result pages)

**However with searchtweets,** pagination is handle automatically with `collect_results`. All you have to do is set `max_tweets` to a large number. Our monthly limit is 10 million so experiment to find a suitable max.

It will return a list of page results. Each element in that list represent a call. Each eall will have `results_per_call` tweets (max is 500). you set this in `gen_request_parameters`

see Fast Way in [searchtweets documentation](https://github.com/twitterdev/search-tweets-python/tree/v2#fast-way)

Below is an example on how to get all tweets by AEI in 2021

In [10]:
MAX_TWEETS_PER_CALL = 500 # set this to the max tweets we can get per call (500)
ruleFullCrawl = gen_request_parameters("from:aei",
                            granularity=None,
                          start_time="2021-01-01", #UTC 2021-01-01 00:00
                          end_time="2022-01-01",#UTC 2021-12-31 00:00
                          tweet_fields='id,author_id,created_at,text,public_metrics,referenced_tweets,entities',
                        results_per_call=MAX_TWEETS_PER_CALL)
print(ruleFullCrawl)

{"query": "from:aei", "start_time": "2021-01-01T00:00:00Z", "end_time": "2022-01-01T00:00:00Z", "max_results": 500, "tweet.fields": "id,author_id,created_at,text,public_metrics,referenced_tweets,entities"}


In [11]:
MAX_TWEETS_PER_SEARCH=100000

This will return a list where each element in that list will contain `MAX_TWEETS_PER_CALL` tweets. The total number of tweets will not exeed `MAX_TWEETS_PER_SEARCH`

In [12]:
tweets = collect_results(ruleFullCrawl, max_tweets=MAX_TWEETS_PER_SEARCH, result_stream_args=full_archive_seach_args)

 HTTP Error code: 429: {"title":"Too Many Requests","detail":"Too Many Requests","type":"about:blank","status":429} | Too Many Requests
 Request payload: {'query': 'from:aei', 'start_time': '2021-01-01T00:00:00Z', 'end_time': '2022-01-01T00:00:00Z', 'max_results': 500, 'tweet.fields': 'id,author_id,created_at,text,public_metrics,referenced_tweets,entities'}
Rate limit hit... Will retry...
Will retry in 4 seconds...


In [13]:
print(len(tweets))

8


all result pages before the last page will have `MAX_TWEETS_PER_CALL` tweets

the last page will have 1-500 tweets. Therefore we can calculate the total number of tweets obtain as follows

In [14]:
total_num_tweets = 7*500 + tweets[7]['meta']['result_count']
print(total_num_tweets)

3595


# What the results look like

The result you get from `collect_results` will look something like this

In [15]:
'''
it returns an array of results page objects

collect_results_output = [results_page1,results_page2 , ... ]

a results page has three fields: data, meta, includes
if you dont have any expansions you will not see includes

data is an array of tweet data objects
meta is an object with fields: newest_id, oldest_id, result_count, next_token
includes is an object which depends on what expansions you specify in expansions


results_page1 = {
                    data:[data_for_tweet1, data_for_tweet2, ... ],
                    meta:{
                        'newest_id': '1355254021024526337',
                        'oldest_id': '1354856778647941121',
                        'result_count': 10,
                        'next_token': 'b26v89c19zqg8o3foskt5kx7wqa14ffp6hpnqd8sxqc8t'
                    },
                    includes:{
                        'users': [{'id': '30864583', 'name': 'AEI', 'username': 'AEI'}]
                    }
                }

tweet data object depends on what tweet_fields you specify

data_for_tweet1 = {
                    'author_id': '30864583', 
                    'id': '1355254021024526337', 
                    'referenced_tweets': [{'type': 'retweeted', 'id': '1355239914858835969'}], 
                    'text': 'RT @AEIecon: There is an opportunity for bipartisan cooperation on expanding coverage and controlling costs — but only if the parties set a…', 
                    'public_metrics': {'retweet_count': 2, 'reply_count': 0, 'like_count': 0, 'quote_count': 0}, 'created_at': '2021-01-29T20:38:32.000Z', 
                    'entities': {'mentions': [{'start': 3, 'end': 11, 'username': 'AEIecon', 'id': '809552311'}]}
                  }
'''
                  

"\nit returns an array of results page objects\n\ncollect_results_output = [results_page1,results_page2 , ... ]\n\na results page has three fields: data, meta, includes\nif you dont have any expansions you will not see includes\n\ndata is an array of tweet data objects\nmeta is an object with fields: newest_id, oldest_id, result_count, next_token\nincludes is an object which depends on what expansions you specify in expansions\n\n\nresults_page1 = {\n                    data:[data_for_tweet1, data_for_tweet2, ... ],\n                    meta:{\n                        'newest_id': '1355254021024526337',\n                        'oldest_id': '1354856778647941121',\n                        'result_count': 10,\n                        'next_token': 'b26v89c19zqg8o3foskt5kx7wqa14ffp6hpnqd8sxqc8t'\n                    },\n                    includes:{\n                        'users': [{'id': '30864583', 'name': 'AEI', 'username': 'AEI'}]\n                    }\n                }\n\ntweet 