# The full-archive endpoint

Most of this comes from this tutorial:
https://dev.to/twitterdev/getting-historical-tweets-using-the-full-archive-search-endpoint-1agp

In [1]:
import os

# Set up the bearer token as an env variable
# os.environ['BEARER_TOKEN'] = '#Bearer Token here'

# To get the token use 
#    Python: os.environ['BEARER_TOKEN']
#    Bash: $BEARER_TOKEN

## Building queries

Full-archive search of an account, e.g. @TwitterDev (i.e. from:twitterdev), using the following syntax.

In [82]:
# A basic query using curl
!curl --request GET 'https://api.twitter.com/2/tweets/search/all?query=from:twitterdev' --header "Authorization: Bearer $BEARER_TOKEN"


{"title":"Unauthorized","type":"about:blank","status":401,"detail":"Unauthorized"}

Queries are built inside the GET method of curl, and there's a huge range of paramaters that can be selected, starting with **search terms**, e.g. *query=(covid OR coronavirus)* below, and **number of results** to return, e.g. 60 below.

In [36]:
# Searching for all tweets from any account.
# Adding search terms and the number of max results. 
!curl --request GET 'https://api.twitter.com/2/tweets/search/all?query=(covid%20OR%20coronavirus)&max_results=60' --header "Authorization: Bearer $BEARER_TOKEN"


{"data":[{"id":"1401907340945809412","text":"The latest The Occupational safety and health Daily! https://t.co/KOv0Ibc13A Thanks to @pathikrit2sen #covid19 #coronavirus"},{"id":"1401907340798988291","text":"RT @WHCOS: In just four months, progress in fighting COVID and getting the economy moving again is making a huge difference."},{"id":"1401907340710866954","text":"RT @DrTomFrieden: Decades of research went into mRNA technology, and clinical trials included more than 100,000 people. Covid vaccines were…"},{"id":"1401907340249493504","text":"@srivatsayb for fucker and motherfucker think that all things  is done  by modi such as covid 19 call modi strain"},{"id":"1401907340161519619","text":"RT @ForestRightsAct: \"Nearly 87% of Adivasis are forest dependent. MFP collection, transport &amp; sales are affected by..lockdown. State procu…"},{"id":"1401907339590918144","text":"RT @YopiGarisKeras: Akibat sibuk ngurusin Capres 2024 yg masih jauh, Gubernur tiktok malah blunder, Jateng enggak d

Specifying the **search period** is done in ISO-8601 format and works similarly:

In [81]:
!curl --request GET 'https://api.twitter.com/2/tweets/search/all?query=(covid%20OR%20coronavirus)&start_time=2020-10-01T00:00:00.00Z&end_time=2020-10-26T00:00:00.00Z&max_results=60' --header "Authorization: Bearer $BEARER_TOKEN"

{"title":"Unauthorized","type":"about:blank","status":401,"detail":"Unauthorized"}

Adding response fields, e.g. *created_at, author_id, referenced_tweets*. The default is just *id*, a unique identifier for each Tweet and *text* the actual text of the Tweet.

In [46]:
!curl --request GET 'https://api.twitter.com/2/tweets/search/all?query=(covid%20OR%20coronavirus)&start_time=2020-10-01T00:00:00.00Z&end_time=2020-10-26T00:00:00.00Z&max_results=60&tweet.fields=created_at,lang,conversation_id&user.fields=created_at,entities' --header "Authorization: Bearer $BEARER_TOKEN"



# Using the *searchtweets-v2* client

This serves as a wrapper for the Twitter API v2 search endpoints (/search/recent and /search/all), providing a command-line utility and a Python library.

Github: https://github.com/twitterdev/search-tweets-python/tree/v2

PyPi: https://pypi.org/project/searchtweets-v2/

More Docs: https://twitterdev.github.io/search-tweets-python/searchtweets.html

In [1]:
from searchtweets import ResultStream, gen_request_parameters, load_credentials, collect_results, convert_utc_time
import pandas as pd
import numpy as np

In [2]:
# For credentials we have 2 options.

## Set them up as environmental variables:
# os.environ['SEARCHTWEETS_ENDPOINT'] = 'https://api.twitter.com/2/tweets/search/all'
# os.environ['SEARCHTWEETS_BEARER_TOKEN'] = 'Bearer Token here'
# os.environ['SEARCHTWEETS_CONSUMER_KEY'] = 'API key here'
# os.environ['SEARCHTWEETS_CONSUMER_SECRET'] = 'API secret key here'

# Set up the .yaml file
# Unless env_overwrite = True, anything found in the .yaml file will be overwritten with env variables. 
credentials = load_credentials('/Volumes/Survey_Social_Media_Compare/Methods/Scripts/Twitter/twitter_keys.yaml', env_overwrite=True);

# Make sure ';' at the end of the line so it doesn't print out.
# Make sure the .yaml is in .gitignore.

In [3]:
# A simple query
query = gen_request_parameters('jobs', results_per_call = 100)
print(query)

{"query": "jobs", "max_results": 100}


In [4]:
# A simple query
query_x = gen_request_parameters('"jobs"', results_per_call = 500)
print(query)

# 1. collect_results function
tweets_x = collect_results(query_x,
                          max_tweets = 500,
                          result_stream_args = credentials)

{"query": "jobs", "max_results": 100}


There are two ways of collecting tweets. Currently not sure whether the difference between the two is important to us. 

This first one, *collect_results* is a "quick method to collect smaller amounts of Tweets to memory that requires less thought and knowledge". The documentation also reads: "For interactive environments and other cases where you don’t care about collecting your data in a single load or don’t need to operate on the stream of Tweets directly, I recommend using this convenience function." I am not sure what "collecting data in a single load means".

The second one is through the ResultStream object, which will be powered by the search_args, and takes the query and other configuration parameters, including a hard stop on number of pages to limit your API call usage.

**The resulting tweets are slightly different with the two searches**

For now I will use *collect_results*. It should become clearer whether this is good for the current application or not once I start to use it properly. 

In [1]:
# Collecting tweets

# # 1. collect_results function
# tweets1 = collect_results(query,
#                           max_tweets = 200,
#                           result_stream_args = credentials)


# 2. Using the ResultStream
rs = ResultStream(request_parameters=query,
                max_pages=1,
                max_tweets= 200,
                output_format = "a",
                 **credentials)

print(rs)

tweets2 = list(rs.stream())


# # Print first 10 tweets for both

# print("collect_results function: \n")
# for i in tweets1[0:10]:
#     print(i['text'],'\n')
    
# print("\n\n\nResultStream: \n ")    
# for i in tweets2[0:10]:
#     print(i['text'],'\n')



NameError: name 'ResultStream' is not defined

In [33]:
tweets2

[{'id': '1402310338364780547',
  'text': 'RT @POTUS: I’m working hard to find common ground with Republicans when it comes to the American Jobs Plan, but I refuse to raise taxes on…'},
 {'id': '1402310336926060547',
  'text': 'RT @SenSchumer: Women with the same jobs, the same degrees, and same work experience are making less money than their male colleagues.\n\nFor…'},
 {'id': '1402310336141791232',
  'text': 'RT @zelorodolffo: se colocar todos os trabalhos que o Rodolffo fez nesse pós bbb em uma lista, tem gente por aí que fica em choque, viu? nã…'},
 {'id': '1402310335797866508',
  'text': 'RT @CharlyMatt: Due anni di esperienza MINIMA per uno stage da segretaria. Prima o poi questa stortura andrà risolta. https://t.co/92mLk7eo…'},
 {'id': '1402310335609069578',
  'text': 'RT @SenSchumer: Women with the same jobs, the same degrees, and same work experience are making less money than their male colleagues.\n\nFor…'},
 {'id': '1402310334753427457',
  'text': 'RT @PharmiWebJobs: CRA / 

In [32]:
tweets2[-1]

{'newest_id': '1402310241992183808',
 'oldest_id': '1402310139630211083',
 'result_count': 100,
 'next_token': 'b26v89c19zqg8o3fpdg7rbcqdq8stpgmibslekg3kxail'}

In [56]:
# Compare overlap between the two
allIDs1 = [tweets1[i]['id'] for i in range(len(tweets1)-3)]
allIDs2 = [tweets2[i]['id'] for i in range(len(tweets2)-3)]

In [27]:
matchingIDs = list(set(allIDs1) & set(allIDs2))
len(matchingIDs)

198

## Building queries using gen_request_parameters

In [7]:
# The *convert_utc_time* might need to be imported.
from_time = '2020-10-23'
to_time = '2020-10-26'

query2 = gen_request_parameters('\"jobs\" -is:retweet place_country:US lang:en', # search term "jobs"; tweets which are not a retweet; limited to US; language english
                                start_time = from_time, # from
                                end_time = to_time,
                               tweet_fields="id,created_at,text,public_metrics",
                               results_per_call = 100) # to
print(query2)

{"query": "\"jobs\" -is:retweet place_country:US lang:en", "max_results": 100, "start_time": "2020-10-23T00:00:00Z", "end_time": "2020-10-26T00:00:00Z", "tweet.fields": "id,created_at,text,public_metrics"}


In [25]:
tweets3 = collect_results(query2,
                         max_tweets = 200,
                         result_stream_args = credentials)

 HTTP Error code: 429: {"title":"Too Many Requests","type":"about:blank","status":429,"detail":"Too Many Requests"} | Too Many Requests
 Request payload: {'query': 'jobs -is:retweet place_country:US lang:en', 'max_results': 100, 'start_time': '2020-10-23T00:00:00Z', 'end_time': '2020-10-26T00:00:00Z', 'tweet.fields': 'id,created_at,text,public_metrics', 'next_token': 'b26v89c19zqg8o3fosbs51vjs8vu5o6cmz5zd8jowssql'}
Rate limit hit... Will retry...
Will retry in 4 seconds...


203

In [None]:
## Saving results to dataframe.

In [84]:
# Saving to dataframe
tweets3_df = pd.json_normalize(tweets3[:-1])
tweets3_df

Unnamed: 0,created_at,text,id,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count
0,2020-10-25T23:59:23.000Z,@Gregt041 @SeanHouse90 @WPTV Let me get this s...,1320515328955293703,0,2,0,0
1,2020-10-25T23:58:50.000Z,@patrickbetdavid China/Jobs,1320515190572666882,0,0,0,0
2,2020-10-25T23:57:10.000Z,How I did my last fore jobs these 6 months wit...,1320514772975132672,0,0,0,0
3,2020-10-25T23:55:22.000Z,We’re looking for a talented Restaurant Manage...,1320514318778052608,1,0,0,0
4,2020-10-25T23:53:28.000Z,Was @GovWhitmer in on purchase of @HennigesCon...,1320513842644979712,0,0,0,0
...,...,...,...,...,...,...,...
195,2020-10-25T18:05:36.000Z,@SpeakerPelosi Need to get with the people nee...,1320426297580015617,0,0,0,0
196,2020-10-25T18:03:13.000Z,"@melissarobbins_ Yes, I’ve started using it du...",1320425696913743872,0,0,1,0
197,2020-10-25T18:02:59.000Z,this play calling by jets OC just shows how ba...,1320425638130552836,0,0,2,0
198,2020-10-25T18:02:24.000Z,@PurpleGimp There must be a reason for that. ...,1320425491468222464,0,0,0,0


In [86]:
max(tweets3_df['public_metrics.retweet_count'])

33