# Twitter V2 Full Archive Search

This document shows how to use Tweepy to conduct a full archive search using v2 of the Twitter API.

## Prep work

In order to use this code, you will need to have a developer account on Twitter, with access to the Academic Research product track. Information about who is eligible and how to apply is [here] OK (https://developer.twitter.com/en/products/twitter-api/academic-research).

Once you have an account, you will need to create a new app at https://developer.twitter.com/en/portal/dashboard and generate a "bearer token" from the app. OK Copy the bearer token to your clipboard and paste it into a new file in the same directory as this file, called `twitter_authentication.py`. The entire contents of the file should look like this:

```python
bearer_token = "YOUR BEARER TOKEN HERE"
```

Note that you should **never** share this token with anyone else. If, for example, you are saving your work in a Git repository, make sure that you add the `twitter_authentication.py` file to your `.gitignore`.

If anyone gets this token, they will have access to your Twitter account and you will need to revoke the token (from the same interface where you created it).

If you've created the file successfully, then the following two blocks of code should work.

In [70]:
folder_path = '/Users/abdeslamguessous/Documents/GitHub/SemesterProject'

In [1]:
#pip install tweepy

In [2]:
#pip install twitter

In [3]:
#import twitter_authentication

In [11]:
import tweepy
#from twitter_authentication import bearer_token
import time
import pandas as pd

In [12]:
bearer_token 

'AAAAAAAAAAAAAAAAAAAAAKcJhgEAAAAAFXJGrhYrtjFERgklJw1Hw07WYys%3DNfILFWz6S6aCVIITtoUjogHfPFbl70d7B1DQlpjNbB2JQ7tTtk'

### Debugging concerns

In [19]:
bearer_token= 'AAAAAAAAAAAAAAAAAAAAAHH9iAEAAAAAkMyMGd3LrN6sWA00oOgRc767oz8%3Dq1kBjW6Rq4Ksy6klsYkycODzv1g2sGdk01d2h2JtcXkk8LdNtn'

In [16]:
# Option 1 : https://www.jcchouinard.com/how-to-use-twitter-api-with-python/ with Abdeslam credentials

import tweepy

api_key = "7PwXsb3XZn4mJvX5ZiEYidVmF"
api_secrets = "ySKzd5sdRrywo4LsgwWLmTcBInCXFV2nRTIsFBtPLFn3l1oDlO"
access_token = "1574386671528939521-CCMs5eiKhrMOj9LLSrXbLWpLZHndKP"
access_secret = "W10k6NJzUAssxfJ2Pg2rpiegur4H91iWxZh83ZHx9LHLr"
#access_secret = "..."
 
# Authenticate to Twitter
auth = tweepy.OAuthHandler(api_key,api_secrets)
auth.set_access_token(access_token,access_secret)
 
api = tweepy.API(auth)
 
try:
    api.verify_credentials()
    print('Successful Authentication')
except:
    print('Failed authentication')


Successful Authentication


In [9]:
# Option 2 : https://www.jcchouinard.com/how-to-use-twitter-api-with-python/ with Aya credentials
import tweepy

api_key = "Aa6EBqX1cbmqOnTPhG8A8TtMR"
api_secrets = "4RmMu7l9TkKlT7hHCrGqDIU1LaDenCpE6XPi0EwJgI2r8F7asz"
access_token = "AAAAAAAAAAAAAAAAAAAAAKcJhgEAAAAAbA54DO%2FjWW1%2Bdw%2BCBkYVkx8GgJo%3DHySsz3FJd3zucxhld6HHCr4HhSOTqKWMwaaMjMDCbNujKiDAmL"
access_token = "1574384793797754881-qjVZrICjvjwp6dGnWVkLQIht0xPPw2"
access_secret = "sySzUWLoDF2UrlPVmpX75Tm7UZOtzZcP6bb74dvN9KfnF"
#access_secret = "..."
 
# Authenticate to Twitter
auth = tweepy.OAuthHandler(api_key,api_secrets)
auth.set_access_token(access_token,access_secret)
 
api = tweepy.API(auth)
 
try:
    api.verify_credentials()
    print('Successful Authentication')
except:
    print('Failed authentication')


Failed authentication


In [35]:
client = tweepy.Client(bearer_token, wait_on_rate_limit=True) ### IMPORTANT

In [8]:
import tweepy

client = tweepy.Client(bearer_token)
for response in tweepy.Paginator(client.get_users_followers, 2244994945,
                                    max_results=1000, limit=5):
    print(response.meta)

for tweet in tweepy.Paginator(client.search_recent_tweets, "Tweepy",
                                max_results=100).flatten(limit=250):
    print(tweet.id)

{'result_count': 1000, 'next_token': 'IFM2A6PF0OS1GZZZ'}
{'result_count': 1000, 'next_token': 'G0BBL45IPCPHGZZZ', 'previous_token': '42QU3CN0V73UEZZZ'}
{'result_count': 1000, 'next_token': 'FE833383E4N1GZZZ', 'previous_token': 'NE1SEFJJ6J6EEZZZ'}
{'result_count': 1000, 'next_token': 'ISJUCAFB5GKHGZZZ', 'previous_token': 'LRT3UQPDIJ8UEZZZ'}
{'result_count': 1000, 'next_token': 'QR69EIP6JGJ1GZZZ', 'previous_token': 'P2C63UDSQFBEEZZZ'}
1581730467413311488
1581729964088778752
1581729590849900544
1581729460763885570
1581728957459623936
1581728584422785024
1581728543934816257
1581728454122164224
1581727950835437568
1581727447531151361
1581727156207775744
1581726944236023808
1581726440902385664
1581725937569124353
1581725434214903809
1581724930999128064
1581723962316582912
1581723458995572736
1581722955712954368
1581722452367118336
1581721949020991489
1581721445729652737
1581721005365792770
1581720942434082817
1581720439096717312
1581719935763755008
1581719432468238338
1581718929164357632
158

## The Search API

Full documentation for searching tweets is at https://docs.tweepy.org/en/latest/client.html#search-tweets. There are a lot of different options, but here is a simple version that gets all of the "COVID hoax" tweets from January 10, 2021. 

By default the only information returned is the tweet ID and the text. Often, we will want information about authors, too. To get information about the author, you need to add the `user_fields` parameter with the fields you want as well as the `expansions = 'author_id'` parameter. 

To get more information about the tweet, you need the `tweet_fields` parameter. The options are shown at https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all

You also likely want to build a somewhat advanced query - instructions are at https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query. For this query, I get English language tweets that are not retweets.


In [17]:
import tweepy

client = tweepy.Client(bearer_token)
for response in tweepy.Paginator(client.get_users_followers, 2244994945, query='BTC',
                                    max_results=1000, limit=5):
    print(response.meta)

for tweet in tweepy.Paginator(client.search_recent_tweets, "Tweepy",
                                max_results=100).flatten(limit=250):
    print(tweet.id)

Unexpected parameter: query


BadRequest: 400 Bad Request
The query parameter [query] is not one of [id,max_results,pagination_token,expansions,tweet.fields,user.fields]

In [101]:
## Version 1
btc_tweets = []
tmp_val = 0
for response in tweepy.Paginator(client.search_all_tweets, 
                                 query = 'btc -is:retweet lang:en',
                                 user_fields = ['username', 'public_metrics', 'description', 'location'],
                                 tweet_fields = ['created_at', 'geo', 'public_metrics', 'text'],
                                 expansions = 'author_id',
                                 start_time = '2021-01-20T00:00:00Z',
                                 end_time = '2021-01-21T00:00:00Z',
                              max_results=90):
    print(tmp_val)
    tmp_val+=1
    time.sleep(1)
    btc_tweets.append(response)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

Note that I followed the best practice above of saving the raw response returned. If this were a real project, I would write out all of the raw responses into a file. For long-running queries (e.g., if you need to get hundreds of thousands of tweets), you will often want to build in some error handling and a way to resume data collection. For example, you might write all of the results to a file and then open the file, retrieve the last tweet, and use the ID of that tweet to tell the script where to start to retrieve new tweets.

The other problem is that the object that is returned is a bit confusing - it is nested, with the tweet data in `.data` and the user data in `.includes['users']`.

In [102]:
len(btc_tweets)

395

In [103]:
btc_tweets[0].data[0]

<Tweet id=1352043220990046212 text='@HatinHate BTC is always right on point! 🤣🤣🤣'>

In [104]:
btc_tweets[0].includes['users'][2]

<User id=1284608768924237825 name=1nfoCompuCrypt0 username=InfoCompuCrypto>

Note that both of these are objects. The data that we asked for in `user_fields` and `tweet_fields` above are attributes of the objects. For example, here's the user's description:

In [105]:
btc_tweets[1].includes['users'][3].description

''

We will often want to reorganize these into a flat file, which means connecting a tweet to the user data of the user who wrote it. I show an example of how to do that here:

In [106]:
result = []
user_dict = {}
# Loop through each response object
for response in btc_tweets:
    # Take all of the users, and put them into a dictionary of dictionaries with the info we want to keep
    for user in response.includes['users']:
        user_dict[user.id] = {'username': user.username, 
                              'followers': user.public_metrics['followers_count'],
                              'tweets': user.public_metrics['tweet_count'],
                              'description': user.description,
                              'location': user.location
                             }
    for tweet in response.data:
        # For each tweet, find the author's information
        author_info = user_dict[tweet.author_id]
        # Put all of the information we want to keep in a single dictionary for each tweet
        result.append({'author_id': tweet.author_id, 
                       'username': author_info['username'],
                       'author_followers': author_info['followers'],
                       'author_tweets': author_info['tweets'],
                       'author_description': author_info['description'],
                       'author_location': author_info['location'],
                       'text': tweet.text,
                       'created_at': tweet.created_at,
                       'retweets': tweet.public_metrics['retweet_count'],
                       'replies': tweet.public_metrics['reply_count'],
                       'likes': tweet.public_metrics['like_count'],
                       'quote_count': tweet.public_metrics['quote_count']
                      })

# Change this list of dictionaries into a dataframe
df = pd.DataFrame(result)

In [108]:
df

Unnamed: 0,author_id,username,author_followers,author_tweets,author_description,author_location,text,created_at,retweets,replies,likes,quote_count
0,30246537,zymminy,5146,9910,DogMom to Queen Fae♥️Animal Lover-Rescuer-Advo...,Social Distancing in USA 😷,@HatinHate BTC is always right on point! 🤣🤣🤣,2021-01-20 23:59:58+00:00,0,1,1,0
1,455093591,_MiCrypto_,14,419,Small gems on my radar $TRADERWALLET,,@MMCrypto It all depends on btc. And at this m...,2021-01-20 23:59:46+00:00,0,0,4,0
2,1284608768924237825,InfoCompuCrypto,39,110,,,@BunkFreamon @JaEsf @RyanSAdams I think the re...,2021-01-20 23:59:42+00:00,0,1,2,0
3,1184221857874087936,Alsheikh_eth,2629,25272,"The Sheikh of #Crypto; #ENS, $LINK, $LPL, $TRI...",Village by the Red Sea,"@BizFuego Fren, but when the blow off top happ...",2021-01-20 23:59:35+00:00,0,1,1,0
4,1037769801526063104,CryptoRadi,477,1417,"(🦋,🦋)",Underwater,$VET 4H 🎯\n\n#Gann #Altcoins #BTC $ETH $BTC ht...,2021-01-20 23:59:30+00:00,0,1,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...
28841,1176621624948404224,bti_trading,12571,67659,Pinescript wizard | A community of 5K @Trading...,www.best-trading-indicator.com,Entry Signal Time: 19/1 18:20\nBINANCE:BCHUSDT...,2021-01-20 00:00:00+00:00,0,0,0,0
28842,1107640745761091586,Cryptonomia1,1783,18726,⚡️LIVE Real-time Alerts | #crypto #news #tradi...,Global,$ETH/BTC 📈 Bearish RSI Divergence | Interval:...,2021-01-20 00:00:00+00:00,0,0,0,0
28843,80722677,TimBeiko,96190,13224,helping @ethereum win,🇨🇦,@nicksdjohnson @hasufl @mohsen_ghajar Big diff...,2021-01-20 00:00:00+00:00,0,1,4,0
28844,1117701695688060929,BinanceRekts,1159,12094,We are watching bitmex and bybit for anything ...,,🔉 ⚠️ $BTC 4 Hours Update ⚠️ 🔉\nPrice @ $35881 ...,2021-01-20 00:00:00+00:00,0,0,0,0


In [109]:
df.columns

Index(['author_id', 'username', 'author_followers', 'author_tweets',
       'author_description', 'author_location', 'text', 'created_at',
       'retweets', 'replies', 'likes', 'quote_count'],
      dtype='object')

In [110]:
df.to_csv(folder_path+'Visualize2.csv')

## `requests`-based version

If you want to do things without tweepy, here is some boilerplate code that should work. As you can see, it's much more complicated. Be grateful for the tweepy developers!! :)

In [100]:
import requests
import os
import json
import twitter_authentication as config
import time

# Save your bearer token in a file called twitter_authentication.py in this directory
# Should look like this:
# bearer_token = 'YOUR_BEARER_TOKEN_HERE'
bearer_token= bearer_token

query = '(#BTC) OR (#btc)'
out_file = 'raw_tweets.txt'

search_url = "https://api.twitter.com/2/tweets/search/all"

# Optional params: start_time,end_time,since_id,until_id,max_results,next_token,
# expansions,tweet.fields,media.fields,poll.fields,place.fields,user.fields
query_params = {'query': query,
                'start_time': '2010-01-01T12:00:00Z',
                'tweet.fields': 'author_id,public_metrics',
                 'user.fields': 'username',
                'expansions': 'author_id',
                'max_results': 500
               }


def create_headers(bearer_token):
    headers = {"Authorization": "Bearer {}".format(bearer_token)}
    return headers


def connect_to_endpoint(url, headers, params, next_token = None):
    if next_token:
        params['next_token'] = next_token
    response = requests.request("GET", search_url, headers=headers, params=params)
    time.sleep(3.1)
    print(response.status_code)
    if response.status_code != 200:
        raise Exception(response.status_code, response.text)
    return response.json()


def get_tweets(num_tweets, output_fh):
    next_token = None
    tweets_stored = 0
    while tweets_stored < num_tweets:
        headers = create_headers(bearer_token)
        json_response = connect_to_endpoint(search_url, headers, query_params, next_token)
        if json_response['meta']['result_count'] == 0:
            break
        author_dict = {x['id']: x['username'] for x in json_response['includes']['users']}
        for tweet in json_response['data']:
            try:
                tweet['username'] = author_dict[tweet['author_id']]
            except KeyError:
                print(f"No data for {tweet['author_id']}")
            output_fh.write(json.dumps(tweet) + '\n')
            tweets_stored += 1
        try:
            next_token = json_response['meta']['next_token']
        except KeyError:
            break
    return None



def main():
    with open(out_file, 'w') as f:
        get_tweets(500, f)



main()

403


Exception: (403, '{"client_id":"25561511","detail":"When authenticating requests to the Twitter API v2 endpoints, you must use keys and tokens from a Twitter developer App that is attached to a Project. You can create a project via the developer portal.","registration_url":"https://developer.twitter.com/en/docs/projects/overview","title":"Client Forbidden","required_enrollment":"Standard Basic","reason":"client-not-enrolled","type":"https://api.twitter.com/2/problems/client-forbidden"}')

In [None]:
tweets = []
with open(out_file, 'r') as f:
    for row in f.readlines():
        tweet = json.loads(row)
        tweets.append(tweet)

In [None]:
tweets[0]