## Building a dataset
The goal of this notebook is to scrap tweets from twitter in order to bild a suitable dataset to classify SDG related tweets.

In [None]:
# IMPORTS
import tweepy
import os
import requests
import json
import pandas as pd
import csv
import time
import dateutil

In [None]:
# set bearer token as an enviroment variable (The set token is an academic token)
os.environ['TOKEN_ACADEMIC'] = 'INSERT YOUR ACADEMIC TOKEN'

In [None]:
# functions

# retrieve the token from the environemtn variable to be used bu the app
def auth():
    return os.getenv('TOKEN_ACADEMIC')

# create heathers
# sets the heathers for a future request to the twitter api
def create_headers(bearer_token):
    headers = {"Authorization": "Bearer {}".format(bearer_token)}
    return headers

#create url
def create_url_search_archive(keyword, start_date, end_date, max_results = 10): #start_date, end_date, 
    
    # set the api endpoint to archive (all)
    search_url = "https://api.twitter.com/2/tweets/search/all"
    
    #change params based on the endpoint you are using
    query_params = {'query': keyword,
                    'start_time': start_date,
                    'end_time': end_date,
                    'max_results': max_results,
                    'tweet.fields': 'id,text,author_id,created_at,lang,public_metrics',
                    'next_token': {}}
    return (search_url, query_params)

#connect to the endpoint
def connect_to_endpoint(url, headers, params, next_token = None, silent=False):
    params['next_token'] = next_token   #params object received from create_url function
    response = requests.request("GET", url, headers = headers, params = params)
    if not silent:
        print("Endpoint Response Code: " + str(response.status_code))
    if response.status_code != 200:
        raise Exception(response.status_code, response.text)
    return response.json()

### Build the query

For more information on how to build queries for the twitter API refer to the twitter API documentation.\
NOTE: In later archive searches we have been using the twitter no-code API tool for academic users.

Query log:\
'#SDGs lang:en -is:retweet'


In [None]:
# function calls
#Inputs for the request
bearer_token = auth()
headers = create_headers(bearer_token)
keyword_eng = '#SDGs lang:en -is:retweet'
start_time = "2015-09-01T00:00:00.000Z" # UN sustainable development summit (sept 2015) SDGs are set for 2030
end_time = "2018-08-10T12:18:23.000Z" # ADAPT to build in different datasets

# predefined time is last week
max_results = 500 # we know by tweet count exploration that only exist about 1.5M (500 is max one can get)
next_token = None

#url_eng = create_url_search_recent(keyword_eng, max_results)#start_time,end_time,
url_eng, query_params_eng = create_url_search_archive(keyword_eng, start_time, end_time, max_results)

# Set the storage parameters:
filename='tweetsSDG_en5.csv' # ADAPT TO BUILD DATASET
field_heathers=['twid','authid','created_at','lang','like_count','quote_count','reply_count','retweet_count','text']

### Call twitter API
Use the built query to retrieve data from tweets and store it in a csv file.

In [None]:
# Create and open file to store tweet rows + define heathers
csvFile = open(filename, "w", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
csvWriter.writerow(field_heathers)

counter = 0

while True:
    json_response = connect_to_endpoint(url_eng, headers, query_params_eng, next_token, silent=True)
    time.sleep(2) # To avoid spamming the API

    #Loop through each tweet
    for tweet in json_response['data']:

        fields=[tweet['id'],
                tweet['author_id'],
                dateutil.parser.parse(tweet['created_at']),
                tweet['lang'],
                tweet['public_metrics']['like_count'],
                tweet['public_metrics']['quote_count'],
                tweet['public_metrics']['reply_count'],
                tweet['public_metrics']['retweet_count'],
                tweet['text']]

        # Append the result to the CSV file
        csvWriter.writerow(fields)
        counter += 1
    print("Total # of Tweets added: ", counter)
    
    if 'next_token' in json_response['meta']:
        next_token = json_response['meta']['next_token']
    else:
        print('Total tweets stored: ', counter)
        break
csvFile.close()

Total # of Tweets added:  500
Total # of Tweets added:  992
Total # of Tweets added:  1489
Total # of Tweets added:  1989
Total # of Tweets added:  2482
Total # of Tweets added:  2980
Total # of Tweets added:  3476
Total # of Tweets added:  3972
Total # of Tweets added:  4467
Total # of Tweets added:  4962
Total # of Tweets added:  5456
Total # of Tweets added:  5954
Total # of Tweets added:  6451
Total # of Tweets added:  6950
Total # of Tweets added:  7446
Total # of Tweets added:  7944
Total # of Tweets added:  8442
Total # of Tweets added:  8941
Total # of Tweets added:  9439
Total # of Tweets added:  9938
Total # of Tweets added:  10430
Total # of Tweets added:  10927
Total # of Tweets added:  11420
Total # of Tweets added:  11916
Total # of Tweets added:  12414
Total # of Tweets added:  12913
Total # of Tweets added:  13409
Total # of Tweets added:  13908
Total # of Tweets added:  14406
Total # of Tweets added:  14903
Total # of Tweets added:  15403
Total # of Tweets added:  1590

Total # of Tweets added:  127764
Total # of Tweets added:  128264
Total # of Tweets added:  128762
Total # of Tweets added:  129259
Total # of Tweets added:  129757
Total # of Tweets added:  130254
Total # of Tweets added:  130750
Total # of Tweets added:  131250
Total # of Tweets added:  131747
Total # of Tweets added:  132244
Total # of Tweets added:  132744
Total # of Tweets added:  133244
Total # of Tweets added:  133740
Total # of Tweets added:  134234
Total # of Tweets added:  134732
Total # of Tweets added:  135232
Total # of Tweets added:  135726
Total # of Tweets added:  136226
Total # of Tweets added:  136722
Total # of Tweets added:  137221
Total # of Tweets added:  137721
Total # of Tweets added:  138218
Total # of Tweets added:  138712
Total # of Tweets added:  139211
Total # of Tweets added:  139708
Total # of Tweets added:  140207
Total # of Tweets added:  140704
Total # of Tweets added:  141200
Total # of Tweets added:  141696
Total # of Tweets added:  142195
Total # of

Total # of Tweets added:  251703
Total # of Tweets added:  252201
Total # of Tweets added:  252699
Total # of Tweets added:  253195
Total # of Tweets added:  253695
Total # of Tweets added:  254192
Total # of Tweets added:  254691
Total # of Tweets added:  255190
Total # of Tweets added:  255687
Total # of Tweets added:  256183
Total # of Tweets added:  256679
Total # of Tweets added:  257175
Total # of Tweets added:  257674
Total # of Tweets added:  258173
Total # of Tweets added:  258671
Total # of Tweets added:  259169
Total # of Tweets added:  259669
Total # of Tweets added:  260168
Total # of Tweets added:  260666
Total # of Tweets added:  261165
Total # of Tweets added:  261663
Total # of Tweets added:  262161
Total # of Tweets added:  262658
Total # of Tweets added:  263157
Total # of Tweets added:  263657
Total # of Tweets added:  264155
Total # of Tweets added:  264652
Total # of Tweets added:  265143
Total # of Tweets added:  265639
Total # of Tweets added:  266139
Total # of

Total # of Tweets added:  375464
Total # of Tweets added:  375959
Total # of Tweets added:  376459
Total # of Tweets added:  376959
Total # of Tweets added:  377458
Total # of Tweets added:  377955
Total # of Tweets added:  378454
Total # of Tweets added:  378947
Total # of Tweets added:  379446
Total # of Tweets added:  379944
Total # of Tweets added:  380436
Total # of Tweets added:  380933
Total # of Tweets added:  381433
Total # of Tweets added:  381930
Total # of Tweets added:  382426
Total # of Tweets added:  382924
Total # of Tweets added:  383423
Total # of Tweets added:  383921
Total # of Tweets added:  384421
Total # of Tweets added:  384921
Total # of Tweets added:  385421
Total # of Tweets added:  385919
Total # of Tweets added:  386418
Total # of Tweets added:  386913
Total # of Tweets added:  387410
Total # of Tweets added:  387900
Total # of Tweets added:  388399
Total # of Tweets added:  388896
Total # of Tweets added:  389394
Total # of Tweets added:  389893
Total # of

Total # of Tweets added:  499240
Total # of Tweets added:  499730
Total # of Tweets added:  500224
Total # of Tweets added:  500720
Total # of Tweets added:  501212
Total # of Tweets added:  501704
Total # of Tweets added:  502200
Total # of Tweets added:  502698
Total # of Tweets added:  503197
Total # of Tweets added:  503695
Total # of Tweets added:  504193
Total # of Tweets added:  504693
Total # of Tweets added:  505190
Total # of Tweets added:  505689
Total # of Tweets added:  506188
Total # of Tweets added:  506687
Total # of Tweets added:  507186
Total # of Tweets added:  507685
Total # of Tweets added:  508185
Total # of Tweets added:  508684
Total # of Tweets added:  509184
Total # of Tweets added:  509684
Total # of Tweets added:  510183
Total # of Tweets added:  510683
Total # of Tweets added:  511183
Total # of Tweets added:  511683
Total # of Tweets added:  512180
Total # of Tweets added:  512680
Total # of Tweets added:  513180
Total # of Tweets added:  513679
Total # of

Total # of Tweets added:  622658
Total # of Tweets added:  623153
Total # of Tweets added:  623649
Total # of Tweets added:  624147
Total # of Tweets added:  624643
Total # of Tweets added:  625140
Total # of Tweets added:  625629
Total # of Tweets added:  626128
Total # of Tweets added:  626625
Total # of Tweets added:  627123
Total # of Tweets added:  627621
Total # of Tweets added:  628115
Total # of Tweets added:  628613
Total # of Tweets added:  629113
Total # of Tweets added:  629613
Total # of Tweets added:  630111
Total # of Tweets added:  630610
Total # of Tweets added:  631109
Total # of Tweets added:  631606
Total # of Tweets added:  632106
Total # of Tweets added:  632604
Total # of Tweets added:  633103
Total # of Tweets added:  633601
Total # of Tweets added:  634101
Total # of Tweets added:  634599
Total # of Tweets added:  635098
Total # of Tweets added:  635596
Total # of Tweets added:  636093
Total # of Tweets added:  636590
Total # of Tweets added:  637088
Total # of

In [None]:
csvFile.close()

* 1: Total # of Tweets added:  187365 \
Exception: (429, '{"title":"Too Many Requests","detail":"Too Many Requests","type":"about:blank","status":429}')

* 2: (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f023b491430>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
* 3: interrupted
* 4: same as 2 (45271 tweets)
* 5: interrupt

In [None]:
df = pd.read_csv('tweetsSDG_en.csv')
display(df.head())

Unnamed: 0,twid,authid,created_at,lang,like_count,quote_count,reply_count,retweet_count,text
0,1498444897636036608,2875796095,2022-02-28 23:47:58+00:00,en,8,0,1,14,@HakomTimeSeries @thenextweb @Analytics_699 @j...
1,1498442516676624387,981318439154282497,2022-02-28 23:38:30+00:00,en,0,0,0,0,#ModelShave Waist length red hair lady becomes...
2,1498442347905892358,986793528394104832,2022-02-28 23:37:50+00:00,en,3,0,0,1,#PNG #DigitalTransformation - NEW course to l...
3,1498442248953946112,1443986304178475012,2022-02-28 23:37:26+00:00,en,2,0,0,0,It also calls for the full participation of wo...
4,1498442101729775618,1443986304178475012,2022-02-28 23:36:51+00:00,en,2,0,1,0,SDG 5 aims to achieve gender equality by endin...


In [None]:
display(df.tail())
df.info()

Unnamed: 0,twid,authid,created_at,lang,like_count,quote_count,reply_count,retweet_count,text
187335,1371793809038848005,1154381411614416906,2021-03-16 12:01:45+00:00,en,1,0,0,0,Help me plant a forest. Sign up for my #TechWi...
187336,1371793676721139713,1141826418822848512,2021-03-16 12:01:13+00:00,en,1,0,0,0,Great news 🙌 We are in partnership with\nImpac...
187337,1371793494608650244,202313343,2021-03-16 12:00:30+00:00,en,7,0,0,1,The only way to meet the Paris Agreement goals...
187338,1371793371405066250,90561618,2021-03-16 12:00:01+00:00,en,0,0,0,0,Let’s Talk EduClowns! Did you know they've per...
187339,1371793203062398983,747357300638158848,2021-03-16 11:59:20+00:00,en,0,0,0,0,@SustInsti @GlobeScan It would be interesting ...


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 187340 entries, 0 to 187339
Data columns (total 9 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   twid           187340 non-null  int64 
 1   authid         187340 non-null  int64 
 2   created_at     187340 non-null  object
 3   lang           187340 non-null  object
 4   like_count     187340 non-null  int64 
 5   quote_count    187340 non-null  int64 
 6   reply_count    187340 non-null  int64 
 7   retweet_count  187340 non-null  int64 
 8   text           187340 non-null  object
dtypes: int64(6), object(3)
memory usage: 12.9+ MB


In [None]:
# test to retrieve data
print(json_response['data'][0]['created_at']) #tweets are printed from newest to oldest, so [0] is the most recent
print(json_response['data'][-1]['created_at'])
#print(json_response['data'][0])

#test to retrieve metadata
print(json_response['meta']['result_count'])
#print(json_response['meta'])


2022-02-18T13:18:51.000Z
2022-02-18T13:17:28.000Z
15
