<a href="https://colab.research.google.com/github/denniesbor/TwitterPython/blob/RawNotebooks/Twitter_API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

This notebook's primary purpose is scraping tweets from the Twitter API and exporting the tweets as data frames to be consumed by other NLP processes such as sentiment, emotion analysis, and topic modelling.

Twitter API allows three access types: Essential, elevated, and academic research. We will use the elevated access for our case, which helps scrape 2m tweets per month. However, there are a couple of restrictions while writing our queries, such as acquiring tweets from a region, etc. 

This tutorial will download data on the Kenyan trends for the past week.

The four Twitter API endpoints to be utilized are listed below:

1. Standard search API: To query tweets using key phrases.
2. Get trends/places API: To query trending hashtags
3. Client & Paginator API: To scrape more tweets
4. Cursor API: Compliments the search API for more tweets

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
cd /content/drive/MyDrive/Summer2022

/content/drive/MyDrive/Summer2022


In [None]:
# update tweepy

!pip install tweepy --upgrade

Collecting tweepy
  Downloading tweepy-4.9.0-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 5.4 MB/s 
Collecting requests<3,>=2.27.0
  Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 1.7 MB/s 
Installing collected packages: requests, tweepy
  Attempting uninstall: requests
    Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
  Attempting uninstall: tweepy
    Found existing installation: tweepy 3.10.0
    Uninstalling tweepy-3.10.0:
      Successfully uninstalled tweepy-3.10.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.27.1 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is 

## 1. Import Libraries

In [None]:
# Import libraries
import tweepy
import pandas as pd
import numpy as np
import datetime
from datetime import date

# save keys in config.py
import config

In [None]:
# !touch config.py

## 2. Twitter Authentication

Set up twitter authentication. Credentials are stored separately in a config file.

In [None]:
# Read credentials from config file

api_key = config.api_key
api_key_secret = config.api_key_secret

access_token = config.access_token
access_token_secret = config.access_token_secret


### 2.1 Authenticate

In [None]:
## Authenticate 

auth = tweepy.OAuth1UserHandler(api_key,api_key_secret)

auth.set_access_token(access_token,access_token_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

## 3. Harvest Tweets

### 3.1 Standard search API

Returns a collection of relevant Tweets matching a specified query.

In [None]:
%%time

# query
query = '(mariupol OR ukraine) lang:en -is:retweet -is:reply'

# Tweets to be returned by API
tweet_count = 100

# List containers for API output
tweets = []
time_stamps = []
screen_names = []
topic_country = []

# Query API

for tweet in api.search_tweets(q=query,count=tweet_count,lang='en',result_type="recent"):
    tweets.append(tweet.text)
    time_stamps.append(tweet.created_at)
    screen_names.append(tweet.user.screen_name)



CPU times: user 40.1 ms, sys: 3.8 ms, total: 43.9 ms
Wall time: 345 ms


In [None]:
# Create df from API output
df_std = pd.DataFrame(list(zip(screen_names,tweets,time_stamps)),
               columns =['screen_name','tweets','time_stamp'])
df_std

Unnamed: 0,screen_name,tweets,time_stamp
0,crna_ruka_,RT @madrascat: @POTUS I do not stand with Ukra...,2022-05-20 11:48:50+00:00
1,SeanThorne1,RT @OlgaNYC1211: Russia is playing long game c...,2022-05-20 11:48:50+00:00
2,Berna_BM,RT @RepMTG: $40 BILLION to Ukraine is an Ameri...,2022-05-20 11:48:50+00:00
3,SermonetaFutura,RT @mannocchia: New Evidence Shows How Russian...,2022-05-20 11:48:50+00:00
4,Dallas4Bernie,"RT @socialiststeve6: Biden will send $53,000,0...",2022-05-20 11:48:49+00:00
...,...,...,...
95,TerraOrBust,RT @commonslibrary: The EU imports over 80% of...,2022-05-20 11:48:37+00:00
96,treswatson,RT @benjaminwittes: The New York Times editori...,2022-05-20 11:48:37+00:00
97,mahiru1024,RT @JuliaDavisNews: Meanwhile on Russian state...,2022-05-20 11:48:37+00:00
98,murf1966,"RT @socialiststeve6: Biden will send $53,000,0...",2022-05-20 11:48:36+00:00


### 3.2 Get trends near a location API

#### 3.2.1 Query Trends using place_id

The place id is added to the query and will filter tweets from the region corresponding to the id.

In [None]:
# get kenyan place id

places = api.search_geo(query="KE", granularity="country")
place_id = places[0].id

In [None]:
# extract tweets in this place id
%%time

# query
query = f'(mariupol OR ukraine) lang:en -is:retweet -is:reply place:{place_id}'

# Tweets to be returned by API
tweet_count = 100

# List containers for API output
tweets = []
time_stamps = []
screen_names = []
topic_country = []

# Query API

for tweet in api.search_tweets(q=query,count=tweet_count,lang='en',result_type="recent"):
    tweets.append(tweet.text)
    time_stamps.append(tweet.created_at)
    screen_names.append(tweet.user.screen_name)


df_place = pd.DataFrame(list(zip(screen_names,tweets,time_stamps)),
               columns =['screen_name','tweets','time_stamp'])
df_place

CPU times: user 21.3 ms, sys: 427 µs, total: 21.7 ms
Wall time: 286 ms


In [None]:
df_place.head(5)

Unnamed: 0,screen_name,tweets,time_stamp
0,EdwinJumba,💸Dollar about to hit the 💶 Euro mark and heade...,2022-05-20 09:35:08+00:00
1,WilliamMfalme,I am examining the response from both sides of...,2022-05-20 07:31:36+00:00
2,BejaMuti,@BBCWorld Ukraine should join NATO now. They h...,2022-05-19 21:49:50+00:00
3,WilmaTarus,I hate retrogressiveness. Loss of life of that...,2022-05-19 18:04:14+00:00
4,wwega2,@TonyMurega Let's keep funding Ukraine to Keep...,2022-05-19 13:50:56+00:00


####3.2.1.1 Extract More Tweets

Paginate the searches to extract more tweets with Cursor paginator

In [None]:
query = f'place:{place_id} lang:en -is:retweet -is:reply'

def tweets_dataframe(api: tweepy.API, query: str, n: int=7):

    '''This function will extract tweets within Kenya for the past one week,
    and returns as a list of Pandas dataframes.

    Attributes
    ----------
    api.tweepy: class 
      instance of the api version 1 endpoint
    
    query: str
      api search parameter

    days: int, optional
      number of days to query tweets. Default is 7 days

    Returns
    -------
    df_lists: list
      returns a list of pandas dataframes

    '''
    df_list = []
    while True:
        date_time = datetime.date.today()               #Acquire today's date
        time_delta1 = datetime.timedelta(days=n)
        time_delta2 = datetime.timedelta(days = n-1)

        # Search dates
        date_since = str(date_time-time_delta1)
        date_until = str(date_time-time_delta2)

        # acquire the tweets
        tweets = tweepy.Cursor(api.search_tweets,
                               q=query,
                               count = 100,
                               until=date_until).items(2000)

        #Obtain the tweets information and pass it into a data frame
        tweet_info = [[tweet.id_str,tweet.created_at,tweet.user.location,tweet.text] for tweet in tweets]
        df = pd.DataFrame(data=tweet_info, columns=['tweet_id_str','date_time','location','tweet_text'])

        #append the created dataframe into a list
        df_list.append(df)
        n = n-1
        if n == 0:
          break
    
    return  pd.concat(df_list)

df_place = tweets_dataframe(api,query=query,n=7)

# save the df
df_place.to_csv('kenya_all_data.csv',index=False)

In [None]:
df_place.drop_duplicates(inplace=True)
df_place.shape

(14000, 4)

#### 3.2.2 Query Trends using WOEID

Returns the top 50 trending topics for a specific id if trending information is available. Note: The id parameter for this endpoint is the "where on earth identifier" or WOEID.

In [None]:
%%time
# Access trending tweets near my location (Kenya)

# WOEID for Kenya (Where On Earth IDentifier)
woeid = 23424863

# fetching the trends
trends = api.get_place_trends(id = woeid)

# Topic placeholder
trending_topics = []

# Query and list trends
for value in trends:
    for trend in value['trends']:
        trending_topics.append(trend['name'])

CPU times: user 14.9 ms, sys: 945 µs, total: 15.8 ms
Wall time: 152 ms


#### 3.2.3 Query Tweets for Trends

Query tweets for the above trending hashtags.

In [None]:
%%time
# Return the most recent tweets for each trend

# tweet count for each hashtag
tweet_count = 100

# List containers for API output
trending_tweets = []
trending_time_stamps = []
trending_screen_names = []
trending_topic = []

# Query tweets from trends
q = ' lang:en -is:retweet -is:reply'
for topic in trending_topics:
    
    for tweet in api.search_tweets(q=topic+q,count=tweet_count,lang='en'):
        trending_tweets.append(tweet.text)
        trending_time_stamps.append(tweet.created_at)
        trending_screen_names.append(tweet.user.screen_name)
        trending_topic.append(topic)
    



Rate limit reached. Sleeping for: 93


CPU times: user 2.04 s, sys: 127 ms, total: 2.17 s
Wall time: 1min 53s


In [None]:
# Create df from API output
trends_df = pd.DataFrame(list(zip(trending_screen_names,trending_topic,trending_tweets,trending_time_stamps)),
               columns =['screen_name','hashtag','tweet','time_stamp'])
trends_df

Unnamed: 0,screen_name,hashtag,tweet,time_stamp
0,victor__haya,#UshuruWaUchungu,The 2022 financial bill will lead to high cos...,2022-05-20 12:02:01+00:00
1,Robert_Musyoka7,#UshuruWaUchungu,RT @MwariwaKuria: The reason why we should all...,2022-05-20 12:01:51+00:00
2,PrinceRaymondke,#UshuruWaUchungu,RT @MwariwaKuria: increased cost of beverages ...,2022-05-20 12:01:46+00:00
3,githu_bobo,#UshuruWaUchungu,RT @githu_bobo: Something to worry about ni pr...,2022-05-20 12:01:40+00:00
4,Jaymoh_8k,#UshuruWaUchungu,Maisha ni hard na uenda tukalipa #UshuruWaUchu...,2022-05-20 12:01:34+00:00
...,...,...,...,...
4633,kupaleon,Alonso,RT @NizaarKinsella: A positive note: Chelsea c...,2022-05-20 11:57:27+00:00
4634,nick01281,Alonso,RT @altaeeameer11: Today marks the retirement ...,2022-05-20 11:57:26+00:00
4635,kupaleon,Alonso,RT @NizaarKinsella: Marcos Alonso is having so...,2022-05-20 11:57:20+00:00
4636,Alonso_Cabral,Alonso,"RT @F1: Just a few days old, and already a fan...",2022-05-20 11:57:17+00:00


In [None]:
# Hashtag distribution
trends_df.hashtag.value_counts()

#UshuruWaUchungu     100
Richarlison          100
Leeds                100
Lampard              100
Waiyaki Way          100
Rihanna              100
Nzioka Waita         100
Rangers              100
LGBTQ                100
Agenda               100
Europe               100
Ramsey               100
betsafe finalists    100
Embu                 100
Goodison Park        100
Crystal Palace       100
Dr Phillip Munyao    100
sifuna               100
Diclofenac           100
Pulisic              100
Nyandarua            100
Nyeri County         100
East Mpaka London    100
Iraq                 100
Vieira               100
Mbappe               100
#BabaNaMama          100
#LGServiceCentre     100
#TwivaPays           100
Everton              100
Larry                100
Celtics              100
Rudi                 100
Alonso               100
Gachagua             100
sakaja               100
Madrid               100
Karatina              97
Polycarp Igathe       97
Mirema                95


In [None]:
# save the scraped data

# Set export file names
today = date.today()
trends_df_name = 'Kenya_Trend_Tweets {}.csv'.format(today)

# Export dataframes

trends_df.to_csv(trends_df_name,index= False)

# Twitter API v2

In [None]:
# import config
import requests
import datetime

In [None]:
# instance of api_v2 endpoint


client = tweepy.Client(bearer_token=config.bearer_token,
                       consumer_key=api_key,
                       consumer_secret=api_key_secret, 
                       access_token=access_token,
                       access_token_secret=access_token_secret,
                       wait_on_rate_limit=True
                       )

In [None]:
query = f'(kenya kwanza OR uda OR ruto OR rigathi\
 OR gachagua OR raila OR azimio OR \
 azimio one kenya OR karua OR martha karua)\
  -#WomenWhoCode lang:en -is:retweet -is:reply'

# allow character length is 512 for elevated queries
print(len(query))

response: tweepy.Client = client.search_recent_tweets(query=query, end_time=None, expansions=['author_id'],
                                       max_results=100, media_fields=None, 
                                       next_token=None, place_fields=None, poll_fields=None,
                                       since_id=None, sort_order=None, start_time=None, 
                                       tweet_fields=['created_at','lang','geo'], until_id=None, user_fields=['pinned_tweet_id','id'], user_auth=False)

163


In [None]:
#  extract the data and export to a pandas dataframe

users: dict = {u['id']: u for u in response.includes['users']}

tweets_data: list = []

for resp in response.data:
  if users[resp.author_id]:
    user = users[resp.author_id]
    
    user_payload = {'user':user.name, 'date':resp.created_at,'tweet_id':resp.id, 'text':resp.text}
    tweets_data.append(user_payload)

df = pd.DataFrame(tweets_data, columns=['user','date','tweet_id','text'])

In [None]:
df.head(5)

Unnamed: 0,user,date,tweet_id,text
0,tv030kenya,2022-05-20 12:07:00+00:00,1527621913706909696,"BARINGO SOUTH, BARINGO NORTH and Mogotio have ..."
1,Ravine News,2022-05-20 12:06:32+00:00,1527621797335769092,"BARINGO SOUTH, BARINGO NORTH and Mogotio have ..."
2,Breaking Kenya News/ www.breakingkenyanews.com,2022-05-20 12:06:24+00:00,1527621761596006405,#Breakingkenyanews Not just for the numbers...
3,Fredrick Were,2022-05-20 12:06:22+00:00,1527621754105081856,We want a day when Azimio will have heavy rall...
4,Baringo News,2022-05-20 12:05:55+00:00,1527621642075136001,"BARINGO SOUTH, BARINGO NORTH and Mogotio have ..."


## API Generator
 Harvest more tweets

In [None]:

def get_data_v2(query: str,days: int=6,max_limit: int=1000):
  """This functions scrape data from using Twitter APIv2 endpoint.
  The elevated access restricts data acquisition to 7 days.

  Parameters
  ----------
  days : int, optional 
    Number of days to return the data. Default is 7
  
  max_limit: int, optional
    Max number of tweets to retrieve per day. Default is 1000

  Returns
  -------
  df: pandas dataframe.
    
  """
  data: list = []

  for n in reversed(range(days+1)):

    date_time = datetime.date.today()               #Acquire today's date
    time_delta1 = datetime.timedelta(days=n)
    time_delta2 = datetime.timedelta(days = n-1)

    date_since = (date_time-time_delta1).strftime("%Y-%m-%dT%H:%M:%SZ")
    date_until = (date_time-time_delta2).strftime("%Y-%m-%dT%H:%M:%SZ")

    if n == 0:
      date_until: datetime=None

    tweets = [ tweet
              for tweet in tweepy.Paginator(client.search_recent_tweets,
                                            query=query,
                                            tweet_fields=['created_at','lang','geo'] ,
                                            expansions=['author_id'], 
                                            max_results=100,
                                            start_time=date_since,
                                            end_time=date_until).flatten(limit=max_limit)
    ]

    data += [{"date": d.created_at,'author_id':d.author_id, 'tweet_id':d.id, "text": d.text} for d in tweets]

  df = pd.DataFrame(data, columns=['date','author_id','tweet_id','text'])

  return df

In [None]:
df = get_data_v2(query=query,max_limit=2000)

# save the csv

df.to_csv('kenya_politico.csv')