<a href="https://colab.research.google.com/github/anqizhang1/AIOE/blob/main/Twitter_API_Collecting_Social_Media_Data_PUBLIC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro and Resources

---
#### *Copyright*
The content in this notebook was developed by Jeremy Walker. All sample code and notes are provided under a Creative Commons ShareAlike license.

Official Copyright Rules / Restrictions / Priveleges Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) https://creativecommons.org/licenses/by-sa/4.0/

---
#### References & Resources
##### *Twitter*
* Twitter API - https://developer.twitter.com/en
* Tweepy - http://docs.tweepy.org/en/latest/index.html

##### *Library Research Data Services (RDS)*
* RDS Homepage - https://research.library.gsu.edu/dataservices
* RDS Workshops - https://research.library.gsu.edu/dataservices/rds-workshops

##### *General GSU Resources*
* O'Reilly For Higher Education - http://go.oreilly.com/georgia-state-university
* LinkedIn Learning - https://technology.gsu.edu/technology-services/it-services/training-and-learning-resources/linkedin-learning/


# Part 0 - Getting Everything Set Up

In [1]:
# Before getting started with any code, you need to visit the Twitter API
# (https://developer.twitter.com/en) website and...
# 1. apply for a developer account
# 2. then create an "project"
# 3. then copy+paste the various keys, secrets, and tokens into the following cell.

my_consumer_key = "xeAdMLaPRWZJT3KVOYJvpNNRe"
my_consumer_secret = "eGa7xCB5mlZb3niapleSipJgg4scV4iavwXqG7VOW9LGM62aeg"

my_access_token = "1169027097567907841-TCfyJxhniG49yCjEGl7byG5lPBkJXk"
my_access_secret = "31EWQ5LicvGWQcYzlwUnDIcMb4c3uN2hL1pJATUSR13qf"

my_bearer_token = "AAAAAAAAAAAAAAAAAAAAANWolAEAAAAA8465wdznPrbS5T0F9gKnmXAkRUk%3DTfrrUZmrDhko4CXKxfPsgFhPVZXuDdWnQYkK4ZXOACGEnGDqc3"

In [2]:
# To get started, we need to ensure that the absolute latest version of Tweepy is 
# fully installed and overwrites any preexisting installations.

!pip install git+https://github.com/tweepy/tweepy.git --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/tweepy/tweepy.git
  Cloning https://github.com/tweepy/tweepy.git to /tmp/pip-req-build-aqotx4kh
  Running command git clone --filter=blob:none --quiet https://github.com/tweepy/tweepy.git /tmp/pip-req-build-aqotx4kh
  Resolved https://github.com/tweepy/tweepy.git to commit 0cd96b1918e5e920eb9f8fe4ba303ab5ec899c65
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting requests<3,>=2.27.0
  Downloading requests-2.28.2-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.8/62.8 KB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: tweepy
  Building wheel for tweepy (setup.py) ... [?25l[?25hdone
  Created wheel for tweepy: filename=tweepy-4.12.1-py3-none-any.whl size=102442 sha256=eb519cf073f426c39ebd441b202636b2c0af41214ad07a02eb1d47cb06fe1b6d
  Stored in directory: /tmp/pip-ep

In [3]:
# Import the tweepy package
import tweepy

# Import a variety of other packages that may be useful for working with data.
import pandas as pd
import json
import time

# Part 1 - Twitter API credentials, authentication, and starting the "client" class

In order for your code and commands to effectively communicate with the Twitter API (most APIs, in truth), you will need to "authenticate" your credentials with the services.  This is conceptually similar to submitting a username/password to the Twitter API prior to being allowed to access the data.

Advanced users will want to read the [official documentation](https://developer.twitter.com/en/docs/authentication/overview) to gain a deeper understanding of various authentication methods (e.g. Oauth1 and Oauth2).

However, Tweepy makes it exceptionally easy to authenticate with the Twitter API v2 using the "Client" class/function without needing to think too much: https://docs.tweepy.org/en/latest/client.html

In [6]:
# Using the tweepy.Client(...) function, you can establish a connection to the 
# Twitter API.  The example below shows a robust way of creating a "client" object
#  by passing your credentials to appropriate parameters.

# Additionally, the "wait_on_rate_limit" parameter is set to True at this stage.
# This will be explained more later, but this helps to overcome a lot of errors
#  associated with API usage limitations.

client = tweepy.Client(
    wait_on_rate_limit = True,
    consumer_key = my_consumer_key,
    consumer_secret = my_consumer_secret,
    access_token = my_access_token,
    access_token_secret = my_access_secret,
    bearer_token = my_bearer_token,
)

In [7]:
# Test to make sure the client object exists

client

<tweepy.client.Client at 0x7fd076652c40>

In [8]:
# Do a quick test to make sure the client is working.  You should see a 
# "Response(..." object of some form displayed in the output if successful 
# (don't worry about the details, we will cover everything together)

client.get_user(username ="standupmaths")

Response(data=<User id=92614042 name=Matt Parker username=standupmaths>, includes={}, errors=[], meta={})

# Part _____ - The General Pattern

Generally speaking, the entirety of this workshop can be reduced to repeatedly cycling through the same three steps:

*   Reading about a function in the Tweepy package's documentation
*   Reading about a function in the official Twitter API documentation
*   Attempting to run the function in the Python notebook

For example, if you want to download the account information for a specific user, you may look into the following:

*   Tweepy: https://docs.tweepy.org/en/latest/client.html#tweepy.Client.get_user
*   Twitter API: https://developer.twitter.com/en/docs/twitter-api/users/lookup/api-reference/get-users-by-username-username
*   Code Attempt: client.get_user(username ="standupmaths")

Where you start and the order in which you learn about aspects of working with the Twitter API is fluid and arbitrary.  You should expect it for this to be a dynamic process of discovery, practice, and iteration.



# Part 2 - Getting Started with Tweepy & User Lookup

*   Tweepy: https://docs.tweepy.org/en/latest/client.html#tweepy.Client.get_user
*   Twitter API: https://developer.twitter.com/en/docs/twitter-api/users/lookup/api-reference/get-users-by-username-username

In [9]:
# Individual user accounts are a good place to start when learning how to 
# retrive and work with data from the Twitter API.  The following example uses 
# our client object and the get_user(...) function to search for a specific 
# Twitter user's account by username.  The user_single object will be used to 
# store the information returned to us.

# We'll starting by looking at the Twitter account for math-communicator, Matt Parker: https://twitter.com/standupmaths

user_single = client.get_user(
    username ="standupmaths",
)

In [10]:
# Whereas the Twitter API, in its raw form, usually just returns a JSON object,
# The Tweepy package bundles data returned from the Twitter API in specific ways
# that can be a bit strange at first, but are often quite helpful for data 
# management and processing.

# The "Response" below contains "data", "includes", "errors", and "meta" components.
# Each of these are useful in different situations, but this workshop will focus
# almost entirely on the "data" component.  In this case, it is also the only 
# component that actually contains any information :)

user_single

Response(data=<User id=92614042 name=Matt Parker username=standupmaths>, includes={}, errors=[], meta={})

In [11]:
# The data can be accessed as component of the user_single object

user_single.data

<User id=92614042 name=Matt Parker username=standupmaths>

In [16]:
# Within user_single.data, you can access a variety of individual attributes
user_single.data.name


'Matt Parker'

In [17]:
user_single.data.keys()

KeysView(<User id=92614042 name=Matt Parker username=standupmaths>)

In [18]:
print( user_single.data.id )
print( user_single.data.name )
print( user_single.data.username )
print( user_single.data.created_at )

92614042
Matt Parker
standupmaths
None


In [19]:
# Unintuitively, there is a "data" attribute within the "data" component 
# within the user_single object.  What's important is that this will give
# us ALL of the data delivered by the Twitter API.

# This information is structured as a dictionary and can be treated as a JSON 
# object when it is helpful.

user_single.data.data   # json format dictionary

{'id': '92614042', 'name': 'Matt Parker', 'username': 'standupmaths'}

In [20]:
# This information can also be accessed as key-value pairs using [...].

# These approaches to isolating individual data may be useful if you need 
# granular control individual datum values.

# user_single.data.data["id"]
# user_single.data.data["name"]
user_single.data.data["username"]

'standupmaths'

In [21]:
# However, what I assume is more useful to most researchers is to use Pandas
# to convert the data to a Pandas DataFrame: https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe

pd.json_normalize( user_single.data.data, sep="_" )

Unnamed: 0,id,name,username
0,92614042,Matt Parker,standupmaths


In [22]:
# "Fields" refer to the additional data you want to request from the Twitter 
# API.  There are very precise options available for "user fields", "tweet 
# fields", "media fields" and other "....fields" options.  Using these tools 
# robustly requires close reading of the documentation.

# Twitter API (get_user) - https://developer.twitter.com/en/docs/twitter-api/users/lookup/api-reference/get-users-by-username-username
# Twitter API (Fields) - https://developer.twitter.com/en/docs/twitter-api/fields
# Tweepy - https://docs.tweepy.org/en/latest/client.html#user-fields

# Based on information drawn from the documentation, the example below adds
# the user_fields=... parameter and a list[...] of desired fields.

user_single = client.get_user(
    username ="standupmaths",
    user_fields = ["created_at", "description","public_metrics","verified",],
)

In [23]:
# Inspect the updated data and note the inclusion of additional data.

user_single.data.data

{'created_at': '2009-11-25T21:19:33.000Z',
 'public_metrics': {'followers_count': 137743,
  'following_count': 613,
  'tweet_count': 27023,
  'listed_count': 1420},
 'description': '#1 best-selling author, also maths clown.\nVideos: https://t.co/sCfGSvBKVm\nAlso over here: https://t.co/3kB1s6dNdX',
 'id': '92614042',
 'name': 'Matt Parker',
 'username': 'standupmaths',
 'verified': False}

In [26]:
# Convert data to Pandas DataFrame format.

pd.json_normalize( user_single.data.data, sep="_" )

Unnamed: 0,created_at,description,id,name,username,verified,public_metrics_followers_count,public_metrics_following_count,public_metrics_tweet_count,public_metrics_listed_count
0,2009-11-25T21:19:33.000Z,"#1 best-selling author, also maths clown.\nVid...",92614042,Matt Parker,standupmaths,False,137743,613,27023,1420


In [27]:
# Many functions in the Twitter API / Tweepy allow for submitting multiple 
# requests at once.  In this case, we can request account details on multiple 
# known Twitter accounts.

# Tweepy - https://docs.tweepy.org/en/latest/client.html#tweepy.Client.get_users
# Twitter API  - https://developer.twitter.com/en/docs/twitter-api/users/lookup/api-reference/get-users-by-username-username


users_group = client.get_users(
    usernames =["standupmaths","FryRsquared","3blue1brown","MouldS",],
    user_fields = ["created_at", "description","public_metrics","verified",],
)

In [28]:
# Inspect the object.  The object is still a "Response(...", but contains 
# multiple users within the data component.

users_group

Response(data=[<User id=92614042 name=Matt Parker username=standupmaths>, <User id=273375532 name=Hannah Fry username=FryRsquared>, <User id=2877269376 name=Grant Sanderson username=3blue1brown>, <User id=17791581 name=Steve Mould username=MouldS>], includes={}, errors=[], meta={})

In [31]:
# The .data attribute returns a list of user objects
users_group.data

[<User id=92614042 name=Matt Parker username=standupmaths>,
 <User id=273375532 name=Hannah Fry username=FryRsquared>,
 <User id=2877269376 name=Grant Sanderson username=3blue1brown>,
 <User id=17791581 name=Steve Mould username=MouldS>]

In [32]:
# Each item can be accessed in the same way you index and select items from other lists.

users_group.data[1]

<User id=273375532 name=Hannah Fry username=FryRsquared>

In [33]:
# When an individual user is selected, you can access all of the different data
# fields in the same way as before.

users_group.data[0].data

{'username': 'standupmaths',
 'public_metrics': {'followers_count': 137743,
  'following_count': 613,
  'tweet_count': 27023,
  'listed_count': 1420},
 'description': '#1 best-selling author, also maths clown.\nVideos: https://t.co/sCfGSvBKVm\nAlso over here: https://t.co/3kB1s6dNdX',
 'id': '92614042',
 'created_at': '2009-11-25T21:19:33.000Z',
 'verified': False,
 'name': 'Matt Parker'}

In [34]:
# Convert an individual user to a Pandas DataFrame

pd.json_normalize( users_group.data[0].data, sep="_" )

Unnamed: 0,username,description,id,created_at,verified,name,public_metrics_followers_count,public_metrics_following_count,public_metrics_tweet_count,public_metrics_listed_count
0,standupmaths,"#1 best-selling author, also maths clown.\nVid...",92614042,2009-11-25T21:19:33.000Z,False,Matt Parker,137743,613,27023,1420


In [35]:
# In order to make use of bulk data gathered from the Twitter API, you will need
# to be comfortable using loops and iterations.  For those unfamiliar with Python 
# coding, this workshop will primarily use for-loops: https://www.tutorialspoint.com/python/python_for_loop.htm

# Create empty dataframe
data_users = pd.DataFrame()

# Iterate through users_group and append data to the data_users dataframe
for i in users_group.data:
  temp_data = pd.json_normalize( i.data ,  sep = "_")
  data_users = data_users.append( temp_data , ignore_index=True )

In [36]:
# Inspect the dataframe

data_users

Unnamed: 0,username,description,id,created_at,verified,name,public_metrics_followers_count,public_metrics_following_count,public_metrics_tweet_count,public_metrics_listed_count
0,standupmaths,"#1 best-selling author, also maths clown.\nVid...",92614042,2009-11-25T21:19:33.000Z,False,Matt Parker,137743,613,27023,1420
1,FryRsquared,All math and no trousers.,273375532,2011-03-28T11:14:20.000Z,True,Hannah Fry,208097,2121,11406,1513
2,3blue1brown,Pi creature caretaker.\nContact/faq: https://t...,2877269376,2014-10-25T20:56:48.000Z,False,Grant Sanderson,310802,346,3530,2142
3,MouldS,YouTube: https://t.co/Qgmx0dSDz0\nYouTube Shor...,17791581,2008-12-01T22:37:20.000Z,True,Steve Mould,30705,324,4377,332


# Part 2 - Practice

Using the outlines below, fill in all of ??? in the following blocks of code.  This is simply meant to ensure you understand how to modify and tweak code to produce outputs.

In [37]:
# Re-create the client using your credentials

client = tweepy.Client(
    consumer_key = ???,
    consumer_secret = ???,
    access_token = ???,
    access_token_secret = ???,
    bearer_token = ???,
)

SyntaxError: ignored

In [None]:
# Find a Twitter username you are interested in and put it in the username parameter.

# If you can not think of or find any Twitter accounts quickly, maybe you can 
# try using "NPR".

user_single_practice = client.get_user(
    username = "???",
    user_fields = ["created_at", "description","public_metrics","verified",],
)

In [None]:
# Using the user_single_practice object, display the data within the data attribute.

user_single_practice.???.???

In [None]:
# Define multiple Twitter usernames AND user_fields= parameters.  You can use the
# parameters above or those found in the Twitter documentation.

user_group_practice = client.get_users(
    usernames =["???","???","???",],
    user_fields = ["???", "???","???",],
)

# Display the list of users
user_group_practice.data

In [None]:
# Finally, run the code below to generate and inspect a dataframe based
# on the user_group_practice object created above (nothing to fix here, just run 
# the code)

data_users = pd.DataFrame()

for i in user_group_practice.data:
  temp_data = pd.json_normalize( i.data ,  sep = "_")
  data_users = data_users.append(temp_data,ignore_index=True)

data_users

# Part 3 - Followers, Friends, and Timelines

In addition to looking up individual users' information, the Twitter API also enables us to harvest lists of friends, followers, and the timeline (tweets, retweets, etc...) of selected Twitter users.

Following
* Tweepy - https://docs.tweepy.org/en/latest/client.html#tweepy.Client.get_users_following
* Twitter API - https://developer.twitter.com/en/docs/twitter-api/users/follows/api-reference/get-users-id-following

Followers
* Tweepy - https://docs.tweepy.org/en/latest/client.html#tweepy.Client.get_users_followers
* Twitter API - https://developer.twitter.com/en/docs/twitter-api/users/follows/api-reference/get-users-id-followers

Timelines
* Tweepy - https://docs.tweepy.org/en/latest/client.html#tweepy.Client.get_users_tweets
* Twitter API - https://developer.twitter.com/en/docs/twitter-api/tweets/timelines/api-reference/get-users-id-tweets


In [38]:
# Let's look at Nikema Williams (GA-05 House of Representatives)
# https://twitter.com/NikemaWilliams

# First, we'll use get_user(...) to get Williams's account details
williams = client.get_user(
    username = "NikemaWilliams",
    user_fields = ["created_at", "description","public_metrics","verified",],
)

In [39]:
# In order to user subsequent functions, we need to switch from relying on the 
# Twitter username to instead using the Twitter id

williams.data.id

36424664

In [40]:
# Using the get_users_following(...) function, you can retrieve the Twitter user 
# account details for up to 1000 Twitter users who Williams's Twitter account is 
# following.

williams_following = client.get_users_following(
    id = williams.data.id ,
    user_fields = ["created_at", "description","public_metrics","verified",],
    max_results = 1000)

In [41]:
# Inspect the resulting list of data

williams_following.data

[<User id=3021477680 name=Cricket Celebration Bowl username=CelebrationBowl>,
 <User id=1556535074996338688 name=Balvir Pokharel username=BalvirPokharel>,
 <User id=2575037659 name=Kathleen Baker username=Kathlee54377081>,
 <User id=14438930 name=Vicki Kuglin Garver username=Kugy55>,
 <User id=215501778 name=David Slack username=slack2thefuture>,
 <User id=15952856 name=Ari Berman username=AriBerman>,
 <User id=28752248 name=ALLTHINGSNICCI username=niccigilbert>,
 <User id=1421845469576245257 name=AMVETS Post 911 username=theatlantaamve1>,
 <User id=1392831444616101888 name=Deputy Secretary Don Graves username=DepSecGraves>,
 <User id=46468469 name=Stacy Xander username=uzanybug>,
 <User id=32673307 name=Donna Lowry username=donnalowrynews>,
 <User id=287703379 name=karla fc holloway username=ProfHolloway>,
 <User id=760639303 name=Ethan Embry username=EmbryEthan>,
 <User id=1511377628145532928 name=Capitol Fox username=thecapitolfox>,
 <User id=1683518701 name=Frank Ski Show username=

In [42]:
# Using the same approach as before, construct a Pandas DataFrame using the
# collected data.

williams_following_df = pd.DataFrame()

for i in williams_following.data:
  temp_data = pd.json_normalize( i.data ,  sep = "_")
  williams_following_df = williams_following_df.append(temp_data,ignore_index=True)

In [43]:
# Inspect the dataframe

williams_following_df

Unnamed: 0,description,verified,created_at,id,name,username,public_metrics_followers_count,public_metrics_following_count,public_metrics_tweet_count,public_metrics_listed_count
0,An HBCU postseason college football game featu...,True,2015-02-15T16:59:42.000Z,3021477680,Cricket Celebration Bowl,CelebrationBowl,8780,446,4723,96
1,"I'm chefs ,silence is the most powerful scream!",False,2022-08-08T06:59:07.000Z,1556535074996338688,Balvir Pokharel,BalvirPokharel,6,222,26,1
2,,False,2014-06-18T15:12:37.000Z,2575037659,Kathleen Baker,Kathlee54377081,262,1294,17782,1
3,Don’t let anything or anyone steal your JOY😇 ...,False,2008-04-18T23:57:26.000Z,14438930,Vicki Kuglin Garver,Kugy55,221,695,686,0
4,TV Writer. Mostly elsewhere now. Turn the ligh...,True,2010-11-14T03:30:33.000Z,215501778,David Slack,slack2thefuture,31203,4366,68164,450
...,...,...,...,...,...,...,...,...,...,...
674,,False,2009-02-13T14:10:42.000Z,20772282,Ayinde Martin,Ayindem,49,28,143,3
675,Business owner. Political reformer. Former Rep...,False,2007-07-23T17:10:36.000Z,7663002,Kyle Bailey,kylebailey415,399,277,217,16
676,,False,2008-11-22T14:21:52.000Z,17557624,audraanne4180,audraanne4180,30,212,25,0
677,,False,2009-04-15T17:57:01.000Z,31468688,ATN,AnnieAngstrom,5,0,2,0


In [44]:
# Conversely, you can collect the user accounts of Twitter users who are 
# following Williams's account.

williams_followers = client.get_users_followers(
    id = williams.data.id ,
    user_fields = ["created_at", "description","public_metrics","verified",],
    max_results = 1000)

In [45]:
# Inspect the discrepancy between the total followers (followers_count) and the
# number of followers collected from the Twitter API.  In Part 5, we will learn
# how to go beyond the "max_results = 1000" limit!

print(williams.data.public_metrics)
print()
print("Number of followers collected from Twitter API: ",len(williams_followers.data))

{'followers_count': 48415, 'following_count': 679, 'tweet_count': 4559, 'listed_count': 473}

Number of followers collected from Twitter API:  1000


In [46]:
# Construct a Pandas DataFrame using the collected data.

williams_followers_df = pd.DataFrame()

for i in williams_followers.data:
  temp_data = pd.json_normalize( i.data ,  sep = "_")
  williams_followers_df = williams_followers_df.append(temp_data,ignore_index=True)

In [47]:
# Inspect the dataframe

williams_followers_df

Unnamed: 0,verified,created_at,id,username,name,description,public_metrics_followers_count,public_metrics_following_count,public_metrics_tweet_count,public_metrics_listed_count
0,False,2014-01-20T16:12:14.000Z,2301645024,AustinRobertG,Austin Robert G,#Entrepreneur,227,767,517,2
1,False,2022-10-14T11:47:43.000Z,1580887661824610305,Cynthia07254613,Cynthia Richardson,,379,2605,58,0
2,False,2023-01-09T12:26:10.000Z,1612425467918749697,ga12dems,Georgia's 12th CD Democratic Committee,,1,213,1,0
3,False,2023-01-13T21:21:07.000Z,1614009354222555139,Sampsondodo,Tara Akosua,,1,115,6,0
4,False,2022-09-05T20:15:50.000Z,1566882688052649989,Kimberly87874,Kimberly Sinclair,,9,74,3,0
...,...,...,...,...,...,...,...,...,...,...
995,False,2009-10-11T08:03:07.000Z,81545830,SeanyFootball,Sean Mardis 🇺🇦,The Official Twitter of Sean Mardis. | Ex-Pro ...,221767,199138,23952,171
996,False,2022-12-06T21:12:07.000Z,1600236535395233792,GenJoyHart,Gen Hart ❤️,"Dog Lover, Entrepreneur, Straightforward, Fun ...",59,740,520,0
997,False,2012-01-08T02:24:13.000Z,458002544,juliew80,Janice 'Fer sure reeeeally',"Lead Guitar, Dr. Teeth and The Electric Mayhem",33,367,3465,0
998,False,2018-01-02T01:12:21.000Z,947998967991762944,OF4Democracy,OF4Democracy 🇺🇸🇺🇦,"Father, Husband, PhD (chemistry), Patriot, For...",4896,5193,21322,0


In [48]:
# Lastly, you can collected the Twitter timelines of specified users using the 
# get_users_tweets(...) function.  In the example below, note the use of the 
# tweet_fields=... parameter.  Since we are collecting Tweet objects now, there
# are different sets of fields and additional data that can be requested from the
# Twitter API.

# Twitter API (get_users_tweets) -https://developer.twitter.com/en/docs/twitter-api/tweets/timelines/api-reference/get-users-id-tweets
# Twitter API (Fields) - https://developer.twitter.com/en/docs/twitter-api/fields
# Tweepy - https://docs.tweepy.org/en/latest/client.html#tweet-fields


williams_tweets = client.get_users_tweets(
    id = williams.data.id ,
    tweet_fields = ["id","created_at", "text","public_metrics",
                    "in_reply_to_user_id","reply_settings", "source",
                    "referenced_tweets"],
    max_results = 100)

In [49]:
# Inspect the resulting list of data

williams_tweets.data

[<Tweet id=1613403347884572675 text='@omigoditslucasx Born at The Medical Center…what about you?'>,
 <Tweet id=1612308513610498054 text='@bluestein @UCSDHealth Whew….you’re in San Diego. I was about to pull together a frequent flyer offering!'>,
 <Tweet id=1611712940616417284 text='Officially Sworn in for a 2nd Term. Ready to get to work for the people of the #FightingFifth! https://t.co/OpOtNMKp2F'>,
 <Tweet id=1611593575082385408 text='Fancy Nancy said, “He can’t count. I could have told him didn’t have the votes.” #GOAT𓃵 https://t.co/w5yKYozxcT'>,
 <Tweet id=1611592636975947776 text='@charmdiddy @GOP @GOPLeader Sick and tired of being sick and tired….'>,
 <Tweet id=1611239733119565824 text='@mintzess @HouseDemocrats @lucymcbath @AOC Checked on this for you! She didn’t know the shade but I did snap a close up! https://t.co/jEa7vm8j1j'>,
 <Tweet id=1611238252068864001 text='@botanicalBastrd With the Pokémon Jibbitz! We allow him to express his personal style. (within reason)'>,
 <Twee

Analyzing tweets

In [50]:
# Isolate the first tweet from the data

tweet_0 = williams_tweets.data[0]

In [51]:
# Like all of the other data objects, there is a .data attribute that can show
# all of the information contained in the object.

tweet_0.data

{'in_reply_to_user_id': '1412664438356914178',
 'referenced_tweets': [{'type': 'replied_to', 'id': '1613321794890465280'}],
 'reply_settings': 'everyone',
 'created_at': '2023-01-12T05:11:48.000Z',
 'text': '@omigoditslucasx Born at The Medical Center…what about you?',
 'id': '1613403347884572675',
 'edit_history_tweet_ids': ['1613403347884572675'],
 'public_metrics': {'retweet_count': 1,
  'reply_count': 2,
  'like_count': 1,
  'quote_count': 0,
  'impression_count': 546}}

In [52]:
# Loop through the data and simply print(...) the information on screen.
# This is just a simple way to preview the timeline of tweets.

for i in williams_tweets.data:
  print("----------------------------")
  print(i.created_at , "\t" , i.text)

----------------------------
2023-01-12 05:11:48+00:00 	 @omigoditslucasx Born at The Medical Center…what about you?
----------------------------
2023-01-09 04:41:19+00:00 	 @bluestein @UCSDHealth Whew….you’re in San Diego. I was about to pull together a frequent flyer offering!
----------------------------
2023-01-07 13:14:43+00:00 	 Officially Sworn in for a 2nd Term. Ready to get to work for the people of the #FightingFifth! https://t.co/OpOtNMKp2F
----------------------------
2023-01-07 05:20:24+00:00 	 Fancy Nancy said, “He can’t count. I could have told him didn’t have the votes.” #GOAT𓃵 https://t.co/w5yKYozxcT
----------------------------
2023-01-07 05:16:41+00:00 	 @charmdiddy @GOP @GOPLeader Sick and tired of being sick and tired….
----------------------------
2023-01-06 05:54:22+00:00 	 @mintzess @HouseDemocrats @lucymcbath @AOC Checked on this for you! She didn’t know the shade but I did snap a close up! https://t.co/jEa7vm8j1j
----------------------------
2023-01-06 05:48:2

In [None]:
# Construct a Pandas DataFrame from the data

williams_tweets_df = pd.DataFrame()

for i in williams_tweets.data:
  temp_df = pd.json_normalize( i.data , sep="_")
  williams_tweets_df = williams_tweets_df.append(temp_df, ignore_index=True)

In [None]:
# Inspect the dataframe

williams_tweets_df

# Part 3 - Practice

Using the outlines below, fill in all of ??? in the following blocks of code.  This is simply meant to ensure you understand how to modify and tweak code to produce outputs.

In [None]:
# Create an object to represent the user-account for a Twitter account of your choice.

??? = client.get_user(
    username = "???",
    user_fields = ["created_at", "description","public_metrics","verified",],
)

In [None]:
# Display the user's id value

print( ???.data.id )

In [None]:
# Use the .get_users_tweets(...) function to collect the latest 100 items from 
# the user's timeline using the user's id.  You can also explicilty define some 
# tweet_fields (or delete that parameter entirely)

practice_tweets = client.???(
    id = ???.???.??? ,
    tweet_fields = [ "???" , "???", "???" , "???" ,] ,
    max_results = 100)

In [None]:
# Construct and inspect a dataframe of collected tweets (nothing to fix here)
practice_tweets_df = pd.DataFrame()

for i in practice_tweets.data:
  temp_df = pd.json_normalize( i.data , sep="_")
  practice_tweets_df = practice_tweets_df.append(temp_df, ignore_index=True)

# Inspect data
practice_tweets_df

# Part 4 - Search & API Parameters

Separate from looking up known accounts and connected information, one of the most powerful facets of the Twitter API is the ability to conduct robust and complicated searches for tweets from all across the platform.

It is important to note that in Twitter API v2, "Academic" developer accounts have free access to API endpoints that allow for searching through all Tweets from the very beginning of Twitter AND more robust search options overall.  This workshop strictly focuses on the non-Academic Twitter API options, but academic researchers are strongly encouraged to apply for an ["Academic" developer account](https://developer.twitter.com/en/solutions/academic-research) with Twitter (it's worth it!).

Search (Standard / Core)
* Tweepy - https://docs.tweepy.org/en/latest/client.html#tweepy.Client.search_recent_tweets
* Twitter API - https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-recent

Search (Academic / Full Archive Search)
* Tweepy - https://docs.tweepy.org/en/latest/client.html#tweepy.Client.search_all_tweets
* Twitter API - https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all


In [None]:
# To get started, you can use the search_recent_tweets(...) function and the 
# only parameter that must be specified is the query=... field.  In the example
# below, max_results is also specified to ensure that we explicitly limit the 
# volume of requests sent to the Twitter API (more on this in Part 5).

search = client.search_recent_tweets(
    query = "atlanta",
    max_results = 10
)

In [None]:
# Inspect the data, note all of the objects in the list represent tweets
# from sometime in the past 7 days.

search.data[0].data

In [None]:
# The query=... parameter can take very precise search requirements when used
# with the appropriate operators.

# Twitter API - https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query

# Below are some sample queries and explanations:

# "atlanta"                              search for "atlanta"
# "(atlanta OR atl)"                     search for "atlanta" or "atl"
# "atlanta -is:retweet"                  search for "atlanta", but exclude retweets
# "atlanta lang:es -is:retweet"          search for "atlanta", language is Spanish, retweets excluded
# "@GeorgiaStateU -from:GeorgiaStateU"   search for mentions of "@GeorgiaStateU", but exclude tweets from @GeorgiaStateU

# Try a search with a more advanced query!

search = client.search_recent_tweets(
    query = "@GeorgiaStateU -from:GeorgiaStateU",
    max_results = 10
)

In [None]:
# Inspect / preview the tweet data
search.data

In [None]:
# "Standard" API users can included query strings that are up to 512 characters long
# "Academic" API users can include query strings that are up to 1024 characters long

search = client.search_recent_tweets(
    query = "nintendo has:media lang:en is:verified -is:retweet -is:reply -is:quote",
    max_results = 10
)

# Inspect / preview the tweet data
search.data

In [None]:
# Just like with prior examples, the scope of data collected from the API is
# tied to the expansions and various ..._fields parameters.  The example below 
# shows many of these options being incorporated into a search.

# Field Options - Twitter API - https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-recent

search = client.search_recent_tweets(
    # Query
    query = "#latteart has:images lang:en",    
    
    # Define a few required Expansions fields
    expansions = ["attachments.media_keys" , "author_id"],
    
    # Specify tweet fields of interest
    tweet_fields = ["attachments","created_at", "entities", "geo", "id","public_metrics","possibly_sensitive", "source", "text",],
    
    # Specify media fields of interest
    media_fields = ["url","height","width","public_metrics","media_key"],
    
    # Specify user fields of interest
    user_fields = ["id", "name", "public_metrics", "username", "verified"] ,

    # Define the maximum number of tweets to request
    max_results = 15
)

In [None]:
# All of the resulting tweets should contains images as well as many additional data-fields.

search.data

In [None]:
# Within the search results, we can drill down to the raw data for individual tweets.

# search.data[1]
search.data[1].data

In [None]:
# Since we used a variety of fields in our original query, the .includes attribute
# contains a lot of attached data and associated fields.  The example below shows
# both "media" and "user" groupings of additional data not visible in the 
# search.data.... components.

search.includes

In [None]:
# Different subsets of .includes can be accessed using [...]

search.includes["media"]

In [None]:
# Isolate a specific item in the list

search.includes["media"][1]

In [None]:
# Inspect the data associated with the included media-field requests

search.includes["media"][1].data

In [None]:
# Using for-loops and Pandas DataFrame, you can reorganize and merge the data
# into a (possibly) more intuitive table form.

######
# Create search_df, containing the Tweet data
search_df = pd.DataFrame()

for i in search.data:
  temp_data = pd.json_normalize( i.data ,  sep = "_")
  search_df = search_df.append(temp_data,ignore_index=True)

search_df["attachments_media_keys"] = search_df["attachments_media_keys"].str[0]

######
# Create search_media_df, containing the media data
search_media_df = pd.DataFrame()

for i in search.includes["media"]:
  temp_data = pd.json_normalize( i.data ,  sep = "_")
  search_media_df = search_media_df.append(temp_data,ignore_index=True)

######
# Merge the search_df and search_media_df dataframes together.
combined_df = search_df.merge(search_media_df, left_on="attachments_media_keys", right_on="media_key")

In [None]:
# Inspect the core dataframe of tweets
search_df.head( 3 )

In [None]:
# Inspect the media dataframe
search_media_df.head( 3 )

In [None]:
# Inspect the combined data
combined_df.head( 3 )

# Part 4 - Practice

Using the outlines below, fill in all of ??? in the following blocks of code.  This is simply meant to ensure you understand how to modify and tweak code to produce outputs.

In [None]:
# Run your own search, specifying the query= and tweet_fields= parameters to 
# according to your interest.

search_practice = client.search_recent_tweets(
    ??? = "????????????????",    
    ??? = ["???","???", "???", "???","public_metrics",],
    max_results = 20g
)

In [None]:
# Create search_df, containing the Tweet data and display the data below. (nothing to fix here)

search_practice_df = pd.DataFrame()

for i in search_practice.data:
  temp_data = pd.json_normalize( i.data ,  sep = "_")
  search_practice_df = search_practice_df.append(temp_data,ignore_index=True)

search_practice_df

# Part 5 - Pages, pagination, and rate limits

When requesting large chunks of data from the Twitter API, Twitter will segment results into "pages" and only provide you with the first "page" of results.  You can then subsequently request page#2, page#3 and so forth to iteratively collect the remaining results.  This is done so that Twitter is better able to manage API requests and prevent abuse.

* Tweepy - https://docs.tweepy.org/en/latest/pagination.html#api-v2
* Twitter API - https://developer.twitter.com/en/docs/twitter-api/pagination

WARNING: Be aware that if you re-run the cells below many times in a short period of time, you will quickly exceed the Twitter API Rate Limits.

In [None]:
# First, start with a basic search
search_results = client.search_recent_tweets(
    query = "pokemon",
    max_results=10
)

# Inspect the data
search_results.data

#### tweepy.Paginator (by pages)

In [None]:
# To start using tweepy.Paginator(...) there are three core requirements to include
# as parameters.

# 1) method = the client.tweepy_function_here ; this specifies the iterable API function you want to use.
# 2) limit = ### ; this specifies the number of "pages" of API results to collect 
# 3) param1 = .... ; all tweepy functions require at least one parameter (e.g. "query" for searching)
# and you will need to include all required parameters appropriate to the function specified in the method parameter.

# EXAMPLE
# tweepy.Paginator(
#     method = the_core_tweepy_function_you_want,
#     limit = the_number_of_pages,
#     param1 = ..., 
#     param1 = ..., 
#     param1 = ...,  
#     )

# The following example will search for recent tweets with the keyword "Pokemon".
# It will collect 3 "pages", each of which will have 10 results; 30 tweets in total.
tweepy.Paginator(
    method = client.search_recent_tweets,  # required
    limit=3,                               # required
    query="pokemon",                       # required
    max_results=10                         # optional
    )

In [None]:
# Same example as above, but with the results stored in the my_search_paginator object.

my_search_paginator = tweepy.Paginator(
    method = client.search_recent_tweets,
    limit=3,
    query="pokemon", 
    max_results=10
    )

In [None]:
# Even as an object, the Paginator will not actually collect data until you
# start to loop through it.
my_search_paginator

In [None]:
# Using a for-loop, loop through the paginator and print() the results

for page in my_search_paginator:
  print(page)
  print("-----------------------------------------------")
  print("-----------------------------------------------")
  print("-----------------------------------------------")

In [None]:
# Same example as above
my_search_paginator = tweepy.Paginator(
    method = client.search_recent_tweets,
    limit=3,
    query="pokemon", 
    max_results=10
    )

# Iterate through each page of results.
# Within each page of results, iterate through each tweet object and print the 
# tweet's text.  Tweets are segmented with *** and pages are segmented by ---

for page in my_search_paginator:  
  for tweet in page.data:
    print(tweet.text)
    print("**************************")

  print("-----------------------------------------------")
  print("-----------------------------------------------")
  print("-----------------------------------------------")

In [None]:
# This example expands both limit= and max_results= to collect 1,000 tweets
# in total.  Additional tweet_fields are also specified to collect additional
# data about each individual tweet.

my_search_paginator = tweepy.Paginator(
    method = client.search_recent_tweets,
    limit=10,
    query="pokemon",
    max_results=100,
    tweet_fields = ["attachments","created_at", "entities", "geo", "id","public_metrics","possibly_sensitive", "source", "text",],
    )

# Create an empty Pandas DataFrame
paged_results_df = pd.DataFrame()


# Iterate through each page of results
for n,page in enumerate(my_search_paginator):  

  # Print the current page number
  print("PAGE NUMBER: ",n)
  
  # Iterate through each tweet in the page.data, convert to a dataframe, and 
  # then append to the paged_results_df above.
  for tweet in page.data:
    temp_df = pd.json_normalize(tweet.data, sep="_")
    paged_results_df = paged_results_df.append(temp_df, ignore_index=True)

In [None]:
# Inspect final results!

paged_results_df

In [None]:
# Once the data is collected, you can start analyzing data!
paged_results_df.describe()

#### tweepy.Paginator (by results)

In [None]:
# The prior examples relied on collected a fixed number of pages with the limit=
# parameter inside of the Paginator(...) function.

# The alternative below adds .flatten(...) to the end of Paginator(...) and moves
# the limit= parameter to the .flatten(...) method.

# In this case, the "limit" now refers to the total number of individual results.
# In the example below, where limit=75, the Paginator will continue to run until
# 75 tweets have been collected.

my_search_paginator = tweepy.Paginator(
    method = client.search_recent_tweets,
    query="pokemon",
    max_results=10,
    ).flatten(limit = 75)

In [None]:
# Same example as above
my_search_paginator = tweepy.Paginator(
    method = client.search_recent_tweets,
    query="pokemon",
    max_results=10,
    ).flatten(limit = 20)

# This version of Paginator will yield every item or result individually, not as
# a set of pages like in the prior examples
for item in my_search_paginator:
  print(item.data)
  print()

In [None]:
# Same example as above, but with unevenly matched values between max_results
# and limit parameters.

my_search_paginator = tweepy.Paginator(
    method = client.search_recent_tweets,
    query="pokemon",
    max_results=10,
    ).flatten(limit = 23)

# This for-loop will help us keep count of results using enumerate(...)
for n,tweet in enumerate(my_search_paginator):
  print(n,"\t",tweet.data)
  print()

# Part 5 - Practice

In the following blocks of code, you are encouraged to change and tweak the code to generate sets of Twitter search results.  Feel free to experiment with different parameters and search operators found on the Twitter API website.

In [None]:
# Let's look at Nikema Williams (GA-05 House of Representatives)
# https://twitter.com/NikemaWilliams

# First, we'll use get_user(...) to get Williams's account details
williams = client.get_user(
    username = "NikemaWilliams",
    user_fields = ["created_at", "description","public_metrics","verified",],
)

print("@NikemaWilliams Twitter ID: ", williams.data.id)
print("@NikemaWilliams Number of Followers: ", williams.data.public_metrics["followers_count"])

In [None]:
# Configure the Paginator function in order to collect the user-account information
# for 10,000 followers of the NikemaWilliams Twitter account.

# The following API page contains the request limits and parameter options 
# relevant to the limit= and max_results= parameters.
# https://developer.twitter.com/en/docs/twitter-api/users/follows/api-reference/get-users-id-followers

# method = needs to be set to the client.get_users_followers function.

# limit = you SHOULD set this to a number smaller than the maximimum number of 
# requests in the "Authentication and rate limits" section near the top of the page

# max_results = set this to the largest allowable value as defined by the 
# Twitter API.  This info is in the "Query Parameters" section of the page.

# id = set this equal to the williams.data.id OR manually type the number as a string "#####"

my_practice_paginator = tweepy.Paginator(
    method = client.get_users_followers,
    limit = ???,
    id = williams.data.id,
    max_results = ???,
    user_fields = ["created_at", "description","public_metrics","verified",],
    )

In [None]:
# Create an empty Pandas DataFrame
paged_practice_followers = pd.DataFrame()

# Iterate through each page of results
for n,page in enumerate(my_practice_paginator):  

  # Print the current page number
  print("PAGE NUMBER: ",n)
  
  # Iterate through each user-account in the page.data, convert to a dataframe, and 
  # then append to the paged_practice_followers above.
  for user in page.data:
    temp_df = pd.json_normalize(user.data, sep="_")
    paged_practice_followers = paged_practice_followers.append(temp_df, ignore_index=True)

In [None]:
paged_practice_followers

# Part 6 - Streaming Tweets (AKA the "fire-hose")

Twitter API streams are notably more complicated than everything to this point in the workshop.  These endpoints for the Twitter API allow you to capture Twitter data in realtime.  The examples below simply show the data being printed on screen or stored in local list-objects.  But you could hypothetically write Python commands that store live-streamed tweets into an SQL database OR save tweets as local files (e.g. as a "pickle") OR whatever else you'd like to do with data.

This portion of the workshop code can be extremely challenging, so please do not feel bad if it takes you a while wrap your head around everything here.

* Twitter API - [Streaming](https://developer.twitter.com/en/docs/tutorials/stream-tweets-in-real-time)
* Tweepy - [Streaming Guide](https://docs.tweepy.org/en/latest/streaming_how_to.html)

In [None]:
import tweepy
import time

In [None]:
# Create a local list-object that we will use to store results as they come in.
# Long-term, this is a VERY BAD STRATEGY for data management and processing.  
# But for this workshop it is fine for demonstration purposes.
statuses_container = []

# Define a custom class to represent our use of the tweepy.Stream tools.
# Within the class, we will add custom code to supplement existing on_status(...)
# functionality that already comes with tweepy.  This is where we will tell Python
# what to do with incoming data and where to store it.

class my_streamer_class(tweepy.Stream):
    def on_status(self, status):
      print(status.id,"\t",status.text)  # print the data on-screen
      statuses_container.append(status)  # store the data in statuses_container

# Create a stream object by calling on the class defined above.
stream = my_streamer_class(
    consumer_key = my_consumer_key,
    consumer_secret = my_consumer_secret,
    access_token = my_access_token,
    access_token_secret = my_access_secret,
)

### stream.sample(...)
* Tweepy - https://docs.tweepy.org/en/latest/stream.html#tweepy.Stream.sample
* Twitter API (v1) - https://developer.twitter.com/en/docs/twitter-api/v1/tweets/sample-realtime/overview

In [None]:
# Create/Re-create an empty list for storing results
statuses_container = []

# Start the sampling stream.  threaded=True is necessary so that we can tell the
# stream to stop after a period of time.  Otherwise it will run forever.
stream.sample(threaded=True, languages=["en"])

# Tell Python to wait for (X) seconds while the stream is running
time.sleep(5)

# Disconnect the stream
stream.disconnect()

In [None]:
# The number of statuses captured from stream
len(statuses_container)

In [None]:
# Inspect the first few statuses
statuses_container[0:5]

In [None]:
# Inspect the first item
statuses_container[0]

In [None]:
# Convert to Pandas DataFrame format
pd.json_normalize(statuses_container[0]._json)

### stream.filter(...)

* Tweepy - https://docs.tweepy.org/en/latest/stream.html#tweepy.Stream.filter
* Twitter API - https://developer.twitter.com/en/docs/twitter-api/v1/tweets/filter-realtime/api-reference/post-statuses-filter

#### stream for "atlanta"

In [None]:
# Create/Re-create an empty list for storing results
statuses_container = []

# Start the filtered stream.  Searching for tweets containing the keyword "atlanta".
stream.filter(track=["atlanta","atl"] , threaded=True, languages=["en"])

# Tell Python to wait for (X) seconds will the stream is running
time.sleep(15)

# Disconnect the stream
stream.disconnect()

In [None]:
# Inspect list of tweet objects
statuses_container

#### stream for "nyc" OR "new york city"

In [None]:
# Create/Re-create an empty list for storing results
statuses_container = []

# Start the filtered stream.  Searching for tweets 
stream.filter(track=["nyc","new york city"] , threaded=True, languages=["en"])

# Tell Python to wait for (X) seconds will the stream is running
time.sleep(15)

# Disconnect the stream
stream.disconnect()

In [None]:
# Inspect list of tweet objects
statuses_container

#### stream of accounts followed by GSU

In [None]:
# Collect GSU's Twitter Account
gastate = client.get_user(
    username = "GeorgiaStateU",
)

# Get a list of accounts that GSU follows on Twitter
gastate_follows = client.get_users_following(
    id = gastate.data.id,
    max_results = 1000
)

# Create list of IDs
gastate_follows_ids = [x.id for x in gastate_follows.data]

In [None]:
# Preview list of gastate_follows_ids
gastate_follows_ids[:10]

In [None]:
# Create/Re-create an empty list for storing results
statuses_container = []

# Start the filtered stream.  Searching for tweets from accounts followed by GeorgiaStateU
stream.filter(follow=gastate_follows_ids , threaded=True)

# Tell Python to wait for (X) seconds will the stream is running
time.sleep(60)

# Disconnect the stream
stream.disconnect()

In [None]:
# Inspect list of tweet objects
statuses_container

# Part 6 - Practice

Much of what could be "practiced" in this section may be wildly difficult for some for whom this is their first exposure to working with APIs.  For this section, I suggest just playing around with changing the stream.filter(...) parameters in the second block.  Test different keywords and combinations of languages and see what people are saying right now (on Twitter)!

In [None]:
import tweepy
import time

class my_streamer_class(tweepy.Stream):
    def on_status(self, status):
      print(status.id,"\t",status.text)  # print the data on-screen
      statuses_container.append(status)  # store the data in statuses_container

# Create a stream object by calling on the class defined above.
stream = my_streamer_class(
    consumer_key = my_consumer_key,
    consumer_secret = my_consumer_secret,
    access_token = my_access_token,
    access_token_secret = my_access_secret,
)

In [None]:
# languages options can be found on the search-operators page near the bottom
# Twitter API - https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query

# Create/Re-create an empty list for storing results
statuses_container = []

# Start the filtered stream.  Searching for tweets containing the a keyword of your choice in English ("en").
stream.filter(languages=["???"], track=["???","???"], threaded=True)

# Tell Python to wait for (X) seconds will the stream is running
time.sleep(????)

# Disconnect the stream
stream.disconnect()

In [None]:
# Inspect list of tweet objects
statuses_container

# Examples in the "wild"!


The following blocks of code show two heavily truncated examples of how you could potentially use data collected from the Twitter API.  These examples are purely illustrative and are not analytically rigorous.  Naturally, you will need to consider your own research and analytic needs when collecting and making use of data harvested from Twitter.

In [None]:
# IMPORT NECESSARY PACKAGES AND DEFINE API KEYS AND CLIENT

# GENERAL PACKAGES
# !pip install git+https://github.com/tweepy/tweepy.git --upgrade
import tweepy
import pandas as pd
import numpy as np


# DEFINING API KEYS AND INITIATING CLIENT
# my_consumer_key = "???????????????????"
# my_consumer_secret = "?????????????????????"

# my_access_token = "?????????????????"
# my_access_secret = "????????????????????????????"

# my_bearer_token = "??????????????????????????????????????????????????????????????"


# client = tweepy.Client(
#     wait_on_rate_limit = True,
#     consumer_key = my_consumer_key,
#     consumer_secret = my_consumer_secret,
#     access_token = my_access_token,
#     access_token_secret = my_access_secret,
#     bearer_token = my_bearer_token,
# )

## Wordcloud

The following code demonstrates how to collect tweets from two specific Twitter timelines.  Then, using WordCloud package, simple visualizations are created that highlight the relative frequency of different terms from each respective Twitter user.  The code is purely intended to illustrate a single and simple example of what you might do with data collected via the Twitter API and it may not apply directly to your research interests.

If you are interested in creating more advanced WordClouds, please refer to the documentation: https://amueller.github.io/word_cloud/index.html

In [None]:
# WORD CLOUD PACKAGES
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

In [None]:
# Collect and organize the Twitter timeline from Stacey 
# Abrams' official Twitter account.

abrams_account = client.get_user(username="staceyabrams")

abrams_timeline = tweepy.Paginator(
        method = client.get_users_tweets,
        id = abrams_account.data.id,
        max_results = 100,
        user_fields = ["public_metrics"]
        ).flatten(limit=1000)

abrams_tweets = []

for k in abrams_timeline:
  abrams_tweets.append(k.data["text"])

In [None]:
# Collect and organize the Twitter timeline from Marjorie Taylor 
# Greene's official Twitter account.

greene_account = client.get_user(username="mtgreenee")

greene_timeline = tweepy.Paginator(
        method = client.get_users_tweets,
        id = greene_account.data.id,
        max_results = 100,
        user_fields = ["public_metrics"]
        ).flatten(limit=1000)

greene_tweets = []

for k in greene_timeline:
  greene_tweets.append(k.data["text"])

In [None]:
# Create a WordCloud using Abrams' Twitter timeline.  

# Define stopwords; words to exclude from the visualization
stopwords = set(STOPWORDS) | set(["rt","https","http","staceyabrams"])
comment_words = ''

# iterate through the list of tweets
for val in abrams_tweets:
      
    # typecaste each val to string
    val = str(val)
  
    # split the value
    tokens = val.split()
      
    # Converts each token into lowercase
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()
      
    comment_words += " ".join(tokens)+" "

# Create WordCloud object
wordcloud = WordCloud(width = 800, height = 800,
                background_color ='white',
                stopwords = stopwords,
                min_font_size = 10,
                regexp = r"\w{3,}"
                ).generate(comment_words)
  
# plot the WordCloud image                       
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.title("WordCloud of @{} 1,000 most recent Tweets".format(abrams_account.data.username),fontdict={"fontsize":20})

plt.show()

In [None]:
# Create a WordCloud using Greene's Twitter timeline.  

# Define stopwords; words to exclude from the visualization
stopwords = set(STOPWORDS) | set(["rt","https","http","mtgreenee","repmtg"])
comment_words = ''

# iterate through the list of tweets
for val in greene_tweets:

    # typecaste each val to string
    val = str(val)
    
    # split the value
    tokens = val.split()
    
    # Converts each token into lowercase
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()
      
    comment_words += " ".join(tokens)+" "

# Create WordCloud object    
wordcloud = WordCloud(width = 800, height = 800,
                background_color ='white',
                stopwords = stopwords,
                min_font_size = 10,
                regexp=r"\w{3,}").generate(comment_words)
                
# plot the WordCloud image                       
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.title("WordCloud of @{} 1,000 most recent Tweets".format(greene_account.data.username),fontdict={"fontsize":20})

plt.show()

## Topic Modelling

The following code demonstrates how to collect a batch of 5,000 Tweets, conduct basic topic modelling using the LDA algorithm, and then visualize the "topics" using the pyLDAvis package.  This section is included purely for illustration purposes and the underlying methods may be extremely dense and challenging to understand.  For additional information about these specific tools and methods, please refer to the following sources:

* Laten Dirichlet allocation (LDA, topic modelling) - https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
* Gensim LDA Model - https://radimrehurek.com/gensim/models/ldamodel.html
* pyLDAvis - https://pyldavis.readthedocs.io/en/latest/


In [None]:
# TOPIC MODELLING PACKAGES
# ! pip install git+https://github.com/bmabey/pyLDAvis/ --upgrade
import pyLDAvis
import pyLDAvis.gensim_models
import gensim

In [None]:
# Search for, collect, and organize 5,000 tweets
tweet_search = tweepy.Paginator(
        method = client.search_recent_tweets,
        query="pokemon lang:en -is:retweet", # keyword search
        max_results=100, # number of results per page
        ).flatten(limit=5000) # max number of results to collect

tweets_text = []

for k in tweet_search:
  tweets_text.append(k.data["text"])

In [None]:
# Check to make sure 5,000 tweets have been collected
len(tweets_text)

In [None]:
# Preview the first 5 tweets
tweets_text[:5]

In [None]:
# Process all tweets, tokenize, filter, and create an LDA model using the Gensim package.

tweets_tokens = [t.lower().replace("'","").split() for t in tweets_text]
tweets_tokens = [[token for token in tweet if len(token) > 3] for tweet in tweets_tokens]

dictionary = gensim.corpora.Dictionary(tweets_tokens)
dictionary.filter_extremes(no_below=2, no_above=0.5)
corpus = [dictionary.doc2bow(text) for text in tweets_tokens]

lda_model = gensim.models.LdaMulticore(corpus=corpus,
                   id2word=dictionary,
                   num_topics=50, 
                   random_state=0,
                   chunksize=100,
                   per_word_topics=True,)

In [None]:
# Create a dashboard visualization of the LDA model using the pyLDAvis package

vis = pyLDAvis.gensim_models.prepare(
        topic_model = lda_model,
        corpus = corpus,
        dictionary = dictionary
)

pyLDAvis.enable_notebook()
pyLDAvis.display(vis)