### Twitter API's bearer_token is required

In [265]:
# Provide your bearer_token, or create a Python file containing it and import
brearer_token = ''
from twitter_authentication import bearer_token

### Connect to API

In [264]:
import os
import tweepy as tw
import pandas as pd
from tqdm import tqdm, notebook

In [257]:
client = tw.Client(bearer_token, wait_on_rate_limit=True)

### Get Tweets

By default the only information returned is the tweet ID and the text. Often, we will want information about authors, too. To get information about the author, you need to add the user_fields parameter with the fields you want as well as the expansions = 'author_id' parameter.

To get more information about the tweet, you need the tweet_fields parameter. The options are shown at https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all

You also likely want to build a somewhat advanced query - instructions are at https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query. For this query, I get English language tweets that are not retweets.

In [258]:
# get recent matching tweets according to the query
tweets = client.search_recent_tweets(query='#vaccine', 
                                     max_results=100,
                                     user_fields = ['username', 'public_metrics', 'description', 'location'],
                                     tweet_fields = ['created_at', 'geo', 'public_metrics', 'text'],
                                     expansions = 'author_id',
                                     start_time = '2021-11-20T00:00:00Z',
                                     end_time = '2021-11-21T00:00:00Z')

Breakdown of data structure (Nov.24 2021 As tweepy gets updated oftenly, this might change in the future):
1. uses dir(object) to check directly on the commands possible for special data structure.
2. tweets.data[i]['text'] returns the tweet text of the ith tweet.

In [259]:
print(type(tweets), [item for item in dir(tweets) if not item.startswith('_')])
print(type(tweets.data), type(tweets.data[0]),[item for item in dir(tweets.data[0]) if not item.startswith('_')])
print(type(tweets.data[0]['text']))

<class 'tweepy.client.Response'> ['count', 'data', 'errors', 'includes', 'index', 'meta']
<class 'list'> <class 'tweepy.tweet.Tweet'> ['attachments', 'author_id', 'context_annotations', 'conversation_id', 'created_at', 'data', 'entities', 'geo', 'get', 'id', 'in_reply_to_user_id', 'items', 'keys', 'lang', 'non_public_metrics', 'organic_metrics', 'possibly_sensitive', 'promoted_metrics', 'public_metrics', 'referenced_tweets', 'reply_settings', 'source', 'text', 'values', 'withheld']
<class 'str'>


### Convert Tweepy data structure to Pandas DataFrame
1. We got here two seperated data structures, one for tweets and one for users, joint key-value connection is the 'user_id'.
2. We create a dictionary which allow us to locate the user with the user_id, and generate a Pandas DataFrame. 

In [260]:
result = []

# we firstly map each user_id to an index which allows us to locate the user
users = tweets.includes['users']
user_dict = {}
for i, user in enumerate(users):
    user_dict[user.id] = i
    
# then we collect all tweets info
for tweet in tweets.data:
    # for each tweet, find the author
    user_index = user_dict[tweet.author_id]
    user = users[user_index]
    tweet_dict = { 'author_id': tweet.author_id, 
                   'username': user.username,
                   'author_followers': user.public_metrics['followers_count'],
                   'author_tweets': user.public_metrics['tweet_count'],
                   'author_description': user.description,
                   'author_location': user.location,
                   'text': tweet.text,
                   'created_at': tweet.created_at,
                   'retweets': tweet.public_metrics['retweet_count'],
                   'replies': tweet.public_metrics['reply_count'],
                   'likes': tweet.public_metrics['like_count'],
                   'quote_count': tweet.public_metrics['quote_count']}
    # append current tweet's dictionary to the result list
    result.append(tweet_dict)
    
df_tweet = pd.DataFrame(result)

In [261]:
df_tweet.head()

Unnamed: 0,author_id,username,author_followers,author_tweets,author_description,author_location,text,created_at,retweets,replies,likes,quote_count
0,1444850387698262017,American_Alley2,9968,38582,"PRO-TRUMP, #2A Love God & support those in Uniform and stand for the Flag Married NO LIST No DM! #BACKTHEBLUE #FightForTrump #FightLikeAFlynn",Ohio,"RT @kernaghanscott5: This chart from WHO says all we need to know.👇\n\nMore vaccines, more deaths.\n\nIFB ALL Patriots! @kernaghanscott5 🇺🇸\n\n#W…",2021-11-20 23:59:20+00:00,220,0,0,0
1,1242841869777883139,WakeUpFrance31,301,86650,,,"RT @Cheikh_Redac: Révolte anti-vax en #Guadeloupe : le Président enfonce #Macron : “j’ai prévenu Paris, mais j’ai reçu du mépris” https://…",2021-11-20 23:59:17+00:00,214,0,0,0
2,3331732167,jessicakardos,445,2648,"Voice Actor and busy mom of 2. You may know me as Sue Ellen on Arthur, Jen Larkin on What's with Andy? and Cameron on the Belle and Sebastian reboot!",Montreal,#COVID19 sucks. Day 4 post symptoms onset. Thank goodness for the #vaccine https://t.co/qYOzHRrio1,2021-11-20 23:59:08+00:00,0,2,2,0
3,1378116838954041353,TheLid14,37,4698,"Love God, my family, and my country.",,"RT @EpochTimes: An Olathe, #Kansas, mother sued @Walmart for allegedly giving a #Vaccine against #COVID19 to her 15-year-old daughter witho…",2021-11-20 23:59:07+00:00,63,0,0,0
4,825540792882065408,anotherAKGorman,261,126693,Artist. Feminist. Warren Democrat.,,RT @tmprowell: More than 2/3 of fully #vaccinated 🇺🇸 adults are 6+ mos since their last dose of #COVID19 #vaccine. On the left is why that…,2021-11-20 23:59:03+00:00,6,0,0,0


### Pipeline output: Write the resulting Dataframe to a CSV file.

In [252]:
df_tweet.to_csv('vaccine_tweet.csv') 