# Collecting Twitter Data with Tweepy

This notebook contains examples for using web-based APIs (Application Programmer Interfaces) to download data from social media platforms.

This notebook focuses specifically on _Twitter_.

For most services, we need to register with the platform in order to use their API.
Instructions for the registration processes are outlined in each specific section below.

We will use APIs because they *can* be much faster than manually copying and pasting data from the web site, APIs provide uniform methods for accessing resources (searching for keywords, places, or dates), and it should conform to the platform's terms of service (important for partnering and publications).
Note however that each of these platforms has strict limits on access times: e.g., requests per hour, search history depth, maximum number of items returned per request, and similar.

In [None]:
# First we need to make sure we have tweepy installed...
!pip install tweepy

In [None]:
%matplotlib inline

import time
import json

<hr>

## Twitter API

Twitter's API is useful and flexible but takes several steps to configure. 
To get access to the API, you first need to have a Twitter account and have a mobile phone number (or any number that can receive text messages) attached to that account.
Then, we'll use Twitter's developer portal to create an "app" that will then give us the keys and tokens (essentially IDs and passwords) we will need to connect to the API.

So, in summary, the general steps are:

0. Have a Twitter account,
1. Configure your Twitter account with your mobile number,
2. Create an app on Twitter's developer site, and
3. Generate consumer and access keys and secrets or _bearer tokens_.

We will then plug these strings into the code below.

In [None]:
# For our first piece of code, we need to import the package 
# that connects to Twitter. Tweepy is a popular and fully featured
# implementation.

import tweepy

### Creating Twitter Credentials

For more in-depth instructions for creating a Twitter account and/or setting up a Twitter account to use the following code, I have provided content in the lecture.

You can also visit [this Medium post](https://towardsdatascience.com/ultimate-beginners-guide-to-collecting-text-for-natural-language-processing-nlp-with-python-256d113e6184) for a good overview of several data collection approaches or [this Twitter-specific Medium post](https://towardsdatascience.com/how-to-access-twitters-api-using-tweepy-5a13a206683b) for a slightly outdated version.


In [None]:
# Use the strings from your Twitter app webpage to populate these  
# variables. Be sure and put the strings BETWEEN the quotation marks
# to make it a valid Python string.

api_key = "xxx"
api_secret = "xxx"
bearer_token = "xxx"


### Connecting to Twitter

Once we have the authentication details set, we can connect to Twitter using the Tweepy OAuth handler, as below.

In [None]:
# Now we use the configured authentication information to connect
# to Twitter's API
auth = tweepy.AppAuthHandler(api_key, api_secret)

api = tweepy.API(auth)

print("Connected to Twitter!")

### Testing our Connection

Now that we are connected to Twitter, let's do a brief check that we can read tweets by pulling the first few tweets from the New York Times' timeline and printing them.

In [None]:
target = "nytimes"
total_tweets = 10

# Call user_timeline() to get the first few tweets from the given user
#. and iterate through them
for tweet_obj in api.user_timeline(id=target, count=total_tweets, tweet_mode="extended"):
    tweet = tweet_obj._json
    print(tweet["id"], tweet["created_at"], tweet["user"]["screen_name"], tweet["full_text"])

### Dealing with Pages

As mentioned, Twitter serves results in pages. 
To get all results, we can use Tweepy's Cursor implementation, which handles this iteration through pages for us in the background.

In [None]:
target = "nytimes"
total_tweets = 10

# Call user_timeline() by first using the Cursor object, and using user_timeline
#. as an argument to it.
for tweet_obj in tweepy.Cursor(api.user_timeline, id=target, tweet_mode="extended").items(total_tweets):
    tweet = tweet_obj._json
    print(tweet["id"], tweet["created_at"], tweet["user"]["screen_name"], tweet["full_text"])

In [None]:
# Handler for waiting if we exhaust a rate limit
def limit_handled(cursor):
    while True:

        try:
            yield cursor.next()
        except tweepy.RateLimitError:
            # Determine how long we need to wait...
            s = api.rate_limit_status()
            dif = s["resources"]['friends']['/friends/list']['reset'] - int(time.time())
            
            # If we have a wait time, wait for it
            if ( dif > 0 ):
                print("Sleeping for %d seconds..." % dif)
                time.sleep(dif)
        except:
            break

In [None]:
target = "codybuntain"
total_friends = 5

# Get the first few friends of mine and first few of each of them
for friend in limit_handled(tweepy.Cursor(api.friends, id=target).items(total_friends)):
    print(target, "->", friend.screen_name)
    
    for friend_of_friend in limit_handled(tweepy.Cursor(api.friends, id=friend.screen_name).items(total_friends)):
        print("\t->", friend_of_friend.screen_name)