In the last section, we figured out how to interact with the Twitter API. Now we need to pull more data. 1000 tweets, or 10 requests, should be enough to play with.

There are a couple pieces that go into acquiring clean data :

1. We need to figure out a way to make multiple requests, but without pulling duplicate tweets.

2. We need to parse the request output into a form and then put it into an form that can be analyzed - i.e. a pandas dataframe.

Tackling 1:

From https://dev.twitter.com/rest/public/timelines:

"To use max_id correctly, an application’s first request to a timeline endpoint should only specify a count. When processing this and subsequent responses, keep track of the lowest ID received. This ID should be passed as the value of the max_id parameter for the next request, which will only return Tweets with IDs lower than or equal to the value of the max_id parameter. Note that the max_id parameter is inclusive."

In [10]:
import twitter
import pandas as pd
from twitter_helpers.authentication import authenticate_twitter

In [27]:
def scrape_tweets(SEARCHTERM, n):
    """
    Input: a search term and a number of tweets to grab
    Output: a pandas dataframe of the tweet text and other parameters
    """
    twitter_api = authenticate_twitter()
    data_types = ['id', 'text', 'retweet_count']
    
    tweets_dict = {}
    tweets_dict['id'] = []
    tweets_dict['text'] = []
    tweets_dict['retweet_count'] = []
    max_id = 999999999999999999999999  # choose an arbitrarily big number as the initial max_id
        
    for i in range(n // 100):
        print('Getting tweets', i*100, 'to', (i+1)*100)
        search = twitter_api.search.tweets(q=SEARCHTERM, 
                                           count=100, 
                                           max_id=max_id)
        results = list(search.values())

        for data in data_types:
            for i in range(100):
                tweets_dict[data].append(results[0][i][data])
        
        max_id = min(tweets_dict['id']) - 1

    # convert to a pandas dataframe and return
    return pd.DataFrame(tweets_dict)

In [28]:
tweets = scrape_tweets(SEARCHTERM="Super Bowl", n=1000)
tweets.head()

Getting tweets 0 to 100
Getting tweets 100 to 200
Getting tweets 200 to 300
Getting tweets 300 to 400
Getting tweets 400 to 500
Getting tweets 500 to 600
Getting tweets 600 to 700
Getting tweets 700 to 800
Getting tweets 800 to 900
Getting tweets 900 to 1000


Unnamed: 0,id,retweet_count,text
0,838630593973862400,0,#porn clip during the super bowl adult mom mov...
1,838630564022349824,10,RT @slayjoannex: New pictures of Lady Gaga reh...
2,838630550831149059,14,RT @ladygaga_JWT: Can you believe it's already...
3,838630541821886464,0,"@Yo_reez no doubt, I remember his transformati..."
4,838630376553664512,1,RT @MarquezRene2: Pot so big call it Super Bowl


In [29]:
tweets.head(3)

Unnamed: 0,id,retweet_count,text
0,838630593973862400,0,#porn clip during the super bowl adult mom mov...
1,838630564022349824,10,RT @slayjoannex: New pictures of Lady Gaga reh...
2,838630550831149059,14,RT @ladygaga_JWT: Can you believe it's already...


Now that we have our data, we need to validate that it was scraped correctly.

In [31]:
tweets['id'].value_counts().value_counts()

1    1000
Name: id, dtype: int64

Awesome. We can see that we have 1000 unique tweets.