In the last section, we figured out how to interact with the Twitter API. Now we need to pull more data. 1000 tweets, or 10 requests, should be enough to play with.

There are a couple pieces that go into acquiring clean data :

1. We need to figure out a way to make multiple requests, but without pulling duplicate tweets.

2. We need to parse the request output into a form and then put it into an form that can be analyzed - i.e. a pandas dataframe.

Tackling 1:

From https://dev.twitter.com/rest/public/timelines:

"To use max_id correctly, an application’s first request to a timeline endpoint should only specify a count. When processing this and subsequent responses, keep track of the lowest ID received. This ID should be passed as the value of the max_id parameter for the next request, which will only return Tweets with IDs lower than or equal to the value of the max_id parameter. Note that the max_id parameter is inclusive."

In [3]:
import twitter
import pandas as pd
import python.twitter_authentication as twit_auth
twitter_api = twit_auth.authenticate_twitter()

In [12]:
% time
SEARCHTERM = "Super Bowl"
n = 500
min_index = 99999999999999999999

data_types = ['id', 'text', 'retweet_count']

tweets_dict = {}
tweets_dict['id'] = []
tweets_dict['text'] = []
tweets_dict['retweet_count'] = []

# initial search without max_id parameter
search = twitter_api.search.tweets(q=SEARCHTERM, count=100)
results = list(search.values())

for data in data_types:
    for i in range(100):
        tweets_dict[data].append(results[0][i][data])
        if data == 'id' and results[0][i][data] < min_index:
            min_index = results[0][i][data]

# now repeat the request to get rest of results,
# setting max_id to the lowest id - 1 (to avoid duplicate tweets)
for i in range(n // 100 - 1):
    print('Getting tweets', (i+1)*100, 'to', (i+2)*100)
    search = twitter_api.search.tweets(q=SEARCHTERM, 
                                       count=100, 
                                       max_id=str(min_index))
    results = list(search.values())

    for data in data_types:
        for i in range(100):
            tweets_dict[data].append(results[0][i][data])
            if data == 'id' and results[0][i][data] < min_index:
                min_index = results[0][i][data]

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 4.29 µs
Getting tweets 100 to 200
Getting tweets 200 to 300
Getting tweets 300 to 400
Getting tweets 400 to 500


In [10]:
tweets = pd.DataFrame(tweets_dict)

Now that we have our data, we need to validate that it was scraped correctly.

In [11]:
tweets['id'].value_counts().value_counts()

1    492
2      4
Name: id, dtype: int64

Awesome. We can see that we have 1000 unique tweets.