In [1]:
%matplotlib inline

import time
import gzip
import json

<hr>
<img src="http://www.cs.umd.edu/~cbuntain/inst728e/TwitterLogo.png" width="20%">

## Twitter API

Twitter's API is most useful and flexible but takes several steps to configure. 
To get access to the API, you first need to have a Twitter account and have a mobile phone number (or any number that can receive text messages) attached to that account.
Then, we'll use Twitter's developer portal to create an "app" that will then give us the keys tokens and keys (essentially IDs and passwords) we will need to connect to the API.

So, in summary, the general steps are:

0. Have a Twitter account,
1. Configure your Twitter account with your mobile number,
2. Create an app on Twitter's developer site, and
3. Generate consumer and access keys and secrets.

We will then plug these four strings into the code below.

In [4]:
# For our first piece of code, we need to import the package 
# that connects to Twitter. Tweepy is a popular and fully featured
# implementation.

import tweepy

In [5]:
# Use the strings from your Twitter app webpage to populate these four 
# variables. Be sure and put the strings BETWEEN the quotation marks
# to make it a valid Python string.

consumer_key = ""
consumer_secret = ""
access_token = ""
access_secret = ""

### Connecting to Twitter

Once we have the authentication details set, we can connect to Twitter using the Tweepy OAuth handler, as below.

In [6]:
# Now we use the configured authentication information to connect
# to Twitter's API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

print("Connected to Twitter!")

Connected to Twitter!


### Testing our Connection

Now that we are connected to Twitter, let's do a brief check that we can read tweets by pulling the first few tweets from our own timeline (or the account associated with your Twitter app) and printing them.

In [7]:
# Get tweets from our timeline
public_tweets = api.home_timeline()

# print the first five authors and tweet texts
for tweet in public_tweets[:5]:
    print (tweet.author.screen_name, tweet.author.name, "said:", tweet.text)

EmmaLBriant Dr Emma L Briant said: RT @JolyonMaugham: Under pressure, the Govt has finally published the text of its (failed) application for permission to appeal to the Supr…
techreview MIT Technology Review said: Technology is ever-changing. Let our newsletters be a constant. Sign up for free today! https://t.co/T16PbXzckJ
emrek Emre Kıcıman said: Looking forward #EuroCSS 2018 at Cologne just around the corner.  My talk is titled "Where Does Data Bias Come from… https://t.co/k0sr0QjtOo
j363j j363j said: #ImpeachTrump #KushnerResign #DeposeMohammedBinSalman
#KhashoggiMurder https://t.co/Zn1VS3NSYx
EmmaLBriant Dr Emma L Briant said: RT @WSJ: "I felt that my face was burning, and my baby fainted. I ran for my life and that of my children," said Cindy Milla, a 23-year-old…


### Dealing with Pages

As mentioned, Twitter serves results in pages. 
To get all results, we can use Tweepy's Cursor implementation, which handles this iteration through pages for us in the background.

In [8]:
me = api.me()
me.screen_name

'codybuntain'

In [10]:
# Handler for waiting if we exhaust a rate limit
def limit_handled(cursor, get_resource = lambda x: x['statuses']['/statuses/lookup']):
    while True:
        try:
            yield cursor.next()
        except tweepy.RateLimitError:
            # Determine how long we need to wait...
            s = api.rate_limit_status()
            dif = get_resource(s["resources"])['reset'] - int(time.time())
            
            # If we have a wait time, wait for it
            if ( dif > 0 ):
                print("Sleeping for %d seconds..." % dif)
                time.sleep(dif)

In [21]:
def get_ids(filename):
    with gzip.open(filename, "rb") as in_file:
        for line_bytes in in_file:
            line = line_bytes.decode("utf8").strip()
            
            if ( len(line) == 0 ):
                continue
                
            try:
                tweet = json.loads(line)

                yield int(tweet["ID"])
            except:
                print(line)
                raise

In [24]:
def rehydrator(id_slice):
    rehydrated = None
    while True:
        try:
            rehydrated = api.statuses_lookup(id_slice)
            break
        except tweepy.RateLimitError:
            # Determine how long we need to wait...
            s = api.rate_limit_status()
            dif = s["resources"]['statuses']['/statuses/lookup']['reset'] - int(time.time())

            # If we have a wait time, wait for it
            if ( dif > 0 ):
                print("Sleeping for %d seconds..." % dif)
                time.sleep(dif)
    return rehydrated

In [None]:
id_set = set()
id_slice = []
slice_size = 100

In [None]:


with open("output.json", "w") as out_file:
    for tid in get_ids("tweets.json.gz"):
        if ( tid not in id_set and tid not in id_slice ):
            id_slice.append(tid)

        if ( len(id_slice) >= slice_size ):
            # rehydrate
            rehydrated = rehydrator(id_slice)

            # add to list of pulled tweets
            for tweet in rehydrated:
                id_set.add(tweet.id)
                out_file.write("%s\n" % json.dumps(tweet._json))

            # Clear id list
            id_slice = []
            
    # Pull any remaining IDs
    if ( len(id_slice) > 0 ):
        # rehydrate
        rehydrated = rehydrator(id_slice)

        # add to list of pulled tweets
        for tweet in rehydrated:
            id_set.add(tweet.id)
            out_file.write("%s\n" % json.dumps(tweet._json))



Sleeping for 112 seconds...
Sleeping for 387 seconds...
Sleeping for 900 seconds...
Sleeping for 406 seconds...
Sleeping for 394 seconds...


In [9]:
total_tweets = 3200
tweet_json = []
for tweet in limit_handled(tweepy.Cursor(api.user_timeline, id="cphfashsummit", count=3200).items(total_tweets), get_resource = lambda x: x['statuses']['/statuses/user_timeline']):
    tweet_json.append(json.dumps(tweet._json))

  app.launch_new_instance()


In [11]:
with open("cphfashsummit_direct_tweets.json", "w") as out_file:
    for tweet_str in tweet_json:
        out_file.write("%s\n" % tweet_str.strip())

In [13]:
ret_user_ids = set()
with open("cphfashsummit_direct_tweets.json", "r") as in_file:
    for line in in_file:
        tweet = json.loads(line)
        ret_user_ids.add(tweet["id"])
        
scrape_user_ids = set()
with open("cphfashsummit_tweets.json/part-00000", "r") as in_file:
    for line in in_file:
        tweet = json.loads(line)
        scrape_user_ids.add(tweet["id"])
        
print(len(ret_user_ids), len(scrape_user_ids))
print(len(ret_user_ids.difference(scrape_user_ids)))
print(list(ret_user_ids.difference(scrape_user_ids))[0])

1476 903
573
1063521323647934464


In [None]:
import networkx as nx

g = nx.Graph()

target = "codybuntain"
total_friends = 20

# Get the first few friends of mine and first few of each of them
#  and add their links to the graph
for friend in limit_handled(tweepy.Cursor(api.friends, id=target).items(total_friends)):
    g.add_node(friend.screen_name)
    g.add_edge(target, friend.screen_name)
    print("Processing:", friend.screen_name)
    
    for friend_of_friend in limit_handled(tweepy.Cursor(api.friends, id=friend.screen_name).items(total_friends), get_resource = lambda x: x['friends']['/friends/list']):
        g.add_node(friend_of_friend.screen_name)
        g.add_edge(friend.screen_name, friend_of_friend.screen_name)
        print("\t->", friend_of_friend.screen_name)
        
subs = [x[0] for x in g.degree() if x[1] > 0]
nx.draw(nx.subgraph(g, subs))

nx.write_graphml(g, "twitter_codybuntain.graphml")

In [41]:
len(id_set)

763461

In [42]:
missing_ids = []
with open("rehydrated.json", "r") as in_file:
    for line in in_file:
        tweet = json.loads(line)
        tweet_id = tweet["id"]
        
        if ( tweet_id not in id_set ):
            missing_ids.append(tweet_id)

In [44]:
len(missing_ids)

0

In [11]:
total_likes = 3200
tweet_json = []
for tweet in limit_handled(tweepy.Cursor(api.favorites, id="codybuntain", count=3200).items(total_likes), get_resource = lambda x: x['favorites']['/favorites/list']):
    tweet_json.append(json.dumps(tweet._json))

  This is separate from the ipykernel package so we can avoid doing imports until


In [12]:
len(tweet_json)

2979

In [14]:
tweet_json[-1]

'{"created_at": "Thu Jun 27 21:52:32 +0000 2013", "id": 350371093418221568, "id_str": "350371093418221568", "text": "@proteius ugh gross!", "truncated": false, "entities": {"hashtags": [], "symbols": [], "user_mentions": [], "urls": []}, "source": "<a href=\\"http://twitter.com/download/android\\" rel=\\"nofollow\\">Twitter for Android</a>", "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": 363200844, "in_reply_to_user_id_str": "363200844", "in_reply_to_screen_name": "codybuntain", "user": {"id": 14522676, "id_str": "14522676", "name": "s", "screen_name": "sanapants", "location": "bay area", "description": "pretty awesome", "url": "http://t.co/WxIMc9YkcR", "entities": {"url": {"urls": [{"url": "http://t.co/WxIMc9YkcR", "expanded_url": "http://www.last.fm/user/sanana", "display_url": "last.fm/user/sanana", "indices": [0, 22]}]}, "description": {"urls": []}}, "protected": true, "followers_count": 74, "friends_count": 142, "listed_count": 1, "created

In [15]:
total_likes = 3200
tweet_json = []
for tweet in limit_handled(tweepy.Cursor(api.favorites, id="j_a_tucker", count=3200).items(total_likes), get_resource = lambda x: x['favorites']['/favorites/list']):
    tweet_json.append(json.dumps(tweet._json))

  This is separate from the ipykernel package so we can avoid doing imports until


In [16]:
len(tweet_json)

1754