Unnecessary indirection sucks
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
README.md
collect.py
graph.py
unwrap.py

README.md

This repository contains a few simple python scripts that collect a sample of the Twitter firehose and then analyze the URL shortening/HTTP redirections for links found in those tweets. The resulting graph shows a histogram of redirects per URL found in tweets.

Requirements

These scripts use the excellent TweetStream and Requests libraries:

pip install tweetstream requests

If you want a graph rather than reading raw output, you need matplotlib as well.

Running

The analysis is a three-step process:

# Create a JSON-formatted credentials.json file with a single
# object containing your Twitter 'username' and 'password' to gain
# access to Twitter's sample stream. Then collect data.
# This will sync every 100 tweets until you kill it, leaving (a
# potentially very large) tweets.log file.
python collect.py

# Next look for URLs and unwrap them, giving you a json file
# containing the a list of accessed URLs and status codes for each
# link found. This will try to follow all the URLs it finds in the
# tweets. It uses a bunch of threads. Beware of high network IO.
python unwrap.py

# Finally, analyze it and generate a few graphs. If you want to
# generate a different graph, add it somewhere here or just use
# the redirects.log file generated by the unwrap.py scripts.
python graph.py

Details

URL detection is a pretty good heuristic, where pretty good means I logged a bunch of tweets, found URLs manually, and made sure I found matches with the regex. Some still are missed, some are caught but catch extra characters (e.g. because extra ')', ':', '"', or random Unicode characters are appended). I don't have an exact measurement, but these are small compared to the whole dataset, so they shouldn't drastically affect the results.

I also required multithreading to get results in a reasonable amount of time. Many of the links were dead very quickly, and a large fraction of those were t.co links. I believe these may have been spam links that Twitter quickly disabled, but I haven't verified that. If you don't run many concurrent requests, you'll process tweeted links orders of magnitude slower than you consume them via the sampled firehose from Twitter.