Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



7 Commits

Repository files navigation

This repository contains a few simple python scripts that collect a sample of the Twitter firehose and then analyze the URL shortening/HTTP redirections for links found in those tweets. The resulting graph shows a histogram of redirects per URL found in tweets.


These scripts use the excellent TweetStream and Requests libraries:

pip install tweetstream requests

If you want a graph rather than reading raw output, you need matplotlib as well.


The analysis is a three-step process:

# Create a JSON-formatted credentials.json file with a single
# object containing your Twitter 'username' and 'password' to gain
# access to Twitter's sample stream. Then collect data.
# This will sync every 100 tweets until you kill it, leaving (a
# potentially very large) tweets.log file.

# Next look for URLs and unwrap them, giving you a json file
# containing the a list of accessed URLs and status codes for each
# link found. This will try to follow all the URLs it finds in the
# tweets. It uses a bunch of threads. Beware of high network IO.

# Finally, analyze it and generate a few graphs. If you want to
# generate a different graph, add it somewhere here or just use
# the redirects.log file generated by the scripts.


URL detection is a pretty good heuristic, where pretty good means I logged a bunch of tweets, found URLs manually, and made sure I found matches with the regex. Some still are missed, some are caught but catch extra characters (e.g. because extra ')', ':', '"', or random Unicode characters are appended). I don't have an exact measurement, but these are small compared to the whole dataset, so they shouldn't drastically affect the results.

I also required multithreading to get results in a reasonable amount of time. Many of the links were dead very quickly, and a large fraction of those were links. I believe these may have been spam links that Twitter quickly disabled, but I haven't verified that. If you don't run many concurrent requests, you'll process tweeted links orders of magnitude slower than you consume them via the sampled firehose from Twitter.


Unnecessary indirection sucks






No releases published


No packages published