This repository contains a few simple python scripts that collect a sample of the Twitter firehose and then analyze the URL shortening/HTTP redirections for links found in those tweets. The resulting graph shows a histogram of redirects per URL found in tweets.
These scripts use the excellent TweetStream and Requests libraries:
pip install tweetstream requests
If you want a graph rather than reading raw output, you need matplotlib as well.
The analysis is a three-step process:
# Create a JSON-formatted credentials.json file with a single # object containing your Twitter 'username' and 'password' to gain # access to Twitter's sample stream. Then collect data. # This will sync every 100 tweets until you kill it, leaving (a # potentially very large) tweets.log file. python collect.py # Next look for URLs and unwrap them, giving you a json file # containing the a list of accessed URLs and status codes for each # link found. This will try to follow all the URLs it finds in the # tweets. It uses a bunch of threads. Beware of high network IO. python unwrap.py # Finally, analyze it and generate a few graphs. If you want to # generate a different graph, add it somewhere here or just use # the redirects.log file generated by the unwrap.py scripts. python graph.py
URL detection is a pretty good heuristic, where pretty good means I logged a bunch of tweets, found URLs manually, and made sure I found matches with the regex. Some still are missed, some are caught but catch extra characters (e.g. because extra ')', ':', '"', or random Unicode characters are appended). I don't have an exact measurement, but these are small compared to the whole dataset, so they shouldn't drastically affect the results.
I also required multithreading to get results in a reasonable amount of time. Many of the links were dead very quickly, and a large fraction of those were t.co links. I believe these may have been spam links that Twitter quickly disabled, but I haven't verified that. If you don't run many concurrent requests, you'll process tweeted links orders of magnitude slower than you consume them via the sampled firehose from Twitter.