This repository contains an experimental (alpha stage!) pipeline for importing and processing data available at Measurement Lab. It is called "Neubot pipeline" because it was developed by Simone Basso in the context of the Neubot project.
How to clone neubot-pipeline
To clone properly libight repository, make sure that you specify the
--recursive command line flag, as in:
git clone --recursive https://github.com/bassosimone/neubot-pipeline
Such flag tells git to clone not only the libight repository, but also the submodules contained therein.
Alternatively, once you have cloned the repository, you can checkout all the submodules using:
git submodule update --init
How to use neubot-pipeline
First, you need to install the dependencies. Specifically, you need to install the following Python modules:
GeoIP pymongo pytz pyyaml
You also need:
Which is used to store data.
On Ubuntu (or on any other Debian-based distribution), you could install all of this with the following command:
sudo apt-get install mongodb python-geoip python-pymongo \ python-tz python-yaml
Download GeoLite databases
In addition you also need to download the free GeoLite City and ASN databases from MaxMind.
Specifically you need to create below data a per-year and per-month folders containing uncompressed GeoLite databases.
To this end, you can use the script
./bin/fetch_geoip that downloads
a mirror of the GeoLite data maintained at
The script also accepts parameters to allow you to download only the GeoLite files that you need, e.g.:
./bin/fetch_geoip 2014/11 2014/12
Note that the GeoLite databases downloaded using this procedure are not always the best GeoLite databases for all months. In fact many of these databases were downloaded recently using the Internet Archive's wayback machine. So there are months in which the GeoLite database used is more than six months old which has an impact on accuracy.
It is in our (well, in my...) TODO list to improve the way in which geolocation is implemented.
Pull and process data
You are now ready to pull data from Measurement Lab. We currently support data from these three tools:
As regards Neubot, we only support loading data of its
BitTorrent tests. As regards Glasnost, we only support version 2 of
its logfiles (which is used since 2010-04-20).
To pull and import Neubot's data of January, 2015 into MongoDB, run:
Afterwards, you can log in MongoBD and check whether everything was correctly imported
$ mongodb > use ooni > db.reports.find() ... # skipped output here
Export data as CSV
To export data as CSV, use the
bin/export_csv script. It can export
data of Neubot and Glasnost as follows:
The Glasnost logfiles parser is a modified version of the parser developed by Hadi Asghari for Glasnost.