A collection of public data sets
HTML JavaScript Other
Latest commit 4bd9140 Feb 25, 2017 @curran Add Andy Kirk's Link Archive
Permalink
Failed to load latest commit information.
HRMcMaster Add timeline JSON created by @micahstubbs using TimeLineCurator Feb 22, 2017
Rdatasets Add R datasets Oct 25, 2015
airbnb Add Airbnb data set Aug 4, 2015
all Add data from data soup meetup Aug 30, 2015
amitkaps Add weed data Jan 31, 2016
appliedPredictiveModeling Add applied predictive modeling data sets Aug 5, 2015
bokeh Add Bokeh examples Oct 25, 2015
calc Add calc s70 data Oct 30, 2015
cdc Added full table data module with unindented causes for the entire ca… Feb 19, 2014
correlatesofwar Add correlates of war data Oct 30, 2015
d3Examples Rename letterFrequency.csv to letterFrequency.tsv Jul 14, 2016
data.gov.in Added data sets from data.gov.in Apr 21, 2014
data.gov/ElectricUtilityRates Add data.gov data Aug 5, 2015
dataSoup Add data from data soup meetup Aug 30, 2015
datalibExamples Add datalib examples Aug 3, 2015
dbpedia/cities Fixed bug where population was not showing up in CSV file for Geoname… Apr 30, 2015
dcjs Add data sets from DC.js Aug 4, 2015
dspl Added countries list from Google Apr 1, 2015
faostat/africaUndernourishment Added Africa undernourishment data set Dec 1, 2014
fbi Add stub for FBI crime dataset Dec 8, 2015
gapminder Update README Aug 31, 2015
geonames Fixed bug where population was not showing up in CSV file for Geoname… Apr 30, 2015
indiaGovOpenData Add India Open Data May 31, 2016
integrated Added population vs. gdp data set Apr 29, 2015
ipo Add note on IPO calue column Aug 6, 2015
jsLibraries Added js lib data set Apr 2, 2015
mattermark Removed funky characters in CSV Aug 18, 2015
medicalStoreChallenge Add data from medical store challenge Aug 4, 2015
migrants Add data from data soup meetup Aug 30, 2015
motherjones Add mother jones shooting data Oct 30, 2015
nsf/bachelorsDegrees Added data about bachelors degrees fron NSF Feb 13, 2014
nyt/gun_sales Update README.md Dec 10, 2015
oecd Add house price data Aug 6, 2015
olpc Fixed image Apr 27, 2015
osm/housenumbers Add house numbers in Montreal Aug 4, 2015
pew/religion Add religionByCountryBanned7.csv Jan 28, 2017
playfair/englandTrade Add William Playfair trade data Sep 6, 2015
plotlyExamples Add fertility-rates-in-south data set Aug 3, 2015
sankey-datasets Add sankey-datasets Feb 22, 2017
senseYourCity Added unit test framework, added test for iris dataset parsing using … Aug 1, 2015
slavevoyages Add slave voyages data Oct 30, 2015
stackOverflow Add ref to StackOverflow developer survey Jan 15, 2016
statCounter Remove .DS_Store Mac turd Aug 27, 2015
superstoreSales Added superstore sales example data Sep 2, 2014
syntagmatic Add data sets from syntagmatic Aug 5, 2015
trumpSpeech Rename trumpSpeech.txt to trumpSpeech.md Jul 22, 2016
tuskegeeInstitute Add links to references Nov 24, 2015
tweets Merge branch 'gh-pages' of github.com:curran/data into gh-pages Dec 1, 2015
uci_ml Add cleaned avian flu data Oct 28, 2015
un Update to 2015 data Jul 2, 2016
undp Add undp data sets Aug 7, 2015
unhcr Add India Open Data May 31, 2016
usda/avian_influenza Add cleaned avian flu data Oct 28, 2015
usgs/centennial Added filtered versions for earthquake data Apr 27, 2015
util Added earthquake data Apr 27, 2015
uwdata_voyager Add Voyager data sets Aug 5, 2015
vegaExamples Prettified vegaExamples/flights-*.json Jul 16, 2016
w3schools Added stub for scraping w3schools browser market share data Apr 4, 2014
wallStreetJournal/terrorAttacks Add WSJ data set Nov 14, 2015
wikibon Add Big Data Vendor data from Wikibon Aug 12, 2015
worldBank/refugees Update sm.pop.refg_Indicator_en_csv_v2.csv Nov 20, 2015
worldFactbook Added world factbook data Aug 14, 2013
.gitignore Added unit test framework, added test for iris dataset parsing using … Aug 1, 2015
Interest Group Spending 2000-2016.csv Create Interest Group Spending 2000-2016.csv Jan 12, 2016
LICENSE Add MIT Licence for #2 Nov 13, 2015
README.md Add Andy Kirk's Link Archive Feb 25, 2017
package.json Add religionByCountryBanned7.csv Jan 28, 2017
test.js Added unit test framework, added test for iris dataset parsing using … Aug 1, 2015

README.md

A collection of public data sets for testing out visualization methods. These data sets are at various stages of preparation, some are just raw data, some are CSV files, and some are exposed as AMD modules. This collection is messy, but with some digging you may find hidden gems.

Targets for import:

Here's a listing of data sets with more detail. Columns will be marked in terms of their type for visualization, including:

  • Q = Quantitative, continuously varying numeric columns

  • T = Temporal, a timestamp

  • O = Ordered, distinct categories with a natural order (e.g. Low, Medium, High)

  • N = Nominal, distinct categories with no natural order (e.g. Ethnicity)

  • G = Geospatial identifiers (e.g. Country, City)

UCI Machine Learning Repository - Adult (3.8 MB)

This data set demonstrates a mix of quantitative, ordinal, and nominal columns. To analyze this data set using visualization, it would be useful to aggregate the data on the fly before visualization.

  • age: Q
  • workclass: N
  • education: O
  • education-num: Q
  • marital-status: N
  • occupation: N
  • relationship: N
  • race: N
  • sex: N
  • capital-gain: Q
  • capital-loss: Q
  • hours-per-week: Q
  • native-country: N

Data Canvas Sense Your City (237MB or Real-time API)

This data set contains measures collected by DIY sensor kits across several major cities ["San Francisco", "Bangalore", "Boston", "Geneva", "Rio de Janeiro", "Shanghai", "Singapore"]. There is a visualization competition for this data set, submissions due March 20.

  • city: G
  • timestamp: T
  • temperature: Q
  • light: Q
  • airquality: Q
  • sound: Q
  • humidity: Q
  • dust: Q

Medical Store Geospatial Challenge (< 100KB)

This is a data set is small, but comes with a set of real-world questions about the data. This is also a competition, with submissions due April 25.

  • Referrers - Each row corresponds to information on a particular client referral source.

  • referrer_code: N

  • visit_count: Q
  • city -- referrer city
  • postal_code_referrer: G
  • (latitude, longitude): G

  • Clients - Each row corresponds to a client visit to the store

  • client_id: N

  • referrer_code: N
  • city -- referrer city
  • postal_code_referrer: G
  • (latitude, longitude): G
  • initial_visit_date: T
  • product_count: Q

UCI Machine Learning Repository - Individual household electric power consumption (20 MB)

This data set would be a great candidate to show multi-scale temporal aggregation.

  • timestamp: T
  • global_active_power: Q
  • global_reactive_power: Q
  • voltage: Q
  • global_intensity: Q

BrightKite User Check-ins (57.2 MB)

This data set would be a useful example for multi-scale aggregation in both space and time. This has been used as the motivating example for several Big Data visualization systems based on data cubes (imMens: Real‐time Visual Querying of Big Data, Nanocubes for real-time exploration of spatiotemporal datasets).

  • user-id: N
  • timestamp: T
  • (latitude, longitude): G

ACLED (Armed Conflict Location and Event Data Project) (35MB)

This data set contains entries for each violent event in Africa from 1997 - 2014. This data set would be a good candidate for visualization with a linked timeline and choropleth map, where selections in the timeline can drive the filtering of data shown on the map.

  • timestamp: T
  • (latitude, longitude): G
  • country: G
  • number of fatalities: Q

Safecast (3.2GB)

Grassroots sensor data about nuclear radiation in Japan

Statistical Computing Statistical Graphics Data expo Airline on-time performance (12GB)

A great data set for scalability testing. This is the data set used in the Crossfilter Demo.

The GDELT Data Set (~100GB)

This would be a great data set for more extreme scalability testing. There is an Open Source project for loading this data set into Spark on AWS.

The Indian Census has lots of public data.

Best Buy has a developer portal for querying their data via a Web API.