# Reference Data Repository

This notebook contains a few examples that demonstrate the main functionality of the **Reference Data Repository** package `refdata`.

## Local Data Store

The local data store is responsible for maintaining information about downloaded datasets and providing access to the downloaded data. By default, all downloaded files are stored in a local folder under `$HOME/.refdata`. This behavior can be changed by either setting the environment variable *REFDATA_BASEDIR* to point to different directory on the file system or by providing a reference to the directory using the `basedir` parameter when creating an instance of the `LocalStore`.

The local data store is associated with a (remote) data repository index file that contains the list of datasets that are available for download. By default, the [index file in this repository is used](https://github.com/VIDA-NYU/reference-data-repository/blob/master/data/index.json). You can change this behavior by setting the environment variable *REFDATA_URL*.

In [1]:
# Create an instance of the local data store with default settings.

from refdata.store import LocalStore

refstore = LocalStore()

In [2]:
# Print the identifier, name, and description for all
# datasets that are listed in the associated repository
# index.

for dataset in refstore.repository().find():
    print('{} (id={})'.format(dataset.name, dataset.identifier))
    desc = dataset.description if dataset.description is not None else 'no description available'
    print('{}\n'.format(desc))

Cities in the U.S. (id=encyclopaedia_britannica:us_cities)
Names of cities in the U.S. from the Encyclopaedia Britannica.

REST Countries (id=restcountries.eu)
Information about countries in the world available from the restcountries.eu project.

C1 Street Suffix Abbreviations (id=usps:street_abbrev)
Mapping of common street type abbreviations to a standard format.

C2 Secondary Unit Designators (id=usps:secondary_unit_designators)
no description available



### Manage Downloaded Datasets

The local datastore provides basic functionality to download datasets, get a list of all downloaded datasets, access metadata for these datasets, and remove a dataset from the local file system.

In [3]:
# Download the restcountries dataset

refstore.download('restcountries.eu')

('a023b7d5233a4d35a15a11b2ec8b9cfa',
 {'id': 'restcountries.eu',
  'name': 'REST Countries',
  'description': 'Information about countries in the world available from the restcountries.eu project.',
  'url': 'https://raw.githubusercontent.com/VIDA-NYU/openclean-reference-data/master/data/restcountries.eu.json',
  'checksum': '5893ebfad649533ac82a0b030a24efdd519f95a8b030a5ac9c7df37e85aad005',
  'webpage': 'https://restcountries.eu/',
  'schema': [{'id': 'name',
    'name': 'Name',
    'description': 'Country name',
    'dtype': 'text'},
   {'id': 'alpha2Code',
    'name': 'Country Code (2-letters)',
    'description': 'ISO 3166-1 2-letter country code',
    'dtype': 'text'},
   {'id': 'alpha3Code',
    'name': 'Country Code (3-letters)',
    'description': 'ISO 3166-1 3-letter country code',
    'dtype': 'text'},
   {'id': 'capital',
    'name': 'Capital',
    'description': 'Capital city',
    'dtype': 'text'},
   {'id': 'region',
    'name': 'Region',
    'description': 'World region'

In [4]:
# List identifier nad names for datasets that have
# been downloaded to the local store.

print('Downloaded datasets:\n')
for dataset in refstore.list():
    print('> {} (id={})'.format(dataset.name, dataset.identifier))

Downloaded datasets:

> REST Countries (id=restcountries.eu)
> Cities in the U.S. (id=encyclopaedia_britannica:us_cities)


In [5]:
# List identifier and names for columns (attributes)
# in the restcountries dataset.

print('Columns:\n')
for col in refstore.open('restcountries.eu').columns:
    print('  {} (id={})'.format(col.name, col.identifier))

Columns:

  Name (id=name)
  Country Code (2-letters) (id=alpha2Code)
  Country Code (3-letters) (id=alpha3Code)
  Capital (id=capital)
  Region (id=region)
  Sub-Region (id=subregion)


In [6]:
# The full dataset metadata is also available as a
# dictionary.

import json

print(json.dumps(refstore.open('restcountries.eu').to_dict(), indent=4))

{
    "id": "restcountries.eu",
    "name": "REST Countries",
    "description": "Information about countries in the world available from the restcountries.eu project.",
    "url": "https://raw.githubusercontent.com/VIDA-NYU/openclean-reference-data/master/data/restcountries.eu.json",
    "checksum": "5893ebfad649533ac82a0b030a24efdd519f95a8b030a5ac9c7df37e85aad005",
    "webpage": "https://restcountries.eu/",
    "schema": [
        {
            "id": "name",
            "name": "Name",
            "description": "Country name",
            "dtype": "text"
        },
        {
            "id": "alpha2Code",
            "name": "Country Code (2-letters)",
            "description": "ISO 3166-1 2-letter country code",
            "dtype": "text"
        },
        {
            "id": "alpha3Code",
            "name": "Country Code (3-letters)",
            "description": "ISO 3166-1 3-letter country code",
            "dtype": "text"
        },
        {
            "id": "capital",
 

In [7]:
# Remove a downloaded dataset from the local file system.

refstore.remove('restcountries.eu')

print(refstore.list())

[<refdata.base.DatasetDescriptor object at 0x7f89f7d12040>]


### Access Reference Data

Data from downloaded datasets can be accessed in three different ways:

- Set of distinct values
- Lookup table generated from dataset columns
- Pandas data frame

#### Set of Distinct Values

Get set of distinct values for one or more columns of the datasets. If multiple columns are specified (as a list) the resulting set will contain tuples of distinct value combinations.

In [8]:
# Get list of distinct U.S. state names from the
# Encyclopaedia Britannica dataset with U.S. city
# names.

# Instead of downloading and then opening the dataset
# we can open it directly and set the auto_download flag
# which will download the datast if it is no in the local
# store.

dataset = refstore.open('encyclopaedia_britannica:us_cities', auto_download=True)
# Alternative shortcut:
# refstore.distinct(key='encyclopaedia_britannica:us_cities', columns='state')

dataset.distinct('state')

{'Alabama',
 'Alaska',
 'Arizona',
 'Arkansas',
 'California',
 'Colorado',
 'Connecticut',
 'Delaware',
 'Florida',
 'Georgia',
 'Hawaii',
 'Idaho',
 'Illinois',
 'Indiana',
 'Iowa',
 'Kansas',
 'Kentucky',
 'Louisiana',
 'Maine',
 'Maryland',
 'Massachusetts',
 'Michigan',
 'Minnesota',
 'Mississippi',
 'Missouri',
 'Montana',
 'Nebraska',
 'Nevada',
 'New Hampshire',
 'New Jersey',
 'New Mexico',
 'New York',
 'North Carolina',
 'North Dakota',
 'Ohio',
 'Oklahoma',
 'Oregon',
 'Pennsylvania',
 'Rhode Island',
 'South Carolina',
 'South Dakota',
 'Tennessee',
 'Texas',
 'Utah',
 'Vermont',
 'Virginia',
 'Washington',
 'West Virginia',
 'Wisconsin',
 'Wyoming'}

#### Lookup Tables

It is possible to directly generate a lookup table that maps values from one column (or multiple columns) to the values in another column(s). Lookup tables are represented as dictionaries.

In [9]:
# Get a lookup table (dictionary) that maps the
# ISO 3166-1 3-letter country code to the country's
# captital city

dataset = refstore.open('restcountries.eu', auto_download=True)
# Alternative shortcut:
# refstore.mapping(key='restcountries.eu', lhs='alpha3Code', rhs='capital')

mapping = dataset.mapping(lhs='alpha3Code', rhs='capital')

mapping['AUS']

'Canberra'

#### Data Frame

The full dataset (or a subset of the columns) can also be loaded as a pandas data frame.

In [10]:
# Get data frame with country name, 3-letter country code,
# and capital city.

dataset = refstore.open('restcountries.eu', auto_download=True)
# Alternative shortcut:
# refstore.load('restcountries.eu', ['name', 'alpha3Code', 'capital'])

df = dataset.data_frame(['name', 'alpha3Code', 'capital'])

df.head()

Unnamed: 0,name,alpha3Code,capital
0,Afghanistan,AFG,Kabul
1,Åland Islands,ALA,Mariehamn
2,Albania,ALB,Tirana
3,Algeria,DZA,Algiers
4,American Samoa,ASM,Pago Pago
