Set up the notebook

In [1]:
import os
import csv
import json

notebook_path = os.path.abspath("generate_metadata.ipynb")

Add the CSV file you're working with (in the same folder as this python notebook):

In [29]:
infilename = 'datasets/housing_data.csv'#"{your file here}.csv"
infilepath = os.path.join(os.path.dirname(notebook_path), infilename)

Fill out the fields for the data card, excepting "fields" and "column" - we'll get to those later. A template is [here](https://hackmd.io/62se7jj-Qoycs__e6NjS2w).

In [30]:
data = {
    #ignore this
    'fields': [],
    #here we include the basic card information
    'card': {
        # a short description of the dataset
        'description': "A widely used, small dataset on housing in the Boston, Masachusetts area.", 
        # a link to the original source
        'source': 'University of Toronto (https://www.cs.toronto.edu/~delve/data/boston/desc.html)', 
        #date last updated (if possible)
        'last_updated': '1978', 
        'context': 
        {
            # who it was created by
            'created_by': 'US Census Service',
            # has it been cleaned/prepared for use
            'preparation': 'Yes', 
            # does it contain potential identifying/confidential information
            'confidentiality': 'No', 
            # does it contain information that can identify a subgroup of people (age, race, gender)
            'subgroup_identifiers': 'No', 
            # what are potential uses (e.g. what are some successful combinations of features)?
            'potential_uses': "Predict median home value based on Black population and parent teacher ratio.", 
            # what should it not be used for?
            'potential_misuses': "This is an old dataset, so it's unlikely to reflect contemporary home values. This is also a dataset that reflects racial and economic biases: is it right or fair that home values tended to be higher in areas where the population was more wealthy, or less Black?"
        }
    }
}


Now that we've done our basic setup, let's get to the columns. Run this code, which should display the available columns in the csv

In [31]:
with open(infilepath, 'r') as infile:
    reader = csv.reader(infile, delimiter=",")
    csv_list = list(map(tuple, reader))
    columns = csv_list[0]
    print(columns)

('Per Capita Crime Rate', 'Proportion of Residental Land Zoned for >25Kft', 'Proportion of non-retail business acres', 'Bounds Charles River?', 'Nitric Oxide Concentration (PPM)', 'Rooms per dwelling', '% of buildings pre-1940', 'Distance to employment centers', 'Accessibility to highways', 'Property tax', 'Student-teacher ratio', 'Weighted black population', 'Lower status population', 'Median home value')


First, if you have a recommended investigation, enter the relevant features here: 

In [32]:
#set features
data['recommended_features'] = ['GDP per capita', 'Student-teacher ratio', 'Lower status population']
#set label
data['recommended_label'] = 'Median home value'

Assign each column to a list of either continuous or categorical data, e.g. ```
continuous = ['temperature', 'score']
categorical = ['state', 'color']
```

In [34]:
    continuous = ['Per Capita Crime Rate', 'Proportion of Residental Land Zoned for >25Kft', 'Proportion of non-retail business acres',  'Nitric Oxide Concentration (PPM)', 'Rooms per dwelling', '% of buildings pre-1940', 'Distance to employment centers', 'Accessibility to highways', 'Property tax', 'Student-teacher ratio', 'Weighted black population', 'Lower status population', 'Median home value']
    categorical = ['Bounds Charles River?']
    # reset fields in case you're making changes
    data['fields'] = []
    for i in columns:
        if i in continuous:
            data['fields'].append({'type': 'continuous', 'id': i})
        elif i in categorical:
            data['fields'].append({'type': 'categorical', 'id': i})
        else:
            raise Exception("You forget to set a type for %f", i)
    
#     print('Set field information:', data['fields'])
 #    columns = data['fields'].copy()
#     print('\b')
    print('columns: ', columns)

    

columns:  ('Per Capita Crime Rate', 'Proportion of Residental Land Zoned for >25Kft', 'Proportion of non-retail business acres', 'Bounds Charles River?', 'Nitric Oxide Concentration (PPM)', 'Rooms per dwelling', '% of buildings pre-1940', 'Distance to employment centers', 'Accessibility to highways', 'Property tax', 'Student-teacher ratio', 'Weighted black population', 'Lower status population', 'Median home value')


We've set our columns for the interface, now let's just add descriptions as a list - e.g. if our columns are `['Year', 'Temperature']` our list might be `['The year the measurement was taken', 'The temperature in Celsius']`. Do this in the same order the columns are printed above.

In [35]:
    desc = ["Per capita crime rate by town", 'Proportion of residential land zoned for lots over 25,000 square feet.','Proportion of non-retail business acres per town.', "1 if the land is next to the Charles River, 0 otherwise. Some towns have an 'NA' where the answer is uncertain.", 'Nitric oxide concentration in parts per 10 million.', 'Average number of rooms per dwelling.',  'Proportion of owner-occupied units build prior to 1940.',"Weighted distance to 5 major Boston employment centers.", 'A weighted measure of accessibility to major highways.', "Property tax rate per $10,000.", "The average student to teacher ratio by town.", "A weighted measure of the Black population per town.", "The percent of the population designated as 'lower status'.", "Median home value in $1000s."]
    if len(desc) < len(data['fields']):
        raise Exception("You don't have a description for each column!")
    for idx, i in enumerate(data['fields']):
        i['description'] =  desc[idx]


Take a look and make sure everything is right, and if you're confident, we can write our data to a JSON file.

In [36]:
    print(infilename)
    with open(infilename.split('.')[0] + '.json', 'w') as outfile:
        json.dump(data, outfile)

datasets/housing_data.csv
