# JSON - Lab

## Introduction

In this lab, you'll practice navigating JSON data structures.

## Objectives

You will be able to:

* Practice using Python to load and parse JSON documents

## Your Task: Find the Total Payments for Each Candidate

We will be using the same dataset, `nyc_2001_campaign_finance.json`, as in the previous lesson. Recall that the description is:

> A listing of public funds payments for candidates for City office during the 2001 election cycle

For added context, the Ciy of New York provides matching funds for eligible contributions made to candidates, using various ratios depending on the contribution amount ([more details here](https://en.wikipedia.org/wiki/New_York_City_Campaign_Finance_Board#The_Campaign_Finance_Program)). So these are not the complete values of all funds raised by these candidates, they are the amounts matched by the city. For that reason we expect that some of the values will be identical for different candidates.

Recall also that the dataset is separated into `meta`, which contains metadata, and `data`, which contains the actual campaign finance records. You will need to use the information in `meta` to understand how to interpret the information in `data`.

Your goal is to create a list of tuples, where the first value in each tuple is the name of a candidate in the 2001 election, and the second value is the total payments they received. The structure should look like this:

```python
[
    ("John Smith", 62184.00),
    ("Jane Doe", 133146.00),
    ...
]
```

The list should contain 284 tuples, since there were 284 candidates.

## Open the Dataset

Import the `json` module, open the `nyc_2001_campaign_finance.json` file using the built-in Python `open` function, and load all of the data from the file into a Python object using `json.load`.

Assign the result of `json.load` to the variable name `data`.

In [18]:
# Your code here
import json

with open('nyc_2001_campaign_finance.json') as f:
    data = json.load(f)

Recall the overall structure of this dataset:

In [19]:
# Run this cell without changes

print(f"The overall data type is {type(data)}")
print(f"The keys are {list(data.keys())}")
print()
print("The value associated with the 'meta' key has metadata, including all of these attributes:")
print(list(data['meta']['view'].keys()))
print()
print(f"The value associated with the 'data' key is a list of {len(data['data'])} records")

The overall data type is <class 'dict'>
The keys are ['meta', 'data']

The value associated with the 'meta' key has metadata, including all of these attributes:
['id', 'name', 'attribution', 'averageRating', 'category', 'createdAt', 'description', 'displayType', 'downloadCount', 'hideFromCatalog', 'hideFromDataJson', 'indexUpdatedAt', 'newBackend', 'numberOfComments', 'oid', 'provenance', 'publicationAppendEnabled', 'publicationDate', 'publicationGroup', 'publicationStage', 'rowClass', 'rowsUpdatedAt', 'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount', 'viewLastModified', 'viewType', 'columns', 'grants', 'metadata', 'owner', 'query', 'rights', 'tableAuthor', 'tags', 'flags']

The value associated with the 'data' key is a list of 285 records


## Find the Column Names

We know that each record in the data list looks something like this:

In [20]:
# Run this cell without changes
data['data'][1]

[2,
 '9D257416-581A-4C42-85CC-B6EAD9DED97F',
 2,
 1315925633,
 '392904',
 1315925633,
 '392904',
 '{\n}',
 '2001',
 'B4',
 'Aboulafia, Sandy',
 '5',
 None,
 '44',
 'P',
 '45410.00',
 '0',
 '0',
 '45410.00']

We could probably guess which of those values is the candidate name, but it's unclear which value is the total payments received. To get that information, we need to look at the metadata.

Investigate the value of `data['meta']['view']['columns']`. It currently contains significantly more information than we need. Extract just the values associated with the `name` keys, so we have a list of the column names.

The result should look something like this:

```python
[
    "sid",
    "id",
    "position",
    ...
]
```

Name this variable `column_names`.

In [21]:
# Your code here (create more cells as needed)
column_data = data['meta']['view']['columns']
type(column_data)

list

In [22]:
column_data[:3]

[{'id': -1,
  'name': 'sid',
  'dataTypeName': 'meta_data',
  'fieldName': ':sid',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']},
 {'id': -1,
  'name': 'id',
  'dataTypeName': 'meta_data',
  'fieldName': ':id',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']},
 {'id': -1,
  'name': 'position',
  'dataTypeName': 'meta_data',
  'fieldName': ':position',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']}]

In [8]:
column_names = [info['name'] for info in column_data]
column_names

['sid',
 'id',
 'position',
 'created_at',
 'created_meta',
 'updated_at',
 'updated_meta',
 'meta',
 'ELECTION',
 'CANDID',
 'CANDNAME',
 'OFFICECD',
 'OFFICEBORO',
 'OFFICEDIST',
 'CANCLASS',
 'PRIMARYPAY',
 'GENERALPAY',
 'RUNOFFPAY',
 'TOTALPAY']

In [23]:
# Run this cell without changes

# There should be 19 names
assert len(column_names) == 19
# CANDNAME and TOTALPAY should be in there
assert "CANDNAME" in column_names and "TOTALPAY" in column_names

Ok, now we know what each of the columns represents.

The columns we are looking for are called `CANDNAME` and `TOTALPAY`. Now that we have this list, we should be able to figure out which of the values in each record lines up with those column names.

## Loop Over the Records to Find the Names and Payments

The data records are contained in `data['data']`. Recall that the first (`0`-th) one is more of a header and should be skipped over.

Loop over the records in `data['data']` and extract the name and total payment from the city. Make sure you convert the total payment to a float, then make a tuple representing that candidate. Append the tuple to an overall list of results called `candidate_total_payments`.

In [24]:
# Your code here (create more cells as needed)
name_index = column_names.index("CANDNAME")
total_payments_index = column_names.index("TOTALPAY")

print("The candidate name is at index", name_index)
print("The total payment amount is at index", total_payments_index)

The candidate name is at index 10
The total payment amount is at index 18


In [29]:
candidate_total_payments = []

for record in data['data'][1:]:
    name = record[name_index]
    total_payments = float(record[total_payments_index])
    candidate_total_payments.append((name, total_payments))

In [25]:
# Run this cell without changes

# There should be 284 records
assert len(candidate_total_payments) == 284

# Each record should contain a tuple
assert type(candidate_total_payments[0]) == tuple

# That tuple should contain a string and a number
assert len(candidate_total_payments[0]) == 2
assert type(candidate_total_payments[0][0]) == str
assert type(candidate_total_payments[0][1]) == float

NameError: name 'candidate_total_payments' is not defined

Now that we have this result, we can answer questions like: *which candidates received the most total payments from the city*?

In [None]:
# Run this cell without changes

# Print the top 10 candidates by total payments
sorted(candidate_total_payments, key=lambda x: x[1], reverse=True)[:10]

Since you found all of the column names, it is also possible to display all of the data in a nice tabular format using pandas. That code would look like this:

In [None]:
# Run this cell without changes

import pandas as pd

pd.DataFrame(data=data['data'][1:], columns=column_names)

## Summary

Congratulations! You've started exploring some more JSON data structures used for the web and got to practice data munging and exploring!