# Data wrangling

### What we'll cover

* Loading data
* Transforming it
* Storing it

### How we'll do it

We'll be working with a dataset, **TODO**. We'll use it to answer some real-life questions, with examples before each step. The goal is to have you write code you can use as a reference later on, so we'll try to explain what's happening.

## Loading data

### Loading from CSVs

The `agate` library (formerly `csvkit` and `journalism`), by ex-Tribune developer Chris Groskopf, was built with journalists handling CSVs in mind ([here's the documentation](https://agate.readthedocs.org/)). The fundamental unit in `agate` is the table, and there's a one-step method of creating a table from a CSV file:

    data = agate.Table.from_csv('data.csv')

#### Your turn

Load the data located in the **TODO** file into an `agate` table and check out how the example works. Think about how you'd access a row's data. How would you answer a question like "sum up all the values of a column in this spreadsheet?"

In [1]:
import agate

table = # Your loading code goes here

for row in table.rows:
    for column in row.columns:
        print '%s: %s' % (column, row[column])

It's worth mentioning that there's another way to load data from a CSV, and it may be more general-purpose: `csv.DictReader`. When you load a CSV file with `DictReader`, you'll get a list of dictionaries, one per line, with the header row (assuming there is one) converted into the dictionary's keys.

    from csv import DictReader
    
    with open('data.csv') as fh:
        reader = DictReader(fh)
        for row in reader:
            # row is now a dict
            # row[column_name] = column_value

### Loading from an API

### Loading from JSON

The `json` module includes two main methods: `loads` to load JSON data, and `dumps` to create JSON from python objects. If you haven't worked with JSON before, it's a very convenient way of passing around data objects using pure text. 

`loads` just takes a single string of JSON-formatted data as an argument, and returns a python object.

    data = json.loads('{"foo":"bar"}')
    print data['foo']

displays `bar`.

#### Your turn

Load the data located in the **TODO** file into a data object.

In [2]:
import json

## Transforming data

### Summing up a column of data

### Filtering rows of data

### Sorting rows of data

### String cleaning

### Geocoding addresses

We'll be using the excellent library geopy ([docs](https://geopy.readthedocs.org/en/1.10.0/); run `pip install geopy` if you don't have it on your machine), which provides a common, simple interface to a variety of different geocoding services. It's worth noting that not all geocoding services work equally well, and they often have limits on how many requests you can make in a short amount of time. So if you're going to geocode a large number of addresses, you'll need to figure out which service is best for you.

To create an instance of a geocoder using a particular service, first import the appropriate class:

    from geopy.geocoders import Nominatim

Then create the instance:

    geocoder = Nominatim()

Once we have the geocoder instance created, using it is as simple as passing a string containing the address we're interested in:

    location = geocoder.geocode("1701 California Street, Denver, CO")

And from there:

    print location.latitude, location.longitude
    
Returns `39.7472023 -104.9904179`

### Your turn

Create an instance of the Google geocoder, and use it to find the latitude and longitude of Molly Brown's house at 1340 Pennsylvania St in Denver **TODO: Should we make this more relevant to the story data?**. (Heads up: most geocoding services restrict heavy usage via IP addresses, so this classroom might get temporarily blocked and the examples may not work).

In [None]:
from geopy.geocoders import

### Comparing dates and date strings 

The standard `datetime` module and the excellent [strftime.org cheat sheet](http://www.strftime.org) (seriously, bookmark it) make python able to translate between a really delicious variety of date and time formats.

## Storing data

### Saving data as a CSV

### Saving data as JSON

### Saving data to S3