# JSON and CSVs

## On boilerplate code

We are finally done with learning new syntax and structures in python.  Now we can start using all the patterns we've been learning in the context of real data.  

You'll be learning some new packages and have your first real interaction with boilerplate code.

Even though you've learned a lot these past few weeks, there are still structures and needs that some tools have that won't really make sense at first interaction.  Some tools are just designed to work in a certain way, and you are provided with documentation and example code that accomplishes them.  This is what we call boilerplate code.  

Much like how a syllabus is required to have statements about inclusion and diversity, not all professors are experts in the appropriate language to say for these things.  So we look to those relevant departments for guidance.  They often provide expertly written language for these documents to help ensure standards are met for all students and to avoid a misinterpretation of the policies.

These are often things that we read through and they make sense, but we often think "there's no way I could have written that on my own!" And that's kind of the point!  That's boilerplate code and that's boilerplate language.  For times when things are so complex and nuanced that the template is just given to you.  As your expertise in that area develops, you are better able to understand the inner workings and the design choices.

A lot of code is a bit of boilerplate that you'll need to adapt to fit your needs, but you'll need to leave the rest of it alone.  A bit part of the challenge is that you'll need to be able to look at the boilerplate code and make an educated guess about what parts of the code you need to change and what parts you should leave alone.  The trick is to find the input and the output of that chunk of code.  Where is the data going into it and where is it coming out.

I'm going to be giving you some boilerplate code for reading in and writing data of these two formats, but there are also some boilerplate patterns that are just the patterns that we use to accomplish these things. 

## What are json and csv?

### From a data structure perspective

These are less two different data types are more like two different data structures.  You can have many kinds of data within them, but they are two different ways of organizing data.

Much like having two different ways of organizing a kitchen or closet, each will get the job done so long as you know where things are, but you may find that:

* one structure makes more sense to you and is your preferred way of organizing the information
* the types of data and questions that you have are more easily processed in one structure than another
* you can generally accomplish the same things in each structure, but the different organization models will be better equipped to store certain data relationships than the other.

### From a format perspective

Each of these types of files are stored in a plain text file.  This means that you should be able to open it up and view it in any plain text editor, but you may not be able to really make sense of the structure just by looking at it.  That's because this format wasn't necessarily designed to organize data for human eyes.  The data is formatted in plain text to make it easy for a computer to process those things.  It is text designed for a program to parse.

Think of a CSV file that you might open up in excel.  You can see the content in the raw file, but you may really need the rendering of the excel workbook to be able to cleanly scan the data.

JSON files are much more like this.  They are not designed for human eyes at all and require programmatic processing for everything to get anything done to it.  What it has sacrificed in terms of readability, it did nso to gain an exceptional amount of depth of detail in the kind of data content and relationships that it can explore.

### From a technical perspective

Each of these files are plain text, so we process them in python much the same way we are used to other files. We read them in as a string or open them up as an infile as normal, and then pass that information off to another part of the program that knows what to do with that kind of data structure.  That code then does the job of translating the data into a structure that we know how to work with in python, and then we can get to business with that resulting structure.

This process is often called parsing of the data.  While we could write our own code to parse the content into a structure we can work with, we usually leave it up to python to deal with for us because we have better things to do with our time.

#### CSV structures in python

CSV documents are rectangular structures that are often best represented as 2D lists.  This means that it creates an outer list that holds many inner lists. Each inner list corresponds to a specific row in the data, and the elements within the list are the values for each cell going across that row.  Lists are a great way to store this data because tho position of the data within the file determines the meaning of it.  Lists have a primary access and identification structure around positions.  However, the thing that isn't well represented within a 2D list is the idea of the column.  You can easily grab an individual row from a 2D list, because that's just one of the inner lists.  We can use normal slicing syntax to grab an entire row of data (`list[pos_num]`).

However, columns in CSVs aren't all contained in a single data structure.  The values that correspond to a column all live across all the inner lists.  Thus, to get the values from an entire column we must first access every inner list and then grab that column position number.  There are special tools (like pandas) that were created to store this kind of 2D array of data in a way that we can access things like columns in a more normal fashion.  From a normal perspective you'd think that accessing a row and a column are about the same, but in the reality of the code, accessing data from a column looks completely different and much more complex.

There are tools you can use to read a CSV into an python `dict` structure, although we will not be using that here.

#### JSON structures in python

json has a completely different structure from a rectangular array of data like a CSV.  In fact, json has no specific shape this this that we can call it, other than a tree structure.  Whereas CSVs are very position first and everything is called via that position value, json has a very name or label oriented structure.

Operationally, you've been dealing with this kind of structure already with dictionaries in Python.  In fact, when you parse a json file into python, it is saved as either one big dictionary or a list of many dictionaries in there.  There are no rows of data, but instead clusters of trees or one large tree that you can access.  Even if those dictionaries are stored within a list, that doesn't mean that the position values for those dictionaries have an actual meaning to the data.  You may be able to infer the meaning based on your understanding of how the data file was created, but that won't always hold true.

## How to parse these in python

You're going to use boilerplate code for each of these to read them in.  There are modules in the Python standard library for parsing json and csv data.  We'll be using both of those.

Much of the credit for the patterns in use for this lesson are from The Python Cookbook 3rd edition.

# The data we'll be using

Take a moment to look through the `lunaresults.json` file in this directory.  It contains the wikipedia api results of 

# Parsing json with the `json` library

Here's the boilerplate code:

You'll see that you'll be reading in the json file like a normal text file with `.read()` so it is all one string.  Then passing that string into a special function within the library called `json.loads()` This function will parse a string and return back the relevant python data structure.

``` python
import json 

infile = open('yourjsonfile.json', 'r')
text = infile.read()
infile.close()

data = json.loads(text)
```

