_HDS5210 Programming for Data Scientists_

# MidTerm Exam

You will have from 10/7/2016 through 10/14/2016 to work on this mid-term.  You're welcome to use your text, all previous class materials and notes, the Internet, and the Python documentation to help you.  Please try to be clear in writing your code.

## General Description

The mid-term is one extended problem that needs to be solved in a specific way, but the code you use will be broken down into a set of modules that could be reused in similar but not identical problems.  The description will lay out the modules that you need to create and what each module should include.  The final code that will be executed is provided for you, so that you can test your code and make sure it's producing the expected outputs.

This problem involves combining and reshaping data from two different data sources:
* The CMS file on pre-existing condition plans from Assignment 7 - https://data.cms.gov/Health/Monthly-Pre-Existing-Condition-Insurance-Plan-Enro/dpuq-z7nj
* The State Census estimates from 2010 to 2015 - https://www.census.gov/popest/data/national/totals/2015/files/NST-EST2015-alldata.csv

Both of these files have already been downloaded and are available in the `/midterm/` folder on the Jupyter server.

# Part 1 - CSV to lists

One of the things we might want to do a lot of is take a CSV file and turn it into a list of lists, which looks like a rectangular spreadsheet.  For the first part of the midterm, create a module called `csvlist` with a class in it called `CsvList`.

The CsvList class will need the following methods:
* A constructor (aka __init__) that take a single parameter, which is the file name
* A way to retreive a list of the column names that were read from the first line of the file
* A way to retrieve the actual data rows themselves, which should be available as a list of lists

In parsing data from the file, our CsvList class can assume that the file will always have a header.  The code below should be able to work if your module and class are working.

In [None]:
import jupyterimporter
from csvlist import CsvList

f = CsvList('/midterm/preexisting.csv')
f.header  # Returns a list of the column names
f.data    # Returns the list of lists data

## Part 2 - "Pivoting" data

You'll notice that CMS data file includes several columns, one for each different timeframe.  It's not atypical to recieve data files where time is measure horizontally across the row, but you may need to pivot that so that have one row per time entry.  As a simplified example, below is a sample of input data that has a separate column for each quarter in 2015:
```
Last,First,2015Q1,2015Q2,2015Q3,2015Q4
Boal,Paul,10,9,10,8
Westhus,Eric,9,10,10,9
```

When we say we want to _pivot_ that data, what mean is that we want one row for each data value.  To be specific, we say that we're going to pivot columns 2 throuh the end (assuming the first column is 0).  The output of doing this looks like this:
```
Last,First,Time,Value
Boal,Paul,2015Q1,10
Boal,Paul,2015Q2,9
Boal,Paul,2015Q3,10
Boal,Paul,2015Q4,8
Westhus,Eric,2015Q1,9
Westhus,Eric,2015Q2,10
Westhus,Eric,2015Q3,10
Westhus,Eric,2015Q4,9
```

So, create a new module called `pivot` with one function in it called `pivot_columns()`.  This function should take a list of lists as well as a list of column numbers that should be pivoted as shown in the example below.  You can assume the file is a CSV. The return value should be a list of lists.




In [None]:
import jupyterimporter
from pivot import pivot_columns

out = pivot_columns(preexisting,list(range(2,5)))
print(out)

## Part 3 - "Joining" data together

The second task is to create another module called `joiner` that will match updata from one list and line it up with data in another list based on some matching criteria and which fields should be retained from each list.  For this example, the comparisons you do should not care about case sensitivity.  Your function signature should be:
```
joinlists(list1, list2, joinfields1, joinfields2, outputfields1, outputfields2)
```

* list1 = This is the main list that we want to match up data from list2 against.
* list2 = This is the list where we're going to get some extra information to attach to list1
* joinfield1 = Which fields from list1 should be used in the join.
* joinfield2 = Which fields from list2 should be used to try to match the corresponding value in list1.
* outputfields1 = The fields from list1 to include in the output
* outputfields2 = The fields from list2 to incldue in the output

Below is an example of what this should do.


list1:
```
Last,First,Time,Value
Boal,Paul,2015Q1,10
Boal,Paul,2015Q2,9
```

list2:
```
Time,Census
2015q1,932
2015q2,943
```

Sample code:
```
joinlists(list1, list2, [2], [0], [1,2,3,4], [2])
```

Should output:
```
Last,First,Time,Value,Census
Boal,Paul,2015Q1,10,932
Boal,Paul,2015Q2,9,943
```

*Note the arrays are displayed as string lists just for my convenience.

*For those of you familiar with SQL joins, this is effectively a style of inner join.

## Part 4 - Final Code to Test With

Something roughlyt like the code below shoud run successfully for your modules.  You can add code that you may need to cleanup or further parse any data.

In [None]:
import jupyterimporter
from csvlist import CsvList
import pivot
import joiner

preexisting = CsvList('/midterm/preexisting.csv')
census = CsvList('/census.csv')

pre = pivot_columns(preexisting, list(range(2,38)))
cen = pivot_columns(census, list(range(5,12)))

# DO SOME PROCESSING TO CLEANUP DATA AS NEEDED

out = joinlists(pre, cen, [1,4], [3,5], [1,2,3,4,5], [5])