# Compiling Massachusetts Insurance Data #

This notebook will teach how to compile a subset of data. Specifically, data regarding insurance in Massachusetts will be taken from data for County Health in the United States.

### Overview ###

The main steps include:

1. Setting up
2. Filtering
3. Indexing
4. Defining
5. Exporting

## Setting Up ##

First, pandas and numpy need to be imported with memorable abbreiviations. These packages allow us to use tools and functions in addition to those in Python. <b>Pandas</b> functions are useful for compiling data because they allow data to be stored with rows and columns, making it more easily comprehensible. Importing <b>Numpy</b> as well helps with some of the math functions. Both can be imported by using the following code.

In [29]:
import pandas as pd
import numpy as np

It is important to define the DataFrame. We can use <b> df </b> to do so.

Now, the data set needs to be uploaded. The County Health Dataset being used can be downloaded from Canvas. To upload the data set to the notebook, we can use `df=pd.read_csv()`. 

In [30]:
df=pd.read_csv("CountyHealthData_2014-2015 (1).csv")

## Filtering ##

In order to determine which row numbers represent the desired data, filtering can be used. By filtering out the State column for "MA", the numbers that correspond with Massachusetts data can be determined.

We can use bracket notation `df["column"]` to do this. This utilizes two statements:

1. `df["State"]=="MA"` will search the State column for anything that equals `"MA"`.

2. `df[...]` will select the rows where this statement is true.

In [31]:
df[df["State"] == "MA"]

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
2313,MA,Northeast,New England,Barnstable County,25001,25001,Region 12,1/1/2014,5861.0,0.083,...,14.61,0.07,0.023,9120.0,0.068,66.0,58616,0.198,1.04,0.154
2314,MA,Northeast,New England,Barnstable County,25001,25001,Region 12,1/1/2015,5632.0,0.083,...,14.22,0.057,0.018,8870.0,0.068,71.0,60685,0.224,1.2,
2315,MA,Northeast,New England,Berkshire County,25003,25003,Region 12,1/1/2014,5915.0,0.121,...,10.83,0.057,0.02,9022.0,0.094,78.0,46660,0.326,1.62,0.196
2316,MA,Northeast,New England,Berkshire County,25003,25003,Region 12,1/1/2015,5773.0,0.121,...,11.65,0.053,0.015,8980.0,0.094,83.0,51431,0.318,1.9,
2317,MA,Northeast,New England,Bristol County,25005,25005,Region 12,1/1/2014,6180.0,0.152,...,18.05,0.064,0.019,10094.0,0.082,54.0,53670,0.323,2.8,0.217
2318,MA,Northeast,New England,Bristol County,25005,25005,Region 12,1/1/2015,6030.0,0.152,...,18.53,0.064,0.014,9523.0,0.082,60.0,53626,0.335,2.6,
2319,MA,Northeast,New England,Dukes County,25007,25007,Region 12,1/1/2014,4328.0,,...,10.7,0.085,0.035,9169.0,0.057,47.0,56784,0.124,,0.197
2320,MA,Northeast,New England,Dukes County,25007,25007,Region 12,1/1/2015,4152.0,,...,8.73,0.07,0.022,8822.0,0.057,52.0,58276,0.168,,
2321,MA,Northeast,New England,Essex County,25009,25009,Region 12,1/1/2014,5062.0,0.13,...,13.4,0.067,0.021,10230.0,0.076,63.0,67408,0.316,2.12,0.184
2322,MA,Northeast,New England,Essex County,25009,25009,Region 12,1/1/2015,5036.0,0.13,...,13.16,0.061,0.015,9564.0,0.076,69.0,67235,0.336,2.2,


## Indexing ##

Using the row numbers that represent data from Massachusetts, the desired columns can be filtered out for those rows. 

This can be done through the use of the `.loc` attribute. 

In the outer brackets, the row numbers corresponding to the Massachusetts data need to be inserted. In the inner brackets, insert the names of the desired columns. 

In [32]:
df.loc[2313:2340,["State","County", "Uninsured adults", "Uninsured children","Health care costs"]]

Unnamed: 0,State,County,Uninsured adults,Uninsured children,Health care costs
2313,MA,Barnstable County,0.07,0.023,9120.0
2314,MA,Barnstable County,0.057,0.018,8870.0
2315,MA,Berkshire County,0.057,0.02,9022.0
2316,MA,Berkshire County,0.053,0.015,8980.0
2317,MA,Bristol County,0.064,0.019,10094.0
2318,MA,Bristol County,0.064,0.014,9523.0
2319,MA,Dukes County,0.085,0.035,9169.0
2320,MA,Dukes County,0.07,0.022,8822.0
2321,MA,Essex County,0.067,0.021,10230.0
2322,MA,Essex County,0.061,0.015,9564.0


## Defining ##

Next, the subset needs to be defined. This will allow us to export the data as a csv file in the next step.

The subset will be named `MA_subset`. The desired name must be attached to the code for the data using an equal sign.

In [33]:
MA_insurance_subset = df.loc[2313:2340,["State","County", "Uninsured adults", "Uninsured children","Health care costs"]]

## Exporting ##

Finally, the subset needs to be exported. This can be done using the `.to_csv()` method. 

Because the subset was previously named as `MA_insurance_subset`, the following code can be used.

In [34]:
MA_insurance_subset.to_csv("MA_insurance_subset.csv")

#### Now the subset is downloaded to the device as a csv file and the process is <b>complete</b>! ####