### Tutorial : Creating Public Health Data Subset

##### Outline
1. Overview
2. Setting Up
3. Examination
4. Filtering
5. Finish and Saving

This notebook will guide you through a step by step tutorial in order to recreate a data subset in a .csv file format.

- There will be example guides throughout this tutorial for extra help. Please reference as needed.

**The data set** (a .csv file) in use is [CountyHealthData_2014-2015.csv](https://uncch.instructure.com/courses/11001/files/1951171/download?download_frd=1) which can be downloaded at the link.

Python commands and functions will be provided in this tutorial. For new users, press the play button or shift+enter to run your code. 

### Setting Up

1) **Download** the dataset file linked above and ensure it is saved in the same folder as this notebook on your computer.
2) **Importing Packages** 

Packages are sets of additional functions and tools that aren't included in base Python. They will help us perform special fuctions in this tutorial.
   
   1) We will import the `pandas` and `numpy` packages to help us navigate our data.
   2) Use the `import <> as <>` function to import the name of the packages.
    
As in the example shown below, we will import these packages as abbreviated names `pd` and `np` for convenience.

In [40]:
import pandas as pd
import numpy as np

Next, we will create our **dataframe object** by reading it through the pandas package. 

Use the function `pd.read_csv()` to read your .csv file, in this case **"CountyHealthData_2014-2015.csv"**, and create your dataframe as shown below.

We will assign our function to our abbreviated dataframe object `df` 

In [41]:
df=pd.read_csv("CountyHealthData_2014-2015.csv")

### Examining for Main Parameters

After setting our dataframe up, we can quickly analyze some parameters of our dataframe object. 

- The base function takes the format 

`df.<parameter>` with `df` representing our dataframe `object`.

- For example, we can examine 

 *the size, shape, number of rows or columns, specific regions of our data, and even generate random samples.*

- We would plug these attributes into the `df.<paramter>`command to execute these functions.

Let's try a couple of them. We will find the **shape, size, and all the columns** in this tutorial. 

#### Example Guided Practice
- To find the shape of our object, we would plug the shape command into our base function

`df.shape`

The process can be repeated with the different commands listed above as well. 

In [36]:
df.shape

(6109, 64)

In [37]:
df.size

390976

In [38]:
df.columns

Index(['State', 'Region', 'Division', 'County', 'FIPS', 'GEOID', 'SMS Region',
       'Year', 'Premature death', 'Poor or fair health',
       'Poor physical health days', 'Poor mental health days',
       'Low birthweight', 'Adult smoking', 'Adult obesity',
       'Food environment index', 'Physical inactivity',
       'Access to exercise opportunities', 'Excessive drinking',
       'Alcohol-impaired driving deaths', 'Sexually transmitted infections',
       'Teen births', 'Uninsured', 'Primary care physicians', 'Dentists',
       'Mental health providers', 'Preventable hospital stays',
       'Diabetic screening', 'Mammography screening', 'High school graduation',
       'Some college', 'Unemployment', 'Children in poverty',
       'Income inequality', 'Children in single-parent households',
       'Social associations', 'Violent crime', 'Injury deaths',
       'Air pollution - particulate matter', 'Drinking water violations',
       'Severe housing problems', 'Driving alone to work'

### Filtering 

Exploring these parameters gives us a *quick overview of key points* in our data. 

From these commands, we can now begin to **filter our data** to create a focused subset to analyze.

- We will create an area of analysis through the the `.loc` function. 

- `.loc` allows us to filter our data based on **qualitative parameters** instead of numeric parameters, which would use the `.iloc` command.

In this tutorial, we will create a data subset that focuses on the columns:

    "state", "county", "health care costs", "median husehold income", and "poor or fair health"

By filtering these columns, as identified in our examining step, we can focus and compare these factors to each other.

#### Example Guided Practice
- Attach the `.loc` command to our datatframe object `df`

`df.loc`

- Include the column parameters listed above within the brackets of the `.loc` function below.

`df.loc[:,["parameter","paramter2"]`

In this tutorial, we will create a small sample of *10* examples. 
- Attach the sample command to the *end* of the function you created as

`.sample(n=10)`

The n stands for the sample size and can be customized.

In [49]:
df.loc[:,["State","County","Health care costs","Median household income","Poor or fair health"]].sample(n=10)

Unnamed: 0,State,County,Health care costs,Median household income,Poor or fair health
1722,IN,Union County,9603.0,44796,0.154
3467,ND,Eddy County,9410.0,42908,0.128
5647,VA,Richmond city,8565.0,37933,0.162
1917,KS,Sheridan County,10250.0,50601,
3518,ND,Sioux County,7815.0,34491,0.268
5220,TX,Milam County,11865.0,40120,0.327
2706,MN,Pope County,8479.0,55404,0.089
5093,TX,Hamilton County,10419.0,37386,
5409,UT,Morgan County,7711.0,75348,0.052
2148,KY,Rockcastle County,9954.0,33131,0.276


### Finishing and Saving
To save this subset we created, we will assign it to a new object for convenience. 
- Name the subset `Sample_subset` by assigning the previous code to the `Sample_subset` object.

In [50]:
Sample_subset = df.loc[:,["State","County","Health care costs","Median household income","Poor or fair health"]].sample(n=10).copy()

**Save** the finished data subset as a .csv file for future use with the command 

`.to_csv("file.csv", index=False)`

- Apply the function to the `Sample_subset` object created in the last step 

- fill the `file` portion of the function in as the name of the .csv file, which in this case would be our `Sample_subset`

In [52]:
Sample_subset.to_csv("Sample_subset.csv", index=False)

### Congratulations! 
*You created a subset!*
A .csv file named `Sample_subset.csv` should appear in the folder used to open this notebook. The subset is a seperate .csv file and can be used in the future for analysis.