#**Procedural Overview**
##*Purpose:*
- The purpose of this notebook is to provide an outline to gather a smaller subset of data from the County Public Health dataset from 2014-2015 regarding each state and their average obesity rate from every observation.
- Can be used to inform people of the distribution of obesity across the United States, as well as further research into causes and future solutions.

##*Scope:*
- We will create a smaller subset of the average obesity rates from each state using simple calculations and filtering.
- All coding will be done using Python3 in a Google CoLab Notebook.

##*Major Steps:*
1. Download the original County Public Health dataset from 2014-2015
2. Import necessary packages
3. Load the original dataset into a CoLab Notebook
4. Create the Average Obesity Rates per State subset
5. Export the subset

#**Step 1: Download the Original Dataset**
First, create a folder on your computer to store the original dataset and the subset that will be created.

The original dataset can be downloaded as a `.csv` [here](https://github.com/gabbyparsons/State_Obesity_Rate/blob/main/Data/CountyHealthData_2014-2015.csv). **Make sure you download this dataset to your google drive.**

Then, open Google CoLab and create a new notebook after signing in to your google account.

To allow CoLab to have constant access to your data is done by mounting your google drive in to this notebook. Type the following into your cell:


In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


**Note:** It may ask to allow access to your drive and sign in again.

#**Step 2: Import Necessary Packages**
For Python3 to work properly and efficiently, a few packages need to be installed for additional functions.

The first of these is **pandas**; this package allows us to store and manipulate data as dataframes. Similar to Microsoft Excel, these dataframes will have rows, columns, etc. We can import this package with the following code (Note: `as pd` is **not** necessary, it simply allows for a quicker reference):


In [None]:
import pandas as pd

The other package necessary to import is **numpy**, which aids in efficient mathematical calculations. We can import this package with the code below (Note: similar to pandas, `as np` is not necessary):


In [None]:
import numpy as np

#**Step 3: Load the Original Dataset into a CoLab Notebook**
We want to give our dataframe a name, so we don't have to write out code for the `.csv` every time we use it. So, we will use the `pd.read_csv()` function.This function reads `.csv` files and converts it into a dataframe that we will define. In this example, we will define it as `rawdf`, but it can be named whatever you please.


In [None]:
rawdf=pd.read_csv('gdrive/My Drive/English 105 Unit 3/CountyHealthData_2014-2015.csv')

**Note:** The exact wording in the code may look different due to a different location and/or name of the `.csv` file (pay attention to your folder name as well).

##*Exploring our Dataframe*
Now we have the original dataframe defined in our notebook, we can explore it and/or look at small fractions of this dataset. For example, lets use the `.heads()` function, which allows us to see the first 5 rows in this dataset:


In [None]:
rawdf.head()

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
0,AK,West,Pacific,Aleutians West Census Area,2016,2016,Insuff Data,1/1/2014,,0.122,...,,0.374,0.25,3791.0,0.185,216.0,69192,0.127,,0.287
1,AK,West,Pacific,Aleutians West Census Area,2016,2016,Insuff Data,1/1/2015,,0.122,...,,0.314,0.176,4837.0,0.185,254.0,74088,0.133,,
2,AK,West,Pacific,Anchorage Borough,2020,2020,Region 22,1/1/2014,6827.0,0.125,...,15.37,0.218,0.096,6588.0,0.119,135.0,71094,0.319,6.29,0.16
3,AK,West,Pacific,Anchorage Borough,2020,2020,Region 22,1/1/2015,6856.0,0.125,...,17.08,0.227,0.123,6582.0,0.119,148.0,76362,0.334,5.6,
4,AK,West,Pacific,Bethel Census Area,2050,2050,Insuff Data,1/1/2014,13345.0,0.211,...,,0.394,0.124,5860.0,0.2,169.0,41722,0.668,12.77,0.477


We can also examine the amount of a specific categorical variable using the `.value_counts()` function. This method creates a table listing every different observation of that variable and the number of times it was present in the column.


In [None]:
rawdf.State.value_counts()

Unnamed: 0_level_0,count
State,Unnamed: 1_level_1
TX,469
GA,318
VA,266
KY,240
MO,229
IL,204
NC,200
KS,199
IA,198
TN,190


#**Step 4: Create the Average Obesity Rate per State Subset**
The original dataset contains lots of valuable information, but it can sometimes be difficult to navigate given the size of it. This subset that we will make is the average obesity rate per state (as stated before). The first step uses the `.groupby()`, `.mean()`, and `.reset_index()` functions, allowing us to have a one (average) obesity rate for each state and to reinforce that `'State'` is a regular column and **not** an index.


In [None]:
state_avg=rawdf.groupby('State')['Adult obesity'].mean().reset_index()

**Note:** In this example the subset was named `state_avg`, this is not necessary and simply a practical choice.

To see what the beginning of this subset looks like, we can use the `.head()` function:


In [None]:
state_avg.head()

Unnamed: 0,State,Adult obesity
0,AK,0.303391
1,AL,0.351836
2,AR,0.338787
3,AZ,0.2788
4,CA,0.240649


To be even more specific with the subset and potentially allow for easier analysis, we can use the `.sort_values` function to list our average obesity rates in descending order.


In [None]:
state_avg=state_avg.sort_values(by='Adult obesity', ascending=False)

We can again use the `.head()` function to see part of the updated subset, which will print the highest obesity rates of all the states:


In [None]:
state_avg.head()

Unnamed: 0,State,Adult obesity
25,MS,0.366828
18,LA,0.358187
1,AL,0.351836
40,SC,0.34788
49,WV,0.339591


#**Step 5: Export the Subset**
Now that we have made our targeted dataset, we can export it as a `.csv` file. We will use the `.to_csv()` function, and in the parentheses is where we add our desired filename. This file will then be exported into our working directory (specifically in the "content" folder) where we can download to our computer.

This `.csv` file will automatically read the index column (the first column of numbers) that was created with pandas when we used the `.read_csv()` function. To make sure these numbers aren't read by the `.csv`, we will add the `index=False` statement to our code:



In [None]:
state_avg.to_csv("state_avg.csv", index=False)

All done! Hope you enjoyed!