#Overview
- During this process, you will utilize Python to create and merge two subsets of public health data provided by CountyHealthData_2014-2015 (accessible through [this link](https://drive.google.com/file/d/134lz04JTLVIbwfsBmuJOEWcRZFOPvUUo/view?usp=sharing))
- This will occur by creating two smaller subsets of relevant data for North and South Carolina, which will then be merged together for the final subset
- The finished product should be a data subset that displays information regarding percentage of people with limited access to healthy foods and adult obesity in the Carolinas (specifically for North and South Carolina)

#Getting Started
1. Create a Google Drive folder in "My Drive" to keep all the files involved in this process.
2. Download the County Health Data from [this link](https://drive.google.com/file/d/134lz04JTLVIbwfsBmuJOEWcRZFOPvUUo/view?usp=sharing) and drag the file into the folder you created.
3. Mount your Google Drive to Google CoLab so you are able to access your data file.





In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


4. Import the Numpy and Pandas packages. This will help with creating a dataframe from our data.
> Be sure to include `as pd` after the import statement to make it easier to call functions later for both the Pandas and Numpy packages.

In [None]:
import pandas as pd

In [None]:
import numpy as np


5. Use the Pandas function `pd.read_csv()` to pull the data file out of your Google Drive so you can work with it
> This will follow the format `pd.read_csv('gdrive/MyDrive/FOLDER NAME/CountyHealthData_2014-2051.csv`)

 > The specific folder name (here represented by the placeholder `FOLDER NAME`) depends on the name of the folder you placed the downloaded County Health Data in

 > Your data frame can be named anything you like. For simplicity's sake, this data frame will be named `df`.

 >You can assign the data frame any name you would like by placing the name in front of an `=`



In [None]:
df = pd.read_csv('gdrive/My Drive/ENGL 105 Unit 3/CountyHealthData_2014-2015.csv')

In [None]:
df

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
0,AK,West,Pacific,Aleutians West Census Area,2016,2016,Insuff Data,1/1/2014,,0.122,...,,0.374,0.250,3791.0,0.185,216.0,69192,0.127,,0.287
1,AK,West,Pacific,Aleutians West Census Area,2016,2016,Insuff Data,1/1/2015,,0.122,...,,0.314,0.176,4837.0,0.185,254.0,74088,0.133,,
2,AK,West,Pacific,Anchorage Borough,2020,2020,Region 22,1/1/2014,6827.0,0.125,...,15.37,0.218,0.096,6588.0,0.119,135.0,71094,0.319,6.29,0.160
3,AK,West,Pacific,Anchorage Borough,2020,2020,Region 22,1/1/2015,6856.0,0.125,...,17.08,0.227,0.123,6582.0,0.119,148.0,76362,0.334,5.60,
4,AK,West,Pacific,Bethel Census Area,2050,2050,Insuff Data,1/1/2014,13345.0,0.211,...,,0.394,0.124,5860.0,0.200,169.0,41722,0.668,12.77,0.477
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6104,WY,West,Mountain,Uinta County,56041,56041,Insuff Data,1/1/2015,7436.0,0.135,...,18.66,0.192,0.090,7600.0,0.123,47.0,60953,0.273,,
6105,WY,West,Mountain,Washakie County,56043,56043,Insuff Data,1/1/2014,6580.0,0.106,...,,0.225,0.086,8202.0,0.099,47.0,49533,0.328,,0.133
6106,WY,West,Mountain,Washakie County,56043,56043,Insuff Data,1/1/2015,7572.0,0.106,...,,0.226,0.101,7940.0,0.099,47.0,50740,0.309,,
6107,WY,West,Mountain,Weston County,56045,56045,Insuff Data,1/1/2014,5633.0,0.162,...,,0.201,0.084,6906.0,0.130,28.0,53665,0.232,,0.171


Here is your data set! If you are experiencing issues pulling the data file from your Google Drive check to make sure the spacings between words and your folder name are all typed correctly.

#Creating the First Subset

Now that you have successfully pulled the data file, we can begin by isolating the data from North Carolina specifically for "Adult obesity" and "Limited access to healthy foods"

1. Create a filtering command that isolates every incidence of `["State"] == NC`
> The `==` indicates the statement is "True", and will return all incidences where the value in the "State" column is "NC"

 > The inner statement will contain `df["State"] == NC` for each row that contains data from North Carolina

  > The outer statement will be a general reference to the data frame `df[]`


In [None]:
df[df["State"] == "NC"]

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
3243,NC,South,South Atlantic,Alamance County,37001,37001,Region 20,1/1/2014,7123.0,0.192,...,10.48,0.259,0.073,8640.0,0.167,46.0,41394,0.444,4.94,0.202
3244,NC,South,South Atlantic,Alamance County,37001,37001,Region 20,1/1/2015,7291.0,0.192,...,12.38,0.249,0.088,9050.0,0.167,56.0,43001,0.455,4.60,
3245,NC,South,South Atlantic,Alexander County,37003,37003,Region 20,1/1/2014,7974.0,0.178,...,22.74,0.240,0.077,9316.0,0.205,30.0,39655,0.417,6.27,0.273
3246,NC,South,South Atlantic,Alexander County,37003,37003,Region 20,1/1/2015,8079.0,0.178,...,24.04,0.239,0.076,9242.0,0.205,32.0,46064,0.449,7.20,
3247,NC,South,South Atlantic,Alleghany County,37005,37005,Insuff Data,1/1/2014,8817.0,0.234,...,18.18,0.320,0.131,9585.0,0.210,55.0,34046,0.523,,0.215
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3438,NC,South,South Atlantic,Wilson County,37195,37195,Region 20,1/1/2015,8028.0,0.159,...,7.31,0.262,0.079,9450.0,0.107,77.0,40772,0.556,9.60,
3439,NC,South,South Atlantic,Yadkin County,37197,37197,Region 20,1/1/2014,7893.0,0.207,...,18.45,0.252,0.097,10084.0,0.158,32.0,40012,0.422,3.76,0.241
3440,NC,South,South Atlantic,Yadkin County,37197,37197,Region 20,1/1/2015,7258.0,0.207,...,20.21,0.242,0.094,10998.0,0.158,32.0,40998,0.455,,
3441,NC,South,South Atlantic,Yancey County,37199,37199,Region 15,1/1/2014,6872.0,0.193,...,20.79,0.268,0.110,7707.0,0.158,79.0,36019,0.477,,0.176


2. Assign this new subset a unique name (such as `nc_subset`)

In [None]:
nc_subset = df[df["State"] == "NC"].copy()

> Adding `.copy()` creates a new subset

3. Display the subset to make sure it is working properly

In [None]:
nc_subset

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
3243,NC,South,South Atlantic,Alamance County,37001,37001,Region 20,1/1/2014,7123.0,0.192,...,10.48,0.259,0.073,8640.0,0.167,46.0,41394,0.444,4.94,0.202
3244,NC,South,South Atlantic,Alamance County,37001,37001,Region 20,1/1/2015,7291.0,0.192,...,12.38,0.249,0.088,9050.0,0.167,56.0,43001,0.455,4.60,
3245,NC,South,South Atlantic,Alexander County,37003,37003,Region 20,1/1/2014,7974.0,0.178,...,22.74,0.240,0.077,9316.0,0.205,30.0,39655,0.417,6.27,0.273
3246,NC,South,South Atlantic,Alexander County,37003,37003,Region 20,1/1/2015,8079.0,0.178,...,24.04,0.239,0.076,9242.0,0.205,32.0,46064,0.449,7.20,
3247,NC,South,South Atlantic,Alleghany County,37005,37005,Insuff Data,1/1/2014,8817.0,0.234,...,18.18,0.320,0.131,9585.0,0.210,55.0,34046,0.523,,0.215
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3438,NC,South,South Atlantic,Wilson County,37195,37195,Region 20,1/1/2015,8028.0,0.159,...,7.31,0.262,0.079,9450.0,0.107,77.0,40772,0.556,9.60,
3439,NC,South,South Atlantic,Yadkin County,37197,37197,Region 20,1/1/2014,7893.0,0.207,...,18.45,0.252,0.097,10084.0,0.158,32.0,40012,0.422,3.76,0.241
3440,NC,South,South Atlantic,Yadkin County,37197,37197,Region 20,1/1/2015,7258.0,0.207,...,20.21,0.242,0.094,10998.0,0.158,32.0,40998,0.455,,
3441,NC,South,South Atlantic,Yancey County,37199,37199,Region 15,1/1/2014,6872.0,0.193,...,20.79,0.268,0.110,7707.0,0.158,79.0,36019,0.477,,0.176


We can further break this data down to only show the data columns we are looking at ("Adult obesity" and "Limited access to healthy foods")
1. Using our new subset `nc_subset`, select multiple columns by placing them within a bracketed function
> We want to keep the "Adult obesity", "Limited access to healthy foods", and "State" columns so we can eventually compare them to the data from South Carolina

 > We will place these column names in brackets so they can be identified as strings related to column titles
  
  > These column names will be placed in the inner statement. The outer statement designating the dataset searched is the `nc_subset` we just created.

2. We will save this under the name `nc_subset` again so the data frame is updated to only include the desired columns

In [None]:
nc_subset = nc_subset[["State","Adult obesity", "Limited access to healthy foods"]]

In [None]:
nc_subset

Unnamed: 0,State,Adult obesity,Limited access to healthy foods
3243,NC,0.341,0.113
3244,NC,0.332,0.113
3245,NC,0.272,0.023
3246,NC,0.283,0.023
3247,NC,0.247,0.014
...,...,...,...
3438,NC,0.373,0.028
3439,NC,0.297,0.004
3440,NC,0.301,0.004
3441,NC,0.287,0.000


#Creating the Second Subset

We will now go back to the original `df` dataset and repeat the same processes to create a subset. This will contain relevant information from South Carolina, which we can name `sc_subset`.

1. Create a filtering command that isolates every instance of `["State"] == SC`
> The inner statement should contain `df["State"] == SC` which will return `True` for each row that contains data from South Carolina.

 > The outer statement will be a general reference of the data frame `df[]`.


In [None]:
df[df["State"] == "SC"]

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
4515,SC,South,South Atlantic,Abbeville County,45001,45001,Region 20,1/1/2014,7632.0,0.199,...,6.09,0.238,0.089,8787.0,0.178,32.0,35456,0.514,6.09,0.288
4516,SC,South,South Atlantic,Abbeville County,45001,45001,Region 20,1/1/2015,9183.0,0.199,...,6.16,0.231,0.078,8468.0,0.178,32.0,36187,0.601,5.90,
4517,SC,South,South Atlantic,Aiken County,45003,45003,Region 24,1/1/2014,8191.0,0.150,...,14.14,0.209,0.080,8511.0,0.137,28.0,45699,0.494,7.03,0.213
4518,SC,South,South Atlantic,Aiken County,45003,45003,Region 24,1/1/2015,8225.0,0.150,...,16.09,0.232,0.082,8802.0,0.137,32.0,43876,0.498,7.80,
4519,SC,South,South Atlantic,Allendale County,45005,45005,Insuff Data,1/1/2014,10984.0,0.249,...,,0.235,0.064,9602.0,0.204,60.0,25633,0.901,14.55,0.395
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4602,SC,South,South Atlantic,Union County,45087,45087,Region 20,1/1/2015,11113.0,0.231,...,12.84,0.223,0.067,10688.0,0.220,25.0,34042,0.580,8.50,
4603,SC,South,South Atlantic,Williamsburg County,45089,45089,Region 20,1/1/2014,11517.0,0.239,...,5.70,0.258,0.086,9873.0,0.221,15.0,28121,0.852,13.02,0.317
4604,SC,South,South Atlantic,Williamsburg County,45089,45089,Region 20,1/1/2015,11106.0,0.239,...,5.36,0.248,0.082,10040.0,0.221,18.0,29391,0.864,14.80,
4605,SC,South,South Atlantic,York County,45091,45091,Region 20,1/1/2014,6905.0,0.126,...,10.53,0.209,0.080,9153.0,0.162,28.0,51427,0.353,5.33,0.188


2. Assign this new subset a unique name (such as `sc_subset`).

In [None]:
sc_subset = df[df["State"] == "SC"].copy()

> Using `.copy()` creates a new subset

3. Display the subset to make sure it is working properly

In [None]:
sc_subset

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
4515,SC,South,South Atlantic,Abbeville County,45001,45001,Region 20,1/1/2014,7632.0,0.199,...,6.09,0.238,0.089,8787.0,0.178,32.0,35456,0.514,6.09,0.288
4516,SC,South,South Atlantic,Abbeville County,45001,45001,Region 20,1/1/2015,9183.0,0.199,...,6.16,0.231,0.078,8468.0,0.178,32.0,36187,0.601,5.90,
4517,SC,South,South Atlantic,Aiken County,45003,45003,Region 24,1/1/2014,8191.0,0.150,...,14.14,0.209,0.080,8511.0,0.137,28.0,45699,0.494,7.03,0.213
4518,SC,South,South Atlantic,Aiken County,45003,45003,Region 24,1/1/2015,8225.0,0.150,...,16.09,0.232,0.082,8802.0,0.137,32.0,43876,0.498,7.80,
4519,SC,South,South Atlantic,Allendale County,45005,45005,Insuff Data,1/1/2014,10984.0,0.249,...,,0.235,0.064,9602.0,0.204,60.0,25633,0.901,14.55,0.395
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4602,SC,South,South Atlantic,Union County,45087,45087,Region 20,1/1/2015,11113.0,0.231,...,12.84,0.223,0.067,10688.0,0.220,25.0,34042,0.580,8.50,
4603,SC,South,South Atlantic,Williamsburg County,45089,45089,Region 20,1/1/2014,11517.0,0.239,...,5.70,0.258,0.086,9873.0,0.221,15.0,28121,0.852,13.02,0.317
4604,SC,South,South Atlantic,Williamsburg County,45089,45089,Region 20,1/1/2015,11106.0,0.239,...,5.36,0.248,0.082,10040.0,0.221,18.0,29391,0.864,14.80,
4605,SC,South,South Atlantic,York County,45091,45091,Region 20,1/1/2014,6905.0,0.126,...,10.53,0.209,0.080,9153.0,0.162,28.0,51427,0.353,5.33,0.188


We will again keep only the "State", "Adult obesity", and "Limited access to healthy foods" columns so this subset may be compared with North Carolina data. This is done in the same manner above.

> We will be using `sc_subset` as the data frame searched instead of `nc_subset`. However, all of the columns kept remain the same.

 > We will save this new subset under the name `sc_subset` again to present the updated subset

In [None]:
sc_subset = sc_subset[["State", "Adult obesity", "Limited access to healthy foods"]]

Display the subset to see if it is working as intended.

In [None]:
sc_subset

Unnamed: 0,State,Adult obesity,Limited access to healthy foods
4515,SC,0.348,0.145
4516,SC,0.346,0.145
4517,SC,0.319,0.060
4518,SC,0.312,0.060
4519,SC,0.367,0.010
...,...,...,...
4602,SC,0.359,0.037
4603,SC,0.426,0.051
4604,SC,0.430,0.051
4605,SC,0.285,0.060


#Merging the Data Sets

1. Use the `pd.concat()` function to combine the two subsets of data into one data frame.

> Make sure the subset references are in brackets so they can be peceived as list objects when being worked with.

 > Order the `nc_subset` before the `sc_subset` so the data table can be organized alphabetically, with the North Carolina data grouped together before the South Carolina data set

In [None]:
carolinas_subset = pd.concat([nc_subset, sc_subset]).copy()

> Make sure to use `copy` to create a new subset

2. Display the new subset to make sure it is working as intended

In [None]:
carolinas_subset

Unnamed: 0,State,Adult obesity,Limited access to healthy foods
3243,NC,0.341,0.113
3244,NC,0.332,0.113
3245,NC,0.272,0.023
3246,NC,0.283,0.023
3247,NC,0.247,0.014
...,...,...,...
4602,SC,0.359,0.037
4603,SC,0.426,0.051
4604,SC,0.430,0.051
4605,SC,0.285,0.060


This subset can also be cleaned up by reformatting the row numbers so they are representative of the number of samples in this data set.
1. To clean the data, re-format the row index values by adding `ignore_index=True` to the `pd.concat` function you had previously used
> The `ignore_index=True` function should go outside o the brackets referencing the subsets `nc_subset` and `sc_subset`.

 >`ignore_index=True` will allow the existing index values to clear and start over from zero.

  > Reassign the name `carolinas_subset` to the polished subset.

In [None]:
carolinas_subset = pd.concat([nc_subset, sc_subset], ignore_index=True).copy()

> Make sure to use `.copy()` to create a new subset

2. Display the subset to make sure it works as intended

In [None]:
carolinas_subset

Unnamed: 0,State,Adult obesity,Limited access to healthy foods
0,NC,0.341,0.113
1,NC,0.332,0.113
2,NC,0.272,0.023
3,NC,0.283,0.023
4,NC,0.247,0.014
...,...,...,...
287,SC,0.359,0.037
288,SC,0.426,0.051
289,SC,0.430,0.051
290,SC,0.285,0.060


#Exporting Your New Subset

1. To export this new subset, we will utilize the `.to_csv()` method.
> We need to indicate the data frame being exported (`carolinas_subset`) before the function.

 > We can put the name we wish to give the file in the quotes within the parenthesies.

  > We will also specify that `index=False` to make sure the exported file does not have the numbered row indecies that Pandas will add otherwise.

2. The following command will save your data as a .csv file.

In [None]:
carolinas_subset.to_csv("Carolinas_Subset 2014-2015.csv", index=False)

The subset will then populate in the folder you created for this project. You can also view it by going to the "Files" folder of the left task bar in Google CoLab.

Congratulations! You have successfully created a subset of data for the Carolinas to be analyzed!