## Conceptual Overview

This notebook functions as a guide to create a smaller subset based on County Public Health Data from the years 2014–2015.  
The subset created specifically highlights counties in North Carolina and the fluctuating costs of healthcare in these counties.  
By following the instructions below, you will be able to create a subset containing this data and export it as a `.csv` file.


- **Purpose:** To extract and analyze healthcare cost data for counties in North Carolina from 2014–2015.
- **Scope:** Focuses specifically on public health data and healthcare cost variations across counties in North Carolina.
- **Major Steps:**
  1. Load and inspect the full dataset
  2. Filter the dataset to focus only on North Carolina counties
  3. Select relevant columns (County, Year, Healthcare Costs)
  4. Export the cleaned subset as a `.csv` file


In [1]:
import pandas as pd



####**Pandas** is a Python library used for working with structured data like tables or spreadsheets.  
* Using `pd` makes it faster and cleaner to call Pandas functions (e.g., `pd.read_csv()`).

* This line imports the **Pandas** library and gives it the alias `pd`.




## **Step 1: Download, Upload, and Import the Dataset**

Download the original dataset [`CountyHealthData_2014-2015.csv`](https://your-download-link-here.com) to your computer.

Upload the file to your Google Drive.

Now you can mount your drive.
* Mounting Google Drive in a Colab notebook connects your Google Drive storage to the Colab environment.

This allows you to:
- Access files stored in your Google Drive directly from your notebook
- Load datasets (like `.csv` files) without manually uploading each time
- Save outputs (such as new datasets or results) back into your Drive[link text](https://)




In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
df=pd.read_csv('/content/gdrive/My Drive/Data/CountyHealthData_2014-2015.csv')

## Loading the Dataset

This line reads the `CountyHealthData_2014-2015.csv` file from your Google Drive and loads it into a defined Pandas DataFrame called `df`.

The DataFrame `df` will store the full dataset, allowing you to view, filter, and manipulate the data in your notebook.


In [4]:
df

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
0,AK,West,Pacific,Aleutians West Census Area,2016,2016,Insuff Data,1/1/2014,,0.122,...,,0.374,0.250,3791.0,0.185,216.0,69192,0.127,,0.287
1,AK,West,Pacific,Aleutians West Census Area,2016,2016,Insuff Data,1/1/2015,,0.122,...,,0.314,0.176,4837.0,0.185,254.0,74088,0.133,,
2,AK,West,Pacific,Anchorage Borough,2020,2020,Region 22,1/1/2014,6827.0,0.125,...,15.37,0.218,0.096,6588.0,0.119,135.0,71094,0.319,6.29,0.160
3,AK,West,Pacific,Anchorage Borough,2020,2020,Region 22,1/1/2015,6856.0,0.125,...,17.08,0.227,0.123,6582.0,0.119,148.0,76362,0.334,5.60,
4,AK,West,Pacific,Bethel Census Area,2050,2050,Insuff Data,1/1/2014,13345.0,0.211,...,,0.394,0.124,5860.0,0.200,169.0,41722,0.668,12.77,0.477
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6104,WY,West,Mountain,Uinta County,56041,56041,Insuff Data,1/1/2015,7436.0,0.135,...,18.66,0.192,0.090,7600.0,0.123,47.0,60953,0.273,,
6105,WY,West,Mountain,Washakie County,56043,56043,Insuff Data,1/1/2014,6580.0,0.106,...,,0.225,0.086,8202.0,0.099,47.0,49533,0.328,,0.133
6106,WY,West,Mountain,Washakie County,56043,56043,Insuff Data,1/1/2015,7572.0,0.106,...,,0.226,0.101,7940.0,0.099,47.0,50740,0.309,,
6107,WY,West,Mountain,Weston County,56045,56045,Insuff Data,1/1/2014,5633.0,0.162,...,,0.201,0.084,6906.0,0.130,28.0,53665,0.232,,0.171


`df` is now a Pandas DataFrame that stores the data loaded from the `CountyHealthData_2014-2015.csv` file.

* now when you want to access the raw data, you simply have to code for `df`


#**Step 2: Filter the Dataset for North Carolina Counties**

Filter the dataset to include only rows where the state is North Carolina (`"NC"`).


###### Note:
- At this point, the filtered data is shown, but not saved to a new variable yet.
- In the next step, we will assign this data to a new dataframe

---



In [11]:
df[df["State"] == "NC"]

Unnamed: 0,State,Region,Division,County,FIPS,GEOID,SMS Region,Year,Premature death,Poor or fair health,...,Drug poisoning deaths,Uninsured adults,Uninsured children,Health care costs,Could not see doctor due to cost,Other primary care providers,Median household income,Children eligible for free lunch,Homicide rate,Inadequate social support
3243,NC,South,South Atlantic,Alamance County,37001,37001,Region 20,1/1/2014,7123.0,0.192,...,10.48,0.259,0.073,8640.0,0.167,46.0,41394,0.444,4.94,0.202
3244,NC,South,South Atlantic,Alamance County,37001,37001,Region 20,1/1/2015,7291.0,0.192,...,12.38,0.249,0.088,9050.0,0.167,56.0,43001,0.455,4.60,
3245,NC,South,South Atlantic,Alexander County,37003,37003,Region 20,1/1/2014,7974.0,0.178,...,22.74,0.240,0.077,9316.0,0.205,30.0,39655,0.417,6.27,0.273
3246,NC,South,South Atlantic,Alexander County,37003,37003,Region 20,1/1/2015,8079.0,0.178,...,24.04,0.239,0.076,9242.0,0.205,32.0,46064,0.449,7.20,
3247,NC,South,South Atlantic,Alleghany County,37005,37005,Insuff Data,1/1/2014,8817.0,0.234,...,18.18,0.320,0.131,9585.0,0.210,55.0,34046,0.523,,0.215
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3438,NC,South,South Atlantic,Wilson County,37195,37195,Region 20,1/1/2015,8028.0,0.159,...,7.31,0.262,0.079,9450.0,0.107,77.0,40772,0.556,9.60,
3439,NC,South,South Atlantic,Yadkin County,37197,37197,Region 20,1/1/2014,7893.0,0.207,...,18.45,0.252,0.097,10084.0,0.158,32.0,40012,0.422,3.76,0.241
3440,NC,South,South Atlantic,Yadkin County,37197,37197,Region 20,1/1/2015,7258.0,0.207,...,20.21,0.242,0.094,10998.0,0.158,32.0,40998,0.455,,
3441,NC,South,South Atlantic,Yancey County,37199,37199,Region 15,1/1/2014,6872.0,0.193,...,20.79,0.268,0.110,7707.0,0.158,79.0,36019,0.477,,0.176


#### Now youll create a new DataFrame called `nc_subset` that contains only the counties in North Carolina.

Use the following code:

In [13]:
nc_subset = df[df["State"] == "NC"].copy()


* This code filters the original DataFrame `df` to keep only rows where the State column is "NC",
then uses .copy() to create a separate copy of the filtered data.

* Using .copy() is important because it prevents any changes made to `nc_subset` from affecting the original df.

#**Step 3: Select relevant columns**

After creating the subset for North Carolina counties, select only the columns you need from the dataset.

* This keeps only the County, Year, and Health care costs columns, removing any unnecessary data from the subset.
* It makes the dataset cleaner and easier to work with for analysis or exporting.

In [16]:
nc_subset[['County', 'Year', 'Health care costs']]

Unnamed: 0,County,Year,Health care costs
3243,Alamance County,1/1/2014,8640.0
3244,Alamance County,1/1/2015,9050.0
3245,Alexander County,1/1/2014,9316.0
3246,Alexander County,1/1/2015,9242.0
3247,Alleghany County,1/1/2014,9585.0
...,...,...,...
3438,Wilson County,1/1/2015,9450.0
3439,Yadkin County,1/1/2014,10084.0
3440,Yadkin County,1/1/2015,10998.0
3441,Yancey County,1/1/2014,7707.0


###Now we'll utilize the print function to:
- display the selected columns in your notebook so you can see what the data looks like
- ensure you correctly filtered for North Carolina
- ensure the the right columns (County, Year, Health care costs) are showing

In [17]:
print(nc_subset[['County', 'Year', 'Health care costs']])


                County      Year  Health care costs
3243   Alamance County  1/1/2014             8640.0
3244   Alamance County  1/1/2015             9050.0
3245  Alexander County  1/1/2014             9316.0
3246  Alexander County  1/1/2015             9242.0
3247  Alleghany County  1/1/2014             9585.0
...                ...       ...                ...
3438     Wilson County  1/1/2015             9450.0
3439     Yadkin County  1/1/2014            10084.0
3440     Yadkin County  1/1/2015            10998.0
3441     Yancey County  1/1/2014             7707.0
3442     Yancey County  1/1/2015             7870.0

[200 rows x 3 columns]


## **Step 4: Export the Subset to a CSV File**

After filtering and selecting the data you need, export the `nc_subset` DataFrame to a new CSV file.

- Use the following code to first create a copy and store it into the new dataframe `nc_subset`


In [19]:
nc_subset = df[df["State"] == "NC"][['County', 'Year', 'Health care costs']].copy()


- now use this code to create a `csv` and export your cleaned subset into your google drive

- use the following code in order to ahieve this:

In [23]:
nc_subset.to_csv('/content/drive/My Drive/Data/nc_subset.csv', index=False)


**This** saves your cleaned North Carolina subset into a file named nc_subset.csv.
Using index=False keeps the CSV clean by not including the extra row index numbers.

- The new file can be downloaded, shared, or used for future analysis without needing to repeat the filtering and selection steps.