# Adult Obesity in the United States
This notebook will guide you through the steps taken to create a Python notebook and filter data from a publicly available data set.

The original data set used for this project records health data, such as the number of premature death, teen births, primary care physicians, violent crimes, and injury deaths, in all states, regions, and counties from 2014-2015. The data subset filtered from the original data will show only the percentage of adult obesity by state.

The purpose of creating this data subset is to help researchers compare the percentage of adult obesity in all 50 states in 2014.

## Overview
This notebook contains three main sections: 
1. Getting Started
2. Filtering Data
3. Exporting Data  

Each of these sections provide step-by-step documentation of the computational methods used to compile the subset keyed into the purpose of the data and your repository.

## Acknowledgments
The original data set is from the County Health Data, and the data subset is created using Jupyter Lab from Anaconda.
You can download the original data set and Anaconda using the links down below.

Data set: https://github.com/chivu04/engl105datarepository/blob/main/Data/CountyHealthData_2014-2015.csv

Anaconda: https://unc-libraries-data.github.io/Python/Setup.html#Anaconda-Installation

## 1. Getting Started
Create a folder on your computer where you store all your files, including the original data set and the python notebook, to create a working directory.

### System Requirements
Download Anaconda, and on the home page, launch Jupyter Lab.

### Creating a Notebook
Open the folder on Jupyter and click + button to create a notebook in your directory. This notebook, a `.pynb` file, is where you will write your codes to filter the data.

### Pandas Packages
#### Packages
Packages are folders containing modules that provide additional tools and functions not present in base Python. Python includes a number of packages. The Anaconda distribution comes with the "Pandas" package already installed. 

Once you have installed the package, load it into your Python session with the import function.

#### Pandas
Pandas is a Python package providing fast and accessible data structures to make working with data easily. Like spreadsheets in Microsoft Excel, Pandas allows us to store our data in dataframes objects with rows, columns, and headers. Pandas provides a wide range of useful tools for working with data once it has been stored and structured.

### Importing Pandas Package
Begin by importing the pandas packages you will need using the following command:

In [139]:
import pandas as pd

Load pandas with the usual `import pandas` and an extra `as pd` statement. This allows you to call functions from `pandas` `pd.` instead of `pandas.` for convenience, though `as pd` is not necessary to load the package.

### Creating a Dataframe
By now, you should have downloaded the original data set, which is the `.csv` file "CountyHealthData_2014-2015.csv", and save it to your folder, the same working directory as your `.pynb` notebook file you will use. 

(The link to this data set can be found in the Acknowledgments section above or in the folder titled "Data" in the GitHub repository.)

To create your dataframe object, you will define the object `df` by using the `pd.read_csv()` function and inserting the file name into the parentheses.

`pd.read_csv` reads the tabular data from a Comma Separated Values (csv) file into a dataframe object that you will define as `df`.

In [140]:
df=pd.read_csv("CountyHealthData_2014-2015.csv")  

## 2. Filtering Data 
After importing the original data set to your notebook, you will start filtering data using the following steps.

The purpose of this section is to make use of the CountyHealthData using Python, specifically comparing the percentage of adults who have obesity in all 50 states.

### Filtering and Indexing
Your dataframe can be thought as a collection of rows and colums where each row represents an observation and each column has a specific type of information about each observation. 

For example, in the original CountyHealthData set, the 9th row contains all health data for Dilingham Census Area in Alaska in by the beginning of 2015, and the 9th column contains only the number of premature death but for all states, divisions, regions, and counties in 2014.

In Pandas, the columns are stored as what is called 'Series' objects, and the dataframes can be thought of as named collections of series.

To extract information from a single column, you will use bracket notation: `df["Region"]`, which is the most robust way to refer to series. While dot notation `df.Region` works and is simpler, it does not work in some cases as a column is not understood as a single value and is not always available. 

As the purpose of creating this data subset is to find data for adult obesity in all 50 states, you will extract information from columns "State", "FIPS", and "Adult obesity". 

(FIPS are codes that uniquely identify geographic areas.)

You will do so with the following syntax:

In [134]:
df[["State", "FIPS", "Adult obesity"]]

Unnamed: 0,State,FIPS,Adult obesity
0,AK,2016,0.300
1,AK,2016,0.329
2,AK,2020,0.257
3,AK,2020,0.268
4,AK,2050,0.315
...,...,...,...
6104,WY,56041,0.293
6105,WY,56043,0.241
6106,WY,56043,0.242
6107,WY,56045,0.313


Note: make sure to insert the names of the columns in another pair of parentheses `[ ]`. 

It should NOT look like this: 
    
`df["State", "FIPS", "Adult obesity"]`

## 3. Exporting Data
After finishing filtering the data to create useful subsets for potential further analysis, export them as a `.csv` file to share with others on your GitHub repositories. 

To do this, you will need to give your final data set a name and define the dataframe `df` as that final data set. You will also need to add `.copy()`, which will create a copy of the existing list. For this final data subset, I will name it as `finaldataset2`.

In [137]:
finaldataset2

Unnamed: 0,State,FIPS,Adult obesity
0,AK,2016,0.300
1,AK,2016,0.329
2,AK,2020,0.257
3,AK,2020,0.268
4,AK,2050,0.315
...,...,...,...
6104,WY,56041,0.293
6105,WY,56043,0.241
6106,WY,56043,0.242
6107,WY,56045,0.313


In [148]:
finaldataset2=df[["State", "FIPS", "Adult obesity"]].copy()

Then, you will use the method `.to_csv()`, inserting the filename infront of and in the parentheses. 

So for example, for your filtered subset you would run: `finaldataset2.to_csv("FinalDataSet2.csv")`

This will export as a `.csv` file in our working directory.

By default, this `.csv` will include the row of indices that pandas created when we read the original file into our notebook using `.read_csv`. To eliminate these, add `index=false` to your statement to tell it not bring in those index numbers.

`finaldataset2.to_csv("FinalDataSet2.csv", index=False)`

In [153]:
finaldataset2.to_csv("FinalDataSet2.csv", index=False)

The new data subset will appear in your folder and is ready to be used.