#  Creating a Subset from a Dataset: Walkability Score in Chapel Hill-Durham
## Overview of Process
Below, the instructions will guide you on how to create a subset of data from a larger public data set, using Python
* Anyone who is interested in urban planning, transportation, environmental equity, or lives in Chapel Hill-Durham would benefit from this dataset
* While the coding is in Python, the instructions make creating this subset fairly easy and requires no previous experience in coding
### How to Find the Data
* Visit: https://catalog.data.gov/dataset/walkability-index7 to download the CSV
* To better understand the data, check out: https://www.epa.gov/smartgrowth/smart-location-mapping and https://www.epa.gov/sites/default/files/2021-06/documents/national_walkability_index_methodology_and_user_guide_june2021.pdf

###Roadmap
1. Find the main dataset to use, download it as a CSV and upload it into Google CoLab
2. Import needed packages, such as pandas
3. Filter dataset for needed information and generate subset from that code
4. Export new CSV
5. Import CSV to make graphs of new data

#### Setting Up the Dataset
* Create a Google file from Google Drive to store the original CSV from the dataset, label it: engl 105 unit 3. Then upload the CSV named : EPAWAlkabilityScore.csv
* Open Google CoLab and on the main screen hover over a small line created, then click "+ code".
* In the cell created, copy in "from google.colab import drive
drive.mount('/content/gdrive')". Once a line of code is copied into the cell, you must press the "play" icon to the left of each cell.




In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


* Once this is created, the CSV will be connected to this coding system

#### Packages
* For packages to download: you will need to upload numpy and pandas, create a new cell and type in "import numpy as np
import pandas as pd"

In [2]:
import numpy as np
import pandas as pd

* This will download needed packages to read the data

#### Final Preperation
* Next, create a new code cell and type "df=pd.read_csv('gdrive/My Drive/engl 105 unit 3/EPAWalkabilityScore.csv')"

In [3]:
df=pd.read_csv('gdrive/My Drive/engl 105 unit 3/EPAWalkabilityScore.csv')

* This will allow you to manipulate the data from the dataset as well as defining the dataset you imported in as "df"

####Optional:
* To learn about how big your dataset is, type in "df.shape" which provides the rows and columns of your data set

In [4]:
df.shape

(220740, 117)

* Next, type " df. size" to find the total number of elements in your dataset

In [5]:
df.size

25826580

* To double-check that the total number of elements match the shape of the data set, type in "df.size == 220740 * 117"

In [6]:
df.size == 220740 * 117

True

#### Creating a Subset
* First, identify the types of columns and what you want to individually look for: In this subset we are looking into how Chapel-Hill and Durham rank in the National Walking Index, and are trying to determine in there is a correelation between low levels of walkability and communities of poverty
* Next, we will create a Boolean Mask that will help to sort the data so that it only contains data about Chapel Hill-Durham. Since the original dataset was not organized by location, this will help greatly rather than trying to import every row individually.
* Type in "mask = df['CBSA_Name'].str.contains('Durham', case=False, na=False)". This uses True values for rows that contain our desired words and False if otherwise

In [18]:
mask = df['CBSA_Name'].str.contains('Durham', case=False, na=False)

* This has now sorted our original dataset into only containing our desired location, next we will need to add in other columns to compare our data against to be able to make inferences
* Type in "mask = df['CBSA_Name'].str.contains('Durham', case=False, na=False)". This will apply the mask to rows where the mask is True, furthering sorting the data.

In [23]:
new_df = df.loc[mask, ['CBSA_Name', 'NatWalkInd', 'R_PCTLOWWAGE']]

* Great! Now we are ready to export the Subset as a CSV. Type in "new_df.to_csv('gdrive/My Drive/engl 105 unit 3/feeder1.csv', index=False)". This will save the data onto your Google Drive.
* To access this, click the folder icon on the top left coroner. Then click My Drive, then engl 105 unit 3.
* Find the subset "feeder1.csv" click on the three dots on the right of it and click "Download"

In [24]:
new_df.to_csv('gdrive/My Drive/engl 105 unit 3/feeder1.csv', index=False)

#### Interpreting Subset
* To see how this subset looks, upload it to Google CoLab and create a graph the compiled data. Create a new code cell and enter "ndf=pd.read_csv('gdrive/My Drive/engl 105 unit 3/feeder1.csv')"
* This will register your subset as "ndf" and upload it to CoLab

In [27]:
ndf=pd.read_csv('gdrive/My Drive/engl 105 unit 3/feeder1.csv')

* To see what the subset looks like, simply type in 'ndf' to view the new data frame
* To create a visual representation of your data, click on the graph button to the right of the data frame

In [28]:
ndf

Unnamed: 0,CBSA_Name,NatWalkInd,R_PCTLOWWAGE
0,"Durham-Chapel Hill, NC",6.166667,0.196615
1,"Durham-Chapel Hill, NC",4.833333,0.144513
2,"Durham-Chapel Hill, NC",5.000000,0.192964
3,"Durham-Chapel Hill, NC",7.833333,0.145032
4,"Durham-Chapel Hill, NC",8.166667,0.155990
...,...,...,...
316,"Durham-Chapel Hill, NC",15.333333,0.211538
317,"Durham-Chapel Hill, NC",11.166667,0.228805
318,"Durham-Chapel Hill, NC",19.000000,0.202929
319,"Durham-Chapel Hill, NC",12.666667,0.195191
