# Overview
-   The purpose of these instructions is to give the context and tools to create two subsets relating to jail deaths in North Carolina divided by counties in Python.
-   This will also allow you to eventually merge them together and analyze the data and variables pulled from the large dataset.
-   All of the data is specifically from different counties in North Carolina.

In this tutorial, you will:
1. Create a file for your data.
2. Find and import original csv into your CoLab notebook.
3. Create two dataframes from the raw dataset.
4. Export those two dataframes and turn them into csv files.



## Getting Started
----
1.  First, you will want to create a file in the file explorer to add all your work into.
    
-   The name of this folder is not important, as long as you know what it is. This step is important because you need to have all of your important files in one place so that nothing gets lost.
2. Next, create a Google CoLab Notebook to begin creating your subsets. This is where all the coding will happen and this is where you will do the work of filtering and merging subsets.
3. After creating your CoLab notebook, you should be ready to download the csv file into the notebook. You should begin by using  [this link](https://www.reuters.com/investigates/special-report/usa-jails-graphic/) to find the dataset called "North Carolina" and download the csv file they provide.

-   Do not download the pdf version, as the csv is what you will need to create your subsets in Google CoLab.  
-   Once you complete the download, you should have a folder called "North Carolina" which includes a Microsoft Excel worksheet, and two csv files titled "NorthCarolina_deaths.csv" and "NorthCarolina_jails.csv".
-   **You need to use the file titled "NorthCarolina_deaths.csv" in order to create this subset.** This is the csv file that has all the information needed.

4.  Next, you will need to upload the North Carolina deaths csv file into the Google CoLab notebook. To do this, you should click the folder icon in the toolbar on the left of the Google CoLab notebook.
-   Click the folder above "sample data" and then click the folder that says "content". Next, click the three dots that appears next to "content" when you hover your mouse and hit "upload".
-   Once you have clicked upload, it will prompt you to add your csv file. This is where you will add the **"NorthCarolina_deaths.csv"** file.
5.
The next step is to import the Pandas and Numpy package as seen below:

In [3]:
import pandas as pd
import numpy as np

This step will allow you to use pandas to code with your csv file and create your subsets.

6. Read the North Carolina Deaths csv file and display data as seen below:
-   
 File is read through 'read_csv()' function

In [19]:
dataframe = pd.read_csv("NorthCarolina_deaths.csv")

Now you should be ready to begin with your first subset!


## Creating your first subset

After you have uploaded the csv and imported Pandas and Numpy, you are ready to begin filtering your data.

1. You will need to isolate two variables from the large csv file. You should be separating the data about causes of deaths of inmates from Wake County.
2. To do this, you will need to utilize the function .loc which is used to filter data by columns. Input the function seen below and then follow this with brackets to input information.

        'df.loc[[]]'
2. Input the brackets after the .loc function and make sure to include the row numbers that need to be included. These row numbers should be the ones that correspond to Wake county, so that the only data we are seeing is for this specific county. These row numbers are 101-119.
3. In the next bracket, you will need to input the specific columns that correspond to the variables that we need for this subset. The columns we want to look at our "county" and "causes_detail", as these columns will isolate the information relating to the county (which has already been specified by the numbers to only get Wake) and the causes of death for the inmates in this specific county.
4. Finally input the function as seen below:

In [11]:
df.loc[101:116, ["county", "cause_detail"]]

(    county                                       cause_detail
 101   Wake                             ischemic heart disease
 102   Wake                                            suicide
 103   Wake                                            suicide
 104   Wake                                           natural 
 105   Wake                                    hanging w sheet
 106   Wake                                           natural 
 107   Wake                                      brain injury 
 108   Wake                                           natural 
 109   Wake                                           natural 
 110   Wake             climbed rails and jumped off mezzanine
 111   Wake  complications of withdrawal due to chronic sub...
 112   Wake                                            suicide
 113   Wake                                           natural 
 114   Wake                                                NaN
 115   Wake                                            

The function uses the row numbers for the variable "Wake County" to separate that county from the larger csv with every county listed in North Carolina. Then, using the brackets, the function shows only the county and causes of death columns from the data that we need for this subset.
  
  Once you have completed this step, a table with your new data should appear!

**After you have finished inputing the function, you should see the two variables that you have pulled out from the large csv and you have your first subset!**

## Creating your second subset
The next step to recreating this data repository is repeating the steps above with a second subset with a few changes to produce a different outcome.

1. To begin, you will want to locate the row numbers for Mecklenburg. In the North Carolina Deaths csv file, you will find the entries for Mecklenburg County at rows 70-87.


2. Next, you should input the functions similar to the ones you put in earlier for the Wake County subset, but with different column numbers, as seen below:

In [13]:
df.loc[70:87,["county","cause_detail"]]

Unnamed: 0,county,cause_detail
70,Mecklenburg,pulmonary thromboemboli
71,Mecklenburg,heart attack
72,Mecklenburg,gastrointestinal hemorrhage
73,Mecklenburg,ischemic heart disease
74,Mecklenburg,"massive pulmonary embolus, cerebral tumor, foc..."
75,Mecklenburg,"morbid obesity, diabetes"
76,Mecklenburg,"cardiac arrythmia, hypertensive heart disease"
77,Mecklenburg,anoxic encehalopathy with an underlying cause ...
78,Mecklenburg,hanging
79,Mecklenburg,hanging


Once you have input this function, you should see a table with your new dataframe, specific to Mecklenburg county!

---

## Exporting your dataframes
After creating your two new dataframes, you will need to export them into csv files. By doing so, you can download them and they can be used by others.

1. The first step to exporting your new dataframes is to define your code as a subset, which is required so that Python knows what you want to export into a csv file.
Follow the code as seen below:

In [None]:
subset = dataframe.loc[101:119,["county","cause_detail"]]

In [17]:
subset = dataframe.loc[70:87,["county","cause_detail"]]

2. Next, you should be able to turn the subset into a csv. To do this, you should input the function 'subset.to_csv'
To export the two dataframes above into csv files, do the following:

In [None]:
subset.to_csv("Wake_subset.csv", index=False)

In [18]:
subset.to_csv("Mecklenburg_subset.csv", index=False)

 It is important to add the index=False function to this code so that the other numbers that we filtered out from the raw dataset. This ensures that you are only getting the numbers exported that you need.



Once you have input both functions into your notebook, you should be all good to export your files. They are now ready to be used by others!

**CONGRATS!!!**
