In [None]:
#Run this cell
from datascience import *
import pandas as pd 
from pandas import read_stata
import numpy as np
import datetime

import matplotlib
matplotlib.use('Agg', warn=False)
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

# Lab 1 - Working With A Matrix

This lab involves moving data from a website to our Jupyter Notebook and then running some simple analyses of the data. The functionality required to complete the lab will be run in Python.

Please take some time to discover different options in the lab, experiment and see what the different functions do (suggestions below for how to get started). We will discuss this in class on the 31st and I will send out a written exercise for you to work on the following week.

## I. Getting the Data 

In this lab, we are going to analyze the data set composed of the people who died in custody. 

Navigate to the Texas Justice Initiative web page at the following link: http://texasjusticeinitiative.org/ 

Select the “Download Data Set” link underneath the word “Overview”.  

If you have Excel on your computer, chances are that the file will open in Excel as your
default. However, if you do not have Excel on your computer, save the file somewhere that you can find it
on your computer.

Once the file has been downloaded, resume back to Jupyter Notebook's home page. Click on the “File” tab and select “Upload”. The file should be labeled as "custodial_deaths.csv". 

Typically, we would be able to import the data set immediately. However, for this particular data set, we need to clean before importing the data. 

<font color="Blue"> Item 1: Clean and import the data set.

In [None]:
#Clean and import data
custodial_deaths = Table().read_table('***INSERT FILE NAME***', ***INSERT CLEANNING CODE***)
custodial_deaths

Check that your answers are in the correct format. 

In [None]:
_ = ok.grade('q1')

## II. Exploring the Functions

We will now explore different functions for manipulating the dataset until desired. 

You probably have noticed that in Google Sheets, it does not fix your “header row.” Note that the header row is different from the other rows, it needs to be fixed in place because it is the row that contains the label for the data in the sheet.

However, luckily in Jupyter Notebooks, it automatically freezes the top row in place for you!

<font color="Blue"> Item 2: Sort the column labeled "First Name" in chronological order (A -> Z).

In [None]:
#Chronological order
sort_data = custodial_deaths***INSERT CHRONOLOGICAL CODE***('***COLUMN NAME***')
sort_data

<font color="Blue"> Item 3: Sort the column labeled "First Name" in reverse chronological order (Z -> A).

In [None]:
#Reverse chronological order
reverse_sort_data = custodial_deaths***INSERT CHRONOLOGICAL CODE***('***COLUMN NAME***', ***INSERT REVERSE CHRONOLOGICAL CODE***)
reverse_sort_data

<font color="Blue"> Item 4: Filter the table to only display the Female data. You can do this by selecting the "Sex" column, then searching through each row to find "Female".

In [None]:
#Filter the table to display Female
female_data = custodial_deaths***INSERT FILTER CODE***('***COLUMN NAME***', "***FILTER WORD***")
female_data

<font color="Blue"> Item 5: Apply multiple filters, specifically select Hispanic Female data.

In [None]:
#Filter table to display Hispanic Females
hispanic_female_data = female_data***INSERT FILTER CODE***('***COLUMN NAME***', "***FILTER WORD***")
hispanic_female_data

<font color="Blue"> Item 6: Split the Death Date column into four seperate columns including: Time, Month, Day, Year

You may have noticed that some of the data contains a value labeled as "nan". Although we could get rid of this, it would be taking out important data. For that reason, we will be replacing the nan values with arbituary replacement values.

In [None]:
#Replace nan values with random dates
temp_df = sort_data.to_df()
temp_df['***COLUMN NAME***'].replace('***VALUE BEING REPLACED***', '1000-01-01 00:00:00', inplace=True)
sort_data = Table().from_df(temp_df)
sort_data

In [None]:
# Making a function that takes a string date and returns a datetime object
apply_datetime = lambda x: datetime.datetime***INSERT STRIP TIME CODE***(x, "%Y-%m-%d %X")

# Apply that function
datetime_objects = ***INSERT CODE TO APPLY FUNCTION***(apply_datetime, '***COLUMN NAME***')
datetime_objects

In [None]:
# Adding the objects as a new column
organized_data = sort_data.with_column('***COLUMN NAME***', datetime_objects)
organized_data

In [None]:
#Apply functions to make new columns
organized_data['***COLUMN NAME***'] = ***INSERT FUNCTION***(lambda x: x.time(), datetime_objects)
organized_data['***COLUMN NAME***'] = ***INSERT FUNCTION***(lambda x: x.month, datetime_objects)
organized_data['***COLUMN NAME***'] = ***INSERT FUNCTION***(lambda x: x.day, datetime_objects)
organized_data['***COLUMN NAME***'] = ***INSERT FUNCTION***(lambda x: x.year, datetime_objects)
organized_data

<font color="Blue"> Item 7: Make a pivot table by adding the "Manner Death" for the rows, and the "Charges Status" for the columns.

The pivot method operates on a table in which a specific column and row is selected, then returns the counts. 

Through this method, pivot takes as its first argument the name of the column that contains values to be used as column labels. The second argument is the name of the column that contains values to be used as row labels. Each unique value in this input column appears in a separate row as the first entry. The third argument is the source of the values, which in this case, counts are used and they are aggregated by summing.

In [None]:
#Pivot table
pivot_data = ***INSERT PIVOT CODE***('***FIRST ARGUMENT COLUMN NAME***', '***SECOND ARGUMENT COLUMN NAME***')
pivot_data

<font color="Blue"> Item 8: Calculate the minimum age, maximum age, median age, total sum, and average age. 

Python appears to be a little picky. In class I was able to get a count based on a name. However, Python appears to be looking for a field that is comprised of numbers, not words. Try coverting the "Age" column data from a string to a number format.

In [None]:
#Replace string nan values with arbituary 00.0 number values
age_reformat = sort_data.to_df()
age_reformat['***COLUMN NAME***'].replace('***VALUE BEING REPLACED***', ***ARBITUARY NUMERIC VALUE***, inplace=True)
new_age_data = Table().from_df(age_reformat)
new_age_data

In [None]:
#Calculate the min, max, median, and sum age
age_stats = new_age_data.select('***COLUMN NAME***')***INSERT MULTI FUNCTION STATS CODE***
age_stats

In [None]:
#Calculate the mean age
age_mean = ***INSERT MEAN FUNCTION CODE***(new_age_data['***COLUMN NAME***'])
age_mean

<font color="Blue"> Item 9: How many people under the age of 18 are in the dataset?

In [None]:
#Make a new variable for the people under age 18
under_eighteen = organized_data.where('***COLUMN NAME***', ***INSERT VALUE***)
under_eighteen

In [None]:
#Count the number of people under age 18
under_eighteen_count = organized_data***INSERT FILTER CODE***('***COLUMN NAME***', ***INSERT VALUE***)***INSERT ROW COUNT CODE***
under_eighteen_count

<font color="Blue"> Item 10: What is the most common reason listed for death?

In [None]:
#Apply pivot table method
common_death =  ***INSERT PIVOT CODE***('***FIRST ARGUMENT COLUMN NAME***', '***SECOND ARGUMENT COLUMN NAME***')
common_death

In [None]:
#Sum the pivot table data
common_death***INSERT DROP CODE***('***DROPPED COLUMN NAME***')***INSERT SUM CODE***

In [None]:
#Output the most common reason listed for death
common_death***INSERT SELECT CODE***('***INSERT MOST COMMON DEATH COLUMN***')***INSERT SUM CODE***

<font color="Blue"> Item 11: What is the most common reason listed for people under 18 years of age?

In [None]:
#Apply pivot table method
common_death_eighteen =  ***INSERT PIVOT CODE***('***FIRST ARGUMENT COLUMN NAME***', '***SECOND ARGUMENT COLUMN NAME***')
common_death_eighteen

In [None]:
#Sum the pivot table data
common_death_eighteen***INSERT DROP CODE***('***DROPPED COLUMN NAME***').sum()

In [None]:
#Output the most common reason listed for people under 18
common_death_eighteen***INSERT SELECT CODE***('***INSERT MOST COMMON REASON COLUMN***').sum()

<font color="Blue"> Item 12: What’s the average age of the people in the database?

In [None]:
#Calculate the average age
age_mean = ***CALCULATE MEAN CODE***(new_age_data['***COLUMN NAME***'])
age_mean

<font color="Blue"> Item 13: What is the most common charge status for people in the database?

In [None]:
#Apply pivot table
common_charge = ***INSERT PIVOT CODE***('***FIRST ARGUMENT COLUMN NAME***', '***SECOND ARGUMENT COLUMN NAME***')
common_charge

In [None]:
#Sum the pivot table data
common_charge***INSERT DROP CODE***('***DROPPED COLUMN NAME***').sum()

In [None]:
#Output the most common charge listed for people
common_charge***INSERT SELECT CODE***('***INSERT MOST COMMON CHARGE COLUMN***').sum()

<font color="Blue"> Item 14: What is the most common reason listed for death for people whose charge status is
“not filed at time.”

In [None]:
#Apply pivot table
not_filed_pivot = ***INSERT PIVOT CODE***('***FIRST ARGUMENT COLUMN NAME***', '***SECOND ARGUMENT COLUMN NAME***')
not_filed_pivot***INSERT ROW CODE***[***ROW NUMBER***][***COLUMN NUMBER***]

<font color="Blue"> Item 15: What does any of this mean? How are these cases counted and classified? Who reports this information to whom and why? Review the Texas Justice Initiative web page, especially the “about” link and the “about the data” link under that. Read the article from The Atlantic linked in the syllabus.

Written Response: 

Congratulations, you finished Lab 1. Now you are a step closer in understanding pressing questions that would normally take hours for people to figure out! 