# [LEGALST-123] Lab 04: Probability Distributions, Bootstrap, and Confidence Intervals

In [13]:
from datascience import *
from collections import Counter
import numpy as np
import pandas as pd
from scipy import stats
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# Introduction
In this lab, we aim to prepare students for prediction exercises in PSET 1 and PSET 2 by focusing on several key aspects of exploratory data analysis (EDA) and data manipulation using the Nashville police stops dataset. The objectives of this lab are as follows:

**Data Cleaning**:
(Continuing from Lab 3): We will continue the data cleaning process by addressing issues not covered in Lab 3. This includes handling missing values through imputation or dropping as appropriate. During the data cleaning process, we will refer to simple plot exercises, such as scatter plots, box plots, and histograms, to make data-driven decisions.

**Summary Statistics**:
We will compute and display summary statistics for relevant columns, particularly 'age' and 'year.' This includes calculating the mean and median for these columns and explaining their significance in the context of the dataset. Visualizations: We will create visualizations, including histograms for 'column1' and 'column2' columns, scatter plots to visualize relationships between specific variables, and box plots to display data distributions. Interpretations of these visualizations will be provided.

**Aggregating Data**:
We will introduce data aggregation using Python libraries like pandas. Techniques such as grouping data using the groupby function in pandas will be explored, along with examples of aggregating data to gain insights. We will also explain the use of pivot tables in pandas for data aggregation.

**Time Series Analysis**:
Introduce time series data analysis using a specific example from the dataset. We will analyze and visualize police stop trends over time, such as monthly or yearly trends, using time-specific data to demonstrate aggregation techniques. Line plots will be created to visualize time series data.

## Data Cleaning (Continuing from Lab 3)

Load the Nashville police stops dataset.


Continue data cleaning by addressing issues not covered in Lab 3.


Handle missing values by either imputation or dropping, as appropriate.


Referring to simple plot exercises (scatter, box, histogram) during the data cleaning process.
Describe the dataset and its columns.


Provide explanations for data cleaning decisions, emphasizing their impact on visualization and analysis.

In [16]:
# load the data
user = "suminpark" #insert your user name
path = "/Users/" + user +  "/Documents/GitHub/Modules/Legalst-123/labs/data"
stops = pd.read_csv(path + "/stops_sample.csv", index_col = 0)
stops.head()

Unnamed: 0,index,raw_row_number,date,time,location,lat,lng,precinct,reporting_area,zone,...,raw_traffic_citation_issued,raw_misd_state_citation_issued,raw_suspect_ethnicity,raw_driver_searched,raw_passenger_searched,raw_search_consent,raw_search_arrest,raw_search_warrant,raw_search_inventory,raw_search_plain_view
0,1840907,93347,2010-04-18,13140.0,"BURGESS AVE & WHITE BRIDGE PIKE, NASHVILLE, TN...",36.145004,-86.85797,1.0,5103.0,113.0,...,False,,N,False,False,False,False,False,False,False
1,492044,2001428,2015-01-19,19920.0,"DUE WEST AVE W & S GRAYCROFT AVE, MADISON, TN,...",36.249187,-86.734459,7.0,1797.0,723.0,...,False,False,N,False,False,False,False,False,False,False
2,431170,1996331,2015-01-15,1020.0,"S GALLATIN PIKE & MADISON BLVD, MADISON, TN, 3...",36.254979,-86.715246,7.0,1623.0,711.0,...,False,False,N,False,False,False,False,False,False,False
3,2066423,1319451,2013-05-17,62760.0,"CHARLOTTE PIKE & W HILLWOOD DR, NASHVILLE, TN,...",36.139093,-86.880533,1.0,5009.0,123.0,...,False,False,N,False,False,False,False,False,False,False
4,2899480,201349,2010-09-01,28140.0,"BELL RD & DODSON CHAPEL RD, HERMITAGE, TN, 37076",36.16331,-86.613147,5.0,9501.0,521.0,...,False,,N,False,False,False,False,False,False,False


## Summary Statistics

Let's now load the CSV file we have into a `pandas.DataFrame` object and start exploring the data.

In [18]:
stops = pd.read_csv(path + "/stops_sample.csv", index_col = 0) #edit this later for the actual lab. 
stops.head()

Unnamed: 0,index,raw_row_number,date,time,location,lat,lng,precinct,reporting_area,zone,...,raw_traffic_citation_issued,raw_misd_state_citation_issued,raw_suspect_ethnicity,raw_driver_searched,raw_passenger_searched,raw_search_consent,raw_search_arrest,raw_search_warrant,raw_search_inventory,raw_search_plain_view
0,1840907,93347,2010-04-18,13140.0,"BURGESS AVE & WHITE BRIDGE PIKE, NASHVILLE, TN...",36.145004,-86.85797,1.0,5103.0,113.0,...,False,,N,False,False,False,False,False,False,False
1,492044,2001428,2015-01-19,19920.0,"DUE WEST AVE W & S GRAYCROFT AVE, MADISON, TN,...",36.249187,-86.734459,7.0,1797.0,723.0,...,False,False,N,False,False,False,False,False,False,False
2,431170,1996331,2015-01-15,1020.0,"S GALLATIN PIKE & MADISON BLVD, MADISON, TN, 3...",36.254979,-86.715246,7.0,1623.0,711.0,...,False,False,N,False,False,False,False,False,False,False
3,2066423,1319451,2013-05-17,62760.0,"CHARLOTTE PIKE & W HILLWOOD DR, NASHVILLE, TN,...",36.139093,-86.880533,1.0,5009.0,123.0,...,False,False,N,False,False,False,False,False,False,False
4,2899480,201349,2010-09-01,28140.0,"BELL RD & DODSON CHAPEL RD, HERMITAGE, TN, 37076",36.16331,-86.613147,5.0,9501.0,521.0,...,False,,N,False,False,False,False,False,False,False


We see that the fields include variables such as the longitude and latitude, the subject's race/age,  and the date and time of the offense.

Let's also check some basic information about this DataFrame using the `DataFrame.info` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html)) and `DataFrame.describe` methods ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)).

In [19]:
stops.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 43 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   index                           1000 non-null   int64  
 1   raw_row_number                  1000 non-null   object 
 2   date                            1000 non-null   object 
 3   time                            996 non-null    float64
 4   location                        1000 non-null   object 
 5   lat                             940 non-null    float64
 6   lng                             940 non-null    float64
 7   precinct                        887 non-null    float64
 8   reporting_area                  903 non-null    float64
 9   zone                            887 non-null    float64
 10  subject_age                     999 non-null    float64
 11  subject_race                    1000 non-null   object 
 12  subject_sex                     998

In [20]:
stops.describe()

Unnamed: 0,index,time,lat,lng,precinct,reporting_area,zone,subject_age
count,1000.0,996.0,940.0,940.0,887.0,903.0,887.0,999.0
mean,1491263.0,47195.963855,36.146446,-86.762884,4.401353,7770.545958,460.828636,36.811812
std,882368.3,24555.357937,0.115117,0.376956,2.24843,12490.193085,225.863415,13.748406
min,1425.0,60.0,33.522888,-97.407823,1.0,889.0,111.0,16.0
25%,721635.2,30660.0,36.1097,-86.789033,2.0,3020.0,227.0,26.0
50%,1456574.0,48810.0,36.154908,-86.751799,4.0,5501.0,425.0,34.0
75%,2282808.0,67755.0,36.190809,-86.70374,6.0,8815.0,621.0,46.0
max,3091709.0,86280.0,36.373107,-84.751067,8.0,95020.0,835.0,82.0


In [21]:
stop = stops[["subject_age", "subject_race"]]
stop.groupby(['subject_race']).mean()

Unnamed: 0_level_0,subject_age
subject_race,Unnamed: 1_level_1
asian/pacific islander,35.45
black,35.506427
hispanic,30.963636
other,32.5
unknown,37.083333
white,38.512573


In [22]:
stop.groupby(['subject_race']).median()

Unnamed: 0_level_0,subject_age
subject_race,Unnamed: 1_level_1
asian/pacific islander,33.0
black,32.0
hispanic,31.0
other,33.0
unknown,37.5
white,35.0


Notice that the functions above reveal type information for the columns, as well as some basic statistics about the numerical columns found in the DataFrame. However, if we want to explore more on the mean or median of the specific varibles, we can use the groupby function. 

In the below example let's use the groupby function to find the age mean/median for each race. 

Compute and display summary statistics for relevant columns, such as 'age' and 'year.'

Calculate the mean and median for these columns.

Explain the significance of mean and median in the context of the dataset.

## Exploratory Data Analysis

### 1. Visualizations

#### **1.1 Histograms:** Create histograms for 'column1' and 'column2' columns.

Interpret the distributions of these variables.

In [30]:
# HISTOGRAM CODE

#### **1.2 Scatter Plots:** Generate sccatter plots to visualize relationships between specific variables. 

Discuss any insights gained from scatter plots.

In [29]:
# SCATTER CODE

#### 1.3 **Box Plots:** Create a box plots for relevant columns. 

Explain the concept of box plots and their use in displaying data distributions. Interpret the box plots and identify outliers if present. What is the shape of the plot?

In [35]:
# BOX CODE

### 2. Aggregating Data: 

Introduction to aggregating data using Python libraries like pandas.

Explore techniques such as grouping data using the groupby function in pandas.

Provide examples of aggregating data to gain insights.

Explain the use of pivot tables in pandas for data aggregation

In [33]:
# HERE

### 3. Time Series Analysis: 
Introduce time series data analysis using a specific example from the dataset.

Analyze and visualize police stop trends over time (e.g., monthly or yearly).

Use time-specific data to demonstrate aggregation techniques.

Create line plots to visualize time series data.

In [32]:
# HERE

## Conclusion 

Summarize key findings from the data analysis. Discuss insights or patterns observed during the analysis.

Reflect on the importance of data cleaning, summary statistics, visualization, and data aggregation in exploratory data analysis.



In [31]:
# HERE

## References

Include references or data sources used in the lab, such as the Stanford traffic stop data library and relevant documents related to Nashville policing practices for traffic stops.