## Lab 11: Data Cleaning and Exploratory Data Analysis for Classification Problems

This lab will go through a little more exploration of the Nashville traffic stop data. _This time we will use a sample of cleaned data._ <span style="color:orange">**We will also apply the mapping tools from the preceding labs to some of the data, e.g., for stops with searches by individual officers.**</span>

Even though it is an unusual outcome, in this lab we are going to consider searches as our outcome of interest. We will focus on learning the data, cleaning the data, and exploring it. Subsequent labs will move on to predictive models.

Documentation for police traffic stop data for Nashville https://github.com/stanford-policylab/opp/blob/master/data_readme.md#nashville-tn is drawn from the Stanford traffic stop data library. 

There are some materials online that can help us put Nashville policing practices for traffic stops in context. Remember that Nashville is a "blue" city in a "red" state, and that whenever we see dramatic changes over time we want to see whether there was an important policy change.<p>
* https://policylab.stanford.edu/media/nashville-traffic-stops.pdf <p>
* https://filetransfer.nashville.gov/portals/0/sitecontent/CommunityOversight/docs/2019-1023/TrafficStopResearchProposal.pdf<p>
* https://www.policingproject.org/nashville<p>
* https://www.tennessean.com/story/news/crime/2019/04/18/nashville-traffic-stops-police-study-statistics-driving-while-black/3273143002/<p>
* https://www.tennessean.com/story/news/crime/2018/11/20/nashville-traffic-stop-report-policing-project-bias-jocques-clemmons/2066165002/<p>
* https://www.davisvanguard.org/2021/04/traffic-stop-assessment-by-nyu-law-in-nashville-may-be-applicable-to-other-u-s-cities/<p>

### from the readme file on the Nashville dataset (2010-01-01 to 2019-03-24)
#### Data notes:
*    Data is deduplicated on raw columns stop_date_time, stop_location_street, officer_employee_number, race, sex, and age_of_suspect, reducing the number of records by ~0.3%
*    There are 30 (of ~2.6M records) cases where search_conducted is ambiguous after the merge and are left as NA, since it's unclear whether they are true or false, since being NA after the above merge indicates that there were two distinct values for raw column searchoccur
*    reason_for_stop and violation are both translations of the original stop_type column; this column is sometimes the pretextual reason for the stop and does not always represent what the individual was ultimately cited for
*    contraband_drugs is raw column drugs_seized, contraband_weapons is weapons_seized, and contraband_found is evidenceseized
*    citation_issued is derived from traffic_citation_issued and misd_state_citation_issued, which are passed through as raw_*; misd_state_citation_issued is sometimes NA, so for the purposes of defining citation_issued, we consider NA to be false
*    warning_issued is derived from verbal_warning_issued and written_warning_issued, which are passed through as raw_*; written_warning_issued is sometimes NA, so for the purposes of defining warning_issued, we consider NA to be false
*    search_basis is based on the raw columns search_plain_view, search_consent, search_incident_to_arrest, search_warrant, and search_inventory, which are all passed on with the raw_* prefix
*    subject_race is derived from raw columns suspect_ethnicity and suspect_race, which are passed through with the raw_* prefix
*    search_person is derived from search_driver and search_passenger, which are passed through with the raw_* prefix
*    When contraband_found is NA, we fill it with false when a search occurred, under the assumption that the officer simply didn't record the absence of contraband


In [1]:
# dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import nltk
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import seaborn as sns
from IPython.display import display, HTML # makes the output in Jupyter notebook pretty
!pip install plotly
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
# !pip install folium
import json
import os
import folium.plugins # The Folium Javascript Map Library
from folium.plugins import HeatMap
from folium.plugins import HeatMapWithTime
from folium.plugins import FastMarkerCluster



In [2]:
# load the data
path = "https://github.com/ds-modules/data/raw/main/nashville_cleaned_sample.csv"
cleaned_stops = pd.read_csv(path, index_col=0)
cleaned_stops.head()

Unnamed: 0_level_0,date,time,location,lat,lng,precinct,reporting_area,zone,subject_age,officer_id_hash,...,violation_moving traffic violation,violation_parking violation,violation_registration,violation_safety violation,violation_seatbelt violation,violation_vehicle equipment violation,subject_race,subject_sex,violation,stardate
raw_row_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2469050,2016-05-01,10:52:00,"924 CURREY RD, NASHVILLE, TN, 37217",36.10838,-86.708519,3.0,8831.0,315.0,52.0,37810c10d6,...,1,0,0,0,0,0,white,female,moving traffic violation,2016-05-01 10:52:00
1210355,2013-02-01,17:53:00,"10TH AVE S & MONTROSE AVE, NASHVILLE, TN, 37204",36.123393,-86.786429,8.0,6949.0,823.0,29.0,cf72c2298a,...,1,0,0,0,0,0,white,male,moving traffic violation,2013-02-01 17:53:00
412047,2011-04-01,11:35:00,"1900 HOBSON PIKE, ANTIOCH, TN, 37013",36.046237,-86.598773,3.0,8927.0,335.0,18.0,d2c8e20a5a,...,1,0,0,0,0,0,unknown,female,moving traffic violation,2011-04-01 11:35:00
2326635,2015-12-01,10:04:00,"ERIN LN & HWY 70 S, NASHVILLE, TN, 37221",36.078407,-86.908912,1.0,4901.0,121.0,65.0,0a70309850,...,1,0,0,0,0,0,white,female,moving traffic violation,2015-12-01 10:04:00
1125061,2012-12-01,12:32:00,"BRILEY PKWY N & MURFREESBORO PIKE, NASHVILLE, ...",36.122074,-86.702419,5.0,8998.0,531.0,20.0,83fbfcfd39,...,1,0,0,0,0,0,unknown,female,moving traffic violation,2012-12-01 12:32:00


### 1. Dataset basics: some facts, some exploration
First we will just get a handle on the sample of cleaned Nashville traffic stop data by finding its shape, what columns are in it, what the proportions of missing data are for each column, and what the searches look like geographically.

In [3]:
# find the shape of the dataframe, what the columns are, what the data type in each column is, etc.
# YOUR CODE HERE (feel free to use multiple code cells)

In [8]:
# show proportions of missing data
pd.options.display.float_format = '{:.8f}'.format # use 8 decimal places and not scientific notation for float
# cleaned_stops ...

### 2. Data Cleaning, Dealing with Missing Values, Creating Dummy Variables
We covered data cleaning and missing values in Labs 3-6, so this is just a brief review. The sample of cleaned data that we are using in this lab already has dichotomous (dummy) variables that indicate subject race, subject sex, and type of violation that resulted in the stop, but if you look closely the dataset retains the original variables--I think it is always a good idea not to throw away information if possible. In this section we will get rid of the existing dummy variables and create new ones and decide on what to do with missing values in the process of readying data for predictive models (which we will create in Labs 12-14.

In [9]:
# get rid of the existing dummy variables constructed from
# 'violation', 'subject_race', and 'subject_sex' but keep working with the cleaned_stops dataframe
# confirm that the resulting dataframe is what we expect

dummy_cols = ...
cleaned_stops.drop...
cleaned_stops.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2000 entries, 2469050 to 2325584
Data columns (total 31 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   date                        2000 non-null   object 
 1   time                        2000 non-null   object 
 2   location                    2000 non-null   object 
 3   lat                         2000 non-null   float64
 4   lng                         2000 non-null   float64
 5   precinct                    1806 non-null   float64
 6   reporting_area              1849 non-null   float64
 7   zone                        1806 non-null   float64
 8   subject_age                 2000 non-null   float64
 9   officer_id_hash             2000 non-null   object 
 10  type                        2000 non-null   object 
 11  arrest_made                 2000 non-null   bool   
 12  citation_issued             2000 non-null   bool   
 14  outcome                     2

In [10]:
# what's that last feature?
print("Captain's log supplemental. The first entry in 'stardate' is ", cleaned_stops.stardate[0], "and is of type ", type(cleaned_stops.stardate[0]))

Captain's log supplemental. The first entry in 'stardate' is  2016-05-01 10:52:00 and is of type  <class 'str'>


Now we are back to something like the original configuration of the data. The features "year", "month", and "stardate" were created from the datetime features "date" and "time" and "year" and "month" are integer representations of the year and month of the stop. Note that you can use a dot notation for columns of a dataframe, although it is not as clear as the brackets for a column index.

**Question:** Do we need to do anything with the with the large number of missing values for the "contraband_"  and "search_basis" features? How about "notes"?

_Your answer here_<p>

In [11]:
# do the operations necessary to deal with missing data, if they are needed, and add a note of explanation
# in a markdown cell


_your explanation here_


    

### Creating dummy variables from categorical variables
We just took the dummy variables out of the dataset, but it is important to be able to create them when you have categorical variables that you wish to use in a model. We are not going to do prediction exercises in this lab, but we will in the next several labs, in the unit on computational text analysis, in the homework, and in the project. Luckily, Pandas makes this part easy.

In [None]:
# create dichotomous variables to represent categorical variables in the dataset, that is,
# 'subject_race', 'subject_sex', and 'violation', and then check the resulting dataframe

categorical_cols = ...
cleaned_stops_dummies = pd.get_dummies...
cleaned_stops_dummies...

**Take a close look at the new dataframe** we just made that has dummy variables for each category in the categorical variables. It's missing the original categorical variables. What if we want to keep them in the dataframe for some purpose (like ease of making boxplots)?

Let's add the categorical variables themselves back to the dataframe.

In [None]:
# add the categorical variables back to the dataframe and check to make sure we got what we expect
# note: if you add back the columns using the function pd.concat, Pandas will turn the integers and booleans
# into floats, which will complicate things later, so use the .join method instead


save_cols = ...
cleaned_stops = ...

### 3. Some EDA Using Maps
Mapping is an excellent way to explore data that has location features, and in the Nashville traffic stop data we have latitude and longitude, as well as date and time. We can get an idea of the spatial distribution of traffic stops, and even some idea of how individual police officers make traffic stops. Refer to the mapping labs, and feel free to think of other things you would like to map.

In [None]:
# create a basemap of Nashville using your choice of tiles
# note that "Stamen Toner" works better with the markers we use below
# but that "Stamen Toner" will no longer render unless it is on Datahub
# so try for something that will still appear in the saved file, like 'Open Street Map'
# the Folium documentation is at https://python-visualization.github.io/folium/latest/user_guide/raster_layers/tiles.html#Other-tilesets


nashville_coords = (36.174465, -86.767960)

nashville_map =folium.Map(location=
                              nashville_coords, 
                              zoom_start=12, 
                              tiles=...
                              attr=...
                         )
nashville_map

In [18]:
# create a dataframe filtering for stops that resulted in searches and count how many searches happened in each year

searches = ...
searches['year'].value_counts()

In [20]:
# make a heatmap of the searches in the dataset
# to do this we first need create an array of latitudes, then of longitudes
# then pass those to folium in the format it expects
lats = searches['lat'].values
longs = ...
search_locs = np.vstack...
search_locs[:5]

In [None]:
# now make a heatmap of locations where searches happened

heatmap = ...
heatmap.add_to ...
nashville_map

**How informative is the heatmap of Nashville traffic stops that resulted in searches? Can you make any generalizations about searches from what you see?**

_Your answer here:_<p>

#### Changing Unit of Analysis to Individual Nashville PD Officers
Instead of using Nashville as our unit of analysis, we can examine searches by individual officers to see if we can better understand any patterns in the data. Our unit of observation remains individual stops. Here we will first make a table of the top 10 officers in terms of searches over the time period in the dataset and aggregate each column in the data in a way that makes sense. After that, we will try mapping the stops resulting in searches by particular officers. _This would look more interesting if we used all the data in the dataset rather than just our sample, because searches are an unusual outcome._

First, make a table of searches by officer, for the top 10 officers in descending order by number of searches, aggregating each feature in a meaningful way. For example, you would want to aggregate `'subject_age'` as a mean, rather than a sum, but for other features you may want to report both a sum and a mean (which for dichotomous variables is a proportion, remember). You should only use features that are summable (i.e., dichotomous features are fine, categorical ones are not).
* Hint 1: the sample data is not so voluminous so you can make a new dataframe with the columns you select
* Hint 2: when you aggregate by officer, you can pass [.agg a dictionary](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.agg.html#pandas.core.groupby.DataFrameGroupBy.agg) telling it what to do with each column
* Hint 3: you can sort the table by `'search_conducted'` and [the aggregation function index label 'sum'](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values) thusly `by=('search_conducted','sum')` since Pandas sees that now as a tuple

The last bit of code in the cell renders the table and displays it nicely in the notebook. Note that the means and sums are just for the searches by officer caught in our sample of 2000.

In [23]:
# let's aggregate searches by individual officers to find indiv officers with most searches over period

# SOLUTION
agg_stops = pd.DataFrame(data=cleaned_stops, columns=['subject_age', 'arrest_made',
       'citation_issued', 'warning_issued','search_conducted','year', 'month',
       'subject_race_asian/pacific islander', 'subject_race_black', 'officer_id_hash',
       'subject_race_hispanic', 'subject_race_other', 'subject_race_unknown',
       'subject_race_white', 'subject_sex_female', 'subject_sex_male',
       'violation_child restraint', 'violation_investigative stop',
       'violation_moving traffic violation', 'violation_parking violation',
       'violation_registration', 'violation_safety violation',
       'violation_seatbelt violation', 'violation_vehicle equipment violation'])
officer_stops = agg_stops.groupby(by=['officer_id_hash']).agg({'subject_age':'mean','arrest_made':['sum','mean'],
                                                                'citation_issued':['sum','mean'], 
                                                                'warning_issued':['sum','mean'],
                                                                'search_conducted':['sum','mean'],
                                                                'subject_race_asian/pacific islander':['sum','mean'], 
                                                                'subject_race_black':['sum','mean'], 
                                                                'subject_race_hispanic':['sum','mean'], 
                                                                'subject_race_other':['sum','mean'], 
                                                                'subject_race_unknown':['sum','mean'],
                                                                'subject_race_white':['sum','mean'], 
                                                                'subject_sex_female':['sum','mean'], 
                                                                'subject_sex_male':['sum','mean'],
                                                                'violation_child restraint':['sum','mean'], 
                                                                'violation_investigative stop':['sum','mean'],
                                                                'violation_moving traffic violation':['sum','mean'], 
                                                                'violation_parking violation':['sum','mean'],
                                                                'violation_registration':['sum','mean'], 
                                                                'violation_safety violation':['sum','mean'],
                                                                'violation_seatbelt violation':['sum','mean'], 
                                                                'violation_vehicle equipment violation':['sum','mean']
                                                                })
table = officer_stops.sort_values(by=('search_conducted','sum'),ascending=False)[:10]
display(HTML(table.to_html()))

Unnamed: 0_level_0,subject_age,arrest_made,arrest_made,citation_issued,citation_issued,warning_issued,warning_issued,search_conducted,search_conducted,subject_race_asian/pacific islander,subject_race_asian/pacific islander,subject_race_black,subject_race_black,subject_race_hispanic,subject_race_hispanic,subject_race_other,subject_race_other,subject_race_unknown,subject_race_unknown,subject_race_white,subject_race_white,subject_sex_female,subject_sex_female,subject_sex_male,subject_sex_male,violation_child restraint,violation_child restraint,violation_investigative stop,violation_investigative stop,violation_moving traffic violation,violation_moving traffic violation,violation_parking violation,violation_parking violation,violation_registration,violation_registration,violation_safety violation,violation_safety violation,violation_seatbelt violation,violation_seatbelt violation,violation_vehicle equipment violation,violation_vehicle equipment violation
Unnamed: 0_level_1,mean,sum,mean,sum,mean,sum,mean,sum,mean,sum,mean,sum,mean,sum,mean,sum,mean,sum,mean,sum,mean,sum,mean,sum,mean,sum,mean,sum,mean,sum,mean,sum,mean,sum,mean,sum,mean,sum,mean,sum,mean
officer_id_hash,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2
82f2e31c2b,28.0,1,0.11111111,2,0.22222222,8,0.88888889,8,0.88888889,0,0.0,0,0.0,1,0.11111111,1,0.11111111,0,0.0,7,0.77777778,3,0.33333333,6,0.66666667,0,0.0,0,0.0,6,0.66666667,0,0.0,2,0.22222222,0,0.0,0,0.0,1,0.11111111
0418967821,31.33333333,0,0.0,1,0.33333333,2,0.66666667,2,0.66666667,0,0.0,1,0.33333333,0,0.0,0,0.0,0,0.0,2,0.66666667,2,0.66666667,1,0.33333333,0,0.0,0,0.0,1,0.33333333,0,0.0,0,0.0,1,0.33333333,0,0.0,1,0.33333333
2b7a7efb70,29.75,0,0.0,1,0.25,4,1.0,2,0.5,0,0.0,3,0.75,0,0.0,0,0.0,0,0.0,1,0.25,2,0.5,2,0.5,0,0.0,0,0.0,1,0.25,0,0.0,0,0.0,0,0.0,1,0.25,2,0.5
ccda12115e,38.28571429,1,0.14285714,2,0.28571429,6,0.85714286,2,0.28571429,0,0.0,1,0.14285714,1,0.14285714,0,0.0,0,0.0,5,0.71428571,3,0.42857143,4,0.57142857,0,0.0,1,0.14285714,4,0.57142857,0,0.0,0,0.0,0,0.0,2,0.28571429,0,0.0
1d2836b079,38.5,0,0.0,0,0.0,2,1.0,2,1.0,0,0.0,1,0.5,0,0.0,0,0.0,0,0.0,1,0.5,2,1.0,0,0.0,0,0.0,1,0.5,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,1,0.5
6761423fb7,36.33333333,2,0.11111111,13,0.72222222,4,0.22222222,2,0.11111111,0,0.0,6,0.33333333,0,0.0,0,0.0,2,0.11111111,10,0.55555556,6,0.33333333,12,0.66666667,0,0.0,4,0.22222222,14,0.77777778,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0
393639dfae,33.0,1,0.5,0,0.0,1,0.5,2,1.0,0,0.0,1,0.5,0,0.0,0,0.0,0,0.0,1,0.5,1,0.5,1,0.5,0,0.0,1,0.5,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,1,0.5
00443cb79c,29.25,0,0.0,1,0.25,3,0.75,1,0.25,0,0.0,3,0.75,0,0.0,0,0.0,0,0.0,1,0.25,3,0.75,1,0.25,0,0.0,0,0.0,2,0.5,1,0.25,0,0.0,0,0.0,0,0.0,1,0.25
cb792661d8,41.0,0,0.0,0,0.0,2,1.0,1,0.5,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,2,1.0,0,0.0,2,1.0,0,0.0,0,0.0,0,0.0,0,0.0,1,0.5,0,0.0,0,0.0,1,0.5
1b55d714e1,28.375,0,0.0,1,0.125,7,0.875,1,0.125,0,0.0,7,0.875,1,0.125,0,0.0,0,0.0,0,0.0,1,0.125,7,0.875,0,0.0,0,0.0,4,0.5,0,0.0,0,0.0,0,0.0,0,0.0,4,0.5


In [24]:
# we can extract into a list the top 10 officers in terms of searches and 
# put them into their own dataframe to use in mapping

top_ten = ...

# hint: use the slicing method .loc to get just the rows where the Nashville PD officer .isin the top_ten

top_ten_map_df = ...
top_ten_map_df.shape  

(23, 47)

In [None]:
# walk through the df to assign a color to each officer (i.e. row)
# this is probably way inefficient; if you know a more efficient way, by all means use it

color_list=[]
for row in top_ten_map_df.itertuples():
    for badge in top_ten:
        if badge==row.officer_id_hash: 
            color_list.append(top_ten.index(badge))
                  
top_ten_map_df['color']=color_list  

# take the index and (lat,lon) pairs from df and map them
colors=['red', 'blue', 'green', 'purple', 'orange', 'white', 'pink',
         'lightgreen', 'gray', 'black', 'lightgray']

for row in top_ten_map_df.itertuples(index=False):
    folium.Circle(
            location=[row.lat, row.lng], 
            popup=(row.year,row.violation,row.officer_id_hash,row.subject_race),
            radius=10,           
            color=colors[row.color],  # the color column is an integer that indexes colors list in this cell  
            fill=True,
            fill_color=colors[row.color]            
        ).add_to(nashville_map)
nashville_map

**Question:** How interesting is the map of searches by the top 10 NPD searching officers? About how many points would there be on the map if we had used the entire dataset of about 2.8M traffic stops?

_Your answer here_ <p>

### 4. Feature Importance and Feature Selection
Before we try to train any models to classify the Nashville traffic stops along some dimension (e.g., whether or not the stop resulted in a search), we need to think a bit about how the features in the dataset relate to the outcome of interest. There are several steps in this process.
* impute missing values for rows where we have the information to replace a NaN value (e.g. `'contraband_found'`)
* remove features that are non-numeric so that we can create matrix of bivariate correlations
* examine the bivariate correlations among all the features to see if there are some good candidate predictors for our outcome and check that they are not collinear
* make sure that none of the likely predictors are logically posterior to the outcome we are trying to predict. 
The last point is often talked about in Machine Learning as "data leakage." The basic idea is that you cannot use a predictor that requires you to know already the value of the outcome. There are a few of these in the Nashville police stop dataset if we are trying to train a model that predicts when a stop results in a search. 

#### Taking care of missing values and categorical variables
Remember above where we created a dataframe with dummy variables to encode the categorical variables `violation`, `subject_race`, and `subject_sex`? That means we have taken care of the categorical variables. We still need to do something with the NaN values for `contraband_found` since we might be interested in that. Some features, like `search_basis`, we may just want to leave as is for now.

In [None]:
# recode the NaN values for 'contraband_found' into something that can be summed and confirm the result


In [None]:
cleaned_stops_dummies.columns

In [None]:
# remove the non-numeric features from the 'cleaned_stops_dummies' df 
non_summable_cols = ...
correlation_df = ...
correlation_df.info()

In [None]:
# create a correlation matrix and a heatmap visualization for it 
# using the seaborn heatmaps library
# the following code comes from the Seaborn examples page
# https://seaborn.pydata.org/examples/many_pairwise_correlations.html


# Compute the correlation matrix
corr = correlation_df.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, center=0, 
            square=True, linewidths=.5, cbar_kws={"shrink": .5})


#### Notes on feature selection using the correlation matrix and heatmap
Remember that originally `'contraband_found'` was coded as a NaN unless there had been a search; when there was a search it was coded as either True or False. We made a choice not to exclude `'contraband_found'` values with NaNs in order not to eliminate most of the sample.

The problem is that now the overwhelming number of observations for `'contraband_found'` are False, since of course no contraband _could_ be found if there was no search. 

**Questions**

1. Does this mean we can use `'contraband_found'` as a predictor for `'search_conducted'`? How about other possible predictors like `'search_person'`, `'search_vehicle'`, and `'frisk_performed'`? Why or why not?
2. Can you find predictors from the correlation matrix that look as though they would work? Which ones?

**Answer:** <p>
    _your answer here_

In [None]:
# reduce variables to reasonable predictors based on heatmap, no posthoc predictors
# quick and dirty heatmap with annot=True
# note at end about reference category and collinearity problem 
# if *all* the dummy variables are used as predictors at once
reasonable_features = ...
reasonable_df = correlation_df[reasonable_features]
reasonable_matrix = reasonable_df.corr()
plt.figure(figsize=(4,4))
g = sns.heatmap(reasonable_matrix, annot=True)
reasonable_matrix

**One last question:**

How well do you think we will be able to predict whether or not a driver was searched using the features available to us in the Nashville traffic stop data? What other data do you think would help us make better predictions?    

_Your answer here:_<p>
    

In [None]:
cleaned_stops.search_basis.value_counts(dropna=False)