# Lab 11: Data Cleaning & EDA for Classification

In this section of Lab 11, we will focus on preparing the Nashville police stops dataset (from lab3!) for classification tasks. Our primary goals are to clean the dataset by dealing with null values and to address potential issues of data leakage.



In [None]:
# Just run this cell
from collections import Counter
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import mean_squared_error

We're going to be using Nashville Police Stop dataset, like we did in lab 3.

In [None]:
# Loading the dataset.
police_stops = pd.read_csv("https://github.com/ds-modules/data/raw/main/nashville.csv").sample(n=1000, random_state = 42).reset_index()
police_stops.head(5)

## Data Preparation

In revisiting the essentials of data operations from Lab 3 and Lab 6, we're reminded of the crucial role of data cleaning in enhancing our analytical outcomes. Here are some concise, yet effective strategies:



1.   Addressing Missing Data

    * Crucial to identify and manage missing data to avoid skewed analyses.
    * Strategies include either removing or statistically imputing missing values, depending on their impact and prevalence.


2.   Recode Categorical Variables
 *   Transform categorical variables into binary (0 or 1) format, known as one-hot encoding, to simplify analysis and model integration.
 * More on this later!

3. Standardize Scale
  *   Align all variables to a consistent scale for better comparability and interpretation of results.










✅ **Question 1:** Clean the dataset so that it doesn't have null values.


Hint: There's pandas function that does this for you.

In [None]:
# Replace null values with zero
...

police_stops.head(3)

## Data Preparation:

#### One Hot Encoding:

Notice that a few of the variables that are numeric/integer actually encode categorical data. In order to analyze the dataset, we can convert these to their true categorical values.

In [None]:
police_stops.info()

In [None]:
# Just run this cell.
police_encoded = police_stops.replace({'' : {0 : 'no', 1 : 'yes'}})
police_encoded.head()

### Dealing with Data Leakage

In the context of criminal justice datasets like the Nashville police stops, it's crucial to identify and eliminate data leakage to ensure our models are predictive, not just reflective of post-event outcomes.



*   What is Data Leakage?
  *   Data leakage occurs when information from outside the training dataset, particularly features that would not be available at the time of prediction, inadvertently influences the model.
  *   This can lead to overly optimistic performance estimates and poor generalization to new data.


*  Leakage in Criminal Justice Datasets
  *    In datasets related to criminal justice, leakage can manifest through variables that contain information about events or outcomes that occur after the police stop (e.g., the result of a legal case).
  *   Using such data in predictive models can result in biased and unreliable predictions.








It seems like data leakage is a crucial issue that we need to deal with! Here're some strategies to mitigate leakage.



1.   Identify Potential Leaky Variables
  * Carefully review each variable to determine if it contains post-event information or outcomes that would not be known at the time of the police stop.
2.   Remove Leaky Variables
  * Once identified, these variables should be removed from the dataset to prevent their influence on the model.



In [None]:
# Inspect the features of the dataset (i.e. all columns in dataset)
police_stops.columns.values

Notice this about the columns above.


*   Outcome Variables
    * These reflect events or decisions made during or after a police stop and would not be known beforehand.
*   Search-Related Variables
    * These are also tied to the outcome of a search, which is a post-event activity.

Let's classify the columns into outcome variables and search related variables.



In [None]:
outcome_vars = [...]
search_related_vars = [...]

## Feature Selection

Now we're finally ready to do some analysis!

In this section, we will focus on identifying which features in the Nashville police stops dataset are most relevant for predicting the outcome `search_conducted`.
Feature selection is crucial in machine learning as it helps in reducing the dimensionality of the data, improving model performance, and enhancing interpretability.

Our goal is to use the correlation coefficient to measure the relationship between each feature and the `search_conducted` variable. This method helps in pinpointing features that have a stronger linear relationship with the target variable.

To further our analysis, we're going to generate a correlation heatmap, which is a visual tool used in feature selection to identify how each feature is correlated with the target variable, which is search_conducted in our case.

Don't worry if you don't really understand what we're doing here. Once you create the heatmap, we're gonna walk you through the analysis.

Hint: Pandas has a built-in function that computes pairwise correlation of columns.
To learn more, visit here: [correlation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html)

To learn about seaborn's heatmap, visit here: [heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html)

You might find this helpful for one-hot encoding categorical variables: [get_dummies](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)

In [None]:
# One-hot encode categorical variables
police_stops_without_categorical = police_stops.drop(['date', 'location', 'raw_row_number', 'officer_id_hash', 'violation', 'outcome', 'search_basis', 'notes', 'type'], axis=1)
police_stops_encoded = ...

# Calculate the correlation matrix
correlation_matrix = ...

# Isolate the correlation coefficients with the target variable 'search_conducted' and sort the values
search_conducted_correlation = ...

# Visualize the correlation coefficients using a heatmap
...
...
...
...
...

### Understanding the heatmap

Now that we've generated correlation heatmap, let's interpret the result.


1. Understanding Correlation Heatmap
  *   The heatmap displays the correlation coefficients between features and the target variable search_conducted.
  *   Correlation coefficients range from -1 to 1, where values close to 1 or -1 indicate a strong relationship, and values close to 0 indicate no relationship.
  * A positive correlation means that as one feature increases, the target variable tends to increase. A negative correlation means that as one feature increases, the target variable tends to decrease.

2.   Interpreting the Results
  *   Features that show higher absolute values of correlation are typically considered more important for prediction.
  *   For example, features like search_person, raw_search_consent, and raw_passenger_searched have high positive correlation coefficients, suggesting they're good predictors for search_conducted.

Remember to not solely rely on correlation for feature selection, as correlation does not imply causation. You should consider the context and meaning of each feature, and be aware of potential ethical and legal implications, especially in sensitive areas such as criminal justice.




✅ Which features are most highly correlated with search_conducted? Why might these features have strong correlations?

In [None]:
solution = ""
print(solution)

✅ Discuss the potential implications of using these features in a predictive model within the criminal justice system. How might this affect fairness and accuracy?

In [None]:
solution = ""
print(solution)