In [None]:
# Import packages

import pandas as pd
import numpy as np
import altair as alt
alt.data_transformers.disable_max_rows()

: 

In [None]:
# Load traffic stops data and create a dataframe called stops
# and check the columns and their types
stops = pd.read_csv('Officer_Traffic_Stops.csv')
stops.head()

: 

In [None]:
# view the data by just typing the dataframe name
stops

: 

In [None]:
# check the data using df.info()
stops.info()

: 

You will notice object and int64 this may not be what we would like so lets remember that

Let's consider our target variable: `Was_a_Search_Conducted`.

**Plot** a bar chart that counts the number of records by `Was_a_Search_Conducted`.

In [None]:
## Bar chart
bar = alt.Chart(stops).mark_bar(size=30).encode(
    x=alt.X('Was_a_Search_Conducted', axis=alt.Axis(title='Was a Search Conducted')),
    y=alt.Y('count()', axis=alt.Axis(title='Count')),
    color=alt.Color('Was_a_Search_Conducted', scale=alt.Scale(domain=['Yes', 'No'], range=['steelblue', 'lightgray'])),
    )

bar

: 

Next, let's consider the age range of the driver. 

**Plot** a histogram of `Driver_Age`. Determine an appropriate number of bins.

In [None]:
## Histogram
histogram = alt.Chart(stops).mark_bar(tooltip=True).encode(
    alt.X("Driver_Age:Q", bin=alt.Bin(maxbins= 50)),
    y='count()',
)

histogram

: 

Once you go above (around) 40-50 bins, you'll notice some points stick out. 

What is happening?

The data gathered into each bin consists of more than one year.

**Plot** a box plot with `Was_a_Search_Conducted` on the x-axis and `Driver_Age` on the y-axis.

In [None]:
## Box plot

box_plot = alt.Chart(stops).mark_boxplot().encode(
    x='Was_a_Search_Conducted',
    y='Driver_Age',
    color=alt.Color('Was_a_Search_Conducted', scale=alt.Scale(domain=['Yes', 'No'], range=['steelblue', 'lightgray'])),

)

box_plot

: 

**Plot** a violin plot where the fill is the response variable "Was_a_search_conducted" https://altair-viz.github.io/gallery/violin_plot.html

In [None]:
# Violin plot



violin = alt.Chart(stops).transform_density(
    'Driver_Age',  
    as_=['Driver_Age', 'density'],
    groupby=['Was_a_search_conducted']
).mark_area(
    orient='horizontal'
).encode(
    y=alt.Y('Driver_Age:Q'), 
    color='Was_a_search_conducted:N',
    x=alt.X(
        'density:Q',
        stack='center'
    )
).properties(
    width=300,
    height=300
)

violin


: 

From the plots above, do you think the age of the driver is a significant factor in whether a search was conducted? Why or why not?

No, it seems that driver age isn't a major factor because search frequency is fairly consistent across all age groups. 

## Date of stop

Let's plot the number of stops by time. 

Recalling part one, the `Month_of_Stop` variable is a character, not a date variable. The datatime's are simply when the data was collected; not when the stop occurred. Therefore, we'll need to convert the `Month_of_Stop` variable from a character to a Date format.

Uncomment out the next 4 code blocks these are given to you but please study and know what they do

In [None]:
stops['Month_of_Stop'] = stops['Month_of_Stop'].astype('datetime64[ns]')

: 

In [None]:
stops['Month_of_Stop'] = pd.to_datetime(stops['Month_of_Stop'], format='%y%m%d')

: 

In [None]:
stops.info()

: 

In [None]:
stops

: 

**Plot** a line chart with the number of traffic stops for each month (hint: think about the aggregations we did in class.).

In [None]:
## Create a dataframe named stops_count for this graph that has a column named Month_of_Stop 
## Also create a column named count representing the ammount of stops per month  https://sparkbyexamples.com/pandas/pandas-groupby-count-examples/
stops_count = stops.groupby('Month_of_Stop').size().reset_index(name='count')
stops_count


: 

In [None]:
## Line chart

line = alt.Chart(stops_count).mark_line().encode(
    x='Month_of_Stop',
    y='count'
)

line


: 

What is the trend (i.e., long term rate of change) of the number of traffic stops in Charlotte? 

It seems that the number of monthly stops are consistent over time. We notice some decrease and increase but on average it seems that it's consistent and averaging around 2800.

**Plot** the same plot but add in `.facet()` by the `Reason_for_Stop` variable.

In [None]:
## create the dataframe that groups by month and reason for stop 
stops_reason = stops.groupby(['Month_of_Stop', 'Reason_for_Stop']).size().reset_index(name='count')
stops_reason


: 

In [None]:
## Facet chart https://stackoverflow.com/questions/64770801/python-altair-facet-line-plot-with-multiple-variables

lines = alt.Chart(stops_reason).mark_line().encode(
    x='Month_of_Stop',
    y='count',
    color='Reason_for_Stop',
    tooltip=['Month_of_Stop', 'count', 'Reason_for_Stop']
).facet(
    column='Reason_of_Stop:N'
)

lines

: 

What is a problem with this plot? 

The y-axis seems to be shared across different reasons for stops.

To address this problem, you will need to figure out how to adjust the scale. To do this, you need to use altairs documentation to see whether there is a function that could help you.

https://altair-viz.github.io/user_guide/scale_resolve.html

What parameter allows you to modify the scales of `facet_wrap`? 

'resolve_scale'

**Plot** the same plot but with a free y-axis scale. (This may take some research but very findable)

In [None]:
# Updated Facet Chart
updated_lines = alt.Chart(stops_reason).mark_line().encode(
    x='Month_of_Stop',
    y='count',
    color='Reason_for_Stop',
    tooltip=['Month_of_Stop', 'count', 'Reason_for_Stop']

).facet(
        column='Reason_for_Stop:N'
    ).resolve_scale(
    y='independent'
)

updated_lines

: 

Which type of police stop has had the most volatility (i.e., big swings in number of stops)? 

Driving While Impaired

What is one problem with allowing the y-axis be free? xxxxx

Small multiples tends to be less effective when each of the variables are on different scales or magnitudes.

Let's consider instead CMPD traffic stops but by CMPD division. These are more even spread by division than the type of stop.

**Plot** a line chart for stops by `Date` but grouped by Division instead of reason

In [None]:
# Facet plot for division stops
stops_division = stops.groupby(['Month_of_Stop', 'CMPD_Division']).size().reset_index(name='count')

division_lines = alt.Chart(stops_division).mark_line().encode(
    x='Month_of_Stop',
    y='count',
    color='CMPD_Division',
    tooltip=['Month_of_Stop', 'CMPD_Division']

).facet(
        column='CMPD_Division:N'
    ).resolve_scale(
    y='independent'
)

division_lines

: 

What are three observations you can make about the number of police stops by divison? (hint: just write about what's in the data.)

1. The number of stops across all divisions dramitacally increased after October 1, 2016.

2. After April 1, 2017 stops started to decrease across all divisions

3. Most divisions have around 200 stops per month.