<a href="https://colab.research.google.com/github/catawba-data-mining/CIS-3902-Data-Mining/blob/main/Chapter11_Homework_4_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Catawba College Data Mining Class Data 100: Chapter 11 Case Study: Berkeley Policing

INSTRUCTIONS: To open this file in Google COLAB, click on the COLAB link (blue). Follow all instructions in the Program once in Colab. You do not have to turn in the code. After studying the code and the output with your group, return to Blackboard and complete the rest of the activity.

In this notebook, we will clean a dataset, and then use it for some basic exploration. 

We must begin by installing some necessary packages. 

STEP 1: Place your cursor (click) in the code cells and click on the triangle to the left of the code to execute (click RUN ANYWAY on first code block if you get an authorization error). Some code blocks WILL NOT display any output. Some code blocks generate many messages! You can clear these by clicking on the x where the messages are displayed.

In [None]:
#STEP 1:  we need to install datascience first because it is not a typical package that comes with our programming environment
#more information can be found here (optional reading https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/)
#!pip install datascience
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install datascience
!{sys.executable} -m pip install sodapy
!{sys.executable} -m pip install seaborn
#
#after this is executed you can click on the x (person changes to x when cursor is hovered) in order to clear messages

In [None]:
# Many of these import statements are repeated across projects.
from datascience import *
import numpy as np
import pandas as pd
import seaborn as sns
from sodapy import Socrata
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

We begin by obtaining our copy of the Calls dataset. We are going to use the Socrata API to download the dataset. 

In [None]:
# This uses the Socrata API to download a copy of the dataset. 

# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.cityofberkeley.info", None)

# Results returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("k2nh-s5h5", limit=2000)
# This restricts us to 2000 records,
# but we could change this number for a larger value


# Convert to pandas DataFrame
results = pd.DataFrame.from_records(results)

results

In [None]:
# We must remove columns that we don't wish to use. 
# In this case, there are some geographic variables we can discard immediately. 
calls = results[['caseno', 'offense', 'eventdt', 'eventtm', 'cvlegend', 'cvdow', 'indbdate', 'block_location', 'blkaddr', 'city', 'state']]
calls

Now that we have obtained a copy of the dataset, and stored it in a dataframe, it is time to examine the dataset for potential problems. 

We will begin by looking for columns with missing values.

In [None]:
# True if row contains at least one null value
null_rows = calls.isnull().any(axis=1)
calls[null_rows]

A small number of calls don't have values listed for block address (blkaddr). We can make assumptions about what these values might be, but we must remember that these are assumptions. 

Another interesting note, is that the event date (eventdt) lists all times as midnight, but the exact time is in event time (eventtm). We can write a function that manipulates the strings to make a new column that mergest the two.

In [None]:
def combine_event_datetimes(calls):
    combined = pd.to_datetime(
        # Combine date and time strings
        calls['eventdt'].str[:10] + ' ' + calls['eventtm'],
        infer_datetime_format=True,
    )
    return calls.assign(eventdttm=combined)

# To peek at the result without mutating the calls DF:
calls.pipe(combine_event_datetimes).head(2)

In [None]:
# Note that the calls dataframe is unaltered. 
calls

It is also useful to check and see which columns were human input. One way of doing that is by checking for unique values. Data input by humans requires special consideration, to check for problems such as spelling errors. 

We are going to look more closely at two columns, Offense, which stores the offense type, and CVLegend, which stores the event description. 

In [None]:
calls['offense'].unique()

In [None]:
calls['cvlegend'].unique()

Lastly, we can make a small dataframe to allow cvdow to be matched to a specific day of the week. 

In [None]:
day_of_week = pd.DataFrame([['0', 'Sunday'], ['1', 'Monday'], ['2', 'Tuesday'], ['3', 'Wednesday'], ['4', 'Thursday'], 
                            ['5', 'Friday'], ['6', 'Saturday']], 
                           columns=['cvdow', 'day'])
day_of_week

In [None]:
def match_weekday(calls):
    return calls.merge(day_of_week, on='cvdow')
calls.pipe(match_weekday).head(2)

In [None]:
# We also drop columns we do not need. 
def drop_unneeded_cols(calls):
    return calls.drop(columns=['cvdow', 'indbdate', 'block_location', 'city',
                               'state', 'eventdt', 'eventtm'])

In [None]:
#Lastly we pipe the dataset through the functions to get our final version
calls_final = (calls.pipe(combine_event_datetimes)
               .pipe(match_weekday)
               .pipe(drop_unneeded_cols))
calls_final

To conclude, you are going to do a bit of exploratory analysis and visualization on this dataset. 

In [None]:
# It's your turn now. 

First, using what you learned in chapter 10, can you make a bar chart comparing the number of cases on different days of the week?

In [None]:
# Place the code for the bar chart here:
# Sets the order for the days
day_order = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
# Resizes the plot so that it will be large enough to read easily
plots.figure(figsize=(16, 10))
# Add the plot here. Use a seaborn count plot, and set the x value to day, the order to day_order and the data to calls_final. 

In [None]:
#Lets use value counts to see what the most frequent offenses are:

In [None]:
# Lastly, make a function similar to the one used in homework 3 to determine the frequency of crimes per day. 

In [None]:
#Then, using that function, make dataframes for each day of the week. 
# Sunday Frame

In [None]:
# Monday Frame

In [None]:
# Tuesday Frame

In [None]:
# Wednesday Frame

In [None]:
# Thursday Frame

In [None]:
# Friday Frame

In [None]:
# Saturday Frame

In [None]:
# Finally, make a seaborn count plot that answers this question:
# What is the most frequent crime on the day of the week with the most crime?

When you finish, download your ipynb file and upload it to blackboard.