# Coding Discussion 03

# Instructions

## Task

Please read in the Chicago Summer 2018 Crimes Dataset located in the repository folder.

Using the data wrangling methods covered in class this week, create a new data frame where:

- the **_unit of observation_** is the crime type (i.e. `primary_type`),
- the **_column variables_** corresponds with the **_day of the month_**, and
- **_each cell_** is populated by the **_proportion of times that crime type was committed over all days of the month_**
    + For example, assume there were just two days in a month and 2 thefts were committed on the first day, and 1 on the second day, then the _proportion_ of thefts committed on the first day would be .66 and .33 on the second day).

Make sure that:

- all missing values are filled with zeros. Zeros in this case means no crimes were committed that day;
- the data is rounded to the second decimal place; and
- the data frame is printed at the end of the notebook.


## Submit

Please submit your answer as a Jupyter Notebook in the `Submissions/` folder. Title the notebook with your lastname_firstname_netid (`doe_john_jd568.ipynb`). Be sure to submit a docstring if you write any functions indicating what your function does and all the arguments it takes.  As per usual, please submit your answer to the class repository by Sunday 11:59pm deadline.


## Things to keep in mind

To answer this question: we'll want to think carefully about assigning an index, aggregating data by groups, and reshaping data. Everything you need is in the lecture notes.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Read in data with the categories I want (only possible if you already looked at the dataset beforehand)
df = pd.read_csv('data/chicago_summer_2018_crime_data.csv',
                usecols = ["day", "primary_type", "month"])

In [3]:
# get some basic information about the data 
# note the days are already integer data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73373 entries, 0 to 73372
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   month         73373 non-null  int64 
 1   day           73373 non-null  int64 
 2   primary_type  73373 non-null  object
dtypes: int64(2), object(1)
memory usage: 1.7+ MB


In [4]:
# look at the first few rows
df.head()

Unnamed: 0,month,day,primary_type
0,8,4,THEFT
1,7,26,THEFT
2,6,24,DECEPTIVE PRACTICE
3,6,13,ASSAULT
4,6,14,CRIMINAL DAMAGE


In [5]:
# take a look at all the kinds of crime in the dataset
df["primary_type"].unique()

array(['THEFT', 'DECEPTIVE PRACTICE', 'ASSAULT', 'CRIMINAL DAMAGE',
       'CRIM SEXUAL ASSAULT', 'OFFENSE INVOLVING CHILDREN', 'BATTERY',
       'HOMICIDE', 'ROBBERY', 'NARCOTICS', 'MOTOR VEHICLE THEFT',
       'OTHER OFFENSE', 'BURGLARY', 'SEX OFFENSE', 'KIDNAPPING',
       'LIQUOR LAW VIOLATION', 'ARSON', 'CRIMINAL TRESPASS', 'OBSCENITY',
       'INTERFERENCE WITH PUBLIC OFFICER', 'NON-CRIMINAL', 'STALKING',
       'PROSTITUTION', 'WEAPONS VIOLATION', 'PUBLIC PEACE VIOLATION',
       'HUMAN TRAFFICKING', 'GAMBLING',
       'CONCEALED CARRY LICENSE VIOLATION', 'INTIMIDATION',
       'PUBLIC INDECENCY', 'NON-CRIMINAL (SUBJECT SPECIFIED)'],
      dtype=object)

In [6]:
# First we count the number of instances of each crime (represented by each instance of 'day' within that type)
# Then we get it back to looking like a data frame and fill in all the missing values with zeroes
df = df.groupby(["primary_type", "day"]).size().unstack().fillna(0)

In [10]:
# After this step, the data frame looks like this: 
df.head()

day,1,2,3,4,5,6,7,8,9,10,...,22,23,24,25,26,27,28,29,30,31
primary_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ARSON,0.04,0.03,0.03,0.02,0.04,0.05,0.04,0.04,0.02,0.02,...,0.04,0.01,0.05,0.01,0.02,0.01,0.03,0.05,0.03,0.03
ASSAULT,0.04,0.03,0.03,0.04,0.04,0.03,0.03,0.03,0.03,0.03,...,0.03,0.03,0.04,0.03,0.03,0.03,0.03,0.03,0.03,0.02
BATTERY,0.04,0.04,0.03,0.04,0.03,0.03,0.03,0.03,0.03,0.03,...,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.02
BURGLARY,0.04,0.03,0.03,0.03,0.03,0.04,0.03,0.03,0.03,0.03,...,0.04,0.03,0.04,0.03,0.03,0.04,0.03,0.03,0.03,0.02
CONCEALED CARRY LICENSE VIOLATION,0.05,0.02,0.05,0.05,0.02,0.05,0.05,0.0,0.02,0.05,...,0.02,0.0,0.05,0.07,0.07,0.02,0.02,0.0,0.02,0.05


In [8]:
# Now we take the values of the cells and divide them by the sum of their row (representing total instances of that crime)
# This gives you the proportion of that crime taking place on a given day of the month
# Round it to 2 decimal places
df = round(df.div(df.sum(axis=1), axis=0), 2)

In [9]:
# This is the final dataframe
df

day,1,2,3,4,5,6,7,8,9,10,...,22,23,24,25,26,27,28,29,30,31
primary_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ARSON,0.04,0.03,0.03,0.02,0.04,0.05,0.04,0.04,0.02,0.02,...,0.04,0.01,0.05,0.01,0.02,0.01,0.03,0.05,0.03,0.03
ASSAULT,0.04,0.03,0.03,0.04,0.04,0.03,0.03,0.03,0.03,0.03,...,0.03,0.03,0.04,0.03,0.03,0.03,0.03,0.03,0.03,0.02
BATTERY,0.04,0.04,0.03,0.04,0.03,0.03,0.03,0.03,0.03,0.03,...,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.02
BURGLARY,0.04,0.03,0.03,0.03,0.03,0.04,0.03,0.03,0.03,0.03,...,0.04,0.03,0.04,0.03,0.03,0.04,0.03,0.03,0.03,0.02
CONCEALED CARRY LICENSE VIOLATION,0.05,0.02,0.05,0.05,0.02,0.05,0.05,0.0,0.02,0.05,...,0.02,0.0,0.05,0.07,0.07,0.02,0.02,0.0,0.02,0.05
CRIM SEXUAL ASSAULT,0.06,0.02,0.04,0.05,0.04,0.04,0.03,0.04,0.03,0.03,...,0.03,0.03,0.02,0.03,0.05,0.03,0.03,0.03,0.03,0.01
CRIMINAL DAMAGE,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,...,0.04,0.04,0.03,0.03,0.04,0.04,0.03,0.03,0.03,0.02
CRIMINAL TRESPASS,0.04,0.03,0.03,0.03,0.03,0.03,0.03,0.04,0.04,0.03,...,0.03,0.04,0.04,0.03,0.03,0.04,0.04,0.03,0.03,0.02
DECEPTIVE PRACTICE,0.04,0.04,0.03,0.03,0.03,0.04,0.03,0.03,0.03,0.03,...,0.03,0.03,0.03,0.03,0.03,0.04,0.03,0.03,0.03,0.03
GAMBLING,0.07,0.03,0.02,0.01,0.03,0.02,0.03,0.03,0.05,0.04,...,0.02,0.02,0.01,0.04,0.03,0.01,0.02,0.03,0.03,0.03
