# Coding Discussion 3
Yousuf Abdelfatah

## Task
Please read in the Chicago Summer 2018 Crimes Dataset located in the repository folder.

Using the data wrangling methods covered in class this week, create a new data frame where:

- the **_unit of observation_** is the crime type (i.e. `primary_type`),
- the **_column variables_** corresponds with the **_day of the month_**, and
- **_each cell_** is populated by the **_proportion of times that crime type was committed over all days of the month_**
    + For example, assume there were just two days in a month and 2 thefts were committed on the first day, and 1 on the second day, then the _proportion_ of thefts committed on the first day would be .66 and .33 on the second day).

Make sure that:

- all missing values are filled with zeros. Zeros in this case means no crimes were committed that day;
- the data is rounded to the second decimal place; and
- the data frame is printed at the end of the notebook.

## Import packages

In [150]:
import pandas as pd
import numpy as np

## Load and Explore the Data

In [151]:
# read in the data as a pandas dataframe
dat= pd.read_csv("chicago_summer_2018_crime_data.csv")

# look at dataframe
dat.head(5)

Unnamed: 0,month,day,year,day_of_week,description,location_description,block,primary_type,district,ward,arrest,domestic,latitude,longitude
0,8,4,2018,Saturday,FROM BUILDING,APARTMENT,039XX W WASHINGTON BLVD,THEFT,11,28.0,False,False,,
1,7,26,2018,Thursday,POCKET-PICKING,RESTAURANT,005XX W MADISON ST,THEFT,1,42.0,False,False,,
2,6,24,2018,Sunday,BOGUS CHECK,GROCERY FOOD STORE,004XX E 34TH ST,DECEPTIVE PRACTICE,2,4.0,False,False,,
3,6,13,2018,Wednesday,SIMPLE,RESIDENCE,098XX S EXCHANGE AVE,ASSAULT,4,10.0,False,True,,
4,6,14,2018,Thursday,TO VEHICLE,STREET,001XX S WALLER AVE,CRIMINAL DAMAGE,15,29.0,False,False,,


In [152]:
# see what the columns are
dat.columns

Index([&#39;month&#39;, &#39;day&#39;, &#39;year&#39;, &#39;day_of_week&#39;, &#39;description&#39;,
       &#39;location_description&#39;, &#39;block&#39;, &#39;primary_type&#39;, &#39;district&#39;, &#39;ward&#39;,
       &#39;arrest&#39;, &#39;domestic&#39;, &#39;latitude&#39;, &#39;longitude&#39;],
      dtype=&#39;object&#39;)

In [153]:
# get data info
dat.info()

# get data shape
dat.shape

&lt;class &#39;pandas.core.frame.DataFrame&#39;&gt;
RangeIndex: 73373 entries, 0 to 73372
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   month                 73373 non-null  int64  
 1   day                   73373 non-null  int64  
 2   year                  73373 non-null  int64  
 3   day_of_week           73373 non-null  object 
 4   description           73373 non-null  object 
 5   location_description  73214 non-null  object 
 6   block                 73373 non-null  object 
 7   primary_type          73373 non-null  object 
 8   district              73373 non-null  int64  
 9   ward                  73372 non-null  float64
 10  arrest                73373 non-null  bool   
 11  domestic              73373 non-null  bool   
 12  latitude              72973 non-null  float64
 13  longitude             72973 non-null  float64
dtypes: bool(2), float64(3), int64(4), object(5)
memory usage

(73373, 14)

## Manipulate The Data

In [154]:
# We want the unit of observation to be crime type, and we're measuring the number of each type of crime per day, so set those two as the index. This will allow us to use group by to create a new variable based on these two variables and attach it to the data frame
dat= dat.set_index(["primary_type","day"])

# See what the data looks like
dat.head(5) 

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day_of_week,description,location_description,block,district,ward,arrest,domestic,latitude,longitude
primary_type,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
THEFT,4,8,2018,Saturday,FROM BUILDING,APARTMENT,039XX W WASHINGTON BLVD,11,28.0,False,False,,
THEFT,26,7,2018,Thursday,POCKET-PICKING,RESTAURANT,005XX W MADISON ST,1,42.0,False,False,,
DECEPTIVE PRACTICE,24,6,2018,Sunday,BOGUS CHECK,GROCERY FOOD STORE,004XX E 34TH ST,2,4.0,False,False,,
ASSAULT,13,6,2018,Wednesday,SIMPLE,RESIDENCE,098XX S EXCHANGE AVE,4,10.0,False,True,,
CRIMINAL DAMAGE,14,6,2018,Thursday,TO VEHICLE,STREET,001XX S WALLER AVE,15,29.0,False,False,,


In [155]:
# Create a variable measuring the number of times each crime is committed per day using the groupby method coupled with .size() 
dat["perday"] = dat.groupby(['primary_type','day']).size()

# make sure this was added
dat.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,month,year,day_of_week,description,location_description,block,district,ward,arrest,domestic,latitude,longitude,perday
primary_type,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
THEFT,4,8,2018,Saturday,FROM BUILDING,APARTMENT,039XX W WASHINGTON BLVD,11,28.0,False,False,,,576
THEFT,26,7,2018,Thursday,POCKET-PICKING,RESTAURANT,005XX W MADISON ST,1,42.0,False,False,,,565
DECEPTIVE PRACTICE,24,6,2018,Sunday,BOGUS CHECK,GROCERY FOOD STORE,004XX E 34TH ST,2,4.0,False,False,,,154


In [156]:
# We need to pivot the table but to do so we need to reset the index
dat.reset_index()

# Now pivot using pivot_table and have the columns correspond with each day, the values be the "perday" variable we created, and the unit of observation be the type of crime." Set missing values to zero
dat = dat.pivot_table(columns = "day", index = "primary_type", values = "perday").fillna(0)

# look at the new data
dat.head(5)

day,1,2,3,4,5,6,7,8,9,10,...,22,23,24,25,26,27,28,29,30,31
primary_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ARSON,4.0,3.0,3.0,2.0,4.0,6.0,5.0,5.0,2.0,2.0,...,5.0,1.0,6.0,1.0,2.0,1.0,3.0,6.0,3.0,3.0
ASSAULT,207.0,188.0,172.0,202.0,209.0,197.0,172.0,195.0,161.0,154.0,...,168.0,182.0,200.0,167.0,174.0,187.0,177.0,194.0,161.0,133.0
BATTERY,511.0,495.0,489.0,576.0,488.0,400.0,455.0,474.0,432.0,438.0,...,423.0,439.0,476.0,485.0,460.0,393.0,450.0,432.0,442.0,274.0
BURGLARY,126.0,109.0,118.0,117.0,101.0,126.0,99.0,96.0,107.0,110.0,...,135.0,102.0,119.0,104.0,104.0,140.0,113.0,107.0,108.0,79.0
CONCEALED CARRY LICENSE VIOLATION,2.0,1.0,2.0,2.0,1.0,2.0,2.0,0.0,1.0,2.0,...,1.0,0.0,2.0,3.0,3.0,1.0,1.0,0.0,1.0,2.0


In [157]:
# Add a column to the dataframe representing the total number of each type of crime
dat["totals"]=dat.sum(axis=1) 
dat.head(5)

day,1,2,3,4,5,6,7,8,9,10,...,23,24,25,26,27,28,29,30,31,totals
primary_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ARSON,4.0,3.0,3.0,2.0,4.0,6.0,5.0,5.0,2.0,2.0,...,1.0,6.0,1.0,2.0,1.0,3.0,6.0,3.0,3.0,112.0
ASSAULT,207.0,188.0,172.0,202.0,209.0,197.0,172.0,195.0,161.0,154.0,...,182.0,200.0,167.0,174.0,187.0,177.0,194.0,161.0,133.0,5635.0
BATTERY,511.0,495.0,489.0,576.0,488.0,400.0,455.0,474.0,432.0,438.0,...,439.0,476.0,485.0,460.0,393.0,450.0,432.0,442.0,274.0,14111.0
BURGLARY,126.0,109.0,118.0,117.0,101.0,126.0,99.0,96.0,107.0,110.0,...,102.0,119.0,104.0,104.0,140.0,113.0,107.0,108.0,79.0,3390.0
CONCEALED CARRY LICENSE VIOLATION,2.0,1.0,2.0,2.0,1.0,2.0,2.0,0.0,1.0,2.0,...,0.0,2.0,3.0,3.0,1.0,1.0,0.0,1.0,2.0,44.0


In [158]:
# Divide each row by the value in its totals column to get the proportion of times each crime was committed on that day
dat = dat.div(dat.totals, axis = 0).round(2)

# Drop the totals column
dat = dat.drop('totals', axis = 1)

In [159]:
# Print the final data frame
dat

day,1,2,3,4,5,6,7,8,9,10,...,22,23,24,25,26,27,28,29,30,31
primary_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ARSON,0.04,0.03,0.03,0.02,0.04,0.05,0.04,0.04,0.02,0.02,...,0.04,0.01,0.05,0.01,0.02,0.01,0.03,0.05,0.03,0.03
ASSAULT,0.04,0.03,0.03,0.04,0.04,0.03,0.03,0.03,0.03,0.03,...,0.03,0.03,0.04,0.03,0.03,0.03,0.03,0.03,0.03,0.02
BATTERY,0.04,0.04,0.03,0.04,0.03,0.03,0.03,0.03,0.03,0.03,...,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.02
BURGLARY,0.04,0.03,0.03,0.03,0.03,0.04,0.03,0.03,0.03,0.03,...,0.04,0.03,0.04,0.03,0.03,0.04,0.03,0.03,0.03,0.02
CONCEALED CARRY LICENSE VIOLATION,0.05,0.02,0.05,0.05,0.02,0.05,0.05,0.0,0.02,0.05,...,0.02,0.0,0.05,0.07,0.07,0.02,0.02,0.0,0.02,0.05
CRIM SEXUAL ASSAULT,0.06,0.02,0.04,0.05,0.04,0.04,0.03,0.04,0.03,0.03,...,0.03,0.03,0.02,0.03,0.05,0.03,0.03,0.03,0.03,0.01
CRIMINAL DAMAGE,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,...,0.04,0.04,0.03,0.03,0.04,0.04,0.03,0.03,0.03,0.02
CRIMINAL TRESPASS,0.04,0.03,0.03,0.03,0.03,0.03,0.03,0.04,0.04,0.03,...,0.03,0.04,0.04,0.03,0.03,0.04,0.04,0.03,0.03,0.02
DECEPTIVE PRACTICE,0.04,0.04,0.03,0.03,0.03,0.04,0.03,0.03,0.03,0.03,...,0.03,0.03,0.03,0.03,0.03,0.04,0.03,0.03,0.03,0.03
GAMBLING,0.07,0.03,0.02,0.01,0.03,0.02,0.03,0.03,0.05,0.04,...,0.02,0.02,0.01,0.04,0.03,0.01,0.02,0.03,0.03,0.03
