# Applied Data Science - Capstone Project Notebook

This Notebook will be primarily used to document the progress with Capstone Project (part of the Coursera's class on Applied Data Science).

## Week 4 Activities
- Document description of the problem and a discussion of the background
- Document description of the data and how it will be used to solve the problem

## Background and Problem description
### Background
In the module 3 of the *Capstone project* we have explored New York City and city of Toronto and clustered their neighborhoods using the venue data obtained from **FourSquares**.

The week 4 assignment tells us to come up with our own problem that can be solved with location data and suggests comparing the neighborhoods of Toronto to the neighborhoods of New York City. While implementing this suggestion can be an easy and straight-forward task, I've decided to not follow this suggestion.

To make the assignment more interesting it would be nice to "dig" deeper into the data available from *FourSquare*. Unfortunately, the default "sandbox" account on *FourSquare* is very limited and expanding it requires sharing the payment info (which I'm NOT comfortable with).

I've decided to take an alternative path and experiment with the location data publicly available from the city of Chicago [Data Portal](https://data.cityofchicago.org/). The portal offers variety of data about city of Chicago and its neighborhoods, many of the datasets are updated daily. The data may be viewed on the portal's website, accessed via APIs or downloaded for offline analysis.

For my project I've chosen the Public Safety [dataset](https://data.cityofchicago.org/d/ijzp-q8t2). This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. The complete dataset (2001-present time) is extremely large and contains around 7.16 million rows and 22 columns. Parsing and analyzing the entire dataset is a very resource-consuming (but exciting) task. For the Coursera Capstone Project I've selected a smaller sub-set covering year 2020 (from January 1st until today).

Our exploration of the city of Toronto was relying on individual neighborhoods identified by their postal codes. A similar approach can be taken to the city of Chicago (e.g. analyze Chicago datasets by Chicago ZIP codes). However, city of Chicago has an alternative zoning based on its political [wards](https://www.chicago.gov/city/en/about/wards.html). There are 50 wards in the city of Chicago their geographical boundaries and aldermans' offices locations are available from the [Data Portal](https://data.cityofchicago.org/).

### Problem description
Analyze whether various locations in the city of Chicago have similar or different crime profiles (e.g. to select safer location for a residence or a new business).

### Data description
As mentioned above, the primary data source for this project will be Chicago [Data Portal](https://data.cityofchicago.org/):
- daily updated dataset of [reported incidents](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2)
- Chicago Police Department - Illinois Uniform Crime Reporting (IUCR) [codes](http://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e)
- Ward [offices](https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Ward-Offices/htai-wnw4)
- Ward [geographical boundaries](https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Wards-2015-/sp34-6z76) 

### How the data will be used for the project (preliminary)
Project assignment suggests hypothetical problems: "In a city of your choice, if someone is looking to open a restaurant, where would you recommend that they open it?" or "Similarly, if a contractor is trying to start their own business, where would you recommend that they setup their office?". We will try to answer these questions considering that the prospective business owner probably does not want his/her business to be destroyed/vandalized or the customers and employees to be endangered. By exploring and analyzing various areas (identified as wards) in the city of Chicago from the perspective of their crime levels and crime profiles, we should be able to answer these questions.

In [1]:
import re
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
import folium # map rendering library
from folium.plugins import FastMarkerCluster, MarkerCluster

## Data exploration

Let's take a look at the available data:


In [2]:
wards_df = pd.read_csv("data/Ward_offices.csv")
wards_df.head()

Unnamed: 0,WARD,ALDERMAN,ADDRESS,CITY,STATE,ZIPCODE,WARD PHONE,WARD FAX,EMAIL,WEBSITE,LOCATION,CITY HALL ADDRESS,CITY HALL CITY,CITY HALL STATE,CITY HALL ZIPCODE,CITY HALL PHONE
0,42,"Reilly, Brendan",,,,,,,,,,"121 North LaSalle Street, Room 200,Office 6",Chicago,IL,60602,(312) 744-3062
1,33,"Rodriguez Sanchez, Rossana",3001 West Irving Park Road,Chicago,IL,60618.0,(773) 840-7880,,Info@33rdward.org,,"(41.95392, -87.703301)","121 North LaSalle Street, Room 200, Office 20",Chicago,IL,60602,(312) 744-3373
2,17,"Moore, David H.",1344 West 79th Street,Chicago,IL,60636.0,(773) 783-3672,(773) 783-3878,Alderman@17ward.com,http://www.David.Moore@cityofchicago.org,"(41.75044, -87.657221)","121 North LaSalle Street, Room 300, Office 37",Chicago,IL,60602,(312) 744-3435
3,44,"Tunney, Thomas","3223 North Sheffield Avenue, Suite A",Chicago,IL,60657.0,(773) 525-6034,(773) 525-5058,Ward44@cityofchicago.org,http://44thward.org/,"(41.940497, -87.654108)","121 North LaSalle Street, Room 304",Chicago,IL,60602,(312) 744-3073 / 3133
4,37,"Mitts, Emma",5344 West North Avenue,Chicago,IL,60651.0,(773) 379-0960,(773) 773-0966,Ward37@cityofchicago.org,https://www.cityofchicago.org/city/en/about/wa...,"(41.909514, -87.759726)","121 North LaSalle Street, Room 300, Office 45",Chicago,IL,60602,(312) 744-3180 / 1589


In [3]:
# Chicago geographical coordinates
chi_lat = 41.8781
chi_lon = -87.6298
# Chicago wards boundaries
geo_wards = r'data/Boundaries - Wards (2015-).geojson'

wards_df['key'] = wards_df['WARD'].astype(str)
wards_df['value'] = wards_df['WARD']

# regexp uesed to extract the coordinate values
p = re.compile("[-+]?[0-9]*\.?[0-9]+")
map_chicago = folium.Map(location = [chi_lat, chi_lon], zoom_start = 10)
ward_markers = MarkerCluster(name = "Alderman Offices")

folium.Choropleth(
    geo_wards,
    data=wards_df,
    columns=['key','value'],
    key_on='feature.properties.ward',
    line_color='red',
    fill_color='YlOrRd',
    #line_weight=3,
    name='Chicago Wards',
    legend_name='Chicago Wards',
    highlight=True
).add_to(map_chicago)

for ward, loc, ald in zip(wards_df['WARD'], wards_df['LOCATION'], wards_df['ALDERMAN']):
    label = 'Ward: {}, Alderman: {}'.format(ward, ald)
    # skip NaN
    if (loc != loc) :
        # substitute missing coordinates
        lat = chi_lat
        lon = chi_lon
    else :
        # extract coordinates
        lat, lon = p.findall(loc)
    # place a marker
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='red',
        fill=False,
        tooltip='Click me!',
        parse_html=False
        ).add_to(ward_markers)

ward_markers.add_to(map_chicago)

folium.LayerControl().add_to(map_chicago)

map_chicago

In [4]:
# read the crime database
crime_df = pd.read_csv("data/Crimes_-_2020.csv")
# drop the rows with missing data
crime_df.dropna(axis = 0, inplace = True)
crime_df.reset_index(drop = True, inplace = True)
# convert ward number into an int (it gets imported as a float)
crime_df['Ward'] = crime_df['Ward'].astype(int)
crime_df.drop(columns = ['Case Number','Block','Description','Location Description','Domestic','Beat','Community Area','X Coordinate','Y Coordinate','Year','Updated On','Location'], inplace=True)
crime_df.dtypes

ID                int64
Date             object
IUCR             object
Primary Type     object
Arrest             bool
District          int64
Ward              int32
FBI Code         object
Latitude        float64
Longitude       float64
dtype: object

In [5]:
crime_df.head()

Unnamed: 0,ID,Date,IUCR,Primary Type,Arrest,District,Ward,FBI Code,Latitude,Longitude
0,12112582,07/20/2020 02:00:00 PM,820,THEFT,False,10,22,06,41.839973,-87.721994
1,12112202,07/20/2020 06:00:00 PM,460,BATTERY,False,2,4,08B,41.802115,-87.587751
2,12111557,07/20/2020 08:00:00 AM,1130,DECEPTIVE PRACTICE,False,5,34,11,41.660909,-87.638945
3,12111227,07/20/2020 10:28:00 AM,460,BATTERY,False,20,48,08B,41.978707,-87.657882
4,12111906,07/20/2020 09:27:00 PM,2825,OTHER OFFENSE,False,14,1,26,41.909184,-87.689647


In [6]:
crime_df.describe()

Unnamed: 0,ID,District,Ward,Latitude,Longitude
count,111544.0,111544.0,111544.0,111544.0,111544.0
mean,11981110.0,11.105904,22.778643,41.839514,-87.670282
std,740229.0,6.910796,13.659633,0.087005,0.058961
min,24889.0,1.0,1.0,41.64459,-87.934567
25%,11982990.0,6.0,10.0,41.765235,-87.71375
50%,12026480.0,10.0,23.0,41.853677,-87.664449
75%,12069780.0,16.0,34.0,41.903056,-87.62755
max,12114970.0,31.0,50.0,42.022586,-87.524618


In [7]:
crime_df['Primary Type'].value_counts()

BATTERY                              23345
THEFT                                22330
CRIMINAL DAMAGE                      13649
ASSAULT                               9881
OTHER OFFENSE                         6808
DECEPTIVE PRACTICE                    6484
BURGLARY                              4906
MOTOR VEHICLE THEFT                   4781
NARCOTICS                             4100
WEAPONS VIOLATION                     3984
ROBBERY                               3907
CRIMINAL TRESPASS                     2466
OFFENSE INVOLVING CHILDREN            1076
PUBLIC PEACE VIOLATION                 887
CRIMINAL SEXUAL ASSAULT                540
SEX OFFENSE                            504
HOMICIDE                               426
INTERFERENCE WITH PUBLIC OFFICER       425
ARSON                                  300
PROSTITUTION                           179
CRIM SEXUAL ASSAULT                    118
STALKING                                95
INTIMIDATION                            81
CONCEALED C

In [8]:
crime_df.groupby(['Ward','Primary Type']).size().reset_index(name='Count')

Unnamed: 0,Ward,Primary Type,Count
0,1,ARSON,5
1,1,ASSAULT,126
2,1,BATTERY,268
3,1,BURGLARY,109
4,1,CONCEALED CARRY LICENSE VIOLATION,1
...,...,...,...
1215,50,ROBBERY,33
1216,50,SEX OFFENSE,8
1217,50,STALKING,1
1218,50,THEFT,291
