# Applied Data Science Capstone

This notebook has the purpose to develop the Applied Data Science Capstone Project Course of the Data Science Professional Certificate by IBM and Coursera.

# Introduction: Business Problem

A strategy to fight against crime is to create a police station where different situations can be attended such as receive reports of crimes suffered by neighbors and prosecute criminals more efficiently. Also, policemen can get out from the station to patrol the neighborhood. The decision about where to place the new police station must be based on crimes rates in the city, due to the neighborhood with the highest crimes rate is a priority to counter crimes.

# Data

Datasets to be used are:

+ **Police Department Incidents for 2016 in San Francisco:** this data was obtained from San Francisco public data portal and includes incident number, category, description, day of week, date, time, police deparment district, resolution, address, latitude, longitude and police department id. Dataset can be downloaded from next link:  'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Data_Files/Police_Department_Incidents_-_Previous_Year__2016_.csv'

+ **Neighborhoods in San Francisco:** the boundaries of the different neighborhoods in San Francisco is in a GeoJSON file, which can be downloaded from this link: 'https://cocl.us/sanfran_geojson'

Police Department Incidents dataset in addition to Neighborhoods in San Francisco GeoJSON file can be seized in building a map to visualize locations where incidents happened and identify the neighborhood to which the location corresponds. As a result, the map allows to identify visually the neighborhood with the highest rate of crimes and with a deeper analysis it can be identified the borough with the greatest criminal activity.

# Incidents in San Francisco for 2016

Import libraries to be used in the capstone project.

In [134]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
from folium import plugins

Load data on police department incidents using pandas read_csv() method.

In [135]:
df = pd.read_csv('Police_Department_Incidents_-_Previous_Year__2016_.csv')
df.head()

Unnamed: 0,IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Address,X,Y,Location,PdId
0,120058272,WEAPON LAWS,POSS OF PROHIBITED WEAPON,Friday,01/29/2016 12:00:00 AM,11:00,SOUTHERN,"ARREST, BOOKED",800 Block of BRYANT ST,-122.403405,37.775421,"(37.775420706711, -122.403404791479)",12005827212120
1,120058272,WEAPON LAWS,"FIREARM, LOADED, IN VEHICLE, POSSESSION OR USE",Friday,01/29/2016 12:00:00 AM,11:00,SOUTHERN,"ARREST, BOOKED",800 Block of BRYANT ST,-122.403405,37.775421,"(37.775420706711, -122.403404791479)",12005827212168
2,141059263,WARRANTS,WARRANT ARREST,Monday,04/25/2016 12:00:00 AM,14:59,BAYVIEW,"ARREST, BOOKED",KEITH ST / SHAFTER AV,-122.388856,37.729981,"(37.7299809672996, -122.388856204292)",14105926363010
3,160013662,NON-CRIMINAL,LOST PROPERTY,Tuesday,01/05/2016 12:00:00 AM,23:50,TENDERLOIN,NONE,JONES ST / OFARRELL ST,-122.412971,37.785788,"(37.7857883766888, -122.412970537591)",16001366271000
4,160002740,NON-CRIMINAL,LOST PROPERTY,Friday,01/01/2016 12:00:00 AM,00:30,MISSION,NONE,16TH ST / MISSION ST,-122.419672,37.76505,"(37.7650501214668, -122.419671780296)",16000274071000


Drop unnecesary columns labels.

In [136]:
df.drop(['IncidntNum', 'Descript', 'DayOfWeek', 'Date', 'Time', 'Resolution', 'Location', 'PdId'], axis = 1, inplace = True)
df.head()

Unnamed: 0,Category,PdDistrict,Address,X,Y
0,WEAPON LAWS,SOUTHERN,800 Block of BRYANT ST,-122.403405,37.775421
1,WEAPON LAWS,SOUTHERN,800 Block of BRYANT ST,-122.403405,37.775421
2,WARRANTS,BAYVIEW,KEITH ST / SHAFTER AV,-122.388856,37.729981
3,NON-CRIMINAL,TENDERLOIN,JONES ST / OFARRELL ST,-122.412971,37.785788
4,NON-CRIMINAL,MISSION,16TH ST / MISSION ST,-122.419672,37.76505


Use geopy library to get the latitude and longitude values of San Francisco.

Get the first 1000 incidents in this dataset to mark their locations in a map of San Francisco.

In [137]:
# get the first 100 crimes in the df_incidents dataframe
limit = 1000
df1000 = df.iloc[0:limit, :]

In [138]:
address = 'San Francisco'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of San Francisco are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of San Francisco are 37.7792808, -122.4192363.


From San Francisco dataset get a pandas dataframe with the total number of crimes in each neighborhood.

In [139]:
cc = (df['PdDistrict'].value_counts()).to_frame().reset_index()
cc = cc.rename(columns= {'index': 'Neighborhood', 'PdDistrict':'Count'})
cc

Unnamed: 0,Neighborhood,Count
0,SOUTHERN,28445
1,NORTHERN,20100
2,MISSION,19503
3,CENTRAL,17666
4,BAYVIEW,14303
5,INGLESIDE,11594
6,TARAVAL,11325
7,TENDERLOIN,9942
8,RICHMOND,8922
9,PARK,8699


Group first 1000 incidents into different clusters. Each cluster represents the number of crimes in each neighborhood. A MarkerCluster object is instantiated and all the data points in the dataframe are added. Besides, create a Choropleth map with GeoJSON file, that marks the boundaries of the different neighborhoods in San Francisco, and includes all incidents from csv file loaded previously, to visualize crime rate in San Francisco with a color scale.

In [141]:
# Create a map of San Francisco
sanfran_map = folium.Map(location = [latitude, longitude], zoom_start = 12)
SF_geo = r'san-francisco.geojson'
# instantiate a mark cluster object for the incidents in the dataframe
incidents = plugins.MarkerCluster().add_to(sanfran_map)

# loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(df1000.Y, df1000.X, df1000.Category):
    folium.Marker(
        location=[lat, lng],
        icon=None,
        popup=label,
    ).add_to(incidents)
    
sanfran_map.choropleth(
    geo_data=SF_geo,
    data=cc,
    columns=['Neighborhood', 'Count'],
    key_on='feature.properties.DISTRICT',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Crime Rate in San Francisco'
)

# display map
sanfran_map



Get incidents in San Francisco from 2016 happened in Southern neighborhood and print shape of new dataframe.

In [142]:
sou = df.loc[df['PdDistrict'] == 'SOUTHERN']
sou.shape

(28445, 5)

In [143]:
ad = (sou['Address'].value_counts()).to_frame().reset_index()
ad = ad.rename(columns= {'index': 'Address', 'Address':'Count'})
ad.head()

Unnamed: 0,Address,Count
0,800 Block of BRYANT ST,3561
1,800 Block of MARKET ST,1405
2,900 Block of MARKET ST,547
3,0 Block of 6TH ST,347
4,800 Block of MISSION ST,345


Add latitude and longitude columns to dataframe with incidents count for address

In [144]:
ad = pd.merge(ad, df[['Address', 'X', 'Y']].drop_duplicates(subset='Address', keep='first'), on='Address', how = 'inner')
ad.head()

Unnamed: 0,Address,Count,X,Y
0,800 Block of BRYANT ST,3561,-122.403405,37.775421
1,800 Block of MARKET ST,1405,-122.407634,37.784189
2,900 Block of MARKET ST,547,-122.408595,37.783707
3,0 Block of 6TH ST,347,-122.40942,37.781615
4,800 Block of MISSION ST,345,-122.405395,37.78351
