# Introduction

## The Problem

A security technology company wants to offer a new surveillance device to businesses in Toronto. The company wants to focus on commercial premises located in neighborhoods that have presented breaking and entering, that information is part of tha market study. The study need information from the last year in order to know the neighborhoods where this kind of crime is high.

### Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [36]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


# Download and Explore Dataset

examp`le:    Neighborhood has a total of 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood. 

Luckily, this dataset exists for free on the web. Feel free to try to find this dataset on your own, but here is the link to the dataset: https://data.torontopolice.on.ca/pages/break-and-enter

#### Tranform the data into a *pandas* dataframe

In [37]:
toronto_df = pd.read_csv('/Users/fabiancamargo/Downloads/Break_and_Enter_2014_to_2019.csv')

In [38]:
toronto_df.shape

(43302, 27)

Quickly examine the resulting dataframe.

In [39]:
toronto_df.head()

Unnamed: 0,Index_,event_unique_id,occurrencedate,reporteddate,premisetype,ucr_code,ucr_ext,offence,reportedyear,reportedmonth,reportedday,reporteddayofyear,reporteddayofweek,reportedhour,occurrenceyear,occurrencemonth,occurrenceday,occurrencedayofyear,occurrencedayofweek,occurrencehour,MCI,Division,Hood_ID,Neighbourhood,Long,Lat,ObjectId
0,714,GO-20141857431,1397050200000,1397050740000,House,2120,200,B&E,2014,April,9,99,Wednesday,13,2014.0,April,9.0,99.0,Wednesday,13,Break and Enter,D22,11,Eringate-Centennial-West Deane (11),-79.582176,43.661335,1
1,715,GO-20141859201,1397055600000,1397067540000,House,2120,200,B&E,2014,April,9,99,Wednesday,18,2014.0,April,9.0,99.0,Wednesday,15,Break and Enter,D33,47,Don Valley Village (47),-79.362968,43.773071,2
2,716,GO-20141866077,1397134800000,1397157060000,House,2120,220,B&E W'Intent,2014,April,10,100,Thursday,19,2014.0,April,10.0,100.0,Thursday,13,Break and Enter,D55,69,Blake-Jones (69),-79.332382,43.681484,3
3,719,GO-20141915866,1397816160000,1397844000000,Commercial,2120,200,B&E,2014,April,18,108,Friday,18,2014.0,April,18.0,108.0,Friday,10,Break and Enter,D53,98,Rosedale-Moore Park (98),-79.386787,43.670227,4
4,724,GO-20141965079,1398545460000,1398545460000,House,2120,200,B&E,2014,April,26,116,Saturday,20,2014.0,April,26.0,116.0,Saturday,20,Break and Enter,D53,41,Bridle Path-Sunnybrook-York Mills (41),-79.38118,43.725376,5


Torondo_df dataframe only have 'break and enter' crimes from 2014 to 2019. 
The datafrane have the columns that we need to resolve this problem:
- Neighbourhood column
- Long and Lat columns for geographic coordinates
- reportedyear columns for year
- premisetype column for type of property

The dataset have 43302 rows. We only need rows with year = 2019 ant premisetype = Comercial

In [40]:
toronto_df = toronto_df[(toronto_df['premisetype']=='Commercial') & (toronto_df['reportedyear']==2019)] 
toronto_df.head()

Unnamed: 0,Index_,event_unique_id,occurrencedate,reporteddate,premisetype,ucr_code,ucr_ext,offence,reportedyear,reportedmonth,reportedday,reporteddayofyear,reporteddayofweek,reportedhour,occurrenceyear,occurrencemonth,occurrenceday,occurrencedayofyear,occurrencedayofweek,occurrencehour,MCI,Division,Hood_ID,Neighbourhood,Long,Lat,ObjectId
27000,132303,GO-2019937686,1558585380000,1558585380000,Commercial,2120,200,B&E,2019,May,23,143,Thursday,4,2019.0,May,23.0,143.0,Thursday,4,Break and Enter,D41,126,Dorset Park (126),-79.281075,43.765804,27001
27002,132307,GO-2019855582,1557514800000,1557572280000,Commercial,2120,200,B&E,2019,May,11,131,Saturday,10,2019.0,May,10.0,130.0,Friday,19,Break and Enter,D43,136,West Hill (136),-79.186867,43.770615,27003
27009,132320,GO-2019334265,1550775600000,1550832660000,Commercial,2120,200,B&E,2019,February,22,53,Friday,10,2019.0,February,21.0,52.0,Thursday,19,Break and Enter,D55,65,Greenwood-Coxwell (65),-79.32399,43.671844,27010
27010,132321,GO-2019463207,1552510800000,1552555020000,Commercial,2120,200,B&E,2019,March,14,73,Thursday,9,2019.0,March,13.0,72.0,Wednesday,21,Break and Enter,D51,72,Regent Park (72),-79.364464,43.655617,27011
27031,131558,GO-20199017508,1559737200000,1559745060000,Commercial,2120,200,B&E,2019,June,5,156,Wednesday,14,2019.0,June,5.0,156.0,Wednesday,12,Break and Enter,D52,76,Bay Street Corridor (76),-79.383423,43.66214,27032


In [41]:
toronto_df.shape

(3263, 27)