# The Battle of the Neighborhoods

## Table of contents
1. [Background: Business Problem](#business_problem)
1. [Crime Dataset Justification and Exploration](#data)
1. [Analysis](#analysis)
1. [Data Modeling](#modeling)
1. [Results and Discussion](#results)
1. [Conclusion](#conclusion)

## 1. Background: Business Problem <a name='business_problem'> 

Assuming one wants to move to Vancouver and they are contemplating on which neighborhood to move to, they might want to know which neighborhood performs better in terms of crime rate. What would be more interesting rather is which neighborhoods has low **break and enter** crime rate or **vehicle accidents**. It all depends on what one really is moving there for. 

All other factors held constant, an investor who wants to construct a business infrastructure might be interested in neighborhoods that have low **commercial break and enter** neighborhoods while a couple trying to raise a family might more keen on neighborhoods with low **residential break and enter** crime rate. This information is essential for such a large group of individuals given their reasons of moving to Vancouver or given the data, any other location in the world.

We will use the **2019 Vancouver crime dataset** to cluster which neighborhoods are ideal either to live in or do business in depending on the rate of specific crime rate in those neighborhoods. Further, we will transform the Foursquare neighborhoods data into a dataframe that can be merged with the clusters to find if there is high-view relation between the number of crimes and specific number of venues.

### Essential Imports

Let's get our environment ready and make the necessary imports before we get started

In [2]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

# !conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    certifi-2019.6.16          |           py36_1         149 KB  conda-forge
    altair-3.1.0               |           py36_0         724 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    ca-certificates-2019.6.16  |       hecc5488_0         145 KB  conda-forge
    openssl-1.1.1c             |       h516909a_0         2.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.2 MB

The following NEW packages will be 

## 2. Crime Dataset Justification and Exploration<a name='data'>

For this exercise, I have obtained two sets of datasets:
1. <a href='https://data.vancouver.ca/datacatalogue/crime-data-details.htm?'>Vancouver crime report data</a>, the csv version of which the raw data is from <a href='https://data.vancouver.ca/datacatalogue/crime-data-details.htm?'>here</a>
1. *Foursquare location data*, to see how the Foursquare fair into our crime clusters

As we will be comparing with the venues dataset, we don't have information about the date when the each venue was constructed enough to related a venue with a crime occurrence but we know that the venue exist now (2019), therefore, we will consider 2019 crime dataset only. The data is stored on <a href='https://raw.githubusercontent.com/dumikaiya/Coursera_Capstone/master/vancouver_crime_data_2019.csv'>github</a> 

#### Data Retrieval

Let's dive right into it by loading the data for analysis

In [3]:
url = 'https://raw.githubusercontent.com/dumikaiya/Coursera_Capstone/master/vancouver_crime_data_2019.csv'
url

'https://raw.githubusercontent.com/dumikaiya/Coursera_Capstone/master/vancouver_crime_data_2019.csv'

## 3. Analysis <a name='analysis'>

We will start by reading our data into a dataframe

In [4]:
vancouver_crime_df = pd.read_csv(url)
vancouver_crime_df.head()

Unnamed: 0,TYPE,YEAR,MONTH,DAY,HOUR,MINUTE,HUNDRED_BLOCK,NEIGHBOURHOOD,X,Y
0,Theft from Vehicle,2019,4,2,16.0,0.0,5XX CARRALL ST,Central Business District,492408.03,5458534.62
1,Theft from Vehicle,2019,2,20,9.0,47.0,5XX CARRALL ST,Central Business District,492403.16,5458628.36
2,Other Theft,2019,3,6,15.0,37.0,7XX GRANVILLE ST,Central Business District,491396.06,5458846.22
3,Other Theft,2019,2,22,19.0,15.0,11XX ROBSON ST,West End,490910.24,5459118.24
4,Offence Against a Person,2019,4,3,,,OFFSET TO PROTECT PRIVACY,,0.0,0.0


Let's look at the shape of our dataframe

In [5]:
vancouver_crime_df.shape

(20831, 10)

#### Data Wrangling

Looks like we have some records that have no values for neighborhood. Let's start by sorting that problem out. This constitutes less than 9% of the whole data and it might be reasonable to just drop those columns after all we could use a smaller dataset anyway for this exercise

In [6]:
vancouver_crime_df.dropna(inplace=True)
vancouver_crime_df.reset_index(drop=True).head()

Unnamed: 0,TYPE,YEAR,MONTH,DAY,HOUR,MINUTE,HUNDRED_BLOCK,NEIGHBOURHOOD,X,Y
0,Theft from Vehicle,2019,4,2,16.0,0.0,5XX CARRALL ST,Central Business District,492408.03,5458534.62
1,Theft from Vehicle,2019,2,20,9.0,47.0,5XX CARRALL ST,Central Business District,492403.16,5458628.36
2,Other Theft,2019,3,6,15.0,37.0,7XX GRANVILLE ST,Central Business District,491396.06,5458846.22
3,Other Theft,2019,2,22,19.0,15.0,11XX ROBSON ST,West End,490910.24,5459118.24
4,Theft from Vehicle,2019,2,25,20.0,30.0,10XX BARCLAY ST,West End,490848.76,5458857.79


We are mainly interested in the columns **NEIGHBOURHOOD** and **TYPE** so let's get rid of the other columns as well, we will also rename *type* to *crime_type* to make sense of the data. But eventually we will need the **X** and **Y** columns to identify the corresponding latitude and longitude coordinates for **Central Business District** as this address won't be recognized by the geopy library to identify the latitude-longitude coordinates

In [7]:
vancouver_crime_filtered_df = vancouver_crime_df[['TYPE', 'NEIGHBOURHOOD', 'X', 'Y']].rename(columns={'TYPE':'CRIME_TYPE'})
vancouver_crime_filtered_df.head()

Unnamed: 0,CRIME_TYPE,NEIGHBOURHOOD,X,Y
0,Theft from Vehicle,Central Business District,492408.03,5458534.62
1,Theft from Vehicle,Central Business District,492403.16,5458628.36
2,Other Theft,Central Business District,491396.06,5458846.22
3,Other Theft,West End,490910.24,5459118.24
5,Theft from Vehicle,West End,490848.76,5458857.79


Let's see and count how many unique neighborhoods do we have...

In [8]:
# define the dataframe columns
column_names = ['crime_type', 'neighborhood'] 

# instantiate the dataframe
vancouver_crimes_df = pd.DataFrame(columns=column_names)

In [9]:
unique_neighborhoods = vancouver_crime_filtered_df['NEIGHBOURHOOD'].unique()
print('There are {} neighborhoods in our dataset:'.format(len(unique_neighborhoods)))
for neigh in unique_neighborhoods:
    print(neigh)

There are 24 neighborhoods in our dataset:
Central Business District
West End
Riley Park
Kerrisdale
Marpole
Kensington-Cedar Cottage
Stanley Park
Mount Pleasant
Renfrew-Collingwood
Dunbar-Southlands
Strathcona
Kitsilano
Grandview-Woodland
Hastings-Sunrise
Sunset
Victoria-Fraserview
West Point Grey
Oakridge
Fairview
Killarney
South Cambie
Shaughnessy
Arbutus Ridge
Musqueam


#### The Longitude-Latitude Neighborhood Coordinates

Even though our crime data identifies the **Central Business District** as a neighborhood, geopy won't recognize this as a neighborhood. In this case what we can do is to use the mean of **X** and **Y** crimes for Central Business District, decode their corresponding latitude and longitude and assign them to the neighborhood "Central Business District".

So first of all let's isolate the Central Business District

In [10]:
is_cbd_geo = vancouver_crime_filtered_df['NEIGHBOURHOOD'] == 'Central Business District'
cbd_df = vancouver_crime_filtered_df[is_cbd_geo]
cbd_df.head()

Unnamed: 0,CRIME_TYPE,NEIGHBOURHOOD,X,Y
0,Theft from Vehicle,Central Business District,492408.03,5458534.62
1,Theft from Vehicle,Central Business District,492403.16,5458628.36
2,Other Theft,Central Business District,491396.06,5458846.22
6,Mischief,Central Business District,492452.86,5458751.7
7,Mischief,Central Business District,492452.86,5458751.7


Get the mean XY coordinates of the the Central Business District...

In [11]:
mean_cbd_X = cbd_df['X'].mean()
mean_cbd_Y = cbd_df['Y'].mean()
print(mean_cbd_X, mean_cbd_Y)

491657.10144641506 5458752.136818845


... and define a function that converts our XY coordinates to latitude and longitude system

In [12]:
# installing the library for conversion
!pip install pyproj

# import the library
import pyproj

# function to convert from xy to latlong
def xy_to_latlong(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=10, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

Collecting pyproj
[?25l  Downloading https://files.pythonhosted.org/packages/16/59/43869adef45ce4f1cf7d5c3aef1ea5d65d449050abdda5de7a2465c5729d/pyproj-2.2.1-cp36-cp36m-manylinux1_x86_64.whl (11.2MB)
[K     |████████████████████████████████| 11.2MB 2.0MB/s eta 0:00:01
[?25hInstalling collected packages: pyproj
Successfully installed pyproj-2.2.1


Get the latitude-longitude coordinates of the **Central Business District** by calling the conversion function we defined passsing the XY parameters as required

In [13]:
cbd_lon, cbd_lat = xy_to_latlong(mean_cbd_X, mean_cbd_Y)
cbd_lon = float('%.7f' % cbd_lon) # The standard coordinates are formated to 7 decimal places
cbd_lat = float('%.7f' % cbd_lat)
print('The approximate Central Business District correspoding longitude and latitude are {}, {} respectively.'.format(cbd_lon, cbd_lat))

The approximate Central Business District correspoding longitude and latitude are -123.1147114, 49.2814662 respectively.


**Longitude and Latitude coordinates of the rest of the neighborhoods using the geopy library**

We first of all remove the Central Business District from our numpy array

In [14]:
new_unique_neighborhoods = np.delete(unique_neighborhoods, 0)
new_unique_neighborhoods

array(['West End', 'Riley Park', 'Kerrisdale', 'Marpole',
       'Kensington-Cedar Cottage', 'Stanley Park', 'Mount Pleasant',
       'Renfrew-Collingwood', 'Dunbar-Southlands', 'Strathcona',
       'Kitsilano', 'Grandview-Woodland', 'Hastings-Sunrise', 'Sunset',
       'Victoria-Fraserview', 'West Point Grey', 'Oakridge', 'Fairview',
       'Killarney', 'South Cambie', 'Shaughnessy', 'Arbutus Ridge',
       'Musqueam'], dtype=object)

...then we use the geopy library to get the coordinates of the neighborhoods storing them in a dictionary

In [15]:
# Also we will initiate our dictionary with the coordinates of the Central Business District
coordinates = {'Central Business District':[cbd_lon, cbd_lat]}
for neigh in new_unique_neighborhoods:
    
    address = 'Vancouver, {}'.format(neigh)

    geolocator = Nominatim(user_agent="vancouver_agent")
    location = geolocator.geocode(address)
    coordinates[neigh] = [location.longitude, location.latitude]

coordinates

{'Central Business District': [-123.1147114, 49.2814662],
 'West End': [-123.1317949, 49.2841308],
 'Riley Park': [-123.1029664, 49.2474381],
 'Kerrisdale': [-123.1553893, 49.2346728],
 'Marpole': [-123.1361495, 49.2092233],
 'Kensington-Cedar Cottage': [-123.0842067, 49.2476321],
 'Stanley Park': [-123.141540528377, 49.3019112],
 'Mount Pleasant': [-122.2320343, 45.5728966],
 'Renfrew-Collingwood': [-123.0576794, 49.2420242],
 'Dunbar-Southlands': [-123.1850439, 49.2534601],
 'Strathcona': [-125.702556961241, 49.59294945],
 'Kitsilano': [-123.155267, 49.2694099],
 'Grandview-Woodland': [-123.0679417, 49.2705588],
 'Hastings-Sunrise': [-123.0439199, 49.2775935],
 'Sunset': [-123.0902386, 49.2195929],
 'Victoria-Fraserview': [-123.0732871, 49.2184156],
 'West Point Grey': [-123.1950217, 49.2640192],
 'Oakridge': [-123.1311342, 49.2308288],
 'Fairview': [-123.1268352, 49.2641128],
 'Killarney': [-123.0462504, 49.2242738],
 'South Cambie': [-123.120915, 49.2466847],
 'Shaughnessy': [-123.

**Let's close the business of neighborhoods coordinates by creating a dataframe from our just created dictionary**

In [16]:
# define an empty dataframe with the name of columns in the list
columns = ['neighborhood', 'latitude', 'longitude']
vancouver_data = pd.DataFrame(columns=columns)

# input the values into the empty dataframe
for key, val in coordinates.items():
    vancouver_data = vancouver_data.append({'neighborhood': key,
                                                          'longitude': val[0],
                                                          'latitude': val[1]}, ignore_index=True)
vancouver_data.head()

Unnamed: 0,neighborhood,latitude,longitude
0,Central Business District,49.281466,-123.114711
1,West End,49.284131,-123.131795
2,Riley Park,49.247438,-123.102966
3,Kerrisdale,49.234673,-123.155389
4,Marpole,49.209223,-123.13615


#### Crime Data Analysis

We'll start by creating a dataframe of crime_type and which neighborhood that it occurred

In [17]:
for crime, neigh in zip(vancouver_crime_filtered_df['CRIME_TYPE'], vancouver_crime_filtered_df['NEIGHBOURHOOD']):
        
    
    vancouver_crimes_df = vancouver_crimes_df.append({'neighborhood': neigh,
                                                      'crime_type': crime}, ignore_index=True)
vancouver_crimes_df.head()

Unnamed: 0,crime_type,neighborhood
0,Theft from Vehicle,Central Business District
1,Theft from Vehicle,Central Business District
2,Other Theft,Central Business District
3,Other Theft,West End
4,Theft from Vehicle,West End


How many crimes in 2019 happened in which neighborhood?

In [18]:
vancouver_crimes_df.groupby('neighborhood').count()

Unnamed: 0_level_0,crime_type
neighborhood,Unnamed: 1_level_1
Arbutus Ridge,147
Central Business District,6243
Dunbar-Southlands,168
Fairview,1085
Grandview-Woodland,967
Hastings-Sunrise,723
Kensington-Cedar Cottage,800
Kerrisdale,209
Killarney,277
Kitsilano,777


**Since this is categorical data, we will do a one-hot encoding of the crime data to make analysis possible**

In [20]:
# one hot encoding
vancouver_crimes_onehot = pd.get_dummies(vancouver_crimes_df[['crime_type']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
vancouver_crimes_onehot['neighborhood'] = vancouver_crimes_df['neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [vancouver_crimes_onehot.columns[-1]] + list(vancouver_crimes_onehot.columns[:-1])
vancouver_crimes_onehot = vancouver_crimes_onehot[fixed_columns]

vancouver_crimes_onehot.head()

Unnamed: 0,neighborhood,Break and Enter Commercial,Break and Enter Residential/Other,Mischief,Other Theft,Theft from Vehicle,Theft of Bicycle,Theft of Vehicle,Vehicle Collision or Pedestrian Struck (with Fatality),Vehicle Collision or Pedestrian Struck (with Injury)
0,Central Business District,0,0,0,0,1,0,0,0,0
1,Central Business District,0,0,0,0,1,0,0,0,0
2,Central Business District,0,0,0,1,0,0,0,0,0
3,West End,0,0,0,1,0,0,0,0,0
4,West End,0,0,0,0,1,0,0,0,0


What is the shape of our one-hot encoded data?

In [21]:
vancouver_crimes_onehot.shape

(19009, 10)

**Let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each crime category**

In [22]:
vancouver_crimes_grouped = vancouver_crimes_onehot.groupby('neighborhood').mean().reset_index()
vancouver_crimes_grouped.head()

Unnamed: 0,neighborhood,Break and Enter Commercial,Break and Enter Residential/Other,Mischief,Other Theft,Theft from Vehicle,Theft of Bicycle,Theft of Vehicle,Vehicle Collision or Pedestrian Struck (with Fatality),Vehicle Collision or Pedestrian Struck (with Injury)
0,Arbutus Ridge,0.027211,0.183673,0.14966,0.108844,0.401361,0.040816,0.034014,0.0,0.054422
1,Central Business District,0.049656,0.011853,0.153292,0.201185,0.517059,0.039244,0.012975,0.00016,0.014576
2,Dunbar-Southlands,0.011905,0.166667,0.202381,0.077381,0.416667,0.029762,0.02381,0.0,0.071429
3,Fairview,0.081106,0.045161,0.122581,0.17788,0.437788,0.094931,0.017512,0.0,0.023041
4,Grandview-Woodland,0.055843,0.084798,0.202689,0.134436,0.377456,0.055843,0.057911,0.001034,0.02999


**How does our new Dataframe look like?**

In [23]:
vancouver_crimes_grouped.shape

(24, 10)

As we have turned to frequency analysis, let's create a function that define the most frequently occurring event, this function will be reused to find the most common occurring venue from the foursquare data.

In [24]:
def most_common_event(row, top_event):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:top_event]


**Feature Extraction**

Top most occurring crime

In [25]:
top_crimes = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['neighborhood']
for ind in np.arange(top_crimes):
    try:
        columns.append('{}{} Most Common Crime'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Crime'.format(ind+1))

# create a new dataframe
neighborhoods_crimes_sorted = pd.DataFrame(columns=columns)
neighborhoods_crimes_sorted['neighborhood'] = vancouver_crimes_grouped['neighborhood']

for ind in np.arange(vancouver_crimes_grouped.shape[0]):
    neighborhoods_crimes_sorted.iloc[ind, 1:] = most_common_event(vancouver_crimes_grouped.iloc[ind, :], top_crimes)

neighborhoods_crimes_sorted.head()

Unnamed: 0,neighborhood,1st Most Common Crime,2nd Most Common Crime,3rd Most Common Crime,4th Most Common Crime,5th Most Common Crime
0,Arbutus Ridge,Theft from Vehicle,Break and Enter Residential/Other,Mischief,Other Theft,Vehicle Collision or Pedestrian Struck (with I...
1,Central Business District,Theft from Vehicle,Other Theft,Mischief,Break and Enter Commercial,Theft of Bicycle
2,Dunbar-Southlands,Theft from Vehicle,Mischief,Break and Enter Residential/Other,Other Theft,Vehicle Collision or Pedestrian Struck (with I...
3,Fairview,Theft from Vehicle,Other Theft,Mischief,Theft of Bicycle,Break and Enter Commercial
4,Grandview-Woodland,Theft from Vehicle,Mischief,Other Theft,Break and Enter Residential/Other,Theft of Vehicle


## 4. Data Modeling<a name='modeling'>

We will use **KMean** to model our data

In [26]:
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 3

vancouver_crimes_grouped_clustering = vancouver_crimes_grouped.drop('neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(vancouver_crimes_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 0, 2, 1, 1, 0, 0, 2, 0, 0], dtype=int32)

In [27]:
# add clustering labels
neighborhoods_crimes_sorted.insert(0, 'cluster_labels', kmeans.labels_)

vancouver_crimes_merged = vancouver_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
vancouver_crimes_merged = vancouver_crimes_merged.join(neighborhoods_crimes_sorted.set_index('neighborhood'), on='neighborhood')

vancouver_crimes_merged.head()# check the last columns!


Unnamed: 0,neighborhood,latitude,longitude,cluster_labels,1st Most Common Crime,2nd Most Common Crime,3rd Most Common Crime,4th Most Common Crime,5th Most Common Crime
0,Central Business District,49.281466,-123.114711,0,Theft from Vehicle,Other Theft,Mischief,Break and Enter Commercial,Theft of Bicycle
1,West End,49.284131,-123.131795,0,Theft from Vehicle,Other Theft,Mischief,Break and Enter Commercial,Theft of Bicycle
2,Riley Park,49.247438,-123.102966,0,Theft from Vehicle,Break and Enter Residential/Other,Mischief,Other Theft,Theft of Vehicle
3,Kerrisdale,49.234673,-123.155389,2,Theft from Vehicle,Break and Enter Residential/Other,Mischief,Vehicle Collision or Pedestrian Struck (with I...,Other Theft
4,Marpole,49.209223,-123.13615,1,Theft from Vehicle,Other Theft,Mischief,Break and Enter Residential/Other,Break and Enter Commercial


We are going to create a map centered around the **Riley Park** so let's use the **cbd_lat** and **cbd_lon** that we previously defined when we were trying to remove the CBD from the list of neighborhoods so that we can use geopy library.

In [28]:
# Coordinate of Riley Park
latitude = 49.247438
longitude = -123.102966

In [29]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(vancouver_crimes_merged['latitude'], vancouver_crimes_merged['longitude'], vancouver_crimes_merged['neighborhood'], vancouver_crimes_merged['cluster_labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Our model has generated 3 clusters which for now we will store dataframes that will be used later in our **Results and Discussion** section

In [30]:
cluster_zero = vancouver_crimes_merged.loc[vancouver_crimes_merged['cluster_labels'] == 0, vancouver_crimes_merged.columns[[0] + list(range(4, vancouver_crimes_merged.shape[1]))]]
cluster_one = vancouver_crimes_merged.loc[vancouver_crimes_merged['cluster_labels'] == 1, vancouver_crimes_merged.columns[[0] + list(range(4, vancouver_crimes_merged.shape[1]))]]
cluster_two = vancouver_crimes_merged.loc[vancouver_crimes_merged['cluster_labels'] == 2, vancouver_crimes_merged.columns[[0] + list(range(4, vancouver_crimes_merged.shape[1]))]]

#### Explore the Foursqures venues

Ideally our goal here is to just to create a **neighborhoods_crimes_sorted** dataframe but for venues not crimes, so it will be called **neighborhoods_venues_sorted** in the end

Of particular interest perhaps would be to find what kind of venues are found in our clustered data, let's make a quick analysis of this and find out an overview from Foursquare Data.
We will use the Foursquare data to explore so before everything, we'll define some important parameters for calling the Foursqpuare API
* CLIENT_ID
* CLIENT_SECRETE
* VERSION
* LIMIT

The cell is hidden for privacy sake

In [31]:
# The code was removed by Watson Studio for sharing.

Define a function that gets the venues surrounding venues of neighborhoods in Vancouver

In [33]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [34]:
vancouver_venues = getNearbyVenues(names=vancouver_data['neighborhood'],
                                   latitudes=vancouver_data['latitude'],
                                   longitudes=vancouver_data['longitude']
                                  )

Central Business District
West End
Riley Park
Kerrisdale
Marpole
Kensington-Cedar Cottage
Stanley Park
Mount Pleasant
Renfrew-Collingwood
Dunbar-Southlands
Strathcona
Kitsilano
Grandview-Woodland
Hastings-Sunrise
Sunset
Victoria-Fraserview
West Point Grey
Oakridge
Fairview
Killarney
South Cambie
Shaughnessy
Arbutus Ridge
Musqueam


In [35]:
print(vancouver_venues.shape)
vancouver_venues.head()

(339, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Central Business District,49.281466,-123.114711,Gotham Steakhouse & Cocktail Bar,49.28283,-123.115865,Steakhouse
1,Central Business District,49.281466,-123.114711,Medina Café,49.280565,-123.116859,Breakfast Spot
2,Central Business District,49.281466,-123.114711,Queen Elizabeth Theatre,49.280229,-123.11273,Theater
3,Central Business District,49.281466,-123.114711,L'Hermitage,49.280139,-123.11748,Hotel
4,Central Business District,49.281466,-123.114711,Finch’s Tea & Coffee House,49.282724,-123.111941,Sandwich Place


In [36]:
vancouver_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Arbutus Ridge,4,4,4,4,4,4
Central Business District,30,30,30,30,30,30
Dunbar-Southlands,6,6,6,6,6,6
Fairview,26,26,26,26,26,26
Grandview-Woodland,30,30,30,30,30,30
Hastings-Sunrise,15,15,15,15,15,15
Kensington-Cedar Cottage,20,20,20,20,20,20
Kerrisdale,30,30,30,30,30,30
Killarney,4,4,4,4,4,4
Kitsilano,30,30,30,30,30,30


### We will now analyze each neighborhood

In [37]:
# one hot encoding
vancouver_venues_onehot = pd.get_dummies(vancouver_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
vancouver_venues_onehot['Neighborhood'] = vancouver_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [vancouver_venues_onehot.columns[-1]] + list(vancouver_venues_onehot.columns[:-1])
vancouver_venues_onehot = vancouver_venues_onehot[fixed_columns]

vancouver_venues_onehot.head()

Unnamed: 0,Neighborhood,American Restaurant,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Beach,...,Tea Room,Thai Restaurant,Theater,Tiki Bar,Trail,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Wine Shop,Yoga Studio
0,Central Business District,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Central Business District,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Central Business District,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,Central Business District,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Central Business District,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


What is the shape of our dataframe?

In [38]:
vancouver_venues_onehot.shape

(339, 115)

In [39]:
vancouver_venues_grouped = vancouver_venues_onehot.groupby('Neighborhood').mean().reset_index()
vancouver_venues_grouped

Unnamed: 0,Neighborhood,American Restaurant,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Beach,...,Tea Room,Thai Restaurant,Theater,Tiki Bar,Trail,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Wine Shop,Yoga Studio
0,Arbutus Ridge,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Central Business District,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.033333,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Dunbar-Southlands,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Fairview,0.0,0.0,0.076923,0.038462,0.0,0.0,0.038462,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.038462,0.0,0.0
4,Grandview-Woodland,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0
5,Hastings-Sunrise,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.133333,0.0,0.0
6,Kensington-Cedar Cottage,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0
7,Kerrisdale,0.0,0.0,0.033333,0.0,0.0,0.033333,0.033333,0.0,0.0,...,0.066667,0.033333,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0
8,Killarney,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Kitsilano,0.033333,0.0,0.033333,0.0,0.0,0.1,0.0,0.033333,0.033333,...,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333


We will use the **most_common_crimes** function defined above, ignore the naming, perhaps it would be essential to rename the function as this time we will not be finding the most common crimes but events, it serves the purpose for this exercise so we'll leave it as it is

In [40]:
top_events = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(top_events):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_events_sorted = pd.DataFrame(columns=columns)
neighborhoods_events_sorted['Neighborhood'] = vancouver_venues_grouped['Neighborhood']

for ind in np.arange(vancouver_venues_grouped.shape[0]):
    neighborhoods_events_sorted.iloc[ind, 1:] = most_common_event(vancouver_venues_grouped.iloc[ind, :], top_events)

neighborhoods_events_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Arbutus Ridge,Bakery,Pet Store,Grocery Store,Nightlife Spot,Yoga Studio
1,Central Business District,Coffee Shop,Hotel,Steakhouse,Theater,Poke Place
2,Dunbar-Southlands,Sushi Restaurant,Italian Restaurant,Coffee Shop,Liquor Store,Fast Food Restaurant
3,Fairview,Coffee Shop,Park,Asian Restaurant,Sushi Restaurant,Korean Restaurant
4,Grandview-Woodland,Italian Restaurant,Pizza Place,Coffee Shop,Sushi Restaurant,Indian Restaurant


## 5. Results And Discussions <a name='results'> 

Let's examine the three clusters that have been created by the model

#### cluster 0 

In [41]:
cluster_zero

Unnamed: 0,neighborhood,1st Most Common Crime,2nd Most Common Crime,3rd Most Common Crime,4th Most Common Crime,5th Most Common Crime
0,Central Business District,Theft from Vehicle,Other Theft,Mischief,Break and Enter Commercial,Theft of Bicycle
1,West End,Theft from Vehicle,Other Theft,Mischief,Break and Enter Commercial,Theft of Bicycle
2,Riley Park,Theft from Vehicle,Break and Enter Residential/Other,Mischief,Other Theft,Theft of Vehicle
5,Kensington-Cedar Cottage,Theft from Vehicle,Mischief,Other Theft,Break and Enter Residential/Other,Theft of Vehicle
6,Stanley Park,Theft from Vehicle,Mischief,Theft of Bicycle,Vehicle Collision or Pedestrian Struck (with I...,Other Theft
10,Strathcona,Theft from Vehicle,Mischief,Break and Enter Commercial,Other Theft,Break and Enter Residential/Other
11,Kitsilano,Theft from Vehicle,Mischief,Other Theft,Break and Enter Residential/Other,Theft of Bicycle
13,Hastings-Sunrise,Theft from Vehicle,Mischief,Break and Enter Residential/Other,Other Theft,Theft of Vehicle
15,Victoria-Fraserview,Theft from Vehicle,Mischief,Break and Enter Residential/Other,Vehicle Collision or Pedestrian Struck (with I...,Theft of Vehicle
16,West Point Grey,Theft from Vehicle,Mischief,Break and Enter Residential/Other,Theft of Bicycle,Theft of Vehicle


As we will observe for for the other clusters, the 1st Most Common Crime doesn't give us any relevant information apart from the fact the *Theft from Vehicle* is a big problem all over Vancouver. All the neighborhoods are pretty much infested with the problem. But if we take a look at the 2nd, 3rd and 4th Most Common Crimes in Vancouver, we will notice that this cluster is has a problem of *Mischief* defined as "willfully causing malicious destruction, damage, or defacement of property including any public mischief towards another person" by the Vancouver Police <a href='https://data.vancouver.ca/datacatalogue/crime-data-attributes.htm#TYPE'>here</a> and *Break and Enter* be it commercial or residential. 

**Cluster 0 merged with neighborhoods_events_sorted dataframe**

In [42]:
cluster_zero_merged = cluster_zero.merge(neighborhoods_events_sorted, how='inner', left_on='neighborhood', right_on='Neighborhood')
cluster_zero_merged.drop(columns=['Neighborhood'])

Unnamed: 0,neighborhood,1st Most Common Crime,2nd Most Common Crime,3rd Most Common Crime,4th Most Common Crime,5th Most Common Crime,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Central Business District,Theft from Vehicle,Other Theft,Mischief,Break and Enter Commercial,Theft of Bicycle,Coffee Shop,Hotel,Steakhouse,Theater,Poke Place
1,West End,Theft from Vehicle,Other Theft,Mischief,Break and Enter Commercial,Theft of Bicycle,Gay Bar,Bakery,Ramen Restaurant,Restaurant,Café
2,Riley Park,Theft from Vehicle,Break and Enter Residential/Other,Mischief,Other Theft,Theft of Vehicle,Arts & Crafts Store,Vegetarian / Vegan Restaurant,Restaurant,Japanese Restaurant,Grocery Store
3,Kensington-Cedar Cottage,Theft from Vehicle,Mischief,Other Theft,Break and Enter Residential/Other,Theft of Vehicle,Coffee Shop,Vietnamese Restaurant,Bus Stop,Indian Restaurant,Filipino Restaurant
4,Stanley Park,Theft from Vehicle,Mischief,Theft of Bicycle,Vehicle Collision or Pedestrian Struck (with I...,Other Theft,Trail,Lake,Park,Yoga Studio,Filipino Restaurant
5,Kitsilano,Theft from Vehicle,Mischief,Other Theft,Break and Enter Residential/Other,Theft of Bicycle,Bakery,Coffee Shop,Tea Room,Yoga Studio,Grocery Store
6,Hastings-Sunrise,Theft from Vehicle,Mischief,Break and Enter Residential/Other,Other Theft,Theft of Vehicle,Vietnamese Restaurant,Liquor Store,Beer Store,Park,Coffee Shop
7,Victoria-Fraserview,Theft from Vehicle,Mischief,Break and Enter Residential/Other,Vehicle Collision or Pedestrian Struck (with I...,Theft of Vehicle,Convenience Store,Pizza Place,Sandwich Place,Fast Food Restaurant,Yoga Studio
8,West Point Grey,Theft from Vehicle,Mischief,Break and Enter Residential/Other,Theft of Bicycle,Theft of Vehicle,Pool,Yoga Studio,Cuban Restaurant,Dessert Shop,Dim Sum Restaurant
9,Killarney,Theft from Vehicle,Mischief,Other Theft,Break and Enter Residential/Other,Theft of Vehicle,Pool,Soccer Field,Italian Restaurant,Gym,Deli / Bodega


The variation of *Most Common Venues* in these neighborhoods might imply that our model has identifies these neighborhoods as some of the most active. These are probably where business is mostly happening and these kinds of crimes are prominent.

#### cluster 1

In [43]:
cluster_one

Unnamed: 0,neighborhood,1st Most Common Crime,2nd Most Common Crime,3rd Most Common Crime,4th Most Common Crime,5th Most Common Crime
4,Marpole,Theft from Vehicle,Other Theft,Mischief,Break and Enter Residential/Other,Break and Enter Commercial
7,Mount Pleasant,Theft from Vehicle,Other Theft,Mischief,Break and Enter Commercial,Theft of Bicycle
8,Renfrew-Collingwood,Theft from Vehicle,Other Theft,Mischief,Break and Enter Residential/Other,Theft of Vehicle
12,Grandview-Woodland,Theft from Vehicle,Mischief,Other Theft,Break and Enter Residential/Other,Theft of Vehicle
14,Sunset,Theft from Vehicle,Other Theft,Mischief,Break and Enter Commercial,Theft of Vehicle
17,Oakridge,Theft from Vehicle,Other Theft,Break and Enter Residential/Other,Mischief,Theft of Vehicle
18,Fairview,Theft from Vehicle,Other Theft,Mischief,Theft of Bicycle,Break and Enter Commercial
20,South Cambie,Theft from Vehicle,Other Theft,Mischief,Break and Enter Residential/Other,Vehicle Collision or Pedestrian Struck (with I...


Again, we will ignore the *Theft from Vehicle* and move on to the 2nd and 3rd Most common Crimes. We have *Other Theft* followed by *Mischief* as the Most Common Crimes in these neighborhoods. *Other Theft* is defined as "theft of property that includes personal items (purse, wallet, cellphone, laptop, etc.), bicycle, etc" <a href='https://data.vancouver.ca/datacatalogue/crime-data-attributes.htm#TYPE'>here</a>

**Cluster 1 merged with neighborhoods_events_sorted dataframe**

In [44]:
cluster_one_merged = cluster_one.merge(neighborhoods_events_sorted, how='inner', left_on='neighborhood', right_on='Neighborhood')
cluster_one_merged.drop(columns=['Neighborhood'])

Unnamed: 0,neighborhood,1st Most Common Crime,2nd Most Common Crime,3rd Most Common Crime,4th Most Common Crime,5th Most Common Crime,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Marpole,Theft from Vehicle,Other Theft,Mischief,Break and Enter Residential/Other,Break and Enter Commercial,Sushi Restaurant,Pizza Place,Chinese Restaurant,Vietnamese Restaurant,Grocery Store
1,Renfrew-Collingwood,Theft from Vehicle,Other Theft,Mischief,Break and Enter Residential/Other,Theft of Vehicle,Vietnamese Restaurant,Chinese Restaurant,Pharmacy,Shanghai Restaurant,Café
2,Grandview-Woodland,Theft from Vehicle,Mischief,Other Theft,Break and Enter Residential/Other,Theft of Vehicle,Italian Restaurant,Pizza Place,Coffee Shop,Sushi Restaurant,Indian Restaurant
3,Sunset,Theft from Vehicle,Other Theft,Mischief,Break and Enter Commercial,Theft of Vehicle,Indian Restaurant,Dessert Shop,Ski Area,Filipino Restaurant,Deli / Bodega
4,Oakridge,Theft from Vehicle,Other Theft,Break and Enter Residential/Other,Mischief,Theft of Vehicle,Convenience Store,Fast Food Restaurant,Vietnamese Restaurant,Sandwich Place,Sushi Restaurant
5,Fairview,Theft from Vehicle,Other Theft,Mischief,Theft of Bicycle,Break and Enter Commercial,Coffee Shop,Park,Asian Restaurant,Sushi Restaurant,Korean Restaurant
6,South Cambie,Theft from Vehicle,Other Theft,Mischief,Break and Enter Residential/Other,Vehicle Collision or Pedestrian Struck (with I...,Coffee Shop,Park,Liquor Store,Bus Stop,Café


Most of these venues are food places where people are most likely to forget their items that are mentioned as examples for *Other Theft* and *Mischief* is likely to happen.

#### cluster 2 (last cluster)

In [45]:
cluster_two

Unnamed: 0,neighborhood,1st Most Common Crime,2nd Most Common Crime,3rd Most Common Crime,4th Most Common Crime,5th Most Common Crime
3,Kerrisdale,Theft from Vehicle,Break and Enter Residential/Other,Mischief,Vehicle Collision or Pedestrian Struck (with I...,Other Theft
9,Dunbar-Southlands,Theft from Vehicle,Mischief,Break and Enter Residential/Other,Other Theft,Vehicle Collision or Pedestrian Struck (with I...
21,Shaughnessy,Theft from Vehicle,Break and Enter Residential/Other,Mischief,Vehicle Collision or Pedestrian Struck (with I...,Break and Enter Commercial
22,Arbutus Ridge,Theft from Vehicle,Break and Enter Residential/Other,Mischief,Other Theft,Vehicle Collision or Pedestrian Struck (with I...


As usual, looking at the 2nd and 3rd Most Common Crimes one would deduce that these are mostly residential areas as Break and Enter Residential/Other is prominent.

**Cluster 2 merged with neighborhoods_events_sorted dataframe**

In [46]:
cluster_two_merged = cluster_two.merge(neighborhoods_events_sorted, how='inner', left_on='neighborhood', right_on='Neighborhood')
cluster_two_merged.drop(columns=['Neighborhood'])

Unnamed: 0,neighborhood,1st Most Common Crime,2nd Most Common Crime,3rd Most Common Crime,4th Most Common Crime,5th Most Common Crime,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Kerrisdale,Theft from Vehicle,Break and Enter Residential/Other,Mischief,Vehicle Collision or Pedestrian Struck (with I...,Other Theft,Coffee Shop,Pharmacy,Tea Room,Chinese Restaurant,Sandwich Place
1,Dunbar-Southlands,Theft from Vehicle,Mischief,Break and Enter Residential/Other,Other Theft,Vehicle Collision or Pedestrian Struck (with I...,Sushi Restaurant,Italian Restaurant,Coffee Shop,Liquor Store,Fast Food Restaurant
2,Shaughnessy,Theft from Vehicle,Break and Enter Residential/Other,Mischief,Vehicle Collision or Pedestrian Struck (with I...,Break and Enter Commercial,Park,Bus Stop,French Restaurant,Yoga Studio,Filipino Restaurant
3,Arbutus Ridge,Theft from Vehicle,Break and Enter Residential/Other,Mischief,Other Theft,Vehicle Collision or Pedestrian Struck (with I...,Bakery,Pet Store,Grocery Store,Nightlife Spot,Yoga Studio


These look like primarily residential areas as most of these are quick needs venues like Pharmacy, Grocery Store, Pet Store and food places

## 6. Conclusion <a name='conclusion'> 

The whole point of this analysis was to give an overview of the neighborhoods in Vancouver based on the frequency of crime type that they experience as well as how those crimes relate to the venues that are located in those neighborhoods. As such, we looked at the data provided by the Vancouver police and analyzed their frequency in 2019. We extracted the features of the 5 top most occurring crimes for each neighborhood. We then used KMeans to cluster those features and merged them to the corresponding events for those neighborhoods.

It has been found that our cluster model divided the neighborhoods into those with crimes related to Mischief, then Break and Enter for commercial and residential places. The related venues around these neighborhoods are varied and looks like they are the **Center of business cluster** in Vancouver where most activity is happening. The other cluster is associated with *Other theft* (phones, laptops, purses). This is mainly a **Food related cluster**, perhaps where employed people spend most of their day hence are likely to lose such items associated with *Other theft*. The last cluster is a **Residential Cluster** as the most common crimes are *Break and Enter Residential* and the associated venues are emergency needs venues like pharmacy, Grocery Store and Liquor Stores.

An individual looking to settle in Vancouver having gained this knowledge would have a better advantage than someone who does not. They can easily avoid settling in in areas where Break and Enter Residential is prominent. Or someone trying to open a restaurant could avoid neighborhoods that are already congested with other restaurants.