# IBM Applied Data Science Capstone Project

## Table of contents
1. [Introduction](#introduction)
2. [Data](#data)
3. [Methodology](#methodology)
4. [Analysis](#analysis)
5. [Results and Conclusion](#results)

## 1. Introduction <a name="introduction"></a>

### 1.1 Business problem

Every big city such as **New York**, **Berlin** or even **Moscow** is huge and multicultural. People that living and working there coming from all around the world. They have differencies not only in culture, but in beliefs as well. What can make happy some can insult the others. Therefore, the customers interests in products could be very different.

Another call to businesses dropped by the well-off level of the citizens. The amplitude of wealth in these cities is covering all possible ranges. It can happen that people in some districts would be too rich for standart grade product and willing to have only the premium one, while in other the poverty level will motivate the people to make saving every single cent.

Additionally to that here many other factors that should be taken in mind: the citizens age, city criminal activity rate, city financial sources, the temperature and weather data, geographical potential and many others.

So for some specific business (e.g. network cloth store like H&M) it could be easy to find a good spot for opening another shop for their branch. Meanwhile for some others (e.g. coffee shop, newspaper stand, small grocery, etc) it would be not a trivial task without conducted research. 

### 1.2 Choosen one to be solved

So in this research we are presuming that we or our customer is a startup company, which buisness is the bike sharing. Bike sharing business is getting more and more popular each year. Hence, there are many competitors in this field. As a start-up we can't afford to place stations randomly in every city neighborhood. This way we have to find the best starting region and possible regions for later expand of our business. In this case we will analyse the biggest city in **Canada - Toronto**, which by the way comfortably located on the south of the country and that making biking there more suitable.

## 2. Data <a name="data"></a>

### 2.1 Data sources

To do this research we have to define which kinds of data we need to analyse the current situation. 

First of all we must take in a count the security aspect to not ruin the business in first few hours with a bunch of accidents and bike thefts. Therefore we must have the data about crime activities and accidents in the city.

For collecting **_The accident data_** in this city we will be using the next datasets provided by the [Toronto Police Service](http://data.torontopolice.on.ca/):

* [Accidents involving cyclists, where people got Killed or became Seriously Injured in a period 2007-2017](http://data.torontopolice.on.ca/datasets/55d5b9f7af7d4710bc98743b2c005f02_0)
* [Bicycle Thefts in period 2014-2017](http://data.torontopolice.on.ca/datasets/16f2b8a1c76547c69fec14b7f8541ffc_0)

The next what we have to care about is potential customers. We must collect and process the data about neigborhoods population to get know their commuting tastes. That information will help us to find out which areas are more in need of our services.

For collecting **_The neigborhoods demographic information_** we will be using data from the [City of Toronto Open Data Portal](https://www.toronto.ca/ext/open_data/catalog/data_set_files/2016_neighbourhood_profiles.csv)

### 2.2 Data preparation

First we are loading datasets for pre-processing and improrting necessary libraries.

**Note:** In case if you don't have the libraries used in this project but want to run the code please uncoment corresponding lines to install them

In [281]:
#!pip install folium
#!pip install bs4
#!pip install opencage
#!pip install foursquare


import numpy as np
import pandas as pd

from bs4 import BeautifulSoup
import requests
import foursquare
from opencage.geocoder import OpenCageGeocode

from time import sleep
from IPython.display import display
import folium

park_out = pd.read_csv("Bicycle Parking Map Data.csv") 
park_in = pd.read_csv("Bicycle Parking Data - Indoor.csv") 
thefts = pd.read_csv("Bicycle_Thefts.csv") 
incidents = pd.read_csv("Cyclists.csv") 
profiles = pd.read_csv("2016_neighbourhood_profiles.csv", encoding='iso-8859-1')

We will start pre-processing with the datasets related to the security aspects. Therefore let's take a look on the thefts and incidents datasets.

In [282]:
print ("Thefts:")
thefts.columns.values

Thefts:


array(['X', 'Y', 'Index_', 'event_unique_id', 'Primary_Offence',
       'Occurrence_Date', 'Occurrence_Year', 'Occurrence_Month',
       'Occurrence_Day', 'Occurrence_Time', 'Division', 'City',
       'Location_Type', 'premisetype', 'Bike_Make', 'Bike_Model',
       'Bike_Type', 'Bike_Speed', 'Bike_Colour', 'Cost_of_Bike', 'Status',
       'Hood_ID', 'Neighbourhood', 'Lat', 'Long', 'ObjectId'],
      dtype=object)

We will omit the details about bike parameters and the status of the crime. As well we will exclude any data about the time of the event occurence. Identificator fields such as event_uniqie_id, objectId, hood_id and division are not making any sense for us too. The fields X and Y are duplicating the longitude and latitude information, so we are getting rid of them. For reason that all the crimes were made in the Toronto we do not need the city field. Therefore the only information we are lefting there:
* Neighbourhood
* Longitude (X)
* Latitude (Y)
* Primary Offence

Unfortunately the aviable data in the dataset taking in account only 2014-2017. So we will left the statistics only for 2017, so it will be more close to today reality in 2019!

It would be good for buisness to reduce chance that our bikes will be stolen and buying only the "safe" ones. So let's make another dataframe from this dataset to see what types of bikes mostly suits to the thief's tastes. Therefore we will create new dataframe **"thief_taste"** and include in it next columns:
* Bike Make
* Bike Model
* Bike Type
* Bike Colour
* Cost of Bike

We have a lot of unknown vehicles here, so we are dropping this results from our dataframe.

In [283]:
theft = thefts[['X', 'Y', 'Neighbourhood', 'Primary_Offence']]
theft.tail(10)

Unnamed: 0,X,Y,Neighbourhood,Primary_Offence
14270,-79.373062,43.642483,Waterfront Communities-The Island (77),B&E
14271,-79.413139,43.663345,Palmerston-Little Italy (80),THEFT UNDER
14272,-79.429436,43.661152,Dovercourt-Wallace Emerson-Junction (93),THEFT UNDER
14273,-79.40271,43.666908,Annex (95),THEFT UNDER - BICYCLE
14274,-79.423355,43.790058,Newtonbrook West (36),THEFT UNDER - BICYCLE
14275,-79.423775,43.7136,Bedford Park-Nortown (39),THEFT UNDER - BICYCLE
14276,-79.534058,43.619293,Islington-City Centre West (14),THEFT UNDER
14277,-79.390793,43.649132,Waterfront Communities-The Island (77),THEFT UNDER
14278,-79.38311,43.661373,Church-Yonge Corridor (75),THEFT UNDER - BICYCLE
14279,-79.455894,43.66584,Dovercourt-Wallace Emerson-Junction (93),THEFT UNDER - BICYCLE


In [284]:
thief_taste = thefts[['Bike_Make', 'Bike_Model', 'Bike_Type', 'Bike_Colour', 'Cost_of_Bike']]
thief_taste = thief_taste[thief_taste.Bike_Make != 'UNKNOWN']
thief_taste.tail(10)

Unnamed: 0,Bike_Make,Bike_Model,Bike_Type,Bike_Colour,Cost_of_Bike
14270,KH,URBAN SOUL,RG,BLK,525
14271,NO,PINNACLE,MT,GRY,300
14272,KO,DEW,RG,BLK,700
14273,CA,CANNONDALE,TO,WHI,1000
14274,RA,HYBRID,RG,GRN,350
14275,OT,STUMPJUMPER FSR,MT,BLK,2500
14276,RM,,MT,RED,1000
14277,IH,,MT,DGR,0
14278,PR,,MT,PLE,150
14279,BI,,TO,GRN,500


In [285]:
print ("Incidents:")
incidents.columns.values

Incidents:


array(['X', 'Y', 'Index_', 'ACCNUM', 'YEAR', 'DATE', 'TIME', 'Hour',
       'STREET1', 'STREET2', 'OFFSET', 'ROAD_CLASS', 'District',
       'LATITUDE', 'LONGITUDE', 'LOCCOORD', 'ACCLOC', 'TRAFFCTL',
       'VISIBILITY', 'LIGHT', 'RDSFCOND', 'ACCLASS', 'IMPACTYPE',
       'INVTYPE', 'INVAGE', 'INJURY', 'FATAL_NO', 'INITDIR', 'VEHTYPE',
       'MANOEUVER', 'DRIVACT', 'DRIVCOND', 'PEDTYPE', 'PEDACT', 'PEDCOND',
       'CYCLISTYPE', 'CYCACT', 'CYCCOND', 'PEDESTRIAN', 'CYCLIST',
       'AUTOMOBILE', 'MOTORCYCLE', 'TRUCK', 'TRSN_CITY_', 'EMERG_VEH',
       'PASSENGER', 'SPEEDING', 'AG_DRIV', 'REDLIGHT', 'ALCOHOL',
       'DISABILITY', 'Division', 'Ward_Name', 'Ward_ID', 'Hood_ID',
       'Hood_Name', 'FID'], dtype=object)

In the incidents dataset we are interested only in coordinates where event happen, injury level and the neighbourhood name. Therefore we will use the next information in our dataframe:
* Longitude (X)
* Latitude (Y)
* Neighbourhood
* Injury

In [286]:
incid = incidents[['X', 'Y', 'Hood_Name', 'INJURY']]
incid.rename(columns={'Hood_Name':'Neighbourhood', 'INJURY': 'Injury'}, inplace=True)

incid.head(10)

Unnamed: 0,X,Y,Neighbourhood,Injury
0,-79.366696,43.659267,Moss Park (73),Major
1,-79.366696,43.659267,Moss Park (73),
2,-79.366696,43.659267,Moss Park (73),
3,-79.366696,43.659267,Moss Park (73),
4,-79.366696,43.659267,Moss Park (73),Major
5,-79.439961,43.65094,Dufferin Grove (83),
6,-79.439961,43.65094,Dufferin Grove (83),Major
7,-79.439961,43.65094,Dufferin Grove (83),
8,-79.44399,43.658045,Dufferin Grove (83),Minimal
9,-79.44399,43.658045,Dufferin Grove (83),


In [291]:
print("Profiles output")
profiles.head(10)

Profiles output


Unnamed: 0,Category,Topic,Data Source,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
0,Neighbourhood Information,Neighbourhood Information,City of Toronto,Neighbourhood Number,,129,128,20,95,42,...,37,7,137,64,60,94,100,97,27,31
1,Neighbourhood Information,Neighbourhood Information,City of Toronto,TSNS2020 Designation,,No Designation,No Designation,No Designation,No Designation,No Designation,...,No Designation,No Designation,NIA,No Designation,No Designation,No Designation,No Designation,No Designation,NIA,Emerging Neighbourhood
2,Population,Population and dwellings,Census Profile 98-316-X2016001,"Population, 2016",2731571,29113,23757,12054,30526,27695,...,16936,22156,53485,12541,7865,14349,11817,12528,27593,14804
3,Population,Population and dwellings,Census Profile 98-316-X2016001,"Population, 2011",2615060,30279,21988,11904,29177,26918,...,15004,21343,53350,11703,7826,13986,10578,11652,27713,14687
4,Population,Population and dwellings,Census Profile 98-316-X2016001,Population Change 2011-2016,4.50%,-3.90%,8.00%,1.30%,4.60%,2.90%,...,12.90%,3.80%,0.30%,7.20%,0.50%,2.60%,11.70%,7.50%,-0.40%,0.80%
5,Population,Population and dwellings,Census Profile 98-316-X2016001,Total private dwellings,1179057,9371,8535,4732,18109,12473,...,8054,8721,19098,5620,3604,6185,6103,7475,11051,5847
6,Population,Population and dwellings,Census Profile 98-316-X2016001,Private dwellings occupied by usual residents,1112929,9120,8136,4616,15934,12124,...,7549,8509,18436,5454,3449,5887,5676,7012,10170,5344
7,Population,Population and dwellings,Census Profile 98-316-X2016001,Population density per square kilometre,4334,3929,3034,2435,10863,2775,...,5820,4007,4345,7838,6722,8541,7162,10708,2086,2451
8,Population,Population and dwellings,Census Profile 98-316-X2016001,Land area in square kilometres,630.2,7.41,7.83,4.95,2.81,9.98,...,2.91,5.53,12.31,1.6,1.17,1.68,1.65,1.17,13.23,6.04
9,Population,Age characteristics,Census Profile 98-316-X2016001,Children (0-14 years),398135,3840,3075,1760,2360,3605,...,1785,3555,9625,2325,1165,1860,1800,1210,4045,1960


As we can see that's massive dataset that providing the information about all neigborhoods in the city. Let's take a look closer what demographic data we can take and use out of here.

In [292]:
profiles["Topic"].unique()

array(['Neighbourhood Information', 'Population and dwellings',
       'Age characteristics', 'Household and dwelling characteristics',
       'Marital status', 'Family characteristics', 'Household type',
       'Family characteristics of adults',
       'Knowledge of official languages',
       'First official language spoken', 'Mother tongue',
       'Language spoken most often at home',
       'Other language spoken regularly at home',
       'Knowledge of languages', 'Income of individuals in 2015',
       'Income of households in 2015',
       'Income of economic families in 2015', 'Low income in 2015',
       'Citizenship', 'Immigrant status and period of immigration',
       'Age at immigration', 'Immigrants by selected place of birth',
       'Recent immigrants by selected place of birth',
       'Generation status', 'Admission category and applicant type',
       'Aboriginal population', 'Visible minority population',
       'Ethnic origin population', 'Household characteristi

We will select next categories in column topics for our research **'Neighbourhood Information', 'Age characteristics', 'Commuting destination', 'Main mode of commuting', 'Commuting duration'.**
The columns **'Category', 'Data Soruce', 'City of Toronto'** have no valuable information for us, so we will exclude these columns from our dataframes. The column **'Topic'** have to be excluded as well but we will do it after splitting dataset into required dataframes.

In [293]:
profiles.drop('Data Source', axis=1, inplace=True)
profiles.drop('Category', axis=1, inplace=True)
profiles.drop('City of Toronto', axis=1, inplace=True)

demo_age = profiles[profiles.Topic == 'Age characteristics']
demo_age.drop('Topic', axis=1, inplace=True)

demo_cdes = profiles[profiles.Topic == 'Commuting destination']
demo_cdes.drop('Topic', axis=1, inplace=True)

demo_mmoc = profiles[profiles.Topic == 'Main mode of commuting']
demo_mmoc.drop('Topic', axis=1, inplace=True)

demo_cdur = profiles[profiles.Topic == 'Commuting duration']
demo_cdur.drop('Topic', axis=1, inplace=True)

demo_age.head(10)

Unnamed: 0,Characteristic,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,Bay Street Corridor,Bayview Village,Bayview Woods-Steeles,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
9,Children (0-14 years),3840,3075,1760,2360,3605,2325,1695,2415,1515,...,1785,3555,9625,2325,1165,1860,1800,1210,4045,1960
10,Youth (15-24 years),3705,3360,1235,3750,2730,1940,6860,2505,1635,...,2230,2625,7660,1035,675,1320,1225,920,4750,1870
11,Working Age (25-54 years),11305,9965,5220,15040,10810,6655,13065,10310,4490,...,7480,8140,21945,6165,3790,6420,5860,5960,12290,5860
12,Pre-retirement (55-64 years),4230,3265,1825,3480,3555,2030,1760,2540,1825,...,2070,2905,6245,1625,1150,1595,1325,1540,2965,1810
13,Seniors (65+ years),6045,4105,2015,5910,6975,2940,2420,3615,3685,...,3370,4905,8010,1380,1095,3150,1600,2905,3530,3295
14,Older Seniors (85+ years),925,555,320,1040,1640,710,330,610,740,...,655,885,1130,170,125,880,165,470,400,775
15,Male: 0 to 04 years,660,575,360,445,570,435,470,455,205,...,355,620,1625,460,225,325,300,220,755,320
16,Male: 05 to 09 years,695,540,270,365,660,355,230,395,260,...,310,625,1705,400,180,350,305,220,685,315
17,Male: 10 to 14 years,660,460,225,325,675,415,130,410,320,...,265,610,1600,330,180,310,280,195,635,370
18,Male: 15 to 19 years,840,780,285,465,715,490,585,520,385,...,415,680,1815,275,160,260,255,145,900,485


In the age dataframe we do not need distinguishing between the genders, so we will left only first 5 rows with the united data from both genders

In [294]:
demo_age = demo_age[:5]

Let's reindex the rows by 'Characteristic' value

In [295]:
demo_age = demo_age.set_index('Characteristic')
demo_cdes = demo_cdes.set_index('Characteristic')
demo_mmoc = demo_mmoc.set_index('Characteristic')
demo_cdur = demo_cdur.set_index('Characteristic')

demo_cdur.head()

Unnamed: 0_level_0,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,Bay Street Corridor,Bayview Village,Bayview Woods-Steeles,Bedford Park-Nortown,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
Characteristic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Total - Commuting duration for the employed labour force aged 15 years and over in private households with a usual place of work or no fixed workplace address - 25% sample data,11820,10155,6045,14905,11395,7355,11815,9820,4965,9515,...,6890,9365,21595,5965,4095,6590,5935,6350,12790,6420
Less than 15 minutes,1610,1455,1210,2330,1695,850,3175,860,480,1560,...,680,1155,2740,530,370,585,695,780,1365,850
15 to 29 minutes,3175,2735,1815,6260,3475,1750,4900,2560,1385,3060,...,1680,3165,5360,1450,900,2020,1765,2265,3280,1770
30 to 44 minutes,2940,2515,1495,4000,3110,2120,2200,3065,1320,2995,...,2200,2320,5280,2145,1420,2355,2215,2195,3430,2170
45 to 59 minutes,1410,1230,630,1410,1540,1465,690,1865,770,1200,...,1290,1170,2895,1095,915,915,795,630,1605,845


## 3. Methodology <a name="methodology"></a>

In our project we want to find the best neighbourhoods to start our bike sharing business, particularly the safest ones with infrastructure and free bike parking spots. Hence we will first prioritise the analysis parameters:

1. _Safety_
2. _Demographic commute data_
3. _Theft security_

In the data section we collected the data about incidents involving cyclists, which happen in different neighbourhoods. This data included the **neighbourhood** name and **injury** level taken by the victim. Basing on this information we can determing the **safety** of regions.

At the end of data preparation section we constructed dataframes with demographic data which includes information about **citizen's age** and their **commute information** in different neighbourhoods. This can help to estimate where our bike sharing services will be most welcomed.

Additionally we will make a small research on which bikes are more likely to be stolen. For this we use the offical data from Toronto Police Service about reported **bike thefts**.

## 4. Analysis  <a name="analysis"></a>

### 4.1 Safety

In [311]:
incid.head(10)

Unnamed: 0,X,Y,Neighbourhood,Injury
0,-79.366696,43.659267,Moss Park (73),Major
1,-79.366696,43.659267,Moss Park (73),
2,-79.366696,43.659267,Moss Park (73),
3,-79.366696,43.659267,Moss Park (73),
4,-79.366696,43.659267,Moss Park (73),Major
5,-79.439961,43.65094,Dufferin Grove (83),
6,-79.439961,43.65094,Dufferin Grove (83),Major
7,-79.439961,43.65094,Dufferin Grove (83),
8,-79.44399,43.658045,Dufferin Grove (83),Minimal
9,-79.44399,43.658045,Dufferin Grove (83),


Let's count the overall number of the incidents happen in the Toronto by the neigborhoods and select top 10.

In [360]:
safety = incid['Neighbourhood'].value_counts(sort=True, ascending=True).rename_axis('Neighbourhood').reset_index(name='Incidents')
safety = safety.head(10)
safety

Unnamed: 0,Neighbourhood,Incidents
0,Palmerston-Little Italy (80),3
1,Agincourt South-Malvern West (128),3
2,Waterfront Communities-The Island (77),3
3,Wychwood (94),3
4,Little Portugal (84),3
5,Mount Pleasant East (99),3
6,Birchcliffe-Cliffside (122),3
7,The Beaches (63),3
8,New Toronto (18),3
9,Casa Loma (96),3


### 4.2 Demographic commute data

#### 4.2.1 Age characteristics

In [297]:
demo_age

Unnamed: 0_level_0,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,Bay Street Corridor,Bayview Village,Bayview Woods-Steeles,Bedford Park-Nortown,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
Characteristic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Children (0-14 years),3840,3075,1760,2360,3605,2325,1695,2415,1515,4555,...,1785,3555,9625,2325,1165,1860,1800,1210,4045,1960
Youth (15-24 years),3705,3360,1235,3750,2730,1940,6860,2505,1635,3210,...,2230,2625,7660,1035,675,1320,1225,920,4750,1870
Working Age (25-54 years),11305,9965,5220,15040,10810,6655,13065,10310,4490,8410,...,7480,8140,21945,6165,3790,6420,5860,5960,12290,5860
Pre-retirement (55-64 years),4230,3265,1825,3480,3555,2030,1760,2540,1825,3075,...,2070,2905,6245,1625,1150,1595,1325,1540,2965,1810
Seniors (65+ years),6045,4105,2015,5910,6975,2940,2420,3615,3685,3980,...,3370,4905,8010,1380,1095,3150,1600,2905,3530,3295


Our target customer groups are 
1. Youth (15-24 years) 
2. Working Age (25-54 years) 

Let's find top 10 neigborhoods with thge highest number of the people from these two groups

In [298]:
demo_age = demo_age[1:3]
demo_age_t = demo_age.T
demo_age_t['Youth (15-24 years)'] = demo_age_t['Youth (15-24 years)'].str.replace(',', '')
demo_age_t['Working Age (25-54 years)'] = demo_age_t['Working Age (25-54 years)'].str.replace(',', '')
demo_age_t[['Youth (15-24 years)', 'Working Age (25-54 years)']] = demo_age_t[['Youth (15-24 years)', 'Working Age (25-54 years)']].astype(int)
demo_age_t['Total'] = demo_age_t.sum(axis = 1)
demo_age_t = demo_age_t.sort_values(by=['Total'], ascending=False)
demo_age_t = demo_age_t.head(10)
demo_age_t

Characteristic,Youth (15-24 years),Working Age (25-54 years),Total
Waterfront Communities-The Island,7840,45105,52945
Willowdale East,6940,25850,32790
Woburn,7660,21945,29605
Niagara,2415,23320,25735
Islington-City Centre West,4695,20640,25335
Rouge,6700,18510,25210
Malvern,6620,17865,24485
Church-Yonge Corridor,5060,18780,23840
Dovercourt-Wallace Emerson-Junction,3925,19790,23715
L'Amoreaux,5730,17210,22940


#### 4.2.2 Commuting destination

In [299]:
demo_cdes

Unnamed: 0_level_0,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,Bay Street Corridor,Bayview Village,Bayview Woods-Steeles,Bedford Park-Nortown,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
Characteristic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Total - Commuting destination for the employed labour force aged 15 years and over in private households with a usual place of work - 25% sample data,10330.0,8800.0,5155.0,13540.0,10200.0,6250.0,10970.0,8910.0,4435.0,8735.0,...,6250.0,8160.0,18845.0,5350.0,3580.0,5820.0,5525.0,5810.0,10885.0,5400.0
Commute within census subdivision (CSD) of residence,7210.0,6530.0,3535.0,12470.0,8615.0,4810.0,10035.0,6735.0,3215.0,7320.0,...,4765.0,5620.0,14935.0,4880.0,3285.0,5200.0,4960.0,5325.0,7620.0,4340.0
Commute to a different census subdivision (CSD) within census division (CD) of residence,,,,,,,,,,,...,,,,,,,,,,
Commute to a different census subdivision (CSD) and census division (CD) within province or territory of residence,3095.0,2240.0,1605.0,1025.0,1560.0,1430.0,850.0,2130.0,1185.0,1375.0,...,1450.0,2535.0,3840.0,460.0,290.0,570.0,545.0,470.0,3220.0,1050.0
Commute to a different province or territory,10.0,35.0,10.0,30.0,40.0,15.0,75.0,40.0,35.0,40.0,...,40.0,10.0,70.0,0.0,0.0,45.0,15.0,10.0,40.0,0.0


Here we are interested in next category of people Commute within census subdivision (CSD) of residence, so they are not going to take a long trips. Therefore let's find top 10 neighbourhood with the most people from these group of people.

In [300]:
demo_cdes_t = demo_cdes.T
demo_cdes_t = demo_cdes.T.iloc[:,0:2].drop('Total - Commuting destination for the employed labour force aged 15 years and over in private households with a usual place of work - 25% sample data', axis=1)
demo_cdes_t['  Commute within census subdivision (CSD) of residence'] = demo_cdes_t['  Commute within census subdivision (CSD) of residence'].str.replace(',', '')
demo_cdes_t['  Commute within census subdivision (CSD) of residence'] = demo_cdes_t['  Commute within census subdivision (CSD) of residence'].astype(int)
demo_cdes_t = demo_cdes_t.sort_values(by=['  Commute within census subdivision (CSD) of residence'], ascending=False)
demo_cdes_t = demo_cdes_t.head(10)
demo_cdes_t

Characteristic,Commute within census subdivision (CSD) of residence
Waterfront Communities-The Island,36345
Niagara,16400
Church-Yonge Corridor,15360
Willowdale East,15020
Woburn,14935
Rouge,14560
Dovercourt-Wallace Emerson-Junction,14285
Mount Pleasant West,13680
Islington-City Centre West,13680
Malvern,12835


#### 4.2.3 Main mode of commuting

In [301]:
demo_mmoc

Unnamed: 0_level_0,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,Bay Street Corridor,Bayview Village,Bayview Woods-Steeles,Bedford Park-Nortown,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
Characteristic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Total - Main mode of commuting for the employed labour force aged 15 years and over in private households with a usual place of work or no fixed workplace address - 25% sample data,11820,10160,6045,14910,11395,7360,11815,9825,4965,9520,...,6900,9360,21595,5965,4100,6595,5935,6345,12790,6420
"Car, truck, van - as a driver",7155,6135,4090,3290,7150,3910,1780,5270,2920,5940,...,3260,6220,11505,2480,1685,2190,1970,2050,5945,3205
"Car, truck, van - as a passenger",930,665,355,290,500,265,165,385,240,355,...,275,515,1405,165,120,195,155,155,665,355
Public transit,3350,2985,1285,6200,2945,2870,3540,3695,1610,2415,...,2845,2380,7635,2540,1940,3005,2935,3170,5405,2400
Walked,265,280,195,3200,615,215,5840,345,110,550,...,405,140,780,340,145,525,635,715,585,360
Bicycle,70,35,65,1675,65,15,325,30,10,90,...,55,20,45,380,175,610,145,155,115,30
Other method,45,65,65,225,140,90,175,100,65,180,...,55,65,210,70,30,55,90,95,75,85


From this dataset we can see the methods how people are commuting in different neighbourhoods. We assume that those who have bicycle already wouldn't be our customers, as well those who driving the car. We are lefting with next categories of people, who identified their main commute method as:
* Public transit
* Walked
* Other method

Now we are going to find top 10 neigborhoods where these methods dominate among the people

In [302]:
commute = demo_mmoc[3:]
commute_t = commute.T
commute_t = commute_t.drop('  Bicycle', axis=1)

In [303]:
arr = list(commute_t)
for item in arr:
    commute_t[item] = commute_t[item].str.replace(',', '')
    commute_t[item] = commute_t[item].astype(int) 

commute_t['Total'] = commute_t.sum(axis = 1)
commute_t = commute_t.sort_values(by=['Total'], ascending=False)
commute_t = commute_t.head(10)
commute_t

Characteristic,Public transit,Walked,Other method,Total
Waterfront Communities-The Island,10915,20855,610,32380
Church-Yonge Corridor,7000,7275,190,14465
Niagara,6965,5070,285,12320
Mount Pleasant West,9435,1840,100,11375
Willowdale East,9390,1550,215,11155
Dovercourt-Wallace Emerson-Junction,8950,1215,310,10475
Annex,6200,3200,225,9625
Bay Street Corridor,3540,5840,175,9555
Islington-City Centre West,8205,795,195,9195
Woburn,7635,780,210,8625


#### 4.2.4 Commuting duration

In [304]:
demo_cdur

Unnamed: 0_level_0,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,Bay Street Corridor,Bayview Village,Bayview Woods-Steeles,Bedford Park-Nortown,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
Characteristic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Total - Commuting duration for the employed labour force aged 15 years and over in private households with a usual place of work or no fixed workplace address - 25% sample data,11820,10155,6045,14905,11395,7355,11815,9820,4965,9515,...,6890,9365,21595,5965,4095,6590,5935,6350,12790,6420
Less than 15 minutes,1610,1455,1210,2330,1695,850,3175,860,480,1560,...,680,1155,2740,530,370,585,695,780,1365,850
15 to 29 minutes,3175,2735,1815,6260,3475,1750,4900,2560,1385,3060,...,1680,3165,5360,1450,900,2020,1765,2265,3280,1770
30 to 44 minutes,2940,2515,1495,4000,3110,2120,2200,3065,1320,2995,...,2200,2320,5280,2145,1420,2355,2215,2195,3430,2170
45 to 59 minutes,1410,1230,630,1410,1540,1465,690,1865,770,1200,...,1290,1170,2895,1095,915,915,795,630,1605,845
60 minutes and over,2670,2230,895,905,1565,1180,855,1480,1000,725,...,1050,1545,5320,735,495,720,470,460,3095,805


From this dataset we can see the time ranges in which people are making their commutes. We assuming that people which commuting duration in ranges under 15 minutes and over 60 minutes do not require our services. Therefore we will select the next categories of people which may want to shortcut their commute time:

* 15 to 29 minutes
* 30 to 44 minutes
* 45 to 59 minutes

After we are going to find top 10 neigborhoods where these methods dominate among the people

In [305]:
demo_cdur = demo_cdur[2:5]
demo_cdur_t = demo_cdur.T

arr = list(demo_cdur_t)
for item in arr:
    demo_cdur_t[item] = demo_cdur_t[item].str.replace(',', '')
    demo_cdur_t[item] = demo_cdur_t[item].astype(int) 

demo_cdur_t['Total'] = demo_cdur_t.sum(axis = 1)
demo_cdur_t = demo_cdur_t.sort_values(by=['Total'], ascending=False)
demo_cdur_t = demo_cdur_t.head(10)

demo_cdur_t

Characteristic,15 to 29 minutes,30 to 44 minutes,45 to 59 minutes,Total
Waterfront Communities-The Island,18250,9955,3320,31525
Niagara,7395,7030,2530,16955
Willowdale East,5430,7290,4175,16895
Islington-City Centre West,5935,5795,4265,15995
Dovercourt-Wallace Emerson-Junction,5335,6665,3055,15055
Rouge,5120,5145,3335,13600
Woburn,5360,5280,2895,13535
Mount Pleasant West,4655,6380,2295,13330
Mimico (includes Humber Bay Shores),4330,5620,3185,13135
Church-Yonge Corridor,7310,4060,1445,12815


### 4.3 Theft Security

#### 4.3.1 Theft reported incidents

In [310]:
theft.head(10)

Unnamed: 0,X,Y,Neighbourhood,Primary_Offence
0,-79.426674,43.641228,South Parkdale (85),B&E
1,-79.382637,43.652973,Bay Street Corridor (76),THEFT UNDER
2,-79.313347,43.674377,Woodbine Corridor (64),THEFT UNDER
3,-79.396164,43.653347,Kensington-Chinatown (78),THEFT UNDER
4,-79.381584,43.657909,Church-Yonge Corridor (75),THEFT UNDER
5,-79.389504,43.66967,Annex (95),THEFT UNDER
6,-79.445328,43.676571,Corso Italia-Davenport (92),THEFT UNDER - BICYCLE
7,-79.356949,43.683933,Broadview North (57),THEFT UNDER - BICYCLE
8,-79.320396,43.672669,Greenwood-Coxwell (65),THEFT UNDER - BICYCLE
9,-79.385048,43.666065,Church-Yonge Corridor (75),THEFT UNDER - BICYCLE


Let's count the overall number of the thefts happen in the Toronto by the neigborhoods and select top 10, where the thefts number were minimum.

In [312]:
theft = theft['Neighbourhood'].value_counts(sort=True, ascending=True).rename_axis('Neighbourhood').reset_index(name='Incidents')
theft = theft.head(10)
theft

Unnamed: 0,Neighbourhood,Incidents
0,Pleasant View (46),5
1,Rexdale-Kipling (4),6
2,Bayview Woods-Steeles (49),7
3,Rustic (28),7
4,Maple Leaf (29),7
5,Beechborough-Greenbrook (112),7
6,Ionview (125),8
7,Pelmo Park-Humberlea (23),8
8,Victoria Village (43),8
9,Humbermede (22),9


#### 4.3.2 Bikes which are more likely to be stolen

In [332]:
thief_taste.head(10)

Unnamed: 0,Bike_Make,Bike_Model,Bike_Type,Bike_Colour,Cost_of_Bike
0,FJ,ROUBAIX 3.0,RC,BLU,1400.0
1,OT,DIADORA,RG,BLU,500.0
2,OT,NAVIGATOR SF24,EL,RED,750.0
3,OT,2015VITA STEPTH,RG,PLE,600.0
4,GI,ESCAPE,RG,BLU,778.57
5,OT,SPEED UNO (KAC0,FO,BLK,400.0
6,RM,,RG,SIL,600.0
7,KO,BLAST,MT,ONG,1200.0
8,KO,DAWG,MT,OTH,2000.0
9,GI,16 ESCAPE 1 L,MT,BLK,769.0


To determine which bikes are most likely to be stolen let's first look at the theft statistics by the bike's vendor and the most stolen by the vendor first

In [338]:
vendor = thief_taste['Bike_Make'].value_counts(sort=True, ascending=False).rename_axis('Vendor').reset_index(name='Count')
vendor = vendor.head(1)
vendor

Unnamed: 0,Vendor,Count
0,OT,2399


After determining the top vendors we can find which models thiefs more like to steal

In [341]:
arr = vendor.Vendor.unique()
models = thief_taste.loc[thief_taste['Bike_Make'] == arr[0]]
models = models['Bike_Model'].value_counts(sort=True, ascending=False).rename_axis('Model').reset_index(name='Count')
models = models.head(10)
models

Unnamed: 0,Model,Count
0,MILANO,19
1,SIRRUS,19
2,HARDROCK,17
3,ALLEZ,16
4,CLASSICO,12
5,VITA,11
6,STOCKHOLM,10
7,CROSSTRAIL,10
8,VENTURA SPORT,10
9,MODENA,9


Another parameter that allow us to define bikes which more likely will be stolen is the color. Let's find top 3 colors that thiefs in Toronto like.

In [342]:
colors = thief_taste['Bike_Colour'].value_counts(sort=True, ascending=False).rename_axis('Color').reset_index(name='Count')
colors = colors.head(3)
colors

Unnamed: 0,Color,Count
0,BLK,3691
1,BLU,1271
2,WHI,1150


## 5. Results and Conclusion <a name="results"></a>

In [365]:
stats = pd.DataFrame()
stats['Safety'] = safety['Neighbourhood'].str.rstrip(')(1234567890')
stats['Security'] = theft['Neighbourhood'].str.rstrip(')(1234567890')
stats['Age'] = demo_age_t.index.values
stats['Commuting destination'] = demo_cdes_t.index.values
stats['Main Commute Mode'] = commute_t.index.values
stats['Commute Time'] = demo_cdur_t.index.values

stats

Unnamed: 0,Safety,Security,Age,Commuting destination,Main Commute Mode,Commute Time
0,Palmerston-Little Italy,Pleasant View,Waterfront Communities-The Island,Waterfront Communities-The Island,Waterfront Communities-The Island,Waterfront Communities-The Island
1,Agincourt South-Malvern West,Rexdale-Kipling,Willowdale East,Niagara,Church-Yonge Corridor,Niagara
2,Waterfront Communities-The Island,Bayview Woods-Steeles,Woburn,Church-Yonge Corridor,Niagara,Willowdale East
3,Wychwood,Rustic,Niagara,Willowdale East,Mount Pleasant West,Islington-City Centre West
4,Little Portugal,Maple Leaf,Islington-City Centre West,Woburn,Willowdale East,Dovercourt-Wallace Emerson-Junction
5,Mount Pleasant East,Beechborough-Greenbrook,Rouge,Rouge,Dovercourt-Wallace Emerson-Junction,Rouge
6,Birchcliffe-Cliffside,Ionview,Malvern,Dovercourt-Wallace Emerson-Junction,Annex,Woburn
7,The Beaches,Pelmo Park-Humberlea,Church-Yonge Corridor,Mount Pleasant West,Bay Street Corridor,Mount Pleasant West
8,New Toronto,Victoria Village,Dovercourt-Wallace Emerson-Junction,Islington-City Centre West,Islington-City Centre West,Mimico (includes Humber Bay Shores)
9,Casa Loma,Humbermede,L'Amoreaux,Malvern,Woburn,Church-Yonge Corridor


We have analyzed the open datasets provided by the offical authorities of the city of Torronto. Therefore there is not doubt in authenticity of the information. Of course the data provided is not the most "fresh" one, but it provided statistics collected for few last years. By my personal thought it at least reflecting the trends of the last years, therefore it should valid for today as well.

In the final stats dataframe I untited the rating of top neigborhoods by the corresponding parameters. We can definetely say that the leader in the most of categories **Waterfront Communities-The Island** district. If someone wanted to start their bike sharing buisness - it's definetely the place where he should start to place his bike stations

By analysing the theif tastes we found that thiefs most likely to steal **black, white and blue** bikes and bikes made by vendor **OT** with the next model preferencies: **MILANO, SIRRUS, HARDROCK, ALLEZ, CLASSICO, VITA, STOCKHOLM, CROSSTRAIL, VENTURA, SPORT,	MODENA**. I would recoment to the bike sharing owner in toronto to not buy these bikes models and be careful with the bikes in these colours.