# Social Determinants of Health: How to Make Toronto a Healthier City

## Background

<a href="http://thecanadianfacts.org/the_canadian_facts.pdf">Social determinants of health</a> (such as income and income distribution, education, unemployement, food insecurity, housing, race and many others) have been long known to have a major impact on health outcomes. However, these impacts are never obvious or clearly visualized in a day to day basis. In this project, I hope to show the effects of social determinants of health using clustering algorithms. I will cluster health outcome data to group toronto into different clusters. I will then use clustering algorithms on social determinants data. Overlapping these two clusters should show the effect of social determinants of health on health outcomes
<img src="https://www.cma.ca/Assets/assets-library/image/en/advocacy/what-makes-canadians-sick-e.png" alt="Social Determinants of Health">

This clustering algorithm is a key tool for policy makers and public health workers. This allows us to see which parts of the city are not as healthy as others, and exactly what we can do about it. For example if it turns out as expected that education clusters well with health data, it can give policy makers the tool to figure out where to build the next school project. 

## Data Source

The city of Toronto has a platform called <a href="https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/">Open Data Catalogue</a>. This platform has a number of datasets titled "Wellbeing Toronto - ...", I will use the datasets for demographics, economics, education, environement, housing, recreation, safety and transporation and FourSquare data of different venues for one set of clustering and the health dataset for the other clustering. Overlapping these two is expected to show the effects of the social determinants of health on health outcomes. 

## Methodology

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [6]:
health = pd.read_excel('http://opendata.toronto.ca/social.development/wellbeing/WB-Health.xlsx',sheet_name=1,skiprows=1)

In [7]:
health.head()

Unnamed: 0,Neighbourhood,Neighbourhood Id,Breast Cancer Screenings,Cervical Cancer Screenings,Colorectal Cancer Screenings,Community Food Programs,Diabetes Prevalence,DineSafe Inspections,Female Fertility,Health Providers,Premature Mortality,Student Nutrition,Teen Pregnancy
0,West Humber-Clairville,1,52.3,60.4,38.7,2,14.0,45,60.286693,94,220.397386,1190,30.917874
1,Mount Olive-Silverstone-Jamestown,2,47.8,54.6,36.8,3,14.9,2,71.103462,39,198.255215,5026,37.106918
2,Thistletown-Beaumond Heights,3,54.8,60.1,38.2,2,12.0,0,60.681976,10,244.423537,950,38.418079
3,Rexdale-Kipling,4,56.0,62.6,37.2,2,11.9,0,52.04216,6,224.477055,125,33.879781
4,Elms-Old Rexdale,5,54.7,61.1,35.6,1,12.7,0,65.44963,1,238.482169,1975,39.344262


In [18]:
health.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140 entries, 0 to 139
Data columns (total 13 columns):
Neighbourhood                   140 non-null object
Neighbourhood Id                140 non-null int64
Breast Cancer Screenings        140 non-null float64
Cervical Cancer Screenings      140 non-null float64
Colorectal Cancer Screenings    140 non-null float64
Community Food Programs         140 non-null int64
Diabetes Prevalence             140 non-null float64
DineSafe Inspections            140 non-null int64
Female Fertility                140 non-null float64
Health Providers                140 non-null int64
Premature Mortality             140 non-null float64
Student Nutrition               140 non-null int64
Teen Pregnancy                  140 non-null float64
dtypes: float64(7), int64(5), object(1)
memory usage: 14.3+ KB


In [19]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

health_cluster = health.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(health_cluster)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 3, 1, 4, 2, 2, 1, 4, 4, 4])

In [20]:
health['Cluster Labels'] = kmeans.labels_

In [21]:
health.head()

Unnamed: 0,Neighbourhood,Neighbourhood Id,Breast Cancer Screenings,Cervical Cancer Screenings,Colorectal Cancer Screenings,Community Food Programs,Diabetes Prevalence,DineSafe Inspections,Female Fertility,Health Providers,Premature Mortality,Student Nutrition,Teen Pregnancy,Cluster Labels
0,West Humber-Clairville,1,52.3,60.4,38.7,2,14.0,45,60.286693,94,220.397386,1190,30.917874,1
1,Mount Olive-Silverstone-Jamestown,2,47.8,54.6,36.8,3,14.9,2,71.103462,39,198.255215,5026,37.106918,3
2,Thistletown-Beaumond Heights,3,54.8,60.1,38.2,2,12.0,0,60.681976,10,244.423537,950,38.418079,1
3,Rexdale-Kipling,4,56.0,62.6,37.2,2,11.9,0,52.04216,6,224.477055,125,33.879781,4
4,Elms-Old Rexdale,5,54.7,61.1,35.6,1,12.7,0,65.44963,1,238.482169,1975,39.344262,2


In [45]:
economics = pd.read_excel('http://opendata.toronto.ca/social.development/wellbeing/WB-Economics.xlsx',sheet_name=1,skiprows=1)

In [46]:
economics.head()

Unnamed: 0,Neighbourhood,Neighbourhood Id,Access to Child Care,Business Licensing,Businesses,Child Care Spaces,Inequality (Gini coeff.),Local Employment,Social Assistance Recipients
0,West Humber-Clairville,1,0.383582,695,2550,180.0,0.35938,63385,2702.0
1,Mount Olive-Silverstone-Jamestown,2,0.245736,106,273,45.0,0.42015,3346,6406.0
2,Thistletown-Beaumond Heights,3,0.325,116,236,25.0,0.38201,1350,1082.0
3,Rexdale-Kipling,4,0.24,49,155,60.0,0.388429,1190,1231.0
4,Elms-Old Rexdale,5,0.321622,31,70,60.0,0.38813,831,1759.0


In [47]:
economics['Social Assistance Recipients'].fillna((economics['Social Assistance Recipients'].mean()),inplace=True)

In [48]:
economics.describe()

Unnamed: 0,Neighbourhood Id,Access to Child Care,Business Licensing,Businesses,Child Care Spaces,Inequality (Gini coeff.),Local Employment,Social Assistance Recipients
count,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0
mean,70.5,0.602408,189.621429,534.771429,111.059631,6.044291,9348.671429,1778.604317
std,40.5586,3.442309,170.297939,630.775899,73.244358,66.903276,18490.817103,1489.087377
min,1.0,0.08,30.0,48.0,0.0,0.195639,300.0,16.0
25%,35.75,0.243073,93.75,171.25,55.0,0.374683,2049.5,661.5
50%,70.5,0.293021,135.0,343.0,93.5,0.397599,3828.5,1347.5
75%,105.25,0.348653,222.75,593.75,159.5,0.414502,10066.75,2528.0
max,140.0,41.0,1163.0,4194.0,400.0,792.0,175515.0,6786.0


In [49]:
# set number of clusters
kclusters = 5

economic_cluster = economics.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(economic_cluster)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [52]:
health['Economics Labels'] = kmeans.labels_

In [53]:
health

Unnamed: 0,Neighbourhood,Neighbourhood Id,Breast Cancer Screenings,Cervical Cancer Screenings,Colorectal Cancer Screenings,Community Food Programs,Diabetes Prevalence,DineSafe Inspections,Female Fertility,Health Providers,Premature Mortality,Student Nutrition,Teen Pregnancy,Cluster Labels,Economics Labels
0,West Humber-Clairville,1,52.3,60.4,38.7,2,14.0,45,60.286693,94,220.397386,1190,30.917874,1,2
1,Mount Olive-Silverstone-Jamestown,2,47.8,54.6,36.8,3,14.9,2,71.103462,39,198.255215,5026,37.106918,3,0
2,Thistletown-Beaumond Heights,3,54.8,60.1,38.2,2,12.0,0,60.681976,10,244.423537,950,38.418079,1,0
3,Rexdale-Kipling,4,56.0,62.6,37.2,2,11.9,0,52.042160,6,224.477055,125,33.879781,4,0
4,Elms-Old Rexdale,5,54.7,61.1,35.6,1,12.7,0,65.449630,1,238.482169,1975,39.344262,2,0
5,Kingsview Village-The Westway,6,55.1,59.2,37.9,0,11.9,0,58.052433,23,225.735504,2058,18.829517,2,0
6,Willowridge-Martingrove-Richview,7,62.3,64.8,40.6,0,10.8,4,48.382560,30,181.669851,1245,15.748031,1,0
7,Humber Heights-Westmount,8,58.8,66.4,37.8,2,9.9,2,43.348281,4,249.064753,0,21.052632,4,0
8,Edenbridge-Humber Valley,9,65.9,71.4,42.3,1,7.2,0,26.871223,10,170.741378,0,7.260726,4,0
9,Princess-Rosethorn,10,67.3,74.2,42.6,0,6.6,0,26.504394,4,209.250891,0,7.407407,4,0


In [75]:
from geopy.geocoders import Nominatim

for neighbourhood in health['Neighbourhood']:
    address = neighbourhood
    geolocator = Nominatim(user_agent="my-application")
    location = geolocator.geocode(address)
    if location:
        health.at[health.loc[health['Neighbourhood'] == neighbourhood].index,'latitude'] =  location.latitude
        health.at[health.loc[health['Neighbourhood'] == neighbourhood].index,'longitude'] =  location.longitude

In [76]:
health

Unnamed: 0,Neighbourhood,Neighbourhood Id,Breast Cancer Screenings,Cervical Cancer Screenings,Colorectal Cancer Screenings,Community Food Programs,Diabetes Prevalence,DineSafe Inspections,Female Fertility,Health Providers,Premature Mortality,Student Nutrition,Teen Pregnancy,Cluster Labels,Economics Labels,latitude,longitude
0,West Humber-Clairville,1,52.3,60.4,38.7,2,14.0,45,60.286693,94,220.397386,1190,30.917874,1,2,53.895042,-0.161158
1,Mount Olive-Silverstone-Jamestown,2,47.8,54.6,36.8,3,14.9,2,71.103462,39,198.255215,5026,37.106918,3,0,,
2,Thistletown-Beaumond Heights,3,54.8,60.1,38.2,2,12.0,0,60.681976,10,244.423537,950,38.418079,1,0,,
3,Rexdale-Kipling,4,56.0,62.6,37.2,2,11.9,0,52.042160,6,224.477055,125,33.879781,4,0,43.721362,-79.565513
4,Elms-Old Rexdale,5,54.7,61.1,35.6,1,12.7,0,65.449630,1,238.482169,1975,39.344262,2,0,43.721362,-79.565513
5,Kingsview Village-The Westway,6,55.1,59.2,37.9,0,11.9,0,58.052433,23,225.735504,2058,18.829517,2,0,43.692999,-79.557143
6,Willowridge-Martingrove-Richview,7,62.3,64.8,40.6,0,10.8,4,48.382560,30,181.669851,1245,15.748031,1,0,,
7,Humber Heights-Westmount,8,58.8,66.4,37.8,2,9.9,2,43.348281,4,249.064753,0,21.052632,4,0,43.697007,-79.520235
8,Edenbridge-Humber Valley,9,65.9,71.4,42.3,1,7.2,0,26.871223,10,170.741378,0,7.260726,4,0,43.678623,-79.510131
9,Princess-Rosethorn,10,67.3,74.2,42.6,0,6.6,0,26.504394,4,209.250891,0,7.407407,4,0,43.659509,-79.539920


In [77]:
health_drop = health.dropna()

In [81]:
import folium # map rendering library
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

In [82]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood, health, income in zip(health_drop['latitude'], health_drop['longitude'], health_drop['Neighbourhood'], health_drop['Cluster Labels'], health_drop['Economics Labels']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[health-1],
        fill=True,
        fill_color=rainbow[income-1],
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## Conclusion

As can be clearly seen, the clusters between health and income data are clearly related. Red clusters, tend to have an organge filling. Purple clusters tend to have a purple outline as well. Green clusters similarily share the same colour outfill