# Building the dataset for PerfectCity.io

PerfectCity.io requires many features for cities. Often we would need to look at multiple datasets for each city and extract relevant information and each dataset requires careful treatment to get the right data. This notebook will document the process that went from basically having no data to a complete PerfectCity.io dataset

In [131]:
import pandas as pd
import json

In [132]:
# define cities and the parameters we are going to use
cities = ['VANCOUVER', 'MONTREAL', 'TORONTO', 'OTTAWA', 'HAMILTON', 'WINNIPEG', 'EDMONTON', 'CALGARY']
parameters = ['TREES', 'PARKS', 'TRANSIT_SCORE', 'WALKABILITY', 'POPULATION', 'BUSINESSES' ,'CRIME', 'LIBRARIES', 'SCHOOLS', 'UNIVERSITIES', 'UNEMPLOYMENT', 'AVERAGE_AGE']

print 'cities x parameters '
print len(cities),'x',len(parameters),'=',len(cities)*len(parameters)

cities x parameters 
8 x 12 = 96


In [133]:
# build an empty object to hold the information
data = { city: { parameter : 0 for parameter in parameters } for city in cities }

#### Dealing with Accents
Many Canadian cities have accents in their names, we use the REGEX in the following way to match the first 4 charaters of the city name which seemed to produce good enough results for our purposes    
```
search_city = city[:4] # select the first four letters of the city name
search_term = r'\b('+search_city+')\w?' # incorporate it into the regex
```

---
# Transit Score Ranking
Obtained from [Walkscore 2014 Ranking](http://blog.walkscore.com/2014/03/best-canadian-cities-for-public-transit/#.VYWKBxNJaV4)

In [134]:
# RANKING FROM http://blog.walkscore.com/2014/03/best-canadian-cities-for-public-transit/#.VYWKBxNJaV4
TRANSIT_SCORE = {'TORONTO': 78, 'MONTREAL': 77, 'VANCOUVER': 74, 'WINNIPEG': 51, 'OTTAWA': 49, 'EDMONTON': 44, 'CALGARY':43, 'HAMILTON': 42 }

for city, score in TRANSIT_SCORE.items():
    data[city]['TRANSIT_SCORE'] = score

----
# Population Information
[Stats Canada census from 2011](http://www12.statcan.gc.ca/census-recensement/2011/dp-pd/hlt-fst/pd-pl/Table-Tableau.cfm?LANG=Eng&T=205&S=3&RPP=50)

In [135]:
population_df = pd.read_csv('./datasets/canadian_population_census_2011.CSV')
print(population_df.columns[[1,4]]) # we need to keep columns 1 and 4 and delete everything else
population_df = population_df[[1,4]]
population_df['Geographic name'] = population_df['Geographic name'].str.upper()
population_df.head() #all good

Index([u'Geographic name', u'Population, 2011'], dtype='object')


Unnamed: 0,Geographic name,"Population, 2011"
0,TORONTO (ONT.),5583064
1,MONTR�AL (QUE.),3824221
2,VANCOUVER (B.C.),2313328
3,OTTAWA - GATINEAU (ONT./QUE.),1236324
4,CALGARY (ALTA.),1214839


In [136]:
for city in cities:
    search_term = city[:4]
    data[city]['POPULATION'] = population_df[ population_df['Geographic name'].str.contains(r'\b('+search_term+')\w?', na=False) ]['Population, 2011'].tolist()[0]

-----
# Age Groups
Curated from *[Statistics Canada. Table  051-0056 -  Estimates of population by census metropolitan area, sex and age group for July 1, based on the Standard Geographical Classification (SGC) 2011, annual (persons),  CANSIM (database). (accessed: 2015-06-20)](http://www5.statcan.gc.ca/cansim/a47)*

----
# Parks and Green Spaces
Curate from *[Statistics Canada. Table  153-0148 -  Households and the environment survey, parks and green spaces, Canada, provinces and census metropolitan areas (CMA), every 2 years (percent),  CANSIM (database). (accessed: 2015-06-20)](http://www5.statcan.gc.ca/cansim/a26)*

- "Close to home" is defined as being a 10 minute journey from home

In [137]:
parks_df = pd.read_csv('./datasets/canadian_parks2013.csv')
parks_df['CITY'] = parks_df['CITY'].str.upper()
parks_df.head() #all good

Unnamed: 0,CITY,PARKS
0,"MONTREAL, QUEBEC [24462]",91
1,"OTTAWA-GATINEAU, ONTARIO/QUEBEC [24505 35505]",91
2,"TORONTO, ONTARIO [35535]",85
3,"HAMILTON, ONTARIO [35537]",86
4,"WINNIPEG, MANITOBA [46602]",88


In [138]:
for city in cities:
    search_term = city[:4]
    data[city]['PARKS'] = parks_df[ parks_df['CITY'].str.contains(r'\b('+search_term+')\w?', na=False) ]['PARKS'].tolist()[0]

-----
# Crime severity Index

We obtain the crime severity statistic from [Statistics Canada. Table  252-0052 -  Crime severity index and weighted clearance rates, annual (index unless otherwise noted),  CANSIM (database). (accessed: 2015-06-20)](http://www5.statcan.gc.ca/cansim/a26)

In [139]:
crime_df = pd.read_csv('./datasets/crime_severity2013.csv')
crime_df['CITY'] = crime_df['CITY'].str.upper()
crime_df.head()

Unnamed: 0,CITY,CRIME_SEVERITY_INDEX
0,"MONTR�AL, QUEBEC (28,34)",65.93
1,"OTTAWA-GATINEAU, ONTARIO/QUEBEC (6)",53.28
2,"TORONTO, ONTARIO (25)",47.14
3,"HAMILTON, ONTARIO (25)",55.11
4,"WINNIPEG, MANITOBA (9,10,33)",83.17


In [140]:
for city in cities:
    search_term = city[:4]
    data[city]['CRIME'] = crime_df[ crime_df['CITY'].str.contains(r'\b('+search_term+')\w?', na=False) ]['CRIME_SEVERITY_INDEX'].tolist()[0]

# Final Curated Data set

In [141]:
s = json.dumps(data)
df = pd.read_json(s, orient='index')
df

Unnamed: 0,AVERAGE_AGE,BUSINESSES,CRIME,LIBRARIES,PARKS,POPULATION,SCHOOLS,TRANSIT_SCORE,TREES,UNEMPLOYMENT,UNIVERSITIES,WALKABILITY
CALGARY,0,0,60.4,0,88,1214839,0,43,0,0,0,0
EDMONTON,0,0,84.49,0,89,1159869,0,44,0,0,0,0
HAMILTON,0,0,55.11,0,86,721053,0,42,0,0,0,0
MONTREAL,0,0,65.93,0,91,3824221,0,77,0,0,0,0
OTTAWA,0,0,53.28,0,91,1236324,0,49,0,0,0,0
TORONTO,0,0,47.14,0,85,5583064,0,78,0,0,0,0
VANCOUVER,0,0,90.26,0,87,2313328,0,74,0,0,0,0
WINNIPEG,0,0,83.17,0,88,730018,0,51,0,0,0,0
