# OPTIMAL BUSINESS INVESTMENT AREA

## Project Introduction 

- **Purpose**- This project aims to give a list of district names in Toronto, Canada. The list is composed of the neighborhoods that hold the highest business opportunities for an enterpreneur who would like to start a gym business. 



- **Structure**- The project is divided into different parts due to its length and each part is segmented by subtitles to increase the readability. 


- **Roadmap** - First get the information about the different aspects of districts; then clean, process and merge the tables coming from different sources. Optimize the data and supply it to the machine learning model. 

### Required Data list :
- Income per neighborhood *
- % of the target audience *
- Population *
- Number of gyms*
- Number/size of green parks *
- Crime rate

In [23]:
import requests 
import pandas as pd
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
import pprint
pp = pprint.PrettyPrinter(indent=4, compact=False, width=80)
import folium
import numpy as np
import json


## CANADA API

In this section, I appoint to the Toronto city API to gather information.I choose to collect data from API since API keeps the data updated and relevant which allows the project live longer. I also tried to softcode the data as much as possible.

There are csv, and excel files ready in the database as well. 

### Collect Demographic Data

In [24]:
url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/package_show" # package_search endpoint
params = { "id": "6e19a90f-971c-46b3-852c-0c48c436d1fc"}  # package_id 
package = requests.get(url, params = params).json()
demographics_meta_data = package["result"]
 
# Get the data by passing the resource_id to the datastore_search endpoint
# See https://docs.ckan.org/en/latest/maintaining/datastore.html for detailed parameters options
 
for idx, resource in enumerate(package["result"]["resources"]):
    if resource["datastore_active"]:  # check if the data store is still active
        url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/datastore_search" # datastore_search endpoint
        p = { "id": resource["id"]}  # resource_id 
        data = requests.get(url, params = p).json() # get the data for the first 100 data samples 
        df = pd.DataFrame(data["result"]["records"])    # save the first 100 data samples as a dataframe 
        demographics_features_description_dict = data['result']['fields'] # get the features description
        for i in range(100, data["result"]["total"], 100): # looping over all the data
            p = { "id": resource["id"], "offset": i} # get the next 100 data samples 
            data = requests.get(url, params = p).json()
            df2 = pd.DataFrame(data["result"]["records"]) # save them in a new dataframe 
            if i == 100:
                demographics_df = df.append(df2)    # if it is the first loop, save them to the old dataframe 
            else:
                demographics_df = demographics_df.append(df2)  # else add them to the main dataframe 
        break
        
demographics_df.reset_index(inplace=True, drop=True)  # reset the index 
print("The shape of the demographics 2016 dataset", demographics_df.shape) # print the shape of the final dataframe 
demographics_df.head(3)  # print the first five rows of the dataframe 

The shape of the demographics 2016 dataset (2383, 146)


Unnamed: 0,_id,Category,Topic,Data Source,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
0,1,Neighbourhood Information,Neighbourhood Information,City of Toronto,Neighbourhood Number,,129,128,20,95,...,37,7,137,64,60,94,100,97,27,31
1,2,Neighbourhood Information,Neighbourhood Information,City of Toronto,TSNS2020 Designation,,No Designation,No Designation,No Designation,No Designation,...,No Designation,No Designation,NIA,No Designation,No Designation,No Designation,No Designation,No Designation,NIA,Emerging Neighbourhood
2,3,Population,Population and dwellings,Census Profile 98-316-X2016001,"Population, 2016",2731571.0,29113,23757,12054,30526,...,16936,22156,53485,12541,7865,14349,11817,12528,27593,14804


In [25]:
demographics_df.Topic.unique()    ## To select the data needed, we need subtitles for each category and get appropriate rows

array(['Neighbourhood Information', 'Population and dwellings',
       'Age characteristics', 'Household and dwelling characteristics',
       'Marital status', 'Family characteristics', 'Household type',
       'Family characteristics of adults',
       'Knowledge of official languages',
       'First official language spoken', 'Mother tongue',
       'Knowledge of languages', 'Income of households in 2015',
       'Language spoken most often at home',
       'Income of individuals in 2015',
       'Other language spoken regularly at home', 'Low income in 2015',
       'Income of economic families in 2015',
       'Immigrants by selected place of birth', 'Citizenship',
       'Visible minority population',
       'Immigrant status and period of immigration',
       'Ethnic origin population', 'Age at immigration',
       'Aboriginal population', 'Highest certificate, diploma or degree',
       'Recent immigrants by selected place of birth',
       'Generation status', 'Admission categ

In [26]:
## DATA REQUIRED ## 
income_dist = demographics_df.iloc[[983,982,964,957,950,929,860,839,827,812,669,615,426],:] ## Income groups per 10,000$ 
age_dist = demographics_df.iloc[9:15,:]  ## number of people per age groups
age_distm = demographics_df.iloc[24:26]   ## needed to refine the age interval that is considered as customer of gyms
age_distf = demographics_df.iloc[45:47]    ## needed to refine the age interval that is considered as customer of gyms
avg_income_tax = demographics_df.iloc[[2362],:]    ## Income level is a good insight provider data for regular gym goers.
population = demographics_df.iloc[[2],:]      ## Population numbers are needed to see the overall picture


In [27]:
demographics_df[demographics_df.Category == 'Population'].head(3)

Unnamed: 0,_id,Category,Topic,Data Source,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
2,3,Population,Population and dwellings,Census Profile 98-316-X2016001,"Population, 2016",2731571,29113,23757,12054,30526,...,16936,22156,53485,12541,7865,14349,11817,12528,27593,14804
3,4,Population,Population and dwellings,Census Profile 98-316-X2016001,"Population, 2011",2615060,30279,21988,11904,29177,...,15004,21343,53350,11703,7826,13986,10578,11652,27713,14687
4,5,Population,Population and dwellings,Census Profile 98-316-X2016001,Population Change 2011-2016,4.50%,-3.90%,8.00%,1.30%,4.60%,...,12.90%,3.80%,0.30%,7.20%,0.50%,2.60%,11.70%,7.50%,-0.40%,0.80%


### Data for Green Spaces


Copy the data above to get data about green spaces in neighborhoods where people have a chance to workout.
I need this data because green spaces are an alternative for gym subscription. Having big green areas around 
discourage people to register gyms.

In [28]:
base_url = 'https://ckan0.cf.opendata.inter.prod-toronto.ca/'
url = base_url + "api/action/package_show"
params = {"id": "green-spaces"}
package = requests.get(url, params=params).json()

for idx, resource in enumerate(package["result"]["resources"]):
    if resource["datastore_active"]:  # check if the data store is still active
        url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/datastore_search"  # datastore_search endpoint
        p = {"id": resource["id"]}  # resource_id
        data = requests.get(url, params=p).json()  # get the data for the first 100 data samples
        df = pd.DataFrame(data["result"]["records"])  # save the first 100 data samples as a dataframe
        green_dict = data['result']['fields']  # get the features description
        for i in range(100, data["result"]["total"], 100):  # looping over all the data
            p = {"id": resource["id"], "offset": i}  # get the next 100 data samples
            data = requests.get(url, params=p).json()
            df2 = pd.DataFrame(data["result"]["records"])  # save them in a new dataframe
            if i == 100:
                green_df = df.append(df2)  # if it is the first loop, save them to the old dataframe
            else:
                green_df = green_df.append(df2)  # else add them to the main dataframe
        break

green_df.reset_index(inplace=True, drop=True)  # reset the index
green_df.head(3)

Unnamed: 0,_id,AREA_ID,AREA_ATTR_ID,PARENT_AREA_ID,AREA_CLASS_ID,AREA_CLASS,AREA_SHORT_CODE,AREA_LONG_CODE,AREA_NAME,AREA_DESC,OBJECTID,geometry
0,1,1,1,,890.0,OTHER_CEMETERY,,4009,ARMADALE FREE METHODIST CEMETERY,ARMADALE FREE METHODIST CEMETERY,1,"{""type"": ""Polygon"", ""coordinates"": [[[-79.2575..."
1,2,2,2,,890.0,OTHER_CEMETERY,,4010,HILLSIDE CEMETERY,HILLSIDE CEMETERY,2,"{""type"": ""Polygon"", ""coordinates"": [[[-79.1896..."
2,3,3,3,,890.0,OTHER_CEMETERY,,4011,HIGHLAND MEMORY GARDENS,HIGHLAND MEMORY GARDENS,3,"{""type"": ""Polygon"", ""coordinates"": [[[-79.3475..."


In [29]:
green_space = green_df[green_df.AREA_CLASS=='Park']    ## Filter out green spaces which can be used for sports activities.
                                                       ## Some other examples like cemetery or zoo are not usefull for me
green_space=green_space.drop(columns=['_id']).reset_index(drop=True)   ##  Tidy up the data 

Now I import 'shapely' which is a library that allows me to measure area size, distance or locate points by using latitude
and longitude data.

In [30]:
from shapely.geometry import Point, Polygon
coorr = green_space.loc[:,'geometry'].apply(lambda x:json.loads(x)['coordinates'][0]) ## The API returns coordinates as text string
## I need to convert it to json format

def get_area_size(coorr):
    locs = []             ## list to gather all centroid coordinates of green zones. 
    size= []              ## list to gather area measurements 
    for corri in coorr:                 ## Loop through coordinate pairs to calculate the area sizes and coordinates of the
                                        ##center of the areas.

        if len(corri) > 1:
            avg_lon = (np.array(corri).ravel()[1::2].min() + np.array(corri).ravel()[1::2].max())/2 ## average of max and min (long)
            avg_lat = (np.array(corri).ravel()[::2].min() + np.array(corri).ravel()[::2].max())/2  ## average of max and min (lat)
            locca = [avg_lat,avg_lon]                      ## centroid coordinate pairs
            pol = Polygon(corri)                        ## Create polygon object for area size calculation

            size.append(pol.area*1e+6)                ## store the area size of  parks
            locs.append(locca)                       ## store the centroid coordinate of parks
        else:
            pol = Polygon(corri[0])                   ## repeat the same for different format
            size.append(pol.area*1e+6)
            avg_lon = (np.array(corri[0]).ravel()[1::2].min() + np.array(corri[0]).ravel()[1::2].max())/2
            avg_lat = (np.array(corri[0]).ravel()[::2].min() + np.array(corri[0]).ravel()[::2].max())/2
            locca = [avg_lat,avg_lon]
            locs.append(locca)
    return locs, size
park_coordinates, park_size = get_area_size(coorr)       

In [31]:
green_space['size'] = park_size               ## add area size to the data provided by the API
green_space['midpoint'] = park_coordinates           ## add the centroid coordinates

In [32]:
green_space.head(3)

Unnamed: 0,AREA_ID,AREA_ATTR_ID,PARENT_AREA_ID,AREA_CLASS_ID,AREA_CLASS,AREA_SHORT_CODE,AREA_LONG_CODE,AREA_NAME,AREA_DESC,OBJECTID,geometry,size,midpoint
0,41931,545,,802.0,Park,545,545,JEFF HEALEY PARK,,4328911,"{""type"": ""Polygon"", ""coordinates"": [[[-79.4965...",6.307529,"[-79.49544686850496, 43.6300613241833]"
1,41321,1493,,802.0,Park,1493,1493,ROYCROFT PARK LANDS,,4328912,"{""type"": ""Polygon"", ""coordinates"": [[[-79.4067...",1.255365,"[-79.40526246712355, 43.6806401198379]"
2,42198,48,,802.0,Park,48,48,ART EGGLETON PARK,,4328913,"{""type"": ""Polygon"", ""coordinates"": [[[-79.4180...",0.634391,"[-79.4183566782689, 43.659316592540094]"


In [33]:
# green_space.to_csv(r'C:\Users\yusuf\Desktop\Workspace url\Gym_Toronto\green_spaces.csv') ## save it 

### Get Coordinates

In [34]:
url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/package_show"  # package_search endpoint
params = {"id": "neighbourhoods"}  # package_id 
package = requests.get(url, params=params).json()
neighborhood_meta_data = package["result"]

for idx, resource in enumerate(package["result"]["resources"]):
    if resource["datastore_active"]:  # check if the data store is still active
        url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/datastore_search"  # datastore_search endpoint
        p = {"id": resource["id"]}  # resource_id 
        data_neighboorhood = requests.get(url, params=p).json()  # get the data for the first 100 data samples 
        df = pd.DataFrame(data_neighboorhood["result"]["records"])  # save the first 100 data samples as a dataframe 
        neighborhood_dict = data_neighboorhood['result']['fields']  # get the features description
        for i in range(100, data_neighboorhood["result"]["total"], 100):  # looping over all the data
            p = {"id": resource["id"], "offset": i}  # get the next 100 data samples 
            data_neighboorhood = requests.get(url, params=p).json()
            df2 = pd.DataFrame(data_neighboorhood["result"]["records"])  # save them in a new dataframe 
            if i == 100:
                neighboorhood_df = df.append(df2)  # if it is the first loop, save them to the old dataframe 
            else:
                neighboorhood_df = neighboorhood_df.append(df2)  # else add them to the main dataframe 
        break

neighboorhood_df.reset_index(inplace=True, drop=True)  # reset the index 

neighboorhood_df.head(3) # print the first five rows of the dataframe        

Unnamed: 0,_id,AREA_ID,AREA_ATTR_ID,PARENT_AREA_ID,AREA_SHORT_CODE,AREA_LONG_CODE,AREA_NAME,AREA_DESC,CLASSIFICATION,CLASSIFICATION_CODE,OBJECTID,geometry
0,1,2502366,26022881,,174,174,South Eglinton-Davisville,South Eglinton-Davisville (174),Not an NIA or Emerging Neighbourhood,,17824737,"{""type"": ""Polygon"", ""coordinates"": [[[-79.3863..."
1,2,2502365,26022880,,173,173,North Toronto,North Toronto (173),Not an NIA or Emerging Neighbourhood,,17824753,"{""type"": ""Polygon"", ""coordinates"": [[[-79.3974..."
2,3,2502364,26022879,,172,172,Dovercourt Village,Dovercourt Village (172),Not an NIA or Emerging Neighbourhood,,17824769,"{""type"": ""Polygon"", ""coordinates"": [[[-79.4341..."


In [35]:
neighbrhd =neighboorhood_df
neighbrhd['AREA_DESC']=neighbrhd.AREA_DESC.replace(r'[(\d)]','',regex=True)  ## Extract the full data name 
neighbrhd.set_index('AREA_DESC',drop=True, inplace=True)             ## set  AREA_DESC column as index       
neighbrhd['geometry'] =neighbrhd.geometry.apply(lambda x:json.loads(x)['coordinates'][0])   ## convert string to json format
neighbrhd=neighbrhd.T.reset_index()  
coordinates=neighbrhd.iloc[[10],:]
cols = sorted(coordinates.drop(columns=['index']).columns)                 ## get the column names and sort them
coordinates = coordinates[cols]                ## pass the sorted column names in the data frame to get desired columns in order
coordinates = coordinates.T                   ## Transpose the dataframe
coordinates.columns = ['coords']               ## name the column 
coordinates.head(3)


Unnamed: 0_level_0,coords
AREA_DESC,Unnamed: 1_level_1
Agincourt North,"[[-79.2577140359756, 43.8099219554704], [-79.2..."
Agincourt South-Malvern West,"[[-79.2837777451522, 43.7978333881815], [-79.2..."
Alderwood,"[[-79.5384742056994, 43.5954328307493], [-79.5..."


Here I  practice something similar to the one above. I call the function I defined above and calculate the area size and extract the centroid coordinates of each neighborhood.

In [36]:
## neighboorhood_df.to_csv(r"C:\Users\yusuf\Desktop\Workspace url\Gym_Toronto\neighborhood_coords.csv")  ## save it

In [37]:
cor1 = coordinates.coords
neighborhood_coordinates, neigborhood_size=get_area_size(cor1)

In [38]:
coordinates['area'] = neigborhood_size
coordinates['location'] = neighborhood_coordinates
coordinates.head(3)

Unnamed: 0_level_0,coords,area,location
AREA_DESC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Agincourt North,"[[-79.2577140359756, 43.8099219554704], [-79.2...",812.508771,"[-79.2666983482874, 43.8040378716849]"
Agincourt South-Malvern West,"[[-79.2837777451522, 43.7978333881815], [-79.2...",880.631588,"[-79.26349952820254, 43.789379994082054]"
Alderwood,"[[-79.5384742056994, 43.5954328307493], [-79.5...",555.247713,"[-79.54238602876575, 43.6036851822034]"


### Location of Parks

Data for green spaces does not include the neighborhood information so I need to find them manually by using the latitude and longitudes. I use the centroid coordinates of each park and detect the neighborhoods the parks are located in.

In [39]:
def check_location(green_space, cooordinates):   ## define a function
    
    agge = {}           ## Create an empty dictionary
    for idd,i in enumerate(green_space['midpoint']):          ## loop through the coordinates  
        p1 = Point(i[0],i[1])                                ## create Point objects for each coordinate pair.
        for idx,area in zip(coordinates.coords.index,coordinates.coords):    ## loop through the neighbor areas using the 
                                                                              ##geometry data provided by AI
            field = Polygon(area)                            ##  Create Polygon object 
            if p1.within(field):          ## check if the points(park centroid coordinates) are within the area (neighborhood area)
                agge[idd] = idx           ## record the neighborhood the point is within
                break                     ## Do not continue through the loop if the true neighborhood is found
    return agge

agge = check_location(green_space,coordinates)

In [40]:
for k in [i for i in np.arange(3150) if i not in agge.keys()]:     ## Some returned null so these indices are left blank
                                        ## to fill the blanks in between, I manually assign 'none' for these indices
    agge[k] = 'None' 
import collections 
agge=collections.OrderedDict(sorted(agge.items()))    ## Order the dictionary by its keys


In [41]:
dst=pd.DataFrame(index= agge.keys(),data= agge.values(),columns=['district'])  ## Create a data frame from the dictionary
green_space['district'] = 'None'               ## Create a new column on the green space data frame

In [42]:
green_space['district'] = dst.district         ## Assign the newly created data frame values as values for
                                               ##the green space data frame
green_space.head(3)

Unnamed: 0,AREA_ID,AREA_ATTR_ID,PARENT_AREA_ID,AREA_CLASS_ID,AREA_CLASS,AREA_SHORT_CODE,AREA_LONG_CODE,AREA_NAME,AREA_DESC,OBJECTID,geometry,size,midpoint,district
0,41931,545,,802.0,Park,545,545,JEFF HEALEY PARK,,4328911,"{""type"": ""Polygon"", ""coordinates"": [[[-79.4965...",6.307529,"[-79.49544686850496, 43.6300613241833]",Stonegate-Queensway
1,41321,1493,,802.0,Park,1493,1493,ROYCROFT PARK LANDS,,4328912,"{""type"": ""Polygon"", ""coordinates"": [[[-79.4067...",1.255365,"[-79.40526246712355, 43.6806401198379]",Casa Loma
2,42198,48,,802.0,Park,48,48,ART EGGLETON PARK,,4328913,"{""type"": ""Polygon"", ""coordinates"": [[[-79.4180...",0.634391,"[-79.4183566782689, 43.659316592540094]",Palmerston-Little Italy


## FOURSQUARE API 


I got the data I needed from Toronto database API. Now, I want to get data for gyms located in Toronto. 
I want to calculate the number and density of the gyms per neighborhood. 

In [43]:
foursquare_API = 'fsq34gm1aawkxD4uogDTv+tw7q/8i9W8+UAlMoOi0Bzi4w8='   ## API KEY             

### Get Data for Gyms Around

I define a function which I can supply latitude, longitude and radius to get the list of surrounding gyms.
This function is only for gyms, but modifying it is not hard.

In [44]:

def extract(radius=5000, Latitude = 51.6252,Longitude = 00.1517):
    
    ## Url created by the tool located in the  API documentation of Foursquare. 
    ## Converted it into an f string so that I can issue different variables. 
    
    url = f"https://api.foursquare.com/v3/places/search?ll={Latitude}%2C{Longitude}&radius={radius}&categories=18021&limit=50"
    
    headers = {"Accept": "application/json","Authorization": foursquare_API}  ## header (Given bt the tool)
    response = requests.get(url, headers=headers)  
    data = response.json()             ## Initial data 
    
    rows = []                        ## An empty list to store rows 
    for i in data['results']:          ## Loop through the data 
        f_id = i['fsq_id']                ## collect features
        name = i['name']                           ## collect features
        lat = i['geocodes']['main']['latitude']       ## collect features
        lon = i['geocodes']['main']['longitude']         ## collect features
        if 'formatted_address' in i['location']:           
            formatted_address = i['location']['formatted_address'] ## Some variables are not present in json 
                                                                  ## so I define an if function to avoid errors. 
        else:
            formatted_address = 'null'  

        if 'neighborhood' in i['location']:
            neighborhood = i['location']['neighborhood']
        else:
            neighborhood = 'null'

        url_detail = f"https://api.foursquare.com/v3/places/{f_id}?fields=popularity%2Crating"   ## Another Loop in the loop 
        headers = {"Accept": "application/json","Authorization": foursquare_API} ##  This loop gathers data from another API type
        
        response_detail = requests.get(url_detail, headers=headers).json()
        
        if 'popularity' in response_detail:
            popularity = response_detail['popularity']
        else:
            popularity = 'null'

        if 'rating' in response_detail:
            rating = response_detail['rating']
        else:
            rating = 'null'

        row = [name,neighborhood,formatted_address,lat,lon,rating,popularity]      ## Create a row from collected features
        rows.append(row)                                     ## Store the rows into the list created before
        

    data = pd.DataFrame(columns = ['name','neighborhood','formatted_address','lat','lon','rating','popularity'], data = rows)    
    ## Create a data frame and return it
    return data


In [45]:
Districts={}    ## Create an empty dictionary to store neighborhood name and centroid coordinates.
for loc, pair in zip(coordinates.index, coordinates.location):      
    Districts[loc] = pair                        


In [None]:
## Create an empty data frame to store data coming from each API requests.
Toronto_data = pd.DataFrame(columns = ['name','neighborhood','formatted_address','lat','lon','rating','popularity'])
locations = Districts.keys()       ## The neighborhood names
points = Districts.values()        ## Centroid coordinates 
 
for loc,point in zip(locations,points):         ## loop through the lists created above.
    lat = point[1]                         ## Separate the longitude and latitude
    long = point[0]
    part =extract(radius = 2000, Latitude = lat, Longitude=long)    ## Use the defined function above and pull data from API
                                                                   ## for each coordination pairs 
    print('This part is done.',loc)                        ## The process takes time so it is a sign that everthing goes well.

    Toronto_data = pd.concat([Toronto_data,part],ignore_index =True)     ## Store everything in a data frame
        

In [None]:
Toronto_data=Toronto_data.drop(columns = ['neighborhood']).drop_duplicates(ignore_index=True)


In [47]:
Toronto_data.head(3)

Unnamed: 0,name,formatted_address,lat,lon,rating,popularity,District
0,Stott Pilates,"2071 McCowan Rd, Scarborough ON M1S 3Y6",43.795931,-79.261192,,,Agincourt North
1,Tarana Dance Company Ltd,"589 Middlefield Rd, Scarborough ON M1V 4Y6",43.812037,-79.258416,,,Milliken
2,D'ornellas Fitness Factory,"4544 Sheppard Ave E, Scarborough ON M1S 1V2",43.788859,-79.263412,,,Agincourt South-Malvern West


In [48]:
whereabout = {}                    ## An empty dict  
for idx,row in enumerate(Toronto_data.values):      ## Loop through the data
    lat = Toronto_data.lat[idx]                   ## get latitude and longitude for each row
    lon = Toronto_data.lon[idx]
    point1 = Point(lon,lat)                       ## create Point object
    for idb, dist in zip(coordinates.coords.index,coordinates.coords):      ## Loop through the area coordinates  
        zone = Polygon(dist)                                   ## Create a Polygon object
        if point1.within(zone):                 ## Check if the point is within the area
            whereabout[idx] = idb                ## Store the data
            break      


In [49]:
for no in [a for a in np.arange(1190) if a not in whereabout.keys()]:   ## Fill the null indices
    whereabout[no] = 'none'

whereabout=collections.OrderedDict(sorted(whereabout.items()))   ## order the dictionary by keys
Toronto_data['District'] = whereabout.values()    ## Assign the values as district information per each gym row.


In [50]:
Toronto_data.head(3)

Unnamed: 0,name,formatted_address,lat,lon,rating,popularity,District
0,Stott Pilates,"2071 McCowan Rd, Scarborough ON M1S 3Y6",43.795931,-79.261192,,,Agincourt North
1,Tarana Dance Company Ltd,"589 Middlefield Rd, Scarborough ON M1V 4Y6",43.812037,-79.258416,,,Milliken
2,D'ornellas Fitness Factory,"4544 Sheppard Ave E, Scarborough ON M1S 1V2",43.788859,-79.263412,,,Agincourt South-Malvern West


## Process the tables

In [51]:
no_gym=pd.DataFrame(Toronto_data.District.value_counts()).sort_index()  ## number of gyms located in each neighborhood
no_gym.head(3)

Unnamed: 0,District
Agincourt North,1
Agincourt South-Malvern West,8
Alderwood,4


In [52]:
green_space= green_space.sort_values('district')
green_space = green_space.set_index('district').T


In [53]:
no_parks =pd.DataFrame(green_space.T.index.value_counts()).sort_index()  ## number of parks located in each neighborhood
no_parks.head(3)

Unnamed: 0,district
Agincourt North,14
Agincourt South-Malvern West,18
Alderwood,16


In [54]:
alli = income_dist.copy()                              
alli=alli.drop(columns=['_id','Category','Topic','Data Source']).set_index('Characteristic').T
alli.iloc[:,:15]

Characteristic,"$20,000 to $29,999","$10,000 to $19,999","$50,000 to $59,999","$40,000 to $49,999","$30,000 to $39,999","$100,000 to $149,999","$100,000 and over","$100,000 and over.1","$150,000 and over","$90,000 to $99,999","$80,000 to $89,999","$70,000 to $79,999","$60,000 to $69,999"
City of Toronto,291155,410355,145500,187235,221475,119810,209580,116680,89770,58210,69990,89645,114460
Agincourt North,3520,6325,1265,1895,2465,530,665,245,135,365,435,655,865
Agincourt South-Malvern West,2715,4505,1125,1560,2020,525,685,265,165,315,435,570,825
Alderwood,1360,1505,825,950,1095,620,845,325,225,370,395,530,690
Annex,2605,3615,1655,1935,2150,2190,5255,3660,3055,830,1000,1290,1460
...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wychwood,1600,2100,705,910,970,710,1280,775,575,335,345,505,620
Yonge-Eglinton,895,1070,715,780,845,870,2100,1460,1230,360,415,520,595
Yonge-St.Clair,955,1115,810,940,935,1075,2735,1940,1645,425,495,585,720
York University Heights,3515,4850,1440,2080,2700,380,460,140,80,250,405,585,895


In [55]:
income_table =pd.concat([alli.T.iloc[:6],alli.T.iloc[8:13]]).sort_index().reset_index()
high_income=pd.concat([income_table.loc[:1,:],income_table.loc[8:,:]])   ## Get the population who earn more than 70k per year
high_income

Unnamed: 0,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,Bay Street Corridor,Bayview Village,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
0,"$100,000 to $149,999",119810,530,525,620,2190,2035,720,1495,1260,...,1010,935,830,945,475,710,870,1075,380,450
1,"$150,000 and over",89770,135,165,225,3055,1635,375,1265,800,...,515,440,200,495,140,575,1230,1645,80,195
8,"$70,000 to $79,999",89645,655,570,530,1290,1220,535,1090,870,...,605,790,1230,540,400,505,520,585,585,380
9,"$80,000 to $89,999",69990,435,435,395,1000,960,405,925,675,...,515,585,800,465,315,345,415,495,405,320
10,"$90,000 to $99,999",58210,365,315,370,830,820,360,700,565,...,475,565,580,465,265,335,360,425,250,245


In [56]:
def get_aggregate(high_income):             ## define a function to process and sum the rows 
    high_sum = {} 
    for i in high_income.columns[1:]:      ## Loop through the data 
        high_income[i] = high_income[i].replace(r',','',regex=True)        ## Delete commas
        high_income[i] = high_income[i].astype('float')        ## Turn the data into float from string
        high_sum[i] = high_income[i].sum()                  ## Sum the rows
    high_sum = pd.Series(data = high_sum.values(),index=high_sum.keys())         ## Restore the sum 
    return high_sum 
high_inc = get_aggregate(high_income)
high_inc


City of Toronto                 427425.0
Agincourt North                   2120.0
Agincourt South-Malvern West      2010.0
Alderwood                         2140.0
Annex                             8365.0
                                  ...   
Wychwood                          2470.0
Yonge-Eglinton                    3395.0
Yonge-St.Clair                    4225.0
York University Heights           1700.0
Yorkdale-Glen Park                1590.0
Length: 141, dtype: float64

In [57]:
total_sum = get_aggregate(income_table)     ## Use the function above for the all community

## Calculate the percentage of high income people
high_inc_perc=pd.DataFrame((high_inc/total_sum)*100,columns=['high_inc_percent']) 

In [58]:
ccols=['_id','Category','Topic','Data Source']     ## define the columns to be removed

age_dist1=age_dist.drop(columns=ccols)        ## drop columns
age_dist2 = age_dist1.loc[10:11,:]           ## slice the data to get the age between 15 and 55


age_all=get_aggregate(age_dist1)               ## sum numbers for all age interval 
customer = get_aggregate(age_dist2)       ## only the age intervals that we consider as customers 

age_dist3 = pd.concat([age_distf,age_distm]).drop(columns=ccols)  ## number of people who are between 45 and 55
discouraged = get_aggregate(age_dist3)       ## sum of the people who are between 45 and 55
discouraged

## Get the number for people between 15 and 45 and calculate the ratio of this sample over all population 

customer_percent =pd.DataFrame(((customer-discouraged) / age_all)*100,columns=['customer_percentage'])
customer_percent

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  high_income[i] = high_income[i].replace(r',','',regex=True)        ## Delete commas
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  high_income[i] = high_income[i].astype('float')        ## Turn the data into float from string


Unnamed: 0,customer_percentage
City of Toronto,42.054100
Agincourt North,35.307820
Agincourt South-Malvern West,40.020555
Alderwood,36.808081
Annex,48.226726
...,...
Wychwood,38.095238
Yonge-Eglinton,45.469729
Yonge-St.Clair,40.445982
York University Heights,47.444603


In [59]:
avg_incme=avg_income_tax.drop(columns=ccols).set_index('Characteristic').T
avg_incme = avg_incme.rename(columns={'    Net federal tax: Average amount ($)':'avg_tax'})
avg_incme=avg_incme[1:]

In [60]:
avg_incme=avg_incme.avg_tax.replace(r',','',regex=True)       ## average_income_tax per neighborhood
avg_incme

Agincourt North                  4795
Agincourt South-Malvern West     5102
Alderwood                        7895
Annex                           30496
Banbury-Don Mills               14370
                                ...  
Wychwood                        11560
Yonge-Eglinton                  20794
Yonge-St.Clair                  27170
York University Heights          4420
Yorkdale-Glen Park               6119
Name: avg_tax, Length: 140, dtype: object

In [61]:
green_space=green_space.T

In [62]:
## Create a pivot table that shows the total size of park areas per neighborhood
piv=pd.pivot_table(green_space, columns = 'district', values = 'size',aggfunc='sum'  )
piv=piv.T

### Merge Data Frames

In [63]:
 ## add the park total area to the coordinates data
coordinates = coordinates.merge(piv, how='inner',left_index=True,right_index=True)  

## calculate the percentage of green areas 
coordinates['green'] = (coordinates['size']/coordinates.area)*100  

## add the number of gyms per neighborhood
coordinates = coordinates.merge(no_gym,how='left',left_index=True,right_index=True)

coordinates = coordinates.rename(columns={"District": "No_Gym"})          

## calculate the size of land per gym
coordinates['area_per_gym'] =coordinates.area / coordinates.No_Gym    

## rename the columns
coordinates.rename(columns={'green':'park_area'},inplace=True)         

## rename the columns
coordinates.rename(columns ={'size':'park_size', 'park_area':'percent_park','No_Gym':'gym_number'},inplace=True) 

## delete the whitespaces around the index names
coordinates.index = [x.strip() for x in coordinates.index] 

## add avg income data
coordinates = coordinates.merge(avg_incme,how='left',left_index=True,right_index=True) 

## add the percentage of people who are between 15 and 45
coordinates = coordinates.merge(customer_percent,how='left',left_index=True,right_index=True)

## add the percentage of people with high income
coordinates=coordinates.merge(high_inc_perc,how='left',left_index=True,right_index=True) 

In [64]:
population1=population.drop(columns=ccols).set_index('Characteristic').T[1:]

In [65]:
population1=population1.rename(columns={'Population, 2016':'population'}).replace(r',','',regex=True) ## delete commas

In [66]:
coordinates=coordinates.merge(population1,how='left',left_index=True,right_index=True)  ## add the population numbers

In [67]:
df=coordinates.iloc[:,3:]         ## select columns and define them as a new data frame                 
df['Area'] = coordinates.area     ##  Add another column to the new data frame

In [68]:
obj=df.select_dtypes(include='object').columns      ## make all object columns float 
for i in obj:
    df[i] = df[i].astype('float')

In [411]:
# df.to_csv(..//final_data.csv')   ## Save the collected data

From now on, I will get a new jupyter page to make things clearer. This part included only the data collection and rough processing. In the next step, I will process the data for machine learning model and make predictions. 

## References

- financesonline.com - 
I used the insights provided in this website to make decisions in regards to the data collection. I was able to create a list for the required data only with the information gathered from this website. The website gave me a good understanding about the features of gym subscribers by giving statistical data. 


- https://medium.com/mlearning-ai/end-to-end-data-science-project-beginner-version-part-1-96e59bdfbc5b -
My work is originated from this medium post. I was fascinated by the idea and I wanted to do a similar project. Thanks to the
author of this post, I learned a lot through this project. 


- open.toronto.ca - 
This database offers APIs and static data in the form of xls,xslx,csv etc. Most of the data is pulled from this database. It includes guides for developers as well.
I used this API to collect demographic information like income level and age; also for geographical information like neighborhood coordination data and infrastructure locations. 



- developer.foursquare.com - 
Foursquare has an amazing API infrastructure. The guides are clear and concise. The interface is very user friendly and 
easy-to-understand. The data quality of free account is satisfying for personal use.
This API is used to collect data about locations of gyms in specific areas and other specifications like rating and popularity.
