<h1>Eshan Ratnayake Data Science Capstone Project </h1>

Dataset obtained from Kaggle: https://www.kaggle.com/mnabaee/ontarioproperties

Chinese gentleman immigrating from China to Canada for work. Will be working at the TD Tower in Downtown Toronto. He wants to buy a property to live in.

He needs to find a place that is close to work, inexpensive, and have some elements of China that make is easy to fit in. We will be analyzing the neighbourhoods in the dataset and their prices, proximity to work, and their accessbility to chinese venues.

<h2>Cleaning Data Set</h2>

In [9]:
# The code was removed by Watson Studio for sharing.

In [10]:
# ensure that all prices are not strings, and duplciates are removed
df_ontario_housing['Price'] = df_ontario_housing.Price.astype(int)
df_ontario_housing.drop_duplicates
df_ontario_housing.shape

(24868, 5)

In [11]:
# filtering out for Addresses with 'Toronto' in name
toronto_housing_data = df_ontario_housing[df_ontario_housing['Address'].str.contains('|'.join(['Toronto']))]

In [12]:
toronto_average_prices = toronto_housing_data.groupby('AreaName', as_index=False)['Price', 'Latitude', 'Longitude'].mean()
toronto_average_prices.columns = ['AreaName','Avg Price', 'Latitude', 'Longitude']
toronto_average_prices.head()

Unnamed: 0,AreaName,Avg Price,Latitude,Longitude
0,Agincourt,425287.29,43.79,-79.28
1,Agincourt North,2200000.0,43.8,-79.24
2,Alderwood,993179.93,43.6,-79.55
3,Amesbury,79450.0,43.7,-79.48
4,Armdale,66739.09,43.83,-79.25


<h2>Find Number of Chinese Venues per Area</h2>

In [13]:
CLIENT_ID = 'YSK4NSSJEHR20EHPGGR22MO52KXJMOSS3PQSPB1ASHISUTRA' 
CLIENT_SECRET = 'YKSPEBXNZUPLG0WEIVQPNTCJLJBWKR4HNSAHM5BR22VTO1ND' 
VERSION = '20180605' 
LIMIT = 10 # limit of number of venues returned by Foursquare API

In [14]:
def getNearbyVenues(names, latitudes, longitudes, price, radius=1500):
    
    venues_list=[]
    for name, lat, lng, prc in zip(names, latitudes, longitudes, price):
            
        # =API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(
            name, 
            lat, 
            lng, 
            prc,
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],
            v['venue']['location']['distance'],
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['AreaName', 
                  'Area Latitude', 
                  'Area Longitude', 
                  'Avg Price',
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude',
                  'Distance (m)', 
                  'Venue Category']
    
    return(nearby_venues)

In [15]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [16]:
toronto_venues = getNearbyVenues(names=toronto_average_prices['AreaName'],
                                   latitudes=toronto_average_prices['Latitude'],
                                   longitudes=toronto_average_prices['Longitude'],
                                   price=toronto_average_prices['Avg Price']
                                  )

In [17]:
toronto_venues.head()

Unnamed: 0,AreaName,Area Latitude,Area Longitude,Avg Price,Venue,Venue Latitude,Venue Longitude,Distance (m),Venue Category
0,Agincourt,43.79,-79.28,425287.29,One2 Snacks,43.79,-79.28,177,Asian Restaurant
1,Agincourt,43.79,-79.28,425287.29,Maple Yip Seafood 陸羽海鮮酒家,43.78,-79.28,376,Chinese Restaurant
2,Agincourt,43.79,-79.28,425287.29,Tim Hortons,43.79,-79.28,285,Coffee Shop
3,Agincourt,43.79,-79.28,425287.29,In Cheon House Korean & Japanese Restaurant 인천관,43.79,-79.28,278,Korean Restaurant
4,Agincourt,43.79,-79.28,425287.29,Yummy Cantonese Restaurant 老西関腸粉,43.79,-79.27,703,Cantonese Restaurant


In [18]:
# boolean variable; true if keyword that indicates Chinese background is found 
chineseVenue = toronto_venues['Venue Category'].str.contains("China") | toronto_venues['Venue Category'].str.contains("Chinese") | toronto_venues['Venue'].str.contains("Cantonese") | toronto_venues['Venue Category'].str.contains("Cantonese") | toronto_venues['Venue'].str.contains("Mandarin") | toronto_venues['Venue Category'].str.contains("Mandarin")  
# filtering for Chinese venues
toronto_venues = toronto_venues[chineseVenue]

In [19]:
toronto_venues = toronto_venues.groupby(['AreaName', 'Area Latitude', 'Area Longitude', 'Avg Price'],as_index=False)['Venue'].count()
toronto_venues.columns = ['AreaName', 'Area Latitude', 'Area Longitude', 'Avg Price', '# of Chinese Venues']
toronto_venues

Unnamed: 0,AreaName,Area Latitude,Area Longitude,Avg Price,# of Chinese Venues
0,Agincourt,43.79,-79.28,425287.29,4
1,Agincourt North,43.8,-79.24,2200000.0,1
2,Amesbury,43.7,-79.48,79450.0,1
3,Bayview Woods - Steeles,43.79,-79.39,1032187.4,1
4,Beechborough - Greenbrook,43.7,-79.48,699000.0,1
5,Ben Jungle,43.76,-79.24,795000.0,1
6,Bendale,43.77,-79.26,482289.71,1
7,Bridlewood,43.8,-79.32,344406.89,3
8,Brookhaven - Amesbury,43.7,-79.49,806418.18,1
9,Don Mills,43.74,-79.35,715269.12,1


<h2>Compute Distances to Work</h2>

In [20]:
# using the Haversine formula

def latLonDistance (area_lat, area_lon):
    
    distance_list = []
    earth_radius = 6731       #in km 
    
    for lat, lon in zip(area_lat, area_lon):
        
        # conversion to radians
        TD_lat_RAD = radians(43.6476)
        TD_lon_RAD = radians(-79.3814)
        Area_lat_RAD = radians(lat)
        Area_lon_RAD = radians(lon)

        dlon = Area_lon_RAD - TD_lon_RAD
        dlat = Area_lat_RAD - TD_lat_RAD

        a = sin(dlat / 2)**2 + cos(TD_lat_RAD) * cos(Area_lat_RAD) * sin(dlon / 2)**2
        c = 2 * atan2(sqrt(a), sqrt(1 - a))

        distance = earth_radius * c
        
        distance_list.append(distance)
        
    return distance_list

In [21]:
distance_to_work = latLonDistance(area_lat=toronto_venues['Area Latitude'] ,area_lon=toronto_venues['Area Longitude'])
distance_to_work_format = ['%.2f' % elem for elem in distance_to_work]
distance_series = pd.Series(distance_to_work_format)
toronto_venues['Distance to Work (km)'] = distance_series.values
toronto_venues

Unnamed: 0,AreaName,Area Latitude,Area Longitude,Avg Price,# of Chinese Venues,Distance to Work (km)
0,Agincourt,43.79,-79.28,425287.29,4,18.68
1,Agincourt North,43.8,-79.24,2200000.0,1,21.75
2,Amesbury,43.7,-79.48,79450.0,1,10.9
3,Bayview Woods - Steeles,43.79,-79.39,1032187.4,1,17.19
4,Beechborough - Greenbrook,43.7,-79.48,699000.0,1,9.97
5,Ben Jungle,43.76,-79.24,795000.0,1,18.1
6,Bendale,43.77,-79.26,482289.71,1,17.48
7,Bridlewood,43.8,-79.32,344406.89,3,18.37
8,Brookhaven - Amesbury,43.7,-79.49,806418.18,1,11.12
9,Don Mills,43.74,-79.35,715269.12,1,11.4


<h2>Compute Euclidean Distances</h2>

In [22]:
def eudDistance(price, dist_work):
    eudDistanceList = []
    
    for p, d in zip(price, dist_work):
        
        # normalizing the prices
        
        normalied_p = p/1000
        euDist = sqrt((normalied_p*normalied_p)+(d*d))
        eudDistanceList.append(euDist)
    
    return eudDistanceList

In [23]:
toronto_venues['Avg Price'] = toronto_venues['Avg Price'].astype(float)
toronto_venues['# of Chinese Venues'] = toronto_venues['# of Chinese Venues'].astype(float)
toronto_venues['Distance to Work (km)'] = toronto_venues['Distance to Work (km)'].astype(float)

euclidean_distance_list = eudDistance(price=toronto_venues['Avg Price'],
                                     dist_work=toronto_venues['Distance to Work (km)'])

In [24]:
euclidean_distance_list
euclidean_distance_list_format = ['%.2f' % elem for elem in euclidean_distance_list]
euclidean_series = pd.Series(euclidean_distance_list_format)

In [25]:
toronto_venues['Euclidean Distance'] = euclidean_series.values
toronto_venues['Euclidean Distance'] = toronto_venues['Euclidean Distance'].astype(float)
toronto_venues.sort_values('Euclidean Distance', ascending=True, inplace=True)
toronto_venues = toronto_venues.reset_index(drop=True)
toronto_venues

Unnamed: 0,AreaName,Area Latitude,Area Longitude,Avg Price,# of Chinese Venues,Distance to Work (km),Euclidean Distance
0,Amesbury,43.7,-79.48,79450.0,1.0,10.9,80.19
1,Bridlewood,43.8,-79.32,344406.89,3.0,18.37,344.9
2,Dorset Park,43.76,-79.28,369159.67,1.0,15.62,369.49
3,The Westway,43.69,-79.55,389596.0,2.0,15.01,389.89
4,Kingsview Village,43.7,-79.55,390883.58,1.0,15.66,391.2
5,Malvern,43.8,-79.23,406304.99,1.0,22.58,406.93
6,Port Royal,43.81,-79.29,419666.67,1.0,21.11,420.2
7,Agincourt,43.79,-79.28,425287.29,4.0,18.68,425.7
8,Tam O'Shanter,43.78,-79.3,453432.5,4.0,17.61,453.77
9,Woburn,43.77,-79.24,469192.72,1.0,19.15,469.58


<h2>Scaling Score According to # of Chinese Venues</h2>

In [26]:
toronto_venues['Final Score'] = toronto_venues['Euclidean Distance']/toronto_venues['# of Chinese Venues']

In [27]:
toronto_venues.sort_values('Final Score', ascending=True, inplace=True)

In [28]:
toronto_venues = toronto_venues.reset_index(drop=True)
toronto_venues.head()

Unnamed: 0,AreaName,Area Latitude,Area Longitude,Avg Price,# of Chinese Venues,Distance to Work (km),Euclidean Distance,Final Score
0,Amesbury,43.7,-79.48,79450.0,1.0,10.9,80.19,80.19
1,Agincourt,43.79,-79.28,425287.29,4.0,18.68,425.7,106.42
2,Tam O'Shanter,43.78,-79.3,453432.5,4.0,17.61,453.77,113.44
3,Bridlewood,43.8,-79.32,344406.89,3.0,18.37,344.9,114.97
4,Steeles,43.81,-79.32,481273.5,3.0,20.19,481.7,160.57
