# Capstone Project - The Battle of the Neighborhoods (Week 1)
### Elias Luoma
#### Applied Data Science Capstone by IBM/Coursera


## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)

## Introduction: Business Problem of knowledge of neighborhood for real estate agent<a name="introduction"></a>

Real estate agents compete from same customer and customers want to find best home and also best neighborhood for their family. Deep knowledge of area and neighborhood brings advantage in competition. From real estate point of view there is battle of neighborhoods. 

Nobody can remenber or know all venues in Helsinki Finland area and so cannot promote all venues and categories which can found through Foursquare API. We would like to provide targeted information near property to sold for real estate agents. We would also cluster and categorise living areas to quicly tell in which category of property belongs and what are unique charasteristics of that area for example good parks and cafeterias. This makes difference when agent has sales meeting with owner. This information can be even crucial when families deciding where they are going to move and buy new home. 



## Import all needed libraries

In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

#Lib for html handling
from lxml import html

print('Libraries imported.')

Libraries imported.


## Data <a name="data"></a>

Based on definition of our problem, factors that will help real estate agent are:
* All venues of neighborhood
* Top venue categoeries in neighborhood
* Overall style for example cafes and parks


Following data sources will be needed to generate the required information:
* Wikipedia page of Helsinki neighborhood including 
* All venues or neigborhood area through Foursqueare API
* Geolocator to get coordinates of neighborhoods

We will use the **explore** function to get the most common venue categories in each neighborhood of Helsinki. We will also cluster neighborhoods to give similarity information to end customer. 

In [3]:
#Fetch wikipedia page as html
url = 'https://fi.wikipedia.org/wiki/Helsingin_alueellinen_jako'
pageContent=requests.get(url)
html = html.fromstring(pageContent.content)

In [4]:
# define the dataframe columns
column_names = ['Neighborhood', 'lat', 'lng'] 
df = pd.DataFrame(columns=column_names)
#Get table rows of Helsinki Neigborhoods with xpath
rows = html.xpath('//*[@id="mw-content-text"]/div/table[2]/tbody/tr')

for row in rows:
    #Append dataframe with variable values
    children = row[0].getchildren()
    
    for child in children:
        if child.tag == 'ul':
            for li in child[0]:
                #print('Suppea')
                df = df.append({'Neighborhood': li.text}, ignore_index=True)
        else:
            df = df.append({'Neighborhood': child.text}, ignore_index=True)
           
df.head()

Unnamed: 0,Neighborhood,lat,lng
0,Kruununhaka,,
1,Kluuvi,,
2,Kaartinkaupunki,,
3,Kamppi,,
4,Punavuori,,


In [5]:
#Create lists for lat and long
lat = []
lng = []

#Loop through all neigborhoods in Helsinki
for adr in df['Neighborhood']:
    
    #Use geolocator to get coordinates of neigborhoods
    loc = geolocator.geocode(adr)
    #Append coordinates to lists
    lat.append(loc.latitude)
    lng.append(loc.longitude)

#Map coordinate lists to data frame 
df['lat'] = lat
df['lng'] = lng

# Dataframe contains all Helsinki neighborhoods or city areas which are in small capital city same thing.
df

NameError: name 'geolocator' is not defined

### Foursquare API
Open sandbox for Coursera project

In [6]:
CLIENT_ID = 'XDSVHHZ0OH2OITHZSB5MJSEHSUVR5J3CYY5EOHOQTV550IQ1' # your Foursquare ID
CLIENT_SECRET = '5MNXPTBQ4GVTH0KMT0H0UKX00KYCEDLTI0XLT5BKJGME4AO3' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: XDSVHHZ0OH2OITHZSB5MJSEHSUVR5J3CYY5EOHOQTV550IQ1
CLIENT_SECRET:5MNXPTBQ4GVTH0KMT0H0UKX00KYCEDLTI0XLT5BKJGME4AO3


#### Use geopy library to get the latitude and longitude values of Helsinki

In [7]:
address = 'Helsinki, FI'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Helsinki City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Helsinki City are 60.1674086, 24.9425683.


In [8]:
# create map of Helsinki using latitude and longitude values
Helsinki_map = folium.Map(location=[latitude, longitude], zoom_start=11)

for adr in df['Neighborhood']:

    loc = geolocator.geocode(adr)
    lat = loc.latitude
    lng = loc.longitude

    label = '{}'.format(adr)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(Helsinki_map)  

Helsinki_map

In [9]:
neighborhood_latitude = df.loc[0, 'lat'] # neighborhood latitude value
neighborhood_longitude = df.loc[0, 'lng'] # neighborhood longitude value

neighborhood_name = df.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Kruununhaka are nan, nan.


In [10]:
#Get Kruunuhaka or City Center venues

LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
 # create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=XDSVHHZ0OH2OITHZSB5MJSEHSUVR5J3CYY5EOHOQTV550IQ1&client_secret=5MNXPTBQ4GVTH0KMT0H0UKX00KYCEDLTI0XLT5BKJGME4AO3&v=20180605&ll=nan,nan&radius=500&limit=100'

In [11]:
results = requests.get(url).json()

## Functions for data preparation and analysis

In [12]:
#Function to Get nearby venues
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [13]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [14]:
# Function for most common venue
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [15]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

KeyError: 'groups'

In [None]:
all_helsinki_venues = getNearbyVenues(names=df['Neighborhood'],latitudes=df['lat'], longitudes=df['lng'])

In [None]:
all_helsinki_venues

In [None]:
# one hot encoding
helsinki_onehot = pd.get_dummies(all_helsinki_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
helsinki_onehot['Neighborhood'] = df['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [helsinki_onehot.columns[-1]] + list(helsinki_onehot.columns[:-1])
downtown_onehot = helsinki_onehot[fixed_columns]

helsinki_onehot.head()

In [None]:
helsinki_grouped = helsinki_onehot.groupby('Neighborhood').mean().reset_index()
helsinki_grouped

In [None]:
#Get 10 top venues of postal code area
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = helsinki_grouped['Neighborhood']

for ind in np.arange(helsinki_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(helsinki_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted