# COUSERA - CAPSTONE PROJECT

#### Daniel E. _Rivero Mendoza_

## _Battle of the neighborhoods_ - Week 1

In [1]:
# A description of the problem and a discussion of the background. (15 marks)

## 1. INTRODUCTION

### 1.1. Scenario and background

Moving can be a pretty overwhelming endeavour.  Whether you own, rent, or find yourself leaving the nest for the first time, the items in that _"to-do before I move list"_ are likely numerous, tedious, time-consuming, and, once completed, tremendously exhausting. No matter how much you plan and try to organize yourself, this seems to be the inherent nature of movings, they are hardly ever the proverbial _"walk in a park"_.  Then, if you consider the very real possibility of having to move to a neighborhood, city, state, country, or continent, you are not too familiar with, now the tiring practical aspects of the task blend with the trepidations evoked by the _unknown(s)_ and, _voila,_ everything is set for very daunting day(s) or week(s) ahead and until done - in comes the anxiety!  Thus, anybody facing such a task will surely hope to find a tool, resource or service that can help answer questions like: _how can I make my life easier?_ or, _am I moving to the right place?_

This Capstone Project - _Battle of the neighboorhoods_ aims to leverage the tools and techniques learned over the past few months in the IBM Data Science Professional Certificate to answer some aspects of the questions above. I shall state the work done will not help pack your plates and cookware, or call the electricity company to arrange final meter reading and payment, so it will not make anybody's _life easier_ in the most practical sense of things. Instead, it will tackle the moving to the _right place_ question by providing an approach and methodology to find neighborhoods with the desired characteristics and attributes, assuming the more familiar and comfortable it feels @ destiny the better-off anybody will be. The case of study is a hypothetical move from Boulder (CO, United States) to Sydney (NSW, Australia), two cities halfway across the world, with very different landscapes, but perhaps more similar than anybody would think at first glance.

### 1.2. Problem to be solved

The problem to solve is finding a neighborhood in Sydney (NSW, Australia) with similar characteristics to one in downtown Boulder (CO, United States), plus some extra attributes. Thus, to set the basis for comparison, the following are applicable to the nieghborhood at destination:

Must haves:

- Surroundings with ammenities and venues similar to the ones found in Boulder (CO), USA
- Located within 2 km from a train or light rail station in greater Sydney (NSW), Australia
 
Desirable, pending availability of time and reliable data:

- Rent price in the 500 AUD/week range for a unit with at least 2 bedrooms, 1 bathroom, 1 parking spot, and 75 sqm. 

### 1.3. Interested audience

Regarding the aspect(s) of the work that deal with _easing_ a generic moving endeavour:
- I believe anybody considering moving to a city where information about public transport, services, ammenities, etc., is available can find this work interesting and useful, provided the approach and methodologies used here are applicable in North America, Asia, Africa, etc.

Regarding uploading the source code to GitHub:
- Anybody aiming to learn about leveraging FourSquare, mapping techniques, plotting data, running SKLearn tools, manipulating data Python, etc., can read running code from my public profile and learn syntax/operations in a "task-oriented" way.

Regarding Capstone Project as an enduring assignment in future courses:
- Anybody going through the same IBM Course may find this work inspiring and may build upon it in future Capstone assignments. 

In [2]:
# A description of the data and how it will be used to solve the problem. (15 marks)

## 2. DATA SECTION

### 2.1. Data required and how it will be used to solve the problem

To establish compliance of neighborhood(s) @ destination with the list of in Section 1.2., the following data is needed. 

- Coordinates for downtown Boulder, CO. 
- List of top venues in downtown Boulder, CO.
- List of neighboorhoods in Sydney, NSW.
- Coordinates for the neighboorhoods in Sydney, NSW.
- GEOJSON data for neighborhoods in Sydney, NSW. 
- List of Train Stations in Sydney, NSW.
- Coordinates of Train Stations in Sydney, NSW.

Time and reliability of data permitting, the following dataset is needed to establish compliance of neighborhoods @ destination with desired attributes for rental units. 

- Average rental price and characteristics of units per neighborhood in Sydney, NSW.

The data will be used as follows:

- Coordinates for downtown Boulder will be used to retrieve top venues in 5 km radius _via_ Foursquare.
- List of neighborhoods in Sydney will be be used to retrieve the geographic coordinates _via_ geopy, which in turn will be used to get the top venues in 5 km km radius.
- List of Train Stations and their coordinates in Sydney, NSW, will be used to filter out neighborhood(s) with centres further than 2 km.
- Average price of rental units with attributes per Section 1.2. will be used to further filter the list of complying neighborhood(s) in Sydney.
- GEOJSON data for neighborhoods in Sydney will be used to produce maps depicting various metrics.
- List of venues in Boulder and Sydney will be used to establish similarities between their corresponding neighborhoods.

Cleaning and processing the data will allow to answer the following and many other questions:

- Which are the neighborhoods located within 2 km from a train station in Sydney? 
- Are there any neighborhoods in Sydney with more than 1 train station closer the 2 km?
- Are there neighborhoods in Sydney offering the same type of venues and ammenities as those found in downtown Boulder? 
- What is the profile of the different neighborhoods found in Sydney, based on the type of vanues and ammenities they offer?
- ...

### 2.2. Raw data download and preliminary processing

To have a frame of reference ahead of week 2 delivery, some of the relevant data for downtown Boulder, CO, are presented as follows. This type of query will be repeated once the list of neighborhoods close to trains stations in Sydney is available, and both datasets will be used to investigate the similarities between both locations - see code in Capstone Project Wee 

In [3]:
######### START OF CODE ########

In [4]:
### Download and import all the libraries needed to advance the project.

import numpy as np # Library for data analsysis
import time

import pandas as pd # Library for data manipulation
from pandas import DataFrame
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # Library to handle JSON files

import requests # Library to handle requests

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

import matplotlib # Library to make plots
import matplotlib.cm as cm
import matplotlib.colors as colors
from matplotlib import pyplot

from sklearn.cluster import KMeans # Import k-means from clustering stage

# Install geopy if needed
#!pip install geopy
from geopy.geocoders import Nominatim # Convert an address into latitude and longitude values
import geopy.distance

# Install maphandling module if needed
#!pip install folium
import folium # map rendering library

#!pip install beautifulsoup4
#!pip install lxml
from bs4 import BeautifulSoup

print('Libraries imported.')

Libraries imported.


In [5]:
### Get the coordinates for downtown Boulder, CO.

# Boulder, CO
address_CO = 'Boulder, CO'

geolocator = Nominatim(user_agent='Explorer')
location = geolocator.geocode(address_CO)
latitude_CO = location.latitude
longitude_CO = location.longitude
print('The geograpical coordinate of '+address_CO+', are latitude {}, longitude {}.'.format(latitude_CO, longitude_CO))

The geograpical coordinate of Boulder, CO, are latitude 40.0149856, longitude -105.2705456.


In [8]:
### API call to Foursquare to request information about venues in Boulder, CO. 

# Define Foursquare Credentials and Version
CLIENT_ID = 'ID' # your Foursquare ID
CLIENT_SECRET = 'SECRET' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

# Details of the Foursquare query
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 5000 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude_CO, 
    longitude_CO, 
    radius, 
    LIMIT)
print('The URL is:')
print(url) # display URL

Your credentails:
CLIENT_ID: ID
CLIENT_SECRET:SECRET
The URL is:
https://api.foursquare.com/v2/venues/explore?&client_id=ID&client_secret=SECRET&v=20180605&ll=40.0149856,-105.2705456&radius=5000&limit=100


In [7]:
### Make the call to Foursquare, retreive the data and store it

# Results
results_CO = requests.get(url).json()

In [9]:
### Define a function that extracts the category of venues from Foursquare data
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [10]:
### Extract the name, type and coordinate of venues in downtown Boulder, C
venues_CO = results_CO['response']['groups'][0]['items']
    
nearby_venues_CO = pd.json_normalize(venues_CO) # flatten JSON

# filter desired columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues_CO =nearby_venues_CO.loc[:, filtered_columns]

# filter the category for each row
nearby_venues_CO['venue.categories'] = nearby_venues_CO.apply(get_category_type, axis=1)

# clean columns
nearby_venues_CO.columns = [col.split(".")[-1] for col in nearby_venues_CO.columns]

nearby_venues_CO.head(10)

Unnamed: 0,name,categories,lat,lng
0,Pizzeria Locale,Pizza Place,40.019208,-105.272611
1,Snooze An A.M. Eatery,Breakfast Spot,40.019127,-105.274285
2,OAK at fourteenth,New American Restaurant,40.018278,-105.277302
3,Boxcar Coffee Roasters,Coffee Shop,40.019796,-105.271443
4,Rincon Argentino,Argentinian Restaurant,40.014883,-105.262918
5,Boulder Farmers' Market,Farmers Market,40.015536,-105.277652
6,Mountain Sun Pub & Brewery,Pub,40.018956,-105.275159
7,Boulder Theater,Music Venue,40.019202,-105.277391
8,Boulder Creek,River,40.014782,-105.279779
9,Frasca Food and Wine,Italian Restaurant,40.019314,-105.272285


In [11]:
### Create map of Boulder, CO, using latitude and longitude values
map_CO = folium.Map(location=[latitude_CO, longitude_CO], zoom_start=12)

# Add markers with top venues from Foursquare to the map
for lat, lng, name, cat in zip(nearby_venues_CO['lat'], nearby_venues_CO['lng'], nearby_venues_CO['name'],\
nearby_venues_CO['categories']):
    label = 'Venue: {} | Type: {} | Coordinates: [{},{}]'.format(name, cat, round(lat,2), round(lng,2))
    label = folium.Popup(label, parse_html=True)
    folium.RegularPolygonMarker(
        [lat, lng],
        number_of_sides=5,
        radius=5,
        popup=label,
        color='red',
        fill_color='red',
        fill_opacity=0.7,
    ).add_to(map_CO)  
    
map_CO

Furthermore, the list and coordinates of neighborhoods in Sydney are downloaded ahead of time to store them as local copies and avoid a time-consuming call to geopy.

In [12]:
######### START OF CODE ########

In [13]:
### List of neighborhoods, scraped from a website online

# Get a list of all the Suburbs in Greater Sydney using beautiful soup
html_SydSubs = requests.get('http://www.walksydneystreets.net/suburbssydneyall.htm')
txt_SydSubs = html_SydSubs.text
Soup_SydSubs = BeautifulSoup(txt_SydSubs, 'lxml')
Soup_SydSubs_Table = Soup_SydSubs.find_all('table')[4]
Soup_SydSubs_Table2 = Soup_SydSubs_Table.find_all('tr')[4]
GSyd_Suburbs = [text.upper() for text in Soup_SydSubs_Table2.stripped_strings]

print('There are ',len(GSyd_Suburbs),' suburbs in the greater Sydney area.')
print('These are 5 suburbs in Sydney: ',GSyd_Suburbs[:5])

There are  669  suburbs in the greater Sydney area.
These are 5 suburbs in Sydney:  ['ABBOTSBURY', 'ABBOTSFORD', 'ACACIA GARDENS', 'AGNES BANKS', 'AIRDS']


In [14]:
### Use geopy to get the coordinates for all, or most, of the Suburbs in Sydney

GSyd_Suburbs_Coord=[]
GSyd_Suburbs_NotFound=[]

for n in range(len(GSyd_Suburbs)):
    try:
        address=(GSyd_Suburbs[n]+' NSW')
        geolocator = Nominatim(user_agent='Explorer')
        location = geolocator.geocode(address) 
        GSyd_Suburbs_Coord.append([GSyd_Suburbs[n],location.latitude,location.longitude]) 
    except:
        GSyd_Suburbs_NotFound.append([n,GSyd_Suburbs[n]])
        print(GSyd_Suburbs[n]+ ', '+'NSW, cannot be found')

print('Geodata completed')
print('These are 5 suburbs in Sydney + their coordinates: ',GSyd_Suburbs_Coord[:5])
print('These are the suburbs in Sydney for which coordinates could not be retreived: ',GSyd_Suburbs_NotFound[1])

BERRILEE, NSW, cannot be found
FIDDLETOWN, NSW, cannot be found
HURLSTONE PARK, NSW, cannot be found
Geodata completed
These are 5 suburbs in Sydney + their coordinates:  [['ABBOTSBURY', -33.8692846, 150.8667029], ['ABBOTSFORD', -33.8505529, 151.129759], ['ACACIA GARDENS', -33.7324595, 150.9125321], ['AGNES BANKS', -33.6145082, 150.7114482], ['AIRDS', -34.09, 150.8261111]]
These are the suburbs in Sydney for which coordinates could not be retreived:  [238, 'FIDDLETOWN']


In [15]:
### Transform the information on Sydney suburb location into a dataframe and then save to a CSV
GSyd_Suburbs_Coord = pd.DataFrame(GSyd_Suburbs_Coord,columns=['Suburb','Latitude','Longitude'])
GSyd_Suburbs_Coord.to_csv('GSyd_Suburbs_Coord.csv')

GSyd_Suburbs_Coord.head()

Unnamed: 0,Suburb,Latitude,Longitude
0,ABBOTSBURY,-33.869285,150.866703
1,ABBOTSFORD,-33.850553,151.129759
2,ACACIA GARDENS,-33.732459,150.912532
3,AGNES BANKS,-33.614508,150.711448
4,AIRDS,-34.09,150.826111


In [16]:
# Thanks

In [17]:
## 3. METHODOLOGY SECTION

In [18]:
## 4. RESULTS SECTION

In [19]:
## 5. DISCUSSION SECTION

In [20]:
## 6. CONCLUSION