# COUSERA - CAPSTONE PROJECT

#### Daniel E. _Rivero Mendoza_

## FINAL REPORT

### Contents

1. INTRODUCTION
   
   1.1. Scenario and background
   
   1.2. Problem to be solved
   
   1.3. Interested audience
    
    
2. DATA

    2.1. Data required and sources
    
    2.2. Relevant raw data examples
    
    2.3. Main data processing operations
    
    
3. METHODOLOGY
    
    
4. RESULTS


5. DISCUSSION


6. CONCLUSION
    
    

## 1. INTRODUCTION

### 1.1. Scenario and background

Moving can be a pretty overwhelming endeavour.  Whether you own, rent, or find yourself leaving the nest for the first time, the items in that _"to-do before I move list"_ are likely numerous, tedious, time-consuming, and, once completed, tremendously exhausting. No matter how much you plan and try to organize yourself, this seems to be the inherent nature of movings, they are hardly ever the proverbial _"walk in a park"_.  Then, if you consider the very real possibility of having to move to a neighborhood, city, state, country, or continent, you are not too familiar with, now the tiring practical aspects of the task blend with the trepidations evoked by the _unknown(s)_ and, _voila,_ everything is set for very daunting day(s) or week(s) ahead and until done - in comes the anxiety!  Thus, anybody facing such a task will surely hope to find a tool, resource or service that can help answer questions like: _how can I make my life easier?_ or, _am I moving to the right place?_

This Capstone Project - _Battle of the neighboorhoods_ aims to leverage the tools and techniques learned over the past few months in the IBM Data Science Professional Certificate to answer some aspects of the questions above. I shall state the work done will not help pack your plates and cookware, or call the electricity company to arrange final meter reading and payment, so it will not make anybody's _life easier_ in the most practical sense of things. Instead, it will tackle the moving to the _right place_ question by providing an approach and methodology to find neighborhoods with the desired characteristics and attributes, assuming the more familiar and comfortable it feels @ destiny the better-off anybody will be. The case of study is a hypothetical move from Boulder (CO, United States) to Sydney (NSW, Australia), two cities halfway across the world, with very different landscapes, but perhaps more similar than anybody would think at first glance.

### 1.2. Problem to be solved

The problem to solve is finding a neighborhood in Sydney (NSW, Australia) with similar characteristics to one in downtown Boulder (CO, United States), plus some extra attributes. Thus, to set the basis for comparison, the following are applicable to the nieghborhood at destination:

Must haves:

- Surroundings with ammenities and venues similar to the ones found in Boulder (CO), USA
- Located within 2 km from a train or light rail station in greater Sydney (NSW), Australia
 
Desirable, pending availability of time and reliable data:

- Rent price in the 500 AUD/week range for a unit with at least 2 bedrooms, 1 bathroom, 1 parking spot, and 75 sqm. 

### 1.3. Interested audience

Regarding the aspect(s) of the work that deal with _easing_ a generic moving endeavour:
- I believe anybody considering moving to a city where information about public transport, services, ammenities, etc., is available can find this work interesting and useful, provided the approach and methodologies used here are applicable in North America, Asia, Africa, etc.

Regarding uploading the source code to GitHub:
- Anybody aiming to learn about leveraging FourSquare, mapping techniques, plotting data, running SKLearn tools, manipulating data Python, etc., can read running code from my public profile and learn syntax/operations in a "task-oriented" way.

Regarding Capstone Project as an enduring assignment in future courses:
- Anybody going through the same IBM Course may find this work inspiring and may build upon it in future Capstone assignments. 

## 2. DATA

### 2.1. Data required and sources

To establish compliance of neighborhood(s) @ destination with the list of attributes in Section 1.2., the following data is needed. 

- Coordinates for downtown Boulder, CO.
    - Source: Nominatim tool from geopy.geocoders.
    
    
- List of top venues in downtown Boulder, CO.
    - Source: API call to access Foursquare data  (name, coordinates, type, etc.) on venues @ given location


- List of suburbs in Sydney, NSW.
    - Source: Scraped from the website http://www.walksydneystreets.net/suburbssydneyall.htm


- Coordinates for the suburbs in Sydney, NSW.
    - Source: Nominatim tool from geopy.geocoders., based mostly on the list of neighborhoods outlined above.


- GEOJSON data for suburbs in Sydney, NSW. 
    - Source: File with all the 4000+ neighborhoods in New South Wales, Australia, downloaded from federal government site https://data.gov.au/data/dataset/91e70237-d9d1-4719-a82f-e71b811154c6


- List of Train Stations in Sydney, NSW.
    - Source: File with all the train and light rail stations in New South Wales, Australia, downloaded from the OpenData portal from the local government site https://opendata.transport.nsw.gov.au/dataset/train-station-entries-and-exits-data


- Coordinates of Train Stations in Sydney, NSW.
    - Source: See above.


Time and reliability of data permitting, the following dataset is needed to establish compliance of neighborhoods @ destination with desired attributes for rental units. 

- Average rental price and characteristics of units per neighborhood in Sydney, NSW.
    - __NOTE:__ No reliable data was found in time for this.

### 2.2. Raw data examples

Selected examples of raw data are provided as follows. 

- GEOJSON with geospatial information of all 4592 suburbs in New South Wales.

![GEOJSON.JPG](GEOJSON.JPG)

- NSW Train Station as downloaded from the NSW government website.

![TrainStations.JPG](TrainStations.JPG)

There are many other examples of fresh datasets, but the details on those can be found in the Capstone Porject - CODE submission.

### 2.3. Main data processing operations

As follows, a list of the main processing operations per dataset outlined in Section 2.1. is provided.

- Coordinates for downtown Boulder, CO.
    - No major processing needed beyond storing the coordinates as local variables to be used throughout the code.
    
    
- List of top venues in downtown Boulder, CO.
    - Name, latitude, longitude and category extracted for all the 100 venues retreived in the JSON from the FourSquare API call.


- List of suburbs in Sydney, NSW.
    - List scraped from website using BeautifulSoup, which required special attention during parsing as the information was not encoded in typical html table format. 


- Coordinates for the suburbs in Sydney, NSW.
    - Iterative process to get coordinates for all suburbs with Nominatim, which was then stored locally asa CSV file to avoid repeating such a long process.


- GEOJSON data for neighborhoods in Sydney, NSW. 
    - Dataset manipulated to reduce the number of suburbs from 4000+ to the 300+ meeting the problem requirements -see below. 

![GEOJSON_cleaned.JPG](GEOJSON_cleaned.JPG)

- List of Train Stations in Sydney, NSW.
    - List of train stations required droping attributes that were not needed for the analyses hereby attempted, along with repeated train station entries. The dataset was complemented with a calculation of each train station distance to city centre using geopy tools - see below.
    
![TrainStations_Cleaned.JPG](TrainStations_Cleaned.JPG)

- Coordinates of Train Stations in Sydney, NSW.
    - See above.


Time and reliability of data permitting, the following dataset is needed to establish compliance of neighborhoods @ destination with desired attributes for rental units. 

- Average rental price and characteristics of units per neighborhood in Sydney, NSW.
    - __N/A__ - see Section 1.2.

## 3. METHODOLOGIES

The data outlined above will be used as follows:

- Coordinates for downtown Boulder will be used to retrieve top venues in 5 km radius _via_ FourSquare.
    - Charts and maps to be used to visualize the data.


- List of suburbs in Sydney will be be used to retrieve the geographic coordinates _via_ geopy, which will be complemented with data from the GEOJSON.


- List of Train Stations and their coordinates in Sydney, NSW, will be used to filter out suburbs(s) with centres further than 2 km.
    - Maps to be used to visualize the data.


- List of venues in Boulder and Sydney will be used to establish similarities between their corresponding suburbs.
    - Charts to be used to explore the data and look for correlations. 
    - Maps to be used to depict the data. 
    - kMeans algorithm to be used in order to categorize the suburbs in Sydney and Boulder, which will pair the latter with one or several suburbs in the formwer based on the venues available closeby.
    - Additional metrics to be used as fit to figure out the list of suburbs that are suitable for moving in Sydney.

## 4. RESULTS SECTION

Once the spatial coordinates for downtown Boulder, CO, were retreived, an API call to FourSquare for venues within 5km radius of said coordinates yielded a list with the top 100 entries. 

A quick revision of the data revealed that there were 55 unique venue types from the 100 entries retreived. The bar chart below shows the frequency with which each type of venues appeared in the data for Boulder. 

![BarChart.JPG](BarChart.JPG)

The most frequent entries accounting for 50 % of the entries were. 

Venue category | Frequency | % Cumulative
--- | --- | ---
Trail | 6 | 6
Pizza place | 5 | 11
Grocery store | 4 | 15
Hotel| 4 | 19
Cafe | 4 | 23
Sandwich place | 3 | 26
Gym | 3 | 29
Sporting goods shop | 3 | 32
Ice cream shop | 3 | 35
French restaurant | 3 | 38
New american restaurant | 3 | 41
Japanese restaurant | 3 | 44
Beer garden | 2 | 46
Spa | 2 | 48
Italian restaurant | 2 | 50

Then, a map depicting the venues location in Boulder was examined. 

![MapBoulder.JPG](MapBoulder.JPG)


The data for Sydney's location, its suburbs, train stations and GEOJSON was downloaded, processed and resulted in the following dataframe. 

![SydneyDataframe.JPG](SydneyDataframe.JPG)


It was found that there are 251 train and light rail stations within a 60km from Sydney's city centre. The train sation data was used to filter down the list of suburbs in greater Sydney based on the < 2km condition, which reduced the number of complying suburbs from 669 to merely 344. Based on this list, the number of suburbs in the NSW GEOJSON was reduced from 4592 to the needed 344.  

To better visualize these results, a map overlying the location of all the train stations (blue) in greater Sydney with all the suburbs within 2km walking distance of those stations (red choropleth) was produced.

![Syd_Sub&Trains.JPG](Syd_Sub&Trains.JPG)


The list of the  suitable suburbs within 2km of any train station was used to execute the API call to FourSquare requesting the top 100 venues within 2km of the suburb centre. A grand total of 19609 venues were retreived across 335 suburbs provided the API call for Earlwood, Liberty, Macquaire, Penrith, Toongabbie, Turrella, Carlton, Carramar and Kings Park errored out.  The reuslts were arranged in a dataframe as shown bellow.

![Foursquare_APIdf.JPG](Foursquare_APIdf.JPG)


The number of venues retreived per Sydney suburb was summarized from the data above and plotted against the distance of said suburbs to the city centre to scrutinize if they were correlated. A slight negative correlation between was observed (_i.e._ the further from the city centre, the smaller the number of venues retreived), but no fitting was attempted given the noise and saturation (venue count capped @ 100) observed for suburbs < 20km from Sydney centre.

![plot_Num2Dist.JPG](plot_Num2Dist.JPG)


Among the 19609 venues in 335 Sydney suburbs, there were 353 unique venue categories which made impractical a venue category breakdown like the one done for Boulder. Instead, the FourSquare data was further manipulated and a dataframe showing the name of the suburb plus the 5 most common venues in it was produced. 

![Foursquare_common.JPG](Foursquare_common.JPG)


Cafes and restaurants feature heavily in the 5 most common places across the 335 suburbs analyzed. These type of venues also top most common lists almost exclusively, which is at first sight different to what was seen for Boulder where trails topped the list and gyms/sporting good stores were not that far behind.

The kMEANS algorithm was run in an iterative manner for kclusters of suburbs (Sydney and Boulder) varying between 1 and 150. The distortion (_i.e._ measure of similarity between points in a cluster) was calculated as a function of kcluster values, just so the optimum number of clusters could be picked.

![kmeans_plot.JPG](kmeans_plot.JPG)


From the plot above, it was observed that the largest drop in distortion happened between 10 and 20 clusters. Thus, 15 was taken as the _"optimum"_ and the classification done. The results were appended to a dataframe with some of the other information retreived.

![kmeans_DF.JPG](kmeans_DF.JPG)


A quick exploratory plot of cluster label vs. distance to city centre from the dataframe above was produced to see if there was any correlation between both. The distance to the city centre did not feature in the list of attributes used to run the kMEANS algorithm, so any influence it may have had on the clustering would have been indirect - _e.g._ if distance from city centre would have dictated that certain geographic characteristics influenced the type of vanues found for a specific suburb(s). 

![kmeans_correlation.JPG](kmeans_correlation.JPG)


No indirect correlation between cluster label and distance to city centre was observed in the plot above. 

A bar chart was produced to better see the clustering of suburbs throughout the 15 categories proposed, which is shown as follows.

![kmeans_clusteringbar.JPG](kmeans_clusteringbar.JPG)


The plot above shows that categories 10, 13 and 14 top the list of clusters with the most number of suburbs assigned. Altogether, these 3 clusters contain ~50% of the 335 suburbs. Cluster 8 is the cluster with the least number of suburs assigned with only 1. 

A map of Sydney and suburbs with color coded markers per cluster was produced to visualize the data. 

![kmeans_map.JPG](kmeans_map.JPG)


Clusters 10 and 8 were examined to see if the top 5 most common places from their suburbs held a clue as to why they had the most and least number of suburbs assigned. Cafes and restaurants featured prominently in the 5 most common type of places for both clusters, so not much could be said about the defining features for these 2 clusters. 

![Cluster8.JPG](Cluster8.JPG)
![Cluster10.JPG](Cluster10.JPG)

Cluster 13 was also scrutinized as it was the cluster containing Boulder, CO. Restaurants and cafes feature heavily in this cluster yet again, but it appears that venues with gym and park category appear quite frequenly as most common venues. This is in line with the breakdown of venue types for Boulder.

![Cluster13_1.JPG](Cluster13_1.JPG)


A quick merging of the data above with the data of the number of train stations within 2km to each suburb, plus some sorting in descending order on that some number, yields a final dataframe used to nominate the suburbs for moving. 

![Cluster13.JPG](attachment:Cluster13.JPG)


## 5. DISCUSSION SECTION

The 335 suburbs within 2km of train and light rail station(s) in Sydney were successfully categorized in 15 clusters, based on the similarity across the venues and ammenities they offered and compared to those available in downtown Boulder. Thus, according to the analysis performed, a suggestion to move to either one of the 50+ suburbs in cluster 13 would tick the box for accomplishing the objective of finding suburbs with familiar ammenities and ready access to public transport. 

Further narrowing of the options available in Sydney was possible by looking into the number train and light rail stations within the 2km mark for any one surburb; the assumption being that more access to public transport is better than less. The definite list of recommended suburbs in Sydney to move to, based on the requirements and conditions set in Section 1.2., is shown as follows. 

Suburb | Cluster label | Train Stations < 2km
--- | --- | ---
DARLINGTON | 13 | 26
REDFERN | 13 | 13
RUSHCUTTERS BAY | 13 | 8
CAMPERDOWN | 13 | 8
ROZELLE | 13 | 7

Even if the main objective of the project was completed, there are further topics which are worth discussing in this section. They range from data observations to methodology critique, so they are presented in an itemized manner and briefly expounded upon. 

On the data exploration exploration done. 

- The venues retreived from FourSquare vs. suburb distance to city centre plot showed general tendency of the former to decrease as the latter increased. The tendency merely depicts, with data, the notion that commercial activity diminishes as a function of distance to the centre of a big city.  
- The cluster label vs. suburb distance to city centre did not show any correlation, which suggests distance does not play, even if indirectly, any role in the clustering of suburbs. The interpretation of this finding is outside of the scope of this work, although it may be interesting for 

On the use of the kMEANS algorithm. 

- The cluster-defining features of the Sydney and Boulder suburbs were not evident from the breakdown and analysis of the 5 most frequent venues for all the suburbs, which evidence of the power of ML and, in this particular case, kMEANS, to quickly find patterns in relatively large datasets.
- In being an unsupervised ML technique, the question remains whether the clustering really means anything in terms of the _quality_ of the suggestions made to our hypothetical subject. If this was a real case, it would prove useful to get feedback about the suggestions from the subject and improve on the methodology as needed. 
- The distortion analysis analysis for the iterative kMEANS algorithm as a function of the number of clusters does not display the expected sharp decrease and subsequent plateauing of the distortion value as cluster number increases. This may point at the available data not having distinct, category-defining features or simply having balancing issues, provided the number of attributes and samples to be categorized were in the same ballpark.

On the data processing methods used. 

- Some data processing steps proved very intensive, so careful consideration of the means employed to compare, treat, filter, and generally manipulate data, need to be taken. 

On finding suitable suburbs when moving. 

- Finding the right suburb to move to in a country halfway around the world will probably require considering a couple more things than just availability of public transport and characteristics of closeby venues/ammenities. However, the type of approach followed here can be extended to rental costs, service availability, schoo performance, etc., and thus account for anything and everything anybody would be interested in when looking for that next place to live. The important thing is to find relevant and reliable data that can be translated into specific requirements.

## 6. CONCLUSION

To conclude, the data analysis was performed to identify the suburbs in Sydney with train or light rail stations within 2km and venues/ammenities similar to downtown Boulder, CO. During the analysis, several suburb and venues/ammenities features were explored and visualized. Furthermore, clustering helped to find a list of suitable moving suburbs.