# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)

## Introduction: Business Problem <a name="introduction"></a>

In this project we will attempt to build a **recommender system** that will generate clusters containing multiple venue categories and the respective neighborhoods in which those venues exist in the city of Toronto. This should ultimately help those who are seeking to open a shop/restaurant/etc. and are in search for a rough estimate as to which neighborhood will be the best fit.

The recommender system will present multiple venue categories in each cluster and will leave the choice to the person in interest to select the cluster that best suit their purpose. 

The project will be utilizing two main data science approaches (k-means clustering and recommender systems) in order to end up with the desired results. The following sections will include an explanation of the data that will be utilized, data sources, the methodology followed and the analysis that was carried out and finally the conclusion and discussion section.

## Data <a name="data"></a>

Based on the definition of the problem, the factor/s that will influence our decission are:
* The most common venues/venue categories in each neighborhood or borough
* The cut-off point for the most common venue to take into consideration
* The above two points will be handled in building the recommendation system section

The Following data sources will be needed to extract/generate the required information:
* The first data source that will be used is a **wikipedia page** which will help in obtaining the neighborhoods' names, postal codes, boroughs' name for the city of Toronto
* The venue categories, their locations in every neighborhood and their count will be obtained using **Foursquare API**
* The coordinates (lats/longs) of each neighborhood will be obtained from **CSV** file that was earlier shared by the course instructors
* Below in **Part 1** and **Part 2**, the code to obtaining the needed data is shown in details

## Part Number 1: Web Scarbing and Data Wrangling

### We start by importing the libraries that will help us in putting the dataframe into place

In [2]:
# Importing the necessary libraries to complete the assignment - Data Wrangling and Web Scrabbing
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np
import requests
from bs4 import BeautifulSoup

In [3]:
# Importing the Libraries that have to do with the geospatial data/plotting/Clustering
# !conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

### We set up the variables and import the table from the wiki page and prepare the columns and rows for the dataframe

In [4]:
# Setting up the variables
URL = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(URL)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table',{'class':'wikitable sortable'}).tbody

# Preparing the Rows and Columns for the DataFrame
rows = table.find_all('tr')
columns = [v.text.replace('\n','') for v in rows[0].find_all('th')]
df = pd.DataFrame(columns=columns)

### Populating the dataframe with the data from the wiki table

In [5]:
# Looping to populate the dataframe
for i in range(1,len(rows)):
    tds = rows[i].find_all('td')
    values = [td.text.replace('\n','') for td in tds]
    df = df.append(pd.Series(values,index=columns), ignore_index=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Now, We will be removing the **Not assigned** Cells from the DataFrame

In [6]:
# data wrangling to remove the "Not assigned" cells
df_new = df.set_index('Borough')
df_new.drop('Not assigned',inplace=True)
df_new = df_new.reset_index()
df_new = df_new[['Postal Code','Borough','Neighborhood']]
df_new.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


## Part Number 2: Getting the Lats/Longs for the Neighborhoods <a name="data"></a>

### First, we download the geospatial data

In [7]:
# Downloading the geospatial data
!wget -q -O 'Geospatial_data.csv' http://cocl.us/Geospatial_data
geospatial_df = pd.read_csv('Geospatial_data.csv')
geospatial_df = geospatial_df.rename(columns={'Postal Code':'PC'})
geospatial_df.head()

Unnamed: 0,PC,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### We merge the two dataframes in order to have it in a format ready for analysis

In [8]:
# using df.merge in order to join the two dataframes
Toronto_df = df_new.merge(geospatial_df,left_on='Postal Code',right_on='PC')
Toronto_df = Toronto_df.drop(['PC'],axis=1)
Toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
