# IBM Professional Data Science Certificate - Capstone Project

## Introduction: Business Problem
**Fitness** is a major global industry. It's estimated to generate at least $100 billion world wide. And, in the US over the years 2013-2017 there was a drastic in boutique gym studios, which grew memberships by 121%. Interestingly, people who exercise also generate are often higher in demand as workers and generate higher salaries. As such, they're part of a high-demand, and relatively mobile workforce. This target demographic of people will also make living and working arrangements to be close to fitness studios. 

![A Fit Man Deadlifting](https://image.freepik.com/free-photo/weights-exercise-weightlifter-strong-athletic_1139-709.jpg)

We also live in a dystopian parallel universe where Yelp and similar services don't exist. We need a way to find out which neighborhoods have the highest prevalence of fitness studios. So, we will use our fancy data science skills to solve this problem. We will help fitness-conscious people in the workforce find neighborhoods to live in Toronto that have a higher prevalence of fitness studios that they have access to to support their lifestyle.

## Data Description
The data required to solve this business problem at a high-level can be leveraged from the datasets used previously in this capstone course. ** Wikipedia **, the **Geospatial Coordiates** `CSV` file, and the **Four Square API** will be leveraged. Popular neighborhoods will be identified geospatially through a combination of the Wikipedia tables and the `CSV` file. Venues of type `fitness studio` will be pulled for those neighborhoods. And, we will determine which neighborhoods have the highest frequency of fitness studios.

## Acquire and Pre-Process Data

### Acquire Toronto Postal Coordinates

#### Download Toronto Postal codes

Wikipedia has a table which provides all of the postal codes for Toronto. Rather than manually performing a copy-paste of the data, which is both time-consuming and subject to error, we will pull the entire page in the form of a HTML file and scrape the table from it. `requests.get` allows us to download the HTML file and store it into a variable.

In [1]:
import requests
import pandas as pd
from pandas.io.json import json_normalize
#import numpy as np
#import geocoder
from bs4 import BeautifulSoup
#import matplotlib.cm as cm
#import matplotlib.colors as colors

wiki_page: str = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
wiki_soup = BeautifulSoup(wiki_page, 'lxml')

#### Scrape the Wikipedia Page
`BeautifulSoup` objects can be used to scrape structured HTML data. We will identify the HTML `table` with classes `wikitable` and `sortable` and pull the table headers from the `th` tags, and the table row's from the `tr` tags, and the data from the `td` tags.

In [2]:
table = wiki_soup.find('table', { 'class': 'wikitable sortable'})
table_headers = table.find_all('th')

parsed_headers = []
for h in table_headers:
    parsed_headers.append(h.text[:-1]) # [:-1] to remove the newline

table_rows = table.find_all('tr')
parsed_rows = []
for r in table_rows:
    table_row_data = r.find_all('td')
    row_data = []
    for d in table_row_data:
        row_data.append(d.text[:-1])
    parsed_rows.append(row_data)

#### Create a Pandas DataFrame of the Scraped Data

In [3]:
df_can_postal_codes = pd.DataFrame(data=parsed_rows, columns=parsed_headers)
df_can_postal_codes.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


#### Create a DataFrame Containing Only Toronto Boroughs' Postal Codes

In [4]:
df_can_postal_codes = df_can_postal_codes.dropna() # Drop empty rows
df_can_postal_codes = df_can_postal_codes[df_can_postal_codes['Borough'] != 'Not assigned'] # Drop not assigned
df_can_postal_codes.reset_index(inplace=True) # Ensure index starts at 0
df_can_postal_codes.drop(columns=['index'], inplace=True) # Remove redundant, old, index
df_toronto_postal_codes = df_can_postal_codes[df_can_postal_codes.apply(lambda x : df_can_postal_codes['Borough'].str.find('Toronto') >= 0)].dropna().reset_index(drop=True)
df_toronto_postal_codes.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M5A,Downtown Toronto,"Regent Park, Harbourfront"
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
2,M5B,Downtown Toronto,"Garden District, Ryerson"
3,M5C,Downtown Toronto,St. James Town
4,M4E,East Toronto,The Beaches


#### Import Geospatial Data for Canadian Postal Codes, Create Pandas DataFrame

In [5]:
df_can_coords = pd.read_csv('./week3/geospatial_coordinates.csv')
df_can_coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


#### Combine Postal Code Data for Burough's with Geospatial Data and Create a new Pandas DataFrame
This task requires creating `numpy` array's from the postal code data and the geospatial data, and then looping through them to look for postal code matches. Where postal code matches exist, we create a row of data which includes postal code, borough, neighborhood, latitude, and longitude. 

In [6]:
toronto_neighborhood_data = df_toronto_postal_codes.to_numpy()
can_coords_data = df_can_coords.to_numpy()

toronto_combined_data = []
for borough in toronto_neighborhood_data:
    for geo_entry in can_coords_data:
        if borough[0] == geo_entry[0]:
            toronto_combined_data.append([borough[0], borough[1], borough[2], geo_entry[1], geo_entry[2]])

df_toronto = pd.DataFrame(data=toronto_combined_data, columns=["Postal Code", "Borough", "Neighborhood", "Latitude", "Longitude"])
df_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


#### Visually Confirm the Geospatial Data Overlies the Correct Location

We will loop through all entires in `df_torotnto` and place labels for each neighborhood. The starting position of the map is calculated as a mean of all latitutdes and longitudes, which should roughly approximate the center of Toronto. 

In [7]:
import folium

class Coordinates:
    def __init__(self, latitude, longitude):
        self.latitude = latitude 
        self.longitude = longitude

starting_coords = Coordinates(df_toronto['Latitude'].mean(), df_toronto['Longitude'].mean())

map = folium.Map(location=[starting_coords.latitude, starting_coords.longitude], zoom_start=12)

for lat, lng, borough, neighborhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Borough'], df_toronto['Neighborhood']):
    label= f'{neighborhood}, {borough}'
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=10,
        popup=label,
        color='green',
        fill=True,
        fill_color='green',
        fill_opacity=0.7,
        parse_html=False
    ).add_to(map)
map

#### Download FourSquare Data

We will first pull secure environmental variables containing the foursquare client ID, and the FourSquare secret so that we can establish a TCP connection to the RESTful API. 

In [8]:
%load_ext dotenv
%dotenv -v ./../.env
import os  
CLIENT_ID = os.getenv("FOURSQUARE_CLIENTID")
CLIENT_SECRET = os.getenv("FOURSQUARE_CLIENTSECRET")
VERSION = '20180605'

Make the venues `GET` request to the FourSquare API, transform the returned venues into a Pandas DataFrame, and then filter out the columns we will need.

In [13]:
radius=500
limit=10000
url=f'https://api.foursquare.com/v2/venues/search?v={VERSION}&client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&ll={starting_coords.latitude},{starting_coords.longitude}&radius={radius}&limit={limit}'

results = requests.get(url).json()
venues = results['response']['venues']
venues[0]
df_venues = json_normalize(venues)

filtered_columns = ['name', 'categories', 'location.lat', 'location.lng']
df_venues =df_venues.loc[:, filtered_columns]
df_venues.head()



Unnamed: 0,name,categories,location.lat,location.lng
0,Loretto College,"[{'id': '4bf58dd8d48988d198941735', 'name': 'C...",43.66718,-79.389414
1,U Condominiums,"[{'id': '4d954b06a243a5684965b473', 'name': 'R...",43.667136,-79.389231
2,Northrop Frye Hall,"[{'id': '4bf58dd8d48988d198941735', 'name': 'C...",43.666262,-79.392377
3,Brennan Hall,[],43.666891,-79.390413
4,Brennan Hall and Sorbara Auditorium,"[{'id': '4bf58dd8d48988d1af941735', 'name': 'C...",43.666624,-79.389868


Data collected from FourSquare will need to be destructured and restructred into something that will fit into a Pandas DataFrame. For that, we define the method `get_category_type` to parse venue categories. Then, we'll improve the presentation of the DataFrame by cleaning up the headers.

In [14]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

df_venues['venue.categories'] = df_venues.apply(get_category_type, axis=1)

# Clean up the Column title's
df_venues.columns = [col.split(".")[-1] for col in df_venues.columns]

df_venues.head()

Unnamed: 0,name,categories,lat,lng,categories.1
0,Loretto College,"[{'id': '4bf58dd8d48988d198941735', 'name': 'C...",43.66718,-79.389414,College Academic Building
1,U Condominiums,"[{'id': '4d954b06a243a5684965b473', 'name': 'R...",43.667136,-79.389231,Residential Building (Apartment / Condo)
2,Northrop Frye Hall,"[{'id': '4bf58dd8d48988d198941735', 'name': 'C...",43.666262,-79.392377,College Academic Building
3,Brennan Hall,[],43.666891,-79.390413,
4,Brennan Hall and Sorbara Auditorium,"[{'id': '4bf58dd8d48988d1af941735', 'name': 'C...",43.666624,-79.389868,College Auditorium


# TODO: Need to re-write the above code; loop through all zip codes and pull venues for each postal code. This currently only looks at the starting location.