## Capstone Project - The Battle of Neighborhoods (Week 1)

### Identifying the similarities or differences in Arts & Entertainment, Food and Nightlife between New York City and Toronto

This notebook seeks to identify the similarities/dissimilarities in Arts & Entertainment, Food and Nightlife between New York City and Toronto. 

This **Week 1** part of the notebook covers how the data of New York City and Toronto is obtained for data wrangling.

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if geopy is not installed
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
import seaborn as sns

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if folium is not installed
import folium # map rendering library

#!conda install -c anaconda beautifulsoup4 # uncomment this line if beautifulsoup4 is not installed
from bs4 import BeautifulSoup # library used to scrape webpages
import requests # library used to make web requests
import csv # library used to read and write to csv files

print('Libraries imported.')

Libraries imported.


### 1. Obtaining Data of New York and Toronto Cities

#### 1.1. Obtaining data of New York City

New York City has a total of 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood. 

Fortunately, this dataset exists for free on the web. Here is the link to the dataset: https://geo.nyu.edu/catalog/nyu_2451_34572

It could also be downloaded from http://tiny.cc/n82g7y.
So let's go ahead and do that.

In [2]:
!wget -q -O 'newyork_city_neighborhood.json' http://tiny.cc/n82g7y
print('Data downloaded!')

Data downloaded!


Now that we've downloaded the json file, let's go ahead and save its json data in a variable *newyork_json* for further analysis.

In [3]:
with open('newyork_city_neighborhood.json') as json_data:
    newyork_json = json.load(json_data)

All the relevant data in *newyork_json* is in the *features* key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data and take a look at its first item in the list.

In [4]:
newyork_neighborhoods_data = newyork_json['features']
newyork_neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

Next, we're going to transform the *newyork_neighborhoods_data* data of nested Python dictionaries into a pandas dataframe. We'll need 'Borough', 'Neighborhood', 'Latitude' and 'Longitude' data.  So let's start by creating an empty dataframe. 

In [5]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
df_newyork_neighborhoods = pd.DataFrame(columns=column_names)

Take a look at the empty dataframe to confirm that the columns are as intended.

In [6]:
df_newyork_neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


Then let's loop through the data and fill the dataframe one row at a time.

In [7]:
for data in newyork_neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    df_newyork_neighborhoods = df_newyork_neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

Quickly examine the resulting dataframe.

In [8]:
df_newyork_neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


And make sure that the dataset has all 5 boroughs and 306 neighborhoods.

In [9]:
print('The New York City dataframe has {} boroughs and {} neighborhoods.'.format(
        len(set(df_newyork_neighborhoods['Borough'])),
        df_newyork_neighborhoods.shape[0]
    )
)

The New York City dataframe has 5 boroughs and 306 neighborhoods.


Now let's take a look at the number of neighborhoods in each borough.

In [10]:
df_newyork_neighborhoods['Borough'].value_counts()

Queens           81
Brooklyn         70
Staten Island    63
Bronx            52
Manhattan        40
Name: Borough, dtype: int64

Now, let's save this dataframe for later use to a *csv* file using <code>to_csv</code> method.

In [11]:
df_newyork_neighborhoods.to_csv('df_newyork_neighborhoods.csv', index = False)

#### 1.2. Obtaining data of Toronto City

We first scrape and then parse the data of the table from <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">List of postal codes of Canada: M</a> and then iterate over it to create an array of table rows.

In [12]:
# scraping and parsing the data from the Wikipedia page
page = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(page, 'lxml')
table = soup.find('table', class_=['wikitable'])

table_rows = []
for table_row in table.findAll('tr'):    
    columns = table_row.findAll('td')
    table_row = []
    for column in columns:
        table_row.append(column.text.rstrip())
    table_rows.append(table_row)
    
header_row = []
for table_head in table.findAll('th'):
    header_row.append(table_head.text.rstrip())    
table_rows[0] = header_row

# Writing table into a CSV file
with open('postalcodes_toronto.csv', 'w') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerows(table_rows)

Next, we load the data and clean it by dropping the cells where the value of *Borough* is *Not assigned*. We then transform a cell that has a *Borough* but a *Not assigned* *Neighborhood*, that *Neighborhood* is assigned the same value as *Borough*. For every duplicate *PostalCode* we concatenate its *Neighborhood* values by separating them with a *,* (comma). We do it by grouping the data by PostalCode and Borough.

In [13]:
# Reading data from 'postalcodes_canada.csv'
df_postalcodes_toronto = pd.read_csv('postalcodes_toronto.csv')

# Renaming the columns of the dataframe
df_postalcodes_toronto.columns = ['PostalCode', 'Borough', 'Neighborhood']

# Droping the cells where the value of 'Borough' is 'Not assigned'
df_postalcodes_toronto = df_postalcodes_toronto.drop(df_postalcodes_toronto[df_postalcodes_toronto['Borough'] == 'Not assigned'].index)

# Assigning 'Neighborhood' the same value as 'Borough' if the value of 'Neighborhood' is 'Not assigned'
df_postalcodes_toronto.loc[df_postalcodes_toronto['Neighborhood'] == 'Not assigned', ['Neighborhood']] = df_postalcodes_toronto['Borough']

# Concatenating 'Neighborhood' values for every duplicate 'PostalCode'
df_postalcodes_toronto = df_postalcodes_toronto.groupby(['PostalCode', 'Borough'], sort=False).agg({'Neighborhood': ', '.join}).reset_index()

Now let's take a look at the head of <code>df_postalcodes_toronto</code>

In [14]:
df_postalcodes_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [15]:
print("The shape of 'df_postalcodes_toronto' is", df_postalcodes_toronto.shape)

The shape of 'df_postalcodes_toronto' is (103, 3)


Now that we've have built a dataframe of the PostalCode of each Neighborhood along with the Borough, next we will get the latitude and longitude coordinates of each Neighborhood.

We'll use a csv file at http://tiny.cc/od8m7y that has the geographical coordinates of each PostalCode.

In [16]:
# Generating dataframe by reading the Geospatial CSV file and then naming columns
df_geo_coords = pd.read_csv('http://cocl.us/Geospatial_data')
df_geo_coords.columns = ['PostalCode', 'Latitude', 'Longitude']

# Merging the PostalCodes dataframe with the Geospatial dataframe by creating an inner join on 'PostalCode'
df_postalcodes_toronto = pd.merge(df_postalcodes_toronto, df_geo_coords, on=['PostalCode'], how='inner')

Now let's take a look at the head of <code>df_postalcodes_toronto</code> again.

In [17]:
df_postalcodes_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


The geographical coordinates of each PostalCode seems to have been added to the dataframe correctly. Now, let's confirm that the shape of the <code>df_postalcodes_toronto</code> is still (103, 5). 

In [18]:
df_postalcodes_toronto.shape

(103, 5)

In [19]:
print('The Toronto dataframe has {} boroughs and {} neighborhoods.'.format(
        len(set(df_postalcodes_toronto['Borough'])),
        df_postalcodes_toronto.shape[0]
    )
)

The Toronto dataframe has 11 boroughs and 103 neighborhoods.


Now let's take a look at the number neighborhoods in each borough.

In [20]:
df_postalcodes_toronto['Borough'].value_counts()

North York          24
Downtown Toronto    18
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
York                 5
East Toronto         5
East York            5
Mississauga          1
Queen's Park         1
Name: Borough, dtype: int64

Let's save the resulting dataframe to a CSV file for using it later or in any other notebooks.

In [21]:
df_postalcodes_toronto.to_csv('df_postalcodes_toronto.csv', index = False)