# Segmenting and Clustering Neighborhoods in Toronto

## 1. Neighborhoods in Toronto: Dataframe

In this section, we extract data from  *https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M*, concerning administrative divisions in Toronto, and build a dataframe. 

We start by copy-paste the data into a csv file, but where different attributes are separated by a tab (rather then by a comma) and we pass the data into a pandas dataframe.

In [1]:
import pandas as pd

In [2]:
# tn - dataframe for Toronto's neighborhood data

tn=pd.read_csv('toronto.csv',delimiter='\t')

In [3]:
print(tn.shape,'\n')
tn.head(10)

(180, 3) 



Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


In [4]:
tn.describe()

Unnamed: 0,Postal Code,Borough,Neighbourhood
count,180,180,180
unique,180,11,101
top,M5Y,Not assigned,Not assigned
freq,1,77,76


We see that there are no repeated postal codes in our dataframe, and that several postal codes have more than one neighborhood.

In [5]:
tn['Borough']=='Not assigned'

0       True
1       True
2      False
3      False
4      False
       ...  
175     True
176     True
177     True
178    False
179     True
Name: Borough, Length: 180, dtype: bool

We create a new dataframe where Postal Codes without an assigned Borough are cleared out.

In [6]:
# tnp - dataframe Toronto's neighborhood prime

tnp=tn[tn['Borough']!='Not assigned']
tnp.shape

(103, 3)

In [7]:
tnp.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [8]:
tnp.reset_index(drop=True,inplace=True)
tnp.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [9]:
tnp.describe()

Unnamed: 0,Postal Code,Borough,Neighbourhood
count,103,103,103
unique,103,10,99
top,M2L,North York,Downsview
freq,1,24,4


We check if there there are Boroughs without and assigned Neighborhood: 

In [10]:
tnpp=tnp[tnp['Neighbourhood']=='Not assigned']
tnpp.shape

(0, 3)

There are none!

Finally, we rename the third column as:

In [11]:
tnp.rename(columns={'Neighbourhood':'Neighborhood'},inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


In [12]:
tnp.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [13]:
tnp.shape

(103, 3)

The Toronto's Neighborhoods dataframe has 103 rows and 3 columns!

At last, we copy the dataframe into a .csv file, to be used in Section 2. 

In [14]:
tnp.to_csv('torontop.csv')