### In the cell below I'll be importing the libraries needed for this notebook:

In [1]:
import pandas as pd 
import numpy as np


!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim

!pip install pgeocode
import pgeocode

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium 

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

### Using pandas.read to read Wikipedia html link to Toronto Postal Code table and adding to variable 'toronto_wiki_df'.

In [4]:
toronto_wiki_df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

### Using the function len() to retrieve the number of tables downloaded.

In [5]:
print('Total tables:', len(toronto_wiki_df))

Total tables: 3


### Saving only the 1st table which is located in the index[0] into the variable 'toronto_wiki_df'...  'Eew!'

In [6]:
toronto_wiki_df = toronto_wiki_df[0]
toronto_wiki_df

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,M1ANot assigned,M2ANot assigned,M3ANorth York(Parkwoods),M4ANorth York(Victoria Village),M5ADowntown Toronto(Regent Park / Harbourfront),M6ANorth York(Lawrence Manor / Lawrence Heights),M7AQueen's Park(Ontario Provincial Government),M8ANot assigned,M9AEtobicoke(Islington Avenue)
1,M1BScarborough(Malvern / Rouge),M2BNot assigned,M3BNorth York(Don Mills)North,M4BEast York(Parkview Hill / Woodbine Gardens),"M5BDowntown Toronto(Garden District, Ryerson)",M6BNorth York(Glencairn),M7BNot assigned,M8BNot assigned,M9BEtobicoke(West Deane Park / Princess Garden...
2,M1CScarborough(Rouge Hill / Port Union / Highl...,M2CNot assigned,M3CNorth York(Don Mills)South(Flemingdon Park),M4CEast York(Woodbine Heights),M5CDowntown Toronto(St. James Town),M6CYork(Humewood-Cedarvale),M7CNot assigned,M8CNot assigned,M9CEtobicoke(Eringate / Bloordale Gardens / Ol...
3,M1EScarborough(Guildwood / Morningside / West ...,M2ENot assigned,M3ENot assigned,M4EEast Toronto(The Beaches),M5EDowntown Toronto(Berczy Park),M6EYork(Caledonia-Fairbanks),M7ENot assigned,M8ENot assigned,M9ENot assigned
4,M1GScarborough(Woburn),M2GNot assigned,M3GNot assigned,M4GEast York(Leaside),M5GDowntown Toronto(Central Bay Street),M6GDowntown Toronto(Christie),M7GNot assigned,M8GNot assigned,M9GNot assigned
5,M1HScarborough(Cedarbrae),M2HNorth York(Hillcrest Village),M3HNorth York(Bathurst Manor / Wilson Heights ...,M4HEast York(Thorncliffe Park),M5HDowntown Toronto(Richmond / Adelaide / King),M6HWest Toronto(Dufferin / Dovercourt Village),M7HNot assigned,M8HNot assigned,M9HNot assigned
6,M1JScarborough(Scarborough Village),M2JNorth York(Fairview / Henry Farm / Oriole),M3JNorth York(Northwood Park / York University),M4JEast YorkEast Toronto(The Danforth East),M5JDowntown Toronto(Harbourfront East / Union ...,M6JWest Toronto(Little Portugal / Trinity),M7JNot assigned,M8JNot assigned,M9JNot assigned
7,M1KScarborough(Kennedy Park / Ionview / East B...,M2KNorth York(Bayview Village),M3KNorth York(Downsview)East (CFB Toronto),M4KEast Toronto(The Danforth West / Riverdale),M5KDowntown Toronto(Toronto Dominion Centre / ...,M6KWest Toronto(Brockton / Parkdale Village / ...,M7KNot assigned,M8KNot assigned,M9KNot assigned
8,M1LScarborough(Golden Mile / Clairlea / Oakridge),M2LNorth York(York Mills / Silver Hills),M3LNorth York(Downsview)West,M4LEast Toronto(India Bazaar / The Beaches West),M5LDowntown Toronto(Commerce Court / Victoria ...,M6LNorth York(North Park / Maple Leaf Park / U...,M7LNot assigned,M8LNot assigned,M9LNorth York(Humber Summit)
9,M1MScarborough(Cliffside / Cliffcrest / Scarbo...,M2MNorth York(Willowdale / Newtonbrook),M3MNorth York(Downsview)Central,M4MEast Toronto(Studio District),M5MNorth York(Bedford Park / Lawrence Manor East),M6MYork(Del Ray / Mount Dennis / Keelsdale and...,M7MNot assigned,M8MNot assigned,M9MNorth York(Humberlea / Emery)


### Stacking all the columns with the stack() function. In return I got a series table so I converted into a datraframe and saved to a variable called 'toronto_stack'.

In [7]:
toronto_stack = toronto_wiki_df.stack()
toronto_stack = pd.DataFrame(toronto_stack)
toronto_stack.head()

Unnamed: 0,Unnamed: 1,0
0,0,M1ANot assigned
0,1,M2ANot assigned
0,2,M3ANorth York(Parkwoods)
0,3,M4ANorth York(Victoria Village)
0,4,M5ADowntown Toronto(Regent Park / Harbourfront)


### Using the .drop() to eliminate all samples which contains the string 'Not assigned'. I achieve this goal by using .str.contains to locate all string values inside the target column that contains 'Not Assigned'. Ultimately I save this process to the variable 'toronto_stack'.

In [8]:
toronto_stack = toronto_stack.drop(toronto_stack[toronto_stack[0].str.contains('Not assigned')].index.tolist())
toronto_stack.head()

Unnamed: 0,Unnamed: 1,0
0,2,M3ANorth York(Parkwoods)
0,3,M4ANorth York(Victoria Village)
0,4,M5ADowntown Toronto(Regent Park / Harbourfront)
0,5,M6ANorth York(Lawrence Manor / Lawrence Heights)
0,6,M7AQueen's Park(Ontario Provincial Government)


### Using .droplevel() I'm eliminating the rows additional index level wich it was created when I stacked all the columns using .stack() 'located in the cell#10' of this notebook. to make everything will run smooth I'm also using .reset_index() to reset the tables index 

In [9]:
toronto_stack = toronto_stack.droplevel(level=0, axis=0).reset_index(drop=True)
toronto_stack.head()

Unnamed: 0,0
0,M3ANorth York(Parkwoods)
1,M4ANorth York(Victoria Village)
2,M5ADowntown Toronto(Regent Park / Harbourfront)
3,M6ANorth York(Lawrence Manor / Lawrence Heights)
4,M7AQueen's Park(Ontario Provincial Government)


### In the next 3 code cells I'll be deviding the only column in this dataframe into 3 diffrent columns. 1st column will contain the [PostalCode], 2nd the [Borough], and 3rd the [Neighborhood] column for the city of Toronto.

1st - In the following lines of code I'll be creating the 1st column which contains all postal codes. Since I know all postal codes in the column values occupies exactly the first 3 locations in the string index I'll use .str.slice() with the parameters start at index '0' and stopping at index '3' to slice it, creating a new 1 column table. Initially I'll save to a variable called  'toronto_1st_slice' and ultimately saving to a variable called 'col1'.

Apply .strip() to eliminate all white spaces

In [10]:
toronto_1st_slice = toronto_stack[0].str.slice(start=0, stop=3)
col1 = toronto_1st_slice
col1 = col1.str.strip()
col1 = pd.DataFrame(col1)
col1.head()

Unnamed: 0,0
0,M3A
1,M4A
2,M5A
3,M6A
4,M7A


### 2nd - Still using the toronto_stack dataFrame located in 'cell#9' I'll use the str.slice again, but this time I'll use the parameter to start at index 3 instead of '0' like in the previous cell. This will alow me to capture the whole string after index location 3. I'll save this line of code to variable 'toronto_2nd_slice'.
Take a look at the dataFrame at the code 'cell#9'. See how the Neighborhoods inside the column values are located between parantheses?!  
With the variable 'toronto_2nd_slice', applying .str.split(), I'll use the first parantheses('(') as a separator to split the 'Boroughs' and 'Neikghborhoods'. 
I'll use the parameters 'n=1' for splitting only at the first ocurrance of parantheses('('), and 'expand=True' to split one column dataFrame into 2 separate columns.
After spliting 'toronto_2nd_slice' into 2 columns, I'll need only the 1st column for the next step. I'll achiave this by storing the 'toronto_split' table using only the 1st column, into a variable called 'col2'.

Now we have a single column table with just the 'Boroughs'.

Apply .strip() to eliminate all white spaces

In [11]:
toronto_2nd_slice = toronto_stack[0].str.slice(start=3)
toronto_split = toronto_2nd_slice.str.split('(', n=1, expand=True)
col2 = toronto_split[0]
col2 = col2.str.strip()
col2 = pd.DataFrame(col2)
col2.head()

Unnamed: 0,0
0,North York
1,North York
2,Downtown Toronto
3,North York
4,Queen's Park


### 3rd - In this step I'll use 'str.replace()', str.strip and str.replace() to clean up 'toronto_split' dataFrame 2nd column values of all unwanted characters. I'll store this code into a variable called 'col3'.

In [12]:
col3 = (((((toronto_split[1].str.replace(')',' ').str.strip(')')).str.replace(')','')).str.replace('/',',')).str.replace(' ,',', ')).str.strip())
col3 = pd.DataFrame(col3)
col3.head()

Unnamed: 0,1
0,Parkwoods
1,Victoria Village
2,"Regent Park, Harbourfront"
3,"Lawrence Manor, Lawrence Heights"
4,Ontario Provincial Government


### In these final steps I'll be resetting the indexes for [col1], [col2] and [col3]. After resetting all indexes I'll use 'pd.concat()' to create a brand new dataFrame with [col1, col2, col3]. The 'axis' parameter will be set to '1'. This will allow the tables to align horizontally forming 3 columns.
Next I'll store the code into the variabble 'toronto_df', name the columns, use 'drop_duplicates' to make there are no duplicates in the dataFrame, and lastly in this code cell I'll print the size of the dataFrame with '.shape'. The result is a dataFrame with 103 rows and 3 columns.

In [13]:
col1.reset_index(drop=True)
col2.reset_index(drop=True)
col3.reset_index(drop=True)
toronto_df = pd.concat([col1, col2, col3], axis=1)
toronto_df.columns = ['PostalCode','Borough','Neighborhood']
print(toronto_df.shape)
toronto_df

(103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills North
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


### In this final touches in the 'Borough' columns I will be replacing the postal office centers with the borough's respective names. I'll use '.replace()' to complete this task. Also, I'll be adding spaces to some of the borough names where the spaces are missing.

In [14]:
toronto_df['Borough']=toronto_df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A','East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business','EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto','MississaugaCanada Post Gateway Processing Centre': 'Mississauga'})

### Drop any duplicates in the dataFrame because no one likes a copycat. Too many copycats out here.

In [15]:
toronto_df.drop_duplicates(inplace=True)

### Finally, Display the dataFrame using .head()

In [16]:
print(toronto_df.shape)
toronto_df.head()

(103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


In [17]:
toronto_df.to_csv('toronto_df.csv', index=False)

### In this 2nd step of the assignment I'll be retrieving Toronto's neighborhoods coordinates using their respective zip codes.

In the cell below I'll be pip installing and importing pgeocode to this notebook.  
Next I will use .tolist() to convert the ['PostalCode'] column from the dataFrame to a list and store it in a new variable called 'postal_code'.  
Afterwards I'll create a geocode object with the parameter set to ('ca') for Canada and store it to a variable called 'nomi'.  
Now I'll use .query_postal_code() and feed the variable 'postal_code' which it contains the list I created previously with the column 'PostalCode'.  
Finally I'll store all coordinate results to a variable called 'latitude' and 'longitude'.  

In [18]:
postal_code = toronto_df['PostalCode'].tolist()

nomi = pgeocode.Nominatim('ca')
location = nomi.query_postal_code(postal_code)
latitude = location.latitude
longitude = location.longitude

### In the cell below I'll convert the coordinate results to a .dataFrame(), .tranpose(), convert the dataFrame values to float using .astype() and finally store it in a variable called 'coordinates'.

In [19]:
coordinates = pd.DataFrame([latitude,longitude]).transpose().astype(float)
print(coordinates.shape)
coordinates.head()

(103, 2)


Unnamed: 0,latitude,longitude
0,43.7545,-79.33
1,43.7276,-79.3148
2,43.6555,-79.3626
3,43.7223,-79.4504
4,43.6641,-79.3889


### In this step I'll merge the dataFrame 'toronto_df' and 'coordinates' using .concat() and setting the parameter 'axis' to '1' so they merge horizontally.
The result is a new dataframe called 'new_toronto_df'.

In [20]:
new_toronto_df = pd.concat([toronto_df, coordinates], axis=1)
print(new_toronto_df.shape)
new_toronto_df

(103, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,latitude,longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Queen's Park,Ontario Provincial Government,43.6641,-79.3889
5,M9A,Etobicoke,Islington Avenue,43.6662,-79.5282
6,M1B,Scarborough,"Malvern, Rouge",43.8113,-79.193
7,M3B,North York,Don Mills North,43.745,-79.359
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.7063,-79.3094
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783


### Lastly in this step I'll be dropping the only row with a 'NaN' values. The coordinates are missing. 'Mississauga' is not a Borough of Toronto. It is a postal center location in the city of 'Mississauga' though the zip code belongs to Toronto My speculation is that this postal center probably serves the area of Toronto, but I'ts not a Borough.
Next I'll reset the index, store it to the the variable 'new_toronto_df' and finally display the .head()

In [21]:
new_toronto_df.dropna(inplace=True)
new_toronto_df.reset_index(drop=True)
print(new_toronto_df.shape)
new_toronto_df

(102, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,latitude,longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Queen's Park,Ontario Provincial Government,43.6641,-79.3889
5,M9A,Etobicoke,Islington Avenue,43.6662,-79.5282
6,M1B,Scarborough,"Malvern, Rouge",43.8113,-79.193
7,M3B,North York,Don Mills North,43.745,-79.359
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.7063,-79.3094
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783


### Save the dataFrame as a csv file.

In [22]:
new_toronto_df.to_csv('new_toronto.csv', index=False)

### I decided to group the neighborhoods in the city of Toronto inside clusters with their respective 'Boroughs'. So in the cell below I'll use .nunique() to find out how many unique values(diffrerent boroughs) the column ['Borough'] may contain.

In [23]:
print('It looks like the Borough column has {} unique values'.format(new_toronto_df.Borough.nunique()))

It looks like the Borough column has 14 unique values


### In this next step after using .groupby() and .count() I was able to discover how many neighborhoods exist in each of the 14 boroughs according to the dataFrame.

In [24]:
neighborhood_count = new_toronto_df.groupby(['Borough']).count()
neighborhood_count =  neighborhood_count.drop(['Neighborhood', 'latitude', 'longitude'], axis=1).rename(columns={'PostalCode':'Neighborhoods per Borough'})
neighborhood_count

Unnamed: 0_level_0,Neighborhoods per Borough
Borough,Unnamed: 1_level_1
Central Toronto,9
Downtown Toronto,17
Downtown Toronto Stn A,1
East Toronto,4
East Toronto Business,1
East York,4
East York/East Toronto,1
Etobicoke,11
Etobicoke Northwest,1
North York,24


### For this assigment it's required that we do a analisys of Toronto City using a cluster algorithm called Kmeans. Since machine learning classifiers(Kmeans in this case) don't work well with categorical variables I'll use in the next cell below .get_dummies().  This action will convert column ['Borough'] into neumerical variables and store it into a variable called 'toronto_onehot'. This will allow Kmeans to iterate horizontally through each of the 14 columns and associate the column that contains a value of '1' with its respective sample (adding a label number for each sample relating the sample with the borough it belongs. This process will be repeated 102 times which represents the amount of samples(102) in the dataFrame, each sample representsa group of neighborhoods.

In [25]:
toronto_onehot = pd.get_dummies(new_toronto_df[['Borough']], prefix="", prefix_sep="")
print(toronto_onehot.shape)
toronto_onehot.head()

(102, 14)


Unnamed: 0,Central Toronto,Downtown Toronto,Downtown Toronto Stn A,East Toronto,East Toronto Business,East York,East York/East Toronto,Etobicoke,Etobicoke Northwest,North York,Queen's Park,Scarborough,West Toronto,York
0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,1,0,0,0


### In this next I'll use Kmeans to iterate horizontally through dataFrame 'toronto_onehot' in each of  14 columns and associate the column that contains a value of '1' with its respective sample adding a label number for each sample tagging each sample to the borough it belongs. This process will be repeated 102 times which represents the amount of 102 samples(rows). Each sample represents one neighborhood, or even cluster of even a cluster of neighborhoods. This model will contain 14 kclusters which represents the 14 boroughs according to the dataFrame.

In [26]:
kclusters = 14

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_onehot)

kmeans.labels_

array([ 3,  3,  2,  3, 10,  4,  1,  3,  9,  2,  3,  4,  1,  3,  9,  2,  8,
        4,  1,  7,  2,  8,  1,  9,  2,  2,  1,  3,  3,  9,  2,  6,  1,  3,
        3, 12,  2,  6,  1,  3,  3,  7,  2,  6,  1,  3,  3,  7,  2,  3,  3,
        1,  3,  3,  7,  3,  8,  3,  1,  3,  3,  5,  5,  8,  8,  1,  3,  5,
        5,  6,  4,  1,  3,  5,  5,  6,  4,  1,  5,  2,  6,  1,  5,  2,  1,
        5,  2,  4,  4,  1,  2, 11,  4, 13,  1,  2,  2,  4,  2,  0,  4,  4])

### Now  I'll store a copy of the 'new_toronto_df' dataFrame to a variable called 'toronto_clustered' and create a new column with the 'kmeans.labels'.

In [27]:
toronto_clustered = new_toronto_df.copy()

toronto_clustered.insert(0, 'KmeanLabels', kmeans.labels_)

### Below is a display of the .head() for the new dataFrame with the KmeanLabels column.

In [29]:
toronto_clustered.head()

Unnamed: 0,KmeanLabels,PostalCode,Borough,Neighborhood,latitude,longitude
0,3,M3A,North York,Parkwoods,43.7545,-79.33
1,3,M4A,North York,Victoria Village,43.7276,-79.3148
2,2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,10,M7A,Queen's Park,Ontario Provincial Government,43.6641,-79.3889


### In order to create a map for Toronto I will need its cordinates. For this task I'll be importing 'Nominatin' from 'geopy.geocode' library to get the city's 'latitude' and 'longitude'.

In [30]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude_tor = location.latitude
longitude_tor = location.longitude
print('The geograpical coordinates for the city of Toronto are {}, {}.'.format(latitude_tor, longitude_tor))

The geograpical coordinates for the city of Toronto are 43.6534817, -79.3839347.


### Now let's try to finish our analisys by displaying the results in a '.folium()' map. Each sample of the dataFrame will be reepresented by a colored datapoint in the map. The colors distincts which cluster the datapoint belongs. The colored clusters of datapoints represent each Borough in Toronto dataFrame.

### After analysing the map the I noticed some singled out data points in the mist of clusters, and the color of this datapoints don't match these clusters. These datapoints are representing 'Boroughs' which don't exist in the city of Toronto.
Let's look at one of the datapoints. The lonely light red datapoint on the northwwest side of Toronto. It belongs to 'Etobicoke Northwest', Cluster 13. There is no borough 'Etobicoke Northwest'. Let's add to add to the 'Etobicoke' cluster.   
### Let's look at another datapoint.  
### How about that confused  East York/East Toronto, Cluster 12 orange colored datapoint on the east side of 'Downtown Toronto'. Since it looks like this datapoint is in the heart of 'East York' I'll add it to the "East York' cluster though it might belong to both 'Boroughs". I'll just keep this way for the sake of simplicity.  
### I'll be reapeting the same action with the other outcast datapoints in the map. To achieve this goal we will need to go back and clean up the column 'Borough', in the dataFrame.  
### Before we do that we need to know how many boroughs really exist in Toronto. According to a Google search Toronto has 6 boroughs.  
They are:  
- York  
- East York  
- North York  
- Etobicoke   
- Scarborough  
- Toronto  

It looks like in this dataset Toronto Borough is devided into 4 seperate boroughs.  

- Downtown Toronto  
- Central Toronto  
- East Toronto  
- West Toronto  

I'll be keeping the borough of Toronto devided. So ultimatelly this dataset should have 9 boroughs instead of 6.  

To achiave this goal we need to go back and analyse the dataset again.

In [31]:
map_clusters = folium.Map(location=[latitude_tor, longitude_tor], zoom_start=11)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, neighborhood, borough, cluster in zip(toronto_clustered['latitude'], toronto_clustered['longitude'], toronto_clustered['Neighborhood'], toronto_clustered['Borough'], toronto_clustered['KmeanLabels']):
    label = folium.Popup('Neighborhood: ' + str(neighborhood) + ' - Borough: ' + str(borough) + ' - Cluster: ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### In this dataset revision lets create a copy of the 'new_toronto_df' and store it in a brand new variable callled 'mod_borough_df'.

In [32]:
mod_borough_df = new_toronto_df.copy()

### I decided to print and analyse the unique values in the column ['Borough']. Lets look for names that are not associated with the 9 borough listed bove.

In [33]:
print(mod_borough_df.Borough.unique())

['North York' 'Downtown Toronto' "Queen's Park" 'Etobicoke' 'Scarborough'
 'East York' 'York' 'East Toronto' 'West Toronto' 'East York/East Toronto'
 'Central Toronto' 'Downtown Toronto Stn A' 'Etobicoke Northwest'
 'East Toronto Business']


### I'll manaully collect datapoint labels by using their location on the map, then replace their erroneous borough names with one of the 9 boroughs in the city of Toronto.  
### Again, to complete this task I'll be using the map above and collect the borough names of the nearest clusters to the datapoints that need to be changed. In this specific dataset there aren't too many datapoints that need change, and the choices are visually obivious.  
### This should be an easy manual task.  

In [34]:
mod_borough_df['Borough']= mod_borough_df['Borough'].replace({"Queen's Park":'Downtown Toronto','Downtown Toronto Stn A':'Downtown Toronto','Etobicoke Northwest':'Etobicoke','East Toronto Business':'Scarborough','East York/East Toronto':'East York'})

### Next lets print our 'mod_borough_df' dataFrame and take a quick look to make sure we are in the right track.

In [35]:
mod_borough_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,latitude,longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Downtown Toronto,Ontario Provincial Government,43.6641,-79.3889


### Now lets check how many unique values we have in the ['Borough'] column.

In [36]:
print('It looks like the Borough column has {} unique values'.format(mod_borough_df.Borough.nunique()))

It looks like the Borough column has 9 unique values


### By reducing the number of boroughs in the dataset, the numbers of neighborhoods in certain boroughs should slightly increase.

In [37]:
new_neighborhood_count = mod_borough_df.groupby(['Borough']).count()
new_neighborhood_count =  new_neighborhood_count.drop(['Neighborhood', 'latitude', 'longitude'], axis=1).rename(columns={'PostalCode':'Neighborhoods per Borough'})
new_neighborhood_count

Unnamed: 0_level_0,Neighborhoods per Borough
Borough,Unnamed: 1_level_1
Central Toronto,9
Downtown Toronto,19
East Toronto,4
East York,5
Etobicoke,12
North York,24
Scarborough,18
West Toronto,6
York,5


### Lets create a new numeric dataset with .get_dummies for the ['Borough'] columns. By printing the dimension of this dataset we can notice a reduction from 14 columns to 9 columns.

In [38]:
toronto_mod_onehot = pd.get_dummies(mod_borough_df[['Borough']], prefix="", prefix_sep="")
print(toronto_mod_onehot.shape)
toronto_mod_onehot.head()

(102, 9)


Unnamed: 0,Central Toronto,Downtown Toronto,East Toronto,East York,Etobicoke,North York,Scarborough,West Toronto,York
0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,1,0,0,0
2,0,1,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0
4,0,1,0,0,0,0,0,0,0


### Again we will use kmeans to create some labels. Only this we change the numbers of clusters to 9 instead of 14. Which represents the number of boroughs in this new modified dataset.

In [39]:
mod_kclusters = 9

mod_kmeans = KMeans(n_clusters=mod_kclusters, random_state=0).fit(toronto_mod_onehot)

mod_kmeans.labels_

array([1, 1, 2, 1, 2, 3, 0, 1, 7, 2, 1, 3, 0, 1, 7, 2, 6, 3, 0, 8, 2, 6,
       0, 7, 2, 2, 0, 1, 1, 7, 2, 5, 0, 1, 1, 7, 2, 5, 0, 1, 1, 8, 2, 5,
       0, 1, 1, 8, 2, 1, 1, 0, 1, 1, 8, 1, 6, 1, 0, 1, 1, 4, 4, 6, 6, 0,
       1, 4, 4, 5, 3, 0, 1, 4, 4, 5, 3, 0, 4, 2, 5, 0, 4, 2, 0, 4, 2, 3,
       3, 0, 2, 2, 3, 3, 0, 2, 2, 3, 2, 0, 3, 3])

### Now we create a copy of this modified dataset and store it in a new variable and insert the column with the new Kmean_labels.

In [40]:
mod_toronto_clustered = mod_borough_df.copy()

mod_toronto_clustered.insert(0, 'KmeanLabels', mod_kmeans.labels_)

### This is how our modified new set looks.

In [41]:
mod_toronto_clustered.head()

Unnamed: 0,KmeanLabels,PostalCode,Borough,Neighborhood,latitude,longitude
0,1,M3A,North York,Parkwoods,43.7545,-79.33
1,1,M4A,North York,Victoria Village,43.7276,-79.3148
2,2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,1,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,2,M7A,Downtown Toronto,Ontario Provincial Government,43.6641,-79.3889


### Finally lets look at the folium map to make sure the datapoints reflect the changes we made in the dataset.

In [42]:
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, neighborhood, borough, cluster in zip(mod_toronto_clustered['latitude'], mod_toronto_clustered['longitude'], mod_toronto_clustered['Neighborhood'], mod_toronto_clustered['Borough'], mod_toronto_clustered['KmeanLabels']):
    label = folium.Popup('Neighborhood: ' + str(neighborhood) + ' - Borough: ' + str(borough) + ' - Cluster: ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Once again by visually analysing the map I can clearly indentify and distinguish each cluster of neighborhoods in their respective boroughs. 

### This concludes the Week_3 assignmet for the Coursera Data Science Capstone Project.