# Proyecto Final del Curso: Ciencia de Datos Aplicada

### Nombre del Proyecto: Clustering Anidados

##### En el presente proyecto, se implementa un proceso de clustering anidado, el cual permite realizar un analisis a un segundo nivel de granularidad sobre clusters con alto numero de elementos (vecindarios). Con la hipotesis que se obtendra una sub-clasificacion que permite reagrupar el cluster objetivo, identificando nuevos sub-grupos que le ofrezcan al cliente una reclasificacion para realizar una mejor y/o mas rapida seleccion de un vecindario de su preferencia.

##### Para la realizacion de este proyecto, se utilizaron datos de localizacion de Toronto, Canada. Especificamente, se utilizaron las coordenadas de los vecindarios de dicha ciudad. Asi mismo, se aprovecharon los datos de ubicación de Foursquare para obtener los negocios (venues) que se encuentran a un radio de 500 metros del centro del vecindario.

### Lectura de archivo con coordenadas de vecindarios en Canada, obtenido de la liga: http://cocl.us/Geospatial_data

In [1]:
import pandas as pd

# Lectura de archivo con coordenadas, obtenido de la liga: http://cocl.us/Geospatial_data
#  El archivo csv se bajo y accede localmente para crear el dataframe de postal code (df_PC)
df_PC = pd.read_csv('Geospatial_Coordinates.csv', index_col=0)
df_PC.head()

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


## Lectura de la tabla con los datos de vecindarios requeridos, disponibles en wikipedia

In [2]:
# Lectura del dataframe con los datos de vecindarios
# Lectura de la tabla usando read_html de pandas
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

# Recorrido de cada fila del dataframe de lectura para generar filas por vecindario
# se almacenan en un dataframe con los datos finales (df_final)
df_final = pd.DataFrame()
for x in range(0,20):
    df_temp = df.iloc[x].str.extractall(r'(M\d\w)([^\(\(]*)(?:\((.*)\))?')
    df_final = pd.concat([df_final, df_temp])

#Limpieza de vlores nulos, ajuste de indice, eliminacion de columnas innecesarias y cambio de titulos
df_final.dropna(inplace=True)
df_final.reset_index(inplace=True)
df_final.drop(['level_0','match'], axis=1, inplace=True)
df_final.rename(columns={0: 'PostalCode', 1: 'Borough', 2: 'Neighborhood'}, inplace=True)
df_final.replace(' /',',', regex=True, inplace=True)
df_final.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


## Agregar columnas de ubicacion a dataframe master (df_final)

In [3]:
# Agregar columnas de ubicacion a dataframe original
df_final['Latitude'] = df_final.PostalCode.map(df_PC['Latitude'].to_dict())
df_final['Longitude'] = df_final.PostalCode.map(df_PC['Longitude'].to_dict())
df_final.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


### Genera nuevo renglones para cada vecindario en misma ciudad, separados por coma

In [4]:
df_final['Neighborhood'] = df_final['Neighborhood'].str.split('[,]')
df_final = df_final.explode('Neighborhood').reset_index(drop=True)
cols = list(df_final.columns)
df_final = df_final[cols]

df_final

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Regent Park,43.654260,-79.360636
3,M5A,Downtown Toronto,Harbourfront,43.654260,-79.360636
4,M6A,North York,Lawrence Manor,43.718518,-79.464763
...,...,...,...,...,...
211,M8Z,Etobicoke,Mimico NW,43.628841,-79.520999
212,M8Z,Etobicoke,The Queensway West,43.628841,-79.520999
213,M8Z,Etobicoke,South of Bloor,43.628841,-79.520999
214,M8Z,Etobicoke,Kingsway Park South West,43.628841,-79.520999


## Conteo de numero de ciudades y vecindarios

In [5]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df_final['Borough'].unique()),
        df_final.shape[0]
    )
)

The dataframe has 15 boroughs and 216 neighborhoods.


## Agrupado de vecindarios por nombre de vecindario y conteo

In [6]:
df_final.groupby('Borough')['PostalCode'].count()

Borough
Central Toronto                                                 16
Downtown Toronto                                                36
Downtown TorontoStn A PO Boxes25 The Esplanade                   1
East Toronto                                                     6
East TorontoBusiness reply mail Processing Centre969 Eastern     1
East York                                                        5
East YorkEast Toronto                                            1
Etobicoke                                                       44
EtobicokeNorthwest                                               9
MississaugaCanada Post Gateway Processing Centre                 1
North York                                                      36
Queen's Park                                                     1
Scarborough                                                     38
West Toronto                                                    13
York                                                  

In [7]:
#pip install geopy
from geopy.geocoders import Nominatim # convertir una dirección en valores de latitud y longitud

In [8]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="on_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto CA are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto CA are 43.6534817, -79.3839347.


### Se genera mapa de Toronto, CA, con los vecindarios superpuestos

In [9]:
#pip install folium
import folium # librería para graficar mapas 

In [10]:
# crear un mapa de Toronto utilizando los valores de latitud y longitud
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# añadir marcadores al mapa tomando como base los datos de localizacion de vecindarios
for lat, lng, borough, neighborhood in zip(df_final['Latitude'], df_final['Longitude'], df_final['Borough'], df_final['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### En este punto trabajaremos con los datos del municipio de Central Toronto
Considerando que es el cuenta con una cantidad media aproximada de vecindarios

In [11]:
central_toronto_data = df_final[df_final['Borough'] == 'Central Toronto'].reset_index(drop=True)
central_toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
1,M5N,Central Toronto,Roselawn,43.711695,-79.416936
2,M4P,Central Toronto,Davisville North,43.712751,-79.390197
3,M5P,Central Toronto,Forest Hill North & West,43.696948,-79.411307
4,M4R,Central Toronto,North Toronto West,43.715383,-79.405678


### Obtenemos las coordenadas de Central Toronto y visualizamos vecindarios

In [12]:
address = 'Central Toronto, CA'

geolocator = Nominatim(user_agent="ct_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Central Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Central Toronto are 43.66291795, -79.4088986364304.


In [13]:
# crear un mapa de Central Toronto usando los valores de latitud y longitud
map_central_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# añadir los marcadores al mapa
for lat, lng, label in zip(central_toronto_data['Latitude'], central_toronto_data['Longitude'], central_toronto_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_central_toronto)  
    
map_central_toronto

### Obtener datos de Foursquare

#### Obtengamos los 50 sitios por vecindario en Toronto dentro de un radio de 500 metros.


(Se realizo el ajuste a la version 3 del API de Foursquare)

In [14]:
# importacion de librerias requeridas
import numpy as np
import requests
import json

# convertir una dirección en valores de latitud y longitud
from geopy.geocoders import Nominatim

# Matplotlib y módulos asociados para graficar
import matplotlib.cm as cm
import matplotlib.colors as colors

# importar k-means desde la fase de agrupación
from sklearn.cluster import KMeans


In [15]:
# especificacion de URI y credenciales de acceso (toma aproximadamente 3 minuto)
limit = 50 # limit of number of venues returned by Foursquare API
radius = 500 # radius

venues_list=[]
for name, lat, lng in zip(df_final["Neighborhood"], df_final["Latitude"], df_final["Longitude"]):
            
    # crear la URI de solicitud de API
    url = "https://api.foursquare.com/v3/places/search?ll={}%2C{}&radius={}&limit={}".format(lat, lng, radius, limit)

    headers = {"Accept": "application/json",
               "Authorization": "fsq32FQVR4rb8gN38EX8nw9vXYrjR0D874DdIeTTJtHhEWU="
            }

    response = requests.get(url, headers=headers)
    results = json.loads(response.text)["results"]
    
    # regresa solo información de interes de cada sitio cercano
    venues_list.append([(name, lat, lng, v['name'], v['geocodes']['main']['latitude'], 
            v['geocodes']['main']['longitude'], v['categories'][0]['name']) 
            for v in results if len(v['categories'])>0 and len(v['geocodes'])>0])
    
central_toronto_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
central_toronto_venues.columns = ['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude', 'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']


In [16]:
print('There are {} uniques categories.'.format(len(central_toronto_venues['Venue Category'].unique())))

There are 417 uniques categories.


In [17]:
#Analizamos cada vecindario opr cada tipo de negocio/servicio identificado
# codificación
central_toronto_onehot = pd.get_dummies(central_toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# añadir la columna de barrio de regreso al dataframe
central_toronto_onehot['Neighborhood'] = central_toronto_venues['Neighborhood'] 

# mover la columna de barrio a la primer columna
fixed_columns = [central_toronto_onehot.columns[-1]] + list(central_toronto_onehot.columns[:-1])
central_toronto_onehot = central_toronto_onehot[fixed_columns]

central_toronto_onehot.head()

Unnamed: 0,Neighborhood,ATM,Accounting and Bookkeeping Service,Adult Store,Advertising Agency,Afghan Restaurant,African Restaurant,Airport Service,American Restaurant,Amusement Park,...,Vintage and Thrift Store,Warehouse / Wholesale Store,Waste Management Service,Website Designer,Wholesaler,Wine Bar,Wine Store,Women's Store,Yoga Studio,Youth Organization
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# agrupacion de vecindarios tomando la média de la frecuancia de ocurrencia de cada categoría
central_toronto_grouped = central_toronto_onehot.groupby('Neighborhood').mean().reset_index()
central_toronto_grouped

Unnamed: 0,Neighborhood,ATM,Accounting and Bookkeeping Service,Adult Store,Advertising Agency,Afghan Restaurant,African Restaurant,Airport Service,American Restaurant,Amusement Park,...,Vintage and Thrift Store,Warehouse / Wholesale Store,Waste Management Service,Website Designer,Wholesaler,Wine Bar,Wine Store,Women's Store,Yoga Studio,Youth Organization
0,Adelaide,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000,0.02,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0
1,Agincourt North,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000,0.00,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0
2,Albion Gardens,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000,0.00,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0
3,Bathurst Quay,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000,0.00,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0
4,Beaumond Heights,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000,0.00,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
205,Willowdale,0.009615,0.000000,0.0,0.0,0.0,0.0,0.000,0.00,0.0,...,0.0,0.0,0.0,0.009615,0.0,0.0,0.0,0.0,0.000000,0.0
206,Woburn,0.000000,0.052632,0.0,0.0,0.0,0.0,0.000,0.00,0.0,...,0.0,0.0,0.0,0.052632,0.0,0.0,0.0,0.0,0.000000,0.0
207,Woodbine Heights,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000,0.00,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.021739,0.0
208,York Mills,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000,0.00,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0


In [19]:
# generamos el nuevo dataframe y mostremos los primeros n sitios de cada barrio
num_top_venues = 10

# crea columnas acorde al numero de sitios
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
        columns.append('{} Common Venue'.format(ind+1))

# crear un nuevo dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = central_toronto_grouped['Neighborhood']

for ind in np.arange(central_toronto_grouped.shape[0]):
    row_categories = central_toronto_grouped.iloc[ind, :].iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    neighborhoods_venues_sorted.iloc[ind, 1:] = row_categories_sorted.index.values[0:num_top_venues]

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1 Common Venue,2 Common Venue,3 Common Venue,4 Common Venue,5 Common Venue,6 Common Venue,7 Common Venue,8 Common Venue,9 Common Venue,10 Common Venue
0,Adelaide,Café,Restaurant,Lounge,Pizzeria,Italian Restaurant,Deli,Sushi Restaurant,Music Venue,Burger Joint,Salad Restaurant
1,Agincourt North,"Shipping, Freight, and Material Transportation...",Property Management Office,Eyecare Store,Painter,Elementary School,Hair Salon,Computer Repair Service,Park,Metals Supplier,Construction
2,Albion Gardens,Drugstore,Clothing Store,Real Estate Agency,Pizzeria,Telecommunication Service,Jewelry Store,Shoe Store,Automotive Repair Shop,Furniture and Home Store,Real Estate Development and Title Company
3,Bathurst Quay,Rental Car Location,Sports and Recreation,Home Improvement Service,Property Management Office,Historic and Protected Site,Harbor / Marina,Sculpture Garden,Music Venue,Office Supply Store,Nutritionist
4,Beaumond Heights,Drugstore,Clothing Store,Real Estate Agency,Pizzeria,Telecommunication Service,Jewelry Store,Shoe Store,Automotive Repair Shop,Furniture and Home Store,Real Estate Development and Title Company


### Generacion del primer cluster

In [21]:
# establecer el número de agrupaciones
kclusters = 5

central_toronto_grouped_clustering = central_toronto_grouped.drop('Neighborhood', 1)

# ejecutar k-means
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(central_toronto_grouped_clustering)

# revisar las etiquetas de las agrupaciones generadas para cada fila del dataframe
kmeans.labels_[0:10]

  central_toronto_grouped_clustering = central_toronto_grouped.drop('Neighborhood', 1)


array([0, 4, 0, 1, 0, 0, 0, 0, 0, 4])

In [22]:
# agrega etiquetas de cluster al dataframe del vecindario
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

# usa dataframe original con datos de localidad
central_toronto_merged = df_final

# Adjunta datos de cluster y sitios principales  
central_toronto_merged = central_toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

central_toronto_merged.head() # revisar las ultimas columnas

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1 Common Venue,2 Common Venue,3 Common Venue,4 Common Venue,5 Common Venue,6 Common Venue,7 Common Venue,8 Common Venue,9 Common Venue,10 Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,4,Business and Professional Services,Accounting and Bookkeeping Service,Financial Service,Engineer,Website Designer,Food and Beverage Retail,Audiovisual Service,Vintage and Thrift Store,Health Food Store,Landscaper and Gardener
1,M4A,North York,Victoria Village,43.725882,-79.315572,0,Car Dealership,Print Store,General Contractor,Automotive Repair Shop,Media Agency,Organization,BBQ Joint,Check Cashing Service,Gift Store,Tailor
2,M5A,Downtown Toronto,Regent Park,43.65426,-79.360636,0,Automotive Repair Shop,Car Dealership,Park,Furniture and Home Store,Restaurant,Italian Restaurant,Coffee Shop,Music Venue,Bakery,Spa
3,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636,0,Automotive Repair Shop,Car Dealership,Park,Furniture and Home Store,Restaurant,Italian Restaurant,Coffee Shop,Music Venue,Bakery,Spa
4,M6A,North York,Lawrence Manor,43.718518,-79.464763,0,Clothing Store,Automotive Repair Shop,Carpet and Flooring Contractor,Housewares Store,Event Service,Retail,Leather Supplier,Computer Repair Service,Electrician,Loans Agency


In [23]:
#  muestra cantidad de elementos por cluster
# aqui identificamos la problematica atendida, uno de los cluster acopia 
# la mayoria de los vecindarios
central_toronto_merged.groupby(['Cluster Labels'])['PostalCode'].count()

Cluster Labels
0    141
1      7
2      1
3     17
4     50
Name: PostalCode, dtype: int64

In [24]:
#Visualicemos las agrupaciones resultantes
 # crear mapa
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# establecer el esquema de color para las agrupaciones
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# añadir marcadores al mapa
markers_colors = []
for lat, lon, poi, cluster in zip(central_toronto_merged['Latitude'], central_toronto_merged['Longitude'], central_toronto_merged['Neighborhood'], central_toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Iniciamos anidamiento de clusters, seleccionando uno de ellos y ejecutando de nuevo K-means
### (nested)

In [25]:
# dataframes para filtrar cluster de interes y realizar nuevo proceso de clustering
central_toronto_grouped_nested = central_toronto_grouped
neighborhoods_venues_sorted_nested = neighborhoods_venues_sorted

In [26]:
# filtrado del cluster 0 en venues

neighborhoods_venues_sorted_nested = neighborhoods_venues_sorted_nested.loc[neighborhoods_venues_sorted_nested['Cluster Labels'] == 0]



### filtrado de cluster nested One_hot


In [27]:
central_toronto_merged_nested = neighborhoods_venues_sorted_nested.loc[neighborhoods_venues_sorted_nested["Cluster Labels"] == 0 ]['Neighborhood']

In [28]:
central_toronto_merged_nested.head()

0              Adelaide
2        Albion Gardens
4      Beaumond Heights
5     Bloordale Gardens
6           Cabbagetown
Name: Neighborhood, dtype: object

In [29]:
central_toronto_grouped_nested = pd.merge(central_toronto_grouped_nested, central_toronto_merged_nested, on=["Neighborhood"], how="outer", indicator=True)

central_toronto_grouped_nested.head()

Unnamed: 0,Neighborhood,ATM,Accounting and Bookkeeping Service,Adult Store,Advertising Agency,Afghan Restaurant,African Restaurant,Airport Service,American Restaurant,Amusement Park,...,Warehouse / Wholesale Store,Waste Management Service,Website Designer,Wholesaler,Wine Bar,Wine Store,Women's Store,Yoga Studio,Youth Organization,_merge
0,Adelaide,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,both
1,Agincourt North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,left_only
2,Albion Gardens,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,both
3,Bathurst Quay,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,left_only
4,Beaumond Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,both


In [30]:
#Eliminamos filas diferentes a cluster objetivo (0)
central_toronto_grouped_nested = central_toronto_grouped_nested.loc[central_toronto_grouped_nested['_merge'] == 'both', :]
central_toronto_grouped_nested.drop(['_merge'], axis=1, inplace=True)
central_toronto_grouped_nested.shape


(135, 418)

### Genera nuevo conjunto de clusters usando k-means anidado

In [31]:
# establecer el número de agrupaciones
kclusters = 5

central_toronto_grouped_clustering_nested = central_toronto_grouped_nested.drop('Neighborhood', 1)

# ejecutar k-means
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(central_toronto_grouped_clustering_nested)

# revisar las etiquetas de las agrupaciones generadas para cada fila del dataframe
kmeans.labels_[0:10]

  central_toronto_grouped_clustering_nested = central_toronto_grouped_nested.drop('Neighborhood', 1)


array([0, 2, 2, 3, 0, 0, 3, 4, 1, 0])

In [32]:
#Generemos un nuevo dataframe que incluya la agrupación asi como los 10 sitios mas populares de cada barrio
# añadir etiquetas
neighborhoods_venues_sorted_nested.insert(0, 'Cluster Labels2', kmeans.labels_)

central_toronto_merged_nested = df_final

# juntar manhattan_grouped con manhattan_data 
central_toronto_merged_nested = central_toronto_merged_nested.join(neighborhoods_venues_sorted_nested.set_index('Neighborhood'), on='Neighborhood')

central_toronto_merged_nested.head() # revisar las ultimas columnas

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels2,Cluster Labels,1 Common Venue,2 Common Venue,3 Common Venue,4 Common Venue,5 Common Venue,6 Common Venue,7 Common Venue,8 Common Venue,9 Common Venue,10 Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,,,,,,,,,,,,
1,M4A,North York,Victoria Village,43.725882,-79.315572,4.0,0.0,Car Dealership,Print Store,General Contractor,Automotive Repair Shop,Media Agency,Organization,BBQ Joint,Check Cashing Service,Gift Store,Tailor
2,M5A,Downtown Toronto,Regent Park,43.65426,-79.360636,4.0,0.0,Automotive Repair Shop,Car Dealership,Park,Furniture and Home Store,Restaurant,Italian Restaurant,Coffee Shop,Music Venue,Bakery,Spa
3,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636,4.0,0.0,Automotive Repair Shop,Car Dealership,Park,Furniture and Home Store,Restaurant,Italian Restaurant,Coffee Shop,Music Venue,Bakery,Spa
4,M6A,North York,Lawrence Manor,43.718518,-79.464763,2.0,0.0,Clothing Store,Automotive Repair Shop,Carpet and Flooring Contractor,Housewares Store,Event Service,Retail,Leather Supplier,Computer Repair Service,Electrician,Loans Agency


In [33]:
# elimina registros con valores nulos (NaN)
central_toronto_merged_nested = central_toronto_merged_nested.dropna()

In [34]:
# obtenemos nuevo conjunto de clusters, aun persiste uno dominante, con nuna mucho mejor
# distribucion de elementos
central_toronto_merged_nested.groupby(['Cluster Labels2'])['PostalCode'].count()

Cluster Labels2
0.0    21
1.0     7
2.0    15
3.0    75
4.0    23
Name: PostalCode, dtype: int64

In [35]:
#Visualicemos de nuevo las agrupaciones resultantes
# crear mapa
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# establecer el esquema de color para las agrupaciones
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# añadir marcadores al mapa
markers_colors = []
for lat, lon, poi, cluster in zip(central_toronto_merged_nested['Latitude'], central_toronto_merged_nested['Longitude'], central_toronto_merged_nested['Neighborhood'], central_toronto_merged_nested['Cluster Labels2']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters