# Clustering Jundiaí City's Neighbourhoods
### Capstone Project for the final assignment of IBM Data Science for Professional Certificate

## Table of Contents
- [Installing and importing needed packages](#install)
- [Creating the neighbourhood dataset](#dataset)
- [Fetching location data from foursquare](#fetching)
- [Analyze each neighbourhood](#analyze)
- [Creating the clusters](#clustering)

<span id="install"></span>  
## Installing and importing needed packages

For this project, I'll use:
- the pandas dataframes/series to manipulate data
- time to make delays
- numpy to manipulate matrix
- requests to make REST requests to foursquare API
- matplotlib to handle color maps
- sklearn KMeans, to create clusters

In [None]:
!conda install -c conda-forge folium --yes

In [749]:
# import needed packages
import pandas as pd
from geopy.geocoders import Nominatim
import time
import numpy as np
import requests
from sklearn.cluster import KMeans
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

<span id='dataset'></span>
## Creating the neighbourhood dataset
### Dataset sources
<font size="4em">The Jundiaí City has no data readly formated about the boroughs and neighbourhoods. Therefore, I need to get this data by my self. I found data about every borough of the city in the Town Hall web page. Each borough has datas in PDF files, so, I decided to extract these data manually.</font>

<font size="4em">The Town Hall web page that annouces the data mentioned before is [this one (jundiai.sp.gov.br)](https://jundiai.sp.gov.br/planejamento-e-meio-ambiente/publicacoes-da-smpma/conheca-seu-bairro/). The following boroughs will be used:
- Centro ([PDF data about this borough](https://jundiai.sp.gov.br/planejamento-e-meio-ambiente/wp-content/uploads/sites/15/2014/08/Conhe%C3%A7a-seu-bairro-Centro.pdf))
- Anhangabaú ([PDF data about this borough](https://jundiai.sp.gov.br/planejamento-e-meio-ambiente/wp-content/uploads/sites/15/2014/08/Conhe%C3%A7a-seu-bairro-Anhangabau.pdf))
- Ponte São João ([PDF data about this borough](https://jundiai.sp.gov.br/planejamento-e-meio-ambiente/wp-content/uploads/sites/15/2014/08/Conhe%C3%A7a-seu-bairro-Ponte-S%C3%A3o-Jo%C3%A3o.pdf))
- Vila Arens ([PDF data about this borough](https://jundiai.sp.gov.br/planejamento-e-meio-ambiente/wp-content/uploads/sites/15/2014/08/Conhe%C3%A7a-seu-bairro-Vila-Arens-Vila-Progresso.pdf))
- Vila Rio Branco ([PDF data about this borough](https://jundiai.sp.gov.br/planejamento-e-meio-ambiente/wp-content/uploads/sites/15/2014/08/Conhe%C3%A7a-seu-bairro-Vila-Rio-Branco.pdf))

I accessed each PDF and generated the following data dictionary:</font>

In [54]:
# Creating a dictionary of neighbourhoods name. Each key is a borough, each borough is a dictionary.
# Each borough dictionary has Id and list of neighbourhoods
jundiaiNeighbourhood = {
    'Centro': {
        'id': 'cnt',
        'neighbourhoods': [
            'Vila Operário',
            'Vila Pachecó',
            'Vila Padre Nóbrega',
            'Vila Gothardo',
            'Vila Leme',
            'Vila Maria Ignez',
            'Vila Boaventura',
            'Vila Argos Velha',
            'Vila Argos Nova'
            'Chácara Urbana',
            'Vila Torres Neves'
        ]
    },
    'Anhangabaú': {
        'id': 'abu',
        'neighbourhoods': [
            'Jardim Luciana',
            'Chacarás São Roque',
            'Bairro Anhangabaú',
            'Jardim Ana Maria',
            'Jardim da Flórida',
            'Jardim da Serra',
            'Jardim Paulista',
            'Vila Japi',
            'Vila Ana',
            'Vila Iracema',
            'Vila Loyola',
            'Vila Cacilda',
            'Jardim Santa Adelaide',
            'Jardim Anhanguera'
        ]
    },
    'Ponte São João': {
        'id': 'psj',
        'neighbourhoods': [
            'Chácara Alvorada',
            'Vila Belesso',
            'Vila Caodáglio',
            'Vila Agostino Zambom',
            'Vila Graff',
            'Vila Lima',
            'Jardim Vila Rica',
            'Jardim da Fonte',
            'Vila Guilherme',
            'Vila Joana',
            'Vila Veto',
            'Jardim São Miguel'
        ]
    },
    'Vila Arens': {
        'id': 'vla',
        'neighbourhoods': [
            'Vila Princesa Isabel',
            'Vila Santa Rosa',
            'Jardim São Bento',
            'Vila Nadi',
            'Jardim Dupre',
            'Vila Isabel Eber',
            'Vila São Bento',
            'Jardim Merci III',
            'Vila Progresso',
            'Vila de Vecchi',
            'Vila Agrícola',
            'Vila Santana',
            'Vila Leda'
        ]
    },
    'Vila Rio Branco': {
        'id': 'vrb',
        'neighbourhoods': [
            'Jardim Marcos Leite',
            'Jardim Liberdade',
            'Jardim Rio Branco',
            'Vila Margarida',
            'Vila Rio Branco',
            'Vila Carlos W. Muller',
            'Vila Liberdade',
            'Jardim Danúbio',
            'Vila Savieto',
            'Jardim Carlos Gomes'
        ]
    }
}

<font size="4em">The data dictionary is ok for the work, but, instead of using it, I'll use the Pandas DataFrame to manipulate the dataset. So, let's convert it and then, print some checks about dataframe:</font>

In [796]:
# converting the neighbourhoods dictionary to pandas dataframe
df_jundiai = pd.DataFrame(columns=['id', 'borough', 'neighbourhood'])

# for each key in jundiaiNeighbourhood...
for key in list(jundiaiNeighbourhood.keys()):
    
    # get id
    id_ = jundiaiNeighbourhood[key]['id']
    
    # for each neighbourhood in borough...
    for neighbourhood in jundiaiNeighbourhood[key]['neighbourhoods']:
        
        # add new object to dataframe
        df_jundiai = df_jundiai.append({
            'id':id_,
            'borough':key,
            'neighbourhood': neighbourhood
        }, ignore_index=True)
        
# showing info about dataframe
print('Sample of df_jundiai: \n', df_jundiai.head())
print('\nShape of df_jundiai:', df_jundiai.shape)

Sample of df_jundiai: 
     id     borough       neighbourhood
0  abu  Anhangabaú      Jardim Luciana
1  abu  Anhangabaú  Chacarás São Roque
2  abu  Anhangabaú   Bairro Anhangabaú
3  abu  Anhangabaú    Jardim Ana Maria
4  abu  Anhangabaú   Jardim da Flórida

Shape of df_jundiai: (60, 3)


<font size="3em">Now that the dataframe was created, let's get the geolocation data (latitude and longitude) about every neighbourhood. This process will try to get geolocation data 3 times for each neighbourhood. If fail, is because these neighbourhood can't be fetched in the geolocation Nominatim server.

After that, let's show a sample of the dataset</font>

In [None]:
# Retrieving the geolocation data of each neighbourhood
address = '{}, Jundiaí'

# for each neighbourhood...
latlng = []
for neighbourhood in df_jundiai['neighbourhood']:
    geocoder = Nominatim(user_agent='dfe_agent')
    address_ = address.format(neighbourhood)
    location = geocoder.geocode(address_)
    
    # if cannot return geolocation
    try_again=0
    print(address_)
    while (location is None):
        # Try 2 more times
        if (try_again > 1):
            break
        time.sleep(2) # wait 2 seconds before trying again
        location = geocoder.geocode(address_)
        try_again += 1
        
    if (try_again > 1):
        latlng.append([-1, -1])
    else:
        latlng.append([location.latitude, location.longitude])

# insertin lat and lng into dataframe
lat, lng = map(list, zip(*latlng)) # split columns
df_jundiai['latitude'] = lat
df_jundiai['longitude'] = lng

In [801]:
df_jundiai.head()

Unnamed: 0,id,borough,neighbourhood,latitude,longitude
0,abu,Anhangabaú,Jardim Luciana,-23.188419,-46.898869
1,abu,Anhangabaú,Chacarás São Roque,-1.0,-1.0
2,abu,Anhangabaú,Bairro Anhangabaú,-1.0,-1.0
3,abu,Anhangabaú,Jardim Ana Maria,-23.191902,-46.900853
4,abu,Anhangabaú,Jardim da Flórida,-23.194772,-46.902675


<font size="3em">Now, let's see how much neighbourhood has no latitude and longitude data</font>

In [802]:
# Cheking how much rows do not have latitude and longitude
df_jundiai[(df_jundiai['latitude'] == -1) & (df_jundiai['longitude'] == -1)].shape

(15, 5)

<font size="3em">So, 15 of 60 neighbourhoods cannot be fetched from geopy. I decided that it's no worth to fetch these data to this study case, so, I'll remove these neighbourhoods from the dataframe. They represent 25% of entirely dataframe</font>

In [809]:
df_jundiai2 = df_jundiai.copy()
df_jundiai['latitude'] == -1
df_jundiai[df_jundiai['latitude'] == -1] = np.nan
df_jundiai.dropna(inplace=True)

<font size="3em">Now, let's plot the graph of the city and mark the neighbourhoods</font>

In [810]:
# getting jundiai location
geocoder = Nominatim(user_agent='dfe_agent')
location = geocoder.geocode('Jundiaí, SP')

map_jundiai = folium.Map(location=[location.latitude, location.longitude], zoom_start=13, width="70%", height="70%")

for lat, lng, name, bor in zip(df_jundiai['latitude'], df_jundiai['longitude'], df_jundiai['neighbourhood'], df_jundiai['borough']):
    label = folium.Popup(str(name) + ' ('+str(bor)+')')
    folium.CircleMarker(
        [lat, lng],
        popup = label,
        radius=10,
        color='blue',
        fill=True,
        fill_color='green',
        fill_opacity=0.7
    ).add_to(map_jundiai)
map_jundiai

<span id="fetching"></span>
## Fetching location data from foursquare
<font size="3em">Now, I'll fetch the location data (i.e. venues) from each neighbourhood. To do this, I'll use the Foursquare API, loop through every neighbourhood collected in the later section and store this data. This step will create a new dataset that  will contain just venue data, and the neighbourhood ID</font>

In [812]:
# Foursquare API auth
fs_clientid = '***'
fs_clientsecret='***'
fs_version='20190506'

In [247]:
# function to retrieve venues from a neighbourhood
def getVenues(numId, id_, latitude, longitude):
    # form url to request location data
    # not providing radius and limits, I'm trusting in the Foursquare API ;) i.e. radius=250 and limit up to 50
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}'
    
    # venues per neighbourhoods id dataframe
    df_neighborVenues = pd.DataFrame(columns=['numid', 'id', 'venue', 'category', 'latitude', 'longitude'])
    
    # for each id/latitude/longitude...
    for lat, lng, id2_, numId_ in zip(latitude, longitude, id_, numId):
        url_ = url.format(fs_clientid, fs_clientsecret, lat, lng, fs_version)
        
        # send request to API
        fs_response = requests.get(url_).json()['response']['groups'][0]['items']
        
        # insert each venue in the df_neighborVenues
        for index in range(len(fs_response)):
            venue = fs_response[index]['venue']
            
            df_neighborVenues = df_neighborVenues.append({
                'numid': numId_,
                'id': id2_,
                'venue': venue['name'],
                'category': venue['categories'][0]['name'],
                'latitude':venue['location']['lat'],
                'longitude':venue['location']['lng']
            }, ignore_index=True)
        
    return df_neighborVenues

<font size="3em">Here are, the new dataset with venues information.</font>

In [857]:
df_jundiaiVenues = getVenues(list(df_jundiai.index), df_jundiai['id'], df_jundiai['latitude'], df_jundiai['longitude'])
df_jundiaiVenues.head()

Unnamed: 0,numid,id,venue,category,latitude,longitude
0,0,abu,Dom Olívio,Market,-23.186749,-46.898197
1,0,abu,Mako's Poke Co.,Food & Drink Shop,-23.189273,-46.897225
2,0,abu,Casa Toro,Steakhouse,-23.189378,-46.900506
3,0,abu,Roma Mia,Italian Restaurant,-23.189222,-46.898106
4,0,abu,Hoken Sushi,Restaurant,-23.190507,-46.899997


<font size="3em">Now, Jundiaí is not a big city (not too much) so, the neighbourhoods fetched early can share some venues, and this is a problem to the analisys in this project. So, before continuing, I need to clean this dataset removing every duplicated venue. To do this, here how I ruled it:</font>
- <font size="3em">Get the duplicated venues list</font>
- <font size="3em">For every duplicated venue, find the near neighbourhood</font>
- <font size="3em">Remove this venue from the others neighbourhoods and assign it to the closest neighbourhood.</font>  

<font size="3em">There is:</font>

In [858]:
# make a list of duplicated venues
df_jdVenuesDupl = df_jundiaiVenues.groupby('venue').count()['id'] > 1
df_jdVenuesDupl = df_jundiaiVenues.groupby('venue').count()[df_jdVenuesDupl].reset_index()

# get a list of unique venues
df_jdVenuesUnique = df_jundiaiVenues.groupby('venue').count()['id'] == 1
df_jdVenuesUnique = df_jundiaiVenues.set_index('venue')[df_jdVenuesUnique].reset_index()
df_jdVenuesUnique.shape



(111, 6)

In [860]:
# the criteria to removing duplicate venues is keep venue in the nearest neighbourhood, and delete in the others neighbourhoods
#for venue, numid, id_, latitude, longitude in zip(df_jdVenuesDupl['venue'])
new_unique = pd.DataFrame(columns = df_jdVenuesUnique.columns)
for venue in df_jdVenuesDupl['venue']:
    # first, need to know what borough the duplicate venues pertences for and it latitude/longitude
    temp_borough_df = df_jundiaiVenues[df_jundiaiVenues['venue'] == venue].reset_index()
    borough = temp_borough_df['id'].unique()
    loc = temp_borough_df.loc[0, 'latitude':'longitude']
    
    # now, get a dataframe with boroughs containing the duplicated venue
    temp_df = df_jundiai[df_jundiai['id'] == borough[0]]
    if (len(borough) > 1):
        for index in range(1, len(borough)):
            temp_df = temp_df.append(df_jundiai[df_jundiai['id'] == borough[index]])
    
    # remove undesired indexes
    accept_index = pd.Series(temp_df.index)
    accept_index.index = accept_index
    accept_index = accept_index.isin(temp_borough_df['numid'])
    temp_df = temp_df[accept_index]
    
    # compare each
    lat_min=-1
    lng_min=-1
    index=-1
    for i, lat, lng in zip(list(temp_df.index), temp_df['latitude'], temp_df['longitude']):
        if (index != -1):
            lat_min_ = np.absolute(loc[0] - lat)
            lng_min_ = np.absolute(loc[0] - lng)
            if (lat_min_ < lat_min) and (lng_min_ < lng_min):
                lat_min = lat_min_
                lng_min = lng_min_
                index = i
        else:
            lat_min = np.absolute(loc[0] - lat)
            lng_min = np.absolute(loc[1] - lng)
            index = i
    
    # Least distance of venue from neighbourhood in index neighbourhood. So
    # insert unique entry to df_jundiaiVenues
    #df_jundiaiVenues[(df_jundiaiVenues['numid'] == index) & (df_jundiaiVenues['venue'] == df_jdVenuesDupl['venue'][0])]
    new_unique = new_unique.append(df_jundiaiVenues[(df_jundiaiVenues['numid'] == index) & (df_jundiaiVenues['venue'] == venue)])
    
# data cleaned
print('New unique shape: ', new_unique.shape)

New unique shape:  (185, 6)


In [861]:
# join DF Unique Venues with new unique venues
df_jdVenuesUnique = df_jdVenuesUnique.append(new_unique, ignore_index=True).reset_index(drop=True)

In [862]:
#df_jdVenuesUnique.to_csv('df_jdVenuesUnique.csv')
df_jdVenuesUnique.shape

(296, 6)

<font size="3em">Now, the dataset has only unique venues for each neighbourhood. So, let's analysis these venues in the next section.</font>

In [863]:
# rearange columns of df
df_jdVenuesUnique2 = df_jdVenuesUnique[['numid', 'id', 'venue', 'category', 'latitude', 'longitude']].sort_values(by='numid', axis=0).reset_index(drop=True)
df_jdVenuesUnique2.head()

Unnamed: 0,numid,id,venue,category,latitude,longitude
0,0,abu,Lisboa Culinária Portuguesa,Portuguese Restaurant,-23.185658,-46.895798
1,0,abu,Smoked Burgers,Burger Joint,-23.189275,-46.895386
2,0,abu,Lo.La,Latin American Restaurant,-23.185582,-46.898311
3,0,abu,Love Burger,Burger Joint,-23.189636,-46.900677
4,0,abu,Mako's Poke Co.,Food & Drink Shop,-23.189273,-46.897225


<span id="analyze"></span>
## Analyze each neighbourhood
<font size="3em">Before analyze any neighbourhood, the dafarame needs some change. First, lets encode the categories in binary numbers, creating columns for each category and assigning 1 if the venue have the category or 0 if not.</font>

In [867]:
#Hot encoding using pandas
df_jdVenuesEncoded = pd.get_dummies(df_jdVenuesUnique[['category']], prefix='', prefix_sep='')

# add neighbourhood to encoded df
df_jundiaiTemp = df_jundiai.reset_index()[['index', 'neighbourhood', 'borough']].rename(columns={'index':'numid'})

# create the new column and fill it with 'N'
df_jdVenuesUnique['neighbourhood'] = 'N'


# For each numid, fill the correct neighbourhood
for numid in df_jdVenuesUnique['numid'].unique():
    #put the neighbourhood into the new colum
    df_jdVenuesUnique.loc[df_jdVenuesUnique['numid'] == numid, 'neighbourhood'] = list(df_jundiaiTemp.loc[df_jundiaiTemp['numid'] == numid, 'neighbourhood'])[0]
    
# add the neighbourhood column to the encoded datframe
df_jdVenuesEncoded2 = df_jdVenuesEncoded.copy()
df_jdVenuesEncoded2['neighbourhood'] = df_jdVenuesUnique['neighbourhood']
columns_position = [df_jdVenuesEncoded2.columns[-1]] + list(df_jdVenuesEncoded2.columns[:-1])
df_jdVenuesEncoded2 = df_jdVenuesEncoded2[columns_position]

df_jdVenuesEncoded2.head()

Unnamed: 0,neighbourhood,Acai House,Adult Boutique,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,Bar,Beer Bar,...,Street Art,Supermarket,Sushi Restaurant,Tea Room,Theater,Theme Park Ride / Attraction,Vegetarian / Vegan Restaurant,Video Store,Warehouse Store,Wine Shop
0,Jardim da Flórida,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Jardim da Flórida,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Jardim da Flórida,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Jardim Paulista,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Jardim Paulista,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<font size="3em">Now, group by neighbourhood using the mean method to "normalize" the values and get the percentage of each category in that neighbourhoo</font>

In [868]:
# Group by neighbourhood taking the mean
df_jdVenuesGrp = df_jdVenuesEncoded2.groupby('neighbourhood').mean()
df_jdVenuesGrp.head()

Unnamed: 0_level_0,Acai House,Adult Boutique,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,Bar,Beer Bar,Big Box Store,...,Street Art,Supermarket,Sushi Restaurant,Tea Room,Theater,Theme Park Ride / Attraction,Vegetarian / Vegan Restaurant,Video Store,Warehouse Store,Wine Shop
neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Chácara Alvorada,0.0,0.0,0.0,0.0,0.034483,0.0,0.103448,0.068966,0.034483,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034483,0.0
Chácara Urbana,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0625,0.0,0.0,...,0.0,0.0625,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Jardim Ana Maria,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Jardim Dupre,0.0,0.0,0.0,0.142857,0.0,0.0,0.142857,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0
Jardim Liberdade,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.111111,0.0,0.0,...,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0


<font size="3em">Let's print the top 5 common venues in each neighbourhood</font>

In [869]:
def getTopCommonVenues(row, top=5):
    return row.T[1:].sort_values(ascending=False)[0:top]

def getTopCommonVenuesCategory(row, top=5):
    row_ = row.T[1:].sort_values(ascending=False)[0:top]
    return row_.index.values

In [870]:
df_jdVenuesGrp = df_jdVenuesGrp.reset_index()
for i in range(0, df_jdVenuesGrp.shape[0]):
    venues = getTopCommonVenues(df_jdVenuesGrp.loc[i,:], 5)
    print(df_jdVenuesGrp.loc[i, 'neighbourhood'])
    print(venues)
    print('\n')

Chácara Alvorada
Gym                      0.137931
Gym / Fitness Center     0.103448
Bakery                   0.103448
Bar                     0.0689655
Nightclub               0.0689655
Name: 0, dtype: object


Chácara Urbana
Steakhouse                  0.125
Café                        0.125
Chinese Restaurant         0.0625
Fruit & Vegetable Store    0.0625
Market                     0.0625
Name: 1, dtype: object


Jardim Ana Maria
Park                    0.142857
Brazilian Restaurant    0.142857
Gym Pool                0.142857
Athletics & Sports      0.142857
Burger Joint            0.142857
Name: 2, dtype: object


Jardim Dupre
Gym / Fitness Center    0.142857
Bakery                  0.142857
Dessert Shop            0.142857
Salon / Barbershop      0.142857
Brazilian Restaurant    0.142857
Name: 3, dtype: object


Jardim Liberdade
Café             0.222222
Pizza Place      0.111111
Grocery Store    0.111111
Theater          0.111111
Bakery           0.111111
Name: 4, dtype: objec

<font size="3em">Finally, let's create a dataset that will contain the top 10 most common venues in each neighbourhood. This dataset will show us with more light how each neighbourhood is made of venues</font>

In [720]:
# create a dataframe with top 10 common venues
columns = ['neighbourhood']
top = 10
for i in range(0, top):
    columns.append('{} most common venues'.format(i))


df_jundiaiTop = pd.DataFrame(columns = columns)
df_jundiaiTop['neighbourhood'] = df_jdVenuesGrp['neighbourhood']
for i in range(0, df_jdVenuesGrp.shape[0]):
    df_jundiaiTop.loc[i, 1:] = getTopCommonVenuesCategory(df_jdVenuesGrp.loc[i, :], 10)
df_jundiaiTop.head()

Unnamed: 0,neighbourhood,0 most common venues,1 most common venues,2 most common venues,3 most common venues,4 most common venues,5 most common venues,6 most common venues,7 most common venues,8 most common venues,9 most common venues
0,Chácara Alvorada,Gym,Gym / Fitness Center,Bakery,Bar,Nightclub,Brazilian Restaurant,Sandwich Place,Food Truck,Warehouse Store,Gymnastics Gym
1,Chácara Urbana,Steakhouse,Café,Chinese Restaurant,Fruit & Vegetable Store,Market,Pharmacy,Burger Joint,Buffet,Botanical Garden,Bar
2,Jardim Ana Maria,Park,Brazilian Restaurant,Gym Pool,Athletics & Sports,Burger Joint,Stadium,Market,Cosmetics Shop,Dance Studio,Deli / Bodega
3,Jardim Dupre,Gym / Fitness Center,Bakery,Dessert Shop,Salon / Barbershop,Brazilian Restaurant,Warehouse Store,Asian Restaurant,BBQ Joint,Food & Drink Shop,Deli / Bodega
4,Jardim Liberdade,Café,Pizza Place,Grocery Store,Theater,Bakery,Bar,Park,Dessert Shop,Wine Shop,Diner


<span id="clustering"></span>
## Creating the clusters
<font size="3em"> Now it's time to generate the clusters. The neighbourshoods constant was initialized with tested value from previous tries.
    
First, let drop the *neighbourhood* column.</font>

In [721]:
df_jdVenuesKmeans = df_jdVenuesGrp.drop('neighbourhood', axis=1)
df_jdVenuesKmeans.head()

Unnamed: 0,Acai House,Adult Boutique,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,Bar,Beer Bar,Big Box Store,...,Street Art,Supermarket,Sushi Restaurant,Tea Room,Theater,Theme Park Ride / Attraction,Vegetarian / Vegan Restaurant,Video Store,Warehouse Store,Wine Shop
0,0.0,0.0,0.0,0.0,0.034483,0.0,0.103448,0.068966,0.034483,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034483,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0625,0.0,0.0,...,0.0,0.0625,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.142857,0.0,0.0,0.142857,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.111111,0.0,0.0,...,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0


<font size="3em">Using the KMeans algorithm (from scikit learn) to train our model and then, joining the *neighbourhood*, *borough*, *id* and *latitude/longitude* columns to the labeled dataset</font>

In [782]:
k = 11
kmeans = KMeans(n_clusters = k).fit(df_jdVenuesKmeans)
kmeans.labels_

# insert labels into dataset
df_jundiaiTop.insert(0, 'Cluster Labels', kmeans.labels_)

array([ 0,  0, 10,  0,  0,  0,  0,  0,  6,  0,  4,  7,  0,  3,  7,  0,  9,
        0,  0,  0,  2,  1,  8,  0,  0,  1,  0,  5], dtype=int32)

In [732]:
# merge with information dataframe
df_jundiai_merged = df_jundiai.merge(df_jundiaiTop)
df_jundiai_merged.head()

Unnamed: 0,id,borough,neighbourhood,latitude,longitude,Cluster Labels,0 most common venues,1 most common venues,2 most common venues,3 most common venues,4 most common venues,5 most common venues,6 most common venues,7 most common venues,8 most common venues,9 most common venues
0,abu,Anhangabaú,Jardim Luciana,-23.188419,-46.898869,9,Italian Restaurant,Burger Joint,Pizza Place,Steakhouse,Bar,Restaurant,Sandwich Place,Brazilian Restaurant,Food & Drink Shop,Farmers Market
1,abu,Anhangabaú,Jardim Ana Maria,-23.191902,-46.900853,5,Park,Brazilian Restaurant,Gym Pool,Athletics & Sports,Burger Joint,Stadium,Market,Cosmetics Shop,Dance Studio,Deli / Bodega
2,abu,Anhangabaú,Jardim da Flórida,-23.194772,-46.902675,1,Hotel,Gym,Pizza Place,Dog Run,Clothing Store,Coffee Shop,Cosmetics Shop,Dance Studio,Deli / Bodega,Department Store
3,abu,Anhangabaú,Jardim da Serra,-23.196057,-46.899601,2,Gym / Fitness Center,Hotel,Coffee Shop,Shopping Mall,Steakhouse,Bookstore,Fruit & Vegetable Store,Food Truck,German Restaurant,Cosmetics Shop
4,abu,Anhangabaú,Jardim Paulista,-23.195495,-46.81515,12,Burger Joint,Italian Restaurant,Ice Cream Shop,Hot Dog Joint,Brazilian Restaurant,Gym,Brewery,Coffee Shop,Gymnastics Gym,Food Truck


## Ploting the graph

In [733]:
# getting jundiai location
geocoder = Nominatim(user_agent='dfe_agent')
location = geocoder.geocode('Jundiaí, SP')
print(location.latitude, location.longitude)

-23.1887866 -46.8845122


In [759]:
map_jundiai = folium.Map(location=[location.latitude, location.longitude], zoom_start=13, width='70%', height='70%')

#color schem
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

for lat, lng, name, bor, cluster in zip(df_jundiai_merged['latitude'], df_jundiai_merged['longitude'], df_jundiai_merged['neighbourhood'], df_jundiai_merged['borough'], df_jundiai_merged['Cluster Labels']):
    label = folium.Popup(name + ' cluster ' + str(cluster) + '('+bor+')')
    folium.CircleMarker(
        [lat, lng],
        popup = label,
        radius=10,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=0.7
    ).add_to(map_jundiai)

In [761]:
map_jundiai

<font size="3em">And there is, the graph with clustered neighbourhoods. Most cluster have just one neighbourhoods, and it's because of low data about neighbourhoods at Foursquare, about this city.</font> 
  
<font size="3em">This shows that this city has a lot to grow in terms of location data (and, for real, is hard to find venues here using the internet. Most of them do not have a web page yet!)</font>  
  
<font size="3em">However, lets pay attention in the clusters 2, 3, 5 and 9:</font>

### Cluster 2
<font size="3em">**Cluster 2** seems to be made of gym, hotel and coffee shops/shoppings. So, it's primarly a "touristic" cluster.</font>

In [774]:
df_jundiai_merged.loc[df_jundiai_merged['Cluster Labels'] == 2,:].drop('latitude', axis=1).drop('longitude', axis=1)

Unnamed: 0,id,borough,neighbourhood,Cluster Labels,0 most common venues,1 most common venues,2 most common venues,3 most common venues,4 most common venues,5 most common venues,6 most common venues,7 most common venues,8 most common venues,9 most common venues
3,abu,Anhangabaú,Jardim da Serra,2,Gym / Fitness Center,Hotel,Coffee Shop,Shopping Mall,Steakhouse,Bookstore,Fruit & Vegetable Store,Food Truck,German Restaurant,Cosmetics Shop
6,abu,Anhangabaú,Vila Ana,2,Italian Restaurant,Chocolate Shop,Gym / Fitness Center,Steakhouse,Furniture / Home Store,Hotel,Coffee Shop,Clothing Store,Pizza Place,Bakery


### Cluster 3
<font size="3em">**Cluster 3** is obviously most made of food venues.</font>

In [775]:
df_jundiai_merged.loc[df_jundiai_merged['Cluster Labels'] == 3,:].drop('latitude', axis=1).drop('longitude', axis=1)

Unnamed: 0,id,borough,neighbourhood,Cluster Labels,0 most common venues,1 most common venues,2 most common venues,3 most common venues,4 most common venues,5 most common venues,6 most common venues,7 most common venues,8 most common venues,9 most common venues
8,abu,Anhangabaú,Vila Loyola,3,Restaurant,Wine Shop,Church,Coffee Shop,Cosmetics Shop,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Diner
14,vla,Vila Arens,Vila Santana,3,Restaurant,Wine Shop,Church,Coffee Shop,Cosmetics Shop,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Diner


### Cluster 5
<font size="3em">**Cluster 5** is primarly a region with general shops and coffee shops, so, a kind of "*walk* place"</font>

In [783]:
df_jundiai_merged.loc[df_jundiai_merged['Cluster Labels'] == 5,:].drop('latitude', axis=1).drop('longitude', axis=1)

Unnamed: 0,id,borough,neighbourhood,Cluster Labels,0 most common venues,1 most common venues,2 most common venues,3 most common venues,4 most common venues,5 most common venues,6 most common venues,7 most common venues,8 most common venues,9 most common venues
1,abu,Anhangabaú,Jardim Ana Maria,5,Park,Brazilian Restaurant,Gym Pool,Athletics & Sports,Burger Joint,Stadium,Market,Cosmetics Shop,Dance Studio,Deli / Bodega
7,abu,Anhangabaú,Vila Iracema,5,Dessert Shop,Burger Joint,Brazilian Restaurant,Gym / Fitness Center,Adult Boutique,Pizza Place,Asian Restaurant,Hot Dog Joint,Ice Cream Shop,Farmers Market
9,vla,Vila Arens,Vila Santa Rosa,5,Brazilian Restaurant,Bar,Martial Arts Dojo,Shopping Mall,Farmers Market,Pet Store,Gastropub,Pizza Place,Dance Studio,Burger Joint
12,vla,Vila Arens,Jardim Merci III,5,Pizza Place,Brazilian Restaurant,Department Store,Bar,Food Court,Mexican Restaurant,Coffee Shop,Cosmetics Shop,Dance Studio,Deli / Bodega


### Cluster 9
<font size="3em">And, finally, cluster 9 seems to be kind of region with shops and some restaurants</font>

In [787]:
df_jundiai_merged.loc[df_jundiai_merged['Cluster Labels'] == 9,:].drop('latitude', axis=1).drop('longitude', axis=1)

Unnamed: 0,id,borough,neighbourhood,Cluster Labels,0 most common venues,1 most common venues,2 most common venues,3 most common venues,4 most common venues,5 most common venues,6 most common venues,7 most common venues,8 most common venues,9 most common venues
0,abu,Anhangabaú,Jardim Luciana,9,Italian Restaurant,Burger Joint,Pizza Place,Steakhouse,Bar,Restaurant,Sandwich Place,Brazilian Restaurant,Food & Drink Shop,Farmers Market
11,vla,Vila Arens,Vila Isabel Eber,9,Pizza Place,Brazilian Restaurant,Video Store,Theme Park Ride / Attraction,Hot Dog Joint,Restaurant,Brewery,Diner,Coffee Shop,Cosmetics Shop
16,psj,Ponte São João,Vila Caodáglio,9,Bar,Pet Store,Brazilian Restaurant,Gym / Fitness Center,Pizza Place,Church,Restaurant,Historic Site,Fast Food Restaurant,Italian Restaurant
21,cnt,Centro,Vila Padre Nóbrega,9,Gourmet Shop,Pizza Place,BBQ Joint,Sushi Restaurant,Bakery,Bar,Wine Shop,Farmers Market,Cosmetics Shop,Dance Studio
24,cnt,Centro,Chácara Urbana,9,Steakhouse,Café,Chinese Restaurant,Fruit & Vegetable Store,Market,Pharmacy,Burger Joint,Buffet,Botanical Garden,Bar
26,vrb,Vila Rio Branco,Jardim Liberdade,9,Café,Pizza Place,Grocery Store,Theater,Bakery,Bar,Park,Dessert Shop,Wine Shop,Diner


<span id="conclusion"></span>
## Conclusion
<font size="3em">This project shows that this city (Jundiaí/SP - Brazil) have a lot to expand in the location data and webpages, a lot of work for webdesigns. Relative to the venues, this city is like a small town, with most of the venues beeing small shops and food venues, but with bunch of Coffee Shops and "to Walk" places.
</font>