<h3 style="color:red;">The purpose here is to define the source of university rankings data, then check how to scrap each data and then how to merge the data from different sources.</h3>
<h4 style="color:blue;">The plan now is to scrap data from three different sources. This would lead to three different webpages, with three different scrapping algorithms and finally the collection of data. One thing I would like to do is to run a similarity analysis between the sources to see how consistent are between them.</h4>

<h1>PROJECT TITLE</h1>
<h2>Supplementary information (code development)</h2>
<h5>By: Aurelio Álvarez Ibarra</h5>

<h3>S1.1 Getting information from the ranking tables</h3>

As usual, first we will download and import the necessary packages and libraries.

In [None]:
# Get packages and libraries ready
!pip install beautifulsoup4 lxml
from bs4 import BeautifulSoup
import requests
import pandas as pd

<h4>Source one: Times Higher Education</h4>
The first source of data will be the ranking by <a href="https://www.timeshighereducation.com/world-university-rankings/2020/world-ranking">Times Higher Education</a>. <b>SAY SOMETHING ABOUT THE SOURCE...</b>. The data will be restricted to some Latin American countries. The ones available in this ranking are:
<ul>
    <li>Argentina (identified as AR)</li>
    <li>Brasil (identified as BR)</li>
    <li>Chile (identified as CL)</li>
    <li>Colombia (identified as CO)</li>
    <li>Costa Rica (identified as CR)</li>
    <li>Cuba (identified as CU)</li>
    <li>Jamaica (identified as JM)</li>
    <li>Mexico (identified as MX)</li>
    <li>Peru (identified as PE)</li>
    <li>Puerto Rico (identified as PR)</li>
    <li>Venezuela (identified as VE)</li>
</ul>
Let's try this first source with the full list. In the process the structure of the webpage will be studied in order to systematically retrieve the data for every countries.

In [None]:
# Save data from webpage
myurl = 'https://www.timeshighereducation.com/world-university-rankings/2020/world-ranking#!/page/0/length/-1/locations/AR/sort_by/rank/sort_order/asc/cols/stats'
## The structure of the webpage is:
##   1.- Main URL and starting page (0):
##       https://www.timeshighereducation.com/world-university-rankings/2020/world-ranking#!/page/0/
##   2.- Number of results per page (-1 means "All"):
##       length/-1/
##   3.- Location definition (check the country identifier above):
##       locations/AR/
##   4.- Sorting parameters:
##       sort_by/rank/sort_order/asc/
##   5.- Information shown ("stats"=ranking data; "scores"=scoring data):
##       cols/stats
source = requests.get(myurl)
print(source)

The response from the Times Higher Education server is to deny incoming requests from this type (403). Sometimes, webpages require to define a <code>User-Agent</code> as explained <a href="https://stackoverflow.com/questions/38489386/python-requests-403-forbidden">here</a>. On top of that, when retrieving the table section of the webpage, the <code>tbody</code> tag (which encloses the data in the body of the table) is empty. According to this <a href="https://stackoverflow.com/questions/49260014/beautifulsoup-returns-empty-td-tags">source</a>, the table data may be generated by a script and not with the HTML code itself. Thus, extra information must be included in the header and in the <code>requests.get</code> arguments to get the results of such script.

In [None]:
# Save data from webpage
##AR_url = 'https://www.timeshighereducation.com/world-university-rankings/2020/world-ranking#!/page/0/length/-1/locations/AR/sort_by/rank/sort_order/asc/cols/stats'
myurl = 'https://www.timeshighereducation.com/world-university-rankings/2020/world-ranking#!/page/0/length/-1/sort_by/rank/sort_order/asc/cols/stats'
## old__header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36','Accept': 'application/json, text/javascript, */*; q=0.01'}
source = requests.get(myurl,headers=header).text # Without .text, it gets the response in JSON format
mysoup = BeautifulSoup(source,'lxml')
mytable = mysoup.find('table')
print(mytable.prettify())

The previous results in a table object with an emtpy <code>tbody</code> tag as well. Unfortunately, I could not figure out how to get the actual data from the table. I checked many sources but it would be too much for me to understand it right away. I tried with another retriever but got the same result.

In [84]:
### My try with urlib.requests... generates the same result (empty tbody)
from bs4 import BeautifulSoup
import urllib.request

url_to_scrape = 'https://www.timeshighereducation.com/world-university-rankings/2020/world-ranking#!/page/0/length/-1/sort_by/rank/sort_order/asc/cols/stats'
soup = BeautifulSoup(urllib.request.urlopen(url_to_scrape).read())

I decide to drop source one...

<h4>Source two: U.S. News</h4>
The second source of data will be the ranking by <a href="https://www.usnews.com/education/best-global-universities/rankings">U.S. News</a>. <b>SAY SOMETHING ABOUT THE SOURCE...</b>. The data will be restricted to some Latin American countries. The ones available in this ranking are:
<ul>
    <li>Argentina</li>
    <li>Brasil</li>
    <li>Chile</li>
    <li>Colombia</li>
    <li>Costa Rica</li>
    <li>Cuba</li>
    <li>Jamaica</li>
    <li>Mexico</li>
    <li>Peru</li>
    <li>Puerto Rico</li>
    <li>Venezuela</li>
</ul>
Let's try this second source with one country. In the process of gathering this sample data, the structure of the webpages will be studied in order to systematically retrieve the data for all countries.

Looking at the structure of the page, the data is not in a table but in a set of <code>div</code> tags with <code>class="sep"</code>. A concerning details is that it is not possible to ask for the full list in one shot, but this will be solved later. For the first page of results for the Latin American universities, the retrieving code is the following:

In [276]:
# Save data from webpage
myurl = 'https://www.usnews.com/education/best-global-universities/search?region=latin-america'
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36','Accept': 'application/json, text/javascript, */*; q=0.01'}
source = requests.get(myurl,headers=header).text # Without .text, it gets the response in JSON format
mysoup = BeautifulSoup(source,'lxml')


# I need this function to find the div with the specific class
mydivs = mysoup.find_all(lambda tag: tag.name == 'div' and
                         tag.get('class') == ['sep'])
print("Elements in the search result: ",len(mydivs))
print(mydivs[0])

Elements in the search result:  10
<div class="sep">
<div class="thumb-right">
<div class="t-large t-strong t-constricted">66.4</div>
<div class="t-smaller t-dim">Global Score</div>
<div class="levelmeter" style="background-position: -97.19999999999999px 0px; width:80px; height: 8px;"></div>
</div>
<div class="thumb-left">
<span class="rankscore-bronze">
          #1
        </span>
</div>
<div class="block unwrap">
<h2 class="h-taut">
<a href="https://www.usnews.com/education/best-global-universities/universidade-de-sao-paulo-500437">Universidade de São Paulo</a>
</h2>
<div class="t-taut">
<i class="flag-22 flag-brazil-22"></i>
<span>Brazil</span>
<span class="t-dim t-small">São Paulo</span>
</div>
<div>
<img alt="" class="icon thumb-left-bleed" src="/static/images/badges/micro-badge-silver-20.png" srcset="/static/images/badges/micro-badge-silver-20.png, /static/images/badges/micro-badge-silver-40.png 2x" width="20"/>
          #128 <span>(tied)</span> – <a data-ajax="true" href="/edu

Using this method, I get a list as a result. According to its lenght, I know how many elements I shall analyze. For the first ten results in the Latin America ranking, the following data can be retrieved:

In [277]:
import re # For the regex search in the local ranking
###
ind = 0
univnames = []
for myUname in mysoup.find_all("h2",class_="h-taut"): # All tags with university name
#    print(myUname.text.strip())
    univnames.append(myUname.text.strip())
    ind = ind+1
###
ind = 0
LAranks = []
for myLArank in mysoup.find_all("div",class_="thumb-left"): # All tags with rank in LatinAmerica (see myurl)
    # Score comes with a # sign and maybe a TIE string. Convert to number
    LAranks.append(int(re.findall("[0-9]+",str(myLArank))[0]))
    ind = ind+1
###
ind = 0
globalscores = []
for myGscore in mysoup.find_all("div",class_="t-large t-strong t-constricted"): # All tags with global score number
    globalscores.append(myGscore.text)
    ind = ind+1
###
ind = 0
countries = []
cities = []
for mylocation in mysoup.find_all("div",class_="t-taut"): # All tags with location
    countries.append(mylocation.span.text.strip())
    cities.append(mylocation.find("span",class_="t-dim t-small").text.strip())
    ind = ind+1
###
for ind in range(len(mydivs)):
    print('** {}, #{} in Latin America with a global score of {}, is located in {} ({}).'.format(univnames[ind],LAranks[ind],globalscores[ind],cities[ind],countries[ind]))

** Universidade de São Paulo, #1 in Latin America with a global score of 66.4, is located in São Paulo (Brazil).
** Pontificia University Católica de Chile, #2 in Latin America with a global score of 57.2, is located in Santiago (Chile).
** State University of Campinas, #3 in Latin America with a global score of 56.7, is located in Campinas, São Paulo (Brazil).
** Federal University of Rio de Janeiro, #4 in Latin America with a global score of 54.9, is located in Rio de Janeiro (Brazil).
** University of Buenos Aires, #5 in Latin America with a global score of 53.7, is located in Buenos Aires City, Buenos Aires (Argentina).
** National Autonomous University of Mexico, #6 in Latin America with a global score of 53.4, is located in Ciudad de México, Distrito Federal (Mexico).
** University of Chile, #7 in Latin America with a global score of 53.1, is located in Santiago (Chile).
** University of the Andes Colombia, #8 in Latin America with a global score of 51.4, is located in Bogotá, DC

For the next pages, an extra string appears in the URL: <code>page=#</code>, where <code>#</code> goes from 2 to 10 in this particular case. For <code>#</code>=1, this "extra" string does not appear. The following loop retrieves as many pages as stated in <code>numpages</code>.

In [303]:
# Initialize fields
univnames = []
LAranks = []
globalscores = []
countries = []
cities = []
pagenum = []
###

# Defining extra strings in URL and setting up request
myurl = 'https://www.usnews.com/education/best-global-universities/latin-america'
numpages = 10
extrastring = ['']
for mypages in range(2,numpages+1):
    mytext = '?page='
    extrastring.append(mytext+str(mypages))
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
          +' Chrome/64.0.3282.186 Safari/537.36','Accept': 'application/json, text/javascript, */*; q=0.01'}

# Loop for pages 1 to numpages
for n in range(1,numpages+1):
    url = myurl + extrastring[n-1]
    source = requests.get(url,headers=header).text
    mysoup = BeautifulSoup(source,'lxml')

    ### University names
    for myUname in mysoup.find_all("h2",class_="h-taut"): # All tags with university name
        univnames.append(myUname.text.strip())
    ### University rank (in Latin America)
    for myLArank in mysoup.find_all("div",class_="thumb-left"): # All tags with rank in LatinAmerica (see myurl)
        try:
            LAranks.append(int(re.findall("[0-9]+",str(myLArank))[0]))
        except:
            LAranks.append(None)
    ### University global score
    for myGscore in mysoup.find_all("div",class_="t-large t-strong t-constricted"): # All tags with global score number
        globalscores.append(myGscore.text.strip())
    ### University location (country and city)
    for mylocation in mysoup.find_all("div",class_="t-taut"): # All tags with global score
        countries.append(mylocation.span.text.strip())
        cities.append(mylocation.find("span",class_="t-dim t-small").text.strip())
    ### Page number of this batch of data
    maxscores = len(mysoup.find_all("div",class_="t-large t-strong t-constricted"))
    for mypage in range(1,maxscores+1):
        pagenum.append(n)
# Universities without global score do not have complete data (see below), thus they will be dropped
##for ind in range(len(globalscores)):
##    print('** {}, #{} in Latin America with a global score of {}, is located in {} ({}).'.format(univnames[ind],LAranks[ind],globalscores[ind],cities[ind],countries[ind]))

Let's combine all the lists in a dataframe

In [314]:
#import pandas as pd
maxlenght = len(globalscores)
USN_df=pd.DataFrame({'University':univnames[0:maxlenght],
                        'LatinAmericaRank':LAranks[0:maxlenght],
                        'GlobalScore':globalscores[0:maxlenght],
                        'Country':countries[0:maxlenght],
                        'City':cities[0:maxlenght],
                        'PageNumber':pagenum[0:maxlenght]}
                      )
USN_df.head(10)

Unnamed: 0,University,LatinAmericaRank,GlobalScore,Country,City,PageNumber
0,Universidade de São Paulo,1,66.4,Brazil,São Paulo,1
1,Pontificia University Católica de Chile,2,57.2,Chile,Santiago,1
2,State University of Campinas,3,56.7,Brazil,"Campinas, São Paulo",1
3,Federal University of Rio de Janeiro,4,54.9,Brazil,Rio de Janeiro,1
4,University of Buenos Aires,5,53.7,Argentina,"Buenos Aires City, Buenos Aires",1
5,National Autonomous University of Mexico,6,53.4,Mexico,"Ciudad de México, Distrito Federal",1
6,University of Chile,7,53.1,Chile,Santiago,1
7,University of the Andes Colombia,8,51.4,Colombia,"Bogotá, DC",1
8,Universidad Tecnica Federico Santa Maria,9,51.1,Chile,"Valparaiso, Valparaiso",1
9,Federal University of Rio Grande do Sul,10,50.4,Brazil,"Porto Alegre, Rio Grande do Sul",1


Details for the university ranking appear when clicking on the university's name. The resulting URL includes a variant of the university and a numeric identifier. This information can be found in an <code>a</code> tag inside the <code>h2</code> tag used to retriege the university name. The previous data retriever will be used to get the URL for each university.

In [323]:
# Initialize fields
univurl = []

# Defining extra strings in URL and setting up request
myurl = 'https://www.usnews.com/education/best-global-universities/latin-america'
numpages = 10
extrastring = ['']
for mypages in range(2,numpages+1):
    mytext = '?page='
    extrastring.append(mytext+str(mypages))
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
          +' Chrome/64.0.3282.186 Safari/537.36','Accept': 'application/json, text/javascript, */*; q=0.01'}

# Loop for pages 1 to numpages
for n in range(1,numpages+1):
    url = myurl + extrastring[n-1]
    source = requests.get(url,headers=header).text
    mysoup = BeautifulSoup(source,'lxml')

    ### University URLs
    for myUname in mysoup.find_all("h2",class_="h-taut"): # All tags with university name
        univurl.append(myUname.a['href'])
# Append URL list to dataframe
tmp = pd.DataFrame({'USN_URL':univurl[0:len(globalscores)]})
USN_df = USN_df.join(tmp)
USN_df

Unnamed: 0,University,LatinAmericaRank,GlobalScore,Country,City,PageNumber,USN_URL
0,Universidade de São Paulo,1,66.4,Brazil,São Paulo,1,https://www.usnews.com/education/best-global-u...
1,Pontificia University Católica de Chile,2,57.2,Chile,Santiago,1,https://www.usnews.com/education/best-global-u...
2,State University of Campinas,3,56.7,Brazil,"Campinas, São Paulo",1,https://www.usnews.com/education/best-global-u...
3,Federal University of Rio de Janeiro,4,54.9,Brazil,Rio de Janeiro,1,https://www.usnews.com/education/best-global-u...
4,University of Buenos Aires,5,53.7,Argentina,"Buenos Aires City, Buenos Aires",1,https://www.usnews.com/education/best-global-u...
...,...,...,...,...,...,...,...
78,Universidade Tecnologica Federal do Parana,79,16.5,Brazil,Curitiba,8,https://www.usnews.com/education/best-global-u...
79,Universidade Federal de Sergipe,80,16.3,Brazil,São Cristóvão,8,https://www.usnews.com/education/best-global-u...
80,Universidade Federal Rural de Pernambuco (UFRPE),81,15.9,Brazil,"Recife, PE",9,https://www.usnews.com/education/best-global-u...
81,Universidad Autonoma de Baja California,82,15.6,Mexico,"Mexicali, Baja California",9,https://www.usnews.com/education/best-global-u...


With the URL retrieved, the data used to score the universities is available. The following code retrieves such data.

<hr></hr><hr></hr><hr></hr>

<code>mytable</code> has the table body (in the webpage there is only one table visible). The data in each row is defined in the following way:
<ol>
    <li>The first column, the ranking position, is a <code>span</code> with <code>class="positionInRankin"</code></li>
    <li>The following columns have (in that order) the name of the university between <code>strong</code> tags, location (city), teaching quality, research, prestige, postgraduate offer, internationalization, accreditation, inclusion and diversity and, finally, Quality Index (year). All of them have the same attributes in their respective <code>a<code> tags.</li>
    <li>There is an underlying table with some details (accessed by clicking on the corresponding row).</li>
</ol>

The idea will be to extract the first (visible) table and then the underlying table. After checking the webpage code (you can try a <code>print(mytable)</code> at the end of the previous code cell), both of them can be accessed at the same level.

In [None]:
# Extracting data from the visible table
mydf = pd.DataFrame(columns = ['Rank','Name','Location','TeachingQuality','Research','Prestige',
                               'PostgraduateOffer','Internationalization','Accrediation','InclusionAndDiversity',
                                'QualityIndex2020'])

for mytr in mytable.find_all('tr',class_='dataRow'): # Looping for each dataRow in the table
    # Initialize data list (row)
    data = []
    for mycell in mytr.find_all('td'): # Looping for each cell in the row
        data.append(mycell.text.strip()) # Strip removes the \n in the end of the cell data
    # Write values from the row
    size = len(mydf) # Current size of dataframe
    mydf.loc[size] = data # Appending data after last row of dataframe
mydf.head()

In [None]:
# Extracting data from the underlying table
myunderdf = pd.DataFrame(columns = ['Name','RK19','TotalProfessors','%FullTimeProfessors','SNIResearchers','PNPCPhDs','PNPCMasters','ExchangeStudents',
                                    'TotalBachelorPrograms','%WomenInSrManagement','OwnNativeSupportProgram'])

for mytr in mytable.find_all('tr',class_='extraDataRow'): # Looping for each extraDataRow in the table
    # Initialize data list (row)
    minititle = mytr.find('h3',class_='extraTitle').text.strip() # University's name
    data = [minititle]
    for mycell in mytr.find_all('span',class_="extraMiniValue"): # Looping for each cell in the row
        data.append(mycell.text.strip()) # Strip removes the \n in the end of the cell data
    # Write values from the row
    size = len(myunderdf) # Current size of dataframe
    myunderdf.loc[size] = data # Appending data after last row of dataframe

myunderdf.head()

In order to manage only one dataframe, let's merge the results.

In [None]:
merged = mydf.merge(myunderdf,on='Name')
merged.head()

In a source with consistent ranking displays (tables) it would be a matter to define a list with the countries and years to retrieve information by looping over the elements of these lists. However, upon further inspection of the source, different countries and different years have different contents and formats in their tables.

<hr></hr><hr></hr><hr></hr>

<h4>0.2 Getting coordinates for the neighborhoods from Problem 2</h4>

In [None]:
# Convert file to dataframe
toronto_df = pd.read_csv('toronto_data.csv')
# Changing the name of the first column of downloaded data
toronto_df.rename(columns={'Postal Code':'PostalCode'},inplace=True)
# Merging provided data into the original dataframe
# dataframe is the original data retrieved and cleaned from wikipedia
# toronto_df is the downloaded data
full_df = pd.merge(dataframe, toronto_df, on='PostalCode')
full_df.drop_duplicates(inplace=True) # Dropping duplicated rows
print('Shape of merged dataframe: ',merged.shape)
full_df.head(10)

<h4>1.1 Analyzing neighborhoods in Toronto</h4>
The purpose of the following code is to group (cluster) different neighborhoods from Toronto in order to see how similar are some of them, and which type of facilities (venues) they have. Maybe you would like to visit neighborhoods with coffee shops and bars one day, and visit neighborhoods with malls and beauty shops another day!

In [None]:
# Get required packages and libraries ready

import numpy as np # library to handle data in a vectorized manner

# import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Libraries imported.')

<h4>1.2 A first look on Toronto</h4>
Let's get some characteristics of the dataframe we have, as well as the location of Toronto in a map.

In [None]:
# How many boroughs and neighborhoods does Toronto have?
print('The dataframe "full_df" for Toronto has {} boroughs and {} neighborhoods.'
      .format(len(full_df['Borough'].unique()),
              full_df.shape[0]
    )
)

In [None]:
# Where is Toronto?
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="TO_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

In [None]:
# Create map of Toronto using its latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# Add markers of neighborhoods to map
for lat, lng, borough, neighborhood, pcode in zip(full_df['Latitude'], full_df['Longitude'],
                                                  full_df['Borough'], full_df['Neighborhood'],
                                                  full_df['PostalCode']):
    label = '{} ({}) {}'.format(neighborhood, borough, pcode)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto) # Do not forget to add CircleMarker to the map!!  
    
map_toronto

In order to simplify the analysis, the exercise suggests to perform it only in boroughs that include 'Toronto' in its name. Let's extract that information:

In [None]:
# Define the dataframe by appending the desired boroughs
tmp = []
for i,x in enumerate(full_df['Borough']): # Create an enumerated list of boroughs
    if 'Toronto' in x: # Check if Toronto appears in the borough's name
        tmp.append(full_df.iloc[i])

justtoronto_df = pd.DataFrame(tmp).reset_index(drop=True) # Transform result to dataframe
print('Shape of dataframe for Toronto boroughs: ',justtoronto_df.shape)
justtoronto_df.head()

Now, let's adapt the map to the Toronto zone:

In [None]:
# Just changed "full_df" to "justtoronto_df"
# And I will overwrite the previous map
# Create map of Toronto using its latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11) # Larger zoom

# Add markers of neighborhoods to map
for lat, lng, borough, neighborhood, pcode in zip(justtoronto_df['Latitude'], justtoronto_df['Longitude'],
                                                  justtoronto_df['Borough'], justtoronto_df['Neighborhood'],
                                                  justtoronto_df['PostalCode']):
    label = '{} ({}) {}'.format(neighborhood, borough, pcode)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto) # Do not forget to add CircleMarker to the map!!  
    
map_toronto

<h4>2.1 Setting up Foursquare credentials</h4>
Please don't eat up my calls credit! XD

In [None]:
CLIENT_ID = 'RRYOHBWLN3VNML1RBPM0TRVDW2R41TKNWMZSH0VTOQKGNO2T' # your Foursquare ID
CLIENT_SECRET = 'X22FCK21ZCS0UVXZ11TILJFRGXGWVMD5ZADQLIOSMDHHSHHN' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

<h4>2.2 Exploring one neighborhood</h4>
In order to make things clear, let's establish the analysis plan using just one neighborhood. Choose by setting a number between 0 and 38 in the following cell.

In [None]:
# Setting up neighborhood to be analyzed
nnum = 5

myneigh = justtoronto_df.loc[nnum, 'Neighborhood']
myneigh_lat = justtoronto_df.loc[nnum, 'Latitude'] # neighborhood latitude value
myneigh_lon = justtoronto_df.loc[nnum, 'Longitude'] # neighborhood longitude value

print('Your selected neighborhood is {}, located at (latitude,longitude) = ({},{}).'
      .format(myneigh, myneigh_lat, myneigh_lon))
print('Don\'t forget to update this cell when you want to analyze other neighborhood!')

The following code requests the top 100 venues in 500 meters around the location of your neighborhood:

In [None]:
LIMIT=100 # Remember the number and type of calls you have in your credit
radius=500 # in meters
# The URL structure is straighforward to read.
# Just remember the information you have to provide for each type of request.
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
           CLIENT_ID,
           CLIENT_SECRET,
           VERSION,
           myneigh_lat,
           myneigh_lon,
           radius,
           LIMIT)
url

In [None]:
# Call to Foursquare. Do not abuse of this cell execution!!!
results = requests.get(url).json()
### results # Careful. Long result ahead. Uncomment just to be sure that it worked

All the information is in the <i>items</i> key. The following function <code>get_category_type</code> is used to extract the name of a category (remember the structure of the information in the <code>json</code> files).

In [None]:
# Function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

The previous function helps to clean the data from the request:

In [None]:
# Getting "items" to work with a smaller amount of data
venues = results['response']['groups'][0]['items']

# Convert JSON-style data into a table
nearby_venues = json_normalize(venues)

# Getting only the columns we will use
# The names come by looking at the json_normalize result
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns] # All rows, only the filtered columns

# venue.categories looks messy from the previous result. This is why you apply "get_category_type"
#   to that column, then you get the cleaned name. Of course, the function's design comes after
#   checking the data structure in "venues".
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# Remove the "venues." string from the column names
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

# Check the result
print('{} venues were returned by Foursquare in {}.'.format(nearby_venues.shape[0],myneigh))
nearby_venues.head()

<h4>3.1 Exploring the full zone</h4>
Now that it has been done for one neighborhood, it can be taken to explore the full set of neghborhoods in the selected region of Toronto.

The following function will do the previous steps with a list of neighborhoods, provided the names and coordinates for each one (and maybe the radius to look for around the location and the limit of venues to search).

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print('Searching for venues in ',name,'...')
            
        # Create URL for API request
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # Make GET request, directly retrieving only the interesting part
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # Return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results]) # This is a "list comprehension"
        # In this type of list, you include an implicit for, which can be useful to reduce the number of lines
        #   in a code. In this case, it looks in the "results" data for the specific elements and values of the
        #   previously defined lists.

    # Transform result in dataframe
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list]) # Nested list comprehension
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    print()
    print('Done!!',end='\n\n')
    print('Returned a dataframe with shape ',nearby_venues.shape)
    return(nearby_venues)

Now, apply the function to the full set of neighborhoods in Toronto:

In [None]:
toronto_venues = getNearbyVenues(names=justtoronto_df['Neighborhood'],
                                   latitudes=justtoronto_df['Latitude'],
                                   longitudes=justtoronto_df['Longitude']
                                  )

In [None]:
toronto_venues.head(10)

In [None]:
# How many venues does each neighborhood has?
print('Number of venues retrieved per neighborhood (dataframe):')
toronto_venues.groupby('Neighborhood').count()

Check that the number of venues returned by Foursquare here matches the one in your "one neighborhood" analysis.

In [None]:
# How many type of venues are there in this dataframe?
print('There are {} uniques categories of venues in the dataframe.'.format(len(toronto_venues['Venue Category'].unique())))

<h4>3.2 Managing the information</h4>
The following code will create a dataframe that show how many venues of a given type exists in each neighborhood. The dataframe will be large but this is the preparation step.

In [None]:
# One hot encoding
# Create a dummy dataframe with columns after (unique) values in 'Venue Category'
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# Add neighborhood column back to dataframe
# With this you just create a 'Neighborhood' column in toronto_onehot
#   with the info from toronto_venues['Neighborhood']
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# Move neighborhood column to the first column
# The previous code results in an alphabetical order in the columns (left-to-right)
#   thus let's move the 'Neighborhood' column to the beginning.
colind = toronto_onehot.columns.get_loc("Neighborhood") # Getting the position of column in dataframe
fixed_columns = [toronto_onehot.columns[colind]] + list(toronto_onehot.columns[0:colind]) + list(toronto_onehot.columns[colind+1:])
toronto_onehot = toronto_onehot[fixed_columns]

### Warning! In the lab exercise, the 'Neighborhood' column was added at the end of
###   the dataframe. That is why there you see a '-1' index to refer to that column.
###       fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
###       toronto_onehot = toronto_onehot[fixed_columns]
###   While checking here, I realized the alphabetical order (don't know why!).
###   Thus, I had to modify the code to look for the column by name.

toronto_onehot.head()

The previous dataframe establishes the occurrence of a given venue in a particular neighborhood. Let's group the occurrence of each type (category) of venue per neighborhood, making a <code>mean</code> out of the location to have an idea of the frequency of such occurrence per neighborhood. This is, of the total of venues in a given neighborhood, how feasible is to find a given type of venue.

In [None]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

As you can see in the previous results, it is more feasible to find a coffee shop than an art gallery in Berczy Park. This is more easily seen if you print the top 5 venues (according to frequency) for each neighborhood.

In [None]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----" + hood + "----") # "plus" signs do not work if you mix strings and numbers!
    # T is for Transposed. It gets the venue categories to the index side.
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 3})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

<b>Note</b>: Remember that this <i>frequency</i> analysis depends on the number of venues in the neighborhood. If you see very small numbers in the top 5, it may mean there is a lot of venues in the neighborhood.

To get this information into a dataframe, it is easier to create a function to return the top venues in a . The next cell will create the dataframe in a readable way.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd'] # Not needed if you use "Venue #X" for X = 1 to num_top_venues

# Create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind])) # Indicators for 1, 2 and 3
    except:
        columns.append('{}th Most Common Venue'.format(ind+1)) # When you run out of "indicators"

# Create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns) # As wide as num_top_venues + 1
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood'] # Copy neighborhoods from dataframe

for ind in np.arange(toronto_grouped.shape[0]): # For the number of neighborhoods in the dataframe...
    # The function returns the first "num_top_venues" from the ordered list from each row
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

<h4>4.1 Clustering neighborhoods using <i>K means</i></h4>
The following code runs the <code>K means</code> model on several values for number of clusters and random-number-generator seeds.

In [None]:
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)
for kclusters in range(3,6):
    print()
    print('Results for K-means with k = ',kclusters)
    for seed in range(0,5):
        # Execute k-means clustering for given conditions
        kmeans = KMeans(n_clusters=kclusters, random_state=seed, n_init=12).fit(toronto_grouped_clustering)    
        # Check cluster labels generated for each row in the dataframe
        print('For k = {} and seed = {} the labels are: \n {}'.format(kclusters,seed,kmeans.labels_[0:]))

The value of the <code>seed</code> for the random number generator that initializes the centroids of the clusters seems to influence more for lower <code>kcluster</code> values. With <code>kclusters=5</code> the results are the same. Let's use those values for the clustering.

In [None]:
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)
kmeans = KMeans(n_clusters=5, random_state=0, n_init=12).fit(toronto_grouped_clustering)
kmeans

Let's complete the dataframe for Toronto neighborhoods with the data from the neighborhoods, cluster label and top venues.

In [None]:
# Add clustering labels to the sorted neighborhood venues
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [None]:
# Recover the original dataframe (in this case, "justtoronto_df")
toronto_merged = justtoronto_df

# Add neighborhoods_venues_sorted to toronto_merged according to the neighborhood name
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head(5)

For the final presentation, a map with colored markers for each cluster is shown as follows.

In [None]:
# Getting Toronto's coordinates
address = 'Toronto, Ontario'
geolocator = Nominatim(user_agent="TO_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

In [None]:
# Create map object
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# Set color scheme for each cluster
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.gnuplot(np.linspace(0, 1, len(ys))) # Look for color maps in matplotlib
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
markers_colors = []
for lat, lon, hood, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'],
                                  toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(hood) + ' (in Cluster ' + str(cluster) + ')', parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<h4>4.2 Examining clusters</h4>
Why that many neighborhoods are in a specific cluster? Let's see the top venues in each cluster and compare between them. Since cluster 3 is the more populated, let's check that one first.

In [None]:
mycluster = 0
toronto_merged.loc[toronto_merged['Cluster Labels'] == mycluster, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

<b>NOTE</b>: If you run this notebook again, the "big" cluster can get another label. In this example, it came to be 0.

For cluster 0, coffee shops and cafés are the common venues on the top list. What happens with neighborhoods like "Dufferin, Dovercourt Village" (index 9)? It does not seem very similar. It shares bakery and bar on his top venues with a couple of other neighborhoods but it seems rather odd. Maybe the analysis tends to load the separation on the top venues rather than the whole set. Anyway, remember we are looking at the top venues here, not at every one of them. For the rest of the clusters, the comparison is straightforward:

In [None]:
mycluster = 1
toronto_merged.loc[toronto_merged['Cluster Labels'] == mycluster, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

In [None]:
mycluster = 2
toronto_merged.loc[toronto_merged['Cluster Labels'] == mycluster, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

In [None]:
mycluster = 3
toronto_merged.loc[toronto_merged['Cluster Labels'] == mycluster, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

In [None]:
mycluster = 4
toronto_merged.loc[toronto_merged['Cluster Labels'] == mycluster, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

For the clusters with more than one element, top venues are very similar. There, the clustering makes sense. It may be a challenge to further analyze the data in order to see why the clustering puts that many neighborhoods in one of them (remember the results for <code>kclusters</code> from 3 to 4 in the beginning of section 4.1). Some straightforward ideas on this can be found <a href="https://zerowithdot.com/mistakes-with-k-means-clustering/">here</a> and some solutions are suggested <a href="https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/">here</a>. Since this is a high-dimensionality problem, the suggestion I have is to try several clusters and check the label distribution. Just set <code>maxclusters</code> in the following cell and see what's a good candidate! After that, rinse and repeat.

In [None]:
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)
maxclusters = 10
seed = 0
save_k = 10
for kclusters in range(3,maxclusters+1):
    print()
    print('Results for K-means with k = ',kclusters)
    # Execute k-means clustering for given conditions
    tmp = KMeans(n_clusters=kclusters, random_state=seed, n_init=12).fit(toronto_grouped_clustering)    
    # Check cluster labels generated for each row in the dataframe
    print('For k = {} and seed = {} the labels are: \n {}'.format(kclusters,seed,tmp.labels_[0:]))
    if kclusters == save_k:
        kmeans = tmp
print()
print('Saved results for kclusters = ',save_k,' in "kmeans"')