<h3 style="color:red;">The purpose here is to define the source of university rankings data, then check how to scrap each data and then how to merge the data from different sources.</h3>
<h4 style="color:blue;">The plan now is to scrap data from three different sources. This would lead to three different webpages, with three different scrapping algorithms and finally the collection of data. One thing I would like to do is to run a similarity analysis between the sources to see how consistent are between them.</h4>

<h1>PROJECT TITLE</h1>
<h2>Supplementary information (code development)</h2>
<h5>By: Aurelio Álvarez Ibarra</h5>

<h3>S1.1 Getting information from the ranking tables</h3>

As usual, first we will download and import the necessary packages and libraries.

In [1]:
# Get packages and libraries ready
!pip install beautifulsoup4 lxml
from bs4 import BeautifulSoup
import requests
import pandas as pd



<h4>Source one: Times Higher Education</h4>
The first source of data will be the ranking by <a href="https://www.timeshighereducation.com/world-university-rankings/2020/world-ranking">Times Higher Education</a>. <b>SAY SOMETHING ABOUT THE SOURCE...</b>. The data will be restricted to some Latin American countries. The ones available in this ranking are:
<ul>
    <li>Argentina (identified as AR)</li>
    <li>Brasil (identified as BR)</li>
    <li>Chile (identified as CL)</li>
    <li>Colombia (identified as CO)</li>
    <li>Costa Rica (identified as CR)</li>
    <li>Cuba (identified as CU)</li>
    <li>Jamaica (identified as JM)</li>
    <li>Mexico (identified as MX)</li>
    <li>Peru (identified as PE)</li>
    <li>Puerto Rico (identified as PR)</li>
    <li>Venezuela (identified as VE)</li>
</ul>
Let's try this first source with the full list. In the process the structure of the webpage will be studied in order to systematically retrieve the data for every countries.

In [2]:
# Save data from webpage
myurl = 'https://www.timeshighereducation.com/world-university-rankings/2020/world-ranking#!/page/0/length/-1/locations/AR/sort_by/rank/sort_order/asc/cols/stats'
## The structure of the webpage is:
##   1.- Main URL and starting page (0):
##       https://www.timeshighereducation.com/world-university-rankings/2020/world-ranking#!/page/0/
##   2.- Number of results per page (-1 means "All"):
##       length/-1/
##   3.- Location definition (check the country identifier above):
##       locations/AR/
##   4.- Sorting parameters:
##       sort_by/rank/sort_order/asc/
##   5.- Information shown ("stats"=ranking data; "scores"=scoring data):
##       cols/stats
source = requests.get(myurl)
print(source)

<Response [403]>


The response from the Times Higher Education server is to deny incoming requests from this type (403). Sometimes, webpages require to define a <code>User-Agent</code> as explained <a href="https://stackoverflow.com/questions/38489386/python-requests-403-forbidden">here</a>. On top of that, when retrieving the table section of the webpage, the <code>tbody</code> tag (which encloses the data in the body of the table) is empty. According to this <a href="https://stackoverflow.com/questions/49260014/beautifulsoup-returns-empty-td-tags">source</a>, the table data may be generated by a script and not with the HTML code itself. Thus, extra information must be included in the header and in the <code>requests.get</code> arguments to get the results of such script.

In [4]:
# Save data from webpage
myurl = 'https://www.timeshighereducation.com/world-university-rankings/2020/world-ranking#!/page/0/length/-1/sort_by/rank/sort_order/asc/cols/stats'
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36','Accept': 'application/json, text/javascript, */*; q=0.01'}
source = requests.get(myurl,headers=header).text # Without .text, it gets the response in JSON format
mysoup = BeautifulSoup(source,'lxml')
mytable = mysoup.find('table')

The previous results in a table object with an emtpy <code>tbody</code> tag as well. Unfortunately, I could not figure out how to get the actual data from the table. I checked many sources but it would be too much for me to understand it right away. I tried with another retriever but got the same result.

In [2]:
### My try with urlib.requests... generates the same result (empty tbody)
from bs4 import BeautifulSoup
import urllib.request

url_to_scrape = 'https://www.timeshighereducation.com/world-university-rankings/2020/world-ranking#!/page/0/length/-1/sort_by/rank/sort_order/asc/cols/stats'
soup = BeautifulSoup(urllib.request.urlopen(url_to_scrape).read())

I decide to drop source one...

<h4>Source two: U.S. News</h4>
The second source of data will be the ranking by <a href="https://www.usnews.com/education/best-global-universities/rankings">U.S. News</a>. <b>SAY SOMETHING ABOUT THE SOURCE...</b>. The data will be restricted to the Latin American countries in their search results. <br><br>
Looking at the structure of the page, the data is not in a table but in a set of <code>div</code> tags with <code>class="sep"</code>. A concerning details is that it is not possible to ask for the full list in one shot (there is no option like "Show all" in this source), but this will be solved later. For the first page of results for the Latin American universities, the retrieving code is the following:

In [341]:
# Save data from webpage
myurl = 'https://www.usnews.com/education/best-global-universities/search?region=latin-america'
#header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
header = {"User-Agent":"Mozilla/5.0"}
source = requests.get(myurl,headers=header).text # Without .text, it gets the response in JSON format
mysoup = BeautifulSoup(source,'lxml')


# I need this function to find the div with the specific class
mydivs = mysoup.find_all(lambda tag: tag.name == 'div' and
                         tag.get('class') == ['sep'])
print("Elements in the search result: ",len(mydivs))

Elements in the search result:  10


Using this method, I get a list as a result. According to its lenght, I know how many elements I shall analyze. For the first ten results in the Latin America ranking, the following data can be retrieved:

In [7]:
import re # For the regex search in the local ranking
###
ind = 0
univnames = []
for myUname in mysoup.find_all("h2",class_="h-taut"): # All tags with university name
#    print(myUname.text.strip())
    univnames.append(myUname.text.strip())
    ind = ind+1
###
ind = 0
LAranks = []
for myLArank in mysoup.find_all("div",class_="thumb-left"): # All tags with rank in LatinAmerica (see myurl)
    # Score comes with a # sign and maybe a TIE string. Convert to number
    LAranks.append(int(re.findall("[0-9]+",str(myLArank))[0]))
    ind = ind+1
###
ind = 0
globalscores = []
for myGscore in mysoup.find_all("div",class_="t-large t-strong t-constricted"): # All tags with global score number
    globalscores.append(myGscore.text)
    ind = ind+1
###
ind = 0
countries = []
cities = []
for mylocation in mysoup.find_all("div",class_="t-taut"): # All tags with location
    countries.append(mylocation.span.text.strip())
    cities.append(mylocation.find("span",class_="t-dim t-small").text.strip())
    ind = ind+1
###
for ind in range(len(mydivs)):
    print('** {}, #{} in Latin America with a global score of {}, is located in {} ({}).'.format(univnames[ind],LAranks[ind],globalscores[ind],cities[ind],countries[ind]))

** Universidade de São Paulo, #1 in Latin America with a global score of 66.4, is located in São Paulo (Brazil).
** Pontificia University Católica de Chile, #2 in Latin America with a global score of 57.2, is located in Santiago (Chile).
** State University of Campinas, #3 in Latin America with a global score of 56.7, is located in Campinas, São Paulo (Brazil).
** Federal University of Rio de Janeiro, #4 in Latin America with a global score of 54.9, is located in Rio de Janeiro (Brazil).
** University of Buenos Aires, #5 in Latin America with a global score of 53.7, is located in Buenos Aires City, Buenos Aires (Argentina).
** National Autonomous University of Mexico, #6 in Latin America with a global score of 53.4, is located in Ciudad de México, Distrito Federal (Mexico).
** University of Chile, #7 in Latin America with a global score of 53.1, is located in Santiago (Chile).
** University of the Andes Colombia, #8 in Latin America with a global score of 51.4, is located in Bogotá, DC

For the next pages, an extra string appears in the URL: <code>page=#</code>, where <code>#</code> goes from 2 to 10 in this particular case. For <code>#</code>=1, this "extra" string does not appear. The following loop retrieves as many pages as stated in <code>numpages</code>.

In [8]:
# Initialize fields
univnames = []
LAranks = []
globalscores = []
countries = []
cities = []
pagenum = []
###

# Defining extra strings in URL and setting up request
myurl = 'https://www.usnews.com/education/best-global-universities/latin-america'
numpages = 10
extrastring = ['']
for mypages in range(2,numpages+1):
    mytext = '?page='
    extrastring.append(mytext+str(mypages))
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
          +' Chrome/64.0.3282.186 Safari/537.36','Accept': 'application/json, text/javascript, */*; q=0.01'}
header = {"User-Agent":"Mozilla/5.0"}

# Loop for pages 1 to numpages
for n in range(1,numpages+1):
    url = myurl + extrastring[n-1]
    source = requests.get(url,headers=header).text
    mysoup = BeautifulSoup(source,'lxml')

    ### University names
    for myUname in mysoup.find_all("h2",class_="h-taut"): # All tags with university name
        univnames.append(myUname.text.strip())
    ### University rank (in Latin America)
    for myLArank in mysoup.find_all("div",class_="thumb-left"): # All tags with rank in LatinAmerica (see myurl)
        try:
            LAranks.append(int(re.findall("[0-9]+",str(myLArank))[0]))
        except:
            LAranks.append(None)
    ### University global score
    for myGscore in mysoup.find_all("div",class_="t-large t-strong t-constricted"): # All tags with global score number
        globalscores.append(myGscore.text.strip())
    ### University location (country and city)
    for mylocation in mysoup.find_all("div",class_="t-taut"): # All tags with global score
        countries.append(mylocation.span.text.strip())
        cities.append(mylocation.find("span",class_="t-dim t-small").text.strip())
    ### Page number of this batch of data
    maxscores = len(mysoup.find_all("div",class_="t-large t-strong t-constricted"))
    for mypage in range(1,maxscores+1):
        pagenum.append(n)
# Universities without global score do not have complete data (see below), thus they will be dropped
##for ind in range(len(globalscores)):
##    print('** {}, #{} in Latin America with a global score of {}, is located in {} ({}).'.format(univnames[ind],LAranks[ind],globalscores[ind],cities[ind],countries[ind]))

Let's combine all the lists in a dataframe

In [9]:
#import pandas as pd
maxlenght = len(globalscores)
USN_df=pd.DataFrame({'University':univnames[0:maxlenght],
                        'LatinAmericaRank':LAranks[0:maxlenght],
                        'GlobalScore':globalscores[0:maxlenght],
                        'Country':countries[0:maxlenght],
                        'City':cities[0:maxlenght],
                        'PageNumber':pagenum[0:maxlenght]}
                      )
USN_df.head(10)

Unnamed: 0,University,LatinAmericaRank,GlobalScore,Country,City,PageNumber
0,Universidade de São Paulo,1,66.4,Brazil,São Paulo,1
1,Pontificia University Católica de Chile,2,57.2,Chile,Santiago,1
2,State University of Campinas,3,56.7,Brazil,"Campinas, São Paulo",1
3,Federal University of Rio de Janeiro,4,54.9,Brazil,Rio de Janeiro,1
4,University of Buenos Aires,5,53.7,Argentina,"Buenos Aires City, Buenos Aires",1
5,National Autonomous University of Mexico,6,53.4,Mexico,"Ciudad de México, Distrito Federal",1
6,University of Chile,7,53.1,Chile,Santiago,1
7,University of the Andes Colombia,8,51.4,Colombia,"Bogotá, DC",1
8,Universidad Tecnica Federico Santa Maria,9,51.1,Chile,"Valparaiso, Valparaiso",1
9,Federal University of Rio Grande do Sul,10,50.4,Brazil,"Porto Alegre, Rio Grande do Sul",1


Details for the university ranking appear when clicking on the university's name. The resulting URL includes a variant of the university and a numeric identifier. This information can be found in an <code>a</code> tag inside the <code>h2</code> tag used to retriege the university name. The previous data retriever will be used to get the URL for each university.

In [10]:
# Initialize fields
univurl = []

# Defining extra strings in URL and setting up request
myurl = 'https://www.usnews.com/education/best-global-universities/latin-america'
numpages = 10
extrastring = ['']
for mypages in range(2,numpages+1):
    mytext = '?page='
    extrastring.append(mytext+str(mypages))
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
          +' Chrome/64.0.3282.186 Safari/537.36','Accept': 'application/json, text/javascript, */*; q=0.01'}
header = {"User-Agent":"Mozilla/5.0"}


# Loop for pages 1 to numpages
for n in range(1,numpages+1):
    url = myurl + extrastring[n-1]
    source = requests.get(url,headers=header).text
    mysoup = BeautifulSoup(source,'lxml')

    ### University URLs
    for myUname in mysoup.find_all("h2",class_="h-taut"): # All tags with university name
        univurl.append(myUname.a['href'])
# Append URL list to dataframe
tmp = pd.DataFrame({'USN_URL':univurl[0:len(globalscores)]})
USN_df = USN_df.join(tmp)
USN_df

Unnamed: 0,University,LatinAmericaRank,GlobalScore,Country,City,PageNumber,USN_URL
0,Universidade de São Paulo,1,66.4,Brazil,São Paulo,1,https://www.usnews.com/education/best-global-u...
1,Pontificia University Católica de Chile,2,57.2,Chile,Santiago,1,https://www.usnews.com/education/best-global-u...
2,State University of Campinas,3,56.7,Brazil,"Campinas, São Paulo",1,https://www.usnews.com/education/best-global-u...
3,Federal University of Rio de Janeiro,4,54.9,Brazil,Rio de Janeiro,1,https://www.usnews.com/education/best-global-u...
4,University of Buenos Aires,5,53.7,Argentina,"Buenos Aires City, Buenos Aires",1,https://www.usnews.com/education/best-global-u...
...,...,...,...,...,...,...,...
78,Universidade Tecnologica Federal do Parana,79,16.5,Brazil,Curitiba,8,https://www.usnews.com/education/best-global-u...
79,Universidade Federal de Sergipe,80,16.3,Brazil,São Cristóvão,8,https://www.usnews.com/education/best-global-u...
80,Universidade Federal Rural de Pernambuco (UFRPE),81,15.9,Brazil,"Recife, PE",9,https://www.usnews.com/education/best-global-u...
81,Universidad Autonoma de Baja California,82,15.6,Mexico,"Mexicali, Baja California",9,https://www.usnews.com/education/best-global-u...


With the URL retrieved, the data used to score the universities is available. The following code retrieves such data. Let's try and study with the first element of the URL list.

In [270]:
details_url = USN_df['USN_URL'][0]
print(details_url)
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
          +' Chrome/64.0.3282.186 Safari/537.36','Accept': 'application/json, text/javascript, */*; q=0.01'}
header = {"User-Agent":"Mozilla/5.0"}
source = requests.get(details_url,headers=header).text
mysoup = BeautifulSoup(source,'lxml')

https://www.usnews.com/education/best-global-universities/universidade-de-sao-paulo-500437


In [271]:
# Getting the address (location)
maincontent = mysoup.find('div',class_='maincontent')
dirdata = maincontent.find('div',class_='directory-data')
address = ''
for data in dirdata.find_all('div'):
    address = address + data.text + ' '
address = address.strip()
print(address)

Av. Prof. Almeida Prado, nº1280 - Butantã São Paulo, 05508-070 Brazil


In [272]:
# Getting the webpage
# Both location and webpage are under the same tree of tags... I have to get both of them in one shot!
maincontent = mysoup.find_all('div',class_='directory-data')
address = maincontent[0]
webpage = maincontent[1]
text = ''
for data in address.find_all('div'):
    text = text + data.text + ' '
address = text.strip()
webpage = webpage.find('a')['href']
print(address)
print(webpage)

Av. Prof. Almeida Prado, nº1280 - Butantã São Paulo, 05508-070 Brazil
http://www5.usp.br/en/ 


In [273]:
# Getting university details
maincontent = mysoup.find('div',id='directoryPageSection-institution-data')
#print(maincontent.prettify())
detailname = []
detailval = []
for name in maincontent.find_all('div',class_="t-dim"):
    detailname.append(name.text.strip())
for value in maincontent.find_all('div',class_="right t-strong"):
    detailval.append(value.text.strip())
# Print dataframe
tmp = pd.DataFrame({'Detail':detailname,'Value':detailval})
tmp

Unnamed: 0,Detail,Value
0,Total number of students,83214
1,Number of international students,3161
2,Total number of academic staff,5230
3,Number of international staff,258
4,Number of undergraduate degrees awarded,8207
5,Number of master's degrees awarded,3742
6,Number of doctoral degrees awarded,3078
7,Number of research only staff,0
8,Number of new undergraduate students,10978
9,Number of new master's students,4697


This dataframe cannot be joined to the ranking dataframe as it is here. The best way to do it is to generate a dataframe with columns labeled as the "Detail" shown here, and rows with the "Values". In that way, after filling the whole dataframe it can be straightforwardly joined to the ranking dataframe. Something like this:

In [274]:
tmp = tmp.T
tmp.rename(columns=tmp.iloc[0],inplace=True)
tmp = tmp.drop(['Detail'])
tmp.reset_index(inplace=True,drop=True)
tmp

Unnamed: 0,Total number of students,Number of international students,Total number of academic staff,Number of international staff,Number of undergraduate degrees awarded,Number of master's degrees awarded,Number of doctoral degrees awarded,Number of research only staff,Number of new undergraduate students,Number of new master's students,Number of new doctoral students
0,83214,3161,5230,258,8207,3742,3078,0,10978,4697,3308


However, some of the pages do not have data at all. And worse, some of them have some of the data (which denies the possibility of just neglecting pages without data). Thus, it is better to create a dictionary for the details of each university. This is taken into account in the definition of the following code, which retrieves the a available details of each university.

In [None]:
import time # For sleep()

# Initialize data fields
details = [] # list, each element will be a dictionary with details
dictkeys = ['Address','Webpage',
            'Total number of students', 'Number of international students', 'Total number of academic staff', 
            'Number of international staff', 'Number of undergraduate degrees awarded', "Number of master's degrees awarded", 
            'Number of doctoral degrees awarded', 'Number of research only staff', 'Number of new undergraduate students', 
            "Number of new master's students", 'Number of new doctoral students']

# Loop over different URLs given in USN_df['USN_URL']
maxurl = len(USN_df['USN_URL'])

header = {"User-Agent":"Mozilla/5.0"}
for urlnum in range(maxurl):
    time.sleep(1) # Wait 1 second so the server will not deny my next calls
    mydict = {x:None for x in dictkeys} # Initialize dictionary for current university page
    details_url = USN_df['USN_URL'][urlnum]
    source = requests.get(details_url,headers=header).text
    mysoup = BeautifulSoup(source,'lxml')
###    # To check that the access was not denied!
###    print('HEAD.TITLE of retrieved mysoup object: ',mysoup.head.title)
    
    # Getting the webpage
    maincontent = mysoup.find_all('div',class_='directory-data')
    address = maincontent[0]
    webpage = maincontent[1]
    text = ''
    for data in address.find_all('div'):
        text = text + data.text + ' '
    mydict['Address'] = text.strip()
    mydict['Webpage'] = webpage.find('a')['href']

    # Getting university details
    maincontent = mysoup.find('div',id='directoryPageSection-institution-data')
    # Initialize details lists
    detailname = []
    detailvalue = []
    for name in maincontent.find_all('div',class_="t-dim"):
        detailname.append(name.text.strip())
    for value in maincontent.find_all('div',class_="right t-strong"):
        detailvalue.append(value.text.strip())
    # At this moment I have a list of details name and values. Put them in the dictionary
    for i in range(len(detailname)):
        mydict[detailname[i]] = detailvalue[i]
    details.append(mydict)

In [342]:
extended_df = pd.DataFrame.from_dict(details)
extended_df.head(10)

Unnamed: 0,Address,Webpage,Total number of students,Number of international students,Total number of academic staff,Number of international staff,Number of undergraduate degrees awarded,Number of master's degrees awarded,Number of doctoral degrees awarded,Number of research only staff,Number of new undergraduate students,Number of new master's students,Number of new doctoral students
0,"Av. Prof. Almeida Prado, nº1280 - Butantã São ...",http://www5.usp.br/en/,83214.0,3161.0,5230.0,258.0,8207.0,3742.0,3078.0,0.0,10978.0,4697.0,3308.0
1,Avda. Libertador Bernardo O'Higgins 340 Santi...,http://www.uc.cl/,28541.0,2007.0,1900.0,207.0,2981.0,,121.0,270.0,5190.0,,268.0
2,"Campus Universitário Zeferino Vaz Campinas, Sã...",http://www.unicamp.br/unicamp/?language=en,28795.0,974.0,1906.0,106.0,2500.0,1342.0,997.0,94.0,3353.0,2110.0,1511.0
3,"Av. Pedro Calmon, 550 Rio de Janeiro, 21941-90...",http://www.ufrj.br/,,,,,,,,,,,
4,"Viamonte 430 st. Buenos Aires City, Buenos Air...",http://www.uba.ar/ingles/index03.php,,,,,,,,,,,
5,"Av. Universidad 3000, Copilco Universidad, Coy...",http://www.unam.mx/index/en,,,,,,,,,,,
6,Av. Libertador Bernardo O'Higgins 1058 Santiag...,http://www.uchile.cl/english,,,,,,,,,,,
7,"Carrera Primera #18A-12 Bogotá, DC Colombia",http://www.uniandes.edu.co/,,,,,,,,,,,
8,"Av. Espana 1680 Valparaiso, Valparaiso Chile",http://www.usm.cl/,,,,,,,,,,,
9,"Av. Paulo Gama, 110 Porto Alegre, Rio Grande d...",http://www.ufrgs.br/english/home,,,,,,,,,,,


Now, the merging of this extended dataframe with the ranking dataframe is straightforward.

In [None]:
USN_fulldf = USN_df.join(extended_df)
USN_fulldf.to_csv('USN_dataframe.csv',encoding="utf-8-sig")
USN_fulldf

This ends the data retrieving for source number two. YAY!

<h4>Source three: QS (Quacquarelli Symonds) Top Universities</h4>
The third source of data will be the ranking by <a href="https://www.topuniversities.com/university-rankings/world-university-rankings/2020">QS (Quacquarelli Symonds) Top Universities</a>. <b>SAY SOMETHING ABOUT THE SOURCE...</b>. he data will be restricted to the Latin American countries in their search results. <br><br>
Looking at the structure of the page, the data is not in a table but in a set of <code>div</code> tags with <code>class="sep"</code>. A concerning details is that it is not possible to ask for the full list in one shot (there is no option like "Show all" in this source), but this will be solved later. For the first page of results for the Latin American universities, the retrieving code is the following:

In [3]:
myurl="https://www.topuniversities.com/university-rankings/world-university-rankings/2020"
#header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
header = {"User-Agent":"Mozilla/5.0"}
source = requests.get(myurl,headers=header).text
mysoup = BeautifulSoup(source,'lxml')

In [5]:
#content = mysoup.find('div',class_='panel-pane pane-block pane-qs-rankings-datatables-0')
content = mysoup.find('table',id='qs-rankings')

Again, an empty table. Further reading about this, the table is generated by a script. Checking in the developer tools in the browser, the files that contain the actual information are located in the URLs given in the next code cell. Those are the ones to be retrieved and processed to generate the corresponding dataframe.

In [6]:
urlranks = "https://www.topuniversities.com/sites/default/files/qs-rankings-data/914824.txt"
urlindicators = "https://www.topuniversities.com/sites/default/files/qs-rankings-data/914824_indicators.txt"

header = {"User-Agent":"Mozilla/5.0"}

sourceranks = requests.get(urlranks,headers=header).text # Without .text, it gets the response in JSON format
sourceindic = requests.get(urlindicators,headers=header).text # Without .text, it gets the response in JSON format
mysoupranks = BeautifulSoup(sourceranks,'lxml')
mysoupindic = BeautifulSoup(sourceindic,'lxml')

In [7]:
import time
import re

# Mysoupranks is a single paragraph (p) of HTML which contains the text file
#   with the information required. It can be converted to a dictionary. However
#   it has only one key with all the data inside. The following code will break
#   the information in order to get individual data in a manageable way.
import json
diction = json.loads(mysoupranks.text)

# By checking the structure of "data", the useful data is:
#   Continent is in ['region'] (I can get the Latin American universities with this)
#   Country is in ['country']
#   University name is in ['title']
#   Global score is in ['score']
#   Rank is in ['rank_display']

details = [] # list, each element will be a dictionary with details
dictkeys = ['Name','Region','Country','GlobalScore','GlobalRank']
## interests = ['title','region','country','score','rank_display']

for i in range(len(diction['data'])):
    if diction['data'][i]['region'] == 'Latin America':
        mydict = {x:None for x in dictkeys} # Initialize dictionary for current university
        mydict['Name']        = diction['data'][i]['title']
        mydict['Region']      = diction['data'][i]['region']
        mydict['Country']     = diction['data'][i]['country']
        mydict['GlobalScore'] = diction['data'][i]['score']
        # Some ranks are tied, and have an "=" in its value
        mydict['GlobalRank']  = int(re.findall("[0-9]+",str(diction['data'][i]['rank_display']))[0])
        # The final [0] is because re.findall returns a list, in this case with one element.
        details.append(mydict)
basic_df = pd.DataFrame.from_dict(details)
basic_df.head(10)

Unnamed: 0,Name,Region,Country,GlobalScore,GlobalRank
0,Universidad de Buenos Aires (UBA),Latin America,Argentina,66.0,74
1,Universidad Nacional Autónoma de México (UNAM),Latin America,Mexico,58.8,103
2,Universidade de São Paulo,Latin America,Brazil,55.5,116
3,Pontificia Universidad Católica de Chile (UC),Latin America,Chile,53.4,127
4,Tecnológico de Monterrey,Latin America,Mexico,48.5,158
5,Universidad de Chile,Latin America,Chile,45.0,189
6,Universidade Estadual de Campinas (Unicamp),Latin America,Brazil,42.1,214
7,Universidad de los Andes,Latin America,Colombia,39.6,234
8,Universidad Nacional de Colombia,Latin America,Colombia,37.5,253
9,Pontificia Universidad Católica Argentina,Latin America,Argentina,31.7,344


Now, it is the turn of the indicators to be processed.

In [28]:
indicators = json.loads(mysoupindic.text)
indicators.pop('columns')
### In columns, I have this data...
#   {'data': '3791737',
#     'title': '<div class="td-wrap"><div class="labl"><div>Citations per Faculty</div></div><div class="sorter"></div></div>',
#     'searchable': False,
#     'orderable': False}
# It seems that instead of names, they used ID numbers for the indicators. I found these:
#     ID = 3791737, for indicator 'Citations per Faculty'
#     ID = 3791738, for indicator 'International Students'
#     ID = 3791739, for indicator 'International Faculty'
#     ID = 3791740, for indicator 'Faculty Student'
#     ID = 3791741, for indicator 'Employer Reputation'
#     ID = 3791742, for indicator 'Academic Reputation'

# The elements in the indicators are strings made of HTML code. I need to parse them
#   before being able to extract the value of the indicator
#   actual_value = BeautifulSoup(indicators['data'][0]['3791737'],'lxml').text

# Example, for university 5 (index 4)
print('Citations per Faculty = ',BeautifulSoup(indicators['data'][4]['3791737'],'lxml').text)
print('International Students = ',BeautifulSoup(indicators['data'][4]['3791738'],'lxml').text)
print('International Faculty = ',BeautifulSoup(indicators['data'][4]['3791739'],'lxml').text)
print('Faculty Student = ',BeautifulSoup(indicators['data'][4]['3791740'],'lxml').text)
print('Employer Reputation = ',BeautifulSoup(indicators['data'][4]['3791741'],'lxml').text)
print('Academic Reputation = ',BeautifulSoup(indicators['data'][4]['3791742'],'lxml').text)

Citations per Faculty =  100
International Students =  87.3
International Faculty =  99.4
Faculty Student =  100
Employer Reputation =  81.2
Academic Reputation =  97.8


In [39]:
indicators = json.loads(mysoupindic.text)
#indicators.pop('columns')

# Initialize variables
details = [] # list, each element will be a dictionary with details
citperfalc = 'Citations_per_Faculty'
interstude = 'International_Students'
interfacul = 'International_Faculty'
facstudent = 'Faculty_Student'
employrepu = 'Employer_Reputation'
academrepu = 'Academic_Reputation'
dictkeys = [citperfalc,interstude,interfacul,facstudent,employrepu,academrepu]
indicators['data'][0]['region']

for i in range(len(indicators['data'])):
    if indicators['data'][i]['region'] == 'Latin America':
        mydict = {x:None for x in dictkeys} # Initialize dictionary for current university
        try:
            mydict[citperfalc] = BeautifulSoup(indicators['data'][i]['3791737'],'lxml').text
        except:
            pass
        try:
            mydict[interstude] = BeautifulSoup(indicators['data'][i]['3791738'],'lxml').text
        except:
            pass
        try:
            mydict[interfacul] = BeautifulSoup(indicators['data'][i]['3791739'],'lxml').text
        except:
            pass
        try:
            mydict[facstudent] = BeautifulSoup(indicators['data'][i]['3791740'],'lxml').text
        except:
            pass
        try:
            mydict[employrepu] = BeautifulSoup(indicators['data'][i]['3791741'],'lxml').text
        except:
            pass
        try:
            mydict[academrepu] = BeautifulSoup(indicators['data'][i]['3791742'],'lxml').text
        except:
            pass
        details.append(mydict)

indic_df = pd.DataFrame.from_dict(details)
indic_df.head(10)
# For the 'data' (remaining) key, the interests are:
# interests = ['region' = 'Latin America','overall_rank','uni'.text,
#              '','','','','',]

Unnamed: 0,Citations_per_Faculty,International_Students,International_Faculty,Faculty_Student,Employer_Reputation,Academic_Reputation
0,2.4,64.7,50.7,77.4,91.3,87.2
1,3.8,4.3,13.8,57.6,91.0,90.9
2,35.2,3.7,8.9,25.2,73.3,88.3
3,13.6,4.2,19.4,28.6,95.5,85.2
4,4.6,18.4,98.2,89.5,88.9,36.9
5,14.5,8.4,10.1,16.3,90.8,71.6
6,32.7,4.3,9.9,21.1,34.5,67.5
7,8.1,3.1,32.1,27.6,87.9,54.4
8,5.3,1.3,8.3,11.1,89.7,61.5
9,1.1,13.2,2.9,95.3,44.3,17.7


Now, the merging of this extended dataframe with the ranking dataframe is straightforward.

In [40]:
QSTopU_fulldf = basic_df.join(indic_df)
QSTopU_fulldf.to_csv('QSTopU_dataframe.csv',encoding="utf-8-sig")
QSTopU_fulldf

Unnamed: 0,Name,Region,Country,GlobalScore,GlobalRank,Citations_per_Faculty,International_Students,International_Faculty,Faculty_Student,Employer_Reputation,Academic_Reputation
0,Universidad de Buenos Aires (UBA),Latin America,Argentina,66,74,2.4,64.7,50.7,77.4,91.3,87.2
1,Universidad Nacional Autónoma de México (UNAM),Latin America,Mexico,58.8,103,3.8,4.3,13.8,57.6,91,90.9
2,Universidade de São Paulo,Latin America,Brazil,55.5,116,35.2,3.7,8.9,25.2,73.3,88.3
3,Pontificia Universidad Católica de Chile (UC),Latin America,Chile,53.4,127,13.6,4.2,19.4,28.6,95.5,85.2
4,Tecnológico de Monterrey,Latin America,Mexico,48.5,158,4.6,18.4,98.2,89.5,88.9,36.9
...,...,...,...,...,...,...,...,...,...,...,...
83,Universidade Federal de São Carlos (UFSCar),Latin America,Brazil,,801,,,,24,,
84,Universidade Federal de Viçosa (UFV),Latin America,Brazil,,801,,,,,,
85,Universidade Federal do Paraná - UFPR,Latin America,Brazil,,801,,,,,,
86,Universidade Federal de Pernambuco (UFPE),Latin America,Brazil,,801,,,,28.5,,


<hr></hr><hr></hr><hr></hr>

<hr></hr><hr></hr><hr></hr>

<h4>0.2 Getting coordinates for the neighborhoods from Problem 2</h4>

In [None]:
# Convert file to dataframe
toronto_df = pd.read_csv('toronto_data.csv')
# Changing the name of the first column of downloaded data
toronto_df.rename(columns={'Postal Code':'PostalCode'},inplace=True)
# Merging provided data into the original dataframe
# dataframe is the original data retrieved and cleaned from wikipedia
# toronto_df is the downloaded data
full_df = pd.merge(dataframe, toronto_df, on='PostalCode')
full_df.drop_duplicates(inplace=True) # Dropping duplicated rows
print('Shape of merged dataframe: ',merged.shape)
full_df.head(10)

<h4>1.1 Analyzing neighborhoods in Toronto</h4>
The purpose of the following code is to group (cluster) different neighborhoods from Toronto in order to see how similar are some of them, and which type of facilities (venues) they have. Maybe you would like to visit neighborhoods with coffee shops and bars one day, and visit neighborhoods with malls and beauty shops another day!

In [None]:
# Get required packages and libraries ready

import numpy as np # library to handle data in a vectorized manner

# import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Libraries imported.')

<h4>1.2 A first look on Toronto</h4>
Let's get some characteristics of the dataframe we have, as well as the location of Toronto in a map.

In [None]:
# How many boroughs and neighborhoods does Toronto have?
print('The dataframe "full_df" for Toronto has {} boroughs and {} neighborhoods.'
      .format(len(full_df['Borough'].unique()),
              full_df.shape[0]
    )
)

In [None]:
# Where is Toronto?
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="TO_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

In [None]:
# Create map of Toronto using its latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# Add markers of neighborhoods to map
for lat, lng, borough, neighborhood, pcode in zip(full_df['Latitude'], full_df['Longitude'],
                                                  full_df['Borough'], full_df['Neighborhood'],
                                                  full_df['PostalCode']):
    label = '{} ({}) {}'.format(neighborhood, borough, pcode)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto) # Do not forget to add CircleMarker to the map!!  
    
map_toronto

In order to simplify the analysis, the exercise suggests to perform it only in boroughs that include 'Toronto' in its name. Let's extract that information:

In [None]:
# Define the dataframe by appending the desired boroughs
tmp = []
for i,x in enumerate(full_df['Borough']): # Create an enumerated list of boroughs
    if 'Toronto' in x: # Check if Toronto appears in the borough's name
        tmp.append(full_df.iloc[i])

justtoronto_df = pd.DataFrame(tmp).reset_index(drop=True) # Transform result to dataframe
print('Shape of dataframe for Toronto boroughs: ',justtoronto_df.shape)
justtoronto_df.head()

Now, let's adapt the map to the Toronto zone:

In [None]:
# Just changed "full_df" to "justtoronto_df"
# And I will overwrite the previous map
# Create map of Toronto using its latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11) # Larger zoom

# Add markers of neighborhoods to map
for lat, lng, borough, neighborhood, pcode in zip(justtoronto_df['Latitude'], justtoronto_df['Longitude'],
                                                  justtoronto_df['Borough'], justtoronto_df['Neighborhood'],
                                                  justtoronto_df['PostalCode']):
    label = '{} ({}) {}'.format(neighborhood, borough, pcode)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto) # Do not forget to add CircleMarker to the map!!  
    
map_toronto

<h4>2.1 Setting up Foursquare credentials</h4>
Please don't eat up my calls credit! XD

In [None]:
CLIENT_ID = 'RRYOHBWLN3VNML1RBPM0TRVDW2R41TKNWMZSH0VTOQKGNO2T' # your Foursquare ID
CLIENT_SECRET = 'X22FCK21ZCS0UVXZ11TILJFRGXGWVMD5ZADQLIOSMDHHSHHN' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

<h4>2.2 Exploring one neighborhood</h4>
In order to make things clear, let's establish the analysis plan using just one neighborhood. Choose by setting a number between 0 and 38 in the following cell.

In [None]:
# Setting up neighborhood to be analyzed
nnum = 5

myneigh = justtoronto_df.loc[nnum, 'Neighborhood']
myneigh_lat = justtoronto_df.loc[nnum, 'Latitude'] # neighborhood latitude value
myneigh_lon = justtoronto_df.loc[nnum, 'Longitude'] # neighborhood longitude value

print('Your selected neighborhood is {}, located at (latitude,longitude) = ({},{}).'
      .format(myneigh, myneigh_lat, myneigh_lon))
print('Don\'t forget to update this cell when you want to analyze other neighborhood!')

The following code requests the top 100 venues in 500 meters around the location of your neighborhood:

In [None]:
LIMIT=100 # Remember the number and type of calls you have in your credit
radius=500 # in meters
# The URL structure is straighforward to read.
# Just remember the information you have to provide for each type of request.
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
           CLIENT_ID,
           CLIENT_SECRET,
           VERSION,
           myneigh_lat,
           myneigh_lon,
           radius,
           LIMIT)
url

In [None]:
# Call to Foursquare. Do not abuse of this cell execution!!!
results = requests.get(url).json()
### results # Careful. Long result ahead. Uncomment just to be sure that it worked

All the information is in the <i>items</i> key. The following function <code>get_category_type</code> is used to extract the name of a category (remember the structure of the information in the <code>json</code> files).

In [None]:
# Function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

The previous function helps to clean the data from the request:

In [None]:
# Getting "items" to work with a smaller amount of data
venues = results['response']['groups'][0]['items']

# Convert JSON-style data into a table
nearby_venues = json_normalize(venues)

# Getting only the columns we will use
# The names come by looking at the json_normalize result
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns] # All rows, only the filtered columns

# venue.categories looks messy from the previous result. This is why you apply "get_category_type"
#   to that column, then you get the cleaned name. Of course, the function's design comes after
#   checking the data structure in "venues".
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# Remove the "venues." string from the column names
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

# Check the result
print('{} venues were returned by Foursquare in {}.'.format(nearby_venues.shape[0],myneigh))
nearby_venues.head()

<h4>3.1 Exploring the full zone</h4>
Now that it has been done for one neighborhood, it can be taken to explore the full set of neghborhoods in the selected region of Toronto.

The following function will do the previous steps with a list of neighborhoods, provided the names and coordinates for each one (and maybe the radius to look for around the location and the limit of venues to search).

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print('Searching for venues in ',name,'...')
            
        # Create URL for API request
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # Make GET request, directly retrieving only the interesting part
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # Return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results]) # This is a "list comprehension"
        # In this type of list, you include an implicit for, which can be useful to reduce the number of lines
        #   in a code. In this case, it looks in the "results" data for the specific elements and values of the
        #   previously defined lists.

    # Transform result in dataframe
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list]) # Nested list comprehension
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    print()
    print('Done!!',end='\n\n')
    print('Returned a dataframe with shape ',nearby_venues.shape)
    return(nearby_venues)

Now, apply the function to the full set of neighborhoods in Toronto:

In [None]:
toronto_venues = getNearbyVenues(names=justtoronto_df['Neighborhood'],
                                   latitudes=justtoronto_df['Latitude'],
                                   longitudes=justtoronto_df['Longitude']
                                  )

In [None]:
toronto_venues.head(10)

In [None]:
# How many venues does each neighborhood has?
print('Number of venues retrieved per neighborhood (dataframe):')
toronto_venues.groupby('Neighborhood').count()

Check that the number of venues returned by Foursquare here matches the one in your "one neighborhood" analysis.

In [None]:
# How many type of venues are there in this dataframe?
print('There are {} uniques categories of venues in the dataframe.'.format(len(toronto_venues['Venue Category'].unique())))

<h4>3.2 Managing the information</h4>
The following code will create a dataframe that show how many venues of a given type exists in each neighborhood. The dataframe will be large but this is the preparation step.

In [None]:
# One hot encoding
# Create a dummy dataframe with columns after (unique) values in 'Venue Category'
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# Add neighborhood column back to dataframe
# With this you just create a 'Neighborhood' column in toronto_onehot
#   with the info from toronto_venues['Neighborhood']
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# Move neighborhood column to the first column
# The previous code results in an alphabetical order in the columns (left-to-right)
#   thus let's move the 'Neighborhood' column to the beginning.
colind = toronto_onehot.columns.get_loc("Neighborhood") # Getting the position of column in dataframe
fixed_columns = [toronto_onehot.columns[colind]] + list(toronto_onehot.columns[0:colind]) + list(toronto_onehot.columns[colind+1:])
toronto_onehot = toronto_onehot[fixed_columns]

### Warning! In the lab exercise, the 'Neighborhood' column was added at the end of
###   the dataframe. That is why there you see a '-1' index to refer to that column.
###       fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
###       toronto_onehot = toronto_onehot[fixed_columns]
###   While checking here, I realized the alphabetical order (don't know why!).
###   Thus, I had to modify the code to look for the column by name.

toronto_onehot.head()

The previous dataframe establishes the occurrence of a given venue in a particular neighborhood. Let's group the occurrence of each type (category) of venue per neighborhood, making a <code>mean</code> out of the location to have an idea of the frequency of such occurrence per neighborhood. This is, of the total of venues in a given neighborhood, how feasible is to find a given type of venue.

In [None]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

As you can see in the previous results, it is more feasible to find a coffee shop than an art gallery in Berczy Park. This is more easily seen if you print the top 5 venues (according to frequency) for each neighborhood.

In [None]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----" + hood + "----") # "plus" signs do not work if you mix strings and numbers!
    # T is for Transposed. It gets the venue categories to the index side.
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 3})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

<b>Note</b>: Remember that this <i>frequency</i> analysis depends on the number of venues in the neighborhood. If you see very small numbers in the top 5, it may mean there is a lot of venues in the neighborhood.

To get this information into a dataframe, it is easier to create a function to return the top venues in a . The next cell will create the dataframe in a readable way.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd'] # Not needed if you use "Venue #X" for X = 1 to num_top_venues

# Create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind])) # Indicators for 1, 2 and 3
    except:
        columns.append('{}th Most Common Venue'.format(ind+1)) # When you run out of "indicators"

# Create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns) # As wide as num_top_venues + 1
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood'] # Copy neighborhoods from dataframe

for ind in np.arange(toronto_grouped.shape[0]): # For the number of neighborhoods in the dataframe...
    # The function returns the first "num_top_venues" from the ordered list from each row
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

<h4>4.1 Clustering neighborhoods using <i>K means</i></h4>
The following code runs the <code>K means</code> model on several values for number of clusters and random-number-generator seeds.

In [None]:
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)
for kclusters in range(3,6):
    print()
    print('Results for K-means with k = ',kclusters)
    for seed in range(0,5):
        # Execute k-means clustering for given conditions
        kmeans = KMeans(n_clusters=kclusters, random_state=seed, n_init=12).fit(toronto_grouped_clustering)    
        # Check cluster labels generated for each row in the dataframe
        print('For k = {} and seed = {} the labels are: \n {}'.format(kclusters,seed,kmeans.labels_[0:]))

The value of the <code>seed</code> for the random number generator that initializes the centroids of the clusters seems to influence more for lower <code>kcluster</code> values. With <code>kclusters=5</code> the results are the same. Let's use those values for the clustering.

In [None]:
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)
kmeans = KMeans(n_clusters=5, random_state=0, n_init=12).fit(toronto_grouped_clustering)
kmeans

Let's complete the dataframe for Toronto neighborhoods with the data from the neighborhoods, cluster label and top venues.

In [None]:
# Add clustering labels to the sorted neighborhood venues
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [None]:
# Recover the original dataframe (in this case, "justtoronto_df")
toronto_merged = justtoronto_df

# Add neighborhoods_venues_sorted to toronto_merged according to the neighborhood name
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head(5)

For the final presentation, a map with colored markers for each cluster is shown as follows.

In [None]:
# Getting Toronto's coordinates
address = 'Toronto, Ontario'
geolocator = Nominatim(user_agent="TO_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

In [None]:
# Create map object
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# Set color scheme for each cluster
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.gnuplot(np.linspace(0, 1, len(ys))) # Look for color maps in matplotlib
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
markers_colors = []
for lat, lon, hood, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'],
                                  toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(hood) + ' (in Cluster ' + str(cluster) + ')', parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<h4>4.2 Examining clusters</h4>
Why that many neighborhoods are in a specific cluster? Let's see the top venues in each cluster and compare between them. Since cluster 3 is the more populated, let's check that one first.

In [None]:
mycluster = 0
toronto_merged.loc[toronto_merged['Cluster Labels'] == mycluster, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

<b>NOTE</b>: If you run this notebook again, the "big" cluster can get another label. In this example, it came to be 0.

For cluster 0, coffee shops and cafés are the common venues on the top list. What happens with neighborhoods like "Dufferin, Dovercourt Village" (index 9)? It does not seem very similar. It shares bakery and bar on his top venues with a couple of other neighborhoods but it seems rather odd. Maybe the analysis tends to load the separation on the top venues rather than the whole set. Anyway, remember we are looking at the top venues here, not at every one of them. For the rest of the clusters, the comparison is straightforward:

In [None]:
mycluster = 1
toronto_merged.loc[toronto_merged['Cluster Labels'] == mycluster, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

In [None]:
mycluster = 2
toronto_merged.loc[toronto_merged['Cluster Labels'] == mycluster, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

In [None]:
mycluster = 3
toronto_merged.loc[toronto_merged['Cluster Labels'] == mycluster, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

In [None]:
mycluster = 4
toronto_merged.loc[toronto_merged['Cluster Labels'] == mycluster, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

For the clusters with more than one element, top venues are very similar. There, the clustering makes sense. It may be a challenge to further analyze the data in order to see why the clustering puts that many neighborhoods in one of them (remember the results for <code>kclusters</code> from 3 to 4 in the beginning of section 4.1). Some straightforward ideas on this can be found <a href="https://zerowithdot.com/mistakes-with-k-means-clustering/">here</a> and some solutions are suggested <a href="https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/">here</a>. Since this is a high-dimensionality problem, the suggestion I have is to try several clusters and check the label distribution. Just set <code>maxclusters</code> in the following cell and see what's a good candidate! After that, rinse and repeat.

In [None]:
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)
maxclusters = 10
seed = 0
save_k = 10
for kclusters in range(3,maxclusters+1):
    print()
    print('Results for K-means with k = ',kclusters)
    # Execute k-means clustering for given conditions
    tmp = KMeans(n_clusters=kclusters, random_state=seed, n_init=12).fit(toronto_grouped_clustering)    
    # Check cluster labels generated for each row in the dataframe
    print('For k = {} and seed = {} the labels are: \n {}'.format(kclusters,seed,tmp.labels_[0:]))
    if kclusters == save_k:
        kmeans = tmp
print()
print('Saved results for kclusters = ',save_k,' in "kmeans"')