<h3 style="color:red;">The purpose here is to define the source of university rankings data, then check how to scrap each data and then how to merge the data from different sources.</h3>
<h4 style="color:blue;">The plan now is to scrap data from three different sources. This would lead to three different webpages, with three different scrapping algorithms and finally the collection of data. One thing I would like to do is to run a similarity analysis between the sources to see how consistent are between them.</h4>

<h1>PROJECT TITLE</h1>
<h2>Supplementary information (code development)</h2>
<h5>By: Aurelio Álvarez Ibarra</h5>

<h3>S1.1 Getting information from the ranking tables</h3>

As usual, first we will download and import the necessary packages and libraries.

In [1]:
# Get packages and libraries ready
!pip install beautifulsoup4 lxml
from bs4 import BeautifulSoup
import requests
import pandas as pd



<h4>Source one: U.S. News</h4>
The first source of data will be the ranking by <a href="https://www.usnews.com/education/best-global-universities/rankings">U.S. News</a>. <b>SAY SOMETHING ABOUT THE SOURCE...</b>. The data will be restricted to the Latin American countries in their search results. <br><br>
Looking at the structure of the page, the data is not in a table but in a set of <code>div</code> tags with <code>class="sep"</code>. A concerning details is that it is not possible to ask for the full list in one shot (there is no option like "Show all" in this source), but this will be solved later. For the first page of results for the Latin American universities, the retrieving code is the following:

In [2]:
# Save data from webpage
myurl = 'https://www.usnews.com/education/best-global-universities/search?region=latin-america'
header = {"User-Agent":"Mozilla/5.0"}
source = requests.get(myurl,headers=header).text # Without .text, it gets the response in JSON format
mysoup = BeautifulSoup(source,'lxml')


# I need this function to find the div with the specific class
mydivs = mysoup.find_all(lambda tag: tag.name == 'div' and
                         tag.get('class') == ['sep'])
print("Elements in the search result: ",len(mydivs))

Elements in the search result:  10


Using this method, I get a list as a result. According to its lenght, I know how many elements I shall analyze. For the first ten results in the Latin America ranking, the following data can be retrieved:

In [3]:
import re # For the regex search in the local ranking
###
ind = 0
univnames = []
for myUname in mysoup.find_all("h2",class_="h-taut"): # All tags with university name
    univnames.append(myUname.text.strip())
    ind = ind+1
###
ind = 0
LAranks = []
for myLArank in mysoup.find_all("div",class_="thumb-left"): # All tags with rank in LatinAmerica (see myurl)
    # Score comes with a # sign and maybe a TIE string. Convert to number
    LAranks.append(int(re.findall("[0-9]+",str(myLArank))[0]))
    ind = ind+1
###
ind = 0
globalscores = []
for myGscore in mysoup.find_all("div",class_="t-large t-strong t-constricted"): # All tags with global score number
    globalscores.append(myGscore.text)
    ind = ind+1
###
ind = 0
countries = []
cities = []
for mylocation in mysoup.find_all("div",class_="t-taut"): # All tags with location
    countries.append(mylocation.span.text.strip())
    cities.append(mylocation.find("span",class_="t-dim t-small").text.strip())
    ind = ind+1
###
for ind in range(len(mydivs)):
    print('** {}, #{} in Latin America with a global score of {}, is located in {} ({}).'.format(univnames[ind],LAranks[ind],globalscores[ind],cities[ind],countries[ind]))

** Universidade de São Paulo, #1 in Latin America with a global score of 66.4, is located in São Paulo (Brazil).
** Pontificia University Católica de Chile, #2 in Latin America with a global score of 57.2, is located in Santiago (Chile).
** State University of Campinas, #3 in Latin America with a global score of 56.7, is located in Campinas, São Paulo (Brazil).
** Federal University of Rio de Janeiro, #4 in Latin America with a global score of 54.9, is located in Rio de Janeiro (Brazil).
** University of Buenos Aires, #5 in Latin America with a global score of 53.7, is located in Buenos Aires City, Buenos Aires (Argentina).
** National Autonomous University of Mexico, #6 in Latin America with a global score of 53.4, is located in Ciudad de México, Distrito Federal (Mexico).
** University of Chile, #7 in Latin America with a global score of 53.1, is located in Santiago (Chile).
** University of the Andes Colombia, #8 in Latin America with a global score of 51.4, is located in Bogotá, DC

For the next pages, an extra string appears in the URL: <code>page=#</code>, where <code>#</code> goes from 2 to 10 in this particular case. For <code>#</code>=1, this "extra" string does not appear. The following loop retrieves as many pages as stated in <code>numpages</code>.

In [4]:
# Initialize fields
univnames = []
LAranks = []
globalranks = []
globalscores = []
countries = []
cities = []
pagenum = []
###

# Defining extra strings in URL and setting up request
myurl = 'https://www.usnews.com/education/best-global-universities/latin-america'
numpages = 10
extrastring = ['']
for mypages in range(2,numpages+1):
    mytext = '?page='
    extrastring.append(mytext+str(mypages))
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
          +' Chrome/64.0.3282.186 Safari/537.36','Accept': 'application/json, text/javascript, */*; q=0.01'}
header = {"User-Agent":"Mozilla/5.0"}

# Loop for pages 1 to numpages
for n in range(1,numpages+1):
    url = myurl + extrastring[n-1]
    source = requests.get(url,headers=header).text
    mysoup = BeautifulSoup(source,'lxml')

    ### University names
    for myUname in mysoup.find_all("h2",class_="h-taut"): # All tags with university name
        univnames.append(myUname.text.strip())
    ### University rank (in Latin America)
    for myLArank in mysoup.find_all("div",class_="thumb-left"): # All tags with rank in LatinAmerica (see myurl)
        try:
            LAranks.append(int(re.findall("[0-9]+",str(myLArank))[0]))
        except:
            LAranks.append(None)
    ### University rank (global)
    for grank1 in mysoup.find_all("div",class_="block unwrap"):
        for grank2 in grank1.find_all("div",attrs={'class': None}):
            try:
                globalranks.append(int(re.findall("[0-9]+",str(grank2.text.strip()))[0]))
            except:
                globalranks.append(None)
    ### University global score
    for myGscore in mysoup.find_all("div",class_="t-large t-strong t-constricted"): # All tags with global score number
        globalscores.append(myGscore.text.strip())
    ### University location (country and city)
    for mylocation in mysoup.find_all("div",class_="t-taut"): # All tags with global score
        countries.append(mylocation.span.text.strip())
        cities.append(mylocation.find("span",class_="t-dim t-small").text.strip())
    ### Page number of this batch of data
    maxscores = len(mysoup.find_all("div",class_="t-large t-strong t-constricted"))
    for mypage in range(1,maxscores+1):
        pagenum.append(n)
# Universities without global score do not have complete data (see below), thus they will be dropped

Let's combine all the lists in a dataframe

In [5]:
#import pandas as pd
maxlenght = len(globalscores)
USN_df=pd.DataFrame({'University':univnames[0:maxlenght],
                        'LatinAmericaRank':LAranks[0:maxlenght],
                        'GlobalRank':globalranks[0:maxlenght],
                        'GlobalScore':globalscores[0:maxlenght],
                        'Country':countries[0:maxlenght],
                        'City':cities[0:maxlenght],
                        'PageNumber':pagenum[0:maxlenght]}
                      )
USN_df.head(5)

Unnamed: 0,University,LatinAmericaRank,GlobalRank,GlobalScore,Country,City,PageNumber
0,Universidade de São Paulo,1,128,66.4,Brazil,São Paulo,1
1,Pontificia University Católica de Chile,2,290,57.2,Chile,Santiago,1
2,State University of Campinas,3,305,56.7,Brazil,"Campinas, São Paulo",1
3,Federal University of Rio de Janeiro,4,346,54.9,Brazil,Rio de Janeiro,1
4,University of Buenos Aires,5,374,53.7,Argentina,"Buenos Aires City, Buenos Aires",1


Details for the university ranking appear when clicking on the university's name. The resulting URL includes a variant of the university and a numeric identifier. This information can be found in an <code>a</code> tag inside the <code>h2</code> tag used to retriege the university name. The previous data retriever will be used to get the URL for each university.

In [6]:
# Initialize fields
univurl = []

# Defining extra strings in URL and setting up request
myurl = 'https://www.usnews.com/education/best-global-universities/latin-america'
numpages = 10
extrastring = ['']
for mypages in range(2,numpages+1):
    mytext = '?page='
    extrastring.append(mytext+str(mypages))
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
          +' Chrome/64.0.3282.186 Safari/537.36','Accept': 'application/json, text/javascript, */*; q=0.01'}
header = {"User-Agent":"Mozilla/5.0"}


# Loop for pages 1 to numpages
for n in range(1,numpages+1):
    url = myurl + extrastring[n-1]
    source = requests.get(url,headers=header).text
    mysoup = BeautifulSoup(source,'lxml')

    ### University URLs
    for myUname in mysoup.find_all("h2",class_="h-taut"): # All tags with university name
        univurl.append(myUname.a['href'])
# Append URL list to dataframe
tmp = pd.DataFrame({'USN_URL':univurl[0:len(globalscores)]})
USN_df = USN_df.join(tmp)
USN_df

Unnamed: 0,University,LatinAmericaRank,GlobalRank,GlobalScore,Country,City,PageNumber,USN_URL
0,Universidade de São Paulo,1,128,66.4,Brazil,São Paulo,1,https://www.usnews.com/education/best-global-u...
1,Pontificia University Católica de Chile,2,290,57.2,Chile,Santiago,1,https://www.usnews.com/education/best-global-u...
2,State University of Campinas,3,305,56.7,Brazil,"Campinas, São Paulo",1,https://www.usnews.com/education/best-global-u...
3,Federal University of Rio de Janeiro,4,346,54.9,Brazil,Rio de Janeiro,1,https://www.usnews.com/education/best-global-u...
4,University of Buenos Aires,5,374,53.7,Argentina,"Buenos Aires City, Buenos Aires",1,https://www.usnews.com/education/best-global-u...
...,...,...,...,...,...,...,...,...
78,Universidade Tecnologica Federal do Parana,79,1476,16.5,Brazil,Curitiba,8,https://www.usnews.com/education/best-global-u...
79,Universidade Federal de Sergipe,80,1480,16.3,Brazil,São Cristóvão,8,https://www.usnews.com/education/best-global-u...
80,Universidade Federal Rural de Pernambuco (UFRPE),81,1492,15.9,Brazil,"Recife, PE",9,https://www.usnews.com/education/best-global-u...
81,Universidad Autonoma de Baja California,82,1497,15.6,Mexico,"Mexicali, Baja California",9,https://www.usnews.com/education/best-global-u...


With the URL retrieved, the data used to score the universities is available. The following code retrieves such data. Let's try and study with the first element of the URL list.

In [7]:
details_url = USN_df['USN_URL'][0]
print(details_url)
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
          +' Chrome/64.0.3282.186 Safari/537.36','Accept': 'application/json, text/javascript, */*; q=0.01'}
header = {"User-Agent":"Mozilla/5.0"}
source = requests.get(details_url,headers=header).text
mysoup = BeautifulSoup(source,'lxml')

https://www.usnews.com/education/best-global-universities/universidade-de-sao-paulo-500437


In [8]:
# Getting the address (location)
maincontent = mysoup.find('div',class_='maincontent')
dirdata = maincontent.find('div',class_='directory-data')
address = ''
for data in dirdata.find_all('div'):
    address = address + data.text + ' '
address = address.strip()
print(address)

Av. Prof. Almeida Prado, nº1280 - Butantã São Paulo, 05508-070 Brazil


In [9]:
# Getting the webpage
# Both location and webpage are under the same tree of tags... I have to get both of them in one shot!
maincontent = mysoup.find_all('div',class_='directory-data')
address = maincontent[0]
webpage = maincontent[1]
text = ''
for data in address.find_all('div'):
    text = text + data.text + ' '
address = text.strip()
webpage = webpage.find('a')['href']
print(address)
print(webpage)

Av. Prof. Almeida Prado, nº1280 - Butantã São Paulo, 05508-070 Brazil
http://www5.usp.br/en/ 


In [10]:
# Getting university details
maincontent = mysoup.find('div',id='directoryPageSection-institution-data')
#print(maincontent.prettify())
detailname = []
detailval = []
for name in maincontent.find_all('div',class_="t-dim"):
    detailname.append(name.text.strip())
for value in maincontent.find_all('div',class_="right t-strong"):
    detailval.append(value.text.strip())
# Print dataframe
tmp = pd.DataFrame({'Detail':detailname,'Value':detailval})
tmp

Unnamed: 0,Detail,Value
0,Total number of students,83214
1,Number of international students,3161
2,Total number of academic staff,5230
3,Number of international staff,258
4,Number of undergraduate degrees awarded,8207
5,Number of master's degrees awarded,3742
6,Number of doctoral degrees awarded,3078
7,Number of research only staff,0
8,Number of new undergraduate students,10978
9,Number of new master's students,4697


This dataframe cannot be joined to the ranking dataframe as it is here. The best way to do it is to generate a dataframe with columns labeled as the "Detail" shown here, and rows with the "Values". In that way, after filling the whole dataframe it can be straightforwardly joined to the ranking dataframe. Something like this:

In [11]:
tmp = tmp.T
tmp.rename(columns=tmp.iloc[0],inplace=True)
tmp = tmp.drop(['Detail'])
tmp.reset_index(inplace=True,drop=True)
tmp

Unnamed: 0,Total number of students,Number of international students,Total number of academic staff,Number of international staff,Number of undergraduate degrees awarded,Number of master's degrees awarded,Number of doctoral degrees awarded,Number of research only staff,Number of new undergraduate students,Number of new master's students,Number of new doctoral students
0,83214,3161,5230,258,8207,3742,3078,0,10978,4697,3308


However, some of the pages do not have data at all. And worse, some of them have some of the data (which denies the possibility of just neglecting pages without data). Thus, it is better to create a dictionary for the details of each university. This is taken into account in the definition of the following code, which retrieves the a available details of each university.

In [12]:
import time # For sleep()

# Initialize data fields
details = [] # list, each element will be a dictionary with details
dictkeys = ['Address','Webpage',
            'Total number of students', 'Number of international students', 'Total number of academic staff', 
            'Number of international staff', 'Number of undergraduate degrees awarded', "Number of master's degrees awarded", 
            'Number of doctoral degrees awarded', 'Number of research only staff', 'Number of new undergraduate students', 
            "Number of new master's students", 'Number of new doctoral students']

# Loop over different URLs given in USN_df['USN_URL']
maxurl = len(USN_df['USN_URL'])

header = {"User-Agent":"Mozilla/5.0"}
for urlnum in range(maxurl):
    time.sleep(1) # Wait 1 second so the server will not deny my next calls
    mydict = {x:None for x in dictkeys} # Initialize dictionary for current university page
    details_url = USN_df['USN_URL'][urlnum]
    source = requests.get(details_url,headers=header).text
    mysoup = BeautifulSoup(source,'lxml')
###    # To check that the access was not denied!
###    print('HEAD.TITLE of retrieved mysoup object: ',mysoup.head.title)
    
    # Getting the webpage
    maincontent = mysoup.find_all('div',class_='directory-data')
    address = maincontent[0]
    webpage = maincontent[1]
    text = ''
    for data in address.find_all('div'):
        text = text + data.text + ' '
    mydict['Address'] = text.strip()
    mydict['Webpage'] = webpage.find('a')['href']

    # Getting university details
    maincontent = mysoup.find('div',id='directoryPageSection-institution-data')
    # Initialize details lists
    detailname = []
    detailvalue = []
    for name in maincontent.find_all('div',class_="t-dim"):
        detailname.append(name.text.strip())
    for value in maincontent.find_all('div',class_="right t-strong"):
        detailvalue.append(value.text.strip())
    # At this moment I have a list of details name and values. Put them in the dictionary
    for i in range(len(detailname)):
        mydict[detailname[i]] = detailvalue[i]
    details.append(mydict)

In [13]:
extended_df = pd.DataFrame.from_dict(details)
extended_df.head(10)

Unnamed: 0,Address,Webpage,Total number of students,Number of international students,Total number of academic staff,Number of international staff,Number of undergraduate degrees awarded,Number of master's degrees awarded,Number of doctoral degrees awarded,Number of research only staff,Number of new undergraduate students,Number of new master's students,Number of new doctoral students
0,"Av. Prof. Almeida Prado, nº1280 - Butantã São ...",http://www5.usp.br/en/,83214.0,3161.0,5230.0,258.0,8207.0,3742.0,3078.0,0.0,10978.0,4697.0,3308.0
1,Avda. Libertador Bernardo O'Higgins 340 Santi...,http://www.uc.cl/,28541.0,2007.0,1900.0,207.0,2981.0,,121.0,270.0,5190.0,,268.0
2,"Campus Universitário Zeferino Vaz Campinas, Sã...",http://www.unicamp.br/unicamp/?language=en,28795.0,974.0,1906.0,106.0,2500.0,1342.0,997.0,94.0,3353.0,2110.0,1511.0
3,"Av. Pedro Calmon, 550 Rio de Janeiro, 21941-90...",http://www.ufrj.br/,,,,,,,,,,,
4,"Viamonte 430 st. Buenos Aires City, Buenos Air...",http://www.uba.ar/ingles/index03.php,,,,,,,,,,,
5,"Av. Universidad 3000, Copilco Universidad, Coy...",http://www.unam.mx/index/en,,,,,,,,,,,
6,Av. Libertador Bernardo O'Higgins 1058 Santiag...,http://www.uchile.cl/english,,,,,,,,,,,
7,"Carrera Primera #18A-12 Bogotá, DC Colombia",http://www.uniandes.edu.co/,,,,,,,,,,,
8,"Av. Espana 1680 Valparaiso, Valparaiso Chile",http://www.usm.cl/,,,,,,,,,,,
9,"Av. Paulo Gama, 110 Porto Alegre, Rio Grande d...",http://www.ufrgs.br/english/home,,,,,,,,,,,


Now, the merging of this extended dataframe with the ranking dataframe is straightforward.

In [14]:
USN_fulldf = USN_df.join(extended_df)
USN_fulldf.to_csv('USN_dataframe.csv',encoding="utf-8-sig")
USN_fulldf

Unnamed: 0,University,LatinAmericaRank,GlobalRank,GlobalScore,Country,City,PageNumber,USN_URL,Address,Webpage,...,Number of international students,Total number of academic staff,Number of international staff,Number of undergraduate degrees awarded,Number of master's degrees awarded,Number of doctoral degrees awarded,Number of research only staff,Number of new undergraduate students,Number of new master's students,Number of new doctoral students
0,Universidade de São Paulo,1,128,66.4,Brazil,São Paulo,1,https://www.usnews.com/education/best-global-u...,"Av. Prof. Almeida Prado, nº1280 - Butantã São ...",http://www5.usp.br/en/,...,3161,5230,258,8207,3742,3078,0,10978,4697,3308
1,Pontificia University Católica de Chile,2,290,57.2,Chile,Santiago,1,https://www.usnews.com/education/best-global-u...,Avda. Libertador Bernardo O'Higgins 340 Santi...,http://www.uc.cl/,...,2007,1900,207,2981,,121,270,5190,,268
2,State University of Campinas,3,305,56.7,Brazil,"Campinas, São Paulo",1,https://www.usnews.com/education/best-global-u...,"Campus Universitário Zeferino Vaz Campinas, Sã...",http://www.unicamp.br/unicamp/?language=en,...,974,1906,106,2500,1342,997,94,3353,2110,1511
3,Federal University of Rio de Janeiro,4,346,54.9,Brazil,Rio de Janeiro,1,https://www.usnews.com/education/best-global-u...,"Av. Pedro Calmon, 550 Rio de Janeiro, 21941-90...",http://www.ufrj.br/,...,,,,,,,,,,
4,University of Buenos Aires,5,374,53.7,Argentina,"Buenos Aires City, Buenos Aires",1,https://www.usnews.com/education/best-global-u...,"Viamonte 430 st. Buenos Aires City, Buenos Air...",http://www.uba.ar/ingles/index03.php,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78,Universidade Tecnologica Federal do Parana,79,1476,16.5,Brazil,Curitiba,8,https://www.usnews.com/education/best-global-u...,"Av. Sete de Setembro, 3165 Curitiba, 80230-901...",http://www.utfpr.edu.br/,...,,,,,,,,,,
79,Universidade Federal de Sergipe,80,1480,16.3,Brazil,São Cristóvão,8,https://www.usnews.com/education/best-global-u...,"Av. Marechal Rondon, s/n, Jd. Rosa Elze São Cr...",http://www.ufs.br/,...,,,,,,,,,,
80,Universidade Federal Rural de Pernambuco (UFRPE),81,1492,15.9,Brazil,"Recife, PE",9,https://www.usnews.com/education/best-global-u...,"Rua Dom Manuel de Medeiros Recife, PE 52170-90...",http://www.ufrpe.br/en,...,,,,,,,,,,
81,Universidad Autonoma de Baja California,82,1497,15.6,Mexico,"Mexicali, Baja California",9,https://www.usnews.com/education/best-global-u...,"Avenida Alvaro Obregon s/n, Nueva Mexicali, Ba...",http://www.uabc.mx/en/,...,,,,,,,,,,


This ends the data retrieving for source number one. YAY!

<h4>Source two: QS (Quacquarelli Symonds) Top Universities</h4>
The second source of data will be the ranking by <a href="https://www.topuniversities.com/university-rankings/world-university-rankings/2020">QS (Quacquarelli Symonds) Top Universities</a>. <b>SAY SOMETHING ABOUT THE SOURCE...</b>. he data will be restricted to the Latin American countries in their search results. <br><br>
Looking at the structure of the page, the data is not in a table but in a set of <code>div</code> tags with <code>class="sep"</code>. A concerning details is that it is not possible to ask for the full list in one shot (there is no option like "Show all" in this source), but this will be solved later. For the first page of results for the Latin American universities, the retrieving code is the following:

In [15]:
myurl="https://www.topuniversities.com/university-rankings/world-university-rankings/2020"
#header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
header = {"User-Agent":"Mozilla/5.0"}
source = requests.get(myurl,headers=header).text
mysoup = BeautifulSoup(source,'lxml')

In [16]:
#content = mysoup.find('div',class_='panel-pane pane-block pane-qs-rankings-datatables-0')
content = mysoup.find('table',id='qs-rankings')

The resulting data comes with an empty table. If one looks at the developer tools in the web browser, the table is correctly built in the code. Further reading about this, the table is generated by a script. Checking again in the developer tools in the browser, the files that contain the actual information are located in the URLs given in the next code cell. Those are the ones to be retrieved and processed to generate the corresponding dataframe.

In [17]:
urlranks = "https://www.topuniversities.com/sites/default/files/qs-rankings-data/914824.txt"
urlindicators = "https://www.topuniversities.com/sites/default/files/qs-rankings-data/914824_indicators.txt"

header = {"User-Agent":"Mozilla/5.0"}

sourceranks = requests.get(urlranks,headers=header).text # Without .text, it gets the response in JSON format
sourceindic = requests.get(urlindicators,headers=header).text
mysoupranks = BeautifulSoup(sourceranks,'lxml')
mysoupindic = BeautifulSoup(sourceindic,'lxml')

In [18]:
import json
import re
# Initialize parameters
root = 'https://www.topuniversities.com'
header = {"User-Agent":"Mozilla/5.0"}
diction = json.loads(mysoupranks.text)

### Define data to retrieve
dataind = 20 # Michigan Ann Arbor
suburl = diction['data'][dataind]['url']
myurl = root+suburl

### Retrieve data
mysource = requests.get(myurl,headers=header).text

In [19]:
mysoup = BeautifulSoup(mysource,'lxml')
mychunk = mysoup.find("div",class_='student line')
for typestud in mychunk.find_all('div',class_='set'): # One for total students, one for intl. students
    mystring = typestud.h4.text.split(" - ")
    mystring[1] = mystring[1].replace(',','')
    print('NAMES AS IN STRING LIST: ',mystring)
    print(mystring[0],' = ',mystring[1])
    for pcents in typestud.find_all('div',class_='gr'): # One for PG students, one for UG students
        mystring = pcents.text.split("%")
        print('NAMES AS IN STRING LIST: ',mystring)
        print('Percentage of',mystring[1],' = ',mystring[0].strip(),'%')
    print('============')
mychunk = mysoup.find("div",class_='faculty')
faculty = mychunk.h4.text.split(" - ")
facnum = faculty[1].replace(',','')
print('Total faculty staff = ',facnum)
print('NAMES AS IN STRING LIST: ',faculty)
for typefac in mychunk.find_all('div',class_='gr'): # One for intl. faculty, one for domestic faculty
    value = int(re.findall("[0-9]+",typefac.text)[0])
    print(typefac.label.text,' = ',value,end='\n ******** \n')
    print('NAMES AS IN STRING LIST: [',typefac.label.text,',',value,']')

NAMES AS IN STRING LIST:  ['Total students', '45102']
Total students  =  45102
NAMES AS IN STRING LIST:  [' 34', 'PG students']
Percentage of PG students  =  34 %
NAMES AS IN STRING LIST:  [' 66', 'UG students']
Percentage of UG students  =  66 %
NAMES AS IN STRING LIST:  ['International students', '7762']
International students  =  7762
NAMES AS IN STRING LIST:  [' 37', 'UG students']
Percentage of UG students  =  37 %
NAMES AS IN STRING LIST:  [' 63', 'PG students']
Percentage of PG students  =  63 %
Total faculty staff =  7125
NAMES AS IN STRING LIST:  ['Total faculty staff', '7,125']
International staff  =  2056
 ******** 
NAMES AS IN STRING LIST: [ International staff , 2056 ]
Domestic staff  =  5069
 ******** 
NAMES AS IN STRING LIST: [ Domestic staff , 5069 ]


In [20]:
import time
import re

# Mysoupranks is a single paragraph (p) of HTML which contains the text file
#   with the information required. It can be converted to a dictionary. However
#   it has only one key with all the data inside. The following code will break
#   the information in order to get individual data in a manageable way.
import json
diction = json.loads(mysoupranks.text)

# By checking the structure of "data", the useful data is:
#   Continent is in ['region'] (I can get the Latin American universities with this)
#   Country is in ['country']
#   University name is in ['title']
#   Global score is in ['score']
#   Rank is in ['rank_display']

# In the individual page for each university (in ['url']), the useful data is:
root = 'https://www.topuniversities.com'
header = {"User-Agent":"Mozilla/5.0"}
#   Total students
#   PG students (% of postgraduate students) (total)
#   UG students (% of undergraduate students) (total)
#   International students
#   PG students (% of postgraduate students) (international)
#   UG students (% of undergraduate students) (international)
#   Total faculty staff
#   International staff
#   Domestic staff
# This data is retrieved directly from the HTML code


details = [] # list, each element will be a dictionary with details
dictkeys = ['Name','Region','Country','GlobalScore','GlobalRank',
            'Total students','PG students total','UG students total',
            'International students','PG students intl','UG students intl',
            'Total faculty staff','International staff','Domestic staff']
## interests = ['title','region','country','score','rank_display']

#for i in range(len(diction['data'])):
for i in range(120):
    if diction['data'][i]['region'] == 'Latin America':
        time.sleep(0.5)
        mydict = {x:None for x in dictkeys} # Initialize dictionary for current university
        mydict['Name']        = diction['data'][i]['title']
        mydict['Region']      = diction['data'][i]['region']
        mydict['Country']     = diction['data'][i]['country']
        mydict['GlobalScore'] = diction['data'][i]['score']
        # Some ranks are tied, and have an "=" in its value
        mydict['GlobalRank']  = int(re.findall("[0-9]+",str(diction['data'][i]['rank_display']))[0])
        # The final [0] is because re.findall returns a list, in this case with one element.
        ### Extra data from individual page ###
        # Initialize parameters
        # Define data to retrieve
        suburl = diction['data'][i]['url']
        myurl = root+suburl
        ### Retrieve data
        mysource = requests.get(myurl,headers=header).text
        mysoup = BeautifulSoup(mysource,'lxml')
        mychunk = mysoup.find("div",class_='student line')
        for typestud in mychunk.find_all('div',class_='set'): # One for total students, one for intl. students
            mystring = typestud.h4.text.split(" - ")
            mystring[1] = mystring[1].replace(',','')
            mydict[mystring[0]] = mystring[1]
            if mystring[0] == 'Total students':
                studtail = ' total'
            else:
                studtail = ' intl'
            for pcents in typestud.find_all('div',class_='gr'): # One for PG students, one for UG students
                mystring = pcents.text.split("%")
#######                print('Percentage of',mystring[1]+studtail,' = ',mystring[0].strip(),'%')
#######                print('============')
                mydict[mystring[1]+studtail] = mystring[0].strip() + '%'
        mychunk = mysoup.find("div",class_='faculty')
        faculty = mychunk.h4.text.split(" - ")
        facnum = faculty[1].replace(',','')
        mydict[faculty[0]] = facnum
#######        print(faculty)
#######        print(faculty[0],' = ',facnum)
        for typefac in mychunk.find_all('div',class_='gr'): # One for intl. faculty, one for domestic faculty
            value = int(re.findall("[0-9]+",typefac.text)[0])
            mydict[typefac.label.text] = value
#######            print(typefac.label.text,' = ',value,end='\n ******** \n')
        details.append(mydict)

basic_df = pd.DataFrame.from_dict(details)
basic_df.head(10)

Unnamed: 0,Name,Region,Country,GlobalScore,GlobalRank,Total students,PG students total,UG students total,International students,PG students intl,UG students intl,Total faculty staff,International staff,Domestic staff
0,Universidad de Buenos Aires (UBA),Latin America,Argentina,66.0,74,122293,7%,93%,27121,8%,92%,16419,3167,13252
1,Universidad Nacional Autónoma de México (UNAM),Latin America,Mexico,58.8,103,143279,21%,79%,5057,57%,43%,16006,1178,14828
2,Universidade de São Paulo,Latin America,Brazil,55.5,116,66214,45%,55%,1992,85%,15%,5116,261,4855


Now, it is the turn of the indicators to be processed.

In [21]:
indicators = json.loads(mysoupindic.text)
indicators.pop('columns')
### In columns, I have this data...
#   {'data': '3791737',
#     'title': '<div class="td-wrap"><div class="labl"><div>Citations per Faculty</div></div><div class="sorter"></div></div>',
#     'searchable': False,
#     'orderable': False}
# It seems that instead of names, they used ID numbers for the indicators. I found these:
#     ID = 3791737, for indicator 'Citations per Faculty'
#     ID = 3791738, for indicator 'International Students'
#     ID = 3791739, for indicator 'International Faculty'
#     ID = 3791740, for indicator 'Faculty Student'
#     ID = 3791741, for indicator 'Employer Reputation'
#     ID = 3791742, for indicator 'Academic Reputation'

# The elements in the indicators are strings made of HTML code. I need to parse them
#   before being able to extract the value of the indicator
#   actual_value = BeautifulSoup(indicators['data'][0]['3791737'],'lxml').text

# Example, for university 5 (index 4)
print('Citations per Faculty = ',BeautifulSoup(indicators['data'][4]['3791737'],'lxml').text)
print('International Students = ',BeautifulSoup(indicators['data'][4]['3791738'],'lxml').text)
print('International Faculty = ',BeautifulSoup(indicators['data'][4]['3791739'],'lxml').text)
print('Faculty Student = ',BeautifulSoup(indicators['data'][4]['3791740'],'lxml').text)
print('Employer Reputation = ',BeautifulSoup(indicators['data'][4]['3791741'],'lxml').text)
print('Academic Reputation = ',BeautifulSoup(indicators['data'][4]['3791742'],'lxml').text)

Citations per Faculty =  100
International Students =  87.3
International Faculty =  99.4
Faculty Student =  100
Employer Reputation =  81.2
Academic Reputation =  97.8


In [22]:
indicators = json.loads(mysoupindic.text)
#indicators.pop('columns')

# Initialize variables
details = [] # list, each element will be a dictionary with details
citperfalc = 'Citations_per_Faculty'
interstude = 'International_Students'
interfacul = 'International_Faculty'
facstudent = 'Faculty_Student'
employrepu = 'Employer_Reputation'
academrepu = 'Academic_Reputation'
dictkeys = [citperfalc,interstude,interfacul,facstudent,employrepu,academrepu]
indicators['data'][0]['region']

for i in range(len(indicators['data'])):
    if indicators['data'][i]['region'] == 'Latin America':
        mydict = {x:None for x in dictkeys} # Initialize dictionary for current university
        try:
            mydict[citperfalc] = BeautifulSoup(indicators['data'][i]['3791737'],'lxml').text
        except:
            pass
        try:
            mydict[interstude] = BeautifulSoup(indicators['data'][i]['3791738'],'lxml').text
        except:
            pass
        try:
            mydict[interfacul] = BeautifulSoup(indicators['data'][i]['3791739'],'lxml').text
        except:
            pass
        try:
            mydict[facstudent] = BeautifulSoup(indicators['data'][i]['3791740'],'lxml').text
        except:
            pass
        try:
            mydict[employrepu] = BeautifulSoup(indicators['data'][i]['3791741'],'lxml').text
        except:
            pass
        try:
            mydict[academrepu] = BeautifulSoup(indicators['data'][i]['3791742'],'lxml').text
        except:
            pass
        details.append(mydict)

indic_df = pd.DataFrame.from_dict(details)
indic_df.head(10)
# For the 'data' (remaining) key, the interests are:
# interests = ['region' = 'Latin America','overall_rank','uni'.text,
#              '','','','','',]

Unnamed: 0,Citations_per_Faculty,International_Students,International_Faculty,Faculty_Student,Employer_Reputation,Academic_Reputation
0,2.4,64.7,50.7,77.4,91.3,87.2
1,3.8,4.3,13.8,57.6,91.0,90.9
2,35.2,3.7,8.9,25.2,73.3,88.3
3,13.6,4.2,19.4,28.6,95.5,85.2
4,4.6,18.4,98.2,89.5,88.9,36.9
5,14.5,8.4,10.1,16.3,90.8,71.6
6,32.7,4.3,9.9,21.1,34.5,67.5
7,8.1,3.1,32.1,27.6,87.9,54.4
8,5.3,1.3,8.3,11.1,89.7,61.5
9,1.1,13.2,2.9,95.3,44.3,17.7


Now, the merging of this extended dataframe with the ranking dataframe is straightforward.

In [23]:
QSTopU_fulldf = basic_df.join(indic_df)
QSTopU_fulldf.to_csv('QSTopU_dataframe.csv',encoding="utf-8-sig")
QSTopU_fulldf

Unnamed: 0,Name,Region,Country,GlobalScore,GlobalRank,Total students,PG students total,UG students total,International students,PG students intl,UG students intl,Total faculty staff,International staff,Domestic staff,Citations_per_Faculty,International_Students,International_Faculty,Faculty_Student,Employer_Reputation,Academic_Reputation
0,Universidad de Buenos Aires (UBA),Latin America,Argentina,66.0,74,122293,7%,93%,27121,8%,92%,16419,3167,13252,2.4,64.7,50.7,77.4,91.3,87.2
1,Universidad Nacional Autónoma de México (UNAM),Latin America,Mexico,58.8,103,143279,21%,79%,5057,57%,43%,16006,1178,14828,3.8,4.3,13.8,57.6,91.0,90.9
2,Universidade de São Paulo,Latin America,Brazil,55.5,116,66214,45%,55%,1992,85%,15%,5116,261,4855,35.2,3.7,8.9,25.2,73.3,88.3


This ends the data retrieving for source number two. YAY!

<h4>Source three: Times Higher Education</h4>
The first source of data will be the ranking by <a href="https://www.timeshighereducation.com/world-university-rankings/2020/world-ranking">Times Higher Education</a>. <b>SAY SOMETHING ABOUT THE SOURCE...</b>. The data will be restricted to some Latin American countries. According to the dropdown menu, the ones available in this ranking are:
<ul>
    <li>Argentina (identified as AR)</li>
    <li>Brasil (identified as BR)</li>
    <li>Chile (identified as CL)</li>
    <li>Colombia (identified as CO)</li>
    <li>Costa Rica (identified as CR)</li>
    <li>Cuba (identified as CU)</li>
    <li>Jamaica (identified as JM)</li>
    <li>Mexico (identified as MX)</li>
    <li>Peru (identified as PE)</li>
    <li>Puerto Rico (identified as PR)</li>
    <li>Venezuela (identified as VE)</li>
</ul>
The structure of the webpage will be studied in order to systematically retrieve the data for every country.

In [24]:
# Save data from webpage
myurl = 'https://www.timeshighereducation.com/world-university-rankings/2020/world-ranking#!/page/0/length/-1/locations/AR/sort_by/rank/sort_order/asc/cols/stats'
## The structure of the webpage is:
##   1.- Main URL and starting page (0):
##       https://www.timeshighereducation.com/world-university-rankings/2020/world-ranking#!/page/0/
##   2.- Number of results per page (-1 means "All"):
##       length/-1/
##   3.- Location definition (check the country identifier above):
##       locations/AR/
##   4.- Sorting parameters:
##       sort_by/rank/sort_order/asc/
##   5.- Information shown ("stats"=ranking data; "scores"=scoring data):
##       cols/stats
source = requests.get(myurl)
print(source)

<Response [403]>


The response from the Times Higher Education server is to deny incoming requests from this type (403). Sometimes, webpages require to define a <code>User-Agent</code> as explained <a href="https://stackoverflow.com/questions/38489386/python-requests-403-forbidden">here</a>. On top of that, when retrieving the table section of the webpage, the <code>tbody</code> tag (which encloses the data in the body of the table) is empty. According to this <a href="https://stackoverflow.com/questions/49260014/beautifulsoup-returns-empty-td-tags">source</a>, the table data may be generated by a script and not with the HTML code itself. Thus, extra information must be included in the header and in the <code>requests.get</code> arguments to get the results of such script.

In [25]:
# Save data from webpage
myurl = 'https://www.timeshighereducation.com/world-university-rankings/2020/world-ranking#!/page/0/length/-1/sort_by/rank/sort_order/asc/cols/stats'
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36','Accept': 'application/json, text/javascript, */*; q=0.01'}
source = requests.get(myurl,headers=header).text # Without .text, it gets the response in JSON format
mysoup = BeautifulSoup(source,'lxml')
mytable = mysoup.find('table')

The previous results in a table object with an emtpy <code>tbody</code> tag as well. Unfortunately, I could not figure out how to get the actual data from the table. I checked many sources but it would be too much for me to understand it right away. I tried with another retriever but got the same result. Checking again in the developer tools in the browser, the file that contains the actual information is located in the URL given in the next code cell. This is the one to be retrieved and processed to generate the corresponding dataframe.

In [26]:
myurl = 'https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2020_0__24cc3874b05eea134ee2716dbf93f11a.json'
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36','Accept': 'application/json, text/javascript, */*; q=0.01'}
source = requests.get(myurl,headers=header).text # Without .text, it gets the response in JSON format
mysoup = BeautifulSoup(source,'lxml')

As the extension shows, this is a JSON file. The algorithm to extract the data will be very similar to the one for Source Two.

In [27]:
import json
diction = json.loads(mysoup.text)
# Check the first entry of data to elucidate the structure
diction['data'][0]


{'rank_order': '10',
 'rank': '1',
 'name': 'University of Oxford',
 'scores_overall': '95.4',
 'scores_overall_rank': '10',
 'scores_teaching': '90.5',
 'scores_teaching_rank': '6',
 'scores_research': '99.6',
 'scores_research_rank': '1',
 'scores_citations': '98.4',
 'scores_citations_rank': '26',
 'scores_industry_income': '65.5',
 'scores_industry_income_rank': '178',
 'scores_international_outlook': '96.4',
 'scores_international_outlook_rank': '22',
 'record_type': 'master_account',
 'member_level': '0',
 'url': '/world-university-rankings/university-oxford',
 'nid': 468,
 'location': 'United Kingdom',
 'stats_number_students': '20,664',
 'stats_student_staff_ratio': '11.2',
 'stats_pc_intl_students': '41%',
 'stats_female_male_ratio': '46 : 54',
 'aliases': 'University of Oxford',
 'subjects_offered': 'Mechanical & Aerospace Engineering,Computer Science,Politics & International Studies (incl Development Studies),Biological Sciences,Languages, Literature & Linguistics,Civil Engi

In [28]:
import time
import re
# By checking the structure of diction["data"], the useful data in diction['data'][i] is:
#   University name is in ['name']
#   Country is in ['location']
#   Global score is in ['scores_overall']
#   Global rank is in ['rank']
#   Teaching score is in ['scores_teaching']
#   Research score is in ['scores_research']
#   Citations score is in ['scores_citations']
#   Industry income score is in ['scores_industry_income']
#   International outlook score is in ['scores_international_outlook']
#   Number of students is in ['stats_number_students']
#   Students per staff ratio is in ['stats_student_staff_ratio']
#   % of international students is in ['stats_pc_intl_students']
#   Females-to-males ratio is in ['stats_female_male_ratio']
# There is no "region" identifier. Thus, I will use the Location to check if the country
#   belongs to Latin America.

LACountries = ['Argentina','Brasil','Chile','Colombia','Costa Rica','Cuba','Jamaica',
               'Mexico','Peru','Puerto Rico','Venezuela']

details = [] # list, each element will be a dictionary with details
dictkeys = ['Name','Country','GlobalScore','GlobalRank','Teaching_score','Research_score',
            'Citations_score','Indust_income_score','Intl_outlook_score','Students',
            'Students_per_staff','%_intl_students','Females:males','Address']
#
# To get the address...
# The address is given in the particular page of the university. I will collect the relative URL
#   and get the data in the retrieving loop
rooturl = 'https://www.timeshighereducation.com'
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36','Accept': 'application/json, text/javascript, */*; q=0.01'}
#
for i in range(len(diction['data'])):
    if diction['data'][i]['location'] in LACountries:
        mydict = {x:None for x in dictkeys} # Initialize dictionary for current university
        mydict['Name']        = diction['data'][i]['name']
        mydict['Country']     = diction['data'][i]['location']
        mydict['GlobalScore'] = diction['data'][i]['scores_overall']
        # Some ranks are tied, and have an "=" in its value
        mydict['GlobalRank']  = int(re.findall("[0-9]+",str(diction['data'][i]['rank']))[0])
        # The final [0] is because re.findall returns a list, in this case with one element.
        mydict['Teaching_score']      = diction['data'][i]['scores_teaching']
        mydict['Research_score']      = diction['data'][i]['scores_research']
        mydict['Citations_score']     = diction['data'][i]['scores_citations']
        mydict['Indust_income_score'] = diction['data'][i]['scores_industry_income']
        mydict['Intl_outlook_score']  = diction['data'][i]['scores_international_outlook']
        mydict['Students']            = diction['data'][i]['stats_number_students']
        mydict['Students_per_staff']  = diction['data'][i]['stats_student_staff_ratio']
        mydict['%_intl_students']     = diction['data'][i]['stats_pc_intl_students']
        mydict['Females:males']       = diction['data'][i]['stats_female_male_ratio']
        mydict['RelativeURL']         = diction['data'][i]['url']
        relurl = diction['data'][i]['url']
        myurl = rooturl + relurl
        time.sleep(0.5)
        source = requests.get(myurl,headers=header).text
        addrsoup = BeautifulSoup(source,'lxml')
        contact = addrsoup.find('div',class_='institution-info__contact-detail institution-info__contact-detail--address')
        mydict['Address'] = contact.text.strip()
        details.append(mydict)

THE_fulldf = pd.DataFrame.from_dict(details)
THE_fulldf.to_csv('THE_dataframe.csv',encoding="utf-8-sig")
THE_fulldf.head()

Unnamed: 0,Name,Country,GlobalScore,GlobalRank,Teaching_score,Research_score,Citations_score,Indust_income_score,Intl_outlook_score,Students,Students_per_staff,%_intl_students,Females:males,Address,RelativeURL
0,University of Desarrollo,Chile,38.8–42.3,401,13.5,8.8,99.1,36.8,48.1,15384,19.0,4%,56 : 44,"Av. Plaza 680 San Carlos de Apoquindo, Las Con...",/world-university-rankings/university-desarrollo
1,Diego Portales University,Chile,38.8–42.3,401,14.7,10.5,95.3,34.4,50.4,17287,14.9,2%,49 : 51,"Manuel Rodríguez Sur 415, Santiago, Región Met...",/world-university-rankings/diego-portales-univ...
2,Pontifical Javeriana University,Colombia,38.8–42.3,401,15.3,8.9,98.6,34.8,44.6,30048,15.4,1%,55 : 45,"Carrera 7 No. 40 - 62, Bogota, Distrito Capita...",/world-university-rankings/pontifical-javerian...
3,Universidad Autónoma de Chile,Chile,35.3–38.7,501,12.3,8.9,81.6,34.4,50.7,23312,26.2,1%,65 : 35,"Av. Pedro de Valdivia 425, Providencia, Región...",/world-university-rankings/universidad-autonom...
4,Universidad Peruana Cayetano Heredia,Peru,35.3–38.7,501,17.0,11.3,83.1,34.7,46.4,6236,10.6,0%,66 : 34,"Av. Honorio Delgado 430, Urb. Ingeniería, San ...",/world-university-rankings/universidad-peruana...


This ends the data retrieving for source number three. YAY!