# 02 - Data from the Web

## Assignment
1. Obtain the 200 top-ranking universities in www.topuniversities.com ([ranking 2018](https://www.topuniversities.com/university-rankings/world-university-rankings/2018)). In particular, extract the following fields for each university: name, rank, country and region, number of faculty members (international and total) and number of students (international and total). Some information is not available in the main list and you have to find them in the [details page](https://www.topuniversities.com/universities/ecole-polytechnique-fédérale-de-lausanne-epfl).
Store the resulting dataset in a pandas DataFrame and answer the following questions:
- Which are the best universities in term of: (a) ratio between faculty members and students, (b) ratio of international students?
- Answer the previous question aggregating the data by (c) country and (d) region.

Plot your data using bar charts and describe briefly what you observed.

2. Obtain the 200 top-ranking universities in www.timeshighereducation.com ([ranking 2018](http://timeshighereducation.com/world-university-rankings/2018/world-ranking)). Repeat the analysis of the previous point and discuss briefly what you observed.

3. Merge the two DataFrames created in questions 1 and 2 using university names. Match universities' names as well as you can, and explain your strategy. Keep track of the original position in both rankings.

4. Find useful insights in the data by performing an exploratory analysis. Can you find a strong correlation between any pair of variables in the dataset you just created? Example: when a university is strong in its international dimension, can you observe a consistency both for students and faculty members?

5. Can you find the best university taking in consideration both rankings? Explain your approach.

Hints:
- Keep your Notebook clean and don't print the verbose output of the requests if this does not add useful information for the reader.
- In case of tie, use the order defined in the webpage.


## 1. World University Rankings by TopUniversities

---

We will focus on data scraping from the [TopUniversities 2018 rankings](https://www.topuniversities.com/university-rankings/world-university-rankings/2018), as well as asnwering the assignment questions by performing simple data analysis. The structure of this chapter follows the assignment outline:

1. Data scraping
2. Ranking: faculty members, students and internationalization
3. Ranking: regional and national results


### Libraries utilized

We will use the BeautifulSoup along with Pandas and standard Python utilities for performing the scraping and data analysis.

In [1]:
# importing the utils for Web scraping
import requests
from bs4 import BeautifulSoup

# importing python and data utils
import pandas as pd
import numpy as np
import json
%matplotlib inline
import matplotlib.pyplot as plt

### Swiss army knife for data scraping

[Postman](https://www.getpostman.com/) is utilized outside Python to get more sense about the Web jungle we are trying to harvest our data from. We have used both Interceptor and the application itself to make sense of incoming traffic. 

We have also relied on traffic observations in Chrome Development Tools (*which can be accessed in Chrome browsers on Windows platform by pressing F12*).

### 1.1 Data scraping

The first step in our data scraping process was to get acquainted with the website itself. We have turned on the Interceptor for the traffic and the application itself, then loaded the website and observed the requests. 

#### Making our life easier: Postman filters

It is useful to filter out results of requests sent to the Website of interest. In modern websites it is common to have many 3rd party APIs which would make it somewhat more difficult to find the requests of interest. Additionally, trafic from other tabs in browser (if open) will appear, so it is additionally useful to display only what is the focus of our search. Such benefits are visible in the practical example bellow:

<table width="70%">
  <tr>
    <th style="text-align:center">Requests without any filters</th>
    <th style="text-align:center">After filtering results from *www.topuniversities.com* only</th> 
  </tr>
  <tr>
    <td>![](images/postman0.png)</td>
    <td style="vertical-align:top">![](images/postman1.png)</td> 
  </tr>
</table>

From the vast list of various requests, we have localized what seems to be important for the given task, and the requests for resources for the TopUniversity rankings. By analyzing the requests, we can see one **GET** request which obtains a textual file:

[https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt?_=1508597583828](https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt?_=1508597583828)

By analyzing the file, it turns out to be a very useful JSON! All the data needed for scraping at this level is present here, most notably the ranking and the URL towards more information. It correlates to the same information we can directly see from the Chrome Developer Tools:

![](images/chrome0.png)

#####  Design choice
As in the image above can be seen, the HTML data is well organized in the tables, with appropriate classes which designate the content of the cell. One approach would be to use `BeautifulSoup` and parse the HTML of the page. Since we have the data in more concise format (JSON) we decided to process the textual representation. Since only the representation and intermediate steps differ, and results are the same, this is a choice we have made. 

Nevertheless, further steps will have to rely on parsing the HTML and DOM of the subsequent university links with further details (since no equivalent JSON is obtained), so both approaches are showcased.


#### Implementing the scraping procedure

We will extract the before found JSON file and use it as a baseline for scraping this Website. We send a GET request and process the file as JSON.

In [2]:
## TU = TopUniversities
TU_path = 'https://www.topuniversities.com'
base_json_path = '/sites/default/files/qs-rankings-data/357051.txt?_=1508597583828'

req = requests.get(TU_path+base_json_path)

TU_json = req.json()

We convert the JSON file to Pandas `DataFrame` to enable easier exploration and utilization of the file. We are interested in the **data** section of our JSON file, and we take the first 200 top ranking universities in the list. 

Notably, some universities share the same ranking (tie). Per assignment at this point we will take 200 best ranking universities for our analysis, and not the universities until the rank 200.

In [3]:
TU_df = pd.DataFrame.from_dict(TU_json['data']).head(200)

We can observe that we did not need any additional parsing which would add both to code complexity, potential computational overhead in bigger datasets, as well as algorithmic complexity if the website HTML code is not tidy (which is not the case here). The `DataFrame` carries all the information we need for now, of which the region, country, rank, title and url play the significant role in the task.

In [4]:
TU_df.head(2)

Unnamed: 0,cc,core_id,country,guide,logo,nid,rank_display,region,score,stars,title,url
0,US,410,United States,"<a href=""/where-to-study/north-america/united-...","<img src=""https://www.topuniversities.com/site...",294850,1,North America,100.0,6,Massachusetts Institute of Technology (MIT),/universities/massachusetts-institute-technolo...
1,US,573,United States,"<a href=""/where-to-study/north-america/united-...","<img src=""https://www.topuniversities.com/site...",297282,2,North America,98.7,5,Stanford University,/universities/stanford-university


#### 1.1.1 Extracting the information for each University

Some information such as University name or rank is available immediately, while for detailed information we need to follow the subsequent link provided in the **url** field. Unfortunately, as previously mentioned, we don't have information packed in JSON format this time. We use HTML parsing provided by `BeautifulSoup` to obtain all the missing data. We are in luck since the HTML is well organized, with appropriate classes assigned to `div`s containing needed pieces of information.

We iterate through the previously obtained `DataFrame` and scrape and collect the necessary data, and append it to the existing `DataFrame`.

#### Note on undefined values

At least one of the universities has missing data and we need to handle such cases. If the number of such cases is small enough, we could impute the values manually, by consulting the appropriate sources, thus keeping the consistency of the rankings. If we don't have all the data for our specific analysis, we would have to discard such Universities or come up with a methodology for mining up missing data on larger scale.

We choose to be consistent and we will print out the universities which have some missing values and will try to fill in the missing values. In such manner we differentiate **raw values** such as the number of students and **derived values** such as different scores provided. 

   #####      Raw values
We will try to impute raw values from various sources since we will use such data in further data analysis. Such data might be available directly from the websites of Universities.

   #####      Derived values
We will not attempt to impute derived values, such as different scores calculated by the selected ranking Website. Methodologies are often incomplete and all the collected raw values unavailable. Any methodology for imputing such values would be most probably very erroneous. Since such data is not extensively used for analysis, it is acceptable to leave some of such values undefined.

In [5]:
# A helper function to get an number (integer) from the div text
def div2num(div):
    return int(div.find('div', class_='number').text.split()[0].replace(',',''))

# A helper function to get the percentage from the span text
def perc2num(div):
    return int(div.find('span', class_='perc').text.replace('%',''))

#### Additional information from the links

Landing page has provided a general overview of the most important pieces of information, such as names and ranks of universities. We further scrape the data provided in the links for each university. The base `DataFrame` already exists, now we enrich it with more detailed data, and we iterate and fill the available information.

In [8]:
# iterating through the existing base dataframe
for ind, row in TU_df.iterrows():
    # getting the detailed page of the given university
    # relative link to the page is given in the url field of the dataframe
    uni_details = requests.get(TU_path+row['url'])
    
    # get the HTML of the detailed university page and add it to the soup
    soup = BeautifulSoup(uni_details.text, 'html.parser')
    
    # get the name of the university from the DF
    uni_name = row['title']
    
    # get the number of faculty members from the detailed page and append to DF
    faculty_num = soup.find('div', class_='total faculty')
    try:
        # we parse the div text and append it to the DataFrame
        TU_df.loc[ind, 'Total faculty members'] = div2num(faculty_num)
    except:
        # print out the message in order to try to impute the data
        print('Undefined total faculty members for: '+uni_name)
    
    # similar procedure for other required data points
    # Total number of students
    students_num = soup.find('div', class_='total student')
    try:
        TU_df.loc[ind, 'Total students'] = div2num(students_num)
    except:
        print('Undefined total students for: '+uni_name)
    
    # Percentage of postgraduate students
    try:
        total_postgrad_percentage = students_num.find('div', class_='post')
        TU_df.loc[ind, 'Total postgrad percentage'] = perc2num(total_undergrad_percentage)
    except:
        print('Undefined Total postgrad percentage for: '+uni_name)
    
    # Percentage of undergraduate students
    try:
        total_undergrad_percentage = students_num.find('div', class_='grad')
        TU_df.loc[ind, 'Total undergrad percentage'] = perc2num(total_undergrad_percentage)
    except:
        print('Undefined Total undergrad percentage for: '+uni_name)
    
    # Number of international faculty members
    intl_faculty_num = soup.find('div', class_='inter faculty')
    try:
        TU_df.loc[ind, 'International faculty members'] = div2num(intl_faculty_num)
    except:
        print('Undefined international faculty members for: '+uni_name)
    
    # Total number of international students
    intl_students_num = soup.find('div', class_='total inter')
    try:
        TU_df.loc[ind, 'International students'] = div2num(intl_students_num)
    except:
        print('Undefined international students for: '+uni_name)
    
    # International postgraduate students percentage
    try:
        intl_postgrad_percentage = intl_students_num.find('div', class_='post')
        TU_df.loc[ind, 'International postgrad percentage'] = perc2num(intl_postgrad_percentage)
    except:
        print('Undefined International postgrad percentage for: '+uni_name)
    
    # International undergraduate students percentage
    try:
        intl_undergrad_percentage = intl_students_num.find('div', class_='grad')
        TU_df.loc[ind, 'International undergrad percentage'] = perc2num(intl_undergrad_percentage)
    except:
        print('Undefined International undergrad percentage for: '+uni_name)
        
    # Find all the scores, iterate through them
    try:
        scores_list = soup.find('ul', class_='score').find_all('li')
        for score in scores_list:
            try:
                criteria = score.find('div', class_='criteria').text.strip()
                value = float(score.find('div', class_='text').text)
                TU_df.loc[ind, criteria] = value
            except:
                print('Undefined '+criteria+' for: '+uni_name)
    except:
        print('No scores available at all for: '+uni_name)


Undefined total faculty members for: New York University (NYU)
Undefined total students for: New York University (NYU)
Undefined Total postgrad percentage for: New York University (NYU)
Undefined Total undergrad percentage for: New York University (NYU)
Undefined international faculty members for: New York University (NYU)
Undefined international students for: New York University (NYU)
Undefined International postgrad percentage for: New York University (NYU)
Undefined International undergrad percentage for: New York University (NYU)
No scores available at all for: New York University (NYU)
Undefined International undergrad percentage for: Pohang University of Science And Technology (POSTECH)
Undefined International undergrad percentage for: Indian Institute of Technology Delhi (IITD)
Undefined international faculty members for: Indian Institute of Science (IISc) Bangalore
Undefined International undergrad percentage for: Scuola Superiore Sant'Anna Pisa di Studi Universitari e di Perfe

##### Scores

Scores are variables in range from 0.0 to 100.0, floating point value. They are derived from the parameters collected, surveys conducted and described in more detail in the provided [methodology](https://www.topuniversities.com/qs-world-university-rankings/methodology). As prior described they would not be imputed if missing.

#### Imputing undefined values

We can see that the New York University (NYU) has information that is completely missing, while Indian Institute of Science (IISc) Bangalore has one missing raw value. Considering the scope of missing values, they will be manually set using the following sources:

- NYU: https://www.nyu.edu/about/news-publications/nyu-at-a-glance.html, https://www.nyu.edu/admissions/undergraduate-admissions/nyu-facts.html
- IISc: http://www.iisc.ac.in/iisc-in-numbers/#ffs-tabbed-12

This is potentially troubling, since there is no information about the actual data collection for the ranking, methodology or actual date which is available for a confident selection of the figures missing. Frequently, the websites also omit the precise dates. Comparing the potential noise which would be added to the fact that we would discard 2 top 200 universities, the decision is to take the data and impute it, while being aware that it might induce some error in methodology. It is a tradeoff between statistical error in imputed data and the error which would occur in the further aggregate analysis.

Therefore the NYU data is following:
- total faculty members: 7861 (Fall 2015)
- total students: 58419 (Fall 2015)
- international faculty members: 619 (September 2017)
- international students: 26% == 15189 (Fall 2015)

Unfortunately, the data of international faculty members is not available in public records online for IISc, therefore it will be designated as 0. The unavailability might rise from the fact that the figure is rather low (https://scroll.in/article/811696/there-are-no-indian-universities-in-the-worlds-top-250-list-heres-how-javadekar-can-change-that), so it would not be positive for faculty to publicize such information. 

Here we try to follow the rule not to completely penalize omission of data if possible, as described in [THE methodology](https://www.timeshighereducation.com/world-university-rankings/methodology-world-university-rankings-2016-2017):
> *On the rare occasions when a particular data point is not provided we enter a low estimate between the average value of the indicators and the lowest value reported: the 25th percentile of the other indicators. By doing this, we avoid penalising an institution too harshly with a “zero” value for data that it overlooks or does not provide, but we do not reward it for withholding them.*



**If time permitted a good practice would be to contact the institution directly for the actual quote in both cases.**

In [9]:
TU_df.loc[TU_df.title=='New York University (NYU)','Total faculty members'] = 7861
TU_df.loc[TU_df.title=='New York University (NYU)','Total students'] = 58419
TU_df.loc[TU_df.title=='New York University (NYU)','International faculty members'] = 619
TU_df.loc[TU_df.title=='New York University (NYU)','International students'] = 15189


TU_df = TU_df.fillna(0)

#### Pruning the data in DataFrame

We have several pieces of information that are redundant at this point, so we will remove such columns from the DataFrame for a better overview. We can see bellow that *cc*, *rank_display*, *stars*, *core_id*, *guide*, *logo*, *nid* and *url* are present, but redundant in the following analysis.

In [10]:
TU_df.head(2)

Unnamed: 0,cc,core_id,country,guide,logo,nid,rank_display,region,score,stars,...,International postgrad percentage,International undergrad percentage,Overall Score,Academic Reputation,Citations per Faculty,Employer Reputation,Faculty Student,International Faculty,International Students,Total postgrad percentage
0,US,410,United States,"<a href=""/where-to-study/north-america/united-...","<img src=""https://www.topuniversities.com/site...",294850,1,North America,100.0,6,...,83.0,17.0,100.0,100.0,99.9,100.0,100.0,100.0,96.1,78.0
1,US,573,United States,"<a href=""/where-to-study/north-america/united-...","<img src=""https://www.topuniversities.com/site...",297282,2,North America,98.7,5,...,83.0,17.0,98.7,100.0,99.4,100.0,100.0,99.6,72.7,40.0


In [11]:
TU_df.drop(['cc', 'stars', 'core_id', 'guide', 'logo', 'nid', 'url'], axis=1, inplace=True)
TU_df.head(1)

Unnamed: 0,country,rank_display,region,score,title,Total faculty members,Total students,Total undergrad percentage,International faculty members,International students,International postgrad percentage,International undergrad percentage,Overall Score,Academic Reputation,Citations per Faculty,Employer Reputation,Faculty Student,International Faculty,International Students,Total postgrad percentage
0,United States,1,North America,100,Massachusetts Institute of Technology (MIT),2982.0,11067.0,40.0,1679.0,3717.0,83.0,17.0,100.0,100.0,99.9,100.0,100.0,100.0,96.1,78.0


#### We have our DataFrame ready for further analysis!
---

### 1.2 Ranking: faculty members, students and internationalization

In this section we answer the question: **Which are the best universities in term of: (a) ratio between faculty members and students, (b) ratio of international students?**

For easier analysis for (a) and (b) we will add the ratios to the existing DataFrame:

In [12]:
# ratio between faculty members and students
TU_df['Ratio faculty/students'] = TU_df['Total faculty members']/TU_df['Total students']

# ratio of international students
TU_df['Ratio intl/total students'] = TU_df['International students']/TU_df['Total students']

In [13]:
TU_df.head(1)

Unnamed: 0,country,rank_display,region,score,title,Total faculty members,Total students,Total undergrad percentage,International faculty members,International students,...,Overall Score,Academic Reputation,Citations per Faculty,Employer Reputation,Faculty Student,International Faculty,International Students,Total postgrad percentage,Ratio faculty/students,Ratio intl/total students
0,United States,1,North America,100,Massachusetts Institute of Technology (MIT),2982.0,11067.0,40.0,1679.0,3717.0,...,100.0,100.0,99.9,100.0,100.0,100.0,96.1,78.0,0.26945,0.335863


We have prepared the data for our analysis! The results are obtained by sorting the `DataFrame` by appropriate criterion.

### Best university: (a) ratio between faculty members and students

In [14]:
TU_fac_stud_df = TU_df.sort_values('Ratio faculty/students', ascending=False)

#### Top 5 by ratio of faculty members and students:

In [15]:
TU_fac_stud_df.head(5)

Unnamed: 0,country,rank_display,region,score,title,Total faculty members,Total students,Total undergrad percentage,International faculty members,International students,...,Overall Score,Academic Reputation,Citations per Faculty,Employer Reputation,Faculty Student,International Faculty,International Students,Total postgrad percentage,Ratio faculty/students,Ratio intl/total students
3,United States,4,North America,97.7,California Institute of Technology (Caltech),953.0,2255.0,44.0,350.0,647.0,...,97.7,99.5,100.0,85.4,100.0,93.4,89.2,31.0,0.422616,0.286918
15,United States,16,North America,90.4,Yale University,4940.0,12402.0,44.0,1708.0,2469.0,...,90.4,100.0,63.2,99.8,100.0,90.7,61.7,82.0,0.398323,0.199081
5,United Kingdom,6,Europe,95.3,University of Oxford,6750.0,19720.0,56.0,2964.0,7353.0,...,95.3,100.0,76.3,100.0,100.0,98.6,98.5,63.0,0.342292,0.37287
4,United Kingdom,5,Europe,95.6,University of Cambridge,5490.0,18770.0,63.0,2278.0,6699.0,...,95.6,100.0,78.3,100.0,100.0,97.4,97.7,44.0,0.292488,0.356899
16,United States,17,North America,89.8,Johns Hopkins University,4462.0,16146.0,37.0,1061.0,4105.0,...,89.8,94.3,83.9,66.4,100.0,87.9,81.3,44.0,0.276353,0.254243


### Best university: (b) ratio of international students

In [16]:
TU_intl_df = TU_df.sort_values('Ratio intl/total students', ascending=False)

#### Top 5 by ratio of international students

In [17]:
TU_intl_df.head(5)

Unnamed: 0,country,rank_display,region,score,title,Total faculty members,Total students,Total undergrad percentage,International faculty members,International students,...,Overall Score,Academic Reputation,Citations per Faculty,Employer Reputation,Faculty Student,International Faculty,International Students,Total postgrad percentage,Ratio faculty/students,Ratio intl/total students
34,United Kingdom,35,Europe,81.8,London School of Economics and Political Scien...,1088.0,9760.0,47.0,687.0,6748.0,...,81.8,90.3,71.7,100.0,55.9,100.0,100.0,73.0,0.111475,0.691393
11,Switzerland,12,Europe,91.2,Ecole Polytechnique Fédérale de Lausanne (EPFL),1695.0,10343.0,52.0,1300.0,5896.0,...,91.2,83.0,99.2,95.5,92.0,100.0,100.0,82.0,0.163879,0.570047
7,United Kingdom,8,Europe,93.7,Imperial College London,3930.0,16090.0,57.0,2071.0,8746.0,...,93.7,99.4,68.7,100.0,100.0,100.0,100.0,52.0,0.244251,0.543567
198,Netherlands,200,Europe,47.9,Maastricht University,1277.0,16385.0,63.0,502.0,8234.0,...,47.9,30.8,73.3,39.0,35.4,95.9,100.0,83.0,0.077937,0.502533
47,United States,=47,North America,78.6,Carnegie Mellon University,1342.0,13356.0,49.0,425.0,6385.0,...,78.6,85.0,95.6,85.1,43.2,62.3,100.0,78.0,0.100479,0.478062


------
### 1.3 Ranking: regional and national results

In this section we use the results obtained in **1.2** and aggregate them, separately, by country and region, thus providing answer to the question 2:

*Answer the previous question aggregating the data by (c) country and (d) region.*

The `DataFrame` contains all the datapoints we need for answering the questions, the only procedure we need to undertake is aggregation of data. We need to remove the calculated ratios per faculty since the summation would not result in proper result in this case (we would need to normalize the result otherwise). We can further remove the score and name of the university **manually**, since it does not play a role in the dataset, but since it is a variable on which we could not perform aggregation such variables would anyway be ignored by *Pandas*.

In [20]:
TU_aggregate_df = TU_df.copy()

columns_to_drop = ['Ratio faculty/students', 'Ratio intl/total students']

TU_aggregate_df.drop(columns_to_drop, axis=1, inplace=True)
TU_aggregate_df.head(1)

Unnamed: 0,country,rank_display,region,score,title,Total faculty members,Total students,Total undergrad percentage,International faculty members,International students,...,Overall Score,Academic Reputation,Citations per Faculty,Employer Reputation,Faculty Student,International Faculty,International Students,Total postgrad percentage,Ratio faculty/students,Ratio intl/total students
0,United States,1,North America,100,Massachusetts Institute of Technology (MIT),2982.0,11067.0,40.0,1679.0,3717.0,...,100.0,100.0,99.9,100.0,100.0,100.0,96.1,78.0,0.26945,0.335863


---
#### Aggregating data by country (c)

We will group the data by the country and then sum the values of such aggregate to produce desired data.

In [48]:
# we can compose multiple functions to reduce code clutter with rule - column:['aggregation1', 'aggregation2'...]
rule = {'Total faculty members':['sum'], 'Total students':['sum'], 'Total undergrad percentage':['mean'], 
        'International faculty members':['sum'], 'International students':['sum'], 'International postgrad percentage':['mean'],
       'International undergrad percentage':['mean'], 'Overall Score':['mean'], 'Academic Reputation':['mean'],
       'Citations per Faculty':['mean'], 'Employer Reputation':['mean'], 'Faculty Student':['mean'], 'International Faculty':['mean'],
       'International Students':['mean'], 'Total postgrad percentage':['mean']}

TU_country_df = TU_aggregate_df.groupby(['country']).agg(rule)
TU_country_df.head(2)

Unnamed: 0_level_0,Total faculty members,Total students,Total undergrad percentage,International faculty members,International students,International postgrad percentage,International undergrad percentage,Overall Score,Academic Reputation,Citations per Faculty,Employer Reputation,Faculty Student,International Faculty,International Students,Total postgrad percentage
Unnamed: 0_level_1,sum,sum,mean,sum,sum,mean,mean,mean,mean,mean,mean,mean,mean,mean,mean
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2
Argentina,16421.0,122301.0,93.0,3165.0,27109.0,8.0,92.0,69.1,94.5,0.0,95.5,75.1,50.3,70.6,52.0
Australia,22034.0,301994.0,68.333333,11382.0,106359.0,49.555556,50.444444,72.544444,86.477778,72.011111,86.166667,15.644444,98.355556,93.955556,62.888889


As done previously, desired ratios are calculated and added to the `DataFrame`:

In [49]:
TU_country_df['Ratio faculty/students'] = TU_country_df['Total faculty members']/TU_country_df['Total students']
TU_country_df['Ratio intl/total students'] = TU_country_df['International students']/TU_country_df['Total students']

#### Top 5 countries with Universities with highest ratio of faculty to students

In [50]:
TU_country_df.sort_values('Ratio faculty/students', ascending=False).head(5)

Unnamed: 0_level_0,Total faculty members,Total students,Total undergrad percentage,International faculty members,International students,International postgrad percentage,International undergrad percentage,Overall Score,Academic Reputation,Citations per Faculty,Employer Reputation,Faculty Student,International Faculty,International Students,Total postgrad percentage,Ratio faculty/students,Ratio intl/total students
Unnamed: 0_level_1,sum,sum,mean,sum,sum,mean,mean,mean,mean,mean,mean,mean,mean,mean,mean,Unnamed: 16_level_1,Unnamed: 17_level_1
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2
Russia,6709.0,30233.0,65.0,373.0,5098.0,16.0,84.0,65.0,82.0,0.0,79.6,99.7,0.0,48.7,88.0,0.22191,0.168624
Denmark,11916.0,67223.0,53.333333,3904.0,9543.0,72.0,28.0,63.066667,61.066667,56.933333,49.1,78.0,85.4,46.7,63.666667,0.177261,0.14196
Saudi Arabia,1062.0,6040.0,86.0,665.0,989.0,55.0,45.0,50.3,35.7,27.6,40.5,94.9,100.0,46.6,52.0,0.175828,0.163742
Singapore,9444.0,58466.0,82.0,6079.0,16168.0,47.0,53.0,91.35,96.95,74.75,98.25,91.2,100.0,87.15,56.0,0.16153,0.276537
Malaysia,2755.0,17902.0,54.0,655.0,3476.0,75.0,25.0,60.8,65.7,24.3,57.5,87.8,65.4,59.7,64.0,0.153893,0.194168


#### Top 5 countries with Universities with highest ratio of international students

In [51]:
TU_country_df.sort_values('Ratio intl/total students', ascending=False).head(5)

Unnamed: 0_level_0,Total faculty members,Total students,Total undergrad percentage,International faculty members,International students,International postgrad percentage,International undergrad percentage,Overall Score,Academic Reputation,Citations per Faculty,Employer Reputation,Faculty Student,International Faculty,International Students,Total postgrad percentage,Ratio faculty/students,Ratio intl/total students
Unnamed: 0_level_1,sum,sum,mean,sum,sum,mean,mean,mean,mean,mean,mean,mean,mean,mean,mean,Unnamed: 16_level_1,Unnamed: 17_level_1
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2
Australia,22034.0,301994.0,68.333333,11382.0,106359.0,49.555556,50.444444,72.544444,86.477778,72.011111,86.166667,15.644444,98.355556,93.955556,62.888889,0.072962,0.352189
United Kingdom,79934.0,583621.0,71.75,30216.0,199426.0,48.178571,51.821429,69.385714,70.507143,57.071429,77.546429,63.210714,91.825,92.471429,62.928571,0.136962,0.341705
Hong Kong,10166.0,78838.0,75.6,6296.0,24499.0,53.2,46.8,78.4,85.48,66.14,73.36,69.56,99.96,91.0,54.4,0.128948,0.310751
Austria,4117.0,63446.0,62.5,1572.0,19667.0,42.5,57.5,51.4,57.5,41.85,63.4,0.0,82.15,92.5,68.0,0.06489,0.30998
Switzerland,15323.0,109112.0,50.857143,9208.0,32995.0,66.285714,33.714286,68.442857,61.357143,75.085714,62.785714,67.342857,99.942857,80.314286,53.714286,0.140434,0.302396


---
#### Aggregating data by region (d)

We perform the same procedure with grouping by the region instead of country.

In [52]:

TU_region_df = TU_aggregate_df.groupby(['region']).sum()
TU_region_df.head(2)

Unnamed: 0_level_0,Total faculty members,Total students,Total undergrad percentage,International faculty members,International students,International postgrad percentage,International undergrad percentage,Overall Score,Academic Reputation,Citations per Faculty,Employer Reputation,Faculty Student,International Faculty,International Students,Total postgrad percentage,Ratio faculty/students,Ratio intl/total students
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Africa,1733.0,19593.0,82.0,379.0,3325.0,30.0,70.0,48.9,60.2,35.9,55.1,32.9,59.2,49.2,12.0,0.08845,0.169703
Asia,106734.0,807003.0,2302.0,25462.0,110100.0,2287.0,1513.0,2604.9,2886.9,2342.9,2845.2,2634.2,1474.9,1040.9,2351.0,5.117563,5.030969


We calculate the desired ratios by the region and then present the top 5 regions by each criterion.

In [25]:
TU_region_df['Ratio faculty/students'] = TU_region_df['Total faculty members']/TU_region_df['Total students']
TU_region_df['Ratio intl/total students'] = TU_region_df['International students']/TU_region_df['Total students']

#### Top 5 regions with Universities with the highest faculty to student ratio

In [26]:
TU_region_df.sort_values(['Ratio faculty/students'], ascending=False).head(5)

Unnamed: 0_level_0,Total faculty members,Total students,International faculty members,International students,Ratio faculty/students,Ratio intl/total students
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Asia,106734.0,807003.0,25462.0,110100.0,0.13226,0.136431
North America,189984.0,1604772.0,44455.0,307305.0,0.118387,0.191494
Europe,218358.0,1957251.0,67598.0,449364.0,0.111564,0.229589
Latin America,45382.0,435750.0,5648.0,36871.0,0.104147,0.084615
Africa,1733.0,19593.0,379.0,3325.0,0.08845,0.169703


#### Top 5 regions with Universities with the highest ratio of international students

In [27]:
TU_region_df.sort_values(['Ratio intl/total students'], ascending=False).head(5)

Unnamed: 0_level_0,Total faculty members,Total students,International faculty members,International students,Ratio faculty/students,Ratio intl/total students
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Oceania,25347.0,350167.0,12786.0,118798.0,0.072385,0.339261
Europe,218358.0,1957251.0,67598.0,449364.0,0.111564,0.229589
North America,189984.0,1604772.0,44455.0,307305.0,0.118387,0.191494
Africa,1733.0,19593.0,379.0,3325.0,0.08845,0.169703
Asia,106734.0,807003.0,25462.0,110100.0,0.13226,0.136431


#### Ask google function

In [47]:
def askGoogleForName(university_name):
    go = requests.get('https://www.google.com/search?q='+university_name.replace(' ','+')+"&sourceid=chrome&ie=UTF-8")
    gSoup = BeautifulSoup(go.text, 'html.parser')
    #fac = gSoup.find('div', class_="_B5d")
    #Getting the a name of the first result
    facs = gSoup.find('div', id="ires").find_all('div', class_='g')
    
    for fac in facs:
        f = fac.find('a')
        if(f!=None):
            return f.text
    
    return None

In [48]:
copy = TU_df.copy()

### Adding google name to DF

In [53]:
for ind,row in copy.iterrows():
    print(row['title'])
    copy.loc[ind, 'Google University Name']=askGoogleForName(row['title'])
    print('     is ' + copy.loc[ind,'Google University Name'])


Massachusetts Institute of Technology (MIT)
     is Massachusetts Institute of Technology: MIT
Stanford University
     is Stanford University
Harvard University
     is Harvard University
California Institute of Technology (Caltech)
     is Caltech: Home
University of Cambridge
     is University of Cambridge
University of Oxford
     is University of Oxford
UCL (University College London)
     is UCL - London's Global University
Imperial College London
     is Imperial College London
University of Chicago
     is The University of Chicago
ETH Zurich - Swiss Federal Institute of Technology
     is ETH Zurich - Homepage | ETH Zurich
Nanyang Technological University, Singapore (NTU)
     is Nanyang Technological University, Singapore
Ecole Polytechnique Fédérale de Lausanne (EPFL)
     is EPFL | École polytechnique fédérale de Lausanne
Princeton University
     is Princeton University: Home
Cornell University
     is Cornell University
National University of Singapore (NUS)
     is NUS 

     is USP - Universidade de São Paulo | Universidade pública, autarquia ...
Universidad Nacional Autónoma de México  (UNAM)
     is UNAM | Portal UNAM
Hokkaido University
     is Hokkaido University
Wageningen University
     is University - WUR
Freie Universitaet Berlin
     is Freie Universität Berlin: Homepage
Ghent University
     is Welcome — Ghent University
Queen Mary University of London
     is Queen Mary University of London
Kyushu University
     is KYUSHU UNIVERSITY
University of Maryland, College Park
     is The University of Maryland | A Preeminent Public Research University
Université de Montréal
     is Université de Montréal / UdeM
Université Pierre et Marie Curie (UPMC)
     is UPMC - University Pierre and Marie CURIE - Sciences and ...
University of Southern California
     is University of Southern California
Chalmers University of Technology
     is Chalmers University of Technology | Chalmers
University of California, Santa Barbara (UCSB)
     is Home - Univers

In [54]:
copy.head()

Unnamed: 0,cc,core_id,country,guide,logo,nid,rank_display,region,score,stars,title,url,Total faculty members,Total students,International faculty members,International students,Google University Name
0,US,410,United States,"<a href=""/where-to-study/north-america/united-...","<img src=""https://www.topuniversities.com/site...",294850,1,North America,100.0,6,Massachusetts Institute of Technology (MIT),/universities/massachusetts-institute-technolo...,2982.0,11067.0,1679.0,3717.0,Massachusetts Institute of Technology: MIT
1,US,573,United States,"<a href=""/where-to-study/north-america/united-...","<img src=""https://www.topuniversities.com/site...",297282,2,North America,98.7,5,Stanford University,/universities/stanford-university,4285.0,15878.0,2042.0,3611.0,Stanford University
2,US,253,United States,"<a href=""/where-to-study/north-america/united-...","<img src=""https://www.topuniversities.com/site...",294270,3,North America,98.4,5,Harvard University,/universities/harvard-university,4350.0,22429.0,1311.0,5266.0,Harvard University
3,US,94,United States,"<a href=""/where-to-study/north-america/united-...","<img src=""https://www.topuniversities.com/site...",294562,4,North America,97.7,5,California Institute of Technology (Caltech),/universities/california-institute-technology-...,953.0,2255.0,350.0,647.0,Caltech: Home
4,GB,95,United Kingdom,"<a href=""/where-to-study/europe/united-kingdom...","<img src=""https://www.topuniversities.com/site...",294561,5,Europe,95.6,5,University of Cambridge,/universities/university-cambridge,5490.0,18770.0,2278.0,6699.0,University of Cambridge


In [4]:
from googleapiclient import discovery

In [6]:
import googleapiclient as gap

In [None]:
gap.