# 02 - Data from the Web

## Assignment
1. Obtain the 200 top-ranking universities in www.topuniversities.com ([ranking 2018](https://www.topuniversities.com/university-rankings/world-university-rankings/2018)). In particular, extract the following fields for each university: name, rank, country and region, number of faculty members (international and total) and number of students (international and total). Some information is not available in the main list and you have to find them in the [details page](https://www.topuniversities.com/universities/ecole-polytechnique-fédérale-de-lausanne-epfl).
Store the resulting dataset in a pandas DataFrame and answer the following questions:
- Which are the best universities in term of: (a) ratio between faculty members and students, (b) ratio of international students?
- Answer the previous question aggregating the data by (c) country and (d) region.

Plot your data using bar charts and describe briefly what you observed.

2. Obtain the 200 top-ranking universities in www.timeshighereducation.com ([ranking 2018](http://timeshighereducation.com/world-university-rankings/2018/world-ranking)). Repeat the analysis of the previous point and discuss briefly what you observed.

3. Merge the two DataFrames created in questions 1 and 2 using university names. Match universities' names as well as you can, and explain your strategy. Keep track of the original position in both rankings.

4. Find useful insights in the data by performing an exploratory analysis. Can you find a strong correlation between any pair of variables in the dataset you just created? Example: when a university is strong in its international dimension, can you observe a consistency both for students and faculty members?

5. Can you find the best university taking in consideration both rankings? Explain your approach.

Hints:
- Keep your Notebook clean and don't print the verbose output of the requests if this does not add useful information for the reader.
- In case of tie, use the order defined in the webpage.


## 1. World University Rankings by TopUniversities

---

We will focus on data scraping from the [TopUniversities 2018 rankings](https://www.topuniversities.com/university-rankings/world-university-rankings/2018), as well as asnwering the assignment questions by performing simple data analysis. The structure of this chapter follows the assignment outline:

1. Data scraping
2. Ranking: faculty members, students and internationalization
3. Ranking: regional and national results


### Libraries utilized

We will use the BeautifulSoup along with Pandas and standard Python utilities for performing the scraping and data analysis.

In [1]:
# importing the utils for Web scraping
import requests
from bs4 import BeautifulSoup

# importing python and data utils
import pandas as pd
import numpy as np
import json
%matplotlib inline
import matplotlib.pyplot as plt

### Swiss army knife for data scraping

[Postman](https://www.getpostman.com/) is utilized outside Python to get more sense about the Web jungle we are trying to harvest our data from. We have used both Interceptor and the application itself to make sense of incoming traffic. 

We have also relied on traffic observations in Chrome Development Tools (*which can be accessed in Chrome browsers on Windows platform by pressing F12*).

### 1.1 Data scraping

The first step in our data scraping process was to get acquainted with the website itself. We have turned on the Interceptor for the traffic and the application itself, then loaded the website and observed the requests. 

#### Making our life easier: Postman filters

It is useful to filter out results of requests sent to the Website of interest. In modern websites it is common to have many 3rd party APIs which would make it somewhat more difficult to find the requests of interest. Additionally, trafic from other tabs in browser (if open) will appear, so it is additionally useful to display only what is the focus of our search. Such benefits are visible in the practical example bellow:

<table width="70%">
  <tr>
    <th style="text-align:center">Requests without any filters</th>
    <th style="text-align:center">After filtering results from *www.topuniversities.com* only</th> 
  </tr>
  <tr>
    <td>![](images/postman0.png)</td>
    <td style="vertical-align:top">![](images/postman1.png)</td> 
  </tr>
</table>

From the vast list of various requests, we have localized what seems to be important for the given task, and the requests for resources for the TopUniversity rankings. By analyzing the requests, we can see one **GET** request which obtains a textual file:

[https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt?_=1508597583828](https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt?_=1508597583828)

By analyzing the file, it turns out to be a very useful JSON! All the data needed for scraping at this level is present here, most notably the ranking and the URL towards more information. It correlates to the same information we can directly see from the Chrome Developer Tools:

![](images/chrome0.png)

#####  Design choice
As in the image above can be seen, the HTML data is well organized in the tables, with appropriate classes which designate the content of the cell. One approach would be to use `BeautifulSoup` and parse the HTML of the page. Since we have the data in more concise format (JSON) we decided to process the textual representation. Since only the representation and intermediate steps differ, and results are the same, this is a choice we have made. 

Nevertheless, further steps will have to rely on parsing the HTML and DOM of the subsequent university links with further details (since no equivalent JSON is obtained), so both approaches are showcased.


#### Implementing the scraping procedure

We will extract the before found JSON file and use it as a baseline for scraping this Website. We send a GET request and process the file as JSON.

In [None]:
## TU = TopUniversities
TU_path = 'https://www.topuniversities.com'
base_json_path = '/sites/default/files/qs-rankings-data/357051.txt?_=1508597583828'

req = requests.get(TU_path+base_json_path)

TU_json = req.json()

We convert the JSON file to Pandas `DataFrame` to enable easier exploration and utilization of the file. We are interested in the **data** section of our JSON file, and we take the first 200 top ranking universities in the list. 

Notably, some universities share the same ranking (tie). Per assignment at this point we will take 200 best ranking universities for our analysis, and not the universities until the rank 200.

In [None]:
TU_df = pd.DataFrame.from_dict(TU_json['data']).head(200)

We can observe that we did not need any additional parsing which would add both to code complexity, potential computational overhead in bigger datasets, as well as algorithmic complexity if the website HTML code is not tidy (which is not the case here). The `DataFrame` carries all the information we need for now, of which the region, country, rank, title and url play the significant role in the task.

In [None]:
TU_df.head(2)

#### 1.1.1 Extracting the information for each University

Some information such as University name or rank is available immediately, while for detailed information we need to follow the subsequent link provided in the **url** field. Unfortunately, as previously mentioned, we don't have information packed in JSON format this time. We use HTML parsing provided by `BeautifulSoup` to obtain all the missing data. We are in luck since the HTML is well organized, with appropriate classes assigned to `div`s containing needed pieces of information.

We iterate through the previously obtained `DataFrame` and scrape and collect the necessary data, and append it to the existing `DataFrame`.

#### Note on undefined values

At least one of the universities has missing data and we need to handle such cases. If the number of such cases is small enough, we could impute the values manually, by consulting the appropriate sources, thus keeping the consistency of the rankings. If we don't have all the data for analysis, we would have to discard such Universities or come up with a methodology for mining up missing data on larger scale.

We choose to be consistent and we will print out the universities which have some missing values and will try to fill in the missing values.

In [None]:
# A helper function to get an number (integer) from the div text
def div2num(div):
    return int(div.find('div', class_='number').text.split()[0].replace(',',''))

In [None]:
for ind, row in TU_df.iterrows():
    # getting the detailed page of the given university
    # relative link to the page is given in the url field of the dataframe
    uni_details = requests.get(TU_path+row['url'])
    
    # get the HTML of the detailed university page and add it to the soup
    soup = BeautifulSoup(uni_details.text, 'html.parser')
    
    # get the name of the university from the DF
    uni_name = row['title']
    
    # get the number of faculty members from the detailed page and append to DF
    faculty_num = soup.find('div', class_='total faculty')
    try:
        # we parse the div text and append it to the DataFrame
        TU_df.loc[ind, 'Total faculty members'] = div2num(faculty_num)
    except:
        # print out the message in order to try to impute the data
        print('Undefined total faculty members for: '+uni_name)
    
    # similar procedure for other required data points
    students_num = soup.find('div', class_='total student')
    try:
        TU_df.loc[ind, 'Total students'] = div2num(students_num)
    except:
        print('Undefined total students for: '+uni_name)
    
    intl_faculty_num = soup.find('div', class_='inter faculty')
    try:
        TU_df.loc[ind, 'International faculty members'] = div2num(intl_faculty_num)
    except:
        print('Undefined international faculty members for: '+uni_name)
    
    intl_students_num = soup.find('div', class_='total inter')
    try:
        TU_df.loc[ind, 'International students'] = div2num(intl_students_num)
    except:
        print('Undefined international students for: '+uni_name)

#### Imputing undefined values

We can see that the New York University (NYU) has information that is completely missing, while Indian Institute of Science (IISc) Bangalore has one missing value. Considering the scope of missing values, they will be manually set using the following sources:

- NYU: https://www.nyu.edu/about/news-publications/nyu-at-a-glance.html, https://www.nyu.edu/admissions/undergraduate-admissions/nyu-facts.html
- IISc: http://www.iisc.ac.in/iisc-in-numbers/#ffs-tabbed-12

This is potentially troubling, since there is no information about the actual data collection for the ranking, methodology or actual date which is available for a confident selection of the figures missing. Frequently, the websites also omit the precise dates. Comparing the potential noise which would be added to the fact that we would discard 2 top 200 universities, the decision is to take the data and impute it, while being aware that it might induce some error in methodology. It is a tradeoff between statistical error in imputed data and the error which would occur in the further aggregate analysis.

Therefore the NYU data is following:
- total faculty members: 7861 (Fall 2015)
- total students: 58419 (Fall 2015)
- international faculty members: 619 (September 2017)
- international students: 26% == 15189 (Fall 2015)

Unfortunately, the data of international faculty members is not available in public records online for IISc, therefore it will be designated as 0. The unavailability might rise from the fact that the figure is rather low (https://scroll.in/article/811696/there-are-no-indian-universities-in-the-worlds-top-250-list-heres-how-javadekar-can-change-that), so it would not be positive for faculty to publicize such information. 

*If time permitted a good practice would be to contact the institution directly for the actual quote in both cases*

In [None]:
TU_df.loc[TU_df.title=='New York University (NYU)','Total faculty members'] = 7861
TU_df.loc[TU_df.title=='New York University (NYU)','Total students'] = 58419
TU_df.loc[TU_df.title=='New York University (NYU)','International faculty members'] = 619
TU_df.loc[TU_df.title=='New York University (NYU)','International students'] = 15189


TU_df = TU_df.fillna(0)

#### Pruning the data in DataFrame

We have several pieces of information that are redundant at this point, so we will remove such columns from the DataFrame for a better overview. We can see bellow that *cc*, *rank_display*, *stars*, *core_id*, *guide*, *logo*, *nid* and *url* are present, but redundant in the following analysis.

In [None]:
TU_df.head(1)

In [None]:
TU_df.drop(['cc', 'stars', 'core_id', 'guide', 'logo', 'nid', 'url'], axis=1, inplace=True)
TU_df.head(1)

#### We have our DataFrame ready for further analysis!
---

### 1.2 Ranking: faculty members, students and internationalization

In this section we answer the question: **Which are the best universities in term of: (a) ratio between faculty members and students, (b) ratio of international students?**

For easier analysis for (a) and (b) we will add the ratios to the existing DataFrame:

In [None]:
# ratio between faculty members and students
TU_df['Ratio faculty/students'] = TU_df['Total faculty members']/TU_df['Total students']

# ratio of international students
TU_df['Ratio intl/total students'] = TU_df['International students']/TU_df['Total students']

In [None]:
TU_df.head(1)

We have prepared the data for our analysis! The results are obtained by sorting the `DataFrame` by appropriate criterion.

### Best university: (a) ratio between faculty members and students

In [None]:
TU_fac_stud_df = TU_df.sort_values('Ratio faculty/students', ascending=False)

#### Top 5 by ratio of faculty members and students:

In [None]:
TU_fac_stud_df.head(5)

### Best university: (b) ratio of international students

In [None]:
TU_intl_df = TU_df.sort_values('Ratio intl/total students', ascending=False)

#### Top 5 by ratio of international students

In [None]:
TU_intl_df.head(5)

------
### 1.3 Ranking: regional and national results

In this section we use the results obtained in **1.2** and aggregate them, separately, by country and region, thus providing answer to the question 2:

*Answer the previous question aggregating the data by (c) country and (d) region.*

The `DataFrame` contains all the datapoints we need for answering the questions, the only procedure we need to undertake is aggregation of data. We need to remove the calculated ratios per faculty since the summation would not result in proper result in this case (we would need to normalize the result otherwise). We can further remove the score and name of the university **manually**, since it does not play a role in the dataset, but since it is a variable on which we could not perform aggregation such variables would anyway be ignored by *Pandas*.

In [None]:
TU_aggregate_df = TU_df.copy()
TU_aggregate_df.drop(['Ratio faculty/students', 'Ratio intl/total students'], axis=1, inplace=True)
TU_aggregate_df.head(1)

---
#### Aggregating data by country (c)

We will group the data by the country and then sum the values of such aggregate to produce desired data.

In [None]:
# we can compose multiple functions to reduce code clutter
TU_country_df = TU_aggregate_df.groupby(['country']).sum()
TU_country_df.head(2)

As done previously, desired ratios are calculated and added to the `DataFrame`:

In [None]:
TU_country_df['Ratio faculty/students'] = TU_country_df['Total faculty members']/TU_country_df['Total students']
TU_country_df['Ratio intl/total students'] = TU_country_df['International students']/TU_country_df['Total students']

#### Top 5 countries with Universities with highest ratio of faculty to students

In [None]:
TU_country_df.sort_values('Ratio faculty/students', ascending=False).head(5)

#### Top 5 countries with Universities with highest ratio of international students

In [None]:
TU_country_df.sort_values('Ratio intl/total students', ascending=False).head(5)

---
#### Aggregating data by region (d)

We perform the same procedure with grouping by the region instead of country.

In [None]:
TU_region_df = TU_aggregate_df.groupby(['region']).sum()
TU_region_df.head(2)

We calculate the desired ratios by the region and then present the top 5 regions by each criterion.

In [None]:
TU_region_df['Ratio faculty/students'] = TU_region_df['Total faculty members']/TU_region_df['Total students']
TU_region_df['Ratio intl/total students'] = TU_region_df['International students']/TU_region_df['Total students']

#### Top 5 regions with Universities with the highest faculty to student ratio

In [None]:
TU_region_df.sort_values(['Ratio faculty/students'], ascending=False).head(5)

#### Top 5 regions with Universities with the highest ratio of international students

In [None]:
TU_region_df.sort_values(['Ratio intl/total students'], ascending=False).head(5)

## 2. World University Rankings by TimesHigherEducation

In this chapter, we will scrape the data from [TimesHigherEducation ranking 2018](http://timeshighereducation.com/world-university-rankings/2018/world-ranking), as well as answering the assignment questions by performing simple data analysis.


Pandas options setting for a better view of the data

In [2]:
pd.set_option('display.max_columns', None)

pd.set_option('display.max_rows', None)

### Analysis of the website to scrape

As in the case for task 1, we notice, using Interceptor Chrome app and Postman, that when accessing the main list url

http://timeshighereducation.com/world-university-rankings/2018/world-ranking 


another specific **GET** request is made to one of the server's REST API endpoints

https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json

which serves a file in JSON format that contains a data field with information about each university in the ranking.

#### Implementing the scrapping procedure

We begin by extracting the JSON file found previously with Interceptor and Postman and processing it as a JSON. In order to do so, we implement an utility function that sends a **GET** request to a server's API endpoint identified by the url.

The result is saved in a dictionary so that the http request is made only once per url.

In [3]:
json_data = {}
def get_json(url):
    if url not in json_data:
        r = requests.get(url)
        json_data[url] = r
    else:
        r = json_data[url]
        
    return r.json()

In [4]:
thedu_url = 'https://www.timeshighereducation.com'
thedu_json_url = '/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json'

json_data_thedu = get_json(thedu_url + thedu_json_url)

#### Analysis of the json object format

In [5]:
# fields of the entire json object
display(json_data_thedu.keys())
# one entry in the data field
display(json_data_thedu['data'][0])

dict_keys(['data', 'subjects', 'locations', 'pillars'])

{'aliases': 'University of Oxford',
 'location': 'United Kingdom',
 'member_level': '0',
 'name': 'University of Oxford',
 'nid': 468,
 'rank': '1',
 'rank_order': '10',
 'record_type': 'master_account',
 'scores_citations': '99.1',
 'scores_citations_rank': '15',
 'scores_industry_income': '63.7',
 'scores_industry_income_rank': '169',
 'scores_international_outlook': '95.0',
 'scores_international_outlook_rank': '24',
 'scores_overall': '94.3',
 'scores_overall_rank': '10',
 'scores_research': '99.5',
 'scores_research_rank': '1',
 'scores_teaching': '86.7',
 'scores_teaching_rank': '5',
 'stats_female_male_ratio': '46 : 54',
 'stats_number_students': '20,409',
 'stats_pc_intl_students': '38%',
 'stats_student_staff_ratio': '11.2',
 'subjects_offered': 'Archaeology,Art, Performing Arts & Design,Biological Sciences,Business & Management,Chemical Engineering,Chemistry,Civil Engineering,Computer Science,Economics & Econometrics,Electrical & Electronic Engineering,General Engineering,Geo

The retrieved json object has 4 fields: data, subjects, locations and pillars out of which we are interested only in data. Specifically in the first 200 entries in the corresponding array because, again, we are focusing on the 200 best ranking universities and not the universities until rank 200.

#### Structure of data

We observe that the data field contains the name, location which is basically the country, number of students, ratio of international students and ratio of student to staff which is enough to infer all the fields that are asked in the asignment, except the region and number of international faculty members. Therefore we are not required to aditionally scrape the detailed page of the univerities.

Given that the region of a country cannot variate, for this field we will use external information, for example the dataframe from the previous task, to infer the region of an university. 

The number of international faculty members is not provided by the TimesHigherEducation, but we do not need it in our further analysis related to this task.

Observing the data field is a list of records, it makes sense to load it in a DataFrame for easier manipulation.

In [6]:
data_thedu = pd.DataFrame(json_data_thedu['data'])

In [7]:
data_thedu.head()

Unnamed: 0,aliases,location,member_level,name,nid,rank,rank_order,record_type,scores_citations,scores_citations_rank,scores_industry_income,scores_industry_income_rank,scores_international_outlook,scores_international_outlook_rank,scores_overall,scores_overall_rank,scores_research,scores_research_rank,scores_teaching,scores_teaching_rank,stats_female_male_ratio,stats_number_students,stats_pc_intl_students,stats_student_staff_ratio,subjects_offered,url
0,University of Oxford,United Kingdom,0,University of Oxford,468,1,10,master_account,99.1,15,63.7,169,95.0,24,94.3,10,99.5,1,86.7,5,46 : 54,20409,38%,11.2,"Archaeology,Art, Performing Arts & Design,Biol...",/world-university-rankings/university-oxford
1,University of Cambridge,United Kingdom,0,University of Cambridge,470,2,20,master_account,97.5,29,51.5,260,93.0,35,93.2,20,97.8,3,87.8,3,45 : 55,18389,35%,10.9,"Archaeology,Architecture,Art, Performing Arts ...",/world-university-rankings/university-cambridge
2,California Institute of Technology caltech,United States,0,California Institute of Technology,128779,=3,30,private,99.5,10,92.6,51,59.7,322,93.0,30,97.5,4,90.3,1,31 : 69,2209,27%,6.5,"Architecture,Biological Sciences,Business & Ma...",/world-university-rankings/california-institut...
3,Stanford University,United States,11,Stanford University,467,=3,40,private,99.9,4,60.5,189,77.6,162,93.0,40,96.7,5,89.1,2,42 : 58,15845,22%,7.5,"Archaeology,Architecture,Art, Performing Arts ...",/world-university-rankings/stanford-university
4,Massachusetts Institute of Technology,United States,0,Massachusetts Institute of Technology,471,5,50,private,100.0,1,88.4,63,87.6,81,92.5,50,91.9,9,87.3,4,37 : 63,11177,34%,8.7,"Architecture,Art, Performing Arts & Design,Bio...",/world-university-rankings/massachusetts-insti...


In [8]:
data_thedu.columns

Index(['aliases', 'location', 'member_level', 'name', 'nid', 'rank',
       'rank_order', 'record_type', 'scores_citations',
       'scores_citations_rank', 'scores_industry_income',
       'scores_industry_income_rank', 'scores_international_outlook',
       'scores_international_outlook_rank', 'scores_overall',
       'scores_overall_rank', 'scores_research', 'scores_research_rank',
       'scores_teaching', 'scores_teaching_rank', 'stats_female_male_ratio',
       'stats_number_students', 'stats_pc_intl_students',
       'stats_student_staff_ratio', 'subjects_offered', 'url'],
      dtype='object')

#### Restructuring the data

As argued above, we will keep in the DataFrame only the columns that are relevant to our analysis:
- name
- rank
- location
- stats_number_students
- stats_pc_intl_students
- stats_student_staff_ratio

In [9]:
columns_to_drop = data_thedu.columns.difference(['name', 'rank','location', 'stats_number_students', 'stats_pc_intl_students', 'stats_student_staff_ratio'])
data_thedu = data_thedu.drop(columns_to_drop, axis=1)[0:200]

In [10]:
data_thedu.stats_pc_intl_students.head()

0    38%
1    35%
2    27%
3    22%
4    34%
Name: stats_pc_intl_students, dtype: object

The column stats_pc_intl_students representing the percentage of international students does not have a format that allows us do the operations necessary to infer the number of international students. 

We will create a function that will parse the string values of this column and transform them into a value from [0,1]

In [11]:
def percentage_to_ratio(value):
    # we slice the value up until the last character
    # then parse it as an int and divide it by 100 to obtain a value between 0 and 1
    return int(value[:-1])/100

In [12]:
data_thedu['stats_pc_intl_students'] = data_thedu['stats_pc_intl_students'].apply(percentage_to_ratio)
data_thedu.rename(columns={'stats_pc_intl_students':'stats_intl_students_ratio'}, inplace=True)

In [13]:
data_thedu.head()

Unnamed: 0,location,name,rank,stats_number_students,stats_intl_students_ratio,stats_student_staff_ratio
0,United Kingdom,University of Oxford,1,20409,0.38,11.2
1,United Kingdom,University of Cambridge,2,18389,0.35,10.9
2,United States,California Institute of Technology,=3,2209,0.27,6.5
3,United States,Stanford University,=3,15845,0.22,7.5
4,United States,Massachusetts Institute of Technology,5,11177,0.34,8.7


### Inferring data

We can now proceed to infer the number of international students and the number of faculty members

The number of international students can be calculated as the product between the number of students and the international students ratio

The number of faculty members can be calculated as the number of students divided by the student per staff ratio

#### Data types

In [16]:
data_thedu.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
location                     200 non-null object
name                         200 non-null object
rank                         200 non-null object
stats_number_students        200 non-null object
stats_intl_students_ratio    200 non-null float64
stats_student_staff_ratio    200 non-null object
dtypes: float64(1), object(5)
memory usage: 9.5+ KB


As we can see, all the stats, except the transformed international students ratio, have the data type object.

For us to able to infer the new columns values, we will have to parse stats_number_students to int and stats_student_staff_ratio to float.

In [17]:
data_thedu['stats_number_students'] = data_thedu.stats_number_students.apply(lambda x: int(x.replace(',','')))
data_thedu['stats_student_staff_ratio'] = data_thedu.stats_student_staff_ratio.apply(lambda x: float(x))

We now compute the new columns as described above

In [19]:
data_thedu['number_international_students_inferred'] = data_thedu.stats_number_students * data_thedu.stats_intl_students_ratio
data_thedu['number_faculty_members_inferred'] = data_thedu.stats_number_students / data_thedu.stats_student_staff_ratio

Having only two decimals precision for the ratio, the resulting values for the inferred columns are not all integers but we choose to leave them as that in order to maintain the accuracy of our data

One thing left to do is to transform the student per staff ratio into staff per student ratio as per assignment.

In [20]:
data_thedu['stats_student_staff_ratio'] = 1 / data_thedu.stats_student_staff_ratio
data_thedu.rename(columns={'stats_student_staff_ratio':'stats_staff_student_ratio'}, inplace=True)

In [21]:
data_thedu.head()

Unnamed: 0,location,name,rank,stats_number_students,stats_intl_students_ratio,stats_staff_student_ratio,number_international_students_inferred,number_faculty_members_inferred
0,United Kingdom,University of Oxford,1,20409,0.38,0.089286,7755.42,1822.232143
1,United Kingdom,University of Cambridge,2,18389,0.35,0.091743,6436.15,1687.06422
2,United States,California Institute of Technology,=3,2209,0.27,0.153846,596.43,339.846154
3,United States,Stanford University,=3,15845,0.22,0.133333,3485.9,2112.666667
4,United States,Massachusetts Institute of Technology,5,11177,0.34,0.114943,3800.18,1284.712644


We have prepared the data for our analysis! The results are obtained by sorting the `DataFrame` by appropriate criterion.

### Best university: (a) ratio between faculty members and students

In [22]:
thedu_fac_stud_df = data_thedu.sort_values('stats_staff_student_ratio', ascending=False)

#### Top 5 by ratio of faculty members and students

In [23]:
thedu_fac_stud_df.head()

Unnamed: 0,location,name,rank,stats_number_students,stats_intl_students_ratio,stats_staff_student_ratio,number_international_students_inferred,number_faculty_members_inferred
105,United States,Vanderbilt University,=105,12011,0.13,0.30303,1561.43,3639.69697
109,Denmark,University of Copenhagen,=109,30395,0.14,0.243902,4255.3,7413.414634
153,United States,University of Rochester,=153,9636,0.29,0.232558,2794.44,2240.930233
11,United States,Yale University,12,12155,0.21,0.232558,2552.55,2826.744186
12,United States,Johns Hopkins University,13,15498,0.24,0.232558,3719.52,3604.186047


### Best university: (b) ratio of international students

In [24]:
thedu_intl = data_thedu.sort_values('stats_intl_students_ratio', ascending=False)

#### Top 5 by ratio of international students

In [25]:
thedu_intl.head()

Unnamed: 0,location,name,rank,stats_number_students,stats_intl_students_ratio,stats_staff_student_ratio,number_international_students_inferred,number_faculty_members_inferred
24,United Kingdom,London School of Economics and Political Science,=25,10065,0.71,0.081967,7146.15,825.0
178,Luxembourg,University of Luxembourg,=179,4969,0.57,0.068493,2832.33,340.342466
37,Switzerland,École Polytechnique Fédérale de Lausanne,=38,9928,0.55,0.089286,5460.4,886.428571
7,United Kingdom,Imperial College London,8,15857,0.55,0.087719,8721.35,1390.964912
102,Netherlands,Maastricht University,103,16727,0.5,0.055556,8363.5,929.277778
