# 02 - Data from the Web

## Deadline
Wednesday October 25, 2017 at 11:59PM

## Important Notes
* Make sure you push on GitHub your Notebook with all the cells already evaluated (i.e., you don't want your colleagues to generate unnecessary Web traffic during the peer review)
* Don't forget to add a textual description of your thought process, the assumptions you made, and the solution you plan to implement!
* Please write all your comments in English, and use meaningful variable names in your code.

## Background
In this homework we will extract interesting information from www.topuniversities.com and www.timeshighereducation.com, two platforms that maintain a global ranking of worldwide universities. This ranking is not offered as a downloadable dataset, so you will have to find a way to scrape the information we need!
You are not allowed to download manually the entire ranking -- rather you have to understand how the server loads it in your browser. For this task, Postman with the Interceptor extension can help you greatly. We recommend that you watch this [brief tutorial](https://www.youtube.com/watch?v=jBjXVrS8nXs&list=PLM-7VG-sgbtD8qBnGeQM5nvlpqB_ktaLZ&autoplay=1) to understand quickly how to use it.

## Assignment
1. Obtain the 200 top-ranking universities in www.topuniversities.com ([ranking 2018](https://www.topuniversities.com/university-rankings/world-university-rankings/2018)). In particular, extract the following fields for each university: name, rank, country and region, number of faculty members (international and total) and number of students (international and total). Some information is not available in the main list and you have to find them in the [details page](https://www.topuniversities.com/universities/ecole-polytechnique-fédérale-de-lausanne-epfl).
Store the resulting dataset in a pandas DataFrame and answer the following questions:
- Which are the best universities in term of: (a) ratio between faculty members and students, (b) ratio of international students?
- Answer the previous question aggregating the data by (c) country and (d) region.

Plot your data using bar charts and describe briefly what you observed.

2. Obtain the 200 top-ranking universities in www.timeshighereducation.com ([ranking 2018](http://timeshighereducation.com/world-university-rankings/2018/world-ranking)). Repeat the analysis of the previous point and discuss briefly what you observed.

3. Merge the two DataFrames created in questions 1 and 2 using university names. Match universities' names as well as you can, and explain your strategy. Keep track of the original position in both rankings.

4. Find useful insights in the data by performing an exploratory analysis. Can you find a strong correlation between any pair of variables in the dataset you just created? Example: when a university is strong in its international dimension, can you observe a consistency both for students and faculty members?

5. Can you find the best university taking in consideration both rankings? Explain your approach.

Hints:
- Keep your Notebook clean and don't print the verbose output of the requests if this does not add useful information for the reader.
- In case of tie, use the order defined in the webpage.


# Solution
First let's try to do the request on the https://www.topuniversities.com/university-rankings/world-university-rankings/2018 url just to see what we get.

In [39]:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.topuniversities.com/university-rankings/world-university-rankings/2018')
soup = BeautifulSoup(r.text, 'html.parser')
soup

<!DOCTYPE html>

<html dir="ltr" version="XHTML+RDFa 1.0" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:article="http://ogp.me/ns/article#" xmlns:book="http://ogp.me/ns/book#" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:og="http://ogp.me/ns#" xmlns:product="http://ogp.me/ns/product#" xmlns:profile="http://ogp.me/ns/profile#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:schema="http://schema.org/" xmlns:sioc="http://rdfs.org/sioc/ns#" xmlns:sioct="http://rdfs.org/sioc/types#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:video="http://ogp.me/ns/video#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#">
<head profile="http://www.w3.org/1999/xhtml/vocab">
<meta content="unsafe-url" name="referrer"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={xpid:"UwUCVVVTGwIAV1VXBQkP"}

If we do Ctrl+F and try to look for "Imperial College London" for example, we don't get anything in the Beautifoul soup HTML even if the string is clearly on the webpage. This is because the data is not loaded at the beginning when the HTML request is answered. It is loaded later and the ranking data is in a separate file. Using Google Chrome Interceptor we figured out the file is at https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051_indicators.txt. So this is the webpage we will scrape instead of the ranking webpage:

In [40]:
from html import parser
import os

base_url = 'https://www.topuniversities.com/'

# create a subclass and override the handler methods
class QSHTMLParser(parser.HTMLParser):
    def handle_starttag(self, tag, attrs):
        if not hasattr(self, 'array_hrefs'):
            self.array_hrefs = []
        if tag == 'a':
            self.last_tag_url = os.path.join(base_url, attrs[0][1][1:])
            
    def handle_data(self, data):
        self.array_hrefs.append({"url": self.last_tag_url, "name": data})

In [41]:
import json

request_ranking_text_file = requests.get('https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051_indicators.txt')
json_ranking = json.loads(request_ranking_text_file.text)
ranking_data = json_ranking['data'][:200]
parser = QSHTMLParser()
for university in ranking_data:
    parser.feed(university['uni'])
    
basic_data = parser.array_hrefs
for i, university in enumerate(ranking_data):
    curr_uni = basic_data[i]
    curr_uni['region'] = university['region']
    curr_uni['location'] = university['location']

Now we have the URLs of the fifty first universities:

In [42]:
basic_data

[{'location': 'United States',
  'name': 'Massachusetts Institute of Technology (MIT) ',
  'region': 'North America',
  'url': 'https://www.topuniversities.com/universities/massachusetts-institute-technology-mit'},
 {'location': 'United States',
  'name': 'Stanford University',
  'region': 'North America',
  'url': 'https://www.topuniversities.com/universities/stanford-university'},
 {'location': 'United States',
  'name': 'Harvard University',
  'region': 'North America',
  'url': 'https://www.topuniversities.com/universities/harvard-university'},
 {'location': 'United States',
  'name': 'California Institute of Technology (Caltech)',
  'region': 'North America',
  'url': 'https://www.topuniversities.com/universities/california-institute-technology-caltech'},
 {'location': 'United Kingdom',
  'name': 'University of Cambridge',
  'region': 'Europe',
  'url': 'https://www.topuniversities.com/universities/university-cambridge'},
 {'location': 'United Kingdom',
  'name': 'University of Ox

Here we have the URLs for the details pages for the 50 first universities. Let's take a closer look et the EPFL detail page:

In [43]:
import re

url_epfl = basic_data[11]['url']
request_epfl = requests.get(url_epfl)
soup_epfl = BeautifulSoup(request_epfl.text, 'html.parser')
div_details = soup_epfl.find_all('div', {"class": "view-academic-data-profile"})[0]
div_details.find_all('div', {"class": "number"})
test_det = div_details.find_all('div', {"class": "number"})
student_values = ["staff_total", "staff_international", "students_total", "students_international"]
number_students = {}
for i in range(0, 4):
    number = int(re.sub('[,]', '', test_det[i].get_text()))
    number_students[student_values[i]] = number
number_students

{'staff_international': 1300,
 'staff_total': 1695,
 'students_international': 5896,
 'students_total': 10343}

That's it, the code is maybe a little messy but with some browser inspection we can figure out where the things we need on the EPFL webpage are.
Now let us generalize this to all the 50 first universities:

In [46]:
import re

number_students_per_fac = {}
for i, data in enumerate(basic_data):
    url = data['url']
    print("Processing ", i, ": ", url)
    request_uni = requests.get(url)
    soup_uni = BeautifulSoup(request_uni.text, 'html.parser')
    div_details = soup_uni.find_all('div', {"class": "view-academic-data-profile"})[0]
    test_det = div_details.find_all('div', {"class": "number"})
    student_values = ["staff_total", "staff_international", "students_total", "students_international"]
    number_students = {}
    print(test_det)
    if(len(test_det) == 4):
        for j in range(0, 4):
            number = int(re.sub('[,]', '', test_det[j].get_text()))
            basic_data[i][student_values[j]] = number
    number_students_per_fac[url] = number_students

Processing  0 :  https://www.topuniversities.com/universities/massachusetts-institute-technology-mit
[<div class="number">
2,982 </div>, <div class="number">
1,679 </div>, <div class="number">
11,067 </div>, <div class="number">
3,717 </div>]
Processing  1 :  https://www.topuniversities.com/universities/stanford-university
[<div class="number">
4,285 </div>, <div class="number">
 2,042 </div>, <div class="number">
15,878 </div>, <div class="number">
3,611 </div>]
Processing  2 :  https://www.topuniversities.com/universities/harvard-university
[<div class="number">
4,350 </div>, <div class="number">
1,311 </div>, <div class="number">
22,429 </div>, <div class="number">
5,266 </div>]
Processing  3 :  https://www.topuniversities.com/universities/california-institute-technology-caltech
[<div class="number">
953 </div>, <div class="number">
350 </div>, <div class="number">
2,255 </div>, <div class="number">
647 </div>]
Processing  4 :  https://www.topuniversities.com/universities/university

In [47]:
basic_data

[{'location': 'United States',
  'name': 'Massachusetts Institute of Technology (MIT) ',
  'region': 'North America',
  'staff_international': 1679,
  'staff_total': 2982,
  'students_international': 3717,
  'students_total': 11067,
  'url': 'https://www.topuniversities.com/universities/massachusetts-institute-technology-mit'},
 {'location': 'United States',
  'name': 'Stanford University',
  'region': 'North America',
  'staff_international': 2042,
  'staff_total': 4285,
  'students_international': 3611,
  'students_total': 15878,
  'url': 'https://www.topuniversities.com/universities/stanford-university'},
 {'location': 'United States',
  'name': 'Harvard University',
  'region': 'North America',
  'staff_international': 1311,
  'staff_total': 4350,
  'students_international': 5266,
  'students_total': 22429,
  'url': 'https://www.topuniversities.com/universities/harvard-university'},
 {'location': 'United States',
  'name': 'California Institute of Technology (Caltech)',
  'region':

That's it, we have all the data for the number of students. We can convince ourselves that this is right by taking a university at random and visiting the corresponding webpage (if we take more of them the probability that our results our wrong decrease exponentially).

Now we simply construct our dataframe:

In [56]:
import pandas as pd
qs_dataframe = pd.DataFrame.from_dict(basic_data)
qs_dataframe.head()

Unnamed: 0,location,name,region,staff_international,staff_total,students_international,students_total,url
0,United States,Massachusetts Institute of Technology (MIT),North America,1679.0,2982.0,3717.0,11067.0,https://www.topuniversities.com/universities/m...
1,United States,Stanford University,North America,2042.0,4285.0,3611.0,15878.0,https://www.topuniversities.com/universities/s...
2,United States,Harvard University,North America,1311.0,4350.0,5266.0,22429.0,https://www.topuniversities.com/universities/h...
3,United States,California Institute of Technology (Caltech),North America,350.0,953.0,647.0,2255.0,https://www.topuniversities.com/universities/c...
4,United Kingdom,University of Cambridge,Europe,2278.0,5490.0,6699.0,18770.0,https://www.topuniversities.com/universities/u...


That's it! There is just one thing we have to noctice here: there are two universities (NYU at index 51 and IISc at index 189) where we do not have the info about the number of students / staff members. And indeed, if we visit the webpages we see that the information is not there.
For NYU we don't have anything so we leave it like that (we will see how to deal with the NaN values depending on the questions later). For ISSc we can see 3 out of the 4 informations on the webpage so we will enter it manually:

In [62]:
qs_dataframe[qs_dataframe.isnull().any(axis=1)]

Unnamed: 0,location,name,region,staff_international,staff_total,students_international,students_total,url
51,United States,New York University (NYU),North America,,,,,https://www.topuniversities.com/universities/n...
189,India,Indian Institute of Science (IISc) Bangalore,Asia,,,,,https://www.topuniversities.com/universities/i...


In [70]:
iisc_uni = qs_dataframe.loc[189]

In [71]:
iisc_uni['staff_international'] = 100

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


location                                                              India
name                           Indian Institute of Science (IISc) Bangalore
region                                                                 Asia
staff_international                                                     NaN
staff_total                                                             NaN
students_international                                                  NaN
students_total                                                          NaN
url                       https://www.topuniversities.com/universities/i...
Name: 189, dtype: object

## Times Ranking
Now let's try to request the page for Times ranking:

In [19]:
r_times = requests.get('https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats')
soup_times = BeautifulSoup(r_times.text, 'html.parser')

In [20]:
soup_times.text

'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWorld University Rankings 2018 | Times Higher Education (THE)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\njQuery.extend(Drupal.settings, {"basePath":"\\/","pathPrefix":"","ajaxPageState":{"theme":"the_responsive","theme_token":"J0qLQKB4HsSZstfJfUmnvlUWkNCMxB029iiuEOVBR6w","js":{"sites\\/default\\/files\\/minify\\/jquery.once.1.2.min.js":1,"sites\\/default\\/files\\/minify\\/the_data_rankings.1.10.12.min.js":1,"sites\\/default\\/files\\/minify\\/jquery.cookie.67fb34f6a866c40d0570.min.js":1,"sites\\/default\\/files\\/minify\\/notification.min.js":1,"sites\\/default\\/files\\/minify\\/scripts.min.js":1,"sites\\/default\\/files\\/minify\\/the-geography-extras.min.js":1,"sites\\/default\\/files\\/minify\\/most_viewed_commented.min.js":1,"sites\\/default\\/files\\/minify\\/paywall.min.js":1,"sites\\/default\\/files\\/minify\\/the_dfp.min.js":1,"sites\\/default\\/files\\/minify\\/caption-filter.min.js":1,"sites\\/default\\/files\\/minify

Again, if we do Ctrl + F and type "Oxford", we find only the Oxford in the title paragraph explaining the ranking but bot the one from the ranking itself. And we do the same thing we did for the QS ranking: we use interceptor and we get all the data on 'https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json

In [23]:
import json
r_times_ranking = requests.get('https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json')
soup_times_ranking = BeautifulSoup(r_times_ranking.text, 'html.parser')
data_times_ranking = json.loads(soup_times_ranking.text)
data_times_50_first = data_times_ranking['data'][:50]

Now we have all the data in our 'data_times_50_first' dictionary. Let's convert that to a pandas frame:

In [32]:
import pandas as pd
times_dataframe = pd.DataFrame.from_dict(data_times_50_first)

Let's take a look at our data frame:

In [34]:
times_dataframe.head()

Unnamed: 0,aliases,location,member_level,name,nid,rank,rank_order,record_type,scores_citations,scores_citations_rank,...,scores_research,scores_research_rank,scores_teaching,scores_teaching_rank,stats_female_male_ratio,stats_number_students,stats_pc_intl_students,stats_student_staff_ratio,subjects_offered,url
0,University of Oxford,United Kingdom,0,University of Oxford,468,1,10,master_account,99.1,15,...,99.5,1,86.7,5,46 : 54,20409,38%,11.2,"Archaeology,Art, Performing Arts & Design,Biol...",/world-university-rankings/university-oxford
1,University of Cambridge,United Kingdom,0,University of Cambridge,470,2,20,master_account,97.5,29,...,97.8,3,87.8,3,45 : 55,18389,35%,10.9,"Archaeology,Architecture,Art, Performing Arts ...",/world-university-rankings/university-cambridge
2,California Institute of Technology caltech,United States,0,California Institute of Technology,128779,=3,30,private,99.5,10,...,97.5,4,90.3,1,31 : 69,2209,27%,6.5,"Architecture,Biological Sciences,Business & Ma...",/world-university-rankings/california-institut...
3,Stanford University,United States,11,Stanford University,467,=3,40,private,99.9,4,...,96.7,5,89.1,2,42 : 58,15845,22%,7.5,"Archaeology,Architecture,Art, Performing Arts ...",/world-university-rankings/stanford-university
4,Massachusetts Institute of Technology,United States,0,Massachusetts Institute of Technology,471,5,50,private,100.0,1,...,91.9,9,87.3,4,37 : 63,11177,34%,8.7,"Architecture,Art, Performing Arts & Design,Bio...",/world-university-rankings/massachusetts-insti...


We see there are a lot of columns:

In [38]:
print("There are {} columns: {}".format(times_dataframe.shape[1], times_dataframe.columns))

There are 26 columns: Index(['aliases', 'location', 'member_level', 'name', 'nid', 'rank',
       'rank_order', 'record_type', 'scores_citations',
       'scores_citations_rank', 'scores_industry_income',
       'scores_industry_income_rank', 'scores_international_outlook',
       'scores_international_outlook_rank', 'scores_overall',
       'scores_overall_rank', 'scores_research', 'scores_research_rank',
       'scores_teaching', 'scores_teaching_rank', 'stats_female_male_ratio',
       'stats_number_students', 'stats_pc_intl_students',
       'stats_student_staff_ratio', 'subjects_offered', 'url'],
      dtype='object')


We don't need all those columns. The ones we need are the location, the name, the rank and the stats at the end (we will also take the nid). We notice here that we don't have access to the region but only the location. We also notice that there is no information concerning the number of international 