# 02 - Data from the Web

## Deadline
Wednesday October 25, 2017 at 11:59PM

## Important Notes
* Make sure you push on GitHub your Notebook with all the cells already evaluated (i.e., you don't want your colleagues to generate unnecessary Web traffic during the peer review)
* Don't forget to add a textual description of your thought process, the assumptions you made, and the solution you plan to implement!
* Please write all your comments in English, and use meaningful variable names in your code.

## Background
In this homework we will extract interesting information from www.topuniversities.com and www.timeshighereducation.com, two platforms that maintain a global ranking of worldwide universities. This ranking is not offered as a downloadable dataset, so you will have to find a way to scrape the information we need!
You are not allowed to download manually the entire ranking -- rather you have to understand how the server loads it in your browser. For this task, Postman with the Interceptor extension can help you greatly. We recommend that you watch this [brief tutorial](https://www.youtube.com/watch?v=jBjXVrS8nXs&list=PLM-7VG-sgbtD8qBnGeQM5nvlpqB_ktaLZ&autoplay=1) to understand quickly how to use it.

## Assignment
1. Obtain the 200 top-ranking universities in www.topuniversities.com ([ranking 2018](https://www.topuniversities.com/university-rankings/world-university-rankings/2018)). In particular, extract the following fields for each university: name, rank, country and region, number of faculty members (international and total) and number of students (international and total). Some information is not available in the main list and you have to find them in the [details page](https://www.topuniversities.com/universities/ecole-polytechnique-fédérale-de-lausanne-epfl).
Store the resulting dataset in a pandas DataFrame and answer the following questions:
- Which are the best universities in term of: (a) ratio between faculty members and students, (b) ratio of international students?
- Answer the previous question aggregating the data by (c) country and (d) region.

Plot your data using bar charts and describe briefly what you observed.

2. Obtain the 200 top-ranking universities in www.timeshighereducation.com ([ranking 2018](http://timeshighereducation.com/world-university-rankings/2018/world-ranking)). Repeat the analysis of the previous point and discuss briefly what you observed.

3. Merge the two DataFrames created in questions 1 and 2 using university names. Match universities' names as well as you can, and explain your strategy. Keep track of the original position in both rankings.

4. Find useful insights in the data by performing an exploratory analysis. Can you find a strong correlation between any pair of variables in the dataset you just created? Example: when a university is strong in its international dimension, can you observe a consistency both for students and faculty members?

5. Can you find the best university taking in consideration both rankings? Explain your approach.

Hints:
- Keep your Notebook clean and don't print the verbose output of the requests if this does not add useful information for the reader.
- In case of tie, use the order defined in the webpage.


In [1]:
# Import libraries
import requests
from bs4 import BeautifulSoup

In [2]:
r = requests.get('https://www.topuniversities.com/university-rankings/world-university-rankings/2018')

In [3]:
soup = BeautifulSoup(r.text, 'html.parser')

In [4]:
soup.title

<title>QS World University Rankings 2018 | Top Universities</title>

In [5]:
soup.text

'\n\n\n\n(window.NREUM||(NREUM={})).loader_config={xpid:"UwUCVVVTGwIAV1VXBQkP"};window.NREUM||(NREUM={}),__nr_require=function(t,n,e){function r(e){if(!n[e]){var o=n[e]={exports:{}};t[e][0].call(o.exports,function(n){var o=t[e][1][n];return r(o||n)},o,o.exports)}return n[e].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<e.length;o++)r(e[o]);return r}({1:[function(t,n,e){function r(t){try{s.console&&console.log(t)}catch(n){}}var o,i=t("ee"),a=t(15),s={};try{o=localStorage.getItem("__nr_flags").split(","),console&&"function"==typeof console.log&&(s.console=!0,o.indexOf("dev")!==-1&&(s.dev=!0),o.indexOf("nr_dev")!==-1&&(s.nrDev=!0))}catch(c){}s.nrDev&&i.on("internal-error",function(t){r(t.stack)}),s.dev&&i.on("fn-err",function(t,n,e){r(e.stack)}),s.dev&&(r("NR AGENT IN DEVELOPMENT MODE"),r("flags: "+a(s,function(t,n){return t}).join(", ")))},{}],2:[function(t,n,e){function r(t,n,e,r,o){try{d?d-=1:i("err",[o||new UncaughtException(t,n,e)])}catch(s){try{i("ierr

In [6]:
import sys
import requests
from bs4 import BeautifulSoup
from PyQt5.QtWebKitWidgets import QWebPage
from PyQt5.QtWidgets import QApplication

class Render(QWebPage):
    def __init__(self, html):
        self.html = None
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().setHtml(html)
        self.app.exec_()
        
    def _loadFinished(self, result):
        self.html = self.mainFrame().toHtml()
        self.app.quit()

In [7]:
url = 'https://www.topuniversities.com/university-rankings/world-university-rankings/2018'
source_html = requests.get(url).text
rendered_html = Render(source_html).html
soup = BeautifulSoup(rendered_html, 'html.parser')

In [12]:
soup.prettify

<bound method Tag.prettify of <!DOCTYPE html>
<html class="async-hide js" dir="ltr" version="XHTML+RDFa 1.0" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:article="http://ogp.me/ns/article#" xmlns:book="http://ogp.me/ns/book#" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:og="http://ogp.me/ns#" xmlns:product="http://ogp.me/ns/product#" xmlns:profile="http://ogp.me/ns/profile#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:schema="http://schema.org/" xmlns:sioc="http://rdfs.org/sioc/ns#" xmlns:sioct="http://rdfs.org/sioc/types#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:video="http://ogp.me/ns/video#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#"><head profile="http://www.w3.org/1999/xhtml/vocab">
<meta content="unsafe-url" name="referrer">
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"><script src="https://js-agent.newrelic.com/nr-1059.mi

In [15]:
request_get = requests.get(url)

In [21]:
request_get.

True

In [24]:
request_ranking_text_file = requests.get('https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051_indicators.txt')

In [28]:
request_ranking_text_file.text

'{"columns":[{"data":"region","title":"REGION","visible":false},{"data":"location","title":"LOCATION","visible":false},{"data":"overall_rank","title":"\\u003Cdiv class=\\u0022td-wrap\\u0022\\u003E\\u003Cdiv class=\\u0022labl\\u0022\\u003E\\u003Cdiv\\u003E#\\u003Cspan\\u003E RANK\\u003C\\/span\\u003E\\u003C\\/div\\u003E\\u003C\\/div\\u003E\\u003Cdiv class=\\u0022sorter\\u0022\\u003E\\u003Cdiv class=\\u0022sel-rep\\u0022\\u003E\\u003C\\/div\\u003E\\u003C\\/div\\u003E\\u003C\\/div\\u003E","visible":false,"className":"rank","searchable":false},{"data":"overall_rank_dis","title":"\\u003Cdiv class=\\u0022td-wrap\\u0022\\u003E\\u003Cdiv class=\\u0022labl\\u0022\\u003E\\u003Cdiv\\u003E#\\u003Cspan\\u003E RANK\\u003C\\/span\\u003E\\u003C\\/div\\u003E\\u003C\\/div\\u003E\\u003Cdiv class=\\u0022sorter\\u0022\\u003E\\u003Cdiv class=\\u0022sel-rep\\u0022\\u003E\\u003C\\/div\\u003E\\u003C\\/div\\u003E\\u003C\\/div\\u003E","className":"rank","searchable":false},{"data":"uni","title":"\\u003Cdiv class

In [41]:
soup = BeautifulSoup(request_ranking_text_file.text, 'html.parser')

In [39]:
for i in soup.children:
    print(i)

{"columns":[{"data":"region","title":"REGION","visible":false},{"data":"location","title":"LOCATION","visible":false},{"data":"overall_rank","title":"\u003Cdiv class=\u0022td-wrap\u0022\u003E\u003Cdiv class=\u0022labl\u0022\u003E\u003Cdiv\u003E#\u003Cspan\u003E RANK\u003C\/span\u003E\u003C\/div\u003E\u003C\/div\u003E\u003Cdiv class=\u0022sorter\u0022\u003E\u003Cdiv class=\u0022sel-rep\u0022\u003E\u003C\/div\u003E\u003C\/div\u003E\u003C\/div\u003E","visible":false,"className":"rank","searchable":false},{"data":"overall_rank_dis","title":"\u003Cdiv class=\u0022td-wrap\u0022\u003E\u003Cdiv class=\u0022labl\u0022\u003E\u003Cdiv\u003E#\u003Cspan\u003E RANK\u003C\/span\u003E\u003C\/div\u003E\u003C\/div\u003E\u003Cdiv class=\u0022sorter\u0022\u003E\u003Cdiv class=\u0022sel-rep\u0022\u003E\u003C\/div\u003E\u003C\/div\u003E\u003C\/div\u003E","className":"rank","searchable":false},{"data":"uni","title":"\u003Cdiv class=\u0022td-wrap\u0022\u003E\u003Cdiv class=\u0022labl\u0022\u003E\u003Cdiv\u003

In [44]:
import json
json_ranking = json.loads(request_ranking_text_file.text)

In [53]:
ranking_data = json_ranking['data']

In [66]:
for elem in ranking_data:
    print(elem['uni'])

<div class="td-wrap"><div class="td-wrap-in"><a href="/universities/massachusetts-institute-technology-mit">Massachusetts Institute of Technology (MIT) </a></div></div>
<div class="td-wrap"><div class="td-wrap-in"><a href="/universities/stanford-university">Stanford University</a></div></div>
<div class="td-wrap"><div class="td-wrap-in"><a href="/universities/harvard-university">Harvard University</a></div></div>
<div class="td-wrap"><div class="td-wrap-in"><a href="/universities/california-institute-technology-caltech">California Institute of Technology (Caltech)</a></div></div>
<div class="td-wrap"><div class="td-wrap-in"><a href="/universities/university-cambridge">University of Cambridge</a></div></div>
<div class="td-wrap"><div class="td-wrap-in"><a href="/universities/university-oxford">University of Oxford</a></div></div>
<div class="td-wrap"><div class="td-wrap-in"><a href="/universities/ucl-university-college-london">UCL (University College London)</a></div></div>
<div class="

In [94]:
from html import parser

# create a subclass and override the handler methods
class MyHTMLParser(parser.HTMLParser):
    def handle_starttag(self, tag, attrs):
        if not hasattr(self, 'array_hrefs'):
            self.array_hrefs = []
        if tag == 'a':
            self.array_hrefs.append(attrs)

    def handle_endtag(self, tag):
        pass
    
    def handle_data(self, data):
        pass

In [95]:
#for elem in ranking_data:
parser = MyHTMLParser()
for university in ranking_data:
    parser.feed(university['uni'])

In [105]:
import os
base_url = 'https://www.topuniversities.com/'
for url_pair in parser.array_hrefs:
    url = os.path.join(base_url, url_pair[0][1][1:])
    print(url)

https://www.topuniversities.com/universities/massachusetts-institute-technology-mit
https://www.topuniversities.com/universities/stanford-university
https://www.topuniversities.com/universities/harvard-university
https://www.topuniversities.com/universities/california-institute-technology-caltech
https://www.topuniversities.com/universities/university-cambridge
https://www.topuniversities.com/universities/university-oxford
https://www.topuniversities.com/universities/ucl-university-college-london
https://www.topuniversities.com/universities/imperial-college-london
https://www.topuniversities.com/universities/university-chicago
https://www.topuniversities.com/universities/eth-zurich-swiss-federal-institute-technology
https://www.topuniversities.com/universities/nanyang-technological-university-singapore-ntu
https://www.topuniversities.com/universities/ecole-polytechnique-f%C3%A9d%C3%A9rale-de-lausanne-epfl
https://www.topuniversities.com/universities/princeton-university
https://www.top

NameError: name 'data_temp' is not defined