# Scrape the Shangai Ranking with Python

## About 
The Academic Ranking of World Universities (ARWU) was first published in June 2003 by the Center for World-Class Universities (CWCU), Graduate School of Education (formerly the Institute of Higher Education) of Shanghai Jiao Tong University, China, and updated on an annual basis. Since 2009 the Academic Ranking of World Universities (ARWU) has been published and copyrighted by ShanghaiRanking Consultancy. ShanghaiRanking Consultancy is a fully independent organization on higher education intelligence and not legally subordinated to any universities or government agencies.

ARWU uses six objective indicators to rank world universities, including the number of alumni and staff winning Nobel Prizes and Fields Medals, number of highly cited researchers selected by Clarivate Analytics, number of articles published in journals of Nature and Science, number of articles indexed in Science Citation Index - Expanded and Social Sciences Citation Index, and per capita performance of a university. More than 1800 universities are actually ranked by ARWU every year and the best 1000 are published.

## Goals 
In this post I will demonstrate how you can get the latest ranking dataset from the official website of the Shangai Ranking using Pandas and bs4.

In [1]:
# Import the libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup

Let's use the `read_html` function of Pandas to scrape the main the table available at the address __http://www.shanghairanking.com/arwu2019.html__

![shangai](shangai.png)

In [2]:
shangai_ranking = pd.read_html("http://www.shanghairanking.com/arwu2019.html")

In [3]:
type(shangai_ranking)

list

It returns a list of 1 element.
We can take a look at the data we just scraped.

In [5]:
shangai_ranking[0].head()

Unnamed: 0,World Rank,Institution*,By location All Argentina Australia Austria Belgium Brazil Bulgaria Canada Chile China China-Hong Kong China-Macau China-Taiwan Colombia Croatia Cyprus Czech Republic Denmark Egypt Estonia Finland France Germany Greece Hungary Iceland India Iran Ireland Israel Italy Japan Lebanon Lithuania Luxembourg Malaysia Mexico Netherlands New Zealand Nigeria Norway Oman Pakistan Poland Portugal Romania Russia Saudi Arabia Serbia Singapore Slovakia Slovenia South Africa South Korea Spain Sweden Switzerland Thailand Tunisia Turkey United Kingdom United Arab Emirates Uruguay United States Vietnam,National/Regional Rank,Total Score,Score on Alumni Award HiCi N&S PUB PCP
0,1,Harvard University,,1,100.0,100.0
1,2,Stanford University,,2,75.1,45.2
2,3,University of Cambridge,,1,72.3,80.7
3,4,Massachusetts Institute of Technology (MIT),,3,69.0,72.0
4,5,"University of California, Berkeley",,4,67.9,67.1


Most of the columns have been correctly scraped by the `pd.read_html()` function. However the column indicating the countries is not showing because the country names are not mentioned on the website. Only the flags are shown instead. We need to come up with a strategy to scrape that missing column.  
First we will scrape the entire page then we will use Beautiful class to parse the HTML. After parsing the HTML, we will then look for all `<img>` HTML tags and retrieve the string (text) used to describe to describe each flags. This string is the country name.  
As a matter of illustration this is how the site stores the titles of the flags.
![illust](illust.png)

Now that we have a clear idea on how the site stores the flags name we can scrape that particular information.

In [6]:
flags = requests.get("http://www.shanghairanking.com/arwu2019.html")

In [7]:
soup = BeautifulSoup(flags.text)

In [8]:
img_src = soup.find_all("img")[1:1001]
img_src = [str(img) for img in img_src]
img_src = [img.split("/")[2].split(".")[0] for img in img_src]

Let's display the first 10 element of that list.

In [13]:
img_src[0:10]

['USA', 'USA', 'UK', 'USA', 'USA', 'USA', 'UK', 'USA', 'USA', 'USA']

We can now transform the img list to a Pandas Series object and let it replace the previous column.

In [9]:
shangai_ranking[0].iloc[:, 2] = pd.Series(name = "country", data = img_src)

As we saw earlier, the name of the third column is not also available, so we need to rename it.

In [10]:
shangai_ranking_ = shangai_ranking[0].rename(columns = {shangai_ranking[0].columns[2]: "country"})

In [11]:
shangai_ranking_.head()

Unnamed: 0,World Rank,Institution*,country,National/Regional Rank,Total Score,Score on Alumni Award HiCi N&S PUB PCP
0,1,Harvard University,USA,1,100.0,100.0
1,2,Stanford University,USA,2,75.1,45.2
2,3,University of Cambridge,UK,1,72.3,80.7
3,4,Massachusetts Institute of Technology (MIT),USA,3,69.0,72.0
4,5,"University of California, Berkeley",USA,4,67.9,67.1


In [12]:
shangai_ranking_.describe(include= "object")

Unnamed: 0,World Rank,Institution*,country,National/Regional Rank
count,1000,1000,1000,1000
unique,87,1000,64,142
top,201-300,University College Dublin,USA,1
freq,100,1,206,61


# For reproducibility
You may be interested in using all the steps above to scrape the data yourself, so I am going to write a function so that you can just copy and run to get the dataset.

In [10]:
# Import the dependencies
from requests import get
from bs4 import BeautifulSoup
from pandas import read_html

def get_arwu(url = "http://www.shanghairanking.com/arwu2019.html"):
    """This function scrapes the Academic Ranking of 
    World University published by the ShangaiRanking Consultancy
    
    There is no need to provide arguments to the function.
    
    Returns a Pandas Dataframe"""
    # Get the first data
    shangai_ranking = read_html(url)
    flags = get("http://www.shanghairanking.com/arwu2019.html")
    soup = BeautifulSoup(flags.text)
    img_src = soup.find_all("img")[1:1001]
    img_src = [str(img) for img in img_src]
    img_src = [img.split("/")[2].split(".")[0] for img in img_src]
    shangai_ranking[0].iloc[:, 2] = pd.Series(name = "country", data = img_src)
    shangai_ranking_ = shangai_ranking[0].rename(columns = {shangai_ranking[0].columns[2]: "country"})
    return shangai_ranking_

In [11]:
arwu_ranking = get_arwu()

In [13]:
# The last 10 Institutions
arwu_ranking.tail(10)

Unnamed: 0,World Rank,Institution*,country,National/Regional Rank,Total Score,Score on Alumni Award HiCi N&S PUB PCP
990,901-1000,University of Tabriz,Iran,11-13,,0.0
991,901-1000,University of Thessaly,Greece,6-7,,0.0
992,901-1000,University of Toyama,Japan,34-43,,0.0
993,901-1000,University of Yamanashi,Japan,34-43,,11.2
994,901-1000,Vellore Institute of Technology,India,11-16,,0.0
995,901-1000,Williams College,USA,193-206,,18.6
996,901-1000,Worcester Polytechnic Institute,USA,193-206,,0.0
997,901-1000,Wroclaw University of Technology,Poland,7-9,,0.0
998,901-1000,Yokohama National University,Japan,34-43,,0.0
999,901-1000,Zagazig University,Egypt,5,,0.0


Remember that **Institutions within the same rank range are listed alphabetically**.

Feel free to comment and give me suggestions on how I can improve this article. I also found that my web scraper takes some time to get the data, so if you know how I can make it faster please tell me !

# To cite this article
To cite this article, please use the following :  

Gailloty, A. (2019, January 13)., *Scrape the Shangai Ranking with Python*, retrieved from https://agailloty.rbind.io/en/post/shangai-ranking/