You have to work on the [University dataset](https://drive.google.com/drive/folders/1Hs3nRtK_F3h8eg59B4-TD1DEua6g8Klv?usp=sharing). It contains three different university rankings:

* The Times Higher Education World University Ranking, shortly *Times*,
* The Academic Ranking of World Universities, shortly *Shanghai*,
* The Center for World University Rankings, shortly *cwur*.

Notes

* It is mandatory to use GitHub for developing the project.
* The project must be a jupyter notebook.
* There is no restriction on the libraries that can be used, nor on the Python version.
* All questions on the project **must** be asked in a public channel on [Zulip](https://focs.zulipchat.com).

# PROJECT 2020-2021 - Group 37 - Foundations of Computer Science
### Professor Gianluca Della Vedova

##### Authors: Marco Braga, Alessandro Maccario

Prima di tutto vengono importate le librerie necessarie nel proseguo dell'analisi.

In [1]:
import re
import math
import numpy as np
import pandas as pd

### 1. For each university, extract from the `times` dataset the most recent and the least recent data, obtaining two separate dataframes

Viene caricato il dataset `times` e ne viene visualizzato il contenuto:

In [2]:
times = pd.read_csv('dataset_progetto/timesData.csv')
times.head()

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
0,1,Harvard University,United States of America,99.7,72.4,98.7,98.8,34.5,96.1,20152,8.9,25%,,2011
1,2,California Institute of Technology,United States of America,97.7,54.6,98.0,99.9,83.7,96.0,2243,6.9,27%,33 : 67,2011
2,3,Massachusetts Institute of Technology,United States of America,97.8,82.3,91.4,99.9,87.5,95.6,11074,9.0,33%,37 : 63,2011
3,4,Stanford University,United States of America,98.3,29.5,98.1,99.2,64.3,94.3,15596,7.8,22%,42 : 58,2011
4,5,Princeton University,United States of America,90.9,70.3,95.4,99.9,-,94.2,7929,8.4,27%,45 : 55,2011


Viene creato il dataset `times_min` per immagazzinare i dati meno recenti:
(viene usata la funzione `loc` in quanto si accede ai dati tramite una *label* specifica) 

In [3]:
times_min = times.loc[times.groupby('university_name')['year'].idxmin()]
times_min

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
2405,601-800,AGH University of Science and Technology,Poland,14.2,17.9,3.7,35.7,-,-,35569,17.0,1%,-,2016
501,301-350,Aalborg University,Denmark,19.0,75.3,20.0,27.1,36.4,-,17422,15.9,15%,48 : 52,2012
502,301-350,Aalto University,Finland,26.2,49.0,22.2,37.5,61.9,-,16099,24.2,17%,32 : 68,2012
166,167,Aarhus University,Denmark,38.1,33.4,55.6,57.3,61.5,49.9,23895,13.6,14%,54 : 46,2011
476,276-300,Aberystwyth University,United Kingdom,19.8,63.8,15.5,56.6,35.5,-,9252,19.2,18%,48 : 52,2012
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41,42,École Normale Supérieure,France,66.8,44.9,48.2,95.7,30.7,68.6,2400,7.9,20%,46 : 54,2011
99,100,École Normale Supérieure de Lyon,France,51.1,37.6,34.4,88.8,26.1,57.0,2218,8.0,14%,49 : 51,2011
38,39,École Polytechnique,France,57.9,77.9,56.1,91.4,-,69.5,2429,4.8,30%,18 : 82,2011
47,48,École Polytechnique Fédérale de Lausanne,Switzerland,55.0,100.0,56.1,83.8,38.0,66.5,9666,10.5,54%,27 : 73,2011


Viene creato il dataset `times_max` per immagazzinare i dati più recenti:

In [4]:
# new dataset with most recent data for each university
times_max = times.loc[times.groupby('university_name')['year'].idxmax()]
times_max

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
2405,601-800,AGH University of Science and Technology,Poland,14.2,17.9,3.7,35.7,-,-,35569,17.0,1%,-,2016
2003,201-250,Aalborg University,Denmark,25.1,71.0,28.4,73.8,43.7,-,17422,15.9,15%,48 : 52,2016
2056,251-300,Aalto University,Finland,31.1,65.4,32.8,62.1,61.6,-,16099,24.2,17%,32 : 68,2016
1908,=106,Aarhus University,Denmark,36.9,76.8,50.7,79.8,68.3,57.7,23895,13.6,14%,54 : 46,2016
2105,301-350,Aberystwyth University,United Kingdom,21.6,72.2,18.9,67.2,31.3,-,9252,19.2,18%,48 : 52,2016
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1856,54,École Normale Supérieure,France,70.6,85.5,47.7,87.1,37.1,69.0,2400,7.9,20%,46 : 54,2016
2013,201-250,École Normale Supérieure de Lyon,France,41.6,65.6,30.0,69.0,31.7,-,2218,8.0,14%,49 : 51,2016
1904,=101,École Polytechnique,France,53.5,92.8,44.6,64.7,82.3,57.9,2429,4.8,30%,18 : 82,2016
1833,31,École Polytechnique Fédérale de Lausanne,Switzerland,61.3,98.6,67.5,94.6,65.4,76.1,9666,10.5,54%,27 : 73,2016


### 2. For each university, compute the improvement in income between the least recent and the most recent data points

La colonna `income` presenta molti valori `-` che verranno considerati come valori nulli o mancanti. In questo caso si è deciso di escluderli dal computo delle differenze per considerare solamente quelle università che presentavano sia nel dataset dei valori minimi che nel dataset dei valori massimi, valori non mancanti per permettere il calcolo della differenza effettiva.

Si crea un dataset unico che contenga quindi i valori di `times_min` e di `times_max`.

In [5]:
times_income = pd.merge(times_min, times_max, on='university_name', suffixes=('_min', '_max'))
times_income.head()

Unnamed: 0,world_rank_min,university_name,country_min,teaching_min,international_min,research_min,citations_min,income_min,total_score_min,num_students_min,...,international_max,research_max,citations_max,income_max,total_score_max,num_students_max,student_staff_ratio_max,international_students_max,female_male_ratio_max,year_max
0,601-800,AGH University of Science and Technology,Poland,14.2,17.9,3.7,35.7,-,-,35569,...,17.9,3.7,35.7,-,-,35569,17.0,1%,-,2016
1,301-350,Aalborg University,Denmark,19.0,75.3,20.0,27.1,36.4,-,17422,...,71.0,28.4,73.8,43.7,-,17422,15.9,15%,48 : 52,2016
2,301-350,Aalto University,Finland,26.2,49.0,22.2,37.5,61.9,-,16099,...,65.4,32.8,62.1,61.6,-,16099,24.2,17%,32 : 68,2016
3,167,Aarhus University,Denmark,38.1,33.4,55.6,57.3,61.5,49.9,23895,...,76.8,50.7,79.8,68.3,57.7,23895,13.6,14%,54 : 46,2016
4,276-300,Aberystwyth University,United Kingdom,19.8,63.8,15.5,56.6,35.5,-,9252,...,72.2,18.9,67.2,31.3,-,9252,19.2,18%,48 : 52,2016


Da quest'ultimo si estraggono solo quelle colonne di interesse:

In [6]:
times_income = times_income[['university_name', 'income_min', 'income_max']]
times_income.head()

Unnamed: 0,university_name,income_min,income_max
0,AGH University of Science and Technology,-,-
1,Aalborg University,36.4,43.7
2,Aalto University,61.9,61.6
3,Aarhus University,61.5,68.3
4,Aberystwyth University,35.5,31.3


Si decide di prendere quindi in considerazione solamente gli elementi non contenenti il dash (`-`) per poter poi calcolare le differenze dei valori.

In [7]:
times_income = times_income[(~times_income['income_min'].str.contains('-')) & (~times_income['income_max'].str.contains('-'))]
times_income

Unnamed: 0,university_name,income_min,income_max
1,Aalborg University,36.4,43.7
2,Aalto University,61.9,61.6
3,Aarhus University,61.5,68.3
4,Aberystwyth University,35.5,31.3
5,Adam Mickiewicz University,28.7,28.7
...,...,...,...
812,Zhejiang University,70.3,96.2
813,École Normale Supérieure,30.7,37.1
814,École Normale Supérieure de Lyon,26.1,31.7
816,École Polytechnique Fédérale de Lausanne,38.0,65.4


Si trasformano in float i valori di colonna:

In [8]:
times_income['income_min'] = times_income['income_min'].astype('float')
times_income['income_max'] = times_income['income_max'].astype('float')

In [9]:
times_income['income_diff'] = times_income['income_max'] - times_income['income_min']
times_income

Unnamed: 0,university_name,income_min,income_max,income_diff
1,Aalborg University,36.4,43.7,7.3
2,Aalto University,61.9,61.6,-0.3
3,Aarhus University,61.5,68.3,6.8
4,Aberystwyth University,35.5,31.3,-4.2
5,Adam Mickiewicz University,28.7,28.7,0.0
...,...,...,...,...
812,Zhejiang University,70.3,96.2,25.9
813,École Normale Supérieure,30.7,37.1,6.4
814,École Normale Supérieure de Lyon,26.1,31.7,5.6
816,École Polytechnique Fédérale de Lausanne,38.0,65.4,27.4


Infine, si controlla se sono presenti ancora valori `-` oppure valori nulli:

In [10]:
times_income[times_income['income_diff'].isin(['^-$'])]

Unnamed: 0,university_name,income_min,income_max,income_diff


In [11]:
times_income[times_income['income_diff'].isna()]

Unnamed: 0,university_name,income_min,income_max,income_diff


### 3. Find the university with the largest increase computed in the previous point

Valutiamo tramite la funzione `.max()` la presenza di uno o più record che presentano valori massimi di differenza fra gli `income` dalla data più recente a quella meno recente.

In [12]:
times_income[times_income['income_diff'] == times_income['income_diff'].max()]

Unnamed: 0,university_name,income_min,income_max,income_diff
428,TU Dresden,31.9,99.7,67.8


### 4. For each ranking, consider only the most recent data point. For each university, compute the maximum difference between the rankings (e.g. for Aarhus University the value is 122-73=49). Notice that some rankings are expressed as a range

Consideriamo i dataset utili al nostro scopo, ovvero `cwur`, `shanghaiData` e `times` e isoliamo i più recenti data point e le colonne di nostro interesse, ovvero `world rank`, `university name/institution`, `year`.

In [13]:
shanghai_db = pd.read_csv('dataset_progetto/shanghaiData.csv')
shanghai_db.head()

Unnamed: 0,world_rank,university_name,national_rank,total_score,alumni,award,hici,ns,pub,pcp,year
0,1,Harvard University,1,100.0,100.0,100.0,100.0,100.0,100.0,72.4,2005
1,2,University of Cambridge,1,73.6,99.8,93.4,53.3,56.6,70.9,66.9,2005
2,3,Stanford University,2,73.4,41.1,72.2,88.5,70.9,72.3,65.0,2005
3,4,"University of California, Berkeley",3,72.8,71.8,76.0,69.4,73.9,72.2,52.7,2005
4,5,Massachusetts Institute of Technology (MIT),4,70.1,74.0,80.6,66.7,65.8,64.3,53.0,2005


In [14]:
cwur_db = pd.read_csv('dataset_progetto/cwurData.csv')
cwur_db.head()

Unnamed: 0,world_rank,institution,country,national_rank,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,year
0,1,Harvard University,USA,1,7,9,1,1,1,1,,5,100.0,2012
1,2,Massachusetts Institute of Technology,USA,2,9,17,3,12,4,4,,1,91.67,2012
2,3,Stanford University,USA,3,17,11,5,4,2,2,,15,89.5,2012
3,4,University of Cambridge,United Kingdom,1,10,24,4,16,16,11,,50,86.17,2012
4,5,California Institute of Technology,USA,4,2,29,7,37,22,22,,18,85.21,2012


In [15]:
times_for_max = times.iloc[times.groupby('university_name')['year'].idxmax()][['university_name','world_rank','year']]
times_for_max.head()

Unnamed: 0,university_name,world_rank,year
2405,AGH University of Science and Technology,601-800,2016
2003,Aalborg University,201-250,2016
2056,Aalto University,251-300,2016
1908,Aarhus University,=106,2016
2105,Aberystwyth University,301-350,2016


In [16]:
cwurdb_for_max = cwur_db.iloc[cwur_db.groupby('institution')['year'].idxmax()][['institution','world_rank','year']]
cwurdb_for_max.head()

Unnamed: 0,institution,world_rank,year
1981,AGH University of Science and Technology,782,2015
1764,Aalborg University,565,2015
1620,Aalto University,421,2015
1321,Aarhus University,122,2015
2013,Aberystwyth University,814,2015


In [17]:
shanghaidb_for_max = shanghai_db.iloc[shanghai_db.groupby('university_name')['year'].idxmax()][['university_name','world_rank','year']]
shanghaidb_for_max.head()

Unnamed: 0,university_name,world_rank,year
4697,Aalborg University,301-400,2015
4797,Aalto University,401-500,2015
4469,Aarhus University,73,2015
4497,Aix Marseille University,101-150,2015
3115,Aix-Marseille University,102-150,2011


In [18]:
times_for_max[times_for_max['university_name'] == 'Karlsruhe Institute of Technology']

Unnamed: 0,university_name,world_rank,year
1942,Karlsruhe Institute of Technology,=138,2016


In [19]:
shanghaidb_for_max[shanghaidb_for_max['university_name'].str.contains('Karlsruhe Institute of Technology')]

Unnamed: 0,university_name,world_rank,year
4618,Karlsruhe Institute of Technology (KIT),201-300,2015


Come si nota già nel *times_for_max* e nello *shanghaidb_for_max* possono essere presenti nomi di università che differiscono per almeno uno o più caratteri pur essendo lo stesso elemento. Dovendo quindi trattare tali eventualità si è deciso di normalizzare i nomi delle università usando il dataset *school_and_country_table.csv*, tramite la colonna *school_name*.

In [20]:
school_and_country_name = pd.read_csv("dataset_progetto/school_and_country_table.csv", encoding="UTF-8")
school_and_country_name.head()

Unnamed: 0,school_name,country
0,Harvard University,United States of America
1,California Institute of Technology,United States of America
2,Massachusetts Institute of Technology,United States of America
3,Stanford University,United States of America
4,Princeton University,United States of America


In [21]:
school_and_country_name.rename(columns={'school_name':'university_name_normalized'}, inplace=True)

In [22]:
school_and_country_name.head()

Unnamed: 0,university_name_normalized,country
0,Harvard University,United States of America
1,California Institute of Technology,United States of America
2,Massachusetts Institute of Technology,United States of America
3,Stanford University,United States of America
4,Princeton University,United States of America


Per ogni università contenuta nella tabella *school_and_country_name* abbiamo verificato se il nome dell'università nel dataset in cui si applica la funzione contiene o è contenuto nella tabella suddetta.

In [23]:
def normalize(column_element):

  for university in school_and_country_name['university_name_normalized']:
    if column_element == university:
      return column_element

  name_column_element = re.compile(column_element)

  for university in school_and_country_name['university_name_normalized']:
    match = name_column_element.search(university)
    if match:
      return university
    else:
      name_university = re.compile(university)
      metch_university = name_university.search(column_element)
      if metch_university:
        return university

In [24]:
shanghaidb_for_max["university_name_normalized"] = shanghaidb_for_max['university_name'].apply(normalize)

In [25]:
shanghaidb_for_max.head()

Unnamed: 0,university_name,world_rank,year,university_name_normalized
4697,Aalborg University,301-400,2015,Aalborg University
4797,Aalto University,401-500,2015,Aalto University
4469,Aarhus University,73,2015,Aarhus University
4497,Aix Marseille University,101-150,2015,
3115,Aix-Marseille University,102-150,2011,Aix-Marseille University


Avendo verificato l'esistenza di duplicati, si è indagato maggiormente per trattare separatamente i singoli casi:

In [26]:
shanghaidb_for_max[(shanghaidb_for_max['university_name_normalized'].isnull() == False) & (shanghaidb_for_max['university_name_normalized'].duplicated() == True)]

Unnamed: 0,university_name,world_rank,year,university_name_normalized
3876,Arizona State University - Tempe,79,2013,Arizona State University
3425,Curtin University of Technology,401-500,2011,Curtin University
925,Louisiana State University Health Sciences Center,401-500,2006,Louisiana State University
4510,Norwegian University of Science and Technology...,101-150,2015,Norwegian University of Science and Technology
4458,Purdue University - West Lafayette,61,2015,Purdue University
4133,Royal Institute of Technology,201-300,2014,KTH Royal Institute of Technology
3606,Texas A&M University - College Station,93,2012,Texas A&M University
3913,The Johns Hopkins University,17,2014,Johns Hopkins University
3461,The University of Connecticut Health Center,401-500,2011,University of Connecticut
4569,The University of Hong Kong,151-200,2015,University of Hong Kong


In [27]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'Arizona State University']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
4489,Arizona State University,93,2015,Arizona State University
3876,Arizona State University - Tempe,79,2013,Arizona State University


In [28]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'Curtin University']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
4606,Curtin University,201-300,2015,Curtin University
3425,Curtin University of Technology,401-500,2011,Curtin University


In [29]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'Louisiana State University']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
4624,Louisiana State University - Baton Rouge,201-300,2015,Louisiana State University
925,Louisiana State University Health Sciences Center,401-500,2006,Louisiana State University


In [30]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'Norwegian University of Science and Technology']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
3738,Norwegian University of Science and Technology,201-300,2012,Norwegian University of Science and Technology
4510,Norwegian University of Science and Technology...,101-150,2015,Norwegian University of Science and Technology


In [31]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'Purdue University']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
4616,Indiana University-Purdue University at Indian...,201-300,2015,Purdue University
4458,Purdue University - West Lafayette,61,2015,Purdue University


In [32]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'KTH Royal Institute of Technology']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
4621,KTH Royal Institute of Technology,201-300,2015,KTH Royal Institute of Technology
4133,Royal Institute of Technology,201-300,2014,KTH Royal Institute of Technology


In [33]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'Texas A&M University']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
4496,Texas A&M University,100,2015,Texas A&M University
3606,Texas A&M University - College Station,93,2012,Texas A&M University


In [34]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'Johns Hopkins University']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
4412,Johns Hopkins University,16,2015,Johns Hopkins University
3913,The Johns Hopkins University,17,2014,Johns Hopkins University


In [35]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'University of Connecticut']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
3758,The University of Connecticut - Storrs,201-300,2012,University of Connecticut
3461,The University of Connecticut Health Center,401-500,2011,University of Connecticut
4768,University of Connecticut,301-400,2015,University of Connecticut


In [36]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'University of Arkansas']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
4853,University of Arkansas at Fayetteville,401-500,2015,University of Arkansas
4854,University of Arkansas at Little Rock,401-500,2015,University of Arkansas


In [37]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'University of Bordeaux']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
4656,University of Bordeaux,201-300,2015,University of Bordeaux
3383,University of Bordeaux 1,301-400,2011,University of Bordeaux


In [38]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'State University of Campinas']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
3748,State University of Campinas,201-300,2012,State University of Campinas
4764,University of Campinas,301-400,2015,State University of Campinas


In [39]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'University of Graz']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
4825,Medical University of Graz,401-500,2015,University of Graz
4864,University of Graz,401-500,2015,University of Graz


In [40]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'University of Innsbruck']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
3346,Medical University of Innsbruck,301-400,2011,University of Innsbruck
4664,University of Innsbruck,201-300,2015,University of Innsbruck


In [41]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'University of Kansas']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
4665,University of Kansas,201-300,2015,University of Kansas
3783,University of Kansas - Lawrence,201-300,2012,University of Kansas
2984,University of Kansas Medical Center,401-500,2010,University of Kansas


In [42]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'University of Maryland, Baltimore County']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
4675,"University of Maryland, Baltimore",201-300,2015,"University of Maryland, Baltimore County"
4872,"University of Maryland, Baltimore County",401-500,2015,"University of Maryland, Baltimore County"


In [43]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'University of Massachusetts']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
4533,University of Massachusetts Amherst,101-150,2015,University of Massachusetts
4534,University of Massachusetts Medical School - W...,101-150,2015,University of Massachusetts


In [44]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'University of Melbourne']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
4440,The University of Melbourne,44,2015,University of Melbourne
3853,University of Melbourne,54,2013,University of Melbourne


In [45]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'University of Michigan']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
3820,University of Michigan - Ann Arbor,23,2013,University of Michigan
4418,University of Michigan-Ann Arbor,22,2015,University of Michigan


In [46]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'University of Montana']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
4250,The University of Montana - Missoula,301-400,2014,University of Montana
4776,University of Montana - Missoula,301-400,2015,University of Montana


In [47]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'University of New South Wales']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
4518,The University of New South Wales,101-150,2015,University of New South Wales
3654,University of New South Wales,101-150,2012,University of New South Wales


In [48]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'University of Newcastle']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
4749,"The University of Newcastle, Australia",301-400,2015,University of Newcastle
4275,University of Newcastle,301-400,2014,University of Newcastle


In [49]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'University of Pittsburgh']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
3860,University of Pittsburgh,61,2013,University of Pittsburgh
4466,"University of Pittsburgh, Pittsburgh Campus",70,2015,University of Pittsburgh
3961,University of Pittsburgh-Pittsburgh Campus,65,2014,University of Pittsburgh


In [50]:
shanghaidb_for_max[shanghaidb_for_max['university_name_normalized'] == 'Washington State University']

Unnamed: 0,university_name,world_rank,year,university_name_normalized
4793,Washington State University,301-400,2015,Washington State University
3309,Washington State University - Pullman,201-300,2011,Washington State University


Abbiamo quindi deciso di eliminare i duplicati e i vari campus di trattarli tramite una lista per poi eliminarli dallo studio, analizzando solamente la sede principali.

In [51]:
campus_shanghai = ['Arizona State University - Tempe',
                              'Louisiana State University Health Sciences Center',
                              'Indiana University-Purdue University at Indianapolis',
                              'The University of Connecticut - Storrs',
                              'The University of Connecticut Health Center',
                              'University of Arkansas at Little Rock',
                              'Medical University of Graz',
                              'Medical University of Innsbruck',
                              'University of Kansas Medical Center',
                              'University of Maryland, Baltimore',
                              'University of Massachusetts Medical School - Worcester']

In [52]:
del shanghaidb_for_max['university_name_normalized']

In [53]:
shanghai_for_max = shanghaidb_for_max.copy()

In [54]:
shanghai_for_max.head()

Unnamed: 0,university_name,world_rank,year
4697,Aalborg University,301-400,2015
4797,Aalto University,401-500,2015
4469,Aarhus University,73,2015
4497,Aix Marseille University,101-150,2015
3115,Aix-Marseille University,102-150,2011


Si eliminano i duplicati tramite indice:

In [55]:
shanghai_for_max.drop([3425, 3738, 4133, 3606, 3913, 3383, 3748, 3783, 3853, 3820, 4250, 3654, 4275, 3860, 3961, 3309], inplace = True)

Poiché la lista toglie i campus e il drop i duplicati, si sono dovuti trattare solamente gli ultimi casi limite come, ad esempio, la presenza o meno di spazi fra i trattini:

In [56]:
def normalize_shanghai(column_element):
  
  if column_element in campus_shanghai:
    return None
  
  for university in school_and_country_name['university_name_normalized']:
    if column_element == university:
      return column_element

    if column_element == "The Chinese University of Hong Kong" and university == "Chinese University of Hong Kong":
      return university
    
    if column_element == "University of Milan - Bicocca" and university == "University of Milan-Bicocca":
      return university

    if column_element == "University of Wisconsin - Madison" and university == "University of Wisconsin-Madison":
      return university

    if column_element == "University of Wisconsin - Milwaukee" and university == "University of Wisconsin-Milwaukee":
      return university
  
  
  name_column_element = re.compile(column_element)

  for university in school_and_country_name['university_name_normalized']:
    match = name_column_element.search(university)
    if match:
      return university
    else:
      name_university = re.compile(university)
      metch_university = name_university.search(column_element)
      if metch_university:
        return university

In [57]:
shanghai_for_max['university_name_normalized'] = shanghai_for_max['university_name'].apply(normalize_shanghai)

In [58]:
shanghai_for_max.head()

Unnamed: 0,university_name,world_rank,year,university_name_normalized
4697,Aalborg University,301-400,2015,Aalborg University
4797,Aalto University,401-500,2015,Aalto University
4469,Aarhus University,73,2015,Aarhus University
4497,Aix Marseille University,101-150,2015,
3115,Aix-Marseille University,102-150,2011,Aix-Marseille University


Si è controllato infine l'eventuale presenza di duplicati per assicurarsi di averli eliminati tutti:

In [59]:
shanghai_for_max[(shanghai_for_max['university_name_normalized'].isnull() == False) & (shanghai_for_max['university_name_normalized'].duplicated() == True)]

Unnamed: 0,university_name,world_rank,year,university_name_normalized


Allo stesso modo si controlla che il times dataset non presenti valori doppi:

In [60]:
times_for_max_copy = times_for_max.copy()

In [61]:
times_for_max_copy['university_name_normalized'] = times_for_max_copy['university_name'].apply(normalize)

In [62]:
times_for_max_copy.head()

Unnamed: 0,university_name,world_rank,year,university_name_normalized
2405,AGH University of Science and Technology,601-800,2016,AGH University of Science and Technology
2003,Aalborg University,201-250,2016,Aalborg University
2056,Aalto University,251-300,2016,Aalto University
1908,Aarhus University,=106,2016,Aarhus University
2105,Aberystwyth University,301-350,2016,Aberystwyth University


Controlliamo la presenza di valori doppi per il *times_for_max_copy*:

In [63]:
times_for_max_copy[(times_for_max_copy['university_name_normalized'].isnull() == False) & (times_for_max_copy['university_name_normalized'].duplicated() == True)]

Unnamed: 0,university_name,world_rank,year,university_name_normalized


Poiché non sono presenti valori duplicati, non sarà necessario applicare lo stesso procedimento di eliminazione dei valori come precedentemente fatto per il database shanghai.

Infine, si applica lo stesso procedimento per il dataset *cwur*:

In [64]:
cwurdb_for_max_copy = cwurdb_for_max.copy()

In [65]:
cwurdb_for_max_copy.head()

Unnamed: 0,institution,world_rank,year
1981,AGH University of Science and Technology,782,2015
1764,Aalborg University,565,2015
1620,Aalto University,421,2015
1321,Aarhus University,122,2015
2013,Aberystwyth University,814,2015


In [66]:
cwurdb_for_max_copy['university_name_normalized'] = cwurdb_for_max_copy['institution'].apply(normalize)

In [67]:
cwurdb_for_max_copy.head()

Unnamed: 0,institution,world_rank,year,university_name_normalized
1981,AGH University of Science and Technology,782,2015,AGH University of Science and Technology
1764,Aalborg University,565,2015,Aalborg University
1620,Aalto University,421,2015,Aalto University
1321,Aarhus University,122,2015,Aarhus University
2013,Aberystwyth University,814,2015,Aberystwyth University


Controlliamo la presenza di valori doppi per il *cwurdb_for_max_copy*:

In [68]:
cwurdb_for_max_copy[(cwurdb_for_max_copy['university_name_normalized'].isnull() == False) & (cwurdb_for_max_copy['university_name_normalized'].duplicated() == True)]

Unnamed: 0,institution,world_rank,year,university_name_normalized
2094,Nanjing University of Aeronautics and Astronau...,895,2015,Nanjing University
2003,Nanjing University of Science and Technology,804,2015,Nanjing University
2009,Nanjing University of Technology,810,2015,Nanjing University
2115,National Taiwan University of Science and Tech...,916,2015,National Taiwan University
2070,Northeastern University (China),871,2015,Northeastern University
1242,"Purdue University, West Lafayette",43,2015,Purdue University
2168,South China Agricultural University,969,2015,China Agricultural University
1476,Sun Yat-sen University,277,2015,Sun Yat-sen University
1628,University of Arkansas for Medical Sciences,429,2015,University of Arkansas
631,University of Bordeaux I,432,2014,University of Bordeaux


In [69]:
cwurdb_for_max_copy[cwurdb_for_max_copy['university_name_normalized'] == 'Nanjing University']

Unnamed: 0,institution,world_rank,year,university_name_normalized
1443,Nanjing University,244,2015,Nanjing University
2094,Nanjing University of Aeronautics and Astronau...,895,2015,Nanjing University
2003,Nanjing University of Science and Technology,804,2015,Nanjing University
2009,Nanjing University of Technology,810,2015,Nanjing University


Si controlla il nome ufficiale in modo da decidere quale mantenere e quale eliminare:

In [70]:
school_and_country_name[school_and_country_name['university_name_normalized'].str.contains('Nanjing University') == True]

Unnamed: 0,university_name_normalized,country
119,Nanjing University,China


In [71]:
cwurdb_for_max_copy[cwurdb_for_max_copy['university_name_normalized'] == 'Northeastern University']

Unnamed: 0,institution,world_rank,year,university_name_normalized
1474,Northeastern University,275,2015,Northeastern University
2070,Northeastern University (China),871,2015,Northeastern University


Controlliamo nelle altre tabelle l'esistenza dell'università *Northeastern University*

In [72]:
times[times['university_name'].str.contains('Northeastern University')==True]

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
406,201-225,Northeastern University,United States of America,30.4,38.0,18.9,70.0,31.1,-,18539,15.1,26%,50 : 50,2012
810,201-225,Northeastern University,United States of America,40.1,41.0,21.2,76.4,33.8,-,18539,15.1,26%,50 : 50,2013
1185,184,Northeastern University,United States of America,34.5,48.5,19.8,82.0,34.5,45.4,18539,15.1,26%,50 : 50,2014
1586,185,Northeastern University,United States of America,36.4,54.7,21.9,81.3,34.0,46.8,18539,15.1,26%,50 : 50,2015
2029,201-250,Northeastern University,United States of America,35.5,58.7,20.6,84.0,32.9,-,18539,15.1,26%,50 : 50,2016


Si mantiene quella localizzata negli USA in quanto in altri dataset è presente mentre non è presente la corrispettiva cinese:

In [73]:
cwurdb_for_max_copy[cwurdb_for_max_copy['university_name_normalized'] == 'Purdue University']

Unnamed: 0,institution,world_rank,year,university_name_normalized
1363,Indiana University-Purdue University Indianapolis,164,2015,Purdue University
1242,"Purdue University, West Lafayette",43,2015,Purdue University


In [74]:
cwurdb_for_max_copy[cwurdb_for_max_copy['university_name_normalized'] == 'China Agricultural University']

Unnamed: 0,institution,world_rank,year,university_name_normalized
1826,China Agricultural University,627,2015,China Agricultural University
2168,South China Agricultural University,969,2015,China Agricultural University


In [75]:
times_for_max_copy[times_for_max_copy['university_name'].str.contains('China Agricultural University') == True]

Unnamed: 0,university_name,world_rank,year,university_name_normalized
2320,China Agricultural University,501-600,2016,China Agricultural University


In [76]:
shanghai_for_max[shanghai_for_max['university_name'].str.contains('China Agricultural University') == True]

Unnamed: 0,university_name,world_rank,year,university_name_normalized
4703,China Agricultural University,301-400,2015,China Agricultural University


In [77]:
# ANALIZZARE DOPO COME CASO PARTICOLARE
cwurdb_for_max_copy[cwurdb_for_max_copy['university_name_normalized'] == 'Sun Yat-sen University']

Unnamed: 0,institution,world_rank,year,university_name_normalized
2002,National Sun Yat-sen University,803,2015,Sun Yat-sen University
1476,Sun Yat-sen University,277,2015,Sun Yat-sen University


In [78]:
school_and_country_name[school_and_country_name['country'].str.contains('Taiwan') == True]

Unnamed: 0,university_name_normalized,country
106,National Tsing Hua University,Taiwan
114,National Taiwan University,Taiwan
162,National Sun Yat-Sen University,Taiwan
180,National Chiao Tung University,Taiwan
320,National Taiwan University of Science and Tech...,Taiwan
374,National Central University,Taiwan
375,National Taiwan Ocean University,Taiwan
403,Yuan Ze University,Taiwan
425,National Cheng Kung University,Taiwan
450,"China Medical University, Taiwan",Taiwan


In [79]:
cwurdb_for_max_copy[cwurdb_for_max_copy['university_name_normalized'] == 'University of Arkansas']

Unnamed: 0,institution,world_rank,year,university_name_normalized
1749,University of Arkansas - Fayetteville,550,2015,University of Arkansas
1628,University of Arkansas for Medical Sciences,429,2015,University of Arkansas


In [80]:
cwurdb_for_max_copy[cwurdb_for_max_copy['university_name_normalized'] == 'University of Bordeaux']

Unnamed: 0,institution,world_rank,year,university_name_normalized
1498,University of Bordeaux,299,2015,University of Bordeaux
631,University of Bordeaux I,432,2014,University of Bordeaux
602,University of Bordeaux II,403,2014,University of Bordeaux


In [81]:
cwurdb_for_max_copy[cwurdb_for_max_copy['university_name_normalized'] == 'University of Graz']

Unnamed: 0,institution,world_rank,year,university_name_normalized
1773,Medical University of Graz,574,2015,University of Graz
1783,University of Graz,584,2015,University of Graz


In [82]:
cwurdb_for_max_copy[cwurdb_for_max_copy['university_name_normalized'] == 'University of Maryland, Baltimore County']

Unnamed: 0,institution,world_rank,year,university_name_normalized
1374,"University of Maryland, Baltimore",175,2015,"University of Maryland, Baltimore County"
1692,"University of Maryland, Baltimore County",493,2015,"University of Maryland, Baltimore County"


In [83]:
cwurdb_for_max_copy[cwurdb_for_max_copy['university_name_normalized'] == 'University of Massachusetts']

Unnamed: 0,institution,world_rank,year,university_name_normalized
1418,University of Massachusetts Amherst,219,2015,University of Massachusetts
2024,University of Massachusetts Boston,825,2015,University of Massachusetts
2083,University of Massachusetts Lowell,884,2015,University of Massachusetts
1382,University of Massachusetts Medical School,183,2015,University of Massachusetts


In [84]:
cwurdb_for_max_copy[cwurdb_for_max_copy['university_name_normalized'] == 'University of Milan']

Unnamed: 0,institution,world_rank,year,university_name_normalized
1371,University of Milan,172,2015,University of Milan
1581,University of Milan - Bicocca,382,2015,University of Milan


In [85]:
cwurdb_for_max_copy[cwurdb_for_max_copy['university_name_normalized'] == 'University of Missouri']

Unnamed: 0,institution,world_rank,year,university_name_normalized
1393,University of Missouri–Columbia,194,2015,University of Missouri
1753,University of Missouri–Kansas City,554,2015,University of Missouri
1947,University of Missouri–St. Louis,748,2015,University of Missouri


In [86]:
cwurdb_for_max_copy[cwurdb_for_max_copy['university_name_normalized'] == 'University of Oklahoma']

Unnamed: 0,institution,world_rank,year,university_name_normalized
1549,University of Oklahoma - Norman Campus,350,2015,University of Oklahoma
1675,University of Oklahoma Health Sciences Center,476,2015,University of Oklahoma


In [87]:
cwurdb_for_max_copy[cwurdb_for_max_copy['university_name_normalized'] == 'University of São Paulo']

Unnamed: 0,institution,world_rank,year,university_name_normalized
1788,Federal University of São Paulo,589,2015,University of São Paulo
1331,University of São Paulo,132,2015,University of São Paulo


In [88]:
cwurdb_for_max_copy[cwurdb_for_max_copy['university_name_normalized'] == 'University of Wisconsin']

Unnamed: 0,institution,world_rank,year,university_name_normalized
1224,University of Wisconsin–Madison,25,2015,University of Wisconsin
1709,University of Wisconsin–Milwaukee,510,2015,University of Wisconsin


In [89]:
cwurdb_for_max_copy[cwurdb_for_max_copy['university_name_normalized'] == 'Zhejiang University']

Unnamed: 0,institution,world_rank,year,university_name_normalized
1390,Zhejiang University,191,2015,Zhejiang University
2183,Zhejiang University of Technology,984,2015,Zhejiang University


In [90]:
cwurdb_for_max_copy[cwurdb_for_max_copy['university_name_normalized'] == 'École Polytechnique']

Unnamed: 0,institution,world_rank,year,university_name_normalized
1235,École Polytechnique,36,2015,École Polytechnique
1980,École Polytechnique de Montréal,781,2015,École Polytechnique


Si elimina il duplicato numero 631:

In [91]:
cwurdb_for_max_copy.drop([631],inplace=True)

In [92]:
campus_cwur = ['Nanjing University of Aeronautics and Astronautics',
                            'Nanjing University of Science and Technology',
                            'Nanjing University of Technology',
                            'Northeastern University (China)',  
                            'Indiana University-Purdue University Indianapolis',                            
                            'South China Agricultural University',
                            'University of Arkansas for Medical Sciences',
                            'University of Bordeaux II',
                            'Medical University of Graz',
                            'University of Maryland, Baltimore',
                            'University of Massachusetts Boston',
                            'University of Massachusetts Lowell',
                            'University of Massachusetts Medical School',
                            'University of Missouri–Kansas City',
                            'University of Missouri–St. Louis',
                            'University of Oklahoma Health Sciences Center',
                            'Federal University of São Paulo',
                            'Zhejiang University of Technology',
                            'École Polytechnique de Montréal']

In [93]:
def normalize_cwur(column_element):
  
  if column_element in campus_cwur:
    return None
  
  for university in school_and_country_name['university_name_normalized']:
    if column_element == university:
      return column_element

    if column_element == 'National Taiwan University of Science and Technology' and university == 'National Taiwan University of Science and Technology (Taiwan Tech)':
      return university

    if column_element == 'National Sun Yat-sen University' and university == 'National Sun Yat-Sen University':
      return university

    if column_element == "University of Milan - Bicocca" and university == "University of Milan-Bicocca":
      return university

    if column_element == "University of Wisconsin-Madison" and university == "University of Wisconsin-Madison":
      return university

    if column_element == 'University of Wisconsin–Milwaukee' and university == 'University of Wisconsin-Milwaukee':
            return university
  
  
  name_column_element = re.compile(column_element)

  for university in school_and_country_name['university_name_normalized']:
    match = name_column_element.search(university)
    if match:
      return university
    else:
      name_university = re.compile(university)
      match_university = name_university.search(column_element)
      if match_university:
        return university

In [94]:
cwurdb_for_max_copy['university_name_normalized'] = cwurdb_for_max_copy['institution'].apply(normalize_cwur)

In [95]:
cwurdb_for_max_copy.head()

Unnamed: 0,institution,world_rank,year,university_name_normalized
1981,AGH University of Science and Technology,782,2015,AGH University of Science and Technology
1764,Aalborg University,565,2015,Aalborg University
1620,Aalto University,421,2015,Aalto University
1321,Aarhus University,122,2015,Aarhus University
2013,Aberystwyth University,814,2015,Aberystwyth University


Si controllano eventuali valori duplicati rimanenti:

In [96]:
cwurdb_for_max_copy[(cwurdb_for_max_copy['university_name_normalized'].isnull() == False) & (cwurdb_for_max_copy['university_name_normalized'].duplicated() == True)]

Unnamed: 0,institution,world_rank,year,university_name_normalized


In [97]:
shanghai_for_max_finale = shanghai_for_max[shanghai_for_max['university_name_normalized'].isnull() == False]

In [98]:
times_for_max_finale = times_for_max_copy[times_for_max_copy['university_name_normalized'].isnull() == False]

In [99]:
cwur_for_max_finale = cwurdb_for_max_copy[cwurdb_for_max_copy['university_name_normalized'].isnull() == False]

Volendo ottenere un unico dataset finale, si dovrà fare prima un `merge` fra due dataset e poi un'ulteriore `merge` sui dati rimanenti:

In [100]:
fuso_1 = pd.merge(shanghai_for_max_finale, times_for_max_finale, on = 'university_name_normalized', suffixes=('_shanghai', '_times'))
fuso_1.head()

Unnamed: 0,university_name_shanghai,world_rank_shanghai,year_shanghai,university_name_normalized,university_name_times,world_rank_times,year_times
0,Aalborg University,301-400,2015,Aalborg University,Aalborg University,201-250,2016
1,Aalto University,401-500,2015,Aalto University,Aalto University,251-300,2016
2,Aarhus University,73,2015,Aarhus University,Aarhus University,=106,2016
3,Aix-Marseille University,102-150,2011,Aix-Marseille University,Aix-Marseille University,251-300,2016
4,Aristotle University of Thessaloniki,401-500,2015,Aristotle University of Thessaloniki,Aristotle University of Thessaloniki,601-800,2016


In [101]:
times_shanghai_cwur = pd.merge(fuso_1, cwur_for_max_finale, on = 'university_name_normalized')[["university_name_normalized", "world_rank_times", "world_rank_shanghai", "world_rank"]]
times_shanghai_cwur.head()

Unnamed: 0,university_name_normalized,world_rank_times,world_rank_shanghai,world_rank
0,Aalborg University,201-250,301-400,565
1,Aalto University,251-300,401-500,421
2,Aarhus University,=106,73,122
3,Aix-Marseille University,251-300,102-150,206
4,Aristotle University of Thessaloniki,601-800,401-500,459


Poiché nella colonna `world_rank_times` sono presenti elementi con il simbolo `=` al loro interno, è necessario rimpiazzare tale elemento per poter lavorare agevolmente con i valori che seguono il simbolo.

In [102]:
times_shanghai_cwur['world_rank_times'] = times_shanghai_cwur['world_rank_times'].str.replace('=', '')

Controlliamo che tutti i valori siano stati eliminati.

In [103]:
times_shanghai_cwur[times_shanghai_cwur['world_rank_times'].str.contains('=')]

Unnamed: 0,university_name_normalized,world_rank_times,world_rank_shanghai,world_rank


Controlliamo inoltre che non siano presenti valori nulli o celle che presentano il solo valore `-`:

In [104]:
times_shanghai_cwur[times_shanghai_cwur['world_rank_times'].isna()]

Unnamed: 0,university_name_normalized,world_rank_times,world_rank_shanghai,world_rank


In [105]:
times_shanghai_cwur[times_shanghai_cwur['world_rank_times'].str.contains('^-$')]

Unnamed: 0,university_name_normalized,world_rank_times,world_rank_shanghai,world_rank


Nel passaggio successivo bisogna affrontare le differenze delle diverse colonne di world rank: in almeno due sono presenti dei range di valori che non permettono nell'immediato di poter ottenere delle differenze di valori. Prima si dovranno trattare adeguatamente e, solo successivamente, potranno essere considerati come valori numerici.

In [106]:
def differenza_max_min(elemento):
    # Match con gli elementi che contengono la dicitura richiesta
    str_times = re.match('(?P<inf>\d+)-(?P<sup>\d+)', elemento['world_rank_times'])
    str_shanghai = re.match('(?P<inf>\d+)-(?P<sup>\d+)', elemento['world_rank_shanghai'])
    
    # se str_times è True
    if str_times:
        sup_times = int(str_times.group('sup'))
        inf_times = int(str_times.group('inf'))
    else:
        sup_times = int(elemento['world_rank_times'])
        inf_times = int(elemento['world_rank_times'])
    if str_shanghai:
        sup_shanghai = int(str_shanghai.group('sup'))
        inf_shanghai = int(str_shanghai.group('inf'))
    else:
        sup_shanghai = int(elemento['world_rank_shanghai'])
        inf_shanghai = int(elemento['world_rank_shanghai'])
    
    cwur_elem = elemento['world_rank']
    massimo = max(sup_times, sup_shanghai, cwur_elem)
    minimo = min(sup_times, sup_shanghai, cwur_elem)
    
    # Viene ritornata la differenza fra i valori
    return massimo - minimo   

Viene applicata la funzione precedentemente creata:

In [107]:
times_shanghai_cwur['differenza_max_min'] = times_shanghai_cwur.apply(differenza_max_min, axis=1)
times_shanghai_cwur

Unnamed: 0,university_name_normalized,world_rank_times,world_rank_shanghai,world_rank,differenza_max_min
0,Aalborg University,201-250,301-400,565,315
1,Aalto University,251-300,401-500,421,200
2,Aarhus University,106,73,122,49
3,Aix-Marseille University,251-300,102-150,206,150
4,Aristotle University of Thessaloniki,601-800,401-500,459,341
...,...,...,...,...,...
430,Yale University,12,11,11,1
431,Yeshiva University,164,201-300,171,136
432,Yonsei University,301-350,201-300,98,252
433,York University,301-350,401-500,337,163


Controlliamo che la Aarhus University abbia effettivamente il valore `49` come differenza:

In [108]:
times_shanghai_cwur[times_shanghai_cwur['university_name_normalized']=='Aarhus University']

Unnamed: 0,university_name_normalized,world_rank_times,world_rank_shanghai,world_rank,differenza_max_min
2,Aarhus University,106,73,122,49


### 5. Consider only the most recent data point of the `times` dataset. Compute the number of male and female students for each country.¶

Vengono quindi considerati i *data points* più recenti del dataset `times`.

In [109]:
times_max_2 = times[times['year'] == times['year'].max()]
times_max_2.head()

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
1803,1,California Institute of Technology,United States of America,95.6,64.0,97.6,99.8,97.8,95.2,2243,6.9,27%,33 : 67,2016
1804,2,University of Oxford,United Kingdom,86.5,94.4,98.9,98.8,73.1,94.2,19919,11.6,34%,46 : 54,2016
1805,3,Stanford University,United States of America,92.5,76.3,96.2,99.9,63.3,93.9,15596,7.8,22%,42 : 58,2016
1806,4,University of Cambridge,United Kingdom,88.2,91.5,96.7,97.0,55.0,92.8,18812,11.8,34%,46 : 54,2016
1807,5,Massachusetts Institute of Technology,United States of America,89.4,84.0,88.6,99.7,95.4,92.0,11074,9.0,33%,37 : 63,2016


Viene richiesto che i valori nulli non vengano considerati per i successivi passaggi per la colonna `female_male_ratio` e nemmeno i valori contenenti il `-`: i valori di questo tipo vengono visti come mancanti e non utilizzati per le operazioni ancora da svolgere. 

In [110]:
times_max_2 = times_max_2[times_max_2['female_male_ratio'].notnull()]
times_max_2 = times_max_2[~times_max_2['female_male_ratio'].str.contains('-')]

Viene verificata il tipo della colonna `num_students` in quanto ambigua:

In [111]:
times_max_2['num_students']

1803     2,243
1804    19,919
1805    15,596
1806    18,812
1807    11,074
         ...  
2597    31,618
2598    21,958
2599    31,268
2601    10,117
2602     8,663
Name: num_students, Length: 736, dtype: object

Non essendo un effettivo *float number* si crea una funzione che rimpiazzi la `,` col `.` e restituisca un valore intero.

In [112]:
def convert_to_int(col):
    repl = int(col.replace(',', ''))
    
    return repl

La colonna `num_students` viene convertita tramite la funzione `convert_to_int` in `int number`, passaggio necessario per poter svolgere le successive operazioni matematiche.

In [113]:
times_max_2['num_students'] = times_max_2['num_students'].apply(convert_to_int)
times_max_2.head()

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
1803,1,California Institute of Technology,United States of America,95.6,64.0,97.6,99.8,97.8,95.2,2243,6.9,27%,33 : 67,2016
1804,2,University of Oxford,United Kingdom,86.5,94.4,98.9,98.8,73.1,94.2,19919,11.6,34%,46 : 54,2016
1805,3,Stanford University,United States of America,92.5,76.3,96.2,99.9,63.3,93.9,15596,7.8,22%,42 : 58,2016
1806,4,University of Cambridge,United Kingdom,88.2,91.5,96.7,97.0,55.0,92.8,18812,11.8,34%,46 : 54,2016
1807,5,Massachusetts Institute of Technology,United States of America,89.4,84.0,88.6,99.7,95.4,92.0,11074,9.0,33%,37 : 63,2016


Si creano due funzioni (una per l'attributo `femmine` e una per l'attributo `maschi`) che ricerchino esattamente quanto richiesto tramite il match della regular expression.

In [114]:
def female_assoluto(col):
    # Si trova il match con il group regex
    fem_regex = re.match('^(\d+) : (\d+)$', col['female_male_ratio'])
        
    # Si convertono in interi i valori
    female = round(int(fem_regex.group(1)) * col['num_students']/100)
    
    return female
    
def male_assoluto(col):
    # Si trova il match con il group regex
    male_regex = re.match('^(\d+) : (\d+)$', col['female_male_ratio'])
    
    # Si convertono in interi i valori
    male = round(int(male_regex.group(2)) * col['num_students']/100)
    
    return male

Si applicano le due funzioni al dataframe `times_max_2` creando due colonne che corrispondano ai valori assoluti di `female` e di `male`:

In [115]:
times_max_2['female'] = times_max_2.apply(female_assoluto, axis=1)
times_max_2['male'] = times_max_2.apply(male_assoluto, axis=1)

Infine, si richiama il dataframe con le nuove colonne per essere certi della correttezza del risultato richiesto, scegliendo le colonne per una migliore visualizzazione:

In [116]:
times_max_2.groupby(["country"], as_index = False)[['country', 'female_male_ratio', 'female', 'male']].sum()

Unnamed: 0,country,female,male
0,Argentina,67191,41182
1,Australia,391736,321640
2,Austria,68364,66113
3,Bangladesh,21323,41393
4,Belarus,20219,9084
...,...,...,...
65,Uganda,18670,18670
66,Ukraine,17846,19250
67,United Arab Emirates,9516,4931
68,United Kingdom,711814,613028


### 6. Find the universities where the ratio between female and male is below the average ratio (computed over all universities)

Viene creata una funzione che computi il rapporto fra `femmine` e `maschi`. Nel caso in cui i `maschi`, che si trovano al denominatore, fossero pari a `0`, verrà restituito il valore `100` in quanto al numeratore si troveranno le `femmine` uguali a `100`. Così facendo il rapporto (pari a `100`) indicherà una università totalmente femminile.

In [117]:
def fem_mal(col):
    # Ricerca il match nella colonna female_male_ratio
    find_male_zero = re.match('^(?P<female>\d+) : (?P<male>\d+)$', col['female_male_ratio'])
    
    # converto in interi i valori di female e male (da stringhe)
    female = int(find_male_zero.group('female'))
    male = int(find_male_zero.group('male'))
    
    # se 'male' fosse uguale a 0, ritorna direttamente il valore 100, considerando l'università
    # completamente femminile
    if male == 0:
        return 100
    # Altrimenti ritorna il rapporto
    else:
        return female/male

Si applica la funzione `f_m_ratio_decimal`:

In [118]:
times_max_2['f_m_ratio_decimal'] = times_max_2.apply(fem_mal, axis = 1)

Se ne controlla la correttezza:

In [119]:
times_max_2.head()

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year,female,male,f_m_ratio_decimal
1803,1,California Institute of Technology,United States of America,95.6,64.0,97.6,99.8,97.8,95.2,2243,6.9,27%,33 : 67,2016,740,1503,0.492537
1804,2,University of Oxford,United Kingdom,86.5,94.4,98.9,98.8,73.1,94.2,19919,11.6,34%,46 : 54,2016,9163,10756,0.851852
1805,3,Stanford University,United States of America,92.5,76.3,96.2,99.9,63.3,93.9,15596,7.8,22%,42 : 58,2016,6550,9046,0.724138
1806,4,University of Cambridge,United Kingdom,88.2,91.5,96.7,97.0,55.0,92.8,18812,11.8,34%,46 : 54,2016,8654,10158,0.851852
1807,5,Massachusetts Institute of Technology,United States of America,89.4,84.0,88.6,99.7,95.4,92.0,11074,9.0,33%,37 : 63,2016,4097,6977,0.587302


Viene applicata la funzione `mean()` per calcolare la media di colonna:

In [120]:
times_max_2['f_m_ratio_decimal'].mean()

1.2169695629288875

Troviamo quindi quelle università che presentano un rapporto `female_male` al di sotto della media:

In [121]:
sotto_media = times_max_2[times_max_2['f_m_ratio_decimal'] < times_max_2['f_m_ratio_decimal'].mean()]
sotto_media

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year,female,male,f_m_ratio_decimal
1803,1,California Institute of Technology,United States of America,95.6,64.0,97.6,99.8,97.8,95.2,2243,6.9,27%,33 : 67,2016,740,1503,0.492537
1804,2,University of Oxford,United Kingdom,86.5,94.4,98.9,98.8,73.1,94.2,19919,11.6,34%,46 : 54,2016,9163,10756,0.851852
1805,3,Stanford University,United States of America,92.5,76.3,96.2,99.9,63.3,93.9,15596,7.8,22%,42 : 58,2016,6550,9046,0.724138
1806,4,University of Cambridge,United Kingdom,88.2,91.5,96.7,97.0,55.0,92.8,18812,11.8,34%,46 : 54,2016,8654,10158,0.851852
1807,5,Massachusetts Institute of Technology,United States of America,89.4,84.0,88.6,99.7,95.4,92.0,11074,9.0,33%,37 : 63,2016,4097,6977,0.587302
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2597,601-800,Xidian University,China,17.9,12.8,12.1,8.9,83.7,-,31618,16.4,2%,29 : 71,2016,9169,22449,0.408451
2598,601-800,Yeungnam University,South Korea,18.6,24.3,10.9,26.5,35.4,-,21958,15.3,3%,48 : 52,2016,10540,11418,0.923077
2599,601-800,Yıldız Technical University,Turkey,14.5,14.9,7.6,19.3,44.0,-,31268,28.7,2%,36 : 64,2016,11256,20012,0.562500
2601,601-800,Yokohama National University,Japan,20.1,23.3,16.0,13.5,40.4,-,10117,12.1,8%,28 : 72,2016,2833,7284,0.388889


### 7. For each country, compute the fraction of the students in the country that are in one of the universities computed in the previous point (that is, the denominator of the ratio is the total number of students over all universities in the country).

Considerando il dataframe poco sopra creato (sulle università che presentavano un rapporto femmine/maschi sotto la media), si raggruppa per `country` e si sommano il numero di studenti per ottenere il valore totale per paese.

In [122]:
stud_paese_sotto_media_parz = sotto_media.groupby("country", as_index = False)["num_students"].agg(sum)
stud_paese_sotto_media_parz.head()

Unnamed: 0,country,num_students
0,Australia,295021
1,Austria,79242
2,Bangladesh,62716
3,Belgium,116129
4,Brazil,438476


Si crea un nuovo df con il numero di studenti totali:

In [123]:
stud_paese_totali = times_max_2.groupby("country", as_index = False)["num_students"].agg(sum)
stud_paese_totali.head()

Unnamed: 0,country,num_students
0,Argentina,108373
1,Australia,713376
2,Austria,134477
3,Bangladesh,62716
4,Belarus,29303


Si fondono i due dataframe:

In [124]:
stud_parz_tot_merge = pd.merge(stud_paese_sotto_media_parz, stud_paese_totali, on = 'country', suffixes=('_parz', '_tot'))
stud_parz_tot_merge.head()

Unnamed: 0,country,num_students_parz,num_students_tot
0,Australia,295021,713376
1,Austria,79242,134477
2,Bangladesh,62716,62716
3,Belgium,116129,169661
4,Brazil,438476,494251


Si crea la colonna desiderata del rapporto:

In [125]:
stud_parz_tot_merge['ratio_parz_tot'] = (stud_parz_tot_merge['num_students_parz']/stud_parz_tot_merge['num_students_tot'])
stud_parz_tot_merge.head()

Unnamed: 0,country,num_students_parz,num_students_tot,ratio_parz_tot
0,Australia,295021,713376,0.413556
1,Austria,79242,134477,0.589261
2,Bangladesh,62716,62716,1.0
3,Belgium,116129,169661,0.684477
4,Brazil,438476,494251,0.887152


### 8. Read the file educational_attainment_supplementary_data.csv, discarding any row with missing country_name or series_name

Viene letto il dataframe:

In [126]:
educational_attainment = pd.read_csv('dataset_progetto/educational_attainment_supplementary_data.csv')
educational_attainment

Unnamed: 0,country_name,series_name,1985,1986,1987,1990,1991,1992,1993,1995,...,2005,2006,2007,2008,2009,2010,2011,2012,2013,2015
0,Afghanistan,"Barro-Lee: Average years of primary schooling,...",0.33,,,0.44,,,,0.57,...,0.86,,,,,1.27,,,,
1,Afghanistan,"Barro-Lee: Average years of primary schooling,...",1.03,,,1.26,,,,1.54,...,2.18,,,,,2.64,,,,
2,Afghanistan,"Barro-Lee: Average years of primary schooling,...",0.83,,,0.95,,,,1.26,...,1.01,,,,,2.45,,,,
3,Afghanistan,"Barro-Lee: Average years of primary schooling,...",2.34,,,2.22,,,,2.37,...,2.26,,,,,3.55,,,,
4,Afghanistan,"Barro-Lee: Average years of primary schooling,...",0.54,,,0.92,,,,0.94,...,2.00,,,,,1.29,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79050,,,,,,,,,,,...,,,,,,,,,,
79051,,,,,,,,,,,...,,,,,,,,,,
79052,,,,,,,,,,,...,,,,,,,,,,
79053,Data from database: Education Statistics: Educ...,,,,,,,,,,...,,,,,,,,,,


Si andranno ad eliminare quei valori nulli nelle colonne presenti nel parametro `subset` di `dropna()`:

In [127]:
educational_attainment.dropna(subset=['country_name', 'series_name'], inplace=True)

Si controlla se quanto svolto è corretto:

In [128]:
educational_attainment[educational_attainment['country_name'].isnull()]

Unnamed: 0,country_name,series_name,1985,1986,1987,1990,1991,1992,1993,1995,...,2005,2006,2007,2008,2009,2010,2011,2012,2013,2015


In [129]:
educational_attainment[educational_attainment['series_name'].isnull()]

Unnamed: 0,country_name,series_name,1985,1986,1987,1990,1991,1992,1993,1995,...,2005,2006,2007,2008,2009,2010,2011,2012,2013,2015


### 9. From attainment build a dataframe with the same data, but with 4 columns: country_name, series_name, year, value

Si controlla quali sono le colonne attualmente presenti in `educational_attainment`:

In [130]:
educational_attainment.columns

Index(['country_name', 'series_name', '1985', '1986', '1987', '1990', '1991',
       '1992', '1993', '1995', '1996', '1997', '1998', '1999', '2000', '2001',
       '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
       '2011', '2012', '2013', '2015'],
      dtype='object')

Il risultato finale che si vuole ottenere prevede di trasformare il dataframe nel *wide format* in un dataframe in un formato *long*, ovvero convertire le colonne per renderle come valori di riga. Sarà necessario anche modificare il nome di colonna in `year`.
Per poter svolgere tale compito, si utilizza la funzione `.melt()` adibita a tale scopo:

In [131]:
melt_educational_df = pd.melt(educational_attainment, id_vars=['country_name', 'series_name'], var_name="year")
melt_educational_df.head()

Unnamed: 0,country_name,series_name,year,value
0,Afghanistan,"Barro-Lee: Average years of primary schooling,...",1985,0.33
1,Afghanistan,"Barro-Lee: Average years of primary schooling,...",1985,1.03
2,Afghanistan,"Barro-Lee: Average years of primary schooling,...",1985,0.83
3,Afghanistan,"Barro-Lee: Average years of primary schooling,...",1985,2.34
4,Afghanistan,"Barro-Lee: Average years of primary schooling,...",1985,0.54


### 10. For each university, find the number of rankings in which they appear (it suffices to appear in one year for each ranking).

Si sono create le liste dei nomi utilizzando i nomi normalizzati e abbiamo contato i valori inserendoli in un dizionario successivamente convertito in un dataframe.

In [132]:
shanghai_list = list(shanghai_for_max[shanghai_for_max['university_name_normalized'].isnull() == False]['university_name_normalized'].unique())

In [133]:
times_list = list(times_for_max_copy[times_for_max_copy['university_name_normalized'].isnull() == False]['university_name_normalized'].unique())

In [134]:
cwur_list = list(cwurdb_for_max_copy[cwurdb_for_max_copy['university_name_normalized'].isnull() == False]['university_name_normalized'].unique())

In [135]:
final_list = shanghai_list + times_list + cwur_list

In [136]:
# crea un dizionario vuoto
count_dict = {}

In [137]:
for university in final_list:
  if university in count_dict:
    count_dict[university] += 1
  else:
    count_dict[university] = 1

In [138]:
conteggio_ranking = pd.DataFrame(count_dict.items(), columns=["university", "count"])

In [139]:
conteggio_ranking

Unnamed: 0,university,count
0,Aalborg University,3
1,Aalto University,3
2,Aarhus University,3
3,Aix-Marseille University,3
4,Aristotle University of Thessaloniki,3
...,...,...
813,École Normale Supérieure,1
814,École Normale Supérieure de Lyon,1
815,École Polytechnique,2
816,École Polytechnique Fédérale de Lausanne,1


### 11. In the `times` ranking, compute the number of times each university appears

In [140]:
final_list_times = list(times['university_name'])

In [141]:
times_dict = {}

In [142]:
for university in final_list_times:
  if university in times_dict:
    times_dict[university] += 1
  else:
    times_dict[university] = 1

In [143]:
conteggio_ranking_university = pd.DataFrame(times_dict.items(), columns=["university", "count"])

In [144]:
conteggio_ranking_university

Unnamed: 0,university,count
0,Harvard University,6
1,California Institute of Technology,6
2,Massachusetts Institute of Technology,6
3,Stanford University,6
4,Princeton University,6
...,...,...
813,Xidian University,1
814,Yeungnam University,1
815,Yıldız Technical University,1
816,Yokohama City University,1


### 12. Find the universities that appear at most twice in the times ranking.

In [145]:
conteggio_ranking_university[conteggio_ranking_university['count'] < 3]

Unnamed: 0,university,count
45,University of Wisconsin,1
211,Medical University of South Carolina,2
239,University of Medicine and Dentistry of New Je...,1
257,University of Hamburg,1
300,University of Kentucky,2
...,...,...
813,Xidian University,1
814,Yeungnam University,1
815,Yıldız Technical University,1
816,Yokohama City University,1


### 13. The universities that, in any year, have the same position in all three rankings (they must have the same position in a year).

In [146]:
times_copy = times.copy()

In [147]:
shanghai_copy = shanghai_db.copy()

In [148]:
cwur_copy = cwur_db.copy()

In [149]:
times_copy['university_name_normalized'] = times_copy['university_name'].apply(normalize)

In [150]:
shanghai_copy['university_name'] = shanghai_copy['university_name'].apply(str)

In [151]:
shanghai_copy['university_name_normalized'] = shanghai_copy['university_name'].apply(normalize_shanghai)

In [152]:
cwur_copy['university_name_normalized'] = cwur_copy['institution'].apply(normalize_cwur)

In [153]:
shanghai_copy.head()

Unnamed: 0,world_rank,university_name,national_rank,total_score,alumni,award,hici,ns,pub,pcp,year,university_name_normalized
0,1,Harvard University,1,100.0,100.0,100.0,100.0,100.0,100.0,72.4,2005,Harvard University
1,2,University of Cambridge,1,73.6,99.8,93.4,53.3,56.6,70.9,66.9,2005,University of Cambridge
2,3,Stanford University,2,73.4,41.1,72.2,88.5,70.9,72.3,65.0,2005,Stanford University
3,4,"University of California, Berkeley",3,72.8,71.8,76.0,69.4,73.9,72.2,52.7,2005,"University of California, Berkeley"
4,5,Massachusetts Institute of Technology (MIT),4,70.1,74.0,80.6,66.7,65.8,64.3,53.0,2005,Massachusetts Institute of Technology


Si procede con un `pd.merge` dei diversi dataframe:

In [154]:
t_s_fusion_db = pd.merge(times_copy, shanghai_copy , on = ['university_name_normalized', 'year'], suffixes=('_times', '_shanghai'))[['university_name_normalized', 'world_rank_times', 'world_rank_shanghai', 'year']]

In [155]:
t_s_c_fusion_db = pd.merge(t_s_fusion_db, cwur_copy, on = ['university_name_normalized', 'year'])[['university_name_normalized', 'world_rank_times', 'world_rank_shanghai', 'world_rank', 'year']]

In [156]:
t_s_c_fusion_db.head()

Unnamed: 0,university_name_normalized,world_rank_times,world_rank_shanghai,world_rank,year
0,California Institute of Technology,1,6,5,2012
1,Harvard University,2,1,1,2012
2,Stanford University,2,2,3,2012
3,University of Oxford,4,10,7,2012
4,Princeton University,5,7,6,2012


E' necessario convertire la colonna `world_rank` in stringa in quanto si tratta di un valore numerico, mentre le altre colonne di world rank sono considerate come oggetti:

In [157]:
t_s_c_fusion_db['world_rank'] = t_s_c_fusion_db['world_rank'].apply(str)

Viene richiesto di ottenere soltanto quelle istanze che si presentano nello stesso rank e, successivamente, nello stesso anno.

Nello stesso rank e nello stesso anno:

In [158]:
t_s_c_fusion_db_same_rank = t_s_c_fusion_db[(t_s_c_fusion_db['world_rank_times'] == t_s_c_fusion_db['world_rank_shanghai']) & (t_s_c_fusion_db['world_rank_times'] == t_s_c_fusion_db['world_rank'])]
t_s_c_fusion_db_same_rank.head()

Unnamed: 0,university_name_normalized,world_rank_times,world_rank_shanghai,world_rank,year
77,Stanford University,2,2,2,2013


Ottenendo come risultato conclusivo che la *Standford University* è l'unica università che viene a trovarsi in tutti e tre i ranking, nello stesso anno, nella stessa identica posizione.