<h1> Homework 2 - Data from the Web </h1>

In [46]:
## Importation of everything useful
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
sns.set_context('notebook')

import requests
from bs4 import BeautifulSoup

In [47]:
parameters_page = requests.get('http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_i_reportModel=133685247')

In [48]:
parameters_html = parameters_page.text

In [49]:
soup = BeautifulSoup(parameters_html, "html.parser")

In [50]:
row_list = []
for parameter_cat_html in soup.find_all('select'):
    for option in parameter_cat_html.contents:
        row_dict = {'category':parameter_cat_html['name'], 'name':option.string, 'value':option['value']}
        row_list.append(row_dict)
parameters = pd.DataFrame(row_list)
parameters.head()

Unnamed: 0,category,name,value
0,ww_x_UNITE_ACAD,,
1,ww_x_UNITE_ACAD,Architecture,942293.0
2,ww_x_UNITE_ACAD,Chimie et génie chimique,246696.0
3,ww_x_UNITE_ACAD,Cours de mathématiques spéciales,943282.0
4,ww_x_UNITE_ACAD,EME (EPFL Middle East),637841336.0


<h2>Creating the base url</h2>

We found with Postman the base ur
l leading to the students list: http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.bhtml

To this page, we will add different parameters. Some of the parameters will not change for all the request, therefore we decided to hardcode them. These parameters are:
<ul>
<li>
<b>ww_x_GPS</b> = <em>-1</em><br>
This parameter varies during request but is not useful. It is probably linked to the gps position. It is was set to -1 which will accept everything.
</li>
<li>
<b>ww_i_reportModel</b> = <em>133685247</em>:<br>
Parameter to choose the type of files to take. We chose html files.
</li>
<li><b>ww_i_reportModelXsl</b> = <em>133685270</em>:<br>
Parameter to choose the type of files to take. We chose html files.
</li>
<li><b>ww_x_HIVERETE</b> = <em>null</em>:<br>
This parameter allow to choose between Winter and Spring semester. However, it is redondant, as Bachelor 1 can only happen in Winter semester, Bachelor 2 in Spring, Bachelor 3 in Winter, and so forth. Therefore we chose to eliminate this parameter by settings its value to null.
</li>
<li><b>ww_x_UNITE_ACAD</b> = <em>249847</em>:<br>
This value represents the "Informatique" section. As it was asked to only consider the data of this section, we have to use it all the time. If it was not the case, we could use the other technique.
</li>


</ul>

In [51]:
people_base_url = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.bhtml?ww_x_GPS=-1&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=249847&ww_x_HIVERETE=null'

***Interesting parameters***<br>
Therefore we are left with two interesting parameters we want to vary:
the year ("periode academique") and the period ("periode pedagogique").

In [52]:
year_param_string = 'ww_x_PERIODE_ACAD'
period_param_string = 'ww_x_PERIODE_PEDAGO'

year_param = parameters.loc[parameters['category'] == year_param_string]
period_param = parameters.loc[parameters['category'] == period_param_string]

year_param = year_param.drop(year_param.iloc[0].name,axis=0, level=None)
period_param = period_param.drop(period_param.iloc[0].name,axis=0, level=None)
period_param = period_param[period_param['name'] != 'Mise à niveau']

In [53]:
year_param.head()

Unnamed: 0,category,name,value
21,ww_x_PERIODE_ACAD,2016-2017,355925344
22,ww_x_PERIODE_ACAD,2015-2016,213638028
23,ww_x_PERIODE_ACAD,2014-2015,213637922
24,ww_x_PERIODE_ACAD,2013-2014,213637754
25,ww_x_PERIODE_ACAD,2012-2013,123456101


In [54]:
period_param

Unnamed: 0,category,name,value
32,ww_x_PERIODE_PEDAGO,Bachelor semestre 1,249108
33,ww_x_PERIODE_PEDAGO,Bachelor semestre 2,249114
34,ww_x_PERIODE_PEDAGO,Bachelor semestre 3,942155
35,ww_x_PERIODE_PEDAGO,Bachelor semestre 4,942163
36,ww_x_PERIODE_PEDAGO,Bachelor semestre 5,942120
37,ww_x_PERIODE_PEDAGO,Bachelor semestre 5b,2226768
38,ww_x_PERIODE_PEDAGO,Bachelor semestre 6,942175
39,ww_x_PERIODE_PEDAGO,Bachelor semestre 6b,2226785
40,ww_x_PERIODE_PEDAGO,Master semestre 1,2230106
41,ww_x_PERIODE_PEDAGO,Master semestre 2,942192


*** Make the dataframe and save it on files: ***

In [55]:
def pretify_df(data_frame):
    # We change the name of the columns by the value of the second line of each columns\n",
    for i in range(0, len(data_frame.columns)):
        data_frame=data_frame.rename(columns = {i:data_frame.loc[1][i]})
        
    # We add a column for the period\n",
    string = data_frame['Civilité'][0]
    splitedString = string.split(',')
    
    data_frame['Period Academic'] = splitedString[1]
    periodPedagogic = splitedString[2].split(' ')[1:4]
    data_frame['Period pedagogic'] = ' '.join(periodPedagogic)
 
    # We drop the unuseful rows\n",
    # And we drop the columns that contains only NaN\n",
    data_frame = data_frame.drop(data_frame.index[[0,1]]).dropna(axis=1, how='all')
    return data_frame

In [56]:
def create_df(url):

    # pd.read returns a list of dataframe
    # in our case, it returns a list of one dataframe,\n",
    # so we select the first item of this list\n",
    return pretify_df(pd.read_html(url)[0])

In [57]:
def url_param_str(param, value):
    return '&' + param + '=' + value

for i, year in year_param.iterrows():
    for j, period in period_param.iterrows():
            
            url_i = people_base_url + url_param_str(year.category, year.value) + url_param_str(period.category, period.value)
            file_name_i = './data/' + str(year['name']+period['name'])
            
            #print(file_name_i)
            print(url_i)
            try:
                df_i = create_df(url_i)
            except (ValueError, KeyError):
                print('-------> no file here!')
            else:
                df_i.to_pickle(file_name_i)


http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.bhtml?ww_x_GPS=-1&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=249847&ww_x_HIVERETE=null&ww_x_PERIODE_ACAD=355925344&ww_x_PERIODE_PEDAGO=249108
http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.bhtml?ww_x_GPS=-1&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=249847&ww_x_HIVERETE=null&ww_x_PERIODE_ACAD=355925344&ww_x_PERIODE_PEDAGO=249114
http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.bhtml?ww_x_GPS=-1&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=249847&ww_x_HIVERETE=null&ww_x_PERIODE_ACAD=355925344&ww_x_PERIODE_PEDAGO=942155
http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.bhtml?ww_x_GPS=-1&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=249847&ww_x_HIVERETE=null&ww_x_PERIODE_ACAD=355925344&ww_x_PERIODE_PEDAGO=942163
http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.bhtml?ww_x_GPS=-1&ww_i_reportModel=133685247&ww_i_re

## Exercise 1
We import the data stored, and create a list of dataframes for Bachelor 1 and Bachelor 6

In [58]:
import glob

bachelors_sem1 = glob.glob('.\\data\\[0-9]*-[0-9]*Bachelor semestre 1')
bachelors_sem6 = glob.glob('.\\data\\[0-9]*-[0-9]*Bachelor semestre 6')

sem1_dfs = [pd.read_pickle(data_file) for data_file in bachelors_sem1]
sem6_dfs = [pd.read_pickle(data_file) for data_file in bachelors_sem6]

In [59]:
sem1_dfs[4].head(1)

Unnamed: 0,Civilité,Nom Prénom,Statut,No Sciper,Period Academic,Period pedagogic
2,Monsieur,Aiulfi Loris Sandro,Présent,202293,2011-2012,Bachelor semestre 1


We change the academic year according to the fact that the first semester of bachelor start during the first year provided, and the last for Bachelor 6

Ex: for 2012-2013 for Bachelor 1, the correcting_year function will changed it to 2012 and
    for 2014-2015 for Bachelor 6, the correcting_year function will changed it to 2015

In [60]:
def correcting_year(listOne, listSix):
    #For all Bachelor 1, we select the firt year of the column Period Academic
    for df in listOne:
        df["Period Academic"]=int(df['Period Academic'][2].split('-')[0].replace(" ", ""))

    #For all Bachelor 6, we select the second year of the column Period Academic
    for df in listSix:
        df["Period Academic"] = int(df['Period Academic'][2].split('-')[1])   

In [61]:
correcting_year(sem1_dfs, sem6_dfs)

In [62]:
sem1_dfs[4].head(1)

Unnamed: 0,Civilité,Nom Prénom,Statut,No Sciper,Period Academic,Period pedagogic
2,Monsieur,Aiulfi Loris Sandro,Présent,202293,2011,Bachelor semestre 1


We fisrt concatenate all the dataframes for each semester in a big dataframe
then, we sort them by ascending Period Academic
and finally, for Bachelor 1, we drop duplicate and keep the first entry,
and for Bachelor 6, we drop duplicate and keep the last entry

In [63]:

bachelors_sem1 = pd.concat(sem1_dfs, ignore_index=True)
bachelors_sem1 = bachelors_sem1.sort_values(['Period Academic'], ascending=True)
bachelors_sem1 = bachelors_sem1.drop_duplicates(subset='No Sciper', keep='first')


bachelors_sem6 = pd.concat(sem6_dfs, ignore_index=True)
bachelors_sem6 = bachelors_sem6.sort_values(['Period Academic'], ascending=True)
bachelors_sem6 = bachelors_sem6.drop_duplicates(subset='No Sciper', keep='last')

bachelors_sem1.shape

(1323, 6)

In [64]:
bachelors_sem6.shape

(516, 9)

We can observe that the number of student in Bachelor 6 is drastically less than in Bachelor 1, 516 for 1323.

We merge the two big dataframes using the No Sciper

In [65]:
res = pd.merge(bachelors_sem1, bachelors_sem6, on='No Sciper', how='inner')
res.head(2)

Unnamed: 0,Civilité_x,Nom Prénom_x,Statut_x,No Sciper,Period Academic_x,Period pedagogic_x,Civilité_y,Ecole Echange,Filière opt.,Nom Prénom_y,Period Academic_y,Period pedagogic_y,Statut_y,Type Echange
0,Monsieur,Arévalo Christian,Présent,169569,2007,Bachelor semestre 1,Monsieur,,,Arévalo Christian,2010,Bachelor semestre 6,Présent,
1,Monsieur,Obrist Damien,Présent,179194,2007,Bachelor semestre 1,Monsieur,Carnegie Mellon University Pittsburgh,,Obrist Damien,2010,Bachelor semestre 6,Congé,Bilatéral


In [66]:
res.shape

(397, 14)

We observed that the number of student that was in Bachelor 1 at the EPFL and in Bachelor 6 are a little bit less than the total number of student in Bachelor 6. We think it is beacause of exchange students ot student that arrived directly in second year (Passerelle).

We create a new column registering the duration of the Bachelor

In [67]:
final_res = pd.DataFrame({'No Sciper': res['No Sciper'], 'Civilité' :res['Civilité_x'], 'Duration of Bachelor': res['Period Academic_y'] - res['Period Academic_x']})

In [68]:
#Partition the data between male and female students
grouped_gender = final_res.groupby('Civilité')

In [69]:
#average duration for female
df_madame = grouped_gender.get_group('Madame')
df_madame.mean()

Duration of Bachelor    3.310345
dtype: float64

In [70]:
#average duration for male
df_monsieur = grouped_gender.get_group('Monsieur')
df_monsieur.mean()

Duration of Bachelor    3.480978
dtype: float64

In [71]:
import scipy

The Wilcoxon rank-sum test tests the null hypothesis that two sets of measurements are drawn from the same distribution. We are testing if male and female durations of bachelor are comming from the same distribution (null hypothesis).

In [72]:
res = scipy.stats.ranksums( df_madame['Duration of Bachelor'].tolist(), df_monsieur['Duration of Bachelor'].tolist())
res.pvalue

0.39364493314799009

As explained in https://en.wikipedia.org/wiki/P-value - pvalue > 0.1 means that we accept the null hypothesis and so we conclude that there is no significant difference.

## Exerise 2

In [360]:
masters_sem1 = glob.glob('.\\data\\[0-9]*-[0-9]*Master semestre 1')
masters_sem2 = glob.glob('.\\data\\[0-9]*-[0-9]*Master semestre 2')
masters_sem3 = glob.glob('.\\data\\[0-9]*-[0-9]*Master semestre 3')
projectsA = glob.glob('.\\data\\[0-9]*-[0-9]*Projet Master automne')
projectsP = glob.glob('.\\data\\[0-9]*-[0-9]*Projet Master printemps')

masters_sem1_df_list = [pd.read_pickle(data_file) for data_file in masters_sem1]
masters_sem2_df_list = [pd.read_pickle(data_file) for data_file in masters_sem2]
masters_sem3_df_list = [pd.read_pickle(data_file) for data_file in masters_sem3]
projectsA_df_list = [pd.read_pickle(data_file) for data_file in projectsA]
projectsP_df_list = [pd.read_pickle(data_file) for data_file in projectsP]

In [361]:
projectsP_df_list[0]

Unnamed: 0,Civilité,Nom Prénom,Spécialisation,Statut,No Sciper,Period Academic,Period pedagogic
2,Monsieur,Brutsche Florian,Internet computing,Congé,159852,2007-2008,Projet Master printemps
3,Monsieur,Dotta Mirco,,Stage,153819,2007-2008,Projet Master printemps
4,Monsieur,Hügli Michael,,Stage,145957,2007-2008,Projet Master printemps
5,Monsieur,Indra Saurabh,,Présent,173257,2007-2008,Projet Master printemps
6,Monsieur,Lépine Simon,Biocomputing,Présent,160150,2007-2008,Projet Master printemps
7,Monsieur,Stewart Conail,,Présent,173527,2007-2008,Projet Master printemps


In [362]:
def correcting_year_Automn(automnDf):
    # We select the second year of the column Period Academic and add 0.5 as it should be the second half of the year
    for df in automnDf:
        df["Period Academic"]=int(df['Period Academic'][2].split('-')[0].replace(" ", "")) + 0.5

In [363]:
correcting_year_Automn(masters_sem1_df_list)
correcting_year_Automn(projectsA_df_list)

In [364]:
masters_sem1_df_list[0].head(3)

Unnamed: 0,Civilité,Nom Prénom,Spécialisation,Statut,Type Echange,Ecole Echange,No Sciper,Period Academic,Period pedagogic
2,Monsieur,Aeberhard François-Xavier,,Présent,,,153066,2007.5,Master semestre 1
3,Madame,Agarwal Megha,,Présent,,,180027,2007.5,Master semestre 1
4,Monsieur,Anagnostaras David,,Présent,,,152232,2007.5,Master semestre 1


In [365]:
def correcting_year_Spring(springDf):
    # We select the second year of the column Period Academic
    for df in springDf:
        df["Period Academic"] = int(df['Period Academic'][2].split('-')[1])  

In [366]:
correcting_year_Spring(masters_sem2_df_list)
correcting_year_Spring(projectsP_df_list)

In [367]:
masters_sem2_df_list[0].head(3)

Unnamed: 0,Civilité,Nom Prénom,Spécialisation,Mineur,Statut,Type Echange,Ecole Echange,No Sciper,Period Academic,Period pedagogic
2,Monsieur,Aeberhard François-Xavier,,,Présent,,,153066,2008,Master semestre 2
3,Madame,Agarwal Megha,,,Présent,,,180027,2008,Master semestre 2
4,Monsieur,Anagnostaras David,,"Mineur en Management, technologie et entrepren...",Présent,,,152232,2008,Master semestre 2


We add the master 2 dataframe after the master 1 dataframe (using the append function)

In [368]:
master1_df = pd.concat(masters_sem1_df_list, ignore_index=True)
master2_df = pd.concat(masters_sem2_df_list, ignore_index=True)
master_courses = master1_df.append(master2_df)
master_courses.head(2)

Unnamed: 0,Civilité,Ecole Echange,Mineur,No Sciper,Nom Prénom,Period Academic,Period pedagogic,Spécialisation,Statut,Type Echange
0,Monsieur,,,153066,Aeberhard François-Xavier,2007.5,Master semestre 1,,Présent,
1,Madame,,,180027,Agarwal Megha,2007.5,Master semestre 1,,Présent,


We sort master 1 and master 2 by date and then by master 2 and master 1 (we put master 2 before master 1 if master 1 and 2 have the same date, since it will mean that the student started its master in spring, so marked as master 2)

In [369]:
def prepare_sort(row):
    if row['Period pedagogic'] == 'Master semestre 1':
        return -1
    else:
        return -2

master_courses2 = master_courses.copy()
master_courses2['sorting semester'] = master_courses2.apply(prepare_sort, axis=1)
master_courses2.tail(2)
#serie = np.where(master_courses2['Period Academic'] == 'Master semester 1', -1, -2)
#master_courses2['sorting semester'] = {-1 if (master_courses2['Period Academic'] == 'Master semestre 1') else -2}
#master_courses2.head()
#serie

Unnamed: 0,Civilité,Ecole Echange,Mineur,No Sciper,Nom Prénom,Period Academic,Period pedagogic,Spécialisation,Statut,Type Echange,sorting semester
1060,Monsieur,,,268709,Dezfuli Seyyed Sina,2017.0,Master semestre 2,,Présent,,-2
1061,Monsieur,,,205771,Marengo Julien Lionel,2017.0,Master semestre 2,,Présent,,-2


In [370]:
master_courses_sorted = master_courses2.sort_values(['Period Academic', 'sorting semester'], ascending=True)
master_courses_sorted = master_courses_sorted.drop_duplicates(subset='No Sciper', keep='first')
master_courses_sorted.head(2)

Unnamed: 0,Civilité,Ecole Echange,Mineur,No Sciper,Nom Prénom,Period Academic,Period pedagogic,Spécialisation,Statut,Type Echange,sorting semester
0,Monsieur,,,153066,Aeberhard François-Xavier,2007.5,Master semestre 1,,Présent,,-1
1,Madame,,,180027,Agarwal Megha,2007.5,Master semestre 1,,Présent,,-1


In [371]:
mp_a = pd.concat(projectsA_df_list, ignore_index=True)
mp_p = pd.concat(projectsP_df_list, ignore_index=True)

In [372]:
master_proj_df = pd.concat([mp_a, mp_p], ignore_index=True)

master_proj_df = master_proj_df.sort_values(['Period Academic'], ascending=True)
master_proj_df = master_proj_df.drop_duplicates(subset='No Sciper', keep='last')
master_proj_df.head()

Unnamed: 0,Civilité,Ecole Echange,Mineur,No Sciper,Nom Prénom,Period Academic,Period pedagogic,Spécialisation,Statut,Type Echange
88,Monsieur,,,145957,Hügli Michael,2008.0,Projet Master printemps,,Stage,
89,Monsieur,,,173257,Indra Saurabh,2008.0,Projet Master printemps,,Présent,
90,Monsieur,,,160150,Lépine Simon,2008.0,Projet Master printemps,Biocomputing,Présent,
91,Monsieur,,,173527,Stewart Conail,2008.0,Projet Master printemps,,Présent,
87,Monsieur,,,153819,Dotta Mirco,2008.0,Projet Master printemps,,Stage,


In [373]:
students = pd.merge(master_courses_sorted, master_proj_df, on='No Sciper', how='inner')
students.head()

Unnamed: 0,Civilité_x,Ecole Echange_x,Mineur_x,No Sciper,Nom Prénom_x,Period Academic_x,Period pedagogic_x,Spécialisation_x,Statut_x,Type Echange_x,sorting semester,Civilité_y,Ecole Echange_y,Mineur_y,Nom Prénom_y,Period Academic_y,Period pedagogic_y,Spécialisation_y,Statut_y,Type Echange_y
0,Madame,,,180027,Agarwal Megha,2007.5,Master semestre 1,,Présent,,-1,Madame,,,Agarwal Megha,2008.5,Projet Master automne,,Stage,
1,Madame,,,154573,Benabdallah Zeineb,2007.5,Master semestre 1,,Présent,,-1,Madame,,,Benabdallah Zeineb,2010.0,Projet Master printemps,Biocomputing,Présent,
2,Monsieur,,,172687,Billaud Joël,2007.5,Master semestre 1,,Présent,,-1,Monsieur,,,Billaud Joël,2009.0,Projet Master printemps,,Stage,
3,Monsieur,,,180072,Campora Simone,2007.5,Master semestre 1,Internet computing,Présent,,-1,Monsieur,,,Campora Simone,2009.0,Projet Master printemps,Internet computing,Stage,
4,Monsieur,,,160225,Cassata Alexandre,2007.5,Master semestre 1,,Présent,,-1,Monsieur,,,Cassata Alexandre,2009.0,Projet Master printemps,,Stage,


In [374]:
students = pd.DataFrame({'No Sciper': students['No Sciper'], 'Civilité' :students['Civilité_x'], 'Spécialisation': students['Spécialisation_y'], 'Mineur': students['Mineur_y'], 'Duration of Master': students['Period Academic_y'] - students['Period Academic_x']})

In [375]:
students.head()

Unnamed: 0,Civilité,Duration of Master,Mineur,No Sciper,Spécialisation
0,Madame,1.0,,180027,
1,Madame,2.5,,154573,Biocomputing
2,Monsieur,1.5,,172687,
3,Monsieur,1.5,,180072,Internet computing
4,Monsieur,1.5,,160225,


In [376]:
def make_final_duration(row):
    duration = float(row['Duration of Master'])
    minor = row['Mineur']
    spec = row['Spécialisation']
    
    no_m_sp = pd.isnull(minor) and pd.isnull(spec)
    
    if duration < 1.5 and no_m_sp:
        return 1.5
    if duration < 2. and not no_m_sp:
        if duration == 1.5:
            return 2.
        else:
            return 2.5
    else:
        return duration

In [377]:
students['Final Duration'] = students.apply(make_final_duration, axis=1)

In [379]:
students.describe()

Unnamed: 0,Duration of Master,Final Duration
count,115.0,115.0
mean,1.617391,1.8
std,0.510119,0.361527
min,1.0,1.5
25%,1.0,1.5
50%,1.5,2.0
75%,2.0,2.0
max,3.5,3.5
