# HW2

We first go over the data extraction part for both exercises. Then we show how we analyze the extracted data.

## Data Extraction

The following code shows the import statements as well as the links shared between data extraction for both bachelor  and master students:   

In [1]:
import numpy as np
import pandas as pd
import sys
from bs4 import BeautifulSoup as BSoup
import requests
formLink = "http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_i_reportModel=133685247"
showLink = "http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_GPS=-1&ww_i_reportModel=133685247"
r = requests.get(formLink)
soup = BSoup(r.text, "lxml")
# The following function generates the link to the corresponding HTML page for a particular field of study, 
# academic year, and semester.
def showLinkGen(htmlP, htmlV, fieldP, fieldV, yearP, yearV, semP, semV):
  return "{}&{}={}&{}={}&{}={}&{}={}".format(showLink, htmlP, htmlV, fieldP, fieldV, yearP, yearV, semP, semV)
selectFields = soup.find_all("select")
infoField=selectFields[0].find("option", text="Informatique")["value"]
infoFieldParam=selectFields[0]["name"]
semParam=selectFields[2]["name"]
allYears = [y["value"] for y in selectFields[1].find_all("option")[1:]]
yearParam = selectFields[1]["name"]
htmlradiobutton=soup.find("input", type="radio")
# For a given semester and year, returns the HTML page containing the list of the students' information.
def get_html(sem, year):
    link = showLinkGen(htmlradiobutton["name"], htmlradiobutton["value"], infoFieldParam, infoField, yearParam, 
                 year, semParam, sem)
    return requests.get(link)

### Bachelor Students' Data Extraction

Assumption
----------
We assume that all bachelor students finish in their 6th semester -- there is no bachelor student who finishes during 5th semester.

The following code extracts the parameters which should be passed to the previous function (`showLinkGen`) for all bachelor students of Computer Science.

In [2]:
bsSem1=selectFields[2].find("option", text="Bachelor semestre 1")["value"]
bsSem6=selectFields[2].find("option", text="Bachelor semestre 6")["value"]

The following function converts an HTML page containing the relevant part of the list of the students' information to a DataFrame.

In [3]:
def get_bs_dataframe(request_link):
  soup2 = BSoup(request_link.text, "lxml")
  elems=soup2.find_all("tr")[2:]
  titleinfo = soup2.find("font").text.split(', ')
  semester=titleinfo[2]
  semester=int(semester[(len(semester)-2):])
  year=titleinfo[1]
  all_data=[]
  for elem in elems:
    items=elem.find_all("td")
    gender = "M" if (items[0].text == "Monsieur") else "F"
    sciper = int(items[10].text)
    all_data.append({"Scipper": sciper, "Sex": gender, "Year": year, "Semester": semester})
  return pd.DataFrame(all_data)

The following function uses the previously defined functions and returns an aggregated DataFrame of all bachelor students' information.

In [4]:
def get_bs_alldata():
  all_data = []
  for bsSem in [bsSem1, bsSem6]:
    for year in allYears:
        all_data.append(get_bs_dataframe(get_html(bsSem, year)))
  return pd.concat(all_data)

### Master Students' Data Extraction

Assumption
----------
Master students are considered to finish their studies only if they are registered for at least one master project.

The following parameters are used for generating the HTML link for all master students.

In [5]:
msSem1_text="Master semestre 1"
msSemPF_text="Projet Master automne"
msSemPS_text="Projet Master printemps"
msSem1=selectFields[2].find("option", text=msSem1_text)["value"]
msSemPF=selectFields[2].find("option", text=msSemPF_text)["value"]
msSemPS=selectFields[2].find("option", text=msSemPS_text)["value"]

The following function converts the given HTML page to a DataFrame containing the relevant information about a master student.

In [6]:
def get_ms_dataframe(request_link):
  soup2 = BSoup(request_link.text, "lxml")
  elems=soup2.find_all("tr")[2:]
  titleinfo = soup2.find("font")
  if(titleinfo is None):
      return pd.DataFrame([])
  else:
      titleinfo = titleinfo.text.split(', ')
      semester=titleinfo[2]
      semester=1 if semester==msSem1_text else (2 if semester==msSemPF_text else 3)
      year=titleinfo[1]
      all_data=[]
      for elem in elems:
        items=elem.find_all("td")
        gender = "M" if (items[0].text == "Monsieur") else "F"
        sciper = int(items[10].text)
        spec = items[4].text <> ""
        minor = items[6].text <> ""
        all_data.append({"Scipper": sciper, "Sex": gender, "Year": year, "Semester": semester,
                        "Minor": minor, "Specialization": spec})
      return pd.DataFrame(all_data)

The following function aggregates the DataFrames for all master students. 

In [8]:
def get_ms_alldata():
  all_data = []
  for sem in [msSem1, msSemPF, msSemPS]:
    for year in allYears:
        all_data.append(get_ms_dataframe(get_html(sem, year)))
  return pd.concat(all_data)

## Data Analysis

### Exercise 1

The following line assigns the extracted DataFrame to `bs_data`.

In [10]:
bs_data=get_bs_alldata()
bs_data.head()

Unnamed: 0,Scipper,Semester,Sex,Year
0,235688,1,M,2016-2017
1,274015,1,M,2016-2017
2,268410,1,F,2016-2017
3,271464,1,M,2016-2017
4,274518,1,M,2016-2017


While the extracted DataFrames are concatenated the indexing information is corrupted. The following line fixes this problem.

In [12]:
bs_data.reset_index(None,drop=True,inplace=True)

Then we change the year information in-place based on the fact that the semester is starting in which year:

In [13]:
for i in range(bs_data.shape[0]):
    if (bs_data.loc[i,'Semester'] == 1):
        bs_data.loc[i,'Year'] = bs_data.loc[i,'Year'][0:4]
    else:
        bs_data.loc[i,'Year'] = bs_data.loc[i,'Year'][5:9]

bs_data['Year'] = bs_data['Year'].astype('int')
bs_data.head()

Unnamed: 0,Scipper,Semester,Sex,Year
0,235688,1,M,2016
1,274015,1,M,2016
2,268410,1,F,2016
3,271464,1,M,2016
4,274518,1,M,2016


Among all bachelor students we have to consider only those who finish their studies. Based on our previously mentioned assumption, we should keep only the ones who are registered for both semester 1 and semester 6.
We achieve this in the following three steps. First, we filter the students who registered for semester 1. In the case that a student is registered twice for semester 1, we only keep the first occurence:


In [14]:
idx_sem1 = (bs_data.Scipper).isin(bs_data[bs_data.Semester == 1].Scipper)
data_sem1 = bs_data[idx_sem1].drop('Semester',axis=1)
data_sem1 = data_sem1.sort_values(by ='Year')
data_sem1 = data_sem1.drop_duplicates(['Scipper'],keep='first')
data_sem1 = data_sem1.rename(columns = {'Year':'Year1'})
data_sem1.head()

Unnamed: 0,Scipper,Sex,Year1
1661,180284,M,2007
1691,180853,M,2007
1690,180094,M,2007
1689,181115,M,2007
1688,175576,M,2007


Then, we do the same for the students who registered for semester 6 by only keeping the last occurence:

In [15]:
idx_sem6 = (bs_data.Scipper).isin(bs_data[bs_data.Semester == 6].Scipper)
data_sem6 = bs_data[idx_sem6].drop('Semester',axis=1)
data_sem6 = data_sem6.sort_values(by ='Year')
data_sem6 = data_sem6.drop_duplicates(['Scipper'],keep='last')
data_sem6 = data_sem6.rename(columns = {'Year':'Year6'})
data_sem6.head()

Unnamed: 0,Scipper,Sex,Year6
2383,171042,M,2008
2382,167439,M,2008
2350,161634,M,2008
2351,170451,M,2008
2352,170219,M,2008


Finally, we join the previously constructed DataFrames into one DataFrame containing the information about the starting year and the finishing year of the study.

In [16]:
data_sem16 = pd.merge(data_sem1,data_sem6,how='inner')
data_sem16.head()

Unnamed: 0,Scipper,Sex,Year1,Year6
0,180094,M,2007,2010
1,181115,M,2007,2010
2,181076,M,2007,2011
3,181298,M,2007,2010
4,178433,M,2007,2010


Based on this DataFrame, we can now compute the Staying time for each student by adding the `Staytime` column and dropping the irrelevant columns (`Year1` and `Year6`):

In [17]:
data_sem16['Staytime'] = (data_sem16.Year6 - data_sem16.Year1)*12
data_sem16 = data_sem16.drop(['Year1','Year6'],axis=1)
data_sem16.head()

Unnamed: 0,Scipper,Sex,Staytime
0,180094,M,36
1,181115,M,36
2,181076,M,48
3,181298,M,36
4,178433,M,36


Now, we can partition the data based on the gender of students and compute the mean of their stay time:

In [18]:
data_grouped = data_sem16.groupby('Sex')
data_grouped['Staytime'].mean()

Sex
F    39.724138
M    41.771739
Name: Staytime, dtype: float64

The results show that in average male students take 2 more months to graduate in comparison with female students.

Now we study the statistical significance of this difference.

In [19]:
import scipy.stats as stats

First, we start by dividing data into two populations of male and female:

In [20]:
data_F = data_sem16[data_sem16.Sex == 'F']
data_M = data_sem16[data_sem16.Sex == 'M']

In a first time, we study the staying-time average's significance of each population.