# HW2

We first go over the data extraction part for both exercises. Then we show how we analyze the extracted data.

## Data Extraction

The following code shows the import statements as well as the links shared between data extraction for both bachelor  and master students:   

In [1]:
import numpy as np
import pandas as pd
import sys
from bs4 import BeautifulSoup as BSoup
import requests
formLink = "http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_i_reportModel=133685247"
showLink = "http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_GPS=-1&ww_i_reportModel=133685247"
r = requests.get(formLink)
soup = BSoup(r.text, "lxml")
# The following function generates the link to the corresponding HTML page for a particular field of study, 
# academic year, and semester.
def showLinkGen(htmlP, htmlV, fieldP, fieldV, yearP, yearV, semP, semV):
  return "{}&{}={}&{}={}&{}={}&{}={}".format(showLink, htmlP, htmlV, fieldP, fieldV, yearP, yearV, semP, semV)
selectFields = soup.find_all("select")
infoField=selectFields[0].find("option", text="Informatique")["value"]
infoFieldParam=selectFields[0]["name"]
semParam=selectFields[2]["name"]
allYears = [y["value"] for y in selectFields[1].find_all("option")[1:]]
yearParam = selectFields[1]["name"]
htmlradiobutton=soup.find("input", type="radio")
# For a given semester and year, returns the HTML page containing the list of the students' information.
def get_html(sem, year):
    link = showLinkGen(htmlradiobutton["name"], htmlradiobutton["value"], infoFieldParam, infoField, yearParam, 
                 year, semParam, sem)
    return requests.get(link)

### Bachelor Students' Data Extraction

Assumption
----------
We assume that all bachelor students finish in their 6th semester -- there is no bachelor student who finishes during 5th semester.

The following code extracts the parameters which should be passed to the previous function (`showLinkGen`) for all bachelor students of Computer Science.

In [2]:
bsSem1=selectFields[2].find("option", text="Bachelor semestre 1")["value"]
bsSem6=selectFields[2].find("option", text="Bachelor semestre 6")["value"]

The following function converts an HTML page containing the relevant part of the list of the students' information to a DataFrame.

In [3]:
def get_bs_dataframe(request_link):
  soup2 = BSoup(request_link.text, "lxml")
  elems=soup2.find_all("tr")[2:]
  titleinfo = soup2.find("font").text.split(', ')
  semester=titleinfo[2]
  semester=int(semester[(len(semester)-2):])
  year=titleinfo[1]
  all_data=[]
  for elem in elems:
    items=elem.find_all("td")
    gender = "M" if (items[0].text == "Monsieur") else "F"
    sciper = int(items[10].text)
    all_data.append({"Scipper": sciper, "Sex": gender, "Year": year, "Semester": semester})
  return pd.DataFrame(all_data)

The following function uses the previously defined functions and returns an aggregated DataFrame of all bachelor students' information.

In [4]:
def get_bs_alldata():
  all_data = []
  for bsSem in [bsSem1, bsSem6]:
    for year in allYears:
        all_data.append(get_bs_dataframe(get_html(bsSem, year)))
  return pd.concat(all_data)

### Master Students' Data Extraction

Assumption
----------
Master students are considered to finish their studies only if they are registered for at least one master project.

The following parameters are used for generating the HTML link for all master students.

In [5]:
msSem1_text="Master semestre 1"
msSemPF_text="Projet Master automne"
msSemPS_text="Projet Master printemps"
msSem1=selectFields[2].find("option", text=msSem1_text)["value"]
msSemPF=selectFields[2].find("option", text=msSemPF_text)["value"]
msSemPS=selectFields[2].find("option", text=msSemPS_text)["value"]

The following function converts the given HTML page to a DataFrame containing the relevant information about a master student.

In [6]:
def get_ms_dataframe(request_link):
  soup2 = BSoup(request_link.text, "lxml")
  elems=soup2.find_all("tr")[2:]
  titleinfo = soup2.find("font")
  if(titleinfo is None):
      return pd.DataFrame([])
  else:
      titleinfo = titleinfo.text.split(', ')
      semester=titleinfo[2]
      semester=1 if semester==msSem1_text else (2 if semester==msSemPF_text else 3)
      year=titleinfo[1]
      all_data=[]
      for elem in elems:
        items=elem.find_all("td")
        gender = "M" if (items[0].text == "Monsieur") else "F"
        sciper = int(items[10].text)
        spec = items[4].text
        minor = items[6].text != ""
        all_data.append({"Scipper": sciper, "Sex": gender, "Year": year, "Semester": semester,
                        "Minor": minor, "Specialization": spec})
      return pd.DataFrame(all_data)

The following function aggregates the DataFrames for all master students. 

In [None]:
def get_ms_alldata():
  all_data = []
  for sem in [msSem1, msSemPF, msSemPS]:
    for year in allYears:
        all_data.append(get_ms_dataframe(get_html(sem, year)))
  return pd.concat(all_data)

## Data Analysis

### Exercise 1

The following line assigns the extracted DataFrame to `bs_data`.

In [None]:
bs_data=get_bs_alldata()
bs_data.head()

While the extracted DataFrames are concatenated the indexing information is corrupted. The following line fixes this problem.

In [None]:
bs_data.reset_index(None,drop=True,inplace=True)

Then we change the year information in-place based on the fact that the semester is starting in which year:

In [None]:
for i in range(bs_data.shape[0]):
    if (bs_data.loc[i,'Semester'] == 1):
        bs_data.loc[i,'Year'] = bs_data.loc[i,'Year'][0:4]
    else:
        bs_data.loc[i,'Year'] = bs_data.loc[i,'Year'][5:9]

bs_data['Year'] = bs_data['Year'].astype('int')
bs_data.head()

Among all bachelor students we have to consider only those who finish their studies. Based on our previously mentioned assumption, we should keep only the ones who are registered for both semester 1 and semester 6.
We achieve this in the following three steps. First, we filter the students who registered for semester 1. In the case that a student is registered twice for semester 1, we only keep the first occurence:


In [None]:
idx_sem1 = (bs_data.Scipper).isin(bs_data[bs_data.Semester == 1].Scipper)
data_sem1 = bs_data[idx_sem1].drop('Semester',axis=1)
data_sem1 = data_sem1.sort_values(by ='Year')
data_sem1 = data_sem1.drop_duplicates(['Scipper'],keep='first')
data_sem1 = data_sem1.rename(columns = {'Year':'Year1'})
data_sem1.head()

Then, we do the same for the students who registered for semester 6 by only keeping the last occurence:

In [None]:
idx_sem6 = (bs_data.Scipper).isin(bs_data[bs_data.Semester == 6].Scipper)
data_sem6 = bs_data[idx_sem6].drop('Semester',axis=1)
data_sem6 = data_sem6.sort_values(by ='Year')
data_sem6 = data_sem6.drop_duplicates(['Scipper'],keep='last')
data_sem6 = data_sem6.rename(columns = {'Year':'Year6'})
data_sem6.head()

Finally, we join the previously constructed DataFrames into one DataFrame containing the information about the starting year and the finishing year of the study.

In [None]:
data_sem16 = pd.merge(data_sem1,data_sem6,how='inner')
data_sem16.head()

Based on this DataFrame, we can now compute the Staying time for each student by adding the `Staytime` column and dropping the irrelevant columns (`Year1` and `Year6`):

In [None]:
data_sem16['Staytime'] = (data_sem16.Year6 - data_sem16.Year1)*12
data_sem16 = data_sem16.drop(['Year1','Year6'],axis=1)
data_sem16.head()

Now, we can partition the data based on the gender of students and compute the mean of their stay time:

In [None]:
data_grouped = data_sem16.groupby('Sex')
data_grouped['Staytime'].mean()

The results show that in average male students take 2 more months to graduate in comparison with female students.

Now we study the statistical significance of this difference.

In [None]:
import scipy.stats as stats

First, we start by dividing data into two populations of male and female:

In [None]:
data_F = data_sem16[data_sem16.Sex == 'F']
data_M = data_sem16[data_sem16.Sex == 'M']

In a first time, we study the staying-time average's significance of each population, using a 1-sample T-Test.

#### 1-sample T-Test

In a 1-sample T-Test, the null hypothesis assumes nothing interesting is going on between the variables we are testing. In this case, it means that there is no difference between each of the sub-populations and the whole population.

In [None]:
stats.ttest_1samp(data_M.Staytime,data_sem16.Staytime.mean())

A p-value of 0.7485 means we'd expect to see data as extreme as our sample due to chance about 74.85% of the time if the null hypothesis was true. In this case, the p-value is higher than our significance level α (equal to 1-conf.level or 0.05) so we should not reject the null hypothesis.

* The staying-time average of males is statically significant.

In [None]:
stats.ttest_1samp(data_F.Staytime,data_sem16.Staytime.mean())

A p-value of 0.1268 means we'd expect to see data as extreme as our sample due to chance about only 12.68% of the time if the null hypothesis was true. 
In this case, the p-value is low than our significance level so we should reject the null hypothesis.

* The staying-time average of females is not statically significant, since the female population is different.

#### 2-sample T-Test
In a 2-sample T-Test, the null hypothesis states that the groups are the same.

In [None]:
stats.ttest_ind(a= data_M.Staytime, b = data_F.Staytime, equal_var = False)

The test yields a p-value of 0.1219, which means there is a 12.19% chance we would see sample data this far apart if the two groups tested are actually identical. The null hypothesis should be rejected.

* We conclude that the difference in the average of the stay-time between males and females is not statically significant.

### Exercise 2

#### Getting the master data

In [None]:
ms_data = get_ms_alldata()
ms_data

#### Processing the data

First of all we reset the indeces to make them unique.

In [None]:
ms_data.reset_index(None,drop=True,inplace=True)

Then, we need to make a dataframe of the master students in their first semester, and their first year.

In [None]:
master_semester1 = ms_data[ms_data.Semester == 1]
year = master_semester1['Year'].str.split('-',expand=True)
master_semester1.Year = year[0].astype(int)
master_semester1

We do the same thing for the master students who registered for the master project, and keep the year too. 

In [None]:
master_project = ms_data[ms_data.Semester != 1]
year = master_project['Year'].str.split('-',expand=True)
master_project.Year = np.where(master_project['Semester']==2, year[0].astype(int), year[1].astype(int))
master_project

If each student does the masterSemster1 once, we should have a unique scipper numbers, but the following instruction shows that it is not the case.

In [None]:
master_semester1.Scipper.is_unique

So we need to keep the first masterSemester1 for each student. So, we sort the data by the "Year", and drop the duplicates and keep the first row only. Besides, we assume that a master student start the first semester in the fall, and will finalize the Specialization and Minor after the first semester. Thus, we drop the three columns: 'Specialization', 'Minor', and 'Semester'.

In [None]:
master_semester1 = master_semester1.sort_values("Year")
master_semester1 = master_semester1.drop_duplicates("Scipper", keep='first')
master_semester1.rename(columns = {'Year':'FirstYear'}, inplace=True)
master_semester1.drop('Specialization',axis=1, inplace=True)
master_semester1.drop('Minor',axis=1, inplace=True)
master_semester1.drop('Semester', axis=1, inplace=True)
master_semester1

We also expected that each master student registered once for the master project. However, the following instruction shows out expectation was wrong.

In [None]:
master_project.Scipper.is_unique

So, we dropped the duplicated values, and keep only the last one. 

In [None]:
master_project = master_project.sort_values("Year")
master_project = master_project.drop_duplicates("Scipper", keep='last')
master_project.rename(columns = {'Year':'LastYear'}, inplace=True)
master_project

To find the students who finished their masters, we merge the two dataframes (join on the scipper number). 

In [None]:
ms = pd.merge(master_semester1,master_project, how='inner')
ms

We calculate the master time for the students who started their master from 2007 and finished it till now. 

In [None]:
ms['Staytime'] = (ms.LastYear - ms.FirstYear)*12 + (ms.Semester-1)*6 
## The first semester is Fall, the semester 2 shows Fall semester, and the semester 3 shows the Spring one
ms

#### Analyzing the data

First, we measure the average time that the master students spend at EPFL.

In [None]:
ms['Staytime'].mean()

On average each master student spent roughly 29 months at EPFL.

However, these results include specialization too. To check them, we group the students based on their specialization.

In [None]:
ms_spec = ms.groupby('Specialization')
ms_spec['Staytime'].mean()

We notice that the average time spent in the master with specialization is more than the average time in the master without specialization. However, we didn't consider the master with the minors, so we make a new group in the following code. 

In [None]:
ms_minor_spec = ms.groupby(['Minor','Specialization'])
ms_minor_spec['Staytime'].mean()

As we can see, the master without a minor/specialization takes less time than the master with a minor/specialization. That is expected. 

#### statistical tests
* Specialization
For each specialization, we apply a 1-sample T-Test, to see whether the results related to the staying-time average are significant or not.

In [None]:
for spec in ms.Specialization.unique():
    print(spec)
    print(stats.ttest_1samp(ms[ms.Specialization == spec].Staytime,ms.Staytime.mean()))

From the results obtained above, we notice that we do not have enough data to assume whether most of the staying-time averages according to specialization make a sens or not.
Therefore, some specialization categories' results make more sens than the others, according to this order:
   1. Students with no specialization.
   2. Students specialized in 'Internet computing'.
   3. Students specialized in 'Foundations of Software'.
   4. 'Computer Engineering - SP', 'Information Security - SP', 'Software Systems'.
   5. Others : 'Biocomputing', 'Signals, Images and Interfaces', 'Service science'

### Bonus

In [None]:
ms_sex = ms.groupby(['Sex', 'FirstYear'])
avg_ms_sex = ms_sex['Staytime'].mean()
avg_ms_sex

In [None]:
import matplotlib.pyplot as plt