# Progetto Machine Learning
Progetto per il corso di Machine Learning - SUPSI DTI 2020/2021.

Gruppo formato da:
* De Santi Massimo
* Aleskandar Stojkovski

## Dataset

Dati provenienti da due scuole superiori Portoghesi.
I dati sono stati raccolti tramite report scolastici e questionari. 

Attributi disponibili includono:
- attributi demografici
- attributi sociali
- attributi relativi alla scuola
- voti

Attraverso due distinti dataset vengono analizzate le performance di due materie:
- Mathematica (student-mat.csv) 
- Lingua Portoghese (student-por.csv)

## Descrizione Attributi comuni:
| i | col | description |
| --- | :- | :- |
| 1  | school     | student's school (binary: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira)
| 2  | sex        | student's sex (binary: "F" - female or "M" - male)
| 3  | age        | student's age (numeric: from 15 to 22)
| 4  | address    | student's home address type (binary: "U" - urban or "R" - rural)
| 5  | famsize    | family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
| 6  | Pstatus    | parent's cohabitation status (binary: "T" - living together or "A" - apart)
| 7  | Medu       | mother's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
| 8  | Fedu       | father's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
| 9  | Mjob       | mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
| 10 | Fjob       | father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
| 11 | reason     | reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")
| 12 | guardian   | student's guardian (nominal: "mother", "father" or "other")
| 13 | traveltime | home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
| 14 | studytime  | weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
| 15 | failures   | number of past class failures (numeric: n if 1<=n<3, else 4)
| 16 | schoolsup  | extra educational support (binary: yes or no)
| 17 | famsup     | family educational support (binary: yes or no)
| 18 | paid       | extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
| 19 | activities | extra-curricular activities (binary: yes or no)
| 20 | nursery    | attended nursery school (binary: yes or no)
| 21 | higher     | wants to take higher education (binary: yes or no)
| 22 | internet   | Internet access at home (binary: yes or no)
| 23 | romantic   | with a romantic relationship (binary: yes or no)
| 24 | famrel     | quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
| 25 | freetime   | free time after school (numeric: from 1 - very low to 5 - very high)
| 26 | goout      | going out with friends (numeric: from 1 - very low to 5 - very high)
| 27 | Dalc       | workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
| 28 | Walc       | weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
| 29 | health     | current health status (numeric: from 1 - very bad to 5 - very good)
| 30 | absences   | number of school absences (numeric: from 0 to 93)
| 31 | G1 | first period grade (numeric: from 0 to 20) |
| 31 | G2 | second period grade (numeric: from 0 to 20) |
| 32 | G3 | final grade (numeric: from 0 to 20, output target) |

## Note
* L'attributo target G3 (Voto terzo e ultimo anno) ha una forte correlazione con gli attributi G1 (voto primo anno) e G2 (voto secondo anno). 
* E' piu' difficile predire G3 senza G1 e G2, ma questa predizione e' anche piu' utile.
* Ci sono parecchi (382) studenti che appartengono a entrambi i dataset. 
* Questi studenti possono essere indentificati facendo un match tra gli attributi demografici, come mostrato nel file student-merge.R

Link dataset: https://archive.ics.uci.edu/ml/datasets/Student+Performance

# Read dataset

In [None]:
# import libraries
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import plotly.express as px
import plotly.graph_objects as go

# this allows plots to appear directly in the notebook
%matplotlib inline

In [None]:
# dataset urls
portuguese_dataset_url = "https://raw.githubusercontent.com/aleksandarstojkovski/SUPSI_Machine_Learning/main/dataset/student-por.csv"
math_dataset_url = "https://raw.githubusercontent.com/aleksandarstojkovski/SUPSI_Machine_Learning/main/dataset/student-mat.csv"

# dataframes
df_por = pd.read_csv(portuguese_dataset_url, sep=';')
df_math = pd.read_csv(math_dataset_url, sep=';')
df = pd.concat([df_por, df_math], ignore_index=True)

# Getting to know the dataset

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.duplicated().sum()

In [None]:
df.nunique()

In [None]:
df.describe()

In [None]:
df.shape

# Data wrangling

# Exploration

In [None]:
gender_count = df['sex'].value_counts().reset_index()
gender_count.columns = ['gender', 'count']
gender_count

In [None]:
fig = px.pie(gender_count, values='count', names='gender')
fig.update_layout(
    title=dict(
        text='Gender Distribution',
        y=0.95,
        x=0.5,
        xanchor='center',
        yanchor='top'
    )
)
fig.show()
#############################
# Invertire colori!!
#############################