# CS105 Final Project: G1-T15 

#### Table of Contents

1. Introduction
2. Dataset preprocessing
3. Exploratory Data Analysis
    - Univariate analysis
        - Five-number summary
        - Boxplot
        - Histogram subplots
    - Multivariate analysis
        - Features and `G3` mean subplots
        - Heatmap
4. Insights
5. Classification model 
6. Conclusion
7. References


## 1 - Introduction

Academic performance is often seen as an important marker for success in early life. In the absence of other metrics, it may be used to assess an individual’s work ethic and personality.

Furthermore, access to employment opportunities and higher education is heavily predicated on having a good academic record. While academic performance tends to be attributed to personal factors, many external factors play a vital role as well. Indeed, it is well-understood that students who come from families with a higher socioeconomic status tend to outperform those from the lower end of the spectrum. While this inequality is difficult to mitigate, there are many ways in which aid is granted to disadvantaged students, such as through
scholarships, bursaries and counselling services.

In this project, we seek to identify the most crucial social factors that predict students’ academic performance. This may be useful in helping to identify at-risk youth, or in constructing policies to support those who are the most disadvantaged within the community.

Finally, for the layperson, it can help to broaden perspectives by giving them different ways to understand differences in student performance.

We are using a dataset `student-mat.csv`, which contains the grades achieved by students from two Portuguese schools. The students are enrolled in secondary education and the grades are for the subject of Mathematics. The dataset has no missing values in its columns, and the data dictionary is defined as follows:

| Column number | Column name | Explanation | Data type |
|:---:|:---|:---|:---|
|1 | school | student's school | **string**: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira|
|2 | sex | student's sex | **string**: "F" - female or "M" - male|
|3 | age | student's age | **int**: from 15 to 22|
|4 | address | student's home address type | **object**: "U" - urban or "R" - rural|
|5 | famsize | family size | **string**: "LE3" - less or equal to 3 or "GT3" - greater than 3|
|6 | Pstatus | parent's cohabitation status | **string**: "T" - living together or "A" - apart|
|7 | Medu | mother's education | **int**: 0 - none,  1 - primary education 4th grade, 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education|
|8 | Fedu | father's education | **int**: 0 - none,  1 - primary education 4th grade, 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education|
|9 | Mjob | mother's job | **string**: "teacher", "health" care related, civil "services" e.g. administrative or police, "at_home" or "other"|
|10| Fjob | father's job | **string**: "teacher", "health" care related, civil "services" e.g. administrative or police, "at_home" or "other"|
|11| reason | reason to choose this school | **string**: close to "home", school "reputation", "course" preference or "other"|
|12| guardian | student's guardian | **string**: "mother", "father" or "other"|
|13| traveltime | home to school travel time | **int**: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour|
|14| studytime | weekly study time | **int**: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours|
|15| failures | number of past class failures | **int**: _n_ if 1<=_n_<3, else 4|
|16| schoolsup | extra educational support | **string**: yes or no|
|17| famsup | family educational support | **string**: yes or no|
|18| paid |  extra paid classes within the course subject Math or Portuguese | **string**: yes or no|
|19| activities |  extra-curricular activities | **string**: yes or no|
|20| nursery |  attended nursery school | **string**: yes or no|
|21| higher |  wants to take higher education | **string**: yes or no|
|22| internet |  Internet access at home | **string**: yes or no|
|23| romantic |  with a romantic relationship | **string**: yes or no|
|24| famrel | quality of family relationships | **int**: from 1 - very bad to 5 - excellent|
|25| freetime | free time after school | **int**: from 1 - very low to 5 - very high|
|26| goout | going out with friends | **int**: from 1 - very low to 5 - very high|
|27| Dalc | workday alcohol consumption | **int**: from 1 - very low to 5 - very high|
|28| Walc | weekend alcohol consumption | **int**: from 1 - very low to 5 - very high|
|29| health | current health status | **int**: from 1 - very bad to 5 - very good|
|30| absences | number of school absences | **int**: from 0 to 93|
|31| G1 | first period grade | **int**: from 0 to 20|
|31| G2 | second period grade | **int**: from 0 to 20|
|32| G3 | final grade | **int**: from 0 to 20, output target|

## 2 - Dataset Preprocessing

In [2]:
import numpy as np
import pandas as pd
pd.set_option("display.max_columns", None)
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px

data = pd.read_csv("student/student-mat.csv", sep=';')

In [3]:
data.dtypes

school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery       object
higher        object
internet      object
romantic      object
famrel         int64
freetime       int64
goout          int64
Dalc           int64
Walc           int64
health         int64
absences       int64
G1             int64
G2             int64
G3             int64
dtype: object

In [4]:
datatypes = data.dtypes.drop(labels=["G1", "G2", "G3"])
all_features = datatypes.index

There are many columns that are of datatype `object`, but only take 2 possible values, and thus can be considered as binary. We convert these columns (e.g. sex, address, famsize, etc) into `int` datatypes taking either 1 or 0, so that we can process this data using methods of analysis for numerical variables as well

In [5]:
data["sex_bin"] = data["sex"].apply(lambda x: 1 if x == "M" else 0)
data["address_bin"] = data["address"].apply(lambda x: 1 if x == "U" else 0)
data["famsize_bin"] = data["famsize"].apply(lambda x: 1 if x == "GT3" else 0)
data["Pstatus_bin"] = data["Pstatus"].apply(lambda x: 1 if x == "T" else 0)
for column in ["schoolsup", "famsup", "paid", "activities", "nursery", "higher", "internet", "romantic"]:
    data[column+"_bin"] = data[column].apply(lambda x: 1 if x == "yes" else 0)
data.head(20)

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3,sex_bin,address_bin,famsize_bin,Pstatus_bin,schoolsup_bin,famsup_bin,paid_bin,activities_bin,nursery_bin,higher_bin,internet_bin,romantic_bin
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,6,5,6,6,0,1,1,0,1,0,0,0,1,1,0,0
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father,1,2,0,no,yes,no,no,no,yes,yes,no,5,3,3,1,1,3,4,5,5,6,0,1,1,1,0,1,0,0,0,1,1,0
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother,1,2,3,yes,no,yes,no,yes,yes,yes,no,4,3,2,2,3,3,10,7,8,10,0,1,0,1,1,0,1,0,1,1,1,0
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother,1,3,0,no,yes,yes,yes,yes,yes,yes,yes,3,2,2,1,1,5,2,15,14,15,0,1,1,1,0,1,1,1,1,1,1,1
4,GP,F,16,U,GT3,T,3,3,other,other,home,father,1,2,0,no,yes,yes,no,yes,yes,no,no,4,3,2,1,2,5,4,6,10,10,0,1,1,1,0,1,1,0,1,1,0,0
5,GP,M,16,U,LE3,T,4,3,services,other,reputation,mother,1,2,0,no,yes,yes,yes,yes,yes,yes,no,5,4,2,1,2,5,10,15,15,15,1,1,0,1,0,1,1,1,1,1,1,0
6,GP,M,16,U,LE3,T,2,2,other,other,home,mother,1,2,0,no,no,no,no,yes,yes,yes,no,4,4,4,1,1,3,0,12,12,11,1,1,0,1,0,0,0,0,1,1,1,0
7,GP,F,17,U,GT3,A,4,4,other,teacher,home,mother,2,2,0,yes,yes,no,no,yes,yes,no,no,4,1,4,1,1,1,6,6,5,6,0,1,1,0,1,1,0,0,1,1,0,0
8,GP,M,15,U,LE3,A,3,2,services,other,home,mother,1,2,0,no,yes,yes,no,yes,yes,yes,no,4,2,2,1,1,1,0,16,18,19,1,1,0,0,0,1,1,0,1,1,1,0
9,GP,M,15,U,GT3,T,3,4,other,other,home,mother,1,2,0,no,yes,yes,yes,yes,yes,yes,no,5,5,1,1,1,5,0,14,15,15,1,1,1,1,0,1,1,1,1,1,1,0


## 3 - Exploratory Data Analysis

### 3.1 - Univariate analysis

In [6]:
n_rows, n_cols = data.shape
print(f"# of rows is {n_rows}")
print(f"# of columns is {n_cols}")

# of rows is 395
# of columns is 45


In [7]:
num_passed = len(data[data["G3"]>=10])
total = len(data)
print(f"Passing rate : {num_passed / total}")

Passing rate : 0.6708860759493671


In [8]:
data[["G1", "G2", "G3"]].describe()

Unnamed: 0,G1,G2,G3
count,395.0,395.0,395.0
mean,10.908861,10.713924,10.41519
std,3.319195,3.761505,4.581443
min,3.0,0.0,0.0
25%,8.0,9.0,8.0
50%,11.0,11.0,11.0
75%,13.0,13.0,14.0
max,19.0,19.0,20.0


In [9]:
fig = go.Figure()

for G in ["G1", "G2", "G3"]:
    fig.add_trace(go.Box(y=data[G], name=G + " scores", showlegend=False))

fig.show()

We will begin exploring the dataset, filtering out features with less relevance and honing in on the important ones.

We first investigate the distribution of each variable using a histogram, and display them as multiple subplots.

In [10]:
fig = make_subplots(
    rows=6, cols=5,
    subplot_titles=all_features)

for i in range(len(all_features)):
    feature=all_features[i]
    fig.append_trace(
        go.Bar(
            x=data[feature].value_counts().index,
            y=data[feature].value_counts().values,
            showlegend=False
            ),
        row = i // 5 + 1,
        col = i % 5 + 1
    )

fig.update_layout(
    autosize=True,
    width=1000,
    height=1000,
    title_text="Histogram values for each feature",
    margin=dict(l=50, r=50, t=100, b=100),
)

fig.update_traces(marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.6)

fig.show()

We can make some observations from the above data:

**Spread of variables**

The distribution of the following binary categorical variables have comparatively large spreads:

- `address` : More students have urban home addresses
- `famsize` : Most students come from families which have more than 3 members
- `Pstatus` : Most students have parents who are together rather than apart
- `schoolsup` : Most students did not receive educational support from the school
- `nursery` : Most students went to nursery school
- `higher` : Most students intend to pursue higher education after school
- `internet` : Most students have reliable access to an internet connection 


**Possible correlations between features**

1. The spread of students living in urban vs rural areas closely mirrors that of those with and without an internet connection. We postulate that these 2 factors are correlated, since internet connection plans are often tied to a place of residence. It is likely that `address` is the causal factor that explains `internet`.
2. There should be a correlation between `Dalc` and `Walc`, since both features are linked to the overall habit of alcohol consumption
3. We suspect that there is a correlation between `Medu` and `Fedu`. People are more likely to likely to enter into relationships with those of a similar social status, and the level of education achieved is a good predictor of this


### 3.2 - Multivariate analysis

Let's also take a look at how the mean of `G3` differs for different values of each feature, with all else being equal

In [11]:
def get_mean_G3_score_set(data, feature):
    result = [];
    for value in data[feature].unique():
        feature_has_value = data[feature]==value
        data_with_feature_val = data[feature_has_value]
        result.append(data_with_feature_val["G3"].mean())
    return result

fig = make_subplots(
    rows=6, cols=5,
    subplot_titles=all_features)

for i in range(len(all_features)):
    feature = all_features[i]
    fig.append_trace(
        go.Bar(
            x=data[feature].unique(),
            y=get_mean_G3_score_set(data, feature),
            showlegend=False,
        ),
        row = i // 5 + 1,
        col = i % 5 + 1
    )

fig.update_layout(
    autosize=True,
    width=1000,
    height=1000,
    title_text="Mean G3 scores for unique values of each variable",
    margin=dict(l=50, r=50, t=100, b=100),
)

fig.update_traces(marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.6)

fig.show()

From a frequentist perspective, these subplots give us a good idea of the maximum likelihood estimation of `G3` for each given value of the feature variables.  However, it is important here to note that we must also take into account the histogram values when making inferences from this chart. Certain feature values which only account for a very small sample number of students will be more sensitive to outliers when attempting to calculate the mean. 

**Trends**

- `sex` : Male students tend to perform slightly better in the test
- `age` : Older students tend to have lower `G3` scored, with the exception of some outliers for in the age category of 20 
- `address` : Those who live in urban environments (`U`) tend to perform better
- `famsize` : Students with a smaller family size (`LT3`) have higher `G3` scores
- `Medu` and `Fedu` : Parental education levels are correlated with higher `G3` scores
- `guardian` : Students with non-parental guardians have lower `G3` scores
- `traveltime` : Longer travel times are correlated with lower `G3` scores
- `studytime` : More time spent studying is correlated to a higher `G3` score
- `failures` : Having more past failures is correlated to lower `G3` scores
- `schoolsup` : Students having school support have lower `G3` scores
- `paid` : Students taking paid classes have higher `G3` scores
- `higher` : Students taking who intend to pursue higher education have higher `G3` scores
- `romantic` : Students involved in romantic relationships have lower `G3` scores
- `goout` : Students who go out moderately often the highest `G3` scores. Otherwise


**Significant factors**

Given the trends observed in the data, we classify features according to their significance in predicting `G3`. We will go on to assess the reasoning behind this in the later sections

|Significant | Non-significant|
|:---:|:---:|
|`sex`, `age`, `address`,  `famsize`, `Medu`, `Fedu`, `guardian`, `traveltime`, `studytime`, `failures`, `schoolsup`, `paid`, `higher`, `romantic`, `goout`, |`school`, `Pstatus`, `Mjob`, `Fjob`, `reason`, `famsup`, `activities`,`nursery`, `internet`, `famrel`, `freetime`, `Dalc`, `Walc`, `health`, `absences`|

We can use a heatmap to get a broad overview of any correlations between different variables, including categorical variables which have been converted to binary representation.

This will help to further refine the variables we have identified and minimise redudancy, by eliminating features which are highly correlated. 

We first create a new feature, `guardian_bin`, which determines whether the student has parental guardians (1) or not (0)

In [12]:
data["guardian_bin"] = data["guardian"].apply(lambda x : 1 if x != "other" else 0)

In [13]:
significant_features = ["sex_bin", "age", "address_bin", "famsize_bin", "Medu", "Fedu", "guardian_bin", "traveltime", "studytime", 
                             "failures", "schoolsup_bin", "paid_bin", "higher_bin", "romantic_bin", "goout"]

fig = go.Figure(go.Heatmap(
                    x=data[significant_features].columns,
                    y=data[significant_features].columns,
                    z=data[significant_features].corr(),
                    colorscale="RdBu",
                    zmin=-1,
                    zmax=1,
                ),
                layout_title_text="Heatmap of correlation between explanatory variables",
)

fig.update_layout(
    autosize=True,
    width=1000,
    height=1000,
    margin=dict(l=50, r=50, t=100, b=100),
)

fig.show()

Apart from the exception of `Medu` and `Fedu` with a correlation coefficient of 0.623, we can see that most features do not are not too strongly correlated with one another, with values within the range (-0.5, 0.5). This implies that we can consider them all as having explanatory power in their own right.

In order to minimise redundancy, we create a new feature, `combined_parental_edu`, which is the sum of education levels of both of the student's parents

In [14]:
data["combined_parental_edu"] = data["Medu"] + data["Fedu"]
significant_features.remove("Medu")
significant_features.remove("Fedu")
significant_features.append("combined_parental_edu")


## 4 - Insights

Here, we will explore some reasoning behind the significance of the some variables we have identified in influencing `G3` scores

- `sex` : The observation that male students tend to perform slightly could be due to prevalent societal and cultural influences, which bias males to be more competent in STEM subjects
- `age` : An older age may imply that students may have had failed on certain occasions in the past, and had to retake years in school
- `address` : Those who live in urban environments (`U`) tend to perform better. Generally, this could be because an urban address is an indicator of higher socioeconomic status and better access to resources 
- `famsize` : Students with a smaller family size (`LT3`) performed slightly better on average, possibly due to the fact that these students receive more support from their family
- `Medu` and `Fedu` : Parental education levels are strongly correlated, as people are likely to enter relationships with those of a similar educational background as them
- `guardian` : Students with non-parental guardians perform slightly worse than others. It is possible that this could be a indicator of a poor family history affecting childhood development
- `schoolsup` : Students under school support would have already been identified as having weaker academic capability
- `higher` : The intention to pursue higher education determines the amount of effort a student puts into his/her studies, as these grades will influence their opportunities for further education 
- `goout` : Students who go out moderately often the highest `G3` scores. It seems that this is a case of striking a balance between healthy levels of social interaction and leaving time for studies 

## 5 - Classification model

We will use a Naïve Bayes classifier to assess whether the social factors we identified are accurate predictors of `G3`. This classifier works on the assumption that all our factors are mutually independent.

First, we create a new boolean column `passed` which captures whether or not the student has passed the test.

In [15]:
data["passed"] = data["G3"].apply(lambda x : 1 if x >= 10 else 0)

### 5.1 - Model Building

In [16]:
from sklearn.naive_bayes import CategoricalNB
from sklearn.model_selection import train_test_split

In [17]:
X = data[significant_features]
y = data["passed"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [18]:
model = CategoricalNB().fit(X_train, y_train)

### 5.2 - Model Evaluation

In [19]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred = model.predict(X_test)

print("Model")
print("-----")
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Precision: {precision_score(y_test, y_pred)}")
print(f"Recall: {recall_score(y_test, y_pred)}")
print(f"F1 score: {f1_score(y_test, y_pred)}")

Model
-----
Accuracy: 0.7070707070707071
Precision: 0.7023809523809523
Recall: 0.9365079365079365
F1 score: 0.802721088435374


## 6 - Conclusion

In this project, we sought to use data anlysis techniques to identify social factors that played a role in determining student academic performance. This was achieved in several sequential processes. Firstly, we did some basic data preprocessing to ensure that the dataset would was suitable for EDA(Exploratory Data Analysis). We then applied a range of data visualisation techniques to find trends and gradually narrow down a set of social factors to focus on. Finally, we put these factors to the test by training a categorical Naïve Bayes classifier to predict whether or not a student would attain a passing score for `G3`.

Our model performs fairly well on the whole, with a very high recall and decent precision. While this does not imply any direct causal relationship between academic performance and the social factors identified, it does indicate that these factors are a good place to start from if we want to further study and explain student academic performance.

## 7 - References
- Dataset and data dictionary: https://archive.ics.uci.edu/ml/datasets/Student+Performance#