# Student Performance Modeling

Approach:
1. Use train/test split data to try a number of different algorithms/techniques.
2. Choose the most promising algorithms and optimize hyperparameters.
3. Validate results using cross-validation on the entire dataset.

Algorithms to try:
- Linear regression
- K-Nearest Neighbors (KNN)
- Decision tree
- Support Vector Machine (SVM)
- Ensemble - RandomForest, XGBoost, stacking, etc
- Multi-layer perceptron (MLP)

Success criteria: 

Without knowledge of the first two term grades, the best model in the original paper had an root mean-squared error (RMSE) of 2.67 on the Portuguese dataset and 3.90 on the mathematics dataset. Since we are combining the two datasets, it may be difficult to compare our results with the results of the original paper. However, we are hoping to create a model that can produce results with a **RMSE of <2**. These results would demonstrate that the model has significant skill and is able to produce reasonably accurate predictions for a student's year-long grades. 

### Install and import libraries

In [17]:
!pip3 install --quiet mlflow

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [38]:
!pip3 install --quiet xgboost

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [50]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
import xgboost as xgb
import mlflow
import mlflow.sklearn
import mlflow.xgboost
import pickle

### Creating directory for model artifacts

In [19]:
artifact_prefix = 'artifacts/'

In [20]:
!mkdir -p $artifact_prefix

### Define contants

In [21]:
RAND_STATE = 12
TRAIN_FILE = 'data/processed/train.csv'
TEST_FILE = 'data/processed/test.csv'

### Load data into memory

In [22]:
train_df = pd.read_csv(TRAIN_FILE, header=0)
test_df = pd.read_csv(TEST_FILE, header=0)

Seperate attributes and target variable:

In [23]:
X_train = train_df.loc[:, train_df.columns != 'G3']
y_train = train_df['G3']
X_test = test_df.loc[:, test_df.columns != 'G3']
y_test = test_df['G3']

### Define model evaluation

In [24]:
# returns the root mean squared error (RMSE) and R-squared value given a set of predicted and actual values
def evaluate(actual, predictions):
    return mean_squared_error(y_true = actual, y_pred = predictions, squared = False), r2_score(actual, predictions)

### Create baseline naive model

We first want to create a baseline model to compare resuls as we test new algorithms. For our baseline regressor, we will be creating a naive model that always returns the average grade for students in the training set.

In [25]:
class NaiveRegressor(mlflow.pyfunc.PythonModel):
    def __init__(self):
        return
    def fit(self, X_train, y_train):
        # calculate average grade in training set
        self.value = np.average(y_train)
        return
    def predict(self, values):
        predictions = list()
        for row in np.array(values):
            predictions.append(self.value)
        return predictions

Now we want to evaluate our baseline model against the test dataset.

In [26]:
# create and train baseline model
baseline = NaiveRegressor()
baseline.fit(X_train, y_train)

In [27]:
# generate predictions
baseline_predictions = baseline.predict(X_test)

In [28]:
# evaluate predictions
baseline_score = evaluate(y_test, baseline_predictions)
print('Baseline RMSE: ', baseline_score[0])
print('Baseline R2: ', baseline_score[1])

Baseline RMSE:  3.7025421957351212
Baseline R2:  -0.01542370731080589


This baseline RMSE score is roughly consistent with the results found in the original paper. Although the original study separated math and Portuguese classes, the RMSE's of their Naive regression model were 4.6 and 3.2 respectively. We will log this model using MLFlow.

In [29]:
baseline_prefix = 'baseline/'
!mkdir -p {artifact_prefix + baseline_prefix}
baseline_model_path = artifact_prefix + baseline_prefix + 'model.pkl'
baseline_artifacts = {"baseline_model_path": baseline_model_path}
with open(baseline_model_path, 'wb') as f:
    pickle.dump(baseline, f)

In [30]:
with mlflow.start_run() as run:
#     mlflow.log_param("alpha", alpha)
#     mlflow.log_param("l1_ratio", l1_ratio)
    mlflow.log_metric("rmse", baseline_score[0])
    mlflow.log_metric("r2", baseline_score[1])

#     mlflow.log_artifact("encoder.pickle")
    
    mlflow.pyfunc.log_model(
        artifact_path='baseline',
        python_model=NaiveRegressor(),
#         code_path=["./artifacts"],
        artifacts=baseline_artifacts,
    )

### Linear Regression

In [31]:
with mlflow.start_run():
    lr = LinearRegression()
    lr.fit(X_train, y_train)

    # Evaluate Metrics
    predictions = lr.predict(X_test)
    rmse, r2 = evaluate(y_test, predictions)

    # Print out metrics
    print("Linear Regression model")
    print("  RMSE: %s" % rmse)
    print("  R2: %s" % r2)

    # Log metrics and model to MLflow
    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)

    mlflow.sklearn.log_model(lr, "LinRegModel")

Linear Regression model
  RMSE: 3.3953687970141377
  R2: 0.14607216161348102


The scikit-learn Linear Regression model with default hyperparameters did improve over the baseline model. The RMSE improved to ~3.4 from 3.7 and the R-squared score improved from ~0 to 0.14. While the improvement in R-squared score is encouraging, it's possible that the relationship between the features and the target variable is not linear. We will explore some non-linear options next. 

### KNN

In [32]:
with mlflow.start_run():
    knn = KNeighborsRegressor()
    knn.fit(X_train, y_train)

    # Evaluate Metrics
    predictions = knn.predict(X_test)
    rmse, r2 = evaluate(y_test, predictions)

    # Print out metrics
    print("KNN model")
    print("  RMSE: %s" % rmse)
    print("  R2: %s" % r2)

    # Log metrics, and model to MLflow
    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)

    mlflow.sklearn.log_model(knn, "KNNModel")

KNN model
  RMSE: 3.9511300372714353
  R2: -0.1563516164574994


Surprisingly, the KNN scikit-learn with default hyperparameters did worse than our baseline model. Both the RMSE and the R-squared score were worse than that of the baseline. We previously thought that students who were close together in the feature space would be similar in many ways, including school performance. However, this is clearly not the case for this set of students.

### Decision Tree

In [33]:
with mlflow.start_run():
    dt = DecisionTreeRegressor()
    dt.fit(X_train, y_train)

    # Evaluate Metrics
    predictions = dt.predict(X_test)
    rmse, r2 = evaluate(y_test, predictions)

    # Print out metrics
    print("Decision Tree model")
    print("  RMSE: %s" % rmse)
    print("  R2: %s" % r2)

    # Log metrics, and model to MLflow
    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)

    mlflow.sklearn.log_model(dt, "DecisionTreeModel")

Decision Tree model
  RMSE: 4.796824346327871
  R2: -0.7043347397275002


The scikit-learn Decision Tree regression algorithm with default hyperparameters is much worse than our baseline algorithm. We may try to tune some of the hyperparameters to see if it will do any better but for now, we will move on to other algorithm types.

### SVM

In [34]:
with mlflow.start_run():
    svm = SVR()
    svm.fit(X_train, y_train)

    # Evaluate Metrics
    predictions = svm.predict(X_test)
    rmse, r2 = evaluate(y_test, predictions)

    # Print out metrics
    print("SVM model")
    print("  RMSE: %s" % rmse)
    print("  R2: %s" % r2)

    # Log metrics, and model to MLflow
    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)

    mlflow.sklearn.log_model(svm, "SVMModel")

SVM model
  RMSE: 3.739125728684253
  R2: -0.03558893804497831


The SVM algorithm from scikit-learn with default hyperparameters performed about as well as the naive baseline model. It is most likely not a viable algorithm for this problem given these poor initial results.

### Random Forest

In [37]:
with mlflow.start_run():
    rf = RandomForestRegressor()
    rf.fit(X_train, y_train)
    
    # Evaluate metrics
    predictions = rf.predict(X_test)
    rmse, r2 = evaluate(y_test, predictions)
    
    # Print out metrics
    print("Random Forest Model")
    print("  RMSE: %s" % rmse)
    print("  R2: %s" % r2)
    
    # Log metrics and model to MLFlow
    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    
    mlflow.sklearn.log_model(rf, "RFModel")

Random Forest Model
  RMSE: 3.4052293119064196
  R2: 0.14110516379565163


The Random Forest algorithm from scikit-learn with default hyperparameters produced very similar results to our Linear Regression algorithm, which is our best performing model so far. We will circle back to the two or three best performing algorithms for hyperparameter tuning. 

### XGBoost

For this algorithm, we will create a simple test harness that takes a dictionary of hyperparameters. The function will train an XGBoost model on the training data and print out the RMSE and R2 on the test data.

In [51]:
def xgboost_test_harness(params):
    with mlflow.start_run():
        dtrain = xgb.DMatrix(X_train, label=y_train)
        xgb_model = xgb.train(params, dtrain)
        
        dtest = xgb.DMatrix(X_test)
        predictions = xgb_model.predict(dtest)
        rmse, r2 = evaluate(y_test, predictions)

        # Print out metrics
        print("XGBoost Model")
        print("  RMSE: %s" % rmse)
        print("  R2: %s" % r2)
        
        # Log parameters, metrics, and model to MLFlow
        for key, value in params.items():
            mlflow.log_param(key, value)
        
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("r2", r2)

        mlflow.xgboost.log_model(xgb_model, "XGBModel")

In [52]:
params = {}
xgboost_test_harness(params)

XGBoost Model
  RMSE: 3.2829399891681286
  R2: 0.20168708801440738


Despite the Decision Tree algorithm performing exceptionally poorly on this dataset, the XGBoost algorithm produced the best results so far. We have a significant improvement over the baseline in terms of both RMSE (3.28 compared to 3.70) and R2 (0.20 compared to -0.01). Later, we will see how much we can improve on these metrics by tuning the model's hyperparameters.

## Multi-Layer Perceptron