## Machine Learning Notebook for deployment on Flask

The purpose of this notebook is to create a ML model for a Flask web application.

Import relevant libraries, then check the characteristics of the dataset. Dataset is obtained from [this Kaggle website](https://www.kaggle.com/datasets/thedevastator/higher-education-predictors-of-student-retention).

In [1]:
import pandas as pd
import numpy as np
from IPython.display import display
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import pickle
from sklearn.ensemble import RandomForestClassifier

In [2]:
df = pd.read_csv('dataset.csv')
df.info()
df.shape
display(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4424 entries, 0 to 4423
Data columns (total 35 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   Marital status                                  4424 non-null   int64  
 1   Application mode                                4424 non-null   int64  
 2   Application order                               4424 non-null   int64  
 3   Course                                          4424 non-null   int64  
 4   Daytime/evening attendance                      4424 non-null   int64  
 5   Previous qualification                          4424 non-null   int64  
 6   Nacionality                                     4424 non-null   int64  
 7   Mother's qualification                          4424 non-null   int64  
 8   Father's qualification                          4424 non-null   int64  
 9   Mother's occupation                      

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Nacionality,Mother's qualification,Father's qualification,Mother's occupation,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,8,5,2,1,1,1,13,10,6,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,6,1,11,1,1,1,1,3,4,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,5,1,1,1,22,27,10,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,8,2,15,1,1,1,23,27,6,...,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,12,1,3,0,1,1,22,28,10,...,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


In [3]:
df.rename(columns = {'Nacionality':'Nationality'}, inplace = True)

In [4]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['Target'] = le.fit_transform(df['Target'])

That pretty much sums up the required data cleaning for the purpose of this project. Next, we check the correlation values of the dataset to pick our feature variables.

In [5]:
labels = df.columns.values.tolist()
rs = np.random.RandomState(0)
pd.DataFrame(rs.rand(35, 35), index = labels,
                  columns = labels)
corr = df.corr()
#show correlation heatmap plot with 2 digit precision
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

#code to mask upper half of the correlational matrix
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
corr[mask] = np.nan
(corr
 .style
 .background_gradient(cmap='coolwarm', axis=None, vmin=-1, vmax=1)
 .highlight_null(color='#f1f1f1')  # Color NaNs grey
 .set_precision(2))

  corr.style.background_gradient(cmap='coolwarm').set_precision(2)
  .set_precision(2))


Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Nationality,Mother's qualification,Father's qualification,Mother's occupation,Father's occupation,Displaced,Educational special needs,Debtor,Tuition fees up to date,Gender,Scholarship holder,Age at enrollment,International,Curricular units 1st sem (credited),Curricular units 1st sem (enrolled),Curricular units 1st sem (evaluations),Curricular units 1st sem (approved),Curricular units 1st sem (grade),Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
Marital status,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Application mode,0.22,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Application order,-0.13,-0.25,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Course,0.02,-0.09,0.12,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Daytime/evening attendance,-0.27,-0.27,0.16,-0.07,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Previous qualification,0.12,0.43,-0.2,-0.16,-0.1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Nationality,-0.02,-0.0,-0.03,-0.0,0.02,-0.04,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Mother's qualification,0.19,0.09,-0.06,0.06,-0.2,0.02,-0.04,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Father's qualification,0.13,0.07,-0.05,0.05,-0.14,0.01,-0.09,0.52,,,,,,,,,,,,,,,,,,,,,,,,,,,
Mother's occupation,0.07,0.03,-0.05,0.03,-0.04,0.01,0.04,0.3,0.21,,,,,,,,,,,,,,,,,,,,,,,,,,


We observe values that are high correlated to Target:
- Tuition fees up to date (0.41)
- Scholarship holder (0.30)
- Curricular units 1st sem (approved) (0.53)
- Curricular units 1st sem (grade) (0.49)
- Curricular units 2nd sem (approved) (0.62)
- Curricular units 2nd sem (grade) (0.57)

These are the variables that will be used for the ML model.

In [6]:
x = df[['Tuition fees up to date','Scholarship holder', 'Curricular units 1st sem (approved)', 'Curricular units 1st sem (grade)', 'Curricular units 2nd sem (approved)', 'Curricular units 2nd sem (grade)']]
y = df['Target']

We select `RandomForestClassifier` to fit the train and test sets.

In [7]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)
model = RandomForestClassifier()
model.fit(x_train, y_train)

In [8]:
y_pred = model.predict(x_test)

In [9]:
score = model.score(x_test, y_test)
print(score)

0.7063253012048193


In [10]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
print(mse, mae)


0.5466867469879518 0.37801204819277107


Earlier runs of this notebook has shown that the relationship of feature values cannot be linear due to poor model score and poor mse & mae values. Therefore, `RandomForestClassifier` was chosen to be the model for this project.

Once done with the ML fitting process, we create the pickle file with `.dump()` for the Flask web app deployment.

In [11]:
filename = 'model.pkl'
pickle.dump(model, open(filename, 'wb'))