Unsure of what to expect. We'll drop, however, the special education and credited & without evaluation features (both 1st and 2nd) to simplify our model. See ceda_i for the reason behind this.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier, GradientBoostingClassifier, RandomForestClassifier

import warnings #To make things cleaner...
warnings.filterwarnings('ignore')

In [3]:
from xgboost import XGBClassifier
# from xgboost import plot_importance

In [4]:
df = pd.read_csv('../data/train.csv')
df.columns = df.columns.str.lower().str.replace(' ', '_').str.replace('.', '')

df.drop(['educational_special_needs', 'curricular_units_1st_sem_(credited)', 'curricular_units_1st_sem_(without_evaluations)', 'curricular_units_2nd_sem_(credited)', 'curricular_units_2nd_sem_(without_evaluations)'],axis = 1, inplace=True)

print(df.shape)
df.head()

(76518, 33)


Unnamed: 0,id,marital_status,application_mode,application_order,course,daytime/evening_attendance,previous_qualification,previous_qualification_(grade),nacionality,mother's_qualification,...,curricular_units_1st_sem_(approved),curricular_units_1st_sem_(grade),curricular_units_2nd_sem_(enrolled),curricular_units_2nd_sem_(evaluations),curricular_units_2nd_sem_(approved),curricular_units_2nd_sem_(grade),unemployment_rate,inflation_rate,gdp,target
0,0,1,1,1,9238,1,1,126.0,1,1,...,6,14.5,6,7,6,12.428571,11.1,0.6,2.02,Graduate
1,1,1,17,1,9238,1,1,125.0,1,19,...,4,11.6,6,9,0,0.0,11.1,0.6,2.02,Dropout
2,2,1,17,2,9254,1,1,137.0,1,3,...,0,0.0,6,0,0,0.0,16.2,0.3,-0.92,Dropout
3,3,1,1,3,9500,1,1,131.0,1,19,...,7,12.59125,8,11,7,12.82,11.1,0.6,2.02,Enrolled
4,4,1,1,2,9500,1,1,132.0,1,19,...,6,12.933333,7,12,6,12.933333,7.6,2.6,0.32,Graduate


In [5]:
unique_categories = df['target'].unique()
print(unique_categories)
category_to_int = {category: idx for idx, category in enumerate(unique_categories)}
df['target_encoded'] = df['target'].map(category_to_int)

['Graduate' 'Dropout' 'Enrolled']


Hmm, without having a formal data science project what to do. I suppose we'll go for default accuracy.

I'll try to reserve comments for something of interest.

In [6]:
X = df.drop(['target', 'id', 'target_encoded'], axis = 1) #Starting out, at least, with 'everything'. Save id of course as that adds no value.
y = df['target_encoded']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.2, #Perhaps subject to change, but we'll go with it for now
                                                    random_state = 26, #Recall that I like this number
                                                    stratify=y) #Lest a dispropportionate amount of one be in any cut.

ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

In [16]:
#Generally I'll only preserve the last bit of parameters I tried in this section of the code, not bothering to record every single iteration...

pipe = Pipeline([
    ('lr', LogisticRegression(max_iter=2500))
])

pipe_params = {'lr__C' : [.85, .95, .9]
              }

gs_lrcv = GridSearchCV(pipe,
                  param_grid=pipe_params,
                  cv=5)

gs_lrcv.fit(X_train, y_train)

print(gs_lrcv.score(X_train, y_train), gs_lrcv.score(X_test, y_test))
print(gs_lrcv.best_score_)
print(gs_lrcv.best_params_)

0.8165452347502206 0.818086774699425
0.815842750975745
{'lr__C': 0.9}


All ready off to a nice start. Ie the simple (albeit I took out a few features already) logistic regression already gave us an overall r2 of .816 on our train split, being slightly underfit in reality. The current leaderboards shows the very best to be at .84179 so it's not like we'd improve all that much in this case by doing something fancier.

Yet, we will anyways!

In [8]:
pipe = Pipeline([
    ('rfc', RandomForestClassifier(random_state=26))
])

pipe_params = {'rfc__min_samples_split' : [2,3,5]
               ,'rfc__min_samples_leaf' : [1,2,3]
               ,'rfc__max_features' : ['sqrt', 'log2']
              }
#Will keep grid search to make things easier
gs_rfc = GridSearchCV(pipe,
                  param_grid=pipe_params,
                  cv=None)

gs_rfc.fit(X_train, y_train)

print(gs_rfc.score(X_train, y_train), gs_rfc.score(X_test, y_test))
print(gs_rfc.best_score_)
print(gsgs_rfc_lrcv.best_params_)

0.9528212500408403 0.8261238891792996
0.8257261443622657
{'rfc__max_features': 'log2', 'rfc__min_samples_leaf': 2, 'rfc__min_samples_split': 2}


Reminder that there's a class_weight parameter.

A nice bit of improvemet with relative ease. On the topic, let's play with this a bit more as we already have an almost .01 improvment of r2 from just switching to random forests from logistic regression.

Note that at the moment unless I get close to the current leaderboards I won't bother submitting.

In [9]:
pipe = Pipeline([
    ('svc', SVC())
])

pipe_params = {'svc__C' : [1.0, .95, .9]
               ,'svc__kernel' : ['rbf']
               ,'svc__degree' : [3] #Oh, recall this only matters if your kernel is poly.
               #,'svc__' : []
              }
#Will keep grid search to make things easier
gs_svc = GridSearchCV(pipe,
                  param_grid=pipe_params,
                  cv=3)

gs_svc.fit(X_train, y_train)

print(gs_svc.score(X_train, y_train), gs_svc.score(X_test, y_test))
print(gs_svc.best_score_)
print(gs_svc.best_params_)

0.8321299049237103 0.8206351280710925
0.8188322993076408
{'svc__C': 1.0, 'svc__degree': 3, 'svc__kernel': 'rbf'}


Aww not as accurate. Considering its runtime (30ish minutes?), and that was just with 3 cvs, we may call it here. Reminder that here also we have class weights, relevant to the potential data science problem.

In [7]:
#To at least make some actual progress today I'll see if KNN gives anything useful:

pipe = Pipeline([
    ('knn', KNeighborsClassifier())
])

pipe_params = {'knn__n_neighbors' : [17,13,9]
               ,'knn__p' : [1,2,3]
#                ,'knn__' : []
#                ,'knn__' : []
#                ,'knn__' : []
              }
#Will keep grid search to make things easier
gs_knn = GridSearchCV(pipe,
                  param_grid=pipe_params,
                  cv=3)

gs_knn.fit(X_train, y_train)

print(gs_knn.score(X_train, y_train), gs_knn.score(X_test, y_test))
print(gs_knn.best_score_)
print(gs_knn.best_params_)

0.8110726304440161 0.7978959749085206
0.7909301576423821
{'knn__n_neighbors': 17, 'knn__p': 2}


Also has weights, for the record.

Unfortunately a worse test r2 than even our preferred default logistic regression, yet worth a try. Likely if its accuracy was a bit higher I could see opting for it as its explanability is quite nicely.

Given that the features are definitly not independent, I won't even pretend to be naive enough to use the various Naive Bayes.

I suppose I'll bust out a neural net later; eh, I don't know. Part of me wants to (and on that note finally get into xgboost),
but just in the context of the data science problem I'm findng it hard to get motivated about it. Especialy when this is meant to be a child of a passion project and temporarily final portfolio piece I really should step it up... One thing to end up wiht the logistic regression or rf, but without trying...

In [15]:
#When it's my first time using it I don't think I can say 'reminder', but knw that the categories need to be integers. Thankfuly
#we wrote the code for that already.

xgb = XGBClassifier(random_state=26)

param_grid = {
    'learning_rate': [.4, .5, .3]
    ,'n_estimators': [200, 150, 250]
    ,'max_depth': [2, 3]
    ,'subsample': [.9]
    ,'colsample_bytree': [1.0]
    ,'reg_alpha': [.004]
    ,'reg_lambda': [1.25]
    ,'min_child_weight': [1]
    ,'gamma': [0]
}

gs_xgb = GridSearchCV(estimator=xgb, param_grid=param_grid, 
                           scoring='accuracy', cv=3, verbose=1, n_jobs=-1)

gs_xgb.fit(X_train, y_train)

y_pred = gs_xgb.predict(X_test)

print(gs_xgb.best_params_)
print(gs_xgb.best_score_)
print(accuracy_score(y_test, y_pred))

Fitting 3 folds for each of 18 candidates, totalling 54 fits
{'colsample_bytree': 1.0, 'gamma': 0, 'learning_rate': 0.4, 'max_depth': 2, 'min_child_weight': 1, 'n_estimators': 200, 'reg_alpha': 0.004, 'reg_lambda': 1.25, 'subsample': 0.9}
0.8308230390720691
0.8335729221118662


Played a bit more and got it a bit higher and quite pleased with XGBoost's performance. Still not at the top, but with my current level of motivation we'll leave it here. Even within this incredible booster, we can do much more. However, it is not skillful (save testing one's patince or ability to be satified watching Youtube or the like in the interim while the algorithim somebody else coded is running), albeit important, to tune. Furthermore, what CEO or board would get behind using such a model under the mantra us, 'Well, mathematically this approach is much more accurate than the simple logistic regressionn.

Quite shocking to me that apparently this is still somewhat underfit.

Regardless, it still behooves me to investigate (eventually) the top scorers methods, whether it was via XGBoost (or its cousins) or some other 'super' package such as this.