# <span style='color:Blue'> STUDENT PERFORMANCE ANALYSIS </span>

## 1. Import Libraries
To develop our prediction model, we need to import the necessary Python libraries:

In [None]:
import pandas as pd
import numpy as np
import scipy as sp
from scipy.stats import loguniform, sem
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import cross_validate, cross_val_score
from sklearn.model_selection import train_test_split, KFold, GridSearchCV, RandomizedSearchCV, StratifiedKFold
from sklearn.model_selection import KFold, LeaveOneOut, LeavePOut, ShuffleSplit, RepeatedKFold

from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import make_scorer

from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline

from sklearn.experimental import enable_halving_search_cv  
from sklearn.model_selection import HalvingGridSearchCV

from sklearn.linear_model import LinearRegression, Ridge

%matplotlib inline
sns.set_style('whitegrid')
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('svg')

## 2. Load Data

This dataset is from the UCI Machine Learning Repository and is comprised of student performance inforation (can be found by clicking the following link: https://archive.ics.uci.edu/ml/datasets/Student+Performance). The data contains the following features:
<details>
<summary>
<a class="btnfire small stroke"><em class="fas fa-chevron-circle-down"></em>&nbsp;&nbsp;Description of the variables:</a>    
</summary>

    

* `school` - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)

* `sex` - student’s sex (binary: ‘F’ - female or ‘M’ - male)

* `age` - student’s age (numeric: from 15 to 22)

* `address` - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)

* `famsize` - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)

* `Pstatus` - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)

* `Medu` - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)

* `Fedu` - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
* `Mjob` - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
* `Fjob` - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
* `reason` - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
* `guardian` - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
* `traveltime` - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
* `studytime` - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
* `failures` - number of past class failures (numeric: n if 1<=n<3, else 4)
* `schoolsup` - extra educational support (binary: yes or no)
* `famsup` - family educational support (binary: yes or no)
* `paid` - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
* `activities` - extra-curricular activities (binary: yes or no)
* `nursery` - attended nursery school (binary: yes or no)
* `higher` - wants to take higher education (binary: yes or no)
* `internet` - Internet access at home (binary: yes or no)
* `romantic` - with a romantic relationship (binary: yes or no)
* `famrel` - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
* `freetime` - free time after school (numeric: from 1 - very low to 5 - very high)
* `goout` - going out with friends (numeric: from 1 - very low to 5 - very high)
* `Dalc` - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
* `Walc` - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
* `health` - current health status (numeric: from 1 - very bad to 5 - very good)
* `absences` - number of school absences (numeric: from 0 to 93)
* `G1` - first period grade (numeric: from 0 to 20)
* `G2` - second period grade (numeric: from 0 to 20)
* `G3` - final grade (numeric: from 0 to 20, output target)
</details>
<br\><br\>
    
The value on which we try to make predictions is `G3`, represents the grade at the end of the year and is therefore the one that determines the success or failure of the school year.

In [None]:
data = pd.read_csv('student-mat.csv')

## 3. Preprocessing

The summary of the data reveals that the dataset has multiple categorical variables that need to be encoded. For this purpose we are using the `LabelEncoder`:

In [None]:
class_le = LabelEncoder()
for column in data[["school", "sex", "address", "famsize", "Pstatus",
                  "Mjob", "Fjob", "reason", "guardian", "schoolsup",
                  "famsup", "paid", "activities", "nursery", "higher",
                  "internet", "romantic"]].columns:
    
    data[column] = class_le.fit_transform(data[column].values)

## 4. Linear Regression

### Splitting
Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see the performance of our model. We will use 67% of the data as the training data and the rest of it as the testing data. If we would determine the performance of our model only on the training set, we would end up with a way too optimistc estimate of our model performance. A random split of 1/3 and 2/3 is not the only option how we can split the data. Change the `test_size` and see how it affects the perfomance estimate.

In [None]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(data,test_size=0.33, random_state=100)

Splitting the dataset into the source variables (independant variables) and the target variable (dependant variable)

In [None]:
#create X and Y
X_train = train.iloc[:, :-1]
Y_train = train.iloc[:, -1:]

X_test = test.iloc[:, :-1]
Y_test = test.iloc[:, -1:]

In [None]:
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('scaler', StandardScaler(), ["age", "traveltime", "studytime", "failures", "famrel",
                  "freetime", "goout", "Dalc", "Walc", "health",
                  "absences", "G1", "G2"])],remainder='passthrough')
sc = StandardScaler()

Scaling input variables is straightforward. In `scikit-learn`, you can use the scale objects manually, or the more convenient `Pipeline` that allows you to chain a series of data transform objects together before using your model.
The `Pipeline` will fit the scale objects on the training data for you and apply the transform to new data, such as when using a model to make a prediction. If you want to try a more complicated apporach you can build your Pipeline with the `ColumnTransformer` above. The transformer ensures that only the consinuous vairables will be scaled and the categorical varaibles won't. One can argue about the sense of scaling binary variables which encode the sex or the school and so on. It might also be easier to interpret when not scaling such categorical variables - otherwise you could end up with a float number inidcating if a student is female or male. In the end you can make the desicision wether to scale or not to scale categorical variables by the predictive performance.

### Pipeline

In [None]:
pipe = Pipeline([('scaler', ct),
                 ('ridge_regression', Ridge(5))])

In [None]:
model = TransformedTargetRegressor(regressor=pipe, transformer=StandardScaler())

### Make scorer

In [None]:
def rmse(actual, predict):
    predict = np.array(predict)
    actual = np.array(actual)

    distance = predict - actual

    square_distance = distance ** 2

    mean_square_distance = square_distance.mean()

    score = np.sqrt(mean_square_distance)

    return score

rmse_scorer = make_scorer(rmse, greater_is_better=False)

In [None]:
def pearson_correlation(y_true, y_pred):
    SPxy =  np.sum((y_true - np.mean(y_true))*(y_pred-np.mean(y_pred))) 
    SQx = np.sum(np.square(y_true - np.mean(y_true))) 
    SQy = np.sum(np.square(y_pred - np.mean(y_pred))) 
    return ( SPxy/(np.sqrt(SQx*SQy) + np.finfo(np.float64).eps))
pearson_scorer = make_scorer(pearson_correlation, greater_is_better=True)

### K-Fold Cross Validation

Cross validation is an approach that you can use to estimate the performance of a machine learning algorithm with less variance than a single train-test set split. It works by splitting the dataset into k-parts (e.g. k = 10). The algorithm is trained on k − 1 folds with one held back and tested on the held back fold. This is repeated so that each fold of the dataset is given a chance to be the held back test set. After running cross validation you end up with k different performance scores that you can summarize using a mean and a standard deviation.

In [None]:
cv_kf = KFold(n_splits=6,shuffle=True,random_state=1)

fig, axs = plt.subplots(2,3, figsize=(10, 6))
fig.subplots_adjust(hspace = .5, wspace=.4)
ax = axs.ravel()

score = list()
rmse_score = list()
pearson_score = list()

gl_min = Y_train.min()
gl_max = Y_train.max()
i = 0
for train_ix, test_ix in cv_kf.split(X_train):
    # split data
    X_tr, X_te = X_train.iloc[train_ix, :], X_train.iloc[test_ix, :]
    y_tr, y_te = Y_train.iloc[train_ix], Y_train.iloc[test_ix]
    # fit and evaluate a model
    model.fit(X_tr, y_tr)
    preds = model.predict(X_te)
    
    rmse_score.append(rmse(y_te, preds))
    score.append(model.score(X_te,y_te))
    pearson_score.append(pearson_correlation(y_te,preds))
    
    ax[i].scatter(y_te, preds, edgecolors=(0, 0, 0))
    #ax[i].plot([y_te.min(), y_te.max()], [y_te.min(), y_te.max()], 'k--', lw=2)
    ax[i].plot([gl_min, gl_max], [gl_min, gl_max], 'k--', lw=2)
    ax[i].set_xlabel('Observed')
    ax[i].set_ylabel('Predicted')
    ax[i].set_xlim([gl_min.values-2, gl_max.values+2])
    ax[i].set_ylim([gl_min.values-2, gl_max.values+2])
    ax[i].set_title("RMSE: %.3f" % (rmse(preds, y_te)))
    
    i += 1
    
plt.show()

print("Mean RMSE: %.3f and STD: +/- %.3f" % (np.mean(rmse_score), np.std(rmse_score)))
print("Mean R^2: %.3f and STD: +/- %.3f" % (np.mean(score), np.std(score)))   
print("Mean r_P: %.3f and STD: +/- %.3f" % (np.mean(pearson_score), np.std(pearson_score))) 

Now we want to see the effect of k on our performance estimate and run a "sensitivity analysis" for different k values. That is, to evaluate the performance of the same model on the same dataset with different values of k and see how they compare.
Nevertheless, we can choose a test condition that represents an “ideal” or as best as we can achieve “ideal” estimate of model performance. Normaly the datasets are not big enough to take one more test set out. For this purpose we can use `LeaveOneOut()` CV method.  

In [None]:
# evaluate the model using a given test condition
def evaluate_model(model, X, y, cv):
    scores = cross_val_score(model, X, y, scoring=rmse_scorer, cv=cv, n_jobs=-1)
    # return scores
    return -np.mean(scores), scores.std()

In [None]:
# calculate the ideal test condition
ideal, _ = evaluate_model(model, X_train, Y_train, LeaveOneOut())
print('Ideal: %.3f' % ideal)

In [None]:
a = range(1,23)
folds = [2]
for i in a:
    #print(folds[i-1]+i)
    folds.append(folds[i-1]+i)
folds.append(264)

In [None]:
# record mean and min/max of each set of results
means, stds = list(),list()
# evaluate each k value
for k in folds:
    # define the test condition
    cv = KFold(n_splits=k, shuffle=True, random_state=1)
    # evaluate k value
    k_mean, k_std = evaluate_model(model, X_train, Y_train,cv)
    # report performance
    print('Folds=%d, RMSE=%.3f (+/- %.3f)' % (k, k_mean, k_std))
    # store mean accuracy
    means.append(k_mean)
    stds.append(k_std)

In [None]:
plt.errorbar(folds, means, yerr=stds, fmt='o')

# plot the ideal case in a separate color
#plt.plot(folds, [ideal for _ in range(len(folds))], color='r')

# show the plot
plt.title("Line plot of k mean values with error bars (std)")
plt.xlabel('k')
plt.ylabel('RMSE')
plt.grid()
plt.show()


The choice of k is usually 5 or 10, but there is no formal rule. As k gets larger, the difference in size between the entire training set and the subsets gets smaller. As this difference decreases, the bias of the technique becomes smaller. Also the number of k definitely affects the computational complexity almost linearly (asymptotically, linearly) for training algorithms with algorithmic complexity linear in the number of training instances.

#### Repeated K-Fold Cross Validation

This is where the k-fold cross-validation procedure is repeated `n_repeats` times, where importantly, the data sample is shuffled prior to each repetition, which results in a different split of the sample.

In [None]:
cv_rep_kfold = RepeatedKFold(n_splits=10, n_repeats=3,random_state=10)

In [None]:
repeats = range(1,35)

In [None]:
rep_means, rep_stds = list(), list()

for r in repeats:
    # evaluate using a given number of repeats
    cv = RepeatedKFold(n_splits=10, n_repeats=r, random_state=10)
    rep_mean, rep_std = evaluate_model(model, X_train, Y_train,cv)
    # summarize
    print('%d mean=%.4f standard_error=%.3f' % (r, rep_mean, rep_std))
    # store
    rep_means.append(rep_mean)
    rep_stds.append(rep_std)

In [None]:
plt.errorbar(np.arange(1,r+1), rep_means, yerr=rep_stds, fmt='o')
# show the plot
plt.xlabel('repetitions')
plt.ylabel('RMSE')
plt.grid()
plt.show()

### Halving GridSearch CV vs GridSearchCV

The best way to find the optimal model hyperparameter is using cross-validation. However, instead of independently searching the hyperparameter set candidates, their successive halving search strategy “starts evaluating all the candidates with a small number of resources and iteratively selects the best candidates, using more and more resources.” The default resource is the number of samples, but it can be set to any positive-integer model parameter like gradient boosting rounds. Thus, the halving approach has the potential of finding good hyperparameters in less time.

In [None]:
# define search space
space = dict()
space['regressor__ridge_regression__solver'] = ['svd', 'cholesky', 'lsqr', 'sag']
space['regressor__ridge_regression__alpha'] = [1e-8,1e-5, 1e-4, 5e-4, 1e-3,5e-3, 1e-2, 5e-2, 1e-1,5e-1, 1,2,5,10,20,50, 100]
#define cv
cv_kfold = KFold(n_splits=10,shuffle=True,random_state=1)

In [None]:
 grid_search_params = dict( estimator = model,
                            param_grid = space,
                            scoring = rmse_scorer,
                            return_train_score=True,
                            cv=cv_kfold,
                            verbose=0)

In [None]:
%%time

FACTOR = 2
MAX_RESOURCE_DIVISOR = 4

n_samples = len(X_train)
halving_results_n_samples =\
    HalvingGridSearchCV(resource='n_samples',
                        min_resources=n_samples//\
                        MAX_RESOURCE_DIVISOR,
                        factor=FACTOR,
                        **grid_search_params
                        )\
                        .fit(X_train, Y_train)

pd.DataFrame(halving_results_n_samples.best_params_, index=[0])\
    .assign(RMSE=abs(halving_results_n_samples.best_score_))

In [None]:
%%time

full_results = GridSearchCV(**grid_search_params)\
               .fit(X_train, Y_train)

pd.DataFrame(full_results.best_params_, index=[0])\
    .assign(RMSE=abs(full_results.best_score_))

In [None]:
clf = full_results.best_estimator_
clf_params = full_results.best_params_
clf_score = abs(full_results.best_score_)
clf_stdev = full_results.cv_results_['std_test_score'][full_results.best_index_]
cv_results = full_results.cv_results_

print("best parameters: {}".format(clf_params))
print("best score:      {:0.5f} (+/-{:0.5f})".format(clf_score, clf_stdev))

In [None]:
coef_names = []
[ coef_names.append('coef_' + str(x)) for x in range(X_train.shape[1]) ]
cfs = pd.DataFrame(columns=coef_names)
for train_index, test_index in cv_kfold.split(X_train):
    X_train_cv, X_test_cv = X_train.iloc[train_index,:], X_train.iloc[test_index,:]
    y_train_cv, y_test_cv = Y_train.iloc[train_index], Y_train.iloc[test_index]
    cve = full_results.best_estimator_.fit(X_train_cv,y_train_cv)
    coefs = cve.regressor_['ridge_regression'].coef_
    cfs = cfs.append(pd.DataFrame(coefs.reshape(1,-1),columns=coef_names))
cfs.index = np.arange(0, len(cfs))
g = sns.catplot(data=cfs)
g.set_xticklabels(rotation=90)

In [None]:
# pick out the best results
# =========================
scores_df = pd.DataFrame(cv_results).sort_values(by='rank_test_score')

best_row = scores_df.iloc[0, :]
best_mean = -best_row['mean_test_score']
best_stdev = best_row['std_test_score']
best_param = best_row['param_' + 'regressor__ridge_regression__alpha']

In [None]:
# plot the results
# ================
scores_df = scores_df.sort_values(by='param_' + 'regressor__ridge_regression__alpha')

means = -scores_df['mean_test_score']

stds = scores_df['std_test_score']
params = scores_df['param_' + 'regressor__ridge_regression__alpha']

# plot

plt.figure(figsize=(8, 4))
plt.errorbar(params, means, yerr=stds)

plt.axhline(y=best_mean + best_stdev, color='red')
plt.axhline(y=best_mean - best_stdev, color='red')
plt.plot(best_param, best_mean, 'or')

plt.title('regressor__ridge_regression__alpha' + " vs Score\nBest Score {:0.5f}".format(clf_score))
plt.xlabel('regressor__ridge_regression__alpha')
plt.ylabel('Score')
plt.show()

In [None]:
Y_pred_train = full_results.best_estimator_.predict(X_train)
Y_pred_test = full_results.best_estimator_.predict(X_test)

In [None]:
print(rmse(Y_pred_train,Y_train.values))
print(rmse(Y_pred_test,Y_test.values))

### Nested Cross-validation

The cross-validation procedure is used to estimate the performance of machine learning models when making predictions or classify the data.

This procedure can be used both when optimizing the hyperparameters of a model on a dataset, and when comparing and selecting a model for the dataset. When the same cross-validation procedure and dataset are used to both tune and select a model, it is likely to lead to an optimistically biased evaluation of the model performance.

One approach to overcoming this bias is to nest the hyperparameter optimization procedure under the model selection procedure. This is called double cross-validation or nested cross-validation and is the preferred way to evaluate and compare tuned machine learning models. The nested CV has an inner loop CV nested in an outer CV. The inner loop is responsible for model selection/hyperparameter tuning (similar to validation set), while the outer loop is for error estimation (test set).

In the inner loop (ex. `GridSearchCV`), we try to find good parameters using the "inner test set" aka validation "set". In the outer loop (ex. `cross_val_score`), the generalization error is estimated by averaging test set scores over several dataset splits.

In [None]:
model.fit(X_train,Y_train);

In [None]:
Y_pred_train = model.predict(X_train)
Y_pred_test = model.predict(X_test)

print("Training RMSE: {:.2f}".format(rmse(Y_pred_train, Y_train))) 
print("Test RMSE: {:.2f}".format(rmse(Y_pred_test, Y_test)))
print("Training R2: {:.2f}".format(r2_score(Y_train, Y_pred_train))) 
print("Test R2: {:.2f}".format(r2_score(Y_test, Y_pred_test)))

In [None]:
# Number of random trials
NUM_TRIALS = 5

In [None]:
# Arrays to store scores
non_nested_scores = np.zeros(NUM_TRIALS)
nested_scores = np.zeros(NUM_TRIALS)

# Loop for each trial
for i in range(NUM_TRIALS):

    # Choose cross-validation techniques for the inner and outer loops,
    # independently of the dataset.
    # E.g "GroupKFold", "LeaveOneOut", "LeaveOneGroupOut", etc.
    print("Trial "+str(i))
    inner_cv = KFold(n_splits=3, shuffle=True, random_state=i)
    outer_cv = KFold(n_splits=3, shuffle=True, random_state=i)

    # Non_nested parameter search and scoring
    clf = GridSearchCV(estimator=model, param_grid=space, cv=inner_cv)
    clf.fit(X_train, Y_train)
    non_nested_scores[i] = clf.best_score_

    # Nested CV with parameter optimization
    nested_score = cross_val_score(clf, X=X_train, y=Y_train, cv=outer_cv)
    nested_scores[i] = nested_score.mean()

In [None]:
score_difference = non_nested_scores - nested_scores

# Plot scores on each trial for nested and non-nested CV
plt.figure()
plt.subplot(211)
non_nested_scores_line, = plt.plot(non_nested_scores, 'rs')
nested_line, = plt.plot(nested_scores, 'bd')
plt.ylabel("score", fontsize="14")
plt.legend([non_nested_scores_line, nested_line],
           ["Non-Nested CV", "Nested CV"],
           bbox_to_anchor=(0, .4, .5, 0))
plt.title("Non-Nested and Nested Cross Validation",
          x=.5, y=1.1, fontsize="15")

# Plot bar chart of the difference.
plt.subplot(212)
difference_plot = plt.bar(range(NUM_TRIALS), score_difference)
plt.xlabel("Individual Trial #")
plt.legend([difference_plot],
           ["Non-Nested CV - Nested CV Score"],
           bbox_to_anchor=(0, 1, .8, 0))
plt.ylabel("score difference", fontsize="14")

plt.show()

Nested CV effectively uses a series of `train`,`validation` and `test` set splits. In the inner loop (here executed by GridSearchCV), the score is approximately maximized by fitting a model to each training set, and then directly maximized in selecting hyperparameters over the validation set. In the outer loop (here in cross_val_score), generalization error is estimated by averaging test set scores over several dataset splits.
We compare the performance of non-nested and nested CV strategies by taking the difference between their scores.

## Classification

First of all it can be interesting how the grades of the students contained in these datasets are distributed so that we can better understand the results. Following the suggestion of [Paulo Cortez and Alice Silva's paper](http://www3.dsi.uminho.pt/pcortez/student.pdf)  the student grades can be analysed using 5-Level classification based on the Erasmus grade conversion system.

In [None]:
def grade_transform(g):
    if g >15:
        return 1
    elif g<16 and g>13: 
        return 2
    elif g<14 and g>11:
        return 3
    elif g<12 and g>9:
        return 4
    else:
        return 5
    

data_c = data.copy()
data_c['grades'] = data_c.apply(lambda x: grade_transform(x['G3']), axis = 1 )


data_c['grades'].value_counts()

In [None]:
data_c.drop('G3', axis=1, inplace=True)

In [None]:
le = LabelEncoder()

data_c['grades'] = le.fit_transform(data_c['grades'])

In [None]:
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(data_c.iloc[:,:-1],data_c.iloc[:,-1],test_size=0.33,
                                                           random_state=1, stratify=data_c.iloc[:,-1])

In [None]:
f, ax = plt.subplots()
sns.histplot(data_c.iloc[:,-1],kde=False,label='All', ax=ax)
sns.histplot(y_train_c+.05, kde=False, label='train', color='green', ax=ax)
sns.histplot(y_test_c+.05, kde=False, label='test', color='orange', ax=ax)
plt.xlabel('G3')
plt.ylabel('Frequency')
plt.xticks([0.25,1,2.25,3,3.75], ['1','2','3','4','5'])
plt.title("Distribution of Classes")
plt.legend()

In [None]:
from sklearn.linear_model import LogisticRegression

clf_2 = LogisticRegression(penalty='l1', 
                           dual=False, 
                           tol=0.001, 
                           C=0.00001, 
                           fit_intercept=True, 
                           intercept_scaling=1,
                           solver='saga',
                           class_weight=None, 
                           random_state=1, 
                           max_iter=1000000, 
                           multi_class='auto', 
                           verbose=0, 
                           warm_start=False, 
                           n_jobs=1)


model_logreg = Pipeline(steps=[('StandardScaler', StandardScaler()), ('LogisticRegression', clf_2)])

In [None]:
pred_train, pred_test = [], []

intervals = np.arange(10, X_train_c.shape[0], 10)

for i in intervals:
    model_logreg.fit(X_train_c.iloc[:i,:], y_train_c.values[:i,])
    #print(i)
    p_train = model_logreg.score(X_train_c.iloc[:i,:], y_train_c.iloc[:i,])
    p_test = model_logreg.score(X_test_c.iloc[:i,:], y_test_c.iloc[:i,])
    pred_train.append(p_train)
    pred_test.append(p_test)

In [None]:
#with plt.style.context(('fivethirtyeight')):
plt.plot(intervals, pred_train, marker='o', label='Train')
plt.plot(intervals, pred_test, marker='s', label='Test')
plt.legend(loc='best', numpoints=1)
plt.xlim([0, X_train_c.shape[0]+30])
plt.axvspan(X_train_c.shape[0], 
            X_train_c.shape[0] + X_test_c.shape[0], 
            alpha=0.2, 
            color='steelblue')
plt.ylim([0.1, .70])
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.tight_layout()

Finding a good balance between bias and variance is important for model evaluation and selection.
The reason why a proportionally large test sets increase the pessimistic bias is that the model may not have reached its full capacity, yet. In other words, the learning algorithm could have formulated a more powerful, more generalizable hypothesis for classification if it had seen more data. [Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning](https://arxiv.org/pdf/1811.12808.pdf)

In [None]:
 from sklearn.metrics import confusion_matrix
    
model_logreg.fit(X_train_c, y_train_c)

y_pred_train_c = model_logreg.predict(X_train_c)
y_pred_test_c = model_logreg.predict(X_test_c)

conf_train = confusion_matrix(y_train_c,y_pred_train_c)
conf_test = confusion_matrix(y_test_c,y_pred_test_c)

fg, (ax1, ax2) = plt.subplots(1,2,figsize=(10,4))
sns.heatmap(conf_train, annot=True, fmt="d", ax=ax1)
ax1.set(xlabel="predicted label")
ax1.set_xticklabels(['1','2','3','4','5'])
ax1.set_yticklabels(['1','2','3','4','5'])
ax1.set(ylabel="actual label")
ax1.set(title="Confusion Matrix for training set")
sns.heatmap(conf_test, annot=True, fmt="d", ax=ax2)
ax2.set(xlabel="predicted label")
ax2.set(ylabel="actual label")
ax2.set_xticklabels(['1','2','3','4','5'])
ax2.set_yticklabels(['1','2','3','4','5'])
ax2.set(title="Confusion Matrix for test set")

In [None]:
grid_logreg={"LogisticRegression__C":np.logspace(-3,3,7), "LogisticRegression__penalty":["l1","l2"]}# l1 lasso l2 ridge

In [None]:
cv_kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)

In [None]:
logreg_cv=GridSearchCV(model_logreg,grid_logreg,cv=cv_kfold)
logreg_cv

In [None]:
logreg_cv.get_params().keys()

In [None]:
logreg_cv.fit(X_train_c,y_train_c)

In [None]:
print(logreg_cv.best_params_)
print(logreg_cv.best_score_)
logreg_results = pd.DataFrame(logreg_cv.cv_results_).sort_values(by='rank_test_score')
logreg_results.head(10)

In [None]:
y_score_test = logreg_cv.decision_function(X_test_c)
y_score_train = logreg_cv.decision_function(X_train_c)
y_pred_train_c = logreg_cv.predict(X_train_c)
y_pred_test_c = logreg_cv.predict(X_test_c)

In [None]:
conf_train = confusion_matrix(y_train_c,y_pred_train_c)
conf_test = confusion_matrix(y_test_c,y_pred_test_c)

fg, (ax1, ax2) = plt.subplots(1,2,figsize=(10,4))
sns.heatmap(conf_train, annot=True, fmt="d", ax=ax1)
ax1.set(xlabel="predicted label")
ax1.set(ylabel="actual label")
ax1.set_xticklabels(['1','2','3','4','5'])
ax1.set_yticklabels(['1','2','3','4','5'])
ax1.set(title="Confusion Matrix for training set")
sns.heatmap(conf_test, annot=True, fmt="d", ax=ax2)
ax2.set(xlabel="predicted label")
ax2.set(ylabel="actual label")
ax2.set_xticklabels(['1','2','3','4','5'])
ax2.set_yticklabels(['1','2','3','4','5'])
ax2.set(title="Confusion Matrix for test set")

In [None]:
from imblearn.over_sampling import RandomOverSampler

In [None]:
ros = RandomOverSampler(random_state=0)
X_data_c_ros, y_data_c_ros = ros.fit_resample(data_c.iloc[:,:-1], data_c.iloc[:,-1])
X_train_c_ros, y_train_c_ros = ros.fit_resample(X_train_c, y_train_c)
X_test_c_ros, y_test_c_ros = X_test_c, y_test_c

In [None]:
f, ax = plt.subplots()
sns.histplot(y_data_c_ros,kde=False,label='All', ax=ax)
sns.histplot(y_train_c_ros+.05, kde=False, label='train', color='green', ax=ax)
sns.histplot(y_test_c_ros+.05, kde=False, label='test', color='orange', ax=ax)
plt.xlabel('G3')
plt.ylabel('Frequency')
plt.xticks([0.25,1,2.25,3,3.75], ['1','2','3','4','5'])
plt.title("Distribution of Classes")
plt.legend()

In [None]:
from imblearn.pipeline import Pipeline

In [None]:
model_logreg_ros = Pipeline(steps=[('StandardScaler', StandardScaler()),('over',RandomOverSampler()),('LogisticRegression', logreg_cv)])

In [None]:
model_logreg_ros.fit(X_train_c_ros, y_train_c_ros)

y_pred_train_c_ros = model_logreg_ros.predict(X_train_c_ros)
y_pred_test_c_ros = model_logreg_ros.predict(X_test_c_ros)

conf_train_ros = confusion_matrix(y_train_c_ros,y_pred_train_c_ros)
conf_test_ros = confusion_matrix(y_test_c_ros,y_pred_test_c_ros)

fg, (ax1, ax2) = plt.subplots(1,2,figsize=(10,4))
sns.heatmap(conf_train_ros, annot=True, fmt="d", ax=ax1)
ax1.set(xlabel="predicted label")
ax1.set(ylabel="actual label")
ax1.set_xticklabels(['1','2','3','4','5'])
ax1.set_yticklabels(['1','2','3','4','5'])
ax1.set(title="Confusion Matrix for training set")
sns.heatmap(conf_test_ros, annot=True, fmt="d", ax=ax2)
ax2.set(xlabel="predicted label")
ax2.set(ylabel="actual label")
ax2.set_xticklabels(['1','2','3','4','5'])
ax2.set_yticklabels(['1','2','3','4','5'])
ax2.set(title="Confusion Matrix for test set")

## Bootstrap

The idea behind the bootstrap is to generate "new samples" by sampling from an empirical distribution. [Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning](https://arxiv.org/pdf/1811.12808.pdf)

In [None]:
rng = np.random.RandomState(seed=12345)

idx = np.arange(Y_train.shape[0])



accuracies = []

for i in range(200):
    
    train_idx = rng.choice(idx, size=idx.shape[0], replace=True)
    
    test_idx = np.setdiff1d(idx, train_idx, assume_unique=False)
    
    boot_train_X, boot_train_y = X_train.iloc[train_idx,:], Y_train.iloc[train_idx,:]
    boot_test_X, boot_test_y = X_train.iloc[test_idx,:], Y_train.iloc[test_idx,:]
    
    model.fit(boot_train_X, boot_train_y)
    pred = model.predict(boot_test_X)
    acc = model.score(boot_test_X, boot_test_y)
    accuracies.append(acc)

In [None]:
sns.set_style("ticks")

mean = np.mean(accuracies)

#se = np.sqrt( (1. / (100-1)) * np.sum([(acc - mean)**2 for acc in accuracies])) 
#ci = 1.984 * se

se = np.sqrt( (1. / (200-1)) * np.sum([(acc - mean)**2 for acc in accuracies])) 
ci = 1.97 * se

lower = np.percentile(accuracies, 2.5)
upper = np.percentile(accuracies, 90)

fig, ax = plt.subplots(figsize=(8, 4))
ax.vlines(mean, [0], 70, lw=2.5, linestyle='-', label='mean')
#ax.vlines(med, [0], 60, lw=2.5, linestyle='--', label='median')
ax.vlines(lower, [0], 20, lw=2.5, linestyle='-.', label='CI95 percentile')
ax.vlines(upper, [0], 40, lw=2.5, linestyle='-.')

ax.vlines(mean + ci, [0], 40, lw=2.5, linestyle=':', label='CI95 standard')
ax.vlines(mean - ci, [0], 20, lw=2.5, linestyle=':')


ax.hist(accuracies, bins=7,
        color='#0080ff', edgecolor="none", 
        alpha=0.3)
plt.legend(loc='upper left')
sns.despine(offset=10, trim=True)
plt.xlim([0.7, 0.9])
plt.tight_layout()


plt.show()