# Student performance (Kaggle) (Moritz)
- https://www.kaggle.com/c/184702-tu-ml-ws-18-student-performance
- small samples (train = 198), medium dimension (32)
- attribute characteristics: numeric, categorical
- Predict: Grade
- Result file cols: id, Grade
- Missing values: No

## with preprocessing
- scale (fit train data to scaler, scale train and test data)
    - _SVR:_ very long runtime without scaling
- merge train and test data
- one hot encode (+ drop first columns) categorical data

### Linear Regression
R^2 and RMSE created from training data without(!) train-test split.
- without scaling 
- with all samples
    - <1 s
    - R^2: 0.40843
    - RMSE: 3.56123
    
    
- with preprocessing
- with all samples:
    - <1 s
    - R^2: 0.40843
    - RMSE: 3.56123
    - Kaggle: 4.75835
    
### SVR
- without scaling 
- with all samples
    - few seconds
    - C: 0.2, kernel: linear, epsilon: 0.5, gamma: auto 
    - RMSE: 4.25981
    - Kaggle: 4.47988
   
   
- with preprocessing
- with all samples:
    - few seconds
    - C: 0.2, kernel: linear, epsilon: 0.5, gamma: auto 
    - RMSE: 4.26364
    - Kaggle: 4.51673
    
### Gradient Boosted Decision Tree
- without scaling
- with all samples
    - 91.043 s
    - {'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 150}
    - RMSE: 3.88092
    - Kaggle: 4.26606
    
    
- with preprocessing
- with all samples:
    - few minutes
    - {'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 150}
    - RMSE: 3.8868
    - Kaggle: __4.24893__
    
### AutoML
- without scaling
- with all samples:
    - few minutes (max 300/600 s)
    - XGBoost_grid_1_AutoML_20190106_131826_model_5
    - RMSE: 3.61816
    - Kaggle: 4.30642
        
        
- more models, more time
    - max_models=50, max_runtime=1800
    - <15 mins
    - GBM_1_AutoML_20190106_151220
    - RMSE: 3.62069
    - Kaggle: 4.39525
    
    
- more models, more time
    - max_models=60, max_runtime=7200
    - 2108 s
    - GBM_1_AutoML_20190106_211629
    - RMSE: 3.62069
    - Kaggle: 4.39525
    
    
- with preprocessing
- with all samples
    - few minutes (max 600 s)
    - XGBoost_grid_1_AutoML_20190106_133539_model_5
    - RMSE: 3.64129
    - Kaggle: 4.30690

In [50]:
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
import datetime as dt
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
%run './base.ipynb'

In [51]:
# Import and preprocessing

# read train data
train = pd.read_csv('./data/student_performance_kaggle/StudentPerformance.shuf.train.csv')
# add index 'train' and val of id col
train['_index'] = 'train'
train.set_index(['_index', 'id'], inplace=True)
# extract, then drop 'Grade' col
train_target = train[['Grade']]
train.drop(['Grade'], axis='columns', inplace=True)

# read test data
test = pd.read_csv('./data/student_performance_kaggle/StudentPerformance.shuf.test.csv')
# add index 'test' and val of id coll
test['_index'] = 'test'
test.set_index(['_index', 'id'], inplace=True)

# scale train and test data
#train_s, test_s = scale_data(train, test)

# concat train and test data for futher preprocessing
#data_s = pd.concat([train_s, test_s])
data_s = pd.concat([train, test])

# one hot encode data
data_oh = one_hot(data_s, drop_first=True)

#display(data_oh)

# split data into train and test
X_train = data_oh.loc['train']
y_train = train_target
X_test = data_oh.loc['test']

#display(X_test)

In [49]:
# Linear Regression
reg = linear_reg(X_train, y_train, X_train, y_train)
result = pd.DataFrame(reg.predict(X_test), columns=['Grade'])

# join id col
result = pd.concat([X_test.reset_index()[['id']], result], axis='columns')

# Save result
filename = f'''lr_{dt.datetime.now()}.csv'''

result.to_csv('./predictions/student_performance_kaggle/' + filename, sep = ",", index=False)
print(f'''Saved as {filename}''')

#display(result)

  linalg.lstsq(X, y)


R^2 value for model: 0.40843
Predict:
RMSE: 3.56123
R^2 Score: 0.40843
Saved as lr_2019-01-06 11:44:38.341174.csv


In [12]:
# SVR
# params
param_grid = {
    'C': np.linspace(.2,1,5),
    'kernel': ['linear', 'rbf', 'sigmoid', 'poly'], # poly very slow
    'epsilon': np.linspace(0,.5,6),
    'gamma': ['auto', 'scale']
}

# run grid search
gs = run_svr(X_train, y_train.values.ravel(), cv=5, param_grid=param_grid)

# predict
result = pd.DataFrame(gs.best_estimator_.predict(X_test), columns=['Grade'])

# join id col
result = pd.concat([X_test.reset_index()[['id']], result], axis='columns')
#display(result)

# Create SVR filename
filename = f'''svr_'''\
           f'''C-{gs.best_estimator_.C}_'''\
           f'''k-{gs.best_estimator_.kernel}_'''\
           f'''e-{gs.best_estimator_.epsilon}_'''\
           f'''g-{gs.best_estimator_.gamma}_'''\
           f'''{dt.datetime.now()}.csv'''

result.to_csv('./predictions/student_performance_kaggle/' + filename, sep = ",", index=False)
print(f'''Saved as {filename}''')

GridSearch initializing...
SVR model in training...
MSE: 18.14595, RMSE: 4.25981, 
C: 0.2, kernel: linear, epsilon: 0.5, gamma: auto 
Saved as svr_C-0.2_k-linear_e-0.5_g-auto_2019-01-05 13:12:21.368568.csv


In [15]:
# Gradient Boosted Decision Tree
param_fix = {
    'learning_rate': .01, 
    'loss': 'ls'
}

param_grid = {
    'n_estimators': (50, 100, 150, 200, 300, 400, 500), 
    'max_depth': (1, 2, 3, 4, 5), 
    'min_samples_split': (2, 3, 5)
}

gs = run_boosted_tree(X_train, y_train.values.ravel(), [], [], param_fix=param_fix, cv=10, param_grid=param_grid)

#plot_scores(gbt.cv_results_)
#plot_training_deviance(gbt, test_data, test_target)

# predict
result = pd.DataFrame(gs.best_estimator_.predict(X_test), columns=['Grade'])

# join id col
result = pd.concat([X_test.reset_index()[['id']], result], axis='columns')
#display(result)

# Create SVR filename
filename = f'''gbdtree_'''\
           f'''ne-{gs.best_estimator_.n_estimators}_'''\
           f'''md-{gs.best_estimator_.max_depth}_'''\
           f'''mss-{gs.best_estimator_.min_samples_split}_'''\
           f'''{dt.datetime.now()}.csv'''

result.to_csv('./predictions/student_performance_kaggle/' + filename, sep = ",", index=False)
print(f'''Saved as {filename}''')

GridSearch initializing...
GradientBoostedRegressor model in training...
GradientBoostedRegressor model selected and fitted in 91.043 s

MSE: 15.06155, RMSE: 3.88092
Best parameters selected by GridSearch: {'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 150}
Saved as gbdtree_ne-150_md-5_mss-5_2019-01-05 13:16:29.721969.csv


In [52]:
# AutoML
# reset indices to id col
X_train_r = X_train.reset_index()
#display(X_train_r)
y_train_r = y_train.reset_index().drop(['_index'], axis=1)
#display(y_train_r.info())

# create train-test-split
X_train1, X_test1, y_train1, y_test1 = train_test_split(X_train_r, y_train_r, random_state=100)
train = pd.merge(X_train1, y_train1, on='id')

# drop id cols from test set
X_test2 = X_test1.drop(['id'], axis=1)
y_test2 = y_test1.drop(['id'], axis=1)

y_name = 'Grade'
aml = run_autoML_moritz(train, y_name, X_test2, y_test2, 
                        max_models=60, max_runtime=7200)

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O cluster uptime:,7 hours 40 mins
H2O cluster timezone:,Europe/Vienna
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.1.1
H2O cluster version age:,9 days
H2O cluster name:,H2O_from_python_Moritz_rnh332
H2O cluster total nodes:,1
H2O cluster free memory:,2.248 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,4


Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
AutoML progress: |████████████████████████████████████████████████████████| 100%
AutoML training performed in 2108.1329061985016 s.


'AutoML Leaderboard'

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
GBM_1_AutoML_20190106_211629,13.1094,3.62069,13.1094,2.8155,0.649204
XGBoost_grid_1_AutoML_20190106_211629_model_6,13.2591,3.64131,13.2591,2.82833,0.738679
XGBoost_1_AutoML_20190106_211629,13.423,3.66374,13.423,2.79087,
XGBoost_grid_1_AutoML_20190106_211629_model_15,13.435,3.66537,13.435,2.88957,0.689862
StackedEnsemble_BestOfFamily_AutoML_20190106_211629,13.4503,3.66746,13.4503,2.8267,
XGBoost_grid_1_AutoML_20190106_211629_model_14,13.5741,3.6843,13.5741,2.8659,
XGBoost_grid_1_AutoML_20190106_211629_model_13,13.5962,3.68731,13.5962,2.82454,0.627185
StackedEnsemble_AllModels_AutoML_20190106_211629,13.6686,3.69711,13.6686,2.82826,0.730744
XGBoost_grid_1_AutoML_20190106_211629_model_5,13.8476,3.72124,13.8476,2.82065,0.68809
GBM_4_AutoML_20190106_211629,14.1335,3.75946,14.1335,2.93219,0.709803




In [53]:
print(aml.leader)

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_1_AutoML_20190106_211629


ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 0.44882574579644285
RMSE: 0.669944584123525
MAE: 0.5051495918651691
RMSLE: 0.20301530801369005
Mean Residual Deviance: 0.44882574579644285

ModelMetricsRegression: gbm
** Reported on cross-validation data. **

MSE: 13.109400767268239
RMSE: 3.620690647827875
MAE: 2.815498404822502
RMSLE: 0.6492037721377621
Mean Residual Deviance: 13.109400767268239
Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,2.8150265,0.1394206,2.8430521,2.9040978,2.7677615,2.4792871,3.0809343
mean_residual_deviance,13.119622,1.424498,12.186991,13.756295,11.902788,11.013168,16.73887
mse,13.119622,1.424498,12.186991,13.756295,11.902788,11.013168,16.73887
r2,0.320604,0.1188542,0.2609466,0.4785771,0.4665351,0.3740150,0.0229463
residual_deviance,13.119622,1.424498,12.186991,13.756295,11.902788,11.013168,16.73887
rmse,3.6119804,0.1913363,3.490987,3.708948,3.4500418,3.3186092,4.091316
rmsle,0.641794,0.0673895,0.497742,0.7511578,0.7228349,0.5683178,0.6689175


Scoring History: 


0,1,2,3,4,5,6
,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
,2019-01-06 21:16:35,0.247 sec,0.0,4.5098099,3.3380205,20.3383857
,2019-01-06 21:16:35,0.254 sec,5.0,3.2972115,2.4351116,10.8716038
,2019-01-06 21:16:35,0.260 sec,10.0,2.4488150,1.8365814,5.9966947
,2019-01-06 21:16:35,0.267 sec,15.0,1.8991992,1.4378336,3.6069576
,2019-01-06 21:16:35,0.273 sec,20.0,1.4244858,1.0656341,2.0291598
,2019-01-06 21:16:35,0.278 sec,25.0,1.1230544,0.8385357,1.2612512
,2019-01-06 21:16:35,0.298 sec,30.0,0.8724866,0.6554668,0.7612329
,2019-01-06 21:16:35,0.303 sec,35.0,0.7426656,0.5568892,0.5515522
,2019-01-06 21:16:35,0.306 sec,37.0,0.6699446,0.5051496,0.4488257


Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
failures,2105.8503418,1.0,0.1358485
absences,1682.9316406,0.7991696,0.1085660
id,1237.9248047,0.5878503,0.0798586
Medu,1224.8724365,0.5816522,0.0790165
goout,946.3896484,0.4494097,0.0610516
---,---,---,---
Fjob_health,20.0193501,0.0095065,0.0012914
Mjob_teacher,19.1801281,0.0091080,0.0012373
guardian_other,15.8760872,0.0075390,0.0010242



See the whole table with table.as_data_frame()



In [54]:
# create predictions for test data
X_test_h2o = h2o.H2OFrame(X_test.reset_index())
result = aml.predict(X_test_h2o)
#result.head(rows=result.nrows)

result_df = result.as_data_frame()
result_df[['id']] = X_test.reset_index()[['id']]
result_df.rename({'predict': 'Grade'}, axis=1, inplace=True)
#display(result_df)

# save to file
filename = f'''autoML_'''\
           f'''{dt.datetime.now()}.csv'''

result_df.to_csv('./predictions/student_performance_kaggle/' + filename, sep = ",", index=False)
print(f'''Saved as {filename}''')

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%
Saved as autoML_2019-01-06 21:53:51.664177.csv


In [39]:
# shutdown h2o cluster
h2o.cluster().shutdown()

H2O session _sid_b5e8 closed.


In [137]:
# Unused
# Get a feeling for the dataset

# Check if train DataFrame has NaNs
if(train.isnull().values.any()): print('NaNs!')
else: print('Nons!')

Nons!
