#**Data preparation: **
1. Removed columns "Name", "Age", "SibSp", "Ticket", "Cabin",  "Parch", and "Embarked".
2. Convert objects to numbers with pandas.get_dummies.
3. Filled nulls with a value of 0.0.
4. Transformed data with MinMaxScaler() method.
5. Randomly splited training set into train and validation subsets.

#**Training Gradient Boosting classifier:**
1. Computed the accuracy scores on train and validation sets when training with different learning rates. When learning rate was 0.5, the accuracy scores on training and validation subsets were 0.829 and 0.830, respectively.
2. Trained Gradient Boosting classifier on training subset with parameters of criterion="mse", n_estimators=20, learning_rate = 0.5, max_features=2, max_depth = 2, random_state = 0.  The average precision, recall, and  f1-scores on validation subsets were 0.83, 0.83, and 0.82, respectively. The area under ROC (AUC) was 0.88.

In [46]:
#Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
#Decision tree learning is one of the predictive modelling approaches used in statistics, data mining and machine learning. It uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. Decision trees are among the most popular machine learning algorithms given their intelligibility and simplicity.[1][2]

In [47]:
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [48]:
# load data
train = pd.read_csv("data/Gradient/train.csv")
test = pd.read_csv("data/Gradient/test.csv")

In [49]:
train.info(), test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass  

(None, None)

In [50]:
# set "PassengerId" variable as index
train.set_index("PassengerId", inplace=True)
test.set_index("PassengerId", inplace=True)

In [51]:
# generate training target set (y_train)
y_train = train["Survived"]

In [52]:
# delete column "Survived" from train set
train.drop(labels="Survived", axis=1, inplace=True)

In [53]:
# shapes of train and test sets
train.shape, test.shape

((891, 10), (418, 10))

In [54]:
# join train and test sets to form a new train_test set
train_test =  train.append(test)

In [55]:
# delete columns that are not used as features for training and prediction
columns_to_drop = ["Name", "Age", "SibSp", "Ticket", "Cabin", "Parch", "Embarked"]
train_test.drop(labels=columns_to_drop, axis=1, inplace=True)

In [56]:
# convert objects to numbers by pandas.get_dummies
train_test_dummies = pd.get_dummies(train_test, columns=["Sex"])
train_test_dummies

Unnamed: 0_level_0,Pclass,Fare,Sex_female,Sex_male
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,3,7.2500,0,1
2,1,71.2833,1,0
3,3,7.9250,1,0
4,1,53.1000,1,0
5,3,8.0500,0,1
...,...,...,...,...
1305,3,8.0500,0,1
1306,1,108.9000,1,0
1307,3,7.2500,0,1
1308,3,8.0500,0,1


In [57]:
# check the dimension
train_test_dummies.shape

(1309, 4)

In [58]:
# replace nulls with 0.0
train_test_dummies.fillna(value=0.0, inplace=True)
train_test_dummies

Unnamed: 0_level_0,Pclass,Fare,Sex_female,Sex_male
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,3,7.2500,0,1
2,1,71.2833,1,0
3,3,7.9250,1,0
4,1,53.1000,1,0
5,3,8.0500,0,1
...,...,...,...,...
1305,3,8.0500,0,1
1306,1,108.9000,1,0
1307,3,7.2500,0,1
1308,3,8.0500,0,1


In [59]:
# generate feature sets (X)
X_train = train_test_dummies.values[0:891]
X_test = train_test_dummies.values[891:]

In [60]:
X_train.shape, X_test.shape

((891, 4), (418, 4))

In [61]:
X_train

array([[ 3.    ,  7.25  ,  0.    ,  1.    ],
       [ 1.    , 71.2833,  1.    ,  0.    ],
       [ 3.    ,  7.925 ,  1.    ,  0.    ],
       ...,
       [ 3.    , 23.45  ,  1.    ,  0.    ],
       [ 1.    , 30.    ,  0.    ,  1.    ],
       [ 3.    ,  7.75  ,  0.    ,  1.    ]])

In [62]:
X_test

array([[ 3.    ,  7.8292,  0.    ,  1.    ],
       [ 3.    ,  7.    ,  1.    ,  0.    ],
       [ 2.    ,  9.6875,  0.    ,  1.    ],
       ...,
       [ 3.    ,  7.25  ,  0.    ,  1.    ],
       [ 3.    ,  8.05  ,  0.    ,  1.    ],
       [ 3.    , 22.3583,  0.    ,  1.    ]])

In [63]:
# transform data
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_scale = scaler.fit_transform(X_train)
X_train_scale

array([[1.        , 0.01415106, 0.        , 1.        ],
       [0.        , 0.13913574, 1.        , 0.        ],
       [1.        , 0.01546857, 1.        , 0.        ],
       ...,
       [1.        , 0.04577135, 1.        , 0.        ],
       [0.        , 0.0585561 , 0.        , 1.        ],
       [1.        , 0.01512699, 0.        , 1.        ]])

In [64]:
X_test_scale = scaler.transform(X_test)
X_test_scale

array([[1.        , 0.01528158, 0.        , 1.        ],
       [1.        , 0.01366309, 1.        , 0.        ],
       [0.5       , 0.01890874, 0.        , 1.        ],
       ...,
       [1.        , 0.01415106, 0.        , 1.        ],
       [1.        , 0.01571255, 0.        , 1.        ],
       [1.        , 0.0436405 , 0.        , 1.        ]])

In [65]:
# split training feature and target sets into training and validation subsets
from sklearn.model_selection import train_test_split

X_train_sub, X_validation_sub, y_train_sub, y_validation_sub = train_test_split(X_train_scale, y_train, random_state=0)

In [66]:
# import machine learning algorithms
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc

In [67]:
# train with Gradient Boosting algorithm
# compute the accuracy scores on train and validation sets when training with different learning rates

learning_rates = [0.05, 0.1, 0.25, 0.5, 0.75, 1]
for learning_rate in learning_rates:
    gb = GradientBoostingClassifier(n_estimators=20, learning_rate = learning_rate, max_features=2, max_depth = 2, random_state = 0)
    gb.fit(X_train_sub, y_train_sub)
    print("Learning rate: ", learning_rate)
    print("Accuracy score (training): {0:.3f}".format(gb.score(X_train_sub, y_train_sub)))
    print("Accuracy score (validation): {0:.3f}".format(gb.score(X_validation_sub, y_validation_sub)))
    print()

Learning rate:  0.05
Accuracy score (training): 0.789
Accuracy score (validation): 0.780

Learning rate:  0.1
Accuracy score (training): 0.792
Accuracy score (validation): 0.780

Learning rate:  0.25
Accuracy score (training): 0.816
Accuracy score (validation): 0.803

Learning rate:  0.5
Accuracy score (training): 0.826
Accuracy score (validation): 0.834

Learning rate:  0.75
Accuracy score (training): 0.831
Accuracy score (validation): 0.789

Learning rate:  1
Accuracy score (training): 0.831
Accuracy score (validation): 0.789



In [68]:
# Output confusion matrix and classification report of Gradient Boosting algorithm on validation set

gb = GradientBoostingClassifier(n_estimators=20, learning_rate = 0.5, max_features=2, max_depth = 2, random_state = 0)
gb.fit(X_train_sub, y_train_sub)
predictions = gb.predict(X_validation_sub)

print("Confusion Matrix:")
print(confusion_matrix(y_validation_sub, predictions))
print()
print("Classification Report")
print(classification_report(y_validation_sub, predictions))

Confusion Matrix:
[[131   8]
 [ 29  55]]

Classification Report
              precision    recall  f1-score   support

           0       0.82      0.94      0.88       139
           1       0.87      0.65      0.75        84

    accuracy                           0.83       223
   macro avg       0.85      0.80      0.81       223
weighted avg       0.84      0.83      0.83       223



In [69]:
# ROC curve and Area-Under-Curve (AUC)

y_scores_gb = gb.decision_function(X_validation_sub)
fpr_gb, tpr_gb, _ = roc_curve(y_validation_sub, y_scores_gb)
roc_auc_gb = auc(fpr_gb, tpr_gb)

print("Area under ROC curve = {:0.2f}".format(roc_auc_gb))

Area under ROC curve = 0.88
