<h2>PyCaret - Titanic dataset - Asir</h2>

In this notebook, I will explain how I used Pycaret to create a prediction on the titanic dataset.

#### First of all, I will install and import some necessary Python libraries to prepare the environment that I need to create a machine learning model.

**Pandas** is installed for data analysis and manipulation;

**matplotlib** is for data visualization;

**numpy** is for mathematical functions and operations;

and **PyCaret** is a low-code machince learning library.


In [1]:
%pip install pandas
%pip install matplotlib
%pip install numpy
%pip install pycaret

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

#### Reading the cleaned Titanic dataset and storing it to a variable, titanic_data. Then, displaying the first five rows.

In [3]:
titanic_data = pd.read_csv("files/Titanic cleaned.csv")
titanic_data.head()

Unnamed: 0.1,Unnamed: 0,passenger_class,survived,name,sex,age,number_of_siblings,number_of_parents,ticket,fare,cabin,embarked,boat,body,destination
0,0,1,1,"Allen, Miss. Elisabeth Walton",0,29.0,0,0,24160,211.3375,B5,S,2,0.0,"St Louis, MO"
1,1,1,1,"Allison, Master. Hudson Trevor",1,1.0,1,2,113781,151.55,C22 C26,S,11,0.0,"Montreal, PQ / Chesterville, ON"
2,2,1,0,"Allison, Miss. Helen Loraine",0,2.0,1,2,113781,151.55,C22 C26,S,0,0.0,"Montreal, PQ / Chesterville, ON"
3,3,1,0,"Allison, Mr. Hudson Joshua Creighton",1,30.0,1,2,113781,151.55,C22 C26,S,0,135.0,"Montreal, PQ / Chesterville, ON"
4,4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",0,25.0,1,2,113781,151.55,C22 C26,S,0,0.0,"Montreal, PQ / Chesterville, ON"


 #### Now, Let's import all the modules and functions from the classification subpackage of pycaret, which contains tools for creating and comparing various classification models.

 Calling the **setup** function.

 Specifying the **features**: name, embarked, boat, ticket, cabin, and destination that are **ignored**.

 Specifying the **target** parameter, survived.

 Specifying the **train_size** parameter. The train size is 0.8, which means 80% of the data will be used for training and 20% for testing.

 To randomly shuffle the data before splitting into train and test set, we specifyied the **data_split_shuffle** parameter to **true**. To reproduce the same results across different runs, **session_id parameter** is set to **0**.

In [4]:
from pycaret.classification import *
clf = setup(data = titanic_data,
            ignore_features = ['name', 'embarked', 'boat', 'ticket', 'cabin', 'destination'],
            target = 'survived',
            train_size = 0.8,
            data_split_shuffle = True,
            session_id = 0)

Unnamed: 0,Description,Value
0,Session id,0
1,Target,survived
2,Target type,Binary
3,Original data shape,"(1309, 15)"
4,Transformed data shape,"(1309, 9)"
5,Transformed train set shape,"(1047, 9)"
6,Transformed test set shape,"(262, 9)"
7,Ignore features,6
8,Numeric features,8
9,Preprocess,True


#### Now, let's get the training data and testing data, and save them into two different variables.

In [5]:
train_data = get_config('X_train')
test_data = get_config('X_test')

#### Now, let's compare the results of different models. From below, we can see that Gradient Boosting Classifier model has best results so we are going to use it.

In [6]:
compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.8119,0.8692,0.6875,0.7983,0.7345,0.5909,0.5983,0.021
lightgbm,Light Gradient Boosting Machine,0.7995,0.8576,0.6925,0.764,0.7234,0.5673,0.5716,0.104
rf,Random Forest Classifier,0.7871,0.8538,0.68,0.7435,0.708,0.5415,0.545,0.04
ada,Ada Boost Classifier,0.7851,0.8513,0.6975,0.7283,0.7113,0.5406,0.5421,0.019
lr,Logistic Regression,0.7824,0.8413,0.695,0.7227,0.7054,0.5337,0.5367,0.365
et,Extra Trees Classifier,0.7776,0.835,0.675,0.7267,0.6975,0.5225,0.5255,0.035
ridge,Ridge Classifier,0.7775,0.0,0.68,0.7217,0.6975,0.5224,0.5254,0.005
lda,Linear Discriminant Analysis,0.7775,0.8329,0.68,0.7217,0.6975,0.5224,0.5254,0.005
dt,Decision Tree Classifier,0.747,0.7346,0.6825,0.6645,0.6723,0.4664,0.4677,0.006
knn,K Neighbors Classifier,0.6802,0.7023,0.555,0.588,0.5685,0.3153,0.3175,0.245


#### Now, we are saving the gbc model and tuning it. But, we did not receive improved results after tuning. So, we'll be keeping the original gbc model for making prediction.

In [7]:
gbc = create_model('gbc');
tuned_gbc = tune_model(gbc)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8381,0.9073,0.825,0.7674,0.7952,0.6616,0.6628
1,0.7429,0.8615,0.725,0.6444,0.6824,0.4676,0.4699
2,0.8,0.8312,0.65,0.7879,0.7123,0.5612,0.5673
3,0.7524,0.8112,0.55,0.7333,0.6286,0.4485,0.4589
4,0.8095,0.8758,0.675,0.7941,0.7297,0.5842,0.5888
5,0.7714,0.8331,0.525,0.8077,0.6364,0.4804,0.5041
6,0.8667,0.8902,0.725,0.9062,0.8056,0.706,0.7162
7,0.875,0.8828,0.8,0.8649,0.8312,0.7322,0.7336
8,0.8269,0.8828,0.725,0.8056,0.7632,0.6274,0.6296
9,0.8365,0.916,0.675,0.871,0.7606,0.6395,0.6515


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8381,0.8946,0.825,0.7674,0.7952,0.6616,0.6628
1,0.7714,0.8377,0.725,0.6905,0.7073,0.52,0.5204
2,0.8095,0.8669,0.7,0.7778,0.7368,0.5882,0.5902
3,0.7429,0.8162,0.525,0.7241,0.6087,0.4244,0.4365
4,0.819,0.8646,0.65,0.8387,0.7324,0.599,0.6101
5,0.8,0.8408,0.6,0.8276,0.6957,0.5523,0.5681
6,0.8381,0.8871,0.65,0.8966,0.7536,0.6376,0.6559
7,0.8077,0.8945,0.75,0.75,0.75,0.5938,0.5938
8,0.8077,0.8756,0.725,0.7632,0.7436,0.5899,0.5904
9,0.8269,0.8967,0.7,0.8235,0.7568,0.6238,0.6288


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


#### Now, let's evaluate the model.

In [8]:
evaluate_model(gbc)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

#### Now, let's print the final model.

In [9]:
final_gbc = finalize_model(gbc)
print(final_gbc)



Pipeline(memory=Memory(location=None),
         steps=[('numerical_imputer',
                 TransformerWrapper(exclude=None,
                                    include=['Unnamed: 0', 'passenger_class',
                                             'sex', 'age', 'number_of_siblings',
                                             'number_of_parents', 'fare',
                                             'body'],
                                    transformer=SimpleImputer(add_indicator=False,
                                                              copy=True,
                                                              fill_value=None,
                                                              keep_empty_features=False,
                                                              missing_values=nan,
                                                              strategy='mean',
                                                              verbose='dep...
                       

#### Finally, let's make predictions with the test data and see the prediction score.

In [10]:
predictions = predict_model(final_gbc, data = test_data)
predictions.head(10)

Unnamed: 0.1,Unnamed: 0,passenger_class,sex,age,number_of_siblings,number_of_parents,fare,body,prediction_label,prediction_score
544,544,2,1,34.0,1,0,21.0,0.0,0,0.8302
599,599,2,0,24.0,0,0,13.0,0.0,1,0.767
803,803,3,1,26.0,0,0,7.8792,0.0,0,0.8621
1065,1065,3,1,21.0,0,0,7.8,0.0,0,0.7395
454,454,2,1,42.0,0,0,13.0,0.0,0,0.9149
1259,1259,3,1,36.0,0,0,7.8958,0.0,0,0.8002
130,130,1,0,22.0,0,1,59.400002,0.0,1,0.9588
877,877,3,0,27.0,1,0,7.925,0.0,1,0.5022
976,976,3,1,0.0,0,0,7.8792,153.0,0,0.9851
736,736,3,1,59.0,0,0,7.25,0.0,0,0.9137
