# Diplodatos Kaggle Competition

We present this peace of code to create the baseline for the competition, and as an example of how to deal with these kind of problems. The main goals are that you:

1. Explore the data and learn from it
1. Try different models and see which one fits the best the given data
1. Get a higher score than the given one in the current baseline example
1. Try to get the highest score in the class :)

In [1]:
# Import the required packages
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Read Data

In [2]:
train_df = pd.read_csv("travel_insurance_prediction_train.csv")
test_df = pd.read_csv("travel_insurance_prediction_test.csv")

## Explore the Data

Is your task to explore the data, do analysis over it and get insights, then use those insights to better pick a model.

In [3]:
train_df.head()

Unnamed: 0,Customer,Age,Employment Type,GraduateOrNot,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,EverTravelledAbroad,TravelInsurance
0,1,33,Private Sector/Self Employed,Yes,550000,6,0,No,No,1
1,2,28,Private Sector/Self Employed,Yes,800000,7,0,Yes,No,0
2,3,31,Private Sector/Self Employed,Yes,1250000,4,0,No,No,0
3,4,31,Government Sector,No,300000,7,0,No,No,0
4,5,28,Private Sector/Self Employed,Yes,1250000,3,0,No,No,0


In [4]:
test_df.head()

Unnamed: 0,Customer,Age,Employment Type,GraduateOrNot,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,EverTravelledAbroad
0,1491,29,Private Sector/Self Employed,Yes,1100000,4,0,No,No
1,1492,28,Private Sector/Self Employed,Yes,750000,5,1,Yes,No
2,1493,31,Government Sector,Yes,1500000,4,0,Yes,Yes
3,1494,28,Private Sector/Self Employed,Yes,1400000,3,0,No,Yes
4,1495,33,Private Sector/Self Employed,Yes,1500000,4,0,Yes,Yes


**TravelInsurance** is the column that we should predict. That column is not present in the test set.

In [5]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1490 entries, 0 to 1489
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Customer             1490 non-null   int64 
 1   Age                  1490 non-null   int64 
 2   Employment Type      1490 non-null   object
 3   GraduateOrNot        1490 non-null   object
 4   AnnualIncome         1490 non-null   int64 
 5   FamilyMembers        1490 non-null   int64 
 6   ChronicDiseases      1490 non-null   int64 
 7   FrequentFlyer        1490 non-null   object
 8   EverTravelledAbroad  1490 non-null   object
 9   TravelInsurance      1490 non-null   int64 
dtypes: int64(6), object(4)
memory usage: 116.5+ KB


In [6]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 497 entries, 0 to 496
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Customer             497 non-null    int64 
 1   Age                  497 non-null    int64 
 2   Employment Type      497 non-null    object
 3   GraduateOrNot        497 non-null    object
 4   AnnualIncome         497 non-null    int64 
 5   FamilyMembers        497 non-null    int64 
 6   ChronicDiseases      497 non-null    int64 
 7   FrequentFlyer        497 non-null    object
 8   EverTravelledAbroad  497 non-null    object
dtypes: int64(5), object(4)
memory usage: 35.1+ KB


In [7]:
train_df.describe()

Unnamed: 0,Customer,Age,AnnualIncome,FamilyMembers,ChronicDiseases,TravelInsurance
count,1490.0,1490.0,1490.0,1490.0,1490.0,1490.0
mean,745.5,29.667114,927818.8,4.777181,0.275839,0.357047
std,430.270264,2.880994,381171.5,1.640248,0.447086,0.47929
min,1.0,25.0,300000.0,2.0,0.0,0.0
25%,373.25,28.0,600000.0,4.0,0.0,0.0
50%,745.5,29.0,900000.0,5.0,0.0,0.0
75%,1117.75,32.0,1250000.0,6.0,1.0,1.0
max,1490.0,35.0,1800000.0,9.0,1.0,1.0


In [8]:
test_df.describe()

Unnamed: 0,Customer,Age,AnnualIncome,FamilyMembers,ChronicDiseases
count,497.0,497.0,497.0,497.0,497.0
mean,1739.0,29.599598,947585.5,4.68008,0.283702
std,143.615807,3.010506,363581.8,1.51347,0.451248
min,1491.0,25.0,300000.0,2.0,0.0
25%,1615.0,28.0,650000.0,4.0,0.0
50%,1739.0,29.0,950000.0,4.0,0.0
75%,1863.0,32.0,1250000.0,6.0,1.0
max,1987.0,35.0,1750000.0,9.0,1.0


## Baseline

In this section we present a baseline based on a decision tree classifier.

Many of the attributes are binary, there are a couple of numeric attributes, we might be able to one-hot (e.g. family members), or event discretize (age and anual income), this will come more clearly after the EDA.

In [9]:
from sklearn.compose import make_column_transformer
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix, mean_squared_error
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, ParameterGrid, train_test_split
from sklearn.preprocessing import KBinsDiscretizer, OneHotEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

### Transform the columns into features

First we need to transform the columns into features. The type of features we use will have a direct impact on the final result. In this example we decided to discretize some numeric features and make a one hot encoding of others. The number of bins, what we use as a one hot encoding, etc, is all up to you to try it out.

In [10]:
transformer_b = make_column_transformer(
    (KBinsDiscretizer(n_bins=5, encode="ordinal", strategy="quantile"), ["Age", "AnnualIncome"]),
    (OneHotEncoder(categories="auto", dtype="int", handle_unknown="ignore"),
     ["Employment Type", "GraduateOrNot", "FamilyMembers", "FrequentFlyer", "EverTravelledAbroad"]),
    remainder="passthrough")

We transform the train and test data. In order to avoid overfitting is better to remove the `Customer` column and we don't want the `TravelInsurance` column as part of the attributes either.

In [11]:
# The data for training the model
X_train_b = transformer_b.fit_transform(train_df.drop(columns=["Customer", "TravelInsurance"]))
y_train_b = train_df["TravelInsurance"].values

# The test data is only for generating the submission
X_test_b = transformer_b.transform(test_df.drop(columns=["Customer"]))

### Grid Search

We do a Grid Search for the Decision Tree (this can be replaced by a randomized search if the model is too complex).

In [12]:
search_params = {
    'criterion': ['gini', 'entropy'],
    'min_samples_leaf': [1, 2, 5],
    'max_depth': [3, 6, 10]
}
tree = DecisionTreeClassifier(random_state=42)
tree_clf = GridSearchCV(tree, search_params, cv=5, scoring='f1', n_jobs=-1)
tree_clf.fit(X_train_b, y_train_b)

best_tree_clf = tree_clf.best_estimator_

### Check Results

We can print the results of the best estimator found on the whole training set (we could also set apart a validation set if we find it useful).

In [13]:
print(classification_report(y_train_b, best_tree_clf.predict(X_train_b)))

              precision    recall  f1-score   support

           0       0.83      0.94      0.88       958
           1       0.86      0.66      0.75       532

    accuracy                           0.84      1490
   macro avg       0.85      0.80      0.82      1490
weighted avg       0.84      0.84      0.84      1490



## Generate the output

The last thing we do is generating a file that should be *submitted* on kaggle

In [15]:
test_id = test_df["Customer"]
test_pred = best_tree_clf.predict(X_test_b)

submission_b = pd.DataFrame(list(zip(test_id, test_pred)), columns=["Customer", "TravelInsurance"])
submission_b.to_csv("travel_insurance_submission.csv", header=True, index=False)

In [28]:
# División entre instancias y etiquetas
feature_names = ['Age', 'Employment Type', 'GraduateOrNot', 'AnnualIncome',
       'FamilyMembers', 'ChronicDiseases', 'FrequentFlyer',
       'EverTravelledAbroad']

X, y = train_df[feature_names], train_df['TravelInsurance']
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.8, random_state=0)
X_train.shape, X_val.shape, y_train.shape, y_val.shape

((1192, 8), (298, 8), (1192,), (298,))

In [29]:
transformer = make_column_transformer(
    (KBinsDiscretizer(n_bins=5, encode="ordinal", strategy="quantile"), ["Age", "AnnualIncome"]),
    (OneHotEncoder(categories="auto", dtype="int", handle_unknown="ignore"),
     ["Employment Type", "GraduateOrNot", "FrequentFlyer", "EverTravelledAbroad", "ChronicDiseases"]),
    remainder="passthrough")

In [30]:
# The data for training the model
X_train_tr = transformer.fit_transform(X_train)
y_train_tr = y_train

# The test data is only for generating the submission
X_val_tr = transformer.transform(X_val)
y_val_tr = y_val

### TRY 1

In [34]:
# GradientBoostingClassifier
gbc = GradientBoostingClassifier(random_state=100)
gbc.fit(X_train_tr, y_train_tr)
y_train_pred1 = gbc.predict(X_train_tr)
y_val_pred1 = gbc.predict(X_val_tr)
#print("Reporte de train")
#print(classification_report(y_train_1, y_train_pred1))
print("Reporte de test")
print(classification_report(y_val_tr, y_val_pred1)) 

Reporte de test
              precision    recall  f1-score   support

           0       0.81      0.97      0.88       192
           1       0.93      0.58      0.72       106

    accuracy                           0.84       298
   macro avg       0.87      0.78      0.80       298
weighted avg       0.85      0.84      0.82       298



In [35]:
gb_pipe = make_pipeline(transformer, GradientBoostingClassifier(random_state=100))
gb_pipe.fit(X, y)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('kbinsdiscretizer',
                                                  KBinsDiscretizer(encode='ordinal'),
                                                  ['Age', 'AnnualIncome']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(dtype='int',
                                                                handle_unknown='ignore'),
                                                  ['Employment Type',
                                                   'GraduateOrNot',
                                                   'FrequentFlyer',
                                                   'EverTravelledAbroad',
                                                   'ChronicDiseases'])])),
                ('gradientboostingclassifier',
                 GradientBoostingC

In [36]:
test_id = test_df["Customer"]
test_pred_gb = gb_pipe.predict(test_df.drop(columns=["Customer"]))

submission = pd.DataFrame(list(zip(test_id, test_pred_gb)), columns=["Customer", "TravelInsurance"])
submission.to_csv("travel_insurance_submission_gbc3.csv", header=True, index=False)

In [26]:
X.columns

Index(['Age', 'FrequentFlyer', 'AnnualIncome', 'FamilyMembers',
       'ChronicDiseases', 'EverTravelledAbroad', 'GraduateOrNot',
       'Employment Type'],
      dtype='object')

In [27]:
test_df.columns

Index(['Customer', 'Age', 'Employment Type', 'GraduateOrNot', 'AnnualIncome',
       'FamilyMembers', 'ChronicDiseases', 'FrequentFlyer',
       'EverTravelledAbroad'],
      dtype='object')