<a href="https://www.kaggle.com/code/aniruddhapa/insurance-claim-prediction-using-deep-learning?scriptVersionId=194290147" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Project Rationale

In this project, we address a binary classification problem using the dataset from the Tabular Playground Series September 2021 competition. Our goal is to predict a binary target variable based on various features in the provided training and test datasets. The dataset presents challenges such as missing values and varying distributions, which necessitate careful preprocessing to optimize model performance.

## Why This Project

Binary classification is a common machine learning task with applications in finance, healthcare, and customer analysis. This project provides an opportunity to apply data preprocessing techniques and deep learning models to achieve high accuracy in a competitive setting. It serves as a valuable exercise to enhance skills in handling missing data, feature scaling, and constructing effective neural network architectures.

## Goal

Our primary goal is to develop a predictive model that classifies the target variable 'claim' into binary classes (0 or 1) based on the provided features. We aim to optimize the model using data preprocessing, feature scaling, and deep learning techniques, and then evaluate its performance on the test set by submitting predictions.

## Strategy Used to Solve the Problem

1. **Data Preprocessing**:
   - **Handling Missing Values**: Utilizes `SimpleImputer` with a median strategy to address missing values.
   - **Feature Scaling**: Applies `QuantileTransformer` and `KBinsDiscretizer` to transform features into a uniform distribution and discretize them for improved model training.

2. **Model Architecture**:
   - **Neural Network Model**: Constructs a Sequential model with an embedding layer, dense layers, and dropout layers to prevent overfitting.
   - **Optimizer and Learning Rate Scheduling**: Employs `RMSprop` optimizer with a learning rate schedule to dynamically adjust the learning rate during training.
   - **Evaluation Metric**: Uses AUC (Area Under the ROC Curve) to measure the model's performance.

3. **Training and Prediction**:
   - **Model Training**: Trains the model on the preprocessed training data with a batch size of 1024 and for 100 epochs.
   - **Making Predictions**: Uses the trained model to predict the target variable for the test dataset and saves the results in a submission file.

## How the Goal is Achieved

1. **Data Preparation**: Loads and preprocesses the datasets by filling missing values and applying feature scaling transformations to prepare the data for model training.

2. **Model Building**: Constructs and compiles a neural network using TensorFlow and Keras, including an embedding layer for feature representation and dense layers for classification, with dropout layers to reduce overfitting.

3. **Training and Evaluation**: Compiles the model with binary cross-entropy loss and AUC as the metric, trains it on the training data, and makes predictions on the test data.

4. **Submission**: Saves the final predictions in a CSV file for submission to the competition.


In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
from sklearn.pipeline import Pipeline


import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.layers import Input, Dense, BatchNormalization, Dropout, Embedding,  Flatten
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.optimizers import RMSprop

from tensorflow.data import Dataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import QuantileTransformer,  KBinsDiscretizer
from tensorflow import keras
from sklearn import metrics
from sklearn.impute import SimpleImputer

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_auc_score

In [2]:
import pandas as pd
import numpy as np
df_train=pd.read_csv('../input/tabular-playground-series-sep-2021/train.csv')
df_train.head()

Unnamed: 0,id,f1,f2,f3,f4,f5,f6,f7,f8,f9,...,f110,f111,f112,f113,f114,f115,f116,f117,f118,claim
0,0,0.10859,0.004314,-37.566,0.017364,0.28915,-10.251,135.12,168900.0,399240000000000.0,...,-12.228,1.7482,1.9096,-7.1157,4378.8,1.2096,861340000000000.0,140.1,1.0177,1
1,1,0.1009,0.29961,11822.0,0.2765,0.4597,-0.83733,1721.9,119810.0,3874100000000000.0,...,-56.758,4.1684,0.34808,4.142,913.23,1.2464,7575100000000000.0,1861.0,0.28359,0
2,2,0.17803,-0.00698,907.27,0.27214,0.45948,0.17327,2298.0,360650.0,12245000000000.0,...,-5.7688,1.2042,0.2629,8.1312,45119.0,1.1764,321810000000000.0,3838.2,0.4069,1
3,3,0.15236,0.007259,780.1,0.025179,0.51947,7.4914,112.51,259490.0,77814000000000.0,...,-34.858,2.0694,0.79631,-16.336,4952.4,1.1784,4533000000000.0,4889.1,0.51486,1
4,4,0.11623,0.5029,-109.15,0.29791,0.3449,-0.40932,2538.9,65332.0,1907200000000000.0,...,-13.641,1.5298,1.1464,-0.43124,3856.5,1.483,-8991300000000.0,,0.23049,1


In [3]:
df_test=pd.read_csv('../input/tabular-playground-series-sep-2021/test.csv')
df_test.head()

Unnamed: 0,id,f1,f2,f3,f4,f5,f6,f7,f8,f9,...,f109,f110,f111,f112,f113,f114,f115,f116,f117,f118
0,957919,0.16585,0.48705,1295.0,0.0231,0.319,0.90188,573.29,3743.7,2705700000000.0,...,0.16253,-22.189,2.0655,0.43088,-10.741,81606.0,1.194,198040000000000.0,2017.1,0.46357
1,957920,0.12965,0.37348,1763.0,0.72884,0.33247,-1.2631,875.55,554370.0,595570000000000.0,...,0.81528,-1.6342,1.5736,-1.0712,11.832,90114.0,1.1507,4.388e+16,6638.9,0.28125
2,957921,0.12019,0.44521,736.26,0.04615,0.29605,0.31665,2659.5,317140.0,397780000000000.0,...,0.81831,-32.78,2.1364,-1.9312,-3.2804,37739.0,1.1548,171810000000000.0,5844.0,0.13797
3,957922,0.054008,0.39596,996.14,0.85934,0.36678,-0.1706,386.56,325680.0,-34322000000000.0,...,0.86559,-2.4162,1.5199,-0.011633,1.384,26849.0,1.149,2.1388e+17,6173.3,0.3291
4,957923,0.079947,-0.006919,10574.0,0.34845,0.45008,-1.842,3027.0,428150.0,929150000000.0,...,0.2519,-18.63,3.7387,0.75708,-4.9405,50336.0,1.2488,2.1513e+17,2250.1,0.33796


In [4]:
#n_missing column has the count of total missing values in each row
df_train['n_missing']=df_train.isna().sum(axis=1)
df_test['n_missing']=df_test.isna().sum(axis=1)

In [5]:
df_train.head()

Unnamed: 0,id,f1,f2,f3,f4,f5,f6,f7,f8,f9,...,f111,f112,f113,f114,f115,f116,f117,f118,claim,n_missing
0,0,0.10859,0.004314,-37.566,0.017364,0.28915,-10.251,135.12,168900.0,399240000000000.0,...,1.7482,1.9096,-7.1157,4378.8,1.2096,861340000000000.0,140.1,1.0177,1,1
1,1,0.1009,0.29961,11822.0,0.2765,0.4597,-0.83733,1721.9,119810.0,3874100000000000.0,...,4.1684,0.34808,4.142,913.23,1.2464,7575100000000000.0,1861.0,0.28359,0,0
2,2,0.17803,-0.00698,907.27,0.27214,0.45948,0.17327,2298.0,360650.0,12245000000000.0,...,1.2042,0.2629,8.1312,45119.0,1.1764,321810000000000.0,3838.2,0.4069,1,5
3,3,0.15236,0.007259,780.1,0.025179,0.51947,7.4914,112.51,259490.0,77814000000000.0,...,2.0694,0.79631,-16.336,4952.4,1.1784,4533000000000.0,4889.1,0.51486,1,2
4,4,0.11623,0.5029,-109.15,0.29791,0.3449,-0.40932,2538.9,65332.0,1907200000000000.0,...,1.5298,1.1464,-0.43124,3856.5,1.483,-8991300000000.0,,0.23049,1,8


In [6]:
#Changing the datatype of target column to Object.
df_train['claim']=df_train['claim'].astype(str)

In [7]:
df_train['claim'].dtype

dtype('O')

In [8]:
#Storing all the column names in features,except claim and id.
features=[col for col in df_train.columns if col not in ['claim','id']]

In [9]:
#Creating a pipeline to fill missing values using SimpleImputer Median strategy and transforming the input data to 
#uniform distribution for better training of the model.
pipe=Pipeline([('imputer',SimpleImputer(strategy='mean',missing_values=np.nan)),
               ('scaler',QuantileTransformer(n_quantiles=64,output_distribution='uniform')),
               ('bin',KBinsDiscretizer(n_bins=64,encode='ordinal',strategy='uniform'))])

In [10]:
#Transforming the independant features
df_train[features]=pipe.fit_transform(df_train[features])

In [11]:
df_test[features]=pipe.fit_transform(df_test[features])

In [12]:
df_test[features].head()

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,...,f110,f111,f112,f113,f114,f115,f116,f117,f118,n_missing
0,61.0,56.0,31.0,12.0,30.0,47.0,17.0,1.0,10.0,49.0,...,21.0,44.0,23.0,5.0,47.0,37.0,14.0,21.0,30.0,28.0
1,53.0,29.0,36.0,61.0,33.0,12.0,22.0,48.0,32.0,30.0,...,58.0,23.0,12.0,55.0,48.0,17.0,45.0,50.0,16.0,0.0
2,49.0,45.0,23.0,17.0,25.0,39.0,50.0,33.0,30.0,18.0,...,13.0,45.0,7.0,18.0,38.0,19.0,14.0,46.0,7.0,28.0
3,9.0,34.0,28.0,63.0,39.0,33.0,13.0,34.0,1.0,8.0,...,55.0,18.0,19.0,33.0,34.0,16.0,60.0,48.0,19.0,0.0
4,23.0,1.0,55.0,51.0,54.0,4.0,54.0,41.0,8.0,48.0,...,27.0,58.0,25.0,14.0,41.0,49.0,60.0,23.0,20.0,0.0


In [13]:
model = Sequential([
    Input(df_train[features].shape[1:]),
    Embedding(input_dim=64, output_dim=4),
    Flatten(),
    Dense(64,  activation='relu'),
    Dropout(0.5),
     Dense(32,  activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid'),
])

auc = tf.keras.metrics.AUC(name='aucroc')
lr_schedule = keras.optimizers.schedules.ExponentialDecay(
        initial_learning_rate=5e-4,
        decay_steps = 900,
        decay_rate= 0.9)
callback = tf.keras.callbacks.LearningRateScheduler(lr_schedule)

optimizer = RMSprop(lr=5e-4, rho=0.9, epsilon=1e-08, decay=0.0)

model.compile(loss='binary_crossentropy', optimizer = optimizer, metrics=[auc]) 


In [14]:
model.fit(x = np.float32(df_train[features]), y = np.float32(df_train.claim),
          batch_size = 1024, shuffle = True, epochs = 100,callbacks=[callback])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7aea685282d0>

In [15]:
sub=pd.read_csv('../input/tabular-playground-series-sep-2021/sample_solution.csv')
sub.head()

Unnamed: 0,id,claim
0,957919,0.5
1,957920,0.5
2,957921,0.5
3,957922,0.5
4,957923,0.5


In [16]:
sub['claim']=model.predict(np.float32(df_test[features]))

In [17]:
sub=sub.set_index('id')

In [18]:
sub.to_csv('submission.csv')

In [19]:
ls

__notebook__.ipynb  submission.csv


In [20]:
sub.head()

Unnamed: 0_level_0,claim
id,Unnamed: 1_level_1
957919,0.480187
957920,0.127608
957921,0.721861
957922,0.112664
957923,0.141387
