# Insurance Cost Prediction

In this notebook we're going to use information like a person's age, sex, BMI, no. of children and smoking habit to predict the price of yearly medical bills. This kind of model is useful for insurance companies to determine the yearly insurance premium for a person. The dataset for this problem is taken from: https://www.kaggle.com/mirichoi0218/insurance

In [1]:
import torch
import torchvision
import torch.nn as nn
import pandas as pd
import matplotlib.pyplot as plt
import torch.nn.functional as F
from torchvision.datasets.utils import download_url
from torch.utils.data import DataLoader, TensorDataset, random_split

Let load our dataset which was downloaded from the Kaggle competition page.

In [2]:
data = pd.read_csv('datasets_insurance.csv')
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


We have age, sex, bmi, number of childredn, smoking habit and region as the input to the model. And based on this the model needs to findout the insurance cost of the person. Lets separate the numerical and the categorical columns here. We will also name our target column separately.

In [4]:
numerical_cols = ['age', 'bmi', 'children']
categorical_cols = ['sex', 'smoker', 'region']
target_cols = ['charges']

Next step is to prepare the data and convert the data into torch tensors which pytorch understands.


In [5]:
def dataframe_to_arrays(dataframe):
    # Make a copy of the original dataframe
    dataframe1 = dataframe.copy(deep=True)
    # Convert non-numeric categorical columns to numbers
    for col in categorical_cols:
        dataframe1[col] = dataframe1[col].astype('category').cat.codes
    # Extract input & outupts as numpy arrays
    inputs_array = dataframe1[input_cols].to_numpy()
    targets_array = dataframe1[output_cols].to_numpy()
    return inputs_array, targets_array

In [41]:
# categorical cols
data_df = data.copy(deep = True)
for col in categorical_cols:
    data_df[col] = data_df[col].astype('category').cat.codes

# convert input(numerical and categorical inputs) and target cols to numpy arrays
X = data_df[numerical_cols+categorical_cols].to_numpy()
y = data_df[target_cols].to_numpy()

In [42]:
# convert to torch tensors
X = torch.tensor(X, dtype = torch.float32)
y = torch.tensor(y, dtype = torch.float32)

We create the torch dataset from our tensors

In [43]:
dataset = TensorDataset(X, y)

In [44]:
num_rows = data.shape[0]
val_percent = 0.2 # between 0.1 and 0.2
val_size = int(num_rows * val_percent)
train_size = num_rows - val_size


train_ds, val_ds = random_split(dataset, [train_size, val_size]) # Use the random_split function to split dataset into 2 parts of the desired length

Next, Lets create a loader for both train and validation set. This will be used during the model training


In [45]:
batch_size = 32 ## we set the batch size value for the loaded
train_loader = DataLoader(train_ds, batch_size, shuffle=True)
val_loader = DataLoader(val_ds, batch_size)


In [46]:
# check if the data loader is working fine
for xb, yb in train_loader:
    print("inputs:", xb)
    print("targets:", yb)
    break

inputs: tensor([[42.0000, 29.0000,  1.0000,  0.0000,  0.0000,  3.0000],
        [45.0000, 30.9000,  2.0000,  0.0000,  0.0000,  3.0000],
        [54.0000, 34.2100,  2.0000,  1.0000,  1.0000,  2.0000],
        [44.0000, 20.2350,  1.0000,  0.0000,  1.0000,  0.0000],
        [63.0000, 36.8500,  0.0000,  0.0000,  0.0000,  2.0000],
        [56.0000, 25.6500,  0.0000,  0.0000,  0.0000,  1.0000],
        [40.0000, 19.8000,  1.0000,  1.0000,  1.0000,  2.0000],
        [63.0000, 23.0850,  0.0000,  0.0000,  0.0000,  0.0000],
        [64.0000, 36.9600,  2.0000,  1.0000,  1.0000,  2.0000],
        [26.0000, 19.8000,  1.0000,  0.0000,  0.0000,  3.0000],
        [57.0000, 25.7400,  2.0000,  0.0000,  0.0000,  2.0000],
        [18.0000, 33.7700,  1.0000,  1.0000,  0.0000,  2.0000],
        [28.0000, 23.8450,  2.0000,  0.0000,  0.0000,  1.0000],
        [28.0000, 23.8000,  2.0000,  1.0000,  0.0000,  3.0000],
        [61.0000, 36.3000,  1.0000,  1.0000,  1.0000,  3.0000],
        [30.0000, 25.4600,  0.00

## Creating the Model

In [47]:
input_size = len(numerical_cols + categorical_cols)
output_size = len(target_cols)

In [53]:
class InsuranceModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(input_size, output_size)                  # fill this (hint: use input_size & output_size defined above)
        
    def forward(self, xb):
        #print(xb)
        out = self.linear(xb)                        # fill this
        return out
    
    def training_step(self, batch):
        inputs, targets = batch 
        # Generate predictions
        #print(inputs)
        out = self(inputs)          
        # Calcuate loss
        loss = F.mse_loss(out, targets)                         # fill this
        return loss
    
    def validation_step(self, batch):
        inputs, targets = batch
        # Generate predictions
        out = self(inputs)
        # Calculate loss
        loss = F.mse_loss(out, targets)                          # fill this    
        return {'val_loss': loss.detach()}
        
    def validation_epoch_end(self, outputs):
        batch_losses = [x['val_loss'] for x in outputs]
        epoch_loss = torch.stack(batch_losses).mean()   # Combine losses
        return {'val_loss': epoch_loss.item()}
    
    def epoch_end(self, epoch, result, num_epochs):
        # Print result every 20th epoch
        if (epoch+1) % 20 == 0 or epoch == num_epochs-1:
            print("Epoch [{}], val_loss: {:.4f}".format(epoch+1, result['val_loss']))

In [63]:
list(model.parameters())

[Parameter containing:
 tensor([[ 0.2827,  0.2767, -0.2985,  0.1372, -0.1731,  0.2979]],
        requires_grad=True),
 Parameter containing:
 tensor([-0.3754], requires_grad=True)]

In [64]:
def evaluate(model, val_loader):
    outputs = [model.validation_step(batch) for batch in val_loader]
    return model.validation_epoch_end(outputs)

def fit(epochs, lr, model, train_loader, val_loader, opt_func=torch.optim.SGD):
    history = []
    optimizer = opt_func(model.parameters(), lr)
    for epoch in range(epochs):
        # Training Phase 
        for batch in train_loader:
            loss = model.training_step(batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        # Validation phase
        result = evaluate(model, val_loader)
        model.epoch_end(epoch, result, epochs)
        print(result)
        history.append(result)
    return history

In [100]:
model = InsuranceModel()

Let see if the model class is working properly by performing an evaluation on the model.

In [101]:
result = evaluate(model, val_loader) # Use the the evaluate function
print(result)

{'val_loss': 346592000.0}


Now we need to train the model.

In [112]:
train_params = [[ep, lr] for ep in range(20,101,10) for lr in [1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8]]
len(train_params)

54

With these training parameters, we will train 54 models and then pick find the model with least validation loss.
This will tell us the best values for epochs and learning rate for training our model

In [108]:
models = []
histories = []
for train_param in train_params:
    
    model = InsuranceModel()
    epochs = train_param[0]
    lr = train_param[1]
    print(f'Training with {epochs} and {lr}')
    history1 = fit(epochs, lr, model, train_loader, val_loader)
    models.append(model)
    histories.append(history1)
    

Training with 20 and 0.001
{'val_loss': inf}
{'val_loss': nan}
{'val_loss': nan}
{'val_loss': nan}
{'val_loss': nan}
{'val_loss': nan}
{'val_loss': nan}
{'val_loss': nan}
{'val_loss': nan}
{'val_loss': nan}
{'val_loss': nan}
{'val_loss': nan}
{'val_loss': nan}
{'val_loss': nan}
{'val_loss': nan}
{'val_loss': nan}
{'val_loss': nan}
{'val_loss': nan}
{'val_loss': nan}
Epoch [20], val_loss: nan
{'val_loss': nan}
Training with 20 and 0.0001
{'val_loss': 145277728.0}
{'val_loss': 144881008.0}
{'val_loss': 144561808.0}
{'val_loss': 144637248.0}
{'val_loss': 144763280.0}
{'val_loss': 143890368.0}
{'val_loss': 143760800.0}
{'val_loss': 144576896.0}
{'val_loss': 145447152.0}
{'val_loss': 147528640.0}
{'val_loss': 145081072.0}
{'val_loss': 142406816.0}
{'val_loss': 144462416.0}
{'val_loss': 145595856.0}
{'val_loss': 143065536.0}
{'val_loss': 144994816.0}
{'val_loss': 141114400.0}
{'val_loss': 144139552.0}
{'val_loss': 141358960.0}
Epoch [20], val_loss: 157572160.0000
{'val_loss': 157572160.0}
Tr

In [115]:
val_losses = [x[-1] for x in histories]

In [124]:
sorted(val_losses)

[nan,
 124929808.0,
 126839408.0,
 133521224.0,
 135093616.0,
 136259264.0,
 136950368.0,
 137115584.0,
 138378304.0,
 142812752.0,
 143168480.0,
 143338272.0,
 143433872.0,
 143871184.0,
 144122240.0,
 144250368.0,
 144312128.0,
 144561440.0,
 144789824.0,
 144820736.0,
 144848000.0,
 144854592.0,
 144880704.0,
 144913696.0,
 144945040.0,
 145022192.0,
 145408400.0,
 151928656.0,
 154550896.0,
 157572160.0,
 158122432.0,
 163397200.0,
 170662512.0,
 180745760.0,
 195508128.0,
 215804080.0,
 245006192.0,
 286969408.0,
 291690944.0,
 297426112.0,
 302576064.0,
 308336768.0,
 313811072.0,
 320648512.0,
 332773568.0,
 nan,
 326416672.0,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan]

In [125]:
min_loss = 124929808.0

In [126]:
for i,history in enumerate(histories):
    if history[-1]['val_loss']== min_loss:
        print(i)

49


The training parameters at index 49 gave us the best model. Lets have a look at the values


In [130]:
train_params[49]

[100, 0.0001]

So, 100 epochs at a learning rate of 0.0001 performed the best.

Now, lets make some predictions