## Insurance cost prediction using linear regression

In this notebook we're going to use information like a person's age, sex, BMI, no. of children and smoking habit to predict the price of yearly medical bills. This kind of model is useful for insurance companies to determine the yearly insurance premium for a person. The dataset for this problem is taken from Kaggle.<br>

We will create a model with the following steps:<br>

1. Download and explore the dataset
2. Prepare the dataset for training
3. Create a linear regression model
4. Train the model to fit the data
5. Make predictions using the trained model

In [1]:
import torch
import jovian
import numpy as np
import torchvision
import torch.nn as nn
import pandas as pd
import matplotlib.pyplot as plt
import torch.nn.functional as F
from torchvision.datasets.utils import download_url
from torch.utils.data import DataLoader, TensorDataset, random_split
import pickle

Step 1: Download and explore the data<br>
Let us begin by downloading the data. We'll use the download_url function from PyTorch to get the data as a CSV (comma-separated values) file.

In [2]:
DATASET_URL = "https://hub.jovian.ml/wp-content/uploads/2020/05/insurance.csv"
DATA_FILENAME = "insurance.csv"
download_url(DATASET_URL, '.')

Using downloaded and verified file: .\insurance.csv


To load the dataset into memory, we'll use the read_csv function from the pandas library.

In [3]:
dataframe_raw = pd.read_csv(DATA_FILENAME)
dataframe_raw.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [4]:
your_name = "gesha" # at least 5 characters

The customize_dataset function will customize the dataset slightly using your name as a source of random numbers.



In [5]:
def customize_dataset(dataframe_raw, rand_str):
    dataframe = dataframe_raw.copy(deep=True)
    # drop some rows
    dataframe = dataframe.sample(int(0.95*len(dataframe)), random_state=int(ord(rand_str[0])))
    # scale input
    dataframe.bmi = dataframe.bmi * ord(rand_str[1])/100.
    # scale target
    dataframe.charges = dataframe.charges * ord(rand_str[2])/100.
    # drop column
    if ord(rand_str[3]) % 2 == 1:
        dataframe = dataframe.drop(['region'], axis=1)
    return dataframe

In [6]:
dataframe = customize_dataset(dataframe_raw, your_name)
dataframe.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
426,38,female,27.53765,1,no,northeast,7538.330902
902,26,male,27.53765,3,no,northeast,5360.479303
309,41,female,33.3906,2,no,northwest,8911.52986
515,58,male,36.057,0,no,southwest,13067.16825
601,51,male,31.95135,0,no,northwest,10550.255998


In [7]:
num_rows = dataframe.shape[0]
print(num_rows)

1271


In [8]:
num_cols = dataframe.shape[1]
print(num_cols)

7


In [9]:
#Column titles of input variables
input_cols = dataframe.columns[:6]
input_cols = input_cols.tolist()
print(input_cols)

['age', 'sex', 'bmi', 'children', 'smoker', 'region']


In [10]:
#categorical variables

categorical_cols = dataframe.select_dtypes(exclude=["number"]) # ["sex", "smoker"]
categorical_cols = categorical_cols.columns
categorical_cols = categorical_cols.tolist()
print(categorical_cols)

['sex', 'smoker', 'region']


In [11]:
#target variable
output_cols = dataframe.columns[6:].tolist()
print(output_cols)

['charges']


### Step 2: Prepare the dataset for training
We need to convert the data from the Pandas dataframe into a PyTorch tensors for training. To do this, the first step is to convert it numpy arrays. If you've filled out input_cols, categorial_cols and output_cols correctly, this following function will perform the conversion to numpy arrays.

In [12]:
def dataframe_to_arrays(dataframe):
    # Make a copy of the original dataframe
    dataframe1 = dataframe.copy(deep=True)
    # Convert non-numeric categorical columns to numbers
    for col in categorical_cols:
        dataframe1[col] = dataframe1[col].astype('category').cat.codes
    # Extract input & outupts as numpy arrays
    inputs_array = dataframe1[input_cols].to_numpy()
    targets_array = dataframe1[output_cols].to_numpy()
    return inputs_array, targets_array

In [13]:
inputs_array, targets_array = dataframe_to_arrays(dataframe)
inputs_array, targets_array

(array([[38.     ,  0.     , 27.53765,  1.     ,  0.     ,  0.     ],
        [26.     ,  1.     , 27.53765,  3.     ,  0.     ,  0.     ],
        [41.     ,  0.     , 33.3906 ,  2.     ,  0.     ,  1.     ],
        ...,
        [24.     ,  1.     , 33.027  ,  0.     ,  1.     ,  3.     ],
        [31.     ,  1.     , 26.159  ,  3.     ,  1.     ,  3.     ],
        [43.     ,  1.     , 30.401  ,  1.     ,  0.     ,  3.     ]]),
 array([[ 7538.3309025],
        [ 5360.4793025],
        [ 8911.52986  ],
        ...,
        [39643.76715  ],
        [22079.9356   ],
        [ 7876.3799   ]]))

In [14]:
inputs = torch.from_numpy(np.float32(inputs_array))
targets = torch.from_numpy(np.float32(targets_array))

In [15]:
inputs.dtype, targets.dtype

(torch.float32, torch.float32)

In [16]:
dataset = TensorDataset(inputs, targets)


Pick a number between 0.1 and 0.2 to determine the fraction of data that will be used for creating the validation set. Then use random_split to create training & validation datasets.

In [17]:
val_percent = 0.15 # between 0.1 and 0.2
val_size = int(num_rows * val_percent)
train_size = num_rows - val_size


train_ds, val_ds = random_split(dataset, [train_size, val_size]) # Use the random_split function to split dataset into 2 parts of the desired length



Finally, we can create data loaders for training & validation.<br>
Pick a batch size for the data loader.



In [18]:
batch_size = 16
train_loader = DataLoader(train_ds, batch_size, shuffle=True)
val_loader = DataLoader(val_ds, batch_size)

In [19]:
for xb, yb in train_loader:
    print("inputs:", xb)
    print("targets:", yb)
    break

inputs: tensor([[41.0000,  1.0000, 28.6891,  1.0000,  0.0000,  1.0000],
        [19.0000,  0.0000, 40.0112,  1.0000,  0.0000,  1.0000],
        [53.0000,  1.0000, 21.6140,  1.0000,  0.0000,  3.0000],
        [31.0000,  0.0000, 21.9725,  0.0000,  0.0000,  1.0000],
        [56.0000,  0.0000, 42.3291,  0.0000,  0.0000,  2.0000],
        [54.0000,  0.0000, 32.2190,  1.0000,  0.0000,  2.0000],
        [49.0000,  0.0000, 30.2243,  0.0000,  0.0000,  1.0000],
        [21.0000,  0.0000, 16.9832,  1.0000,  0.0000,  0.0000],
        [27.0000,  1.0000, 28.7850,  0.0000,  1.0000,  1.0000],
        [30.0000,  1.0000, 31.8857,  3.0000,  0.0000,  2.0000],
        [18.0000,  1.0000, 16.1196,  0.0000,  0.0000,  0.0000],
        [34.0000,  1.0000, 21.5888,  0.0000,  0.0000,  0.0000],
        [53.0000,  1.0000, 28.8860,  3.0000,  0.0000,  3.0000],
        [63.0000,  1.0000, 35.4409,  0.0000,  1.0000,  2.0000],
        [49.0000,  1.0000, 31.6635,  1.0000,  0.0000,  0.0000],
        [50.0000,  0.0000, 46.55

### Step 3: Create a Linear Regression Model


In [20]:
input_size = len(input_cols)
output_size = len(output_cols)

In [21]:
class InsuranceModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(input_size, output_size) # fill this (hint: use input_size & output_size defined above)
        
    def forward(self, xb):
        # xb = xb.reshape(-1, input_size * input_size) # check if we need to reshape
        out = self.linear(xb)                          # fill this
        return out
    
    def training_step(self, batch):
        inputs, targets = batch 
        # Generate predictions
        out = self(inputs)          
        # Calcuate loss
        loss = F.l1_loss(out, targets)                          # fill this
        # l1_loss: Function that takes the mean element-wise absolute value difference.
        # smooth_l1_loss: 
        return loss
    
    def validation_step(self, batch):
        inputs, targets = batch
        # Generate predictions
        out = self(inputs)
        # Calculate loss
        loss = F.l1_loss(out, targets)                           # fill this
        return {'val_loss': loss.detach()}
        
    def validation_epoch_end(self, outputs):
        batch_losses = [x['val_loss'] for x in outputs]
        epoch_loss = torch.stack(batch_losses).mean()   # Combine losses
        return {'val_loss': epoch_loss.item()}
    
    def epoch_end(self, epoch, result, num_epochs):
        # Print result every 20th epoch
        if (epoch+1) % 20 == 0 or epoch == num_epochs-1:
            print("Epoch [{}], val_loss: {:.4f}".format(epoch+1, result['val_loss']))

Let us create a model using the InsuranceModel class. You may need to come back later and re-run the next cell to reinitialize the model, in case the loss becomes nan or infinity.

In [22]:
model = InsuranceModel()


Let's check out the weights and biases of the model using model.parameters.



In [23]:
list(model.parameters())


[Parameter containing:
 tensor([[ 0.3066,  0.1402, -0.2322,  0.0151,  0.3637, -0.3625]],
        requires_grad=True),
 Parameter containing:
 tensor([0.2143], requires_grad=True)]

### Step 4: Train the model to fit the data


In [24]:
def evaluate(model, val_loader):
    outputs = [model.validation_step(batch) for batch in val_loader]
    return model.validation_epoch_end(outputs)

def fit(epochs, lr, model, train_loader, val_loader, opt_func=torch.optim.SGD):
    history = []
    optimizer = opt_func(model.parameters(), lr)
    for epoch in range(epochs):
        # Training Phase 
        for batch in train_loader:
            loss = model.training_step(batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        # Validation phase
        result = evaluate(model, val_loader)
        model.epoch_end(epoch, result, epochs)
        history.append(result)
    return history

In [25]:
result = evaluate(model, val_loader) # Use the the evaluate function
print(result)

{'val_loss': 16748.201171875}


We are now ready to train the model. You may need to run the training loop many times, for different number of epochs and with different learning rates, to get a good result. Also, if your loss becomes too large (or nan), you may have to re-initialize the model by running the cell model = InsuranceModel().

In [26]:
epochs = 500
lr = 1e-2
history1 = fit(epochs, lr, model, train_loader, val_loader)

Epoch [20], val_loss: 9755.7412
Epoch [40], val_loss: 9597.8115
Epoch [60], val_loss: 9460.9121
Epoch [80], val_loss: 9344.7764
Epoch [100], val_loss: 9266.5410
Epoch [120], val_loss: 9217.2070
Epoch [140], val_loss: 9196.8838
Epoch [160], val_loss: 9185.2256
Epoch [180], val_loss: 9181.5410
Epoch [200], val_loss: 9179.9482
Epoch [220], val_loss: 9177.1729
Epoch [240], val_loss: 9175.5215
Epoch [260], val_loss: 9173.5146
Epoch [280], val_loss: 9171.5752
Epoch [300], val_loss: 9169.6445
Epoch [320], val_loss: 9168.3857
Epoch [340], val_loss: 9166.3740
Epoch [360], val_loss: 9163.8877
Epoch [380], val_loss: 9162.2959
Epoch [400], val_loss: 9160.9033
Epoch [420], val_loss: 9158.1865
Epoch [440], val_loss: 9156.0615
Epoch [460], val_loss: 9153.7256
Epoch [480], val_loss: 9152.3936
Epoch [500], val_loss: 9151.4014


In [27]:
epochs = 500
lr = 1e-3
history2 = fit(epochs, lr, model, train_loader, val_loader)

Epoch [20], val_loss: 9150.9736
Epoch [40], val_loss: 9150.7061
Epoch [60], val_loss: 9150.5098
Epoch [80], val_loss: 9150.2764
Epoch [100], val_loss: 9150.1699
Epoch [120], val_loss: 9149.9961
Epoch [140], val_loss: 9149.8076
Epoch [160], val_loss: 9149.6670
Epoch [180], val_loss: 9149.3682
Epoch [200], val_loss: 9149.0498
Epoch [220], val_loss: 9148.9482
Epoch [240], val_loss: 9148.7295
Epoch [260], val_loss: 9148.7393
Epoch [280], val_loss: 9148.6357
Epoch [300], val_loss: 9148.4619
Epoch [320], val_loss: 9148.2559
Epoch [340], val_loss: 9148.0537
Epoch [360], val_loss: 9147.8838
Epoch [380], val_loss: 9147.5654
Epoch [400], val_loss: 9147.3701
Epoch [420], val_loss: 9147.3008
Epoch [440], val_loss: 9147.0400
Epoch [460], val_loss: 9146.7930
Epoch [480], val_loss: 9146.6221
Epoch [500], val_loss: 9146.5508


In [28]:
epochs = 500
lr = 1e-4
history3 = fit(epochs, lr, model, train_loader, val_loader)

Epoch [20], val_loss: 9146.5361
Epoch [40], val_loss: 9146.5225
Epoch [60], val_loss: 9146.5244
Epoch [80], val_loss: 9146.5068
Epoch [100], val_loss: 9146.4883
Epoch [120], val_loss: 9146.4619
Epoch [140], val_loss: 9146.4443
Epoch [160], val_loss: 9146.4463
Epoch [180], val_loss: 9146.4199
Epoch [200], val_loss: 9146.3975
Epoch [220], val_loss: 9146.3779
Epoch [240], val_loss: 9146.3682
Epoch [260], val_loss: 9146.3525
Epoch [280], val_loss: 9146.3477
Epoch [300], val_loss: 9146.3408
Epoch [320], val_loss: 9146.3271
Epoch [340], val_loss: 9146.3174
Epoch [360], val_loss: 9146.3096
Epoch [380], val_loss: 9146.3047
Epoch [400], val_loss: 9146.2852
Epoch [420], val_loss: 9146.2666
Epoch [440], val_loss: 9146.2588
Epoch [460], val_loss: 9146.2393
Epoch [480], val_loss: 9146.2295
Epoch [500], val_loss: 9146.2217


In [29]:
epochs = 500
lr = 1e-5
history4 = fit(epochs, lr, model, train_loader, val_loader)

Epoch [20], val_loss: 9146.2217
Epoch [40], val_loss: 9146.2217
Epoch [60], val_loss: 9146.2197
Epoch [80], val_loss: 9146.2197
Epoch [100], val_loss: 9146.2197
Epoch [120], val_loss: 9146.2178
Epoch [140], val_loss: 9146.2197
Epoch [160], val_loss: 9146.2188
Epoch [180], val_loss: 9146.2139
Epoch [200], val_loss: 9146.2139
Epoch [220], val_loss: 9146.2148
Epoch [240], val_loss: 9146.2158
Epoch [260], val_loss: 9146.2158
Epoch [280], val_loss: 9146.2158
Epoch [300], val_loss: 9146.2139
Epoch [320], val_loss: 9146.2139
Epoch [340], val_loss: 9146.2129
Epoch [360], val_loss: 9146.2139
Epoch [380], val_loss: 9146.2139
Epoch [400], val_loss: 9146.2129
Epoch [420], val_loss: 9146.2139
Epoch [440], val_loss: 9146.2119
Epoch [460], val_loss: 9146.2139
Epoch [480], val_loss: 9146.2129
Epoch [500], val_loss: 9146.2119


In [30]:
epochs = 500
lr = 1e-6
history5 = fit(epochs, lr, model, train_loader, val_loader)

Epoch [20], val_loss: 9146.2119
Epoch [40], val_loss: 9146.2109
Epoch [60], val_loss: 9146.2119
Epoch [80], val_loss: 9146.2109
Epoch [100], val_loss: 9146.2109
Epoch [120], val_loss: 9146.2119
Epoch [140], val_loss: 9146.2139
Epoch [160], val_loss: 9146.2139
Epoch [180], val_loss: 9146.2148
Epoch [200], val_loss: 9146.2148
Epoch [220], val_loss: 9146.2129
Epoch [240], val_loss: 9146.2119
Epoch [260], val_loss: 9146.2129
Epoch [280], val_loss: 9146.2139
Epoch [300], val_loss: 9146.2139
Epoch [320], val_loss: 9146.2139
Epoch [340], val_loss: 9146.2139
Epoch [360], val_loss: 9146.2119
Epoch [380], val_loss: 9146.2119
Epoch [400], val_loss: 9146.2119
Epoch [420], val_loss: 9146.2100
Epoch [440], val_loss: 9146.2100
Epoch [460], val_loss: 9146.2119
Epoch [480], val_loss: 9146.2100
Epoch [500], val_loss: 9146.2100


In [31]:
#final validation loss of your model
val_loss = 7810.9771

### Step 5: Make predictions using the trained model


In [32]:
def predict_single(input, target, model):
    inputs = input.unsqueeze(0)
    predictions = model(inputs)               
    prediction = predictions[0].detach()
    print("Input:", input)
    print("Target:", target)
    print("Prediction:", prediction)

In [33]:
input, target = val_ds[0]
predict_single(input, target, model)

Input: tensor([19.0000,  0.0000, 28.1790,  0.0000,  1.0000,  3.0000])
Target: tensor([19417.6621])
Prediction: tensor([2874.9050])


In [34]:
input, target = val_ds[10]
predict_single(input, target, model)

Input: tensor([56.0000,  0.0000, 28.5931,  0.0000,  0.0000,  0.0000])
Target: tensor([13406.3770])
Prediction: tensor([13721.3887])


In [35]:
input, target = val_ds[23]
predict_single(input, target, model)

Input: tensor([47.0000,  0.0000, 23.8360,  1.0000,  0.0000,  3.0000])
Target: tensor([9820.6221])
Prediction: tensor([11435.4199])


In [36]:
# save the model to disk
filename = 'linear_regression.pkl'
pickle.dump(model, open(filename, 'wb'))