# Insurance cost prediction using linear regression


In this assignment I'm going to use information like a person's age, sex, BMI, no. of children and smoking habit to predict the price of yearly medical bills. This kind of model is useful for insurance companies to determine the yearly insurance premium for a person. The dataset for this problem is taken from [Kaggle](https://www.kaggle.com/mirichoi0218/insurance).


I will create a model with the following steps:
1. Download and explore the dataset
2. Prepare the dataset for training
3. Create a linear regression model
4. Train the model to fit the data
5. Make predictions using the trained model



In [4]:
import torch
import torchvision
import torch.nn as nn
import pandas as pd
import matplotlib.pyplot as plt
import torch.nn.functional as F
from torchvision.datasets.utils import download_url
from torch.utils.data import DataLoader, TensorDataset, random_split

## Step 1: Download and explore the data



To load the dataset into memory, we'll use the `read_csv` function from the `pandas` library. The data will be loaded as a Pandas dataframe. See this short tutorial to learn more: https://data36.com/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection/

In [5]:
dataframe_raw = pd.read_csv("C:\python\pytorch\insurance.csv")
dataframe_raw.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


We're going to do a slight customization of the data, so that you every participant receives a slightly different version of the dataset. Fill in your name below as a string (enter at least 5 characters)

Lets drop the reigion colum because it doesen't seem importnat parameter here

In [9]:
dataframe = dataframe_raw.drop(['region'], axis=1)
dataframe.head()

Unnamed: 0,age,sex,bmi,children,smoker,charges
0,19,female,27.9,0,yes,16884.924
1,18,male,33.77,1,no,1725.5523
2,28,male,33.0,3,no,4449.462
3,33,male,22.705,0,no,21984.47061
4,32,male,28.88,0,no,3866.8552


In [10]:
num_rows = dataframe.shape[0]
print(num_rows)

1338


In [11]:
num_cols = dataframe.shape[1]
print(num_cols)

6


In [12]:
input_cols = ['age','sex','bmi','children','smoker']

In [13]:
categorical_cols = ['sex','smoker']

In [14]:
output_cols = ['charges']

## Step 2: Prepare the dataset for training

We need to convert the data from the Pandas dataframe into a PyTorch tensors for training. To do this, the first step is to convert it numpy arrays. If you've filled out `input_cols`, `categorial_cols` and `output_cols` correctly, this following function will perform the conversion to numpy arrays.

In [19]:
def dataframe_to_arrays(dataframe):
    # Make a copy of the original dataframe
    dataframe1 = dataframe.copy(deep=True)
    # Convert non-numeric categorical columns to numbers
    for col in categorical_cols:
        dataframe1[col] = dataframe1[col].astype('category').cat.codes
    # Extract input & outupts as numpy arrays
    inputs_array = dataframe1[input_cols].to_numpy()
    targets_array = dataframe1[output_cols].to_numpy()
    return inputs_array, targets_array

In [20]:
inputs_array, targets_array = dataframe_to_arrays(dataframe)
inputs_array, targets_array

(array([[19.  ,  0.  , 27.9 ,  0.  ,  1.  ],
        [18.  ,  1.  , 33.77,  1.  ,  0.  ],
        [28.  ,  1.  , 33.  ,  3.  ,  0.  ],
        ...,
        [18.  ,  0.  , 36.85,  0.  ,  0.  ],
        [21.  ,  0.  , 25.8 ,  0.  ,  0.  ],
        [61.  ,  0.  , 29.07,  0.  ,  1.  ]]),
 array([[16884.924 ],
        [ 1725.5523],
        [ 4449.462 ],
        ...,
        [ 1629.8335],
        [ 2007.945 ],
        [29141.3603]]))

Convert the numpy arrays `inputs_array` and `targets_array` into PyTorch tensors. Make sure that the data type is `torch.float32`.**

In [21]:
inputs = torch.Tensor(inputs_array)
targets =torch.Tensor(targets_array)

In [22]:
inputs.dtype, targets.dtype

(torch.float32, torch.float32)

Next, we need to create PyTorch datasets & data loaders for training & validation. We'll start by creating a `TensorDataset`.

In [23]:
dataset = TensorDataset(inputs, targets)

In [24]:
 inputs.shape,targets.shape

(torch.Size([1338, 5]), torch.Size([1338, 1]))

In [25]:
val_percent = 0.1 
val_size = int(num_rows * val_percent)
train_size = num_rows - val_size


train_ds, val_ds = random_split(dataset, [train_size,val_size]) # Use the random_split function to split dataset into 2 parts of the desired length

In [26]:
#random_split?
#train_ds?

create data loaders for training & validation.



In [27]:
batch_size = 128

In [28]:
train_loader = DataLoader(train_ds, batch_size, shuffle=True)
val_loader = DataLoader(val_ds, batch_size)

Let's look at a batch of data to verify everything is working fine so far.

In [29]:
for xb, yb in train_loader:
    print("inputs:", xb)
    print("targets:", yb)
    break

inputs: tensor([[18.0000,  1.0000, 31.6800,  2.0000,  1.0000],
        [36.0000,  0.0000, 29.0400,  4.0000,  0.0000],
        [27.0000,  1.0000, 30.3000,  3.0000,  0.0000],
        [23.0000,  1.0000, 32.5600,  0.0000,  0.0000],
        [62.0000,  1.0000, 27.5500,  1.0000,  0.0000],
        [37.0000,  0.0000, 26.4000,  0.0000,  1.0000],
        [27.0000,  1.0000, 26.0300,  0.0000,  0.0000],
        [19.0000,  0.0000, 35.1500,  0.0000,  0.0000],
        [36.0000,  1.0000, 30.8750,  1.0000,  0.0000],
        [50.0000,  0.0000, 25.6000,  0.0000,  0.0000],
        [26.0000,  1.0000, 30.0000,  1.0000,  0.0000],
        [55.0000,  0.0000, 33.5350,  2.0000,  0.0000],
        [21.0000,  0.0000, 34.8700,  0.0000,  0.0000],
        [25.0000,  0.0000, 41.3250,  0.0000,  0.0000],
        [19.0000,  1.0000, 19.8000,  0.0000,  0.0000],
        [18.0000,  0.0000, 39.1600,  0.0000,  0.0000],
        [21.0000,  0.0000, 25.8000,  0.0000,  0.0000],
        [38.0000,  0.0000, 40.5650,  1.0000,  0.0000],
  

## Step 3: Create a Linear Regression Model



In [31]:
input_size = len(input_cols)
output_size = len(output_cols) 


In [35]:
class InsuranceModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(input_size,output_size)                  # fill this (hint: use input_size & output_size defined above)
        
    def forward(self, xb):
        out = self.linear(xb)                          # fill this
        return out
    
    def training_step(self, batch):
        inputs, targets = batch 
        # Generate predictions
        out = self(inputs)          
        # Calcuate loss
        loss = F.l1_loss(out, targets)                          # fill this
        return loss
    
    def validation_step(self, batch):
        inputs, targets = batch
        # Generate predictions
        out = self(inputs)
        # Calculate loss
        loss = F.l1_loss(out, targets)                           # fill this    
        return {'val_loss': loss.detach()}
        
    def validation_epoch_end(self, outputs):
        batch_losses = [x['val_loss'] for x in outputs]
        epoch_loss = torch.stack(batch_losses).mean()   # Combine losses
        return {'val_loss': epoch_loss.item()}
    
    def epoch_end(self, epoch, result, num_epochs):
        # Print result every 20th epoch
        if (epoch+1) % 100 == 0 or epoch == num_epochs-1:
            print("Epoch [{}], val_loss: {:.4f}".format(epoch+1, result['val_loss']))

Let us create a model using the `InsuranceModel` class. 

In [36]:
model = InsuranceModel()

Let's check out the weights and biases of the model using `model.parameters`.

In [37]:
list(model.parameters())

[Parameter containing:
 tensor([[ 0.0592, -0.2975, -0.2417,  0.4435,  0.2920]], requires_grad=True),
 Parameter containing:
 tensor([-0.2773], requires_grad=True)]

## Step 4: Train the model to fit the data



In [39]:
def evaluate(model, val_loader):
    outputs = [model.validation_step(batch) for batch in val_loader]
    return model.validation_epoch_end(outputs)

def fit(epochs, lr, model, train_loader, val_loader, opt_func=torch.optim.SGD):
    history = []
    optimizer = opt_func(model.parameters(), lr)
    for epoch in range(epochs):
        # Training Phase 
        for batch in train_loader:
            loss = model.training_step(batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        # Validation phase
        result = evaluate(model, val_loader)
        model.epoch_end(epoch, result, epochs)
        history.append(result)
    return history

Use the `evaluate` function to calculate the loss on the validation set before training.**

In [40]:
result = evaluate(model, val_loader)
print(result)

{'val_loss': 9050.0}


Train the model 4-5 times with different learning rates & for different number of epochs.**

Vary learning rates by orders of 10 (e.g. `1e-2`, `1e-3`, `1e-4`, `1e-5`, `1e-6`) to figure out what works.

In [41]:
epochs = 1000
lr = 1e-2
history1 = fit(epochs, lr, model, train_loader, val_loader)

Epoch [100], val_loss: 3805.9246
Epoch [200], val_loss: 3644.7153
Epoch [300], val_loss: 3507.6113
Epoch [400], val_loss: 3343.9238
Epoch [500], val_loss: 3241.2388
Epoch [600], val_loss: 3160.9927
Epoch [700], val_loss: 3123.7441
Epoch [800], val_loss: 3105.5884
Epoch [900], val_loss: 3102.9993
Epoch [1000], val_loss: 3092.7505


In [42]:
epochs = 1000
lr = 1e-3
history2 = fit(epochs, lr, model, train_loader, val_loader)

Epoch [100], val_loss: 3093.2710
Epoch [200], val_loss: 3094.1414
Epoch [300], val_loss: 3095.1921
Epoch [400], val_loss: 3093.6138
Epoch [500], val_loss: 3093.0237
Epoch [600], val_loss: 3093.0742
Epoch [700], val_loss: 3091.5332
Epoch [800], val_loss: 3090.9634
Epoch [900], val_loss: 3090.9390
Epoch [1000], val_loss: 3090.1238


In [43]:
epochs = 1000
lr = 1
history3 = fit(epochs, lr, model, train_loader, val_loader)

Epoch [100], val_loss: 2977.6240
Epoch [200], val_loss: 3085.0601
Epoch [300], val_loss: 3202.4131
Epoch [400], val_loss: 2986.4631
Epoch [500], val_loss: 2902.7234
Epoch [600], val_loss: 3039.8623
Epoch [700], val_loss: 2858.1611
Epoch [800], val_loss: 2709.3862
Epoch [900], val_loss: 3104.8550
Epoch [1000], val_loss: 2769.4980


In [44]:
epochs = 1000
lr = 2
history4 = fit(epochs, lr, model, train_loader, val_loader)

Epoch [100], val_loss: 2677.8403
Epoch [200], val_loss: 3513.7786
Epoch [300], val_loss: 3725.5938
Epoch [400], val_loss: 2552.6465
Epoch [500], val_loss: 3391.4231
Epoch [600], val_loss: 3455.1990
Epoch [700], val_loss: 2424.0386
Epoch [800], val_loss: 2645.6934
Epoch [900], val_loss: 2290.7632
Epoch [1000], val_loss: 2687.0684


In [45]:
epochs = 1000
lr = 1
history5 = fit(epochs, lr, model, train_loader, val_loader)

Epoch [100], val_loss: 2256.8691
Epoch [200], val_loss: 2352.3257
Epoch [300], val_loss: 2385.6714
Epoch [400], val_loss: 2625.7075
Epoch [500], val_loss: 2292.3596
Epoch [600], val_loss: 2482.0195
Epoch [700], val_loss: 2339.1543
Epoch [800], val_loss: 2083.9797
Epoch [900], val_loss: 2209.2368
Epoch [1000], val_loss: 2258.6819


final validation loss of the model

In [46]:
val_loss = 2258.6819

## Step 5: Make predictions using the trained model



In [49]:
def predict_single(input, target, model):
    inputs = input.unsqueeze(0)
    predictions = model(inputs)            
    prediction = predictions[0].detach()
    print("Input:", input)
    print("Target:", target)
    print("Prediction:", prediction)

In [56]:
input, target = val_ds[1]
predict_single(input, target, model)

Input: tensor([61.0000,  0.0000, 28.2000,  0.0000,  0.0000])
Target: tensor([13041.9209])
Prediction: tensor([13202.2148])


In [51]:
input, target = val_ds[10]
predict_single(input, target, model)

Input: tensor([56.0000,  0.0000, 35.8000,  1.0000,  0.0000])
Target: tensor([11674.1299])
Prediction: tensor([12003.3613])


In [61]:
input, target = val_ds[100]
predict_single(input, target, model)

Input: tensor([31.0000,  1.0000, 31.0650,  3.0000,  0.0000])
Target: tensor([5425.0234])
Prediction: tensor([5957.5327])
