# Insurance Premium prediction using multiple linear regression model

For those interested in the background info on the dataset, you'll find it __[here.](https://www.kaggle.com/mirichoi0218/insurance)__

Our aim is to predict Insurance charges based on features (age, sex, bmi, children, smoker and region). Our first will be to import dependencies and then load the dataset.

In [1]:
#Import Dependencies 
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt 

In [2]:
#Load And Read Data 
dataset = pd.read_csv('insurance.csv')
dataset.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [3]:
#The dataset has 1338 rows and 7 columns
dataset.shape

(1338, 7)

Before we proceed with, first let's check if our dataset has any missing values. We'll do so using the Pandas Series.isnull() function. 

Pandas Series.isnull() function detects missing values in the given series object. It returns a boolean same-sized object indicating if the values are NA. Missing values gets mapped to True boolean value and non-missing value gets mapped to False boolean value.

In [4]:
#check for missing values 
dataset.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

So as we can see our dataset has no missing value.

Our next step is to create X-matrix of features and y-vector of the target variable.

In [5]:
#Create the X-matrix features and y-vector
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 6].values

In [6]:
#Encoding Categorical Data for Matrix X for the sex, smoker and region columns
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(transformers=[("oh", OneHotEncoder(), [1, 4, 5])], remainder="passthrough")
X = ct.fit_transform(X)
X

array([[1.0, 0.0, 0.0, ..., 19, 27.9, 0],
       [0.0, 1.0, 1.0, ..., 18, 33.77, 1],
       [0.0, 1.0, 1.0, ..., 28, 33.0, 3],
       ...,
       [1.0, 0.0, 1.0, ..., 18, 36.85, 0],
       [1.0, 0.0, 1.0, ..., 21, 25.8, 0],
       [1.0, 0.0, 0.0, ..., 61, 29.07, 0]], dtype=object)

As you can see, we converted our categorical data columns into numerical data. And in the process the order of dataset changed. By default passthrough pushes the untransformed features to the right side of the output.

In [7]:
#Splitting dataset Into Training & Testing Data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.3,
                                                    random_state=0)

In [8]:
#Fitting Multiple Linear Regression To Training Set
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [9]:
#Predicting The Test Results
y_pred = regressor.predict(X_test)
results_comparison = pd.DataFrame({"Actual Costs": y_test.flatten(), "Predicted Costs": y_pred.flatten().round(2)})
results_comparison

Unnamed: 0,Actual Costs,Predicted Costs
0,9724.53000,11253.19
1,8547.69130,9544.91
2,45702.02235,37849.80
3,12950.07120,16069.27
4,9644.25250,6734.41
...,...,...
397,3277.16100,7412.79
398,17942.10600,26648.65
399,10226.28420,14064.27
400,14418.28040,17140.03


In [12]:
# Calculating the performance of the regressor
from sklearn import metrics 
print('Train Score: {:.2f} %'.format(regressor.score(X_train, y_train) * 100))  
print('Test Score: {:.2f} %'.format(regressor.score(X_test, y_test) *100))  

Train Score: 73.10 %
Test Score: 79.09 %


In [13]:
print("Mean Absolute Error", metrics.mean_absolute_error(y_test, y_pred))
print("Mean Squared Error", metrics.mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error 4011.449679327992
Mean Squared Error 33342497.82695455
Root Mean Squared Error 5774.296305780866
