<h1 align='center' style='color:black'><b>Insurance premium Prediction</b></h1>
<h2 align='center' style='color:black'><b>by predicting Medical Expenses</b></h2>

## General Description: -
The dataset is retrieved from Machine Learning Website by Professor Eric Suess at http://www.sci.csueastbay.edu/~esuess/stat6620/#week-6. 
The insurance.csv dataset contains 1338 observations (rows) and 7 features (columns). The dataset contains 4 numerical features (age, bmi, children and expenses) and 3 nominal features (sex, smoker and region) that were converted into factors with numerical value designated for each level.

## Aim: -
The purposes of this exercise to look into different features to observe their relationship, and plot a regression model based on several features of individual such as age, physical/family condition and location against their existing medical expense to be used for predicting future medical expenses of individuals that will help medical insurance company to make decision on charging the premium

## Outline: -
1. Import Dataset
2. Data Cleaning and Data Preparation
3. Exploratory Data Analysis
5. Train Test Split
6. Model Building
7. Model Evaluation

In [None]:
# Import Libraries for Analysis
import numpy as np
import pandas as pd

# Import Libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import libraries for train test split
from sklearn.model_selection import train_test_split

# import Ilbrary for Scaling
from sklearn.preprocessing import StandardScaler

# import Ilbrary for Model Building
from sklearn.linear_model import LinearRegression

from catboost import CatBoostRegressor

# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Importing the dataset
df = pd.read_csv('/kaggle/input/insurance-premium-prediction/insurance.csv')
df.head(5)

## Data Cleaning and Preparation

In [None]:
# Checking the info of data set
df.info()

There are 1338 rows and 7 columns in the data set provided

In [None]:
df.describe()

In [None]:
# creating a copy of the dataset
df_cpy = df.copy()

In [None]:
# checking columns names
df_cpy.columns

All the columns names are in lowercase letters, and there are no extra spaces in column names

In [None]:
# checking null values
df_cpy.isnull().sum()

There are no null values in our dataset

The last thing left in Data Preparation is Outlier Treatment. So lets create Box plot for Expenses and check for Outliers

In [None]:
plt.subplots(figsize=(15,7))  

plt.subplot(1,2,1)         
df_cpy['expenses'].plot.box()

plt.subplot(1,2,2)      
plt.hist(df_cpy['expenses'], bins=20)

plt.show()

There are few entries in which expenses are way far from other expenses. They might effect the prediction, so we have to eliminate them.

In [None]:
df_cpy.expenses.describe()

As can be seen the max value is way more than mean , median . Therefore according to the box plot, let's delete the rows that has expenses > 50000

In [None]:
df_cpy = df_cpy[df_cpy['expenses']< 50000]   
df_cpy.shape

In [None]:
# checking the box plot again
plt.subplots(figsize=(15,7))  

plt.subplot(1,2,1)         
df_cpy['expenses'].plot.box()

plt.subplot(1,2,2)      
plt.hist(df_cpy['expenses'], bins=20)

plt.show()

All the data points above the 75% line are very close to each other so leaving it as is.

### Treating Categorical data

Here we have 3 columns having categorical values and 4 columns have numerical values, Before proceeding further, we have to convert categorical values into numerical values.

In [None]:
# applying one-hot encoding on the categorical features 
df_dummy= pd.get_dummies(df_cpy)

In [None]:
df_dummy

##  Exploratory Data Analysis

In [None]:
# We will first check the distribution of expenses by creating a distplot

sns.distplot(df_dummy.expenses)
plt.title("Expenses Distribution Plot",fontsize=15)
plt.show()

Data is randomly distributed, mostly people have their medical expenses below 30000, Small number of people have their medical expenses between 30000 to 50000

In [None]:
# Medical Expenses of male and female
plt.figure(figsize=(10,5))
df_cpy.groupby(['sex'])['expenses'].mean().plot.bar()
plt.ylabel('Average Medical Expense')
plt.title("Average Expenses of Male and Female",fontsize=18)
plt.xticks(rotation = 0)
plt.show()

Average medical expense of male is greater than females

In [None]:
# Medical Expenses of male and female
plt.figure(figsize=(10,5))
df_cpy.groupby(['smoker'])['expenses'].mean().plot.bar()
plt.ylabel('Average Medical Expense')
plt.title("Average Expenses of a smoker and Non-smoker",fontsize=18)
plt.xticks(rotation = 0)
plt.show()

Medical Expense of an smoker is much more than a person who doesn't smoke

In [None]:
# Medical Expenses of male and female
plt.figure(figsize=(10,5))
df_cpy.groupby(['region'])['expenses'].mean().plot.bar()
plt.ylabel('Average Medical Expense')
plt.title("Average Expenses of people of different region",fontsize=18)
plt.xticks(rotation = 0)
plt.show()

### Checking the relationship between different features

In [None]:
# Plot a pair plot
plt.figure(figsize=(10,15))
sns.pairplot(df_cpy)
plt.show()

Lets visualize correlation coefficients using heatmap

In [None]:
plt.figure(figsize=(10,5))
sns.heatmap(df_dummy.corr(),annot=True) 

There are no probable correlation between the various feature of our data.
Only 'Smoker' feature has maximum correlation to the target

## Train Test Split

Splitting the dataset into the Training set and Test set using train_test_split

In [None]:
# at first let's create a copy of our data to use in model building
df_2 = df_dummy.copy()
df_2.head()

In [None]:
# Seperating Dependent and Independent Variables

y = df_2.pop('expenses')
X = df_2

In [None]:
print(X.shape)
print(y.shape)

In [None]:
# Spliting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

Checking sizes of data to know whether they are split correctly 

In [None]:
# Shape of train set
print(X_train.shape)

# Shape of test set
print(X_test.shape)

In [None]:
print(y_train.shape)
print(y_test.shape)

In [None]:
X_train

In [None]:
y_train

In [None]:
X_test

In [None]:
y_test

## Model Building

In [None]:
# using multiple linear regression
reg = LinearRegression()
reg.fit(X_train, y_train)

### Visualising the model
Let's Plot a scatter plot to show the real values and the predicted values of Expenses using our Model
Since we have two numerical values in our initial data, i.e BMI and Age, we will use that in visualising the predicted results

In [None]:
# Age vs Expenses
plt.subplots(figsize=(15,6))  

plt.subplot(1,2,1)
plt.scatter(df_cpy['age'], y, color = 'red')
plt.scatter(df_cpy['age'], reg.predict(X), color = 'blue')
plt.title('Actual Expenses and Predicted Expenses', fontsize = 16)
plt.xlabel('Age', fontsize = 14)
plt.ylabel('Expenses',fontsize = 14)

# BMI vs Expenses
plt.subplot(1,2,2)
plt.scatter(df_cpy['bmi'], y, color = 'red')
plt.scatter(df_cpy['bmi'], reg.predict(X), color = 'blue')
plt.title('Actual Expenses and Predicted Expenses', fontsize = 16)
plt.xlabel('BMI', fontsize = 14)
plt.ylabel('Expenses',fontsize = 14)
plt.show()

Here the red points indicates actual expenses and the blue points indicate predicted Expenses.

As we can see **many predicted values are very different from the actual values**, therefore let's check it's accuracy.

In [None]:
# checking the model
reg.score(X_test,y_test)

Therefore the cata model we created is able to predict the results with an **accuracy of 76%**

**Let's make another model for the given data using Polynomial Regression with better acccuracy**

In [None]:
from catboost import CatBoostRegressor
cat_reg = CatBoostRegressor()
X_cat= cat_reg.fit_transform(X_train)
reg_2.fit(X_cat, y_train)
print(X_cat)

### Visualising the Model

In [None]:
# Plotting a scatter plot to show the real values and the predicted values of Expenses using our Model
plt.subplots(figsize=(15,6))  

# Age vs Expenses
plt.subplot(1,2,1)
plt.scatter(df_cpy['age'], y, color = 'red')
plt.scatter(df_cpy['age'], reg_2.predict(poly_reg.fit_transform(X)), color = 'blue')
plt.title('Actual Expenses and Predicted Expenses', fontsize = 16)
plt.xlabel('Age', fontsize = 14)
plt.ylabel('Expenses', fontsize = 14)

# BMI vs Expenses
plt.subplot(1,2,2)
plt.scatter(df_cpy['bmi'], y, color = 'red')
plt.scatter(df_cpy['bmi'], reg_2.predict(poly_reg.fit_transform(X)), color = 'blue')
plt.title('Actual Expenses and Predicted Expenses', fontsize = 16)
plt.xlabel('BMI', fontsize = 14)
plt.ylabel('Expenses', fontsize = 14)
plt.show()

As we can see **most of the predicted values are very near to the actual value of Expenses**, and **It is predicting values better than the previos model** which shows that **it is better than the previous model**, Let's check it's accuracy too.

## Model Evaluation

In [None]:
reg_2.score(cat_reg.fit_transform(X_test), y_test)

Since **the polynomial regression model is predicting result with accuracy 85%, it is better than the previous model**

**The model generated can be used for predicting medical expenses, using which we can predict our Insurance Premium amount.**

#### Using
#### y_pred = reg_2.predict(cat_reg.fit_transform(X_sample))
#### we can predict the medical expenses of a person and hence can predict the Insurance Premium amount according to the predicted medical expenses.