# In this notebook we explore the Medical Insurance Dataset, perform Exploratory Data Analysis and train the different Regression Models.

# Regarding Dataset
Medical Insurance dataset contain total 7 columns and 1338 rows. In this notebook our aim is to predict the insurance charges based on various input features such as age, sex, bmi, smoker, region and no. of childrens. 

The description of the columns is as below:
1. **Age:** The age of the person.
2. **BMI:** BMI stands for Body Mass Index. It is a numerical value derived from an individual's weight and height and is used as an indicator of body fatness and potential health risks. It is calculated by dividing a person's weight in kilograms by the square of their height in meters (BMI = weight (kg) / height^2 (m^2)).
3. **Sex:** Wether the person is male or female.
4. **Smoker:** Person is smoker or not.
5. **Children:** No. of children lies under the health insurance
6. **Region:** The person belong to which US region that is northeast, southeast, southwest, northwest.
7. **Charges:** The amount paid for medical health insurance.


# Importing all the necessary libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.svm import SVR
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
import warnings 
warnings.filterwarnings("ignore")

In [None]:
dataframe= pd.read_csv("/kaggle/input/insurance/insurance.csv")
dataframe.head()

# Exploratory Data Analysis

In [None]:
# check the shape of the dataframe
dataframe.shape

In [None]:
# check the description of the dataframe
dataframe.info()

In [None]:
dataframe.describe()

In [None]:
# check Is there null values in the dataframe
dataframe.isnull().sum()

In [None]:
# check the duplicate values in the dataframe 
dataframe.duplicated().sum()

In [None]:
# Remove the duplicates from the dataset
dataframe.drop_duplicates(inplace=True)

In [None]:
# check the shape of the dataset
dataframe.shape

In [None]:
# converting the categorical variables into numerical variables using label encoder
encoder= LabelEncoder()
dataframe['sex']= encoder.fit_transform(dataframe['sex'])
dataframe['smoker']= encoder.fit_transform(dataframe['smoker'])
dataframe['region']= encoder.fit_transform(dataframe['region'])
dataframe.head()

In [None]:
# Find the correlation between the variables
correlation=dataframe.corr()
correlation

# Correlation Matrix

In [None]:
# Let visualise the correlation matrix with the help of heatmap
plt.figure(figsize=(8,5))
sns.heatmap(correlation, annot=True, cmap='Wistia', fmt='.2f')
plt.show()

# Univariate Analysis 

In [None]:
# frequency plot for Smoker
dataframe["smoker"].value_counts().plot(kind='bar', figsize=(6, 4), rot=0)
plt.title("Frequencies for Smoker", fontsize=14)
plt.xlabel("Smoker")
plt.ylabel("Count")
plt.show()

In [None]:
# frequency plot for children
dataframe["children"].value_counts().plot(kind="bar", figsize=(6,4), rot=0)
plt.title("Frequencies for Children", fontsize=14)
plt.xlabel("Children")
plt.ylabel("Count")
plt.show()

In [None]:
# frequency plot for Sex
dataframe["sex"].value_counts().plot(kind="bar", figsize=(6,4), rot=0)
plt.title("Frequencies for Sex", fontsize=14)
plt.xlabel("Sex")
plt.ylabel("Count")
plt.show()

In [None]:
#  Density plot for age
sns.distplot(dataframe["age"], hist=False)
plt.title("Data Distribution for age using desnsity plot", fontsize=14)
plt.xlabel("age")
plt.ylabel("density")
plt.show()

In [None]:
#  Density plot for bmi
sns.distplot(dataframe["bmi"], hist=False)
plt.title("Data Distribution for bmi using desnsity plot", fontsize=14)
plt.xlabel("bmi")
plt.ylabel("density")
plt.show()

# Bivariate Analysis
In Bivariate analysis we see the relationship between two variables.Here we see the relationship of our target variable (charges) with independent features(smoker, children, sex) with the help of Violin plot. 


In [None]:
fig=plt.figure(figsize=(15,15))

ax=fig.add_subplot(221)
sns.violinplot(data=dataframe, x=dataframe["smoker"], y=dataframe["charges"], hue=None ,color='c',ax=ax)
ax.set_title('Distribution of charges vs smokers', fontsize=16)

ax=fig.add_subplot(222)
sns.violinplot(data=dataframe, x=dataframe["children"], y=dataframe["charges"], hue=None ,color='c',ax=ax)
ax.set_title('Distribution of charges vs children', fontsize=16)

ax=fig.add_subplot(223)
sns.violinplot(data=dataframe, x=dataframe["sex"], y=dataframe["charges"], hue=None ,color='c',ax=ax)
ax.set_title('Distribution of charges vs sex', fontsize=16)

ax=fig.add_subplot(224)
sns.violinplot(data=dataframe, x=dataframe["sex"], y=dataframe["charges"], hue="smoker" ,color='c',ax=ax)
ax.set_title('Distribution of charges vs sex vs smoker', fontsize=16)



plt.show()

# How we detect the outliers in our dataset

1. We visually try to detect the outliers in our dataset.As we see above in the bivariate analysis, if the person has a smoker then he/she pay high medical insurance. 
2. And from the correlation matrix we came to know that the charges and smoker has a correlation with the value of 0.79, which means smoker column has  a greater influence on the charges of medical insurance.
3. Furthermore, we also consider the age and bmi for detecting the outliers in the dataset. We find the 95th percentile of Charges column, and we got the value of 41210.04980000002, which means 95% of our data covered under this value and the maximum value of  charges column is 63770.428010, and it is quite possible that remaining 5% data lies upto 63770.428010 value.
4. After that we get all the rows from the dataset whose charge value is greater than 41210.04980000002 and visually see all the rows, and we find that as the the value of age and bmi increases and person has a smoker, then that person has to pay high medical insurance.
5. In this way we able to detect that there is no outliers in the dataset.

In [None]:
# Find the 95th percentile of charge column
charge_value=np.quantile(dataframe["charges"], 0.95)
charge_value

In [None]:
# Selecting all the rows from the dataset whose charge value is greater than 41210.04980000002 in order to detect the outliers
dataframe[dataframe["charges"]>41210.04980000002]

# Data Preprocessing and Modelling

In [None]:
dataframe= pd.read_csv("/kaggle/input/insurance/insurance.csv")
dataframe.head()

# Split the data into train and test split

In [None]:
x_train, x_test, y_train, y_test= train_test_split(dataframe.drop("charges", axis=1),
                                                   dataframe["charges"],
                                                   test_size=0.2)

x_train.shape, x_test.shape, y_train.shape, y_test.shape

# X_train Encoding

In [None]:
ohe= OneHotEncoder(handle_unknown="ignore")

x_train_ohe= ohe.fit_transform(x_train[['sex', 'smoker', 'region']])
x_train_ohe= x_train_ohe.toarray()

x_train_ohe_df= pd.DataFrame(x_train_ohe, columns=ohe.get_feature_names_out(['sex', 'smoker', 'region']))

# One-hot encoding removed an index. Let's put it back:
x_train_ohe_df.index= x_train.index

# Joining the tables
x_train = pd.concat([x_train, x_train_ohe_df], axis=1)

# Dropping old categorical columns
x_train.drop(["sex", "smoker", "region"], axis=1, inplace=True)

# Checking result
x_train.head()


# X_test Encoding

In [None]:
x_test_ohe= ohe.transform(x_test[['sex', 'smoker', 'region']])
x_test_ohe= x_test_ohe.toarray()

x_test_ohe_df= pd.DataFrame(x_test_ohe, columns=ohe.get_feature_names_out(['sex', 'smoker', 'region']))
#print(x_test_ohe_df)

# One-hot encoding removed an index. Let's put it back:
x_test_ohe_df.index= x_test.index

# Joining the tables
x_test= pd.concat([x_test, x_test_ohe_df], axis=1)

# Dropping old categorical columns
x_test.drop(["sex", "smoker", "region"], axis=1, inplace=True)

# Checking result
x_test.head()

# Create the Reagression models and define their pararmeters

In [None]:
models_parameters= {

       "LinearRegression":[LinearRegression(),  {'n_jobs':[-1]}],
       "RandomForestRegressor": [RandomForestRegressor(), {'n_estimators':[100], 'max_depth':[10], 'min_samples_split':[2], 'criterion':['squared_error']}],
       "DecisionTreeRegressor": [DecisionTreeRegressor(), {'splitter':['best'], 'max_depth':[12], 'min_samples_split':[2],'criterion':['squared_error']}],
       "GradientBoostingRegressor":[GradientBoostingRegressor(), {'n_estimators':[120], 'learning_rate':[0.1],'max_depth':[12], 'min_samples_leaf':[3],'loss':['squared_error']}],
       "SupportVectorRegressor": [SVR(), {'kernel':['rbf'], 'gamma':['scale']}],
       "Lasso":[ Lasso(), {'alpha':[1.0,1.1],'max_iter':[1000,1200],'selection':['cyclic', 'random']}],
       "Ridge":[Ridge(), { 'alpha':[1.0,1.1],'max_iter':[1000,1200],'solver':['auto','svd','lsqr']}]
}


# Train all the Regression models by using Grid Search CV

In [None]:
result={}
for key, value in models_parameters.items():
    result_list=[]
    regressor = GridSearchCV(value[0],value[1],cv=10, scoring="r2", n_jobs=-1).fit(x_train, y_train)
    y_pred = regressor.predict(x_test)
    mse = mean_squared_error(y_test, y_pred)
    root_mse=np.sqrt(mse)
    mae=mean_absolute_error(y_test, y_pred)
    result_list.append(root_mse)
    result_list.append(mae)
    result[key]=result_list

# Getting Results from all the models 

In [None]:
result

In [None]:
final_results= pd.DataFrame(result)
final_results=final_results.T
columns=final_results.columns.tolist()
columns[0]="RootMeanSquaredError"
columns[1]="MeanAbsoluteError"
final_results.columns=columns
final_results

# Comparing the Results of Regression models with each other in terms of Root Mean squared Error and Mean Absolute Error

In [None]:
final_results.plot(kind="bar", figsize=(10, 7)).legend(bbox_to_anchor=(1.0, 1.0));

# Conclusion
1. We do the Univariate and Bivariate analysis in EDA and We came to know that smoker that is our input feature correlated with target varialble charges with the value of 0.79, which shows that smoker column has a high influence on the charges column.
2. From the result section we clearly see that the RandomForestRegressor model perform best among all other regression models.