# <p style="padding:10px;background-color:#87CEEB ;margin:10;color:#000000;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 10px 10px ;overflow:hidden;font-weight:50">1. EDA for Insurance Premium</p>

### About this file

- The insurance.csv dataset contains 1338 observations (rows) and 7 features (columns). The dataset contains 4 numerical features (age, bmi, children and expenses) and 3 nominal features (sex, smoker and region) that were converted into factors with numerical value desginated for each level.

- The purposes of this exercise to look into different features to observe their relationship, and plot a multiple linear regression based on several features of individual such as age, physical/family condition and location against their existing medical expense to be used for predicting future medical expenses of individuals that help medical insurance to make decision on charging the premium.



Dataset source link:- https://www.kaggle.com/datasets/noordeen/insurance-premium-prediction

# <p style="padding:10px;background-color:#87CEEB ;margin:10;color:#000000;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 10px 10px ;overflow:hidden;font-weight:50">Import Data and Required Packages</p>

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

# <p style="padding:10px;background-color:#87CEEB ;margin:10;color:#000000;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 10px 10px ;overflow:hidden;font-weight:50">Read the Dataset</p>

In [None]:
df = pd.read_csv("./data/insurance.csv")
df.head()

### Datset info

In [None]:
df.info()

# <p style="padding:10px;background-color:#87CEEB ;margin:10;color:#000000;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 10px 10px ;overflow:hidden;font-weight:50">Check Missing Values in Dataset</p>

In [None]:
df.isna().sum()

- No Missing Values found in the dataset

# <p style="padding:10px;background-color:#87CEEB ;margin:10;color:#000000;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 10px 10px ;overflow:hidden;font-weight:50">Check Duplicates in Dataset</p>

In [None]:
df.duplicated().sum()

In [None]:
df.drop_duplicates()

# <p style="padding:10px;background-color:#87CEEB ;margin:10;color:#000000;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 10px 10px ;overflow:hidden;font-weight:50">Descriptive Statistics</p>

## Numerical and Categorical columns seperation

In [None]:
# define numerical & categorical columns
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

## Numerical Columns Description

In [None]:
df.describe().T

## categorical Columns Description

In [None]:
df[categorical_features].describe().T

## All unique values in dataset

In [None]:
df.nunique()

# <p style="padding:10px;background-color:#87CEEB ;margin:10;color:#000000;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 10px 10px ;overflow:hidden;font-weight:50"> Exploring Data (Analysis with Visualisation)</p>

In [None]:
# distribution of age value
sns.set()
plt.figure(figsize=(6,6))
sns.distplot(df['age'])
plt.title('Age Distribution')


In [None]:
# Gender column
plt.figure(figsize=(6,6))
sns.countplot(x='sex', data=df)
plt.title('Sex Distribution')

In [None]:
df['sex'].value_counts()

In [None]:
# bmi distribution
plt.figure(figsize=(6,6))
sns.distplot(df['bmi'])
plt.title('BMI Distribution')

Normal BMI Range --> 18.5 to 24.9

In [None]:
# children column
plt.figure(figsize=(6,6))
sns.countplot(x='children', data=df)
plt.title('Children')

In [None]:
df['children'].value_counts()

In [None]:
# smoker column
plt.figure(figsize=(6,6))
sns.countplot(x='smoker', data=df)
plt.title('smoker')

In [None]:
df['smoker'].value_counts()

In [None]:
# region column
plt.figure(figsize=(6,6))
sns.countplot(x='region', data=df)
plt.title('region')


 

In [None]:
plt.pie(x = df['region'].value_counts(),labels=df['region'].value_counts().index,explode=[0.1,0,0,0],autopct='%1.1f%%',shadow=True)
plt.show()

In [None]:
df['region'].value_counts()

In [None]:
# distribution of charges value
plt.figure(figsize=(6,6))
sns.distplot(df['expenses'])
plt.title('Expenses Distribution')

In [None]:
# Create a countplot of region with a hue for smoker
plt.figure(figsize=(10,6))
sns.countplot(x='region', data=df, hue='smoker', palette='Blues')
plt.title('Region Distribution by Smoker', size=18)
plt.xlabel('Region', size=14)
plt.ylabel('Count', size=14)
plt.show()

### BIVARIATE ANALYSIS ( Is Age type intake has any impact on Expenses ? )

In [None]:
plt.figure(figsize=(12,6))
sns.lineplot(x='age',y='expenses',data=df)

### BIVARIATE ANALYSIS ( Is Smoking type intake has any impact on Expenses ? )

In [None]:
## Age vs Expenses
## This also shows who is a smoker or not.

plt.figure(figsize = (10,6))
sns.scatterplot(x='age',y='expenses',hue='smoker', data=df, palette='deep')
plt.title('Age vs Expenses',size=18)
plt.xlabel('Age',size=14)
plt.ylabel('Expenses',size=14)
plt.show()

In [None]:
## Smoker Vs Expenses
sns.violinplot(data=df, x='smoker', y='expenses')

- The violinplot shows us the cost of insurance for smokers is higher than for non-smokers

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x=df.age,y=df.expenses,hue=df.smoker)

### Insight
- Smokers from same age group pay more expenses

### BIVARIATE ANALYSIS ( Is Region type intake has any impact on Expenses ? )

In [None]:
plt.bar(df['region'],df['expenses'])
plt.title('Impact of region on expenses',fontsize=18)
plt.xlabel('Regions',fontsize=15)
plt.ylabel('Expense', fontsize=15)


In [None]:
sns.violinplot(data=df, x='region', y='expenses')

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x=df.age,y=df.expenses,hue=df.region)

In [None]:
sns.violinplot(data=df, x='children', y='expenses')

#### Relation between Sex and Expenses

In [None]:
plt.bar(df['sex'],df['expenses'])

In [None]:
sns.violinplot(data=df, x='sex', y='expenses')

No Differnce on expenses between female and male

#### BMI

- A bit of feature engineering for this one, as the continous data bmi is probably understood better in categeories. according to the BMI indicators: underweight < 18, healthy 18 < 25, overweight 25 < 30, obese > 30.


In [None]:
bins = [0,18.5,25,30, 100]
slots = ['under-weight','healthy','over-weight', 'obese']

df['Bmi_range']=pd.cut(df['bmi'],bins=bins,labels=slots)
df.head()

In [None]:
df['Bmi_range'].value_counts()

In [None]:
sns.histplot(data=df, x='Bmi_range')

In [None]:
df.groupby('Bmi_range')['expenses'].describe()

In [None]:
sns.violinplot(data=df, x='Bmi_range', y='expenses')

### BMI and smokers

In [None]:
df.groupby(['Bmi_range', 'smoker'])['expenses'].describe()

- People who are obese and smoke on average pay double what overweight smoker, and pay 5 times more than a healthy non-smoker.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
ax = sns.barplot(x='Bmi_range', y='expenses', hue='smoker', data=df)


### BMI and Age

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
ax = sns.scatterplot(data=df, x='age', y= 'expenses', hue='Bmi_range')

BMI, smoker & age

In [None]:
markers = {"yes": "s", "no": "X"}
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
ax = sns.scatterplot(data=df, x='age', y= 'expenses', hue='Bmi_range', style='smoker',markers=markers)

 #### MUTIVARIATE ANALYSIS USING PAIRPLOT

In [None]:
sns.pairplot(df, 
                 markers="+",
                 diag_kind="kde",
                 kind='reg',
                 plot_kws={'line_kws':{'color':'#aec6cf'}, 
                           'scatter_kws': {'alpha': 0.7, 
                                           'color': 'green'}},
                 corner=True);

#### CHECKING OUTLIERS

In [None]:
plt.subplots(1,4,figsize=(16,5))
plt.subplot(141)
sns.boxplot(df['age'],color='skyblue')
plt.subplot(142)
sns.boxplot(df['bmi'],color='hotpink')
plt.subplot(143)
sns.boxplot(df['children'],color='yellow')
plt.show()

### correlation matrix

In [None]:
# cmap = sns.diverging_palette(70,20,s=50, l=40, n=6,as_cmap=True)
# corrmat= df.corr()
# f, ax = plt.subplots(figsize=(12,12))
# sns.heatmap(corrmat,cmap=cmap,annot=True, )