## Performing Exploratory Data Analysis (EDA) on `MEDICAL INSURNACE DATASET`

1. Importing Library and Reading the dataset

2. Feature Engineering 

2.1 Numerical Approach

2.2 Visual Approach - 

2.2.1 Univariate Analysis 

2.2.2 Multivariate Analysis 

3. Preprocessing the data

3.1 Data cleaning

3.2 Feature Transformation

3.3 Feature Scaling (Normalization)

4. Conclusion

### 1. Import Library & reading dataset

In [1]:
#Import Library
import warnings
warnings.filterwarnings('ignore')

#DataFrame Library
import pandas as pd
import numpy as np

#Visualization Library
import matplotlib.pyplot as plt
import seaborn as sns

#Modelin Library
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

In [2]:
# Visualization Plot Settings
sns.set(rc={'figure.figsize':(15,5)})
sns.set_style('whitegrid')
sns.color_palette('viridis')
plt.style.use('seaborn-bright')

In [3]:
# Read the dataset
df = pd.read_csv('/Users/adityaagarwal/My Drive/Jupyter Notebook/Resume Projects/Medical Insurance Cost Prediction/data/insurance.csv')
df

FileNotFoundError: [Errno 2] No such file or directory: '/Users/adityaagarwal/My Drive/Jupyter Notebook/Resume Projects/Medical Insurance Cost Prediction/data/insurance.csv'

### 2. Feature Engineering

In [None]:
# Feature Engineering weight_status
df['weight_status'] = np.where(df['bmi'] < 18.000, 'underweigth',
                               np.where(df['bmi'] < 25.000, 'normal',
                                       np.where(df['bmi'] < 30.000, 'overweight', 'obese')))
df

In [None]:
df.info()

In [None]:
# Separating Categorical & Numerical Values
cats = ['sex', 'smoker', 'region', 'weight_status']
nums = ['age', 'bmi', 'children', 'charges']

#### 2.1 Numeric Approach - Describing the data

In [None]:
#Sampling
df.sample(10)

In [None]:
## Describe Categorical Values
df[cats].describe()

In [None]:
## Describe Numerical Values
df[nums].describe()

In [None]:
## Check Smoker Feature
df.groupby(['smoker'])['charges'].count()

In [None]:
## Check Sex Feature
df.groupby(['sex'])['charges'].count()

In [None]:
## Check Region Feature
df.groupby(['region'])['charges'].count()

In [None]:
#Check Children Feature
df.groupby(['children'])['charges'].count()

In [None]:
## Check Weight_Status Feature
df.groupby(['weight_status'])['charges'].count()

#### 2.2 Visual Approach

#### 1. Univariate Analysis

#### 1.1 Boxplot

In [None]:
for i in range(0, len(nums)):
    plt.subplot(1, len(nums), i + 1)
    sns.boxplot(y = df[nums[i]])
    plt.tight_layout()

#### 1.2 Kernel Density Estimate (KDE) plot

In [None]:
plt.figure(figsize=(15, 3))
for i in range(0, len(nums)):
    plt.subplot(1, len(nums), i + 1)
    sns.kdeplot(x = df[nums[i]])
    plt.tight_layout()

#### 1.3 Histplot

In [None]:
for i in range(0, len(nums)):
    plt.subplot(1, len(nums), i +1)
    sns.histplot(data = df[nums[i]])
    plt.tight_layout()

#### 1.4 Countplot

In [None]:
for i in range(0, len(cats)):
    plt.subplot(1, len(cats), i + 1)
    sns.countplot(df[cats[i]])
    plt.tight_layout()

#### 2. Multivariate Analysis

#### 2.1 Heatmap

In [None]:
#Heatmap Correlation
plt.figure(figsize=(8, 8))
sns.heatmap(df.corr(), annot=True, fmt='.2f')

#### 2.2 Pairplot

In [None]:
sns.pairplot(df, diag_kind='kde')

#### 2.3 Scatterplot

In [None]:
sns.scatterplot(x = 'bmi', y = 'charges', hue = 'smoker', data = df)

In [None]:
sns.scatterplot(x = 'age', y = 'charges', hue = 'smoker', data = df)

#### 2.4 Barplot

In [None]:
#Charges per Region
region_charges = df.groupby(['region']).agg({'charges' : sum}).reset_index()
region_charges.sort_values(['charges'], ascending = False)

In [None]:
#Barplot Charges per Region
sns.barplot(x = 'region', y ='charges', data = region_charges)

In [None]:
#Barplot with Hue Smoker
sns.barplot(x = df['region'], y = df['charges'], hue = df['smoker'], data = df)

In [None]:
#Barplot with Hue Sex
sns.barplot(x = df['region'], y = df['charges'], hue = 'sex', data = df)

In [None]:
#Count Smoker with Sex Hue
smoker_sex = df.groupby(['smoker', 'sex']).agg({'charges' : 'count'}).reset_index()
smoker_sex.columns = ['smoker', 'sex', 'count']
smoker_sex.sort_values(['smoker', 'count'], ascending = False)

In [None]:
#Barplot Visualization
sns.barplot(x = smoker_sex['sex'], y = smoker_sex['count'], hue = smoker_sex['smoker'], data = smoker_sex)

In [None]:
#Barplot with Hue Sex
sns.barplot(x = df['children'], y = df['charges'], hue = df['sex'], data = df)

In [None]:
#Barplot with Hue Smoker
sns.barplot(x = df['children'], y = df['charges'], hue = df['smoker'], data = df)

In [None]:
#Count Weight_Status with Hue Smoker
smoker_status = df.groupby(['weight_status', 'smoker']).agg({'charges' : 'count'}).reset_index()
smoker_status.columns = ('weight_status', 'smoker', 'count')
smoker_status

In [None]:
#Barplot Smoker_Status
sns.barplot(x = 'weight_status', y = 'count', hue = 'smoker', data = smoker_status)

In [None]:
#Count Weight_Status with Hue Smoker
smoker_status = df.groupby(['weight_status', 'smoker']).agg({'charges' : 'sum'}).reset_index()
smoker_status.columns = ('weight_status', 'smoker', 'charges')
smoker_status 

In [None]:
#Insurance Cost based on Weight_Status and smoker
smoker_status_charges = df.groupby(['weight_status', 'smoker']).agg({'charges' : 'count'}).reset_index()
smoker_status_charges.columns = ['weight_status', 'smoker', 'count']
smoker_status_charges['total'] = smoker_status_charges.groupby(['weight_status'])['count'].transform('sum')
smoker_status_charges

In [None]:
#Barplot Smoker_Status_Charges
sns.barplot(x = 'weight_status', y = 'count', hue = 'smoker', data = smoker_status_charges)

In [None]:
# Merge Columns
condition_cost = smoker_status.merge(smoker_status_charges, 
                   left_on = ['weight_status', 'smoker'],
                    right_on = ['weight_status', 'smoker'],
                    how = 'inner')
condition_cost

In [None]:
# Average Charges
condition_cost['avg_charges'] = condition_cost['charges'] / condition_cost['count']
condition_cost

In [None]:
# Average Charges Barplot
sns.barplot(x = 'weight_status', y = 'avg_charges', hue = 'smoker', data = condition_cost)

Person who smoke have to pay more cost for insurance, obese person who smoke have the highest medical cost.

### 3. Preprocessing the data (Data cleaning, Feature Transformation, Feature Scaling (Normalization))

#### 3.1 Missing Value

In [None]:
df.isna().sum()

There are no missing value

#### 3.2 Duplicate Value

In [None]:
df.duplicated().sum()

In [None]:
#Remove Duplicate Value
df = df.drop_duplicates()

#Check Duplicate Value
df.duplicated().sum()

In [None]:
# Creating a Copy DataFrame
dfori = df.copy()

#### 3.3 Feature Transformation

- Ordinal Encoding
- One hot Encoding
- Feature Encoding

In [None]:
# Ordinal Encoding
dfori['smoker'] = dfori['smoker'].replace({'yes' : 0, 'no' : 1})

# One Hot Encoding
sex_ori = pd.get_dummies(dfori['sex'], prefix = 'sex')
region_ori = pd.get_dummies(dfori['region'], prefix = 'reg')
status_ori = pd.get_dummies(dfori['weight_status'], prefix = 'status')

# Concat Feature Encoding
dfori = pd.concat([dfori, sex_ori], axis=1)
dfori = pd.concat([dfori, region_ori], axis=1)
dfori = pd.concat([dfori, status_ori], axis=1)

In [None]:
# Drop Encoded Feature
dfori = dfori.drop(columns = ['sex', 'region', 'weight_status'])

# Check df
dfori.head()

In [None]:
dfori.info()

#### 3.4 Normalization (Feature Scaling)

In [None]:
# Grouping Features for Normalization
norm_ori = dfori.drop(columns = ['charges']).columns
print(norm_ori)

In [None]:
# Normalization Features
for i in range(len(norm_ori)):
    dfori[norm_ori[i]] = MinMaxScaler().fit_transform(dfori[norm_ori[i]].values.reshape(len(dfori), 1))

In [None]:
dfori.sample(10)

### 4. Conclusion

The insights drawn by performing `Exploratory Data Analysis` (EDA) are:

- Most people are a non smokers & obese
- Feature sex, region has an almost balanced amount
- People who smoke & have a higher BMI, has higher medical charges
- Older people who smoke have more expensive charges
- An obese person who smokes have higher charges