## <span style="color:lightblue"> **Problem Statement 3:** </span>
I have been provided with the  medical cost dataset. I need to predict individual medical costs billed by health insurance.

Dataset Description:
- age: age of the primary beneficiary
- sex: gender of primary beneficiary female, male
- bmi: Body mass index, providing an understanding of the body, weights that are relatively<br>
  high or low relative to height, an objective index of body.
- weight(kg/m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
- Children: Number of children covered by health insurance / Number of dependents
- smoker: Smokes or not
- region: the beneficiary's residential area in the US, northest, southeast, southwest, northwest
- charges: Individual medical costs billed by health insurance

I would write a python code to perform the following tasks mentioned:
1. Load the data check its shape and check for null values
2. Convert categorical features to numerical values (Use One-Hot Encoding)
3. Split the dataset for training and testing
4. Train the model using sklearn - Linear Regression
5. Find the intercept and coefficient from the trained model
6. Predict the prices of test data and evalutate the model using calculated r2 score and root mean squared error

In [1]:
#dataFrame manipulation and visualiztion
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('Display.max_columns', None)
pd.set_option('Display.max_rows', None)

import math

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [2]:
#task1: loading the data, checking it shape
medic_cost_df = pd.read_csv("./../Assignment_files/insurance_ass4.csv")
medic_cost_df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [3]:
#checking for null values
medic_cost_df.isna().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

oooh No null values that's good for me :)

In [4]:
#checking for duplicates
medic_cost_df.duplicated().sum()

1

In [5]:
#Dropping the duplicated data
medic_cost_df.drop_duplicates(inplace=True)
medic_cost_df.duplicated().sum()

medic_cost_df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552
5,31,female,25.74,0,no,southeast,3756.6216
6,46,female,33.44,1,no,southeast,8240.5896
7,37,female,27.74,3,no,northwest,7281.5056
8,37,male,29.83,2,no,northeast,6406.4107
9,60,female,25.84,0,no,northwest,28923.13692


In [17]:
#Task2 converting categorical features to numerical values using (OneHotEncoding
cat_features = ["sex", "smoker", "region"]

encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

encoded_data = encoder.fit_transform(medic_cost_df[cat_features])
(encoded_data)

encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(cat_features))
transformed_medic = medic_cost_df.drop(cat_features, axis=1)
transformed_medic = pd.concat([transformed_medic, encoded_df], axis=1)
transformed_medic

Unnamed: 0,age,bmi,children,charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19.0,27.9,0.0,16884.924,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,18.0,33.77,1.0,1725.5523,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
2,28.0,33.0,3.0,4449.462,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
3,33.0,22.705,0.0,21984.47061,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
4,32.0,28.88,0.0,3866.8552,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
5,31.0,25.74,0.0,3756.6216,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
6,46.0,33.44,1.0,8240.5896,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
7,37.0,27.74,3.0,7281.5056,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
8,37.0,29.83,2.0,6406.4107,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0
9,60.0,25.84,0.0,28923.13692,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


In [18]:
rows_with_null = transformed_medic[transformed_medic.isnull().any(axis=1)]
rows = rows_with_null.index.values.tolist()

In [19]:
#dropping any rows with missing values
transformed_medic.drop(rows, axis=0, inplace=True)
transformed_medic.isna().sum()

age                 0
bmi                 0
children            0
charges             0
sex_female          0
sex_male            0
smoker_no           0
smoker_yes          0
region_northeast    0
region_northwest    0
region_southeast    0
region_southwest    0
dtype: int64

In [20]:
#task3: spliting the dataset for training and testing
X = transformed_medic.drop(["charges"], axis=1)
y = transformed_medic["charges"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=42)


X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1068, 11), (268, 11), (1068,), (268,))

In [21]:
#Task 4: Training the model using sklearn --Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)
model.score(X_train, y_train)

0.22437401595456608

In [22]:
#Task 6:
y_preds = model.predict(X_test)

r2score = r2_score(y_preds, y_test)
mae = mean_absolute_error(y_preds, y_test)
rmse = math.sqrt(mean_squared_error(y_preds, y_test))
r2score, mae, rmse

(-3.0980050088006426, 8473.22583843408, 11623.420684269717)

In [32]:
#Task 5: Find coefficient and  intercept using the trained model
model.intercept_, model.coef_

(-3550.329956785259,
 array([  248.5329455 ,   287.69346411,   588.00837038,  -535.06220552,
          535.06220552, -4627.83682329,  4627.83682329,   588.28285601,
          399.27577353,  -830.97417124,  -156.58445829]))