# Medical Cost prediction


###  in this project a Decision tree Regression model been built to predict Medical Cost

### About Dataset


| ID |	Identification|
|:----------|:-----------|
|Age| 	age of primary beneficiary|
|sex 	|insurance contractor gender, female, male|
|bmi| 	Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9|
|children 	|Number of children covered by health insurance / Number of dependents
|smoker| 	Smoking|
|region 	|the beneficiary's residential area in the US, northeast, southeast, southwest, northwes|


### Importing libraries


In [1]:
import numpy as np
import pandas as pd

###  Load the data


In [2]:
df=pd.read_csv("insurance.csv")
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [3]:
df.shape

(1338, 7)

### Check for any missing data

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [5]:
df.isna().sum() 


age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [6]:
df.drop("region",inplace=True,axis=1)
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,charges
0,19,female,27.9,0,yes,16884.924
1,18,male,33.77,1,no,1725.5523
2,28,male,33.0,3,no,4449.462
3,33,male,22.705,0,no,21984.47061
4,32,male,28.88,0,no,3866.8552


### Since its a categorical data, we need to encode it to numerical values using label encoding.

In [7]:
from sklearn.preprocessing import LabelEncoder        
le = LabelEncoder()

In [8]:
df['sex']= le.fit_transform(df['sex'])
df['smoker'] = le.fit_transform(df['smoker'])
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,charges
0,19,0,27.9,0,1,16884.924
1,18,1,33.77,1,0,1725.5523
2,28,1,33.0,3,0,4449.462
3,33,1,22.705,0,0,21984.47061
4,32,1,28.88,0,0,3866.8552


## creating features and label 

In [9]:
#split df into x and y
x = df.iloc[:,:-1]
y = df.iloc[:,-1]

### splitting data into training and test set

In [19]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split (x,y,test_size=0.2, random_state=20)

### Decision Tree Regression model

In [20]:
from sklearn.tree import DecisionTreeRegressor
reg = DecisionTreeRegressor ()

### fitting data to model

In [21]:
reg.fit(x_train, y_train)

DecisionTreeRegressor()

### model predictions

In [22]:
# Predict the target values for the test set
y_pred = reg.predict(x_test)

###  Evaluation of the Decision Tree Regression model

In [23]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test,y_pred )


In [24]:
print("R2 score : " ,r2)

R2 score :  0.7961722226055354


In [25]:
# Calculate the mean absolute percentage error (MAPE)
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100
print("MAPE: ", mape)

MAPE:  28.573867882727733


### K-fold cross validation

In [46]:
# Perform k-fold cross validation
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
scores = []
for train_index, test_index in kf.split(x):
    x_train, x_test = x.iloc[train_index], x.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    reg.fit(x_train, y_train)
    y_pred = reg.predict(x_test)
    scores.append(r2_score(y_test, y_pred))

In [47]:
# Print the mean and standard deviation of the k-fold cross validation scores
print("Mean k-fold accuracy:", np.mean(scores))
print("Std k-fold accuracy:", np.std(scores))

Mean k-fold accuracy: 0.7042440955603507
Std k-fold accuracy: 0.032500032939541415
