# Decision Regression Tree

### Importing libraries

We will start importing numpy, pandas, matplotlib and sklearn libraries. Pandas and numpy will be used for the deployment and calculations of the dataset, matplotlib for the visualization of graphs and sklearn for training the model using machine learning.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from mpl_toolkits.mplot3d import Axes3D

### Loading the dataset

In [6]:
data = pd.read_csv("Medical_insurance.csv")

data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [7]:
data_missing = data.isnull().sum()
data_missing

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [8]:
data.dtypes

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

### Data Preprocessing

Convert categorical features: As we mentioned before when we loaded the dataset, this step is crucial since we can have columns (features) that are categorical like sex, smoker and region. Python doesn’t understand categories, it understands numbers so we will give it instructions to convert this into numbers

In [9]:
data =pd.get_dummies(data, columns =["sex","smoker","region"], drop_first =True)
print(data.head())

   age     bmi  children      charges  sex_male  smoker_yes  region_northwest  \
0   19  27.900         0  16884.92400     False        True             False   
1   18  33.770         1   1725.55230      True       False             False   
2   28  33.000         3   4449.46200      True       False             False   
3   33  22.705         0  21984.47061      True       False              True   
4   32  28.880         0   3866.85520      True       False              True   

   region_southeast  region_southwest  
0             False              True  
1              True             False  
2              True             False  
3             False             False  
4             False             False  


The first block drops the column charges from the dataset because this is the target variable.

In [10]:
X =data.drop(columns=["charges"])
y =data["charges"]

### Split the data for train and test
 Now we are ready to separate train from test. I decided to separate 80% of the dataset for training and 20% for testing. This means that out of those 2772 entries you will have 2218 for training and 554 for testing. This is normally the recommended percentages to have a good result, but you can adjust it based on the results given.

In [11]:
X_train,X_test,y_train,y_test =train_test_split(X,y,test_size=0.2, random_state=42)

### Standardize numerical features
 Another step is scaling since there are features that can be represented in numbers that are very high (like the charges which can reach thousands of dollars) and other categories that can have very low numbers like the age (years old) or children.

In [12]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train the model

In [18]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

tree_model = DecisionTreeRegressor()
tree_model.fit(X_train, y_train)
y_pred2 = tree_model.predict(X_test)

In [19]:
mae= mean_absolute_error(y_test, y_pred2)
print(f'mean absolute error is: {mae}')

mean absolute error is: 603.9758176


In [20]:
r2_score(y_test, y_pred2)*100 

94.83169057239432

### Making a new prediction

Now that we have this model both trained and evaluated, and you want to make a new prediction for a patient based on these features. the patient, specifically has the following data:

We will create a new dataframe with this data and we will call it new_patient.

In [21]:
new_patient = pd.DataFrame({
    'age': [38],
    'sex': ['male'],
    'bmi': [22.500],  
    'children': [0],
    'smoker': ['no'],
    'region':['southwest']
})
new_patient


Unnamed: 0,age,sex,bmi,children,smoker,region
0,38,male,22.5,0,no,southwest


In [22]:
new_patient_encoded = pd.get_dummies(new_patient, columns =["sex","smoker","region"])
new_patient_encoded

Unnamed: 0,age,bmi,children,sex_male,smoker_no,region_southwest
0,38,22.5,0,True,True,True


In [30]:
required_columns = ['age','bmi','children','sex_male','smoker_yes','region_northwest','region_southeast','region_southwest']
for col in required_columns:
    if col not in new_patient_encoded.columns:
        new_patient_encoded[col] = 0


In [None]:
new_patient_encoded = new_patient_encoded[required_columns]

In [32]:
y_pred2 = tree_model.predict(new_patient_encoded)

