<a href="https://colab.research.google.com/github/Zarin-08/ETE-456/blob/main/Regression_Project_1608008_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **ETE-456:** Regression Project on Medical Cost Personal Dataset


> Objective: 
 1. *Apply various regression algorithms on the real world dataset.*

##Dataset (Medical Cost Personal Datasets)
Columns

**age:** age of primary beneficiary

**sex:** insurance contractor gender, female, male

**bmi:** Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

**children:** Number of children covered by health insurance / Number of dependents

**smoker: **Smoking

**region:** the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

**charges:** Individual medical costs billed by health insurance

In [48]:
import warnings
warnings.filterwarnings("ignore")

### **Import the Libraries**

In [49]:
import numpy as np        
import pandas as pd     
import matplotlib.pyplot as plt 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR 

### **Dataset**

In [50]:
# Download the data
!wget -O insurance.csv https://www.dropbox.com/s/mwgqgjbmfw0xa5p/insurance.csv?dl=0

--2021-12-20 10:29:39--  https://www.dropbox.com/s/mwgqgjbmfw0xa5p/insurance.csv?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.81.18, 2620:100:6035:18::a27d:5512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.81.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/mwgqgjbmfw0xa5p/insurance.csv [following]
--2021-12-20 10:29:39--  https://www.dropbox.com/s/raw/mwgqgjbmfw0xa5p/insurance.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc7e7607911a2297eb0ff48d8cb7.dl.dropboxusercontent.com/cd/0/inline/BcPYdgrPyg_h1OCjeSf39F55qCMdUPkr_IPS8wG3iTw15gaL7pIPkLFxO3uEBsyu_2TbACQPnkrIYvbMuP1r_hyTIUjij6Z0Q2HLC2-Cm_FigoSGPm7xbReuuoeIZpYnoH_Ws54rdUjIX7MUY3CPtlYk/file# [following]
--2021-12-20 10:29:40--  https://uc7e7607911a2297eb0ff48d8cb7.dl.dropboxusercontent.com/cd/0/inline/BcPYdgrPyg_h1OCjeSf39F55qCMdUPkr_IPS8wG3iTw15gaL7pIPkLFxO3uEBsyu_2TbACQPnkrIYvb

In [51]:
"""importing the dataset """

dataset = pd.read_csv('insurance.csv')
dataset

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [52]:
# Feature Columns
features = dataset[['age', 'bmi', 'smoker', 'children']]
# Target Columns
target = dataset[['charges']]

In [53]:
features

Unnamed: 0,age,bmi,smoker,children
0,19,27.900,yes,0
1,18,33.770,no,1
2,28,33.000,no,3
3,33,22.705,no,0
4,32,28.880,no,0
...,...,...,...,...
1333,50,30.970,no,3
1334,18,31.920,no,0
1335,18,36.850,no,0
1336,21,25.800,no,0


### **Label Encoding**

In [54]:
from sklearn.preprocessing import LabelEncoder

In [55]:
labelencoder_f = LabelEncoder()
#the country column is represented by numeric value
features['smoker'] = labelencoder_f.fit_transform(features['smoker'])

In [56]:
features

Unnamed: 0,age,bmi,smoker,children
0,19,27.900,1,0
1,18,33.770,0,1
2,28,33.000,0,3
3,33,22.705,0,0
4,32,28.880,0,0
...,...,...,...,...
1333,50,30.970,0,3
1334,18,31.920,0,0
1335,18,36.850,0,0
1336,21,25.800,0,0


### Taking care of missing values

In [57]:
from sklearn.impute import SimpleImputer

In [58]:
imputer = SimpleImputer(missing_values=np.nan,strategy = "mean") # imputer is an object of Imputer class 
imputer = imputer.fit(features[['age', 'bmi', 'children','smoker']])

In [59]:
features[['age', 'bmi', 'children','smoker']]= imputer.transform(features[['age', 'bmi', 'children','smoker']])

In [60]:
features

Unnamed: 0,age,bmi,smoker,children
0,19.0,27.900,1.0,0.0
1,18.0,33.770,0.0,1.0
2,28.0,33.000,0.0,3.0
3,33.0,22.705,0.0,0.0
4,32.0,28.880,0.0,0.0
...,...,...,...,...
1333,50.0,30.970,0.0,3.0
1334,18.0,31.920,0.0,0.0
1335,18.0,36.850,0.0,0.0
1336,21.0,25.800,0.0,0.0


In [61]:
imputer = SimpleImputer(missing_values=np.nan,strategy = "mean") # imputer is an object of Imputer class 
imputer = imputer.fit(target[['charges']])

In [62]:
target[['charges']]= imputer.transform(target[['charges']])

In [63]:
target

Unnamed: 0,charges
0,16884.92400
1,1725.55230
2,4449.46200
3,21984.47061
4,3866.85520
...,...
1333,10600.54830
1334,2205.98080
1335,1629.83350
1336,2007.94500


### **Splitting Dataset**

In [64]:
from sklearn.model_selection import train_test_split

In [65]:
"""Spliting the Dataset into Training Set and Test Set """

X_train,X_test,y_train,y_test=train_test_split(features,target,test_size = 0.2,random_state = 0)
# random_state = 0 is select to get the same result

In [66]:
print(X_train.shape)
print(X_test.shape)

(1070, 4)
(268, 4)


### **Feature Scaling**

In [67]:
from sklearn.preprocessing import StandardScaler
X_sc = StandardScaler()
y_sc = StandardScaler()
y_train = y_sc.fit_transform(y_train[['charges']])
y_test = y_sc.transform(y_test[['charges']])

### Different types of Regression Algorithm

1. Linear Regression (Univariate or Multivariate)
2. Support Vector Regression
3. Decision Tree Regression
4. Random Forest Regressrion

### **Simple Linear Regression**

In [68]:
from sklearn.linear_model import LinearRegression


regressor = LinearRegression()


regressor.fit(X_train,y_train)

LinearRegression()

In [69]:
# predicting the Test set Results
y_pred = regressor.predict(X_test)

In [70]:
regressor.score(X_train,y_train)

0.7361379262990395

In [71]:
y_test

array([[-2.90360755e-01],
       [-3.88647200e-01],
       [ 2.71438394e+00],
       [-2.09721124e-02],
       [-2.97065319e-01],
       [-7.26671304e-01],
       [-9.18940714e-01],
       [-1.47361673e-01],
       [-4.73043767e-01],
       [-6.49444142e-01],
       [-5.38531882e-01],
       [-2.26101189e-01],
       [-4.89698454e-01],
       [-7.52999425e-01],
       [ 4.26736889e-01],
       [-2.08671402e-01],
       [-5.65894545e-02],
       [-8.11006298e-01],
       [-5.63185752e-01],
       [ 1.69328373e+00],
       [ 8.99164540e-01],
       [-4.65863314e-02],
       [ 8.22176839e-01],
       [ 8.23835038e-01],
       [-9.62666851e-01],
       [-7.12701491e-01],
       [-7.90788753e-01],
       [-4.60891484e-01],
       [-7.88784629e-01],
       [-3.99857588e-01],
       [-4.29404684e-01],
       [ 2.98733744e+00],
       [-1.85261566e-02],
       [ 6.20458841e-01],
       [ 1.14477957e-01],
       [-7.56972780e-01],
       [-4.05394683e-01],
       [ 3.17310604e+00],
       [ 2.2

In [72]:
X_test

Unnamed: 0,age,bmi,smoker,children
578,52.0,30.200,0.0,1.0
610,47.0,29.370,0.0,1.0
569,48.0,40.565,1.0,2.0
1034,61.0,38.380,0.0,0.0
198,51.0,18.050,0.0,0.0
...,...,...,...,...
1084,62.0,30.495,0.0,2.0
726,41.0,28.405,0.0,1.0
1132,57.0,40.280,0.0,0.0
725,30.0,39.050,1.0,3.0


In [73]:
y_pred

array([[-0.1453172 ],
       [-0.27401501],
       [ 2.05297905],
       [ 0.22941353],
       [-0.52791336],
       [-0.80088618],
       [-1.0093192 ],
       [ 0.07092162],
       [-0.37125691],
       [-0.49742414],
       [-0.75660804],
       [-0.2585217 ],
       [-0.34500123],
       [-0.74358949],
       [ 1.21081013],
       [-0.1706158 ],
       [-0.16836915],
       [-0.63765533],
       [-0.43265334],
       [ 1.12025665],
       [ 1.69458544],
       [ 0.07275264],
       [-0.16320059],
       [ 1.61876017],
       [-0.72451664],
       [-0.38015231],
       [-1.03803848],
       [-0.25395931],
       [-0.7478871 ],
       [-0.25052165],
       [-0.36936925],
       [ 2.26665788],
       [ 0.22446636],
       [ 0.07272686],
       [ 0.93324406],
       [-0.71053946],
       [-0.02371172],
       [ 1.47960298],
       [ 1.67402796],
       [-0.77969103],
       [-0.79899852],
       [-0.73156623],
       [ 1.41038016],
       [ 2.16925858],
       [ 1.24274931],
       [-0

### **Evaluation Matrices**

1. Mean Absolute Error (MAE)
2. Mean Squared Error (MSE)
3. R-Squared Error

In [74]:
from sklearn.metrics import mean_absolute_error

# MAE

mean_absolute_error(y_test, y_pred)

0.3291476588325616

In [75]:
from sklearn.metrics import mean_squared_error

mean_squared_error(y_test, y_pred)

0.22440240237700151

In [76]:
from sklearn.metrics import r2_score

r2_score(y_test, y_pred)

0.7978274606303823

In [77]:
mean_absolute_error(y_test, y_pred)

0.3291476588325616

In [78]:
mean_squared_error(y_test, y_pred)

0.22440240237700151

### **Support Vector Regression**

In [79]:
# Fitting SVR to the dataset
from sklearn.svm import SVR 

regressor = SVR(kernel = 'linear')
regressor.fit(X_train,y_train)

SVR(kernel='linear')

In [80]:
y_pred = regressor.predict(X_test)

In [81]:
regressor.score(X_train,y_train)

0.6785757829098201

In [82]:
r2_score(y_test, y_pred)

0.7660151179057306

In [83]:
mean_absolute_error(y_test, y_pred)

0.2654285114234701

In [84]:
mean_squared_error(y_test, y_pred)

0.25971266832563783

### **Decision Tree Regression**

In [85]:
from sklearn.tree import DecisionTreeRegressor


regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X_train,y_train) 


DecisionTreeRegressor(random_state=0)

In [86]:
y_pred = regressor.predict(X_test)
r2_score(y_test, y_pred)

0.7328218425585529

In [87]:
regressor.score(X_train,y_train)

0.9982932582589356

In [88]:
mean_absolute_error(y_test, y_pred)

0.2471217344359763

In [89]:
mean_squared_error(y_test, y_pred)

0.29655570721654356

### **Random Forest Regression**

In [90]:
# Fitting the Random Forest Regression with the dataset
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators = 10,random_state = 0) # n estiamator is the number of decision trees
regressor.fit(X_train,y_train) 

RandomForestRegressor(n_estimators=10, random_state=0)

In [91]:
y_pred = regressor.predict(X_test)
r2_score(y_test, y_pred)

0.8599430915361103

In [92]:
regressor.score(X_train,y_train)

0.9650281786414272

In [93]:
mean_absolute_error(y_test, y_pred)

0.2313787994014181

In [94]:
mean_squared_error(y_test, y_pred)

0.15545685297711503

## Result Analysis

In this project, different regressor algorithms were used to train the dataset and evaluation matrices were used to evaluate them. The comparison between these algorithms on the basis of evaluation matrices are given below:

### Linear Regression
1. Mean Absolute Error  - 0.3291 
2. Mean Squared Error   - 0.2244
3. R-Squared Error      - 0.7978
4. Regression Score     - 0.7361

### Support Vector Regression
1. Mean Absolute Error  - 0.2654 
2. Mean Squared Error   - 0.2597
3. R-Squared Error      - 0.7660
4. Regression Score     - 0.6785

### Decision Tree Regression
1. Mean Absolute Error  - 0.2471 
2. Mean Squared Error   - 0.2965
3. R-Squared Error      - 0.7328
4. Regression Score     - 0.9982

### Random Forest Regression
1. Mean Absolute Error  - 0.2313 
2. Mean Squared Error   - 0.1554
3. R-Squared Error      - 0.8599
4. Regression Score     - 0.9650

## Discussion
In this experiment, Regression is done on **"Medical Cost Personal Dataset"** using four different Regressor algorithms, those are **Linear Regression, Support Vector Regression, Decision Tree Regression and Rando Forest Regression.**
First of all the libraries were imported and then dataset was retrived from storage and some preprocessing was done like **Encoding, Taking care of missing data and Feature Scaling.** 
The **charges** column of dataset was taken as the target dataset and **age, bmi, children, smoker** columns were taken as features. Then the dataset is splitted into training and testing where training data was 80% of the dataset. 
After training the data into regressors some evaluation matrices were used to evaluate how the model worked. From Result analysis, it is seen that Random Forest Regressor has better evaluation matrices.   