<a href="https://colab.research.google.com/github/ahmadalmasri270/training-projects/blob/main/First_Model_(Practice).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Assignment:**

For this exercise, you will create, fit, and evaluate the performance of a linear regression model. The machine learning question is:


How well can the additional charges be predicted based on the age, sex, BMI, number of children, smoking habit, and region of the patient?

This is the dataset you will be using: insurance.csv

For this task, you will need to:

Create a preprocessing object, such as a column transformer or pipeline, that will:

Ordinal encode any ordinal features

One-hot encode any nominal features

Scale any numeric features

Instantiate a linear regression model

Create a model pipeline with your preprocessor first and linear 

regression model last

Fit the modeling pipeline on the training data

Evaluate the model performance on both the training set and the test 

set using the R-squared score.


##1.Load the Data.

In [17]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [18]:
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/Coding Dojo/insurance_reg.csv')
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [19]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import plot_tree
from sklearn import set_config
set_config(display='diagram')

##2.We will make a copy of original df to avoid any manipulations

In [20]:
df_ml = df.copy()

##3.check for Duplicated and missing value

In [21]:
df_ml.duplicated().sum()

1

In [22]:
df_ml.drop_duplicates(inplace=True)

In [23]:
df_ml.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1337 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1337 non-null   int64  
 1   sex       1337 non-null   object 
 2   bmi       1337 non-null   float64
 3   children  1337 non-null   int64  
 4   smoker    1337 non-null   object 
 5   region    1337 non-null   object 
 6   charges   1337 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 83.6+ KB


#4.Validation Split

In [24]:
X = df_ml.drop(columns=['charges'])
y = df_ml['charges']

##5.Instantiate Colume Selector

In [25]:
# select only object columns
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25 , random_state=42)

# check value counts for each object column
categoricals = X_train.select_dtypes(include='object')

for col in categoricals.columns:
  print(col)
  print(categoricals[col].value_counts(), '\n')

sex
male      519
female    483
Name: sex, dtype: int64 

smoker
no     797
yes    205
Name: smoker, dtype: int64 

region
southeast    269
northwest    253
southwest    251
northeast    229
Name: region, dtype: int64 



In [26]:
#  Selectors
cat_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')

In [27]:
#instantiate the StandardScaler, OneHotEncoder
scaler = StandardScaler()
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

## Create tuples of (transformer, selector) for the ColumnTransformer

cat_tuple = (ohe, cat_selector)
num_tuple = (scaler, num_selector)

In [28]:
## Create ColumnTransformer

preprocessor = make_column_transformer(cat_tuple, num_tuple)

# Modeling

Create a pipeline with
1. preprocessor
2. model


In [29]:
## Additionl imports (normally all imports should be at top of notebook)
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score, mean_squared_error

In [30]:
## Make and fit a linear regression model
reg = LinearRegression()
reg_pipe = make_pipeline(preprocessor, reg)

reg_pipe.fit(X_train, y_train)

### Measuring model performance - using functions from metrics

In [31]:
## Measuring model performance - using builtin .score()
print(f"Train R-Squared: { round( reg_pipe.score(X_train, y_train),3) }")
print(f"Test R-Squared: { round( reg_pipe.score(X_test, y_test),3) }")

Train R-Squared: 0.73
Test R-Squared: 0.795


### Measuring model performance - using functions from metrics

In [32]:
# get predictions for train and test data
y_hat_train = reg_pipe.predict(X_train)
y_hat_test = reg_pipe.predict(X_test)

In [33]:
## Get r-square for train vs test
print(f"Train R-Squared: { round( r2_score(y_train, y_hat_train), 3)}")
print(f"Test R-Squared: { round( r2_score(y_test, y_hat_test), 3)}")

Train R-Squared: 0.73
Test R-Squared: 0.795


In [34]:
## Get RMSE for train vs test
rmse_train =  mean_squared_error(y_train, y_hat_train, squared=False)
rmse_test = mean_squared_error(y_test, y_hat_test, squared=False)
print(f"Train RMSE: { round(rmse_train, 2)}")
print(f"Test RMSE: { round( rmse_test, 2)}")

Train RMSE: 6098.14
Test RMSE: 5947.85
