# Model Training + Pipeline + Feature Scaling + TrainTestSplit

#### Feature Scaling 
Feature scaling is a preprocessing step commonly used in machine learning to standardize or normalize the features of a dataset. It helps to ensure that all features are on a similar scale, which can be beneficial for many machine learning algorithms.

Standardization transforms the data so that it has zero mean and unit variance. It subtracts the mean of each feature and divides by its standard deviation. This technique does not bound the values to a specific range.
#### TrainTestSplit
It is common practice to split your dataset into a training set and a test set. The training set is used to train your model, while the test set is used to evaluate its performance on unseen data.
#### Pipeline
pipeline is a convenient way to chain multiple data preprocessing steps and machine learning algorithms together. The scikit-learn library provides the Pipeline class, which allows you to define and execute a sequence of transformations and estimators in a systematic manner.


In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer ## HAndling Missing Values
from sklearn.preprocessing import StandardScaler # HAndling Feature Scaling
from sklearn.preprocessing import OrdinalEncoder # Ordinal Encoding
## pipelines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

In [2]:
## Independent and dependent features
df = pd.read_csv('C:/pwskills_python_codes/DiamondPricePrediction/notebooks/data/gemstone.csv')
df=df.drop(['id'],axis=1)
X = df.drop(['price'],axis=1)
Y = df[['price']]

In [3]:
# Segregating numerical and categorical variables
categorical_cols = X.select_dtypes(include='object').columns
numerical_cols = X.select_dtypes(exclude='object').columns

In [4]:
print(numerical_cols)
print(categorical_cols)

Index(['carat', 'depth', 'table', 'x', 'y', 'z'], dtype='object')
Index(['cut', 'color', 'clarity'], dtype='object')


In [5]:
# Define the custom ranking for each ordinal variable
cut_categories = ['Fair', 'Good', 'Very Good','Premium','Ideal']
color_categories = ['D', 'E', 'F', 'G', 'H', 'I', 'J']
clarity_categories = ['I1','SI2','SI1','VS2','VS1','VVS2','VVS1','IF']

## Pipeline

In [6]:
## Numerical Pipeline
num_pipeline=Pipeline(
    steps=[
    ('remove_features', FunctionTransformer((lambda X: X.drop(['x', 'y', 'z'], axis=1)), validate=False)),
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
    ]
)

# Categorigal Pipeline
cat_pipeline = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('ordinalencoder', OrdinalEncoder(categories=[cut_categories, color_categories, clarity_categories])),
        ('scaler', StandardScaler())
    ]
)


preprocessor=ColumnTransformer([
('num_pipeline',num_pipeline,numerical_cols),
('cat_pipeline',cat_pipeline,categorical_cols)
])

## Train Test Split

In [7]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.30,random_state=30)


In [8]:
X_train=pd.DataFrame(preprocessor.fit_transform(X_train),columns=[['carat', 'depth', 'table', 'cut', 'color', 'clarity']])
X_test=pd.DataFrame(preprocessor.transform(X_test),columns=[['carat', 'depth', 'table', 'cut', 'color', 'clarity']])

In [9]:
X_train.head()

Unnamed: 0,carat,depth,table,cut,color,clarity
0,-0.975439,-0.849607,-0.121531,0.874076,1.528722,1.352731
1,0.235195,1.833637,-0.121531,-2.144558,-0.935071,-0.646786
2,0.494617,0.815855,0.3998,-0.132136,0.296826,0.686225
3,-1.018676,0.260701,0.921131,-0.132136,0.296826,0.01972
4,-0.953821,-0.664555,-0.642862,0.874076,2.14467,1.352731


## Model Training 

In [16]:
from sklearn.linear_model import LinearRegression,Lasso,Ridge,ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.decomposition import PCA
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error

In [17]:
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    r2_square = r2_score(true, predicted)
    return mae, rmse, r2_square

In [18]:
## Train multiple models
## Model Ecaluation
models={
    'Decision Tree' :DecisionTreeRegressor(),
    'LinearRegression':LinearRegression(),
    'Lasso':Lasso(),
    'Ridge':Ridge(),
    'Elasticnet':ElasticNet(),
    'KNN':KNeighborsRegressor(),
    'XBG':XGBRegressor(),
    'RandomForest':RandomForestRegressor(),
    'Adaboost':AdaBoostRegressor(),
    'Gradientboost':GradientBoostingRegressor()
}
trained_model_list=[]
model_list=[]
r2_list=[]

for i in range(len(list(models))):
    model=list(models.values())[i]
    model.fit(X_train,y_train)

    #Make Predictions
    y_pred=model.predict(X_test)

    mae, rmse, r2_square=evaluate_model(y_test,y_pred)

    print(list(models.keys())[i])
    model_list.append(list(models.keys())[i])

    print('Model Training Performance')
    print("RMSE:",rmse)
    print("MAE:",mae)
    print("R2 score for test",r2_square*100)
    print("R2 score for train", r2_score(y_train, model.predict(X_train))*100)

    r2_list.append(r2_square)
    
    print('='*35)
    print('\n')

Decision Tree
Model Training Performance
RMSE: 827.2447627721448
MAE: 418.28553521680266
R2 score for test 95.7988684968339
R2 score for train 99.83697395085885


LinearRegression
Model Training Performance
RMSE: 1099.6943843143683
MAE: 806.3805022561628
R2 score for test 92.57592692715887
R2 score for train 92.52748141456539


Lasso
Model Training Performance
RMSE: 1099.7070571865745
MAE: 806.0476384650286
R2 score for test 92.57575581621613
R2 score for train 92.52742700593768


Ridge
Model Training Performance
RMSE: 1099.6945713391974
MAE: 806.3751566214534
R2 score for test 92.57592440193709
R2 score for train 92.52748140478235


Elasticnet
Model Training Performance
RMSE: 1831.6608029990882
MAE: 1239.9971996118236
R2 score for test 79.4037418412659
R2 score for train 79.36759398085746


KNN
Model Training Performance
RMSE: 724.1472091506897
MAE: 395.9440900950545
R2 score for test 96.7807704482506
R2 score for train 97.84823635094692


XBG
Model Training Performance
RMSE: 590.4517

  model.fit(X_train,y_train)


RandomForest
Model Training Performance
RMSE: 637.6597693464169
MAE: 329.6385622266084
R2 score for test 97.50381793347167
R2 score for train 99.53918814982065




  y = column_or_1d(y, warn=True)


Adaboost
Model Training Performance
RMSE: 1198.8984069452056
MAE: 790.1264395575438
R2 score for test 91.17605099654375
R2 score for train 91.23840224220335




  y = column_or_1d(y, warn=True)


Gradientboost
Model Training Performance
RMSE: 619.7769284774103
MAE: 333.5549661325968
R2 score for test 97.6418629923036
R2 score for train 97.65249864571588




In [19]:
model_list

['Decision Tree',
 'LinearRegression',
 'Lasso',
 'Ridge',
 'Elasticnet',
 'KNN',
 'XBG',
 'RandomForest',
 'Adaboost',
 'Gradientboost']