# All ML Regression Analyses using 13 Models

In [1]:
import pandas as pd 
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

# Activate the Inline mod
%matplotlib inline

# To ignore/hide the warnings
import warnings
warnings.filterwarnings("ignore")

## Read Data

In [2]:
df = pd.read_excel("sample_data.xlsx")
df.head(3)

Unnamed: 0,Car_Name,Car_Age,Target,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,10,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,sx4,11,4.75,9.54,43000,Diesel,Dealer,Manual,0
2,ciaz,7,7.25,9.85,6900,Petrol,Dealer,Manual,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301 entries, 0 to 300
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Car_Name       301 non-null    object 
 1   Car_Age        301 non-null    int64  
 2   Target         301 non-null    float64
 3   Present_Price  301 non-null    float64
 4   Kms_Driven     301 non-null    int64  
 5   Fuel_Type      301 non-null    object 
 6   Seller_Type    301 non-null    object 
 7   Transmission   301 non-null    object 
 8   Owner          301 non-null    int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 21.3+ KB


## Unique Values

In [4]:
# For getting detailed information about the dataset;

def get_unique_values(df):
    
    output_data = []

    for col in df.columns:

        # If the number of unique values in the column is less than or equal to 5
        if df.loc[:, col].nunique() <= 5:
            # Get the unique values in the column
            unique_values = df.loc[:, col].unique()
            # Append the column name, number of unique values, unique values, and data type to the output data
            output_data.append([col, df.loc[:, col].nunique(), unique_values, df.loc[:, col].dtype])
        else:
            # Otherwise, append only the column name, number of unique values, and data type to the output data
            output_data.append([col, df.loc[:, col].nunique(),"-", df.loc[:, col].dtype])

    output_df = pd.DataFrame(output_data, columns=['Column Name', 'Number of Unique Values', ' Unique Values ', 'Data Type'])

    return output_df

get_unique_values(df)

Unnamed: 0,Column Name,Number of Unique Values,Unique Values,Data Type
0,Car_Name,98,-,object
1,Car_Age,16,-,int64
2,Target,156,-,float64
3,Present_Price,147,-,float64
4,Kms_Driven,206,-,int64
5,Fuel_Type,3,"[Petrol, Diesel, CNG]",object
6,Seller_Type,2,"[Dealer, Individual]",object
7,Transmission,2,"[Manual, Automatic]",object
8,Owner,3,"[0, 1, 3]",int64


* There are 98 category at "Car_Name". We can not apply dummy encoding  to this column. 

* Let's drop it.

## Drop

In [5]:
df = df.drop('Car_Name', axis=1)
df.head(2)

Unnamed: 0,Car_Age,Target,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,10,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,11,4.75,9.54,43000,Diesel,Dealer,Manual,0


## Encoding

* We should apply dummy encoding at regression analyses.

In [6]:
# List the categorical columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns

# Apply dummy encoding with drop_first=True for the categorical columns (Also, original columns were dropped)
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
df.head(2)

Unnamed: 0,Car_Age,Target,Present_Price,Kms_Driven,Owner,Fuel_Type_Diesel,Fuel_Type_Petrol,Seller_Type_Individual,Transmission_Manual
0,10,3.35,5.59,27000,0,False,True,False,True
1,11,4.75,9.54,43000,0,True,False,False,True


## Handling with null values

In [7]:
df.isna().sum()

Car_Age                   0
Target                    0
Present_Price             0
Kms_Driven                0
Owner                     0
Fuel_Type_Diesel          0
Fuel_Type_Petrol          0
Seller_Type_Individual    0
Transmission_Manual       0
dtype: int64


* We don't need this code now. 

* I put here in case of you will need in future.

In [8]:
# Fill null values in categorical columns with mode
for col in df.select_dtypes(include=['object', 'category']):
    df[col].fillna(df[col].mode()[0], inplace=True)

In [9]:
# Fill null values in numerical columns with mean
for col in df.select_dtypes(include=['number']):
    df[col].fillna(df[col].mean(), inplace=True)

## Labelling

In [10]:
X = df.drop(columns ="Target")
y = df["Target"]

## Split as Train & Test

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

## Scalling

Although scaling is not necessary for tree-based algorithms; it is important for Linear Regression, Regulation, KNN and SVM. 

Additionally, we can use scaled data in tree-based algorithms, too. Sometimes, this can improve our models' performance (speed).   


**Machine Learning Methods That Require Scaling**

Scaling is particularly important for algorithms that are based on distance calculations or gradient-based optimization. However, it is not necessary for all machine learning methods. Below is a summary of when scaling is required and when it is not.

**Algorithms That Require Scaling**

These algorithms are sensitive to the scale of features, so scaling is essential:

**K-Nearest Neighbors (KNN):** KNN relies on distance metrics (like Euclidean or Manhattan). Without scaling, features with larger ranges will dominate the distance calculation, leading to biased predictions. Use either StandardScaler or MinMaxScaler.

**Support Vector Machines (SVM):** SVM constructs hyperplanes based on the distance between data points and the decision boundary. Without scaling, it can be difficult to find the optimal hyperplane.

**Logistic Regression:** Logistic regression uses gradient descent for optimization. Different scales across features can slow down or skew the optimization process, affecting model performance.

**Neural Networks (MLP - Multi-Layer Perceptron):** Neural networks update weights using gradient-based optimization. Large differences in feature scales can cause slower convergence and poor performance.

**Principal Component Analysis (PCA):** PCA is based on capturing the variance in the data. Features with larger ranges will contribute more to the variance, overshadowing others, making scaling essential.

**Gradient Boosting Algorithms (XGBoost, LightGBM, CatBoost):** Although tree-based, these algorithms use gradient-based optimization internally. While generally robust, scaling can improve performance and convergence speed.

**Linear Regression:** For linear regression, particularly with regularization techniques like Ridge or Lasso, scaling is necessary. Regularization is sensitive to the magnitude of coefficients, so scaling ensures all features are treated equally.



**-------------Algorithms That Do Not Require Scaling--------------**

These methods are not sensitive to feature scales and therefore do not need scaling:

**Decision Trees (CART, Random Forest):** Decision trees use threshold-based splits, so the scale of features does not affect the way the tree is constructed. Scaling has no impact on the model’s performance.

**Bagging Methods (Random Forest, Bagging):** Since bagging techniques often rely on decision trees, they also do not require scaling. 

**Naive Bayes:** Naive Bayes models work with probabilities based on feature values, which are independent of scale, so no scaling is necessary.

**Tree-based Gradient Boosting Algorithms:** Tree-based methods like XGBoost, LightGBM, and CatBoost use decision trees, which do not rely on feature scaling. However, scaling may help for the optimization part of these models.

**K-Means Clustering:** Although not strictly necessary, K-means uses distance calculations and can benefit from scaling to ensure all features are treated equally.

* BUT, if you use scaled data at these models, it will not cause a problem. 



**------Summary-----**:

**Algorithms that require scaling:** KNN, SVM, Logistic Regression, Neural Networks, PCA, Gradient Boosting, Linear Regression.
**Algorithms that do not require scaling:** Decision Trees, Random Forest, Naive Bayes, Bagging methods.

In general, distance-based and gradient-based methods require scaling, while tree-based methods do not. However, if you are using a pipeline that involves multiple models, it may be useful to scale all features for consistency across the pipeline.




In [12]:
X_train.head(2)

Unnamed: 0,Car_Age,Present_Price,Kms_Driven,Owner,Fuel_Type_Diesel,Fuel_Type_Petrol,Seller_Type_Individual,Transmission_Manual
184,16,0.75,26000,1,False,True,True,True
132,7,0.95,3500,0,False,True,True,True


In [13]:
# Initialize the MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

# Fit the scaler on the training data and transform it
X_train_scaled = scaler.fit_transform(X_train)

In [29]:
"""
NOT necessary for you!!!

# Convert the scaled NumPy array back into a pandas DataFrame
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)

# Now you can use .head() on the DataFrame
print(X_train_scaled.head(2))"""

    Car_Age  Present_Price  Kms_Driven     Owner  Fuel_Type_Diesel  \
0  0.642857       0.004660    0.051051  0.333333               0.0   
1  0.000000       0.006827    0.006006  0.000000               0.0   

   Fuel_Type_Petrol  Seller_Type_Individual  Transmission_Manual  
0               1.0                     1.0                  1.0  
1               1.0                     1.0                  1.0  


In [14]:
X_test.head(2)

Unnamed: 0,Car_Age,Present_Price,Kms_Driven,Owner,Fuel_Type_Diesel,Fuel_Type_Petrol,Seller_Type_Individual,Transmission_Manual
177,8,0.57,24000,0,False,True,True,False
289,8,13.6,10980,0,False,True,False,True


In [25]:
# Transform the scaler on the test data
X_test_scaled = scaler.transform(X_test)

In [32]:
"""
NOT necessary for you!!!

# Convert the scaled NumPy array back into a pandas DataFrame
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

# Now you can use .head() on the DataFrame
print(X_test_scaled.head(2))"""

    Car_Age  Present_Price  Kms_Driven  Owner  Fuel_Type_Diesel  \
0  0.071429       0.002709    0.047047    0.0               0.0   
1  0.071429       0.143910    0.020981    0.0               0.0   

   Fuel_Type_Petrol  Seller_Type_Individual  Transmission_Manual  
0               1.0                     1.0                  0.0  
1               1.0                     0.0                  1.0  


## Modelling

In [18]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
# from lightgbm import LGBMRegressor

In [19]:
# Fit the models
linear = LinearRegression().fit(X_train_scaled, y_train)
ridge=Ridge().fit(X_train_scaled, y_train)
lasso=Lasso().fit(X_train_scaled, y_train)
enet=ElasticNet().fit(X_train_scaled, y_train)
knn=KNeighborsRegressor().fit(X_train_scaled, y_train)
ada=AdaBoostRegressor().fit(X_train_scaled, y_train)
svm=SVR().fit(X_train_scaled, y_train)
mlpc=MLPRegressor().fit(X_train_scaled, y_train) # Multilayer perceptron. ANN model
dtc=DecisionTreeRegressor().fit(X_train_scaled, y_train)
rf=RandomForestRegressor().fit(X_train_scaled, y_train)
xgb=XGBRegressor().fit(X_train_scaled, y_train)
gbm=GradientBoostingRegressor().fit(X_train_scaled, y_train)
# lgb=LGBMRegressor().fit(X_train_scaled, y_train) # LightGBM
catbost=CatBoostRegressor().fit(X_train_scaled, y_train)

Learning rate set to 0.032678
0:	learn: 5.0363211	total: 47.5ms	remaining: 47.5s
1:	learn: 4.9464546	total: 48.2ms	remaining: 24.1s
2:	learn: 4.8504633	total: 48.9ms	remaining: 16.2s
3:	learn: 4.7574203	total: 49.6ms	remaining: 12.3s
4:	learn: 4.6638347	total: 50.3ms	remaining: 10s
5:	learn: 4.5824986	total: 51ms	remaining: 8.46s
6:	learn: 4.5180984	total: 51.5ms	remaining: 7.31s
7:	learn: 4.4367646	total: 52.2ms	remaining: 6.47s
8:	learn: 4.3558074	total: 52.6ms	remaining: 5.79s
9:	learn: 4.2785412	total: 53.4ms	remaining: 5.29s
10:	learn: 4.2140923	total: 54.1ms	remaining: 4.86s
11:	learn: 4.1565100	total: 54.8ms	remaining: 4.51s
12:	learn: 4.0725655	total: 55.5ms	remaining: 4.22s
13:	learn: 4.0061777	total: 56.4ms	remaining: 3.98s
14:	learn: 3.9433616	total: 57.9ms	remaining: 3.8s
15:	learn: 3.8871514	total: 58.7ms	remaining: 3.61s
16:	learn: 3.8274795	total: 59.4ms	remaining: 3.44s
17:	learn: 3.7576466	total: 60.2ms	remaining: 3.28s
18:	learn: 3.6969935	total: 61ms	remaining: 3.15s

## Eval Metrics

In [23]:
models=[linear,ridge,lasso,enet,knn,ada,svm,mlpc,dtc,rf,xgb,gbm,catbost]

# For train;
def ML(y,models):
    r2_score=models.score(X_train_scaled, y_train)
    return r2_score
for i in models:
    print(i,"Algorithm succed rate :", ML("survived",i))

LinearRegression() Algorithm succed rate : 0.8886517300804564
Ridge() Algorithm succed rate : 0.8279940941365418
Lasso() Algorithm succed rate : 0.1444873504727131
ElasticNet() Algorithm succed rate : 0.200825215920849
KNeighborsRegressor() Algorithm succed rate : 0.857362402680176
AdaBoostRegressor() Algorithm succed rate : 0.9639803262775456
SVR() Algorithm succed rate : 0.610690444665906
MLPRegressor() Algorithm succed rate : 0.6695101703320103
DecisionTreeRegressor() Algorithm succed rate : 1.0
RandomForestRegressor() Algorithm succed rate : 0.9858909443932178
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             

In [24]:
# For test;
def ML(y,models):
    r2_score=models.score(X_test_scaled, y_test)
    return r2_score
for i in models:
    print(i,"Algorithm succed rate :", ML("survived",i))

LinearRegression() Algorithm succed rate : 0.8489813024899064
Ridge() Algorithm succed rate : 0.7837567859000831
Lasso() Algorithm succed rate : 0.1624615665269098
ElasticNet() Algorithm succed rate : 0.22033692156322826
KNeighborsRegressor() Algorithm succed rate : 0.920822980791902
AdaBoostRegressor() Algorithm succed rate : 0.9173885044456718
SVR() Algorithm succed rate : 0.6970730951191091
MLPRegressor() Algorithm succed rate : 0.6522677431324668
DecisionTreeRegressor() Algorithm succed rate : 0.9383347498425312
RandomForestRegressor() Algorithm succed rate : 0.9628348990161746
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=N

* You can choose best models and apply hyperparameter tuning.

* Using best hyperparameters nd all data, you can obtain final model, and save the final model. 

# Hyperparameter Tuning

# Final Model 

## Save the Final Model

In [None]:
# TASK DONE!!!!!!!