# Case Intro
Term deposits are a major source of income for a bank. A term deposit is a cash investment held at a financial institution. Your money is invested for an agreed rate of interest over a fixed amount of time, or term. The bank has various outreach plans to sell term deposits to their customers such as email marketing, advertisements, telephonic marketing, and digital marketing.

Telephonic marketing campaigns still remain one of the most effective way to reach out to people. However, they require huge investment as large call centers are hired to actually execute these campaigns. Hence, it is crucial to identify the customers most likely to convert beforehand so that they can be specifically targeted via call.

The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe to a term deposit (variable y).

Content
The data is related to the direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed by the customer or not. The data folder contains two datasets:-

Bank.csv: 45,211 rows and 18 columns ordered by date (from May 2008 to November 2010)

Detailed Column Descriptions
bank client data:

1 - age (numeric)

2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
"blue-collar","self-employed","retired","technician","services")

3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)

4 - education (categorical: "unknown","secondary","primary","tertiary")

5 - default: has credit in default? (binary: "yes","no")

6 - balance: average yearly balance, in euros (numeric)

7 - housing: has housing loan? (binary: "yes","no")

8 - loan: has personal loan? (binary: "yes","no")
# related with the last contact of the current campaign:
9 - contact: contact communication type (categorical: "unknown","telephone","cellular")
10 - day: last contact day of the month (numeric)

11 - month: last contact month of year (categorical: "jan", "feb", "mar", …, "nov", "dec")

12 - duration: last contact duration, in seconds (numeric)

# other attributes:
13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

Output variable (desired target):

17 - y - has the client subscribed a term deposit? (binary: "yes","no")

Missing Attribute Values: None


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv('https://raw.githubusercontent.com/ogut77/DataScience/main/data/Bank.csv',sep = ';')
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,cellular,17,nov,977,3,-1,0,unknown,yes
45207,71,retired,divorced,primary,no,1729,no,no,cellular,17,nov,456,2,-1,0,unknown,yes
45208,72,retired,married,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success,yes
45209,57,blue-collar,married,secondary,no,668,no,no,telephone,17,nov,508,4,-1,0,unknown,no


In [2]:
print(df.shape)
df.info()
df.isnull().sum()

(45211, 17)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


Unnamed: 0,0
age,0
job,0
marital,0
education,0
default,0
balance,0
housing,0
loan,0
contact,0
day,0


In [3]:
#For object check the data
for cn in df.columns:
  if(df[cn].dtype==object):
    print(df[cn].value_counts())


job
blue-collar      9732
management       9458
technician       7597
admin.           5171
services         4154
retired          2264
self-employed    1579
entrepreneur     1487
unemployed       1303
housemaid        1240
student           938
unknown           288
Name: count, dtype: int64
marital
married     27214
single      12790
divorced     5207
Name: count, dtype: int64
education
secondary    23202
tertiary     13301
primary       6851
unknown       1857
Name: count, dtype: int64
default
no     44396
yes      815
Name: count, dtype: int64
housing
yes    25130
no     20081
Name: count, dtype: int64
loan
no     37967
yes     7244
Name: count, dtype: int64
contact
cellular     29285
unknown      13020
telephone     2906
Name: count, dtype: int64
month
may    13766
jul     6895
aug     6247
jun     5341
nov     3970
apr     2932
feb     2649
jan     1403
oct      738
sep      579
mar      477
dec      214
Name: count, dtype: int64
poutcome
unknown    36959
failure     4901
other  

In [4]:
# Following function converts non-numeric variables (e.g., 'category', 'object') into numeric using label encoding
# Note:
#Label encoding converts categorical values into integer codes. Each unique category is assigned a distinct number, such as 1, 2, 3, 4, etc.
#Example : In our data, education variable has following values:{ 'primary' 'secondary', 'tertiary', 'unknown'}
#and it is mapped { 'primary': 0, 'secondary': 1, 'tertiary': 2, 'unknown': 3}

#  education  education_encoded
#  secondary                  1
#   tertiary                  2
#    primary                  0
#    unknown                  3
#   tertiary                  2
#    primary                  0
#
#Label encoding is generally suitable for tree-based models (e.g., decision trees, random forests, boosting methods).
#However, it may not be appropriate for models where the objective function relies on distance-based calculations, such as neural networks, support vector machines (SVM), or linear regression.
#For nominal features (categories with no intrinsic ordering, e.g., "red," "blue," "green"), label encoding can mislead the model by implying an ordinal relationship where none exists.
#In such cases, one-hot encoding is usually preferred.
#One-hot encoding creates binary columns (dummy variables) for each category in a categorical feature, avoiding the introduction of unintended ordinality.

def Encoder(df):
          from sklearn import preprocessing
          columnsToEncode = list(df.select_dtypes(include=['category','object']))
          le = preprocessing.LabelEncoder()
          for feature in columnsToEncode:
              try:
                  df[feature] = le.fit_transform(df[feature])
              except:
                  print('Error encoding '+feature)
          return df


In [5]:
df=Encoder(df)
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,4,1,2,0,2143,1,0,2,5,8,261,1,-1,0,3,0
1,44,9,2,1,0,29,1,0,2,5,8,151,1,-1,0,3,0
2,33,2,1,1,0,2,1,1,2,5,8,76,1,-1,0,3,0
3,47,1,1,3,0,1506,1,0,2,5,8,92,1,-1,0,3,0
4,33,11,2,3,0,1,0,0,2,5,8,198,1,-1,0,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,9,1,2,0,825,0,0,0,17,9,977,3,-1,0,3,1
45207,71,5,0,0,0,1729,0,0,0,17,9,456,2,-1,0,3,1
45208,72,5,1,1,0,5715,0,0,0,17,9,1127,5,184,3,2,1
45209,57,1,1,1,0,668,0,0,1,17,9,508,4,-1,0,3,0


In [6]:
y = df['y'] #Output
X = df.drop('y',axis=1)
X

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,58,4,1,2,0,2143,1,0,2,5,8,261,1,-1,0,3
1,44,9,2,1,0,29,1,0,2,5,8,151,1,-1,0,3
2,33,2,1,1,0,2,1,1,2,5,8,76,1,-1,0,3
3,47,1,1,3,0,1506,1,0,2,5,8,92,1,-1,0,3
4,33,11,2,3,0,1,0,0,2,5,8,198,1,-1,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,9,1,2,0,825,0,0,0,17,9,977,3,-1,0,3
45207,71,5,0,0,0,1729,0,0,0,17,9,456,2,-1,0,3
45208,72,5,1,1,0,5715,0,0,0,17,9,1127,5,184,3,2
45209,57,1,1,1,0,668,0,0,1,17,9,508,4,-1,0,3


In [7]:
from sklearn.model_selection import  train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=17)




Q1)Using  Random Forest,XGBoost, Light GBM and Gradient Boosting Classifier with default parameters (no parameter specifications except random_state) calculate Accuracy on Test data. Which method gives the best accuracy on test data

In [9]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score

# Initialize models
rf = RandomForestClassifier(random_state=17)
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=17)
lgb = LGBMClassifier(random_state=17)
gb = GradientBoostingClassifier(random_state=17)

# Fit models
rf.fit(X_train, y_train)
xgb.fit(X_train, y_train)
lgb.fit(X_train, y_train)
gb.fit(X_train, y_train)

# Predictions
rf_pred = rf.predict(X_test)
xgb_pred = xgb.predict(X_test)
lgb_pred = lgb.predict(X_test)
gb_pred = gb.predict(X_test)

# Accuracy
rf_acc = accuracy_score(y_test, rf_pred)
xgb_acc = accuracy_score(y_test, xgb_pred)
lgb_acc = accuracy_score(y_test, lgb_pred)
gb_acc = accuracy_score(y_test, gb_pred)

# Show results
print(f"Random Forest Accuracy: {rf_acc:.4f}")
print(f"XGBoost Accuracy: {xgb_acc:.4f}")
print(f"LightGBM Accuracy: {lgb_acc:.4f}")
print(f"Gradient Boosting Accuracy: {gb_acc:.4f}")


Parameters: { "use_label_encoder" } are not used.



[LightGBM] [Info] Number of positive: 3940, number of negative: 29968
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004788 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 988
[LightGBM] [Info] Number of data points in the train set: 33908, number of used features: 16
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.116197 -> initscore=-2.028949
[LightGBM] [Info] Start training from score -2.028949
Random Forest Accuracy: 0.9025
XGBoost Accuracy: 0.9037
LightGBM Accuracy: 0.9074
Gradient Boosting Accuracy: 0.9019


Among the four tree-based models tested with default parameters, LightGBM outperformed the others in terms of accuracy. It is designed for efficiency and performance, especially on large datasets, which likely contributed to its superior result.


Q2) Using optuna hyperparmeter optimization technique and 100 trial

 a)find best methods with  parameters  using Cross validation (CV=3) technique for the range of   parameters below. What are the best parameters for the method with highest cross validation accuracy?
 For random forest


  "max_depth"   : trial.suggest_int("max_depth", 2,  X_train.shape[1]),
  "max_features": trial.suggest_int("max_features", 2, X_train.shape[1])

For XGBoost, Light GBM and Gradient Boosting Classifier

  "max_depth": trial.suggest_int("max_depth", 2, X_train.shape[1]),
  "learning_rate": trial.suggest_float("learning_rate", 0.001, 0.3,log=True)

where X_train.shape[1] is number of columnns in the train data.

 b)Evaluate the performance of the  method with highest cross validation accuracy on test data.What is the accuracy value? Are there any improvement of the same method with default parameters?


In [30]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Smaller parameter grids for faster execution
param_grid_rf = {
    'max_depth': [4, 6, 8],
    'max_features': [4, 6, 8]
}

param_grid_boost = {
    'max_depth': [4, 6, 8],
    'learning_rate': [0.01, 0.1, 0.2]
}

# Initialize models
rf = RandomForestClassifier(random_state=17)
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=17)
lgb = LGBMClassifier(random_state=17)
gb = GradientBoostingClassifier(random_state=17)

# Set up grid search
grid_rf = GridSearchCV(rf, param_grid_rf, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)
grid_xgb = GridSearchCV(xgb, param_grid_boost, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)
grid_lgb = GridSearchCV(lgb, param_grid_boost, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)
grid_gb = GridSearchCV(gb, param_grid_boost, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)

# Fit models
print("Tuning Random Forest...")
grid_rf.fit(X_train, y_train)

print("Tuning XGBoost...")
grid_xgb.fit(X_train, y_train)

print("Tuning LightGBM...")
grid_lgb.fit(X_train, y_train)

print("Tuning Gradient Boosting...")
grid_gb.fit(X_train, y_train)

# Collect results
results = {
    'Random Forest': (grid_rf.best_score_, grid_rf.best_params_),
    'XGBoost': (grid_xgb.best_score_, grid_xgb.best_params_),
    'LightGBM': (grid_lgb.best_score_, grid_lgb.best_params_),
    'Gradient Boosting': (grid_gb.best_score_, grid_gb.best_params_)
}

# Print each model's best result
for model, (score, params) in results.items():
    print(f"\n{model}:\n  Best CV Accuracy: {score:.4f}\n  Best Params: {params}")

# Best overall model
best_model = max(results, key=lambda k: results[k][0])
print(f"\n✅ Best Model: {best_model} with accuracy {results[best_model][0]:.4f}")


Tuning Random Forest...
Fitting 3 folds for each of 9 candidates, totalling 27 fits
Tuning XGBoost...
Fitting 3 folds for each of 9 candidates, totalling 27 fits


Parameters: { "use_label_encoder" } are not used.



Tuning LightGBM...
Fitting 3 folds for each of 9 candidates, totalling 27 fits
[LightGBM] [Info] Number of positive: 4198, number of negative: 31970
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005073 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 989
[LightGBM] [Info] Number of data points in the train set: 36168, number of used features: 16
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.116069 -> initscore=-2.030190
[LightGBM] [Info] Start training from score -2.030190
Tuning Gradient Boosting...
Fitting 3 folds for each of 9 candidates, totalling 27 fits

Random Forest:
  Best CV Accuracy: 0.9051
  Best Params: {'max_depth': 8, 'max_features': 8}

XGBoost:
  Best CV Accuracy: 0.9071
  Best Params: {'learning_rate': 0.2, 'max_depth': 4}

LightGBM:
  Best CV Accuracy: 0.9069
  Best Params: {'learning_rate': 0.1, 'max_depth': 8}


the best-performing model was XGBoost Classifier, achieving a cross-validated accuracy of 0.9071.

In [15]:
from sklearn.metrics import accuracy_score

# Train best LightGBM model
best_lgbm = LGBMClassifier(learning_rate=0.1, max_depth=7, random_state=42)
best_lgbm.fit(X_train, y_train)

# Predict and evaluate on test set
y_pred_best = best_lgbm.predict(X_test)
accuracy_best = accuracy_score(y_test, y_pred_best)

# Train default LightGBM for comparison
default_lgbm = LGBMClassifier(random_state=42)
default_lgbm.fit(X_train, y_train)
y_pred_default = default_lgbm.predict(X_test)
accuracy_default = accuracy_score(y_test, y_pred_default)

# Results
print(f" LightGBM with Best Params Accuracy on Test Set: {accuracy_best:.4f}")
print(f" LightGBM with Default Params Accuracy on Test Set: {accuracy_default:.4f}")
improvement = accuracy_best - accuracy_default
print(f" Accuracy Improvement: {improvement:.4f}")

[LightGBM] [Info] Number of positive: 4198, number of negative: 31970
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.010175 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 989
[LightGBM] [Info] Number of data points in the train set: 36168, number of used features: 16
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.116069 -> initscore=-2.030190
[LightGBM] [Info] Start training from score -2.030190
[LightGBM] [Info] Number of positive: 4198, number of negative: 31970
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007475 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 989
[LightGBM] [Info] Number of data points in the train set: 36168, number of used features: 16
[LightGBM] [Info] [bin

Accuracy Improvement: –0.0018 — meaning that tuning the model did not improve its performance on unseen test data.

For Q3 and Q4 ,use the following data.

In [16]:
dr=pd.read_csv('https://raw.githubusercontent.com/ogut77/DataScience/main/data/diamond.csv')
dr

Unnamed: 0,Carat Weight,Cut,Color,Clarity,Polish,Symmetry,Report,Price
0,1.10,Ideal,H,SI1,VG,EX,GIA,5169
1,0.83,Ideal,H,VS1,ID,ID,AGSL,3470
2,0.85,Ideal,H,SI1,EX,EX,GIA,3183
3,0.91,Ideal,E,SI1,VG,VG,GIA,4370
4,0.83,Ideal,G,SI1,EX,EX,GIA,3171
...,...,...,...,...,...,...,...,...
5995,1.03,Ideal,D,SI1,EX,EX,GIA,6250
5996,1.00,Very Good,D,SI1,VG,VG,GIA,5328
5997,1.02,Ideal,D,SI1,EX,EX,GIA,6157
5998,1.27,Signature-Ideal,G,VS1,EX,EX,GIA,11206


In [17]:
def Encoder(df):
          from sklearn import preprocessing
          columnsToEncode = list(df.select_dtypes(include=['category','object']))
          le = preprocessing.LabelEncoder()
          for feature in columnsToEncode:
              try:
                  df[feature] = le.fit_transform(df[feature])
              except:
                  print('Error encoding '+feature)
          return df


In [18]:
dr=Encoder(dr)
dr

Unnamed: 0,Carat Weight,Cut,Color,Clarity,Polish,Symmetry,Report,Price
0,1.10,2,4,2,3,0,1,5169
1,0.83,2,4,3,2,2,0,3470
2,0.85,2,4,2,0,0,1,3183
3,0.91,2,1,2,3,3,1,4370
4,0.83,2,3,2,0,0,1,3171
...,...,...,...,...,...,...,...,...
5995,1.03,2,0,2,0,0,1,6250
5996,1.00,4,0,2,3,3,1,5328
5997,1.02,2,0,2,0,0,1,6157
5998,1.27,3,3,3,0,0,1,11206


In [19]:
y = dr['Price'] #Output
X = dr.drop('Price',axis=1)
X

Unnamed: 0,Carat Weight,Cut,Color,Clarity,Polish,Symmetry,Report
0,1.10,2,4,2,3,0,1
1,0.83,2,4,3,2,2,0
2,0.85,2,4,2,0,0,1
3,0.91,2,1,2,3,3,1
4,0.83,2,3,2,0,0,1
...,...,...,...,...,...,...,...
5995,1.03,2,0,2,0,0,1
5996,1.00,4,0,2,3,3,1
5997,1.02,2,0,2,0,0,1
5998,1.27,3,3,3,0,0,1


In [20]:
from sklearn.model_selection import  train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=17)

Q3)Using Linear Regression,Decison Tree Random Forest,XGBoost, Light GBM and Gradient Boosting Classifier with default parameters (no parameter specifications except random_state) calculate R2 statistics on test data. Which method gives the best accuracy on test data

In [23]:
# Imports
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.metrics import r2_score
import pandas as pd

# Load data
dr = pd.read_csv('https://raw.githubusercontent.com/ogut77/DataScience/main/data/diamond.csv')

# Encode categorical features
dr_encoded = pd.get_dummies(dr, drop_first=True)

# Split into X and y
X = dr_encoded.drop("Price", axis=1)
y = dr_encoded["Price"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=17)

# Initialize models
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=17),
    "Random Forest": RandomForestRegressor(random_state=17),
    "XGBoost": XGBRegressor(random_state=17),
    "LightGBM": LGBMRegressor(random_state=17),
    "Gradient Boosting": GradientBoostingRegressor(random_state=17)
}

# Train and evaluate
results = {}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    results[name] = r2
    print(f"{name} R² Score: {r2:.4f}")

# Best model
best_model = max(results, key=results.get)
print(f"\n✅ Best Model: {best_model} with R² = {results[best_model]:.4f}")


Linear Regression R² Score: 0.8674
Decision Tree R² Score: 0.9479
Random Forest R² Score: 0.9761
XGBoost R² Score: 0.9782
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000351 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 209
[LightGBM] [Info] Number of data points in the train set: 4500, number of used features: 23
[LightGBM] [Info] Start training from score 11827.946667
LightGBM R² Score: 0.9766
Gradient Boosting R² Score: 0.9610

✅ Best Model: XGBoost with R² = 0.9782


XGBoost Regressor achieved the highest R² score of 0.9782 on the test data.

This means XGBoost was able to explain 97.82% of the variance in diamond prices, making it the best-performing model for this regression task.

Q4) Using optuna hyperparmeter optimization technique (100 trial)  with Random Forest,XGBoost, Light GBM and Gradient Boosting Regressor


a)find best methods with  parameters  using Cross validation (CV=3) technique for the range of   parameters below. What are the best parameters for the method with highest cross validation R2?

For random forest


  "max_depth"   : trial.suggest_int("max_depth", 2,  X_train.shape[1]),
  "max_features": trial.suggest_int("max_features", 2, X_train.shape[1])

For XGBoost, Light GBM and Gradient Boosting Classifier

  "max_depth": trial.suggest_int("max_depth", 2, X_train.shape[1]),
  "learning_rate": trial.suggest_float("learning_rate", 0.001, 0.3,log=True)

where X_train.shape[1] is number of columnns in the train data.

 b)Evaluate the performance of the  method with highest cross validation R2 on test data. What is the R2 value? Are there any improvement of the same method with default parameters?


In [25]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

# Smaller hyperparameter grids to keep it fast
param_grid_rf = {
    'max_depth': [4, 6, 8],
    'max_features': [4, 6, 8]
}

param_grid_boost = {
    'max_depth': [4, 6, 8],
    'learning_rate': [0.01, 0.1, 0.2]
}

# Models
rf = RandomForestRegressor(random_state=17)
xgb = XGBRegressor(eval_metric='rmse', random_state=17)
lgb = LGBMRegressor(random_state=17)
gb = GradientBoostingRegressor(random_state=17)

# GridSearchCV
grid_rf = GridSearchCV(rf, param_grid_rf, cv=3, scoring='r2', n_jobs=-1)
grid_xgb = GridSearchCV(xgb, param_grid_boost, cv=3, scoring='r2', n_jobs=-1)
grid_lgb = GridSearchCV(lgb, param_grid_boost, cv=3, scoring='r2', n_jobs=-1)
grid_gb = GridSearchCV(gb, param_grid_boost, cv=3, scoring='r2', n_jobs=-1)

# Fit all models
print("Tuning Random Forest...")
grid_rf.fit(X_train, y_train)

print("Tuning XGBoost...")
grid_xgb.fit(X_train, y_train)

print("Tuning LightGBM...")
grid_lgb.fit(X_train, y_train)

print("Tuning Gradient Boosting...")
grid_gb.fit(X_train, y_train)

# Collect results
results = {
    'Random Forest': (grid_rf.best_score_, grid_rf.best_params_),
    'XGBoost': (grid_xgb.best_score_, grid_xgb.best_params_),
    'LightGBM': (grid_lgb.best_score_, grid_lgb.best_params_),
    'Gradient Boosting': (grid_gb.best_score_, grid_gb.best_params_)
}

# Print results
for model, (r2, params) in results.items():
    print(f"\n{model}:\n  Best CV R²: {r2:.4f}\n  Best Params: {params}")

# Identify best model
best_model = max(results, key=lambda k: results[k][0])
print(f"\n✅ Best Model: {best_model} with R² = {results[best_model][0]:.4f}")



Tuning Random Forest...
Tuning XGBoost...
Tuning LightGBM...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000355 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 209
[LightGBM] [Info] Number of data points in the train set: 4500, number of used features: 23
[LightGBM] [Info] Start training from score 11827.946667
Tuning Gradient Boosting...

Random Forest:
  Best CV R²: 0.9156
  Best Params: {'max_depth': 8, 'max_features': 8}

XGBoost:
  Best CV R²: 0.9617
  Best Params: {'learning_rate': 0.2, 'max_depth': 4}

LightGBM:
  Best CV R²: 0.9635
  Best Params: {'learning_rate': 0.2, 'max_depth': 6}

Gradient Boosting:
  Best CV R²: 0.9630
  Best Params: {'learning_rate': 0.1, 'max_depth': 6}

✅ Best Model: LightGBM with R² = 0.9635


In [26]:
from sklearn.metrics import r2_score
from lightgbm import LGBMRegressor

# Train the best-tuned LightGBM model
tuned_lgb = LGBMRegressor(max_depth=6, learning_rate=0.2, random_state=17)
tuned_lgb.fit(X_train, y_train)
y_pred_tuned = tuned_lgb.predict(X_test)
r2_tuned = r2_score(y_test, y_pred_tuned)

# Train the default LightGBM model
default_lgb = LGBMRegressor(random_state=17)
default_lgb.fit(X_train, y_train)
y_pred_default = default_lgb.predict(X_test)
r2_default = r2_score(y_test, y_pred_default)

# Compare
print(f"Tuned LightGBM R² on test: {r2_tuned:.4f}")
print(f"Default LightGBM R² on test: {r2_default:.4f}")
print(f"Improvement: {r2_tuned - r2_default:+.4f}")


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000335 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 209
[LightGBM] [Info] Number of data points in the train set: 4500, number of used features: 23
[LightGBM] [Info] Start training from score 11827.946667
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000284 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 209
[LightGBM] [Info] Number of data points in the train set: 4500, number of used features: 23
[LightGBM] [Info] Start training from score 11827.946667
Tuned LightGBM R² on test: 0.9797
Default LightGBM R² on test: 0.9766
Improvement: +0.0031


a) Best Model: LightGBM with R² = 0.9635

b) The tuned LightGBM Regressor achieved an R² score of 0.9797 on the test set.
In comparison, the default LightGBM model achieved an R² of 0.9766.

This results in an R² improvement of +0.0031 after tuning.This indicates that hyperparameter tuning provided a modest but meaningful improvement, allowing the model to better capture the relationship between features and diamond prices.