In [None]:
'''
Gradient Boosting :

Definition:
- Gradient Boosting is a powerful ensemble machine learning technique that builds models sequentially, 
    where each new model attempts to correct the errors of the previous ones. 
- It combines multiple weak learners, typically decision trees, to create a strong predictive model. 
- The key idea is to optimize a loss function by adding models that minimize the 
    residual errors of the combined model.

Key Features:
- used for both regression and classification tasks.
- It uses a gradient descent algorithm to minimize the loss function.
- Each new model is trained on the residuals of the previous models, 
    allowing it to focus on the areas where the previous models performed poorly.
- It can handle various types of data, including numerical and categorical features.
- Gradient Boosting can be sensitive to overfitting, especially with deep trees, 
    so regularization techniques like learning rate and tree depth control are often applied.

'''

In [None]:
'''
Steps in Gradient Boosting:
1. Initialize the model with a constant value (e.g., the mean of the target variable).
2. For each iteration:
   a. Compute the residuals (errors) of the current model.
   b. Fit a new weak learner (e.g., decision tree) to the residuals.
   c. Update the model by adding the predictions of the new weak learner, scaled by a learning rate.
3. Repeat until a stopping criterion is met (e.g., a maximum number of iterations or convergence).
4. Make predictions using the final model.

Here formula:
a. Learning Rate:
- The learning rate (often denoted as "η") is a hyperparameter that controls the contribution of each weak learner to the final model.
- It is a value between 0 and 1, where a smaller learning rate means that each weak learner has a smaller impact on the final prediction.
- The formula for updating the model with a new weak learner is:
    F(x) = F(x) + η * h(x)
    
    where:
    - F(x) is the current model's prediction.
    - η is the learning rate.
    - h(x) is the prediction of the new weak learner.


'''

In [None]:
'''
Difference with AdaBoost:
- AdaBoost (Adaptive Boosting) and Gradient Boosting are both ensemble learning techniques, 
    but they differ in their approach to combining weak learners.
- AdaBoost focuses on adjusting the weights of misclassified instances, 
    while Gradient Boosting minimizes the residuals of the combined model using gradient descent.
- AdaBoost typically uses a fixed learning rate and combines weak learners in a sequential manner, 
    while Gradient Boosting allows for more flexibility in the learning rate and can use different loss functions.
- AdaBoost is often simpler and faster to implement, while Gradient Boosting can be more powerful and flexible,
    but may require more careful tuning of hyperparameters.

'''

In [1]:
# Data Collection:
#https://www.kaggle.com/datasets/susant4learning/holiday-package-purchase-prediction?resource=download&select=Travel.csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [2]:
df = pd.read_csv(r'Travel.xls')
df.head()


Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0


In [3]:
df.isnull().sum()

CustomerID                    0
ProdTaken                     0
Age                         226
TypeofContact                25
CityTier                      0
DurationOfPitch             251
Occupation                    0
Gender                        0
NumberOfPersonVisiting        0
NumberOfFollowups            45
ProductPitched                0
PreferredPropertyStar        26
MaritalStatus                 0
NumberOfTrips               140
Passport                      0
PitchSatisfactionScore        0
OwnCar                        0
NumberOfChildrenVisiting     66
Designation                   0
MonthlyIncome               233
dtype: int64

In [4]:
df['Gender'] = df['Gender'].apply(lambda x:'Female' if x=='Fe Male' else x)
df['Gender'].value_counts()

Gender
Male      2916
Female    1972
Name: count, dtype: int64

In [5]:
df['MaritalStatus'] = df['MaritalStatus'].replace({'Single':'Unmarried'})
df['MaritalStatus'].value_counts()

MaritalStatus
Married      2340
Unmarried    1598
Divorced      950
Name: count, dtype: int64

In [6]:
## Checking missing values

feature_with_na = [features for features in df.columns if df[features].isnull().sum() > 0]
for feature in feature_with_na:
    #print(f"{feature} has {df[feature].isnull().sum()} missing values.")
    print(feature, np.round(df[feature].isnull().mean()*100, 5), '% missing values')

Age 4.62357 % missing values
TypeofContact 0.51146 % missing values
DurationOfPitch 5.13502 % missing values
NumberOfFollowups 0.92062 % missing values
PreferredPropertyStar 0.53191 % missing values
NumberOfTrips 2.86416 % missing values
NumberOfChildrenVisiting 1.35025 % missing values
MonthlyIncome 4.76678 % missing values


In [7]:
# statistical summary of numerical columns
df[feature_with_na].select_dtypes(exclude='object').describe()

Unnamed: 0,Age,DurationOfPitch,NumberOfFollowups,PreferredPropertyStar,NumberOfTrips,NumberOfChildrenVisiting,MonthlyIncome
count,4662.0,4637.0,4843.0,4862.0,4748.0,4822.0,4655.0
mean,37.622265,15.490835,3.708445,3.581037,3.236521,1.187267,23619.853491
std,9.316387,8.519643,1.002509,0.798009,1.849019,0.857861,5380.698361
min,18.0,5.0,1.0,3.0,1.0,0.0,1000.0
25%,31.0,9.0,3.0,3.0,2.0,1.0,20346.0
50%,36.0,13.0,4.0,3.0,3.0,1.0,22347.0
75%,44.0,20.0,4.0,4.0,4.0,2.0,25571.0
max,61.0,127.0,6.0,5.0,22.0,3.0,98678.0


In [8]:
df.Age.fillna(df.Age.median(), inplace=True)
df.TypeofContact.fillna(df.TypeofContact.mode()[0], inplace=True)
df.DurationOfPitch.fillna(df.DurationOfPitch.median(), inplace=True)
df.NumberOfFollowups.fillna(df.NumberOfFollowups.mode()[0], inplace=True)
df.PreferredPropertyStar.fillna(df.PreferredPropertyStar.mode()[0], inplace=True)
df.NumberOfTrips.fillna(df.NumberOfTrips.median(), inplace=True)
df.NumberOfChildrenVisiting.fillna(df.NumberOfChildrenVisiting.mode()[0], inplace=True)
df.MonthlyIncome.fillna(df.MonthlyIncome.median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.Age.fillna(df.Age.median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.TypeofContact.fillna(df.TypeofContact.mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we a

In [9]:
df.isnull().sum()

CustomerID                  0
ProdTaken                   0
Age                         0
TypeofContact               0
CityTier                    0
DurationOfPitch             0
Occupation                  0
Gender                      0
NumberOfPersonVisiting      0
NumberOfFollowups           0
ProductPitched              0
PreferredPropertyStar       0
MaritalStatus               0
NumberOfTrips               0
Passport                    0
PitchSatisfactionScore      0
OwnCar                      0
NumberOfChildrenVisiting    0
Designation                 0
MonthlyIncome               0
dtype: int64

In [10]:
df.drop(columns=['CustomerID'], inplace=True)

In [11]:
# Create new columns for feature extraction
df['TotalVisiting'] = df.NumberOfChildrenVisiting + df.NumberOfPersonVisiting

In [12]:
df.drop(columns=['NumberOfChildrenVisiting', 'NumberOfPersonVisiting'], inplace=True)

In [13]:
# get all numerical columns
numerical_cols = [feature for feature in df.columns if df[feature].dtype!= 'O']
print(len(numerical_cols))

12


In [14]:
# get all numerical columns
categorical_cols = [feature for feature in df.columns if df[feature].dtype == 'O']
print(len(categorical_cols))

6


In [15]:
# discrete features - also knowns as categorical features
# are those features which have a limited number of unique values
discrete_features = [feature for feature in numerical_cols if len(df[feature].unique()) < 25]
print(len(discrete_features))

9


In [16]:
# continuous features - are those features which have a large number of unique values
continuous_features = [feature for feature in numerical_cols if feature not in discrete_features]
print(len(continuous_features))

3


In [17]:
from sklearn.model_selection import train_test_split
X = df.drop(columns=['ProdTaken'], axis=1)
y = df['ProdTaken']

In [18]:
# separate dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3910, 17), (978, 17), (3910,), (978,))

In [19]:
cat_features = X.select_dtypes(include='object').columns
num_features = X.select_dtypes(exclude='object').columns
print("Categorical Features:", cat_features)
print("Numerical Features:", num_features)

Categorical Features: Index(['TypeofContact', 'Occupation', 'Gender', 'ProductPitched',
       'MaritalStatus', 'Designation'],
      dtype='object')
Numerical Features: Index(['Age', 'CityTier', 'DurationOfPitch', 'NumberOfFollowups',
       'PreferredPropertyStar', 'NumberOfTrips', 'Passport',
       'PitchSatisfactionScore', 'OwnCar', 'MonthlyIncome', 'TotalVisiting'],
      dtype='object')


In [20]:
# One Hot Encoding for Categorical Features and Standardization for Numerical Features
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(drop='first')

In [21]:
preprocessor = ColumnTransformer(
    [
    ("OneHotEncoder", categorical_transformer, cat_features),
    ("StandardScaler", numeric_transformer, num_features)
    ]
)

In [22]:
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

In [25]:
y_train

3995    0
2610    0
3083    0
3973    0
4044    0
       ..
4426    0
466     0
3092    0
3772    0
860     1
Name: ProdTaken, Length: 3910, dtype: int64

In [26]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import roc_auc_score, roc_curve, precision_score, recall_score, f1_score

In [27]:
models = {
    "Logistic Regression": LogisticRegression(),
    "Random Forest": RandomForestClassifier(),
    "Decision Tree": DecisionTreeClassifier(),
    "Gradient Boosting": GradientBoostingClassifier(),
    "AdaBoost": AdaBoostClassifier()
}

In [29]:
for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train)

    #Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # training set performance
    model_train_accuracy = accuracy_score(y_train, y_train_pred)
    model_train_f1 = f1_score(y_train, y_train_pred, average='weighted')
    model_train_recall = recall_score(y_train, y_train_pred)
    model_train_precision = precision_score(y_train, y_train_pred)
    model_train_roc_auc = roc_auc_score(y_train, y_train_pred)

    # test set performance
    model_test_accuracy = accuracy_score(y_test, y_test_pred)
    model_test_f1 = f1_score(y_test, y_test_pred, average='weighted')
    model_test_recall = recall_score(y_test, y_test_pred)
    model_test_precision = precision_score(y_test, y_test_pred)
    model_test_roc_auc = roc_auc_score(y_test, y_test_pred)

    print(f"Model: {list(models.keys())[i]}")
    print("Training Set Performance:")
    print(f"Training Accuracy: {model_train_accuracy}")
    print(f"Training F1 Score: {model_train_f1}")
    print(f"Training Recall: {model_train_recall}")
    print(f"Training Precision: {model_train_precision}")
    print(f"Training ROC AUC: {model_train_roc_auc}")

    print("Test Set Performance:")
    print(f"Test Accuracy: {model_test_accuracy}")
    print(f"Test F1 Score: {model_test_f1}")
    print(f"Test Recall: {model_test_recall}")
    print(f"Test Precision: {model_test_precision}")
    print(f"Test ROC AUC: {model_test_roc_auc}")
    print("-"*50)


Model: Logistic Regression
Training Set Performance:
Training Accuracy: 0.8460358056265984
Training F1 Score: 0.8202118738880438
Training Recall: 0.30315500685871055
Training Precision: 0.7015873015873015
Training ROC AUC: 0.6368022755136056
Test Set Performance:
Test Accuracy: 0.83640081799591
Test F1 Score: 0.8086633047343356
Test Recall: 0.2931937172774869
Test Precision: 0.691358024691358
Test ROC AUC: 0.630713758257549
--------------------------------------------------
Model: Random Forest
Training Set Performance:
Training Accuracy: 1.0
Training F1 Score: 1.0
Training Recall: 1.0
Training Precision: 1.0
Training ROC AUC: 1.0
Test Set Performance:
Test Accuracy: 0.9274028629856851
Test F1 Score: 0.9213994793886622
Test Recall: 0.643979057591623
Test Precision: 0.9761904761904762
Test ROC AUC: 0.8200835567500682
--------------------------------------------------
Model: Decision Tree
Training Set Performance:
Training Accuracy: 1.0
Training F1 Score: 1.0
Training Recall: 1.0
Trainin

In [30]:
## Hyperparameter Tuning using RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
rf_params = {
    "max_depth": [5, 8, 15, None, 10],
    "max_features": [5, 7, "auto", 8],
    "n_estimators": [100, 200, 500, 1000],
    "min_samples_split": [2, 8, 15, 20]
}

gradient_params = {
    "loss": ['log_loss', 'deviance', 'exponential'],
    "criterion": ['friedman_mse', 'squared_error', 'mse'],
    "n_estimators": [100, 200, 500],
    "max_depth": [5, 8, 15, None, 10]
}

In [31]:
rf_params

{'max_depth': [5, 8, 15, None, 10],
 'max_features': [5, 7, 'auto', 8],
 'n_estimators': [100, 200, 500, 1000],
 'min_samples_split': [2, 8, 15, 20]}

In [32]:
gradient_params

{'loss': ['log_loss', 'deviance', 'exponential'],
 'criterion': ['friedman_mse', 'squared_error', 'mse'],
 'n_estimators': [100, 200, 500],
 'max_depth': [5, 8, 15, None, 10]}

In [34]:
# model list for hyperparameter tuning
randomcv_models = [
    ("RF", RandomForestClassifier(), rf_params),
    ("Gradient Boosting", GradientBoostingClassifier(), gradient_params)
]

In [35]:
randomcv_models

[('RF',
  RandomForestClassifier(),
  {'max_depth': [5, 8, 15, None, 10],
   'max_features': [5, 7, 'auto', 8],
   'n_estimators': [100, 200, 500, 1000],
   'min_samples_split': [2, 8, 15, 20]}),
 ('Gradient Boosting',
  GradientBoostingClassifier(),
  {'loss': ['log_loss', 'deviance', 'exponential'],
   'criterion': ['friedman_mse', 'squared_error', 'mse'],
   'n_estimators': [100, 200, 500],
   'max_depth': [5, 8, 15, None, 10]})]

In [36]:
from sklearn.model_selection import RandomizedSearchCV
import warnings
warnings.filterwarnings("ignore")

model_param = {}
for name, model, params in randomcv_models:
    random = RandomizedSearchCV(estimator=model,
                                param_distributions=params,
                                n_iter=100,
                                cv=3,
                                verbose=2,
                                n_jobs=-1,
                                )
    random.fit(X_train, y_train)
    model_param[name] = random.best_params_

for model_name, params in model_param.items():
    print(f"Best parameters for {model_name}: {params}")


Fitting 3 folds for each of 100 candidates, totalling 300 fits


69 fits failed out of a total of 300.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
47 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\stuar\Desktop\Data Science Learning\venv\lib\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\stuar\Desktop\Data Science Learning\venv\lib\site-packages\sklearn\base.py", line 1382, in wrapper
    estimator._validate_params()
  File "c:\Users\stuar\Desktop\Data Science Learning\venv\lib\site-packages\sklearn\base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "c:\Users\stuar\Desktop\Data Science Learning\venv\lib\site-packages\sklea

Fitting 3 folds for each of 100 candidates, totalling 300 fits


168 fits failed out of a total of 300.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
44 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\stuar\Desktop\Data Science Learning\venv\lib\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\stuar\Desktop\Data Science Learning\venv\lib\site-packages\sklearn\base.py", line 1382, in wrapper
    estimator._validate_params()
  File "c:\Users\stuar\Desktop\Data Science Learning\venv\lib\site-packages\sklearn\base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "c:\Users\stuar\Desktop\Data Science Learning\venv\lib\site-packages\skle

Best parameters for RF: {'n_estimators': 500, 'min_samples_split': 2, 'max_features': 8, 'max_depth': None}
Best parameters for Gradient Boosting: {'n_estimators': 500, 'max_depth': 10, 'loss': 'exponential', 'criterion': 'squared_error'}


In [37]:
# remodeling using hyperparameter
models = {
    "Random Forest": RandomForestClassifier(**model_param['RF']),
    "Gradient Boosting": GradientBoostingClassifier(**model_param['Gradient Boosting'])
}

In [38]:
for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train)

    #Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # training set performance
    model_train_accuracy = accuracy_score(y_train, y_train_pred)
    model_train_f1 = f1_score(y_train, y_train_pred, average='weighted')
    model_train_recall = recall_score(y_train, y_train_pred)
    model_train_precision = precision_score(y_train, y_train_pred)
    model_train_roc_auc = roc_auc_score(y_train, y_train_pred)

    # test set performance
    model_test_accuracy = accuracy_score(y_test, y_test_pred)
    model_test_f1 = f1_score(y_test, y_test_pred, average='weighted')
    model_test_recall = recall_score(y_test, y_test_pred)
    model_test_precision = precision_score(y_test, y_test_pred)
    model_test_roc_auc = roc_auc_score(y_test, y_test_pred)

    print(f"Model: {list(models.keys())[i]}")
    print("Training Set Performance:")
    print(f"Training Accuracy: {model_train_accuracy}")
    print(f"Training F1 Score: {model_train_f1}")
    print(f"Training Recall: {model_train_recall}")
    print(f"Training Precision: {model_train_precision}")
    print(f"Training ROC AUC: {model_train_roc_auc}")

    print("Test Set Performance:")
    print(f"Test Accuracy: {model_test_accuracy}")
    print(f"Test F1 Score: {model_test_f1}")
    print(f"Test Recall: {model_test_recall}")
    print(f"Test Precision: {model_test_precision}")
    print(f"Test ROC AUC: {model_test_roc_auc}")
    print("-"*50)


Model: Random Forest
Training Set Performance:
Training Accuracy: 1.0
Training F1 Score: 1.0
Training Recall: 1.0
Training Precision: 1.0
Training ROC AUC: 1.0
Test Set Performance:
Test Accuracy: 0.9355828220858896
Test F1 Score: 0.9314434086004779
Test Recall: 0.6963350785340314
Test Precision: 0.9637681159420289
Test ROC AUC: 0.8449909191907768
--------------------------------------------------
Model: Gradient Boosting
Training Set Performance:
Training Accuracy: 1.0
Training F1 Score: 1.0
Training Recall: 1.0
Training Precision: 1.0
Training ROC AUC: 1.0
Test Set Performance:
Test Accuracy: 0.9591002044989775
Test F1 Score: 0.957637969504513
Test Recall: 0.8115183246073299
Test Precision: 0.9748427672955975
Test ROC AUC: 0.9032178662426738
--------------------------------------------------
