In [None]:
'''
Ensemble Techniques in Machine Learning:
- Definition: Ensemble techniques combine multiple models to produce a better predictive performance 
than any individual model.
- Types of Ensemble Methods:
    1. Bagging (Bootstrap Aggregating):
        - Combines predictions from multiple models trained on different subsets of the training data.
        - Each model is trained independently, and their predictions are averaged (for regression) or voted on (for classification).
        - Trains models in parallel, which can lead to faster training times.
        - Base learners can be of the different types, but decision trees are commonly used.
        (base learners are the individual models in the ensemble)
        - Reduces variance and helps to avoid overfitting.
        - Example: Random Forest, where multiple decision trees are trained on random subsets of the data.

    2. Boosting:
        - Sequentially trains models, where each new model focuses on correcting errors made by previous models.
        - Models are trained in a way that each subsequent model pays more attention to the instances that were misclassified by previous models.
        - Combines weak learners to create a strong learner.
        - Can be more sensitive to noise in the data.
        - Reduces bias and variance, often leading to better performance than bagging.
        - Example: AdaBoost, Gradient Boosting, XGBoost.

    3. Stacking:
        - Combines multiple models (base learners) and uses another model (meta-learner) to make final predictions based on the outputs of the base learners.
        - Can leverage the strengths of different algorithms.
        - Example: Using logistic regression as a meta-learner for predictions from decision trees and SVMs.
'''

In [None]:
'''
Random Forest Machine Learning Algorithm:

- Definition: A Random Forest is an ensemble learning method (bagging) that constructs multiple decision trees 
            during training and outputs the mode of their predictions for classification 
            or the mean prediction for regression.
- Key Features:
    - Uses bagging to create a "forest" of decision trees.
    - Combines the predictions of multiple decision trees to improve accuracy and control overfitting.
    - Each tree is built from a random subset of the training data, and at each split, a random subset of features is considered.
    - The final prediction is made by averaging the predictions of all trees (for regression) or taking a majority vote (for classification).
    - Each tree is trained on a random subset of the data, which helps in reducing variance.
    - Can handle both classification and regression tasks.

- How it Works: (row sampling and feature selection)
    - Randomly selects a subset of the training data (with replacement) to train each tree.
    - At each node, it randomly selects a subset of features to consider for splitting, which helps in reducing correlation between trees.
    - Each tree is grown to its maximum depth without pruning, which allows for capturing complex patterns in the data.
    - The final prediction is made by aggregating the predictions from all trees.
- Advantages:
    - Robust to overfitting, especially with large datasets.
    - Handles missing values and maintains accuracy for large datasets.
    - Provides feature importance scores, which can be useful for feature selection.  
- Disadvantages:
    - Can be computationally intensive and slower to predict compared to a single decision tree.
    - Less interpretable than a single decision tree due to the complexity of multiple trees. 
- Common Use Cases:
    - Classification tasks such as spam detection, fraud detection, and image classification.
    - Regression tasks like predicting house prices or stock prices.
    - Feature selection and ranking in high-dimensional datasets.


'''

In [None]:
'''
Random Forest Classification Implementation:
'''
# Holiday Package Prediction

'''
Problem statement:
Trips and travel.com company wants to enable and establish a viable business model to expand
the customer base. One of the ways to expand the customer base is to introdcue a new offering of package tours.
Currently, there are 5 types of package tours available:
1. Basic
2. Standard
3. Deluxe
4. Super Deluxe
5. King

Looking at the data of the last year, we observed that 18% of the customers purchased the package.
However, the marketing cost was quite high because customers were contacted at random without
looking at the available information.
The company is now planning to launch a new product i.e. Wellness Package Tour.
Welness Package Tour is defined as Travel that allows the traveler to maintain,
enhance or kickstart a healthy lifesyle, 
and support or increase one's sense of well-being.

Howwever, this time company wants to harness the available data of existing and potential
customers to make the marketing expenditure more efficient.

'''



In [5]:
# Data Collection:
#https://www.kaggle.com/datasets/susant4learning/holiday-package-purchase-prediction?resource=download&select=Travel.csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [6]:
df = pd.read_csv(r'Travel.xls')
df.head()


Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0


## Data Cleaning:
- Handle Missing Values
- Handling Duplicate Data
- Check Data Types
- Understand the Data

In [7]:
df.isnull().sum()

CustomerID                    0
ProdTaken                     0
Age                         226
TypeofContact                25
CityTier                      0
DurationOfPitch             251
Occupation                    0
Gender                        0
NumberOfPersonVisiting        0
NumberOfFollowups            45
ProductPitched                0
PreferredPropertyStar        26
MaritalStatus                 0
NumberOfTrips               140
Passport                      0
PitchSatisfactionScore        0
OwnCar                        0
NumberOfChildrenVisiting     66
Designation                   0
MonthlyIncome               233
dtype: int64

In [8]:
## Check all the categorical columns
df['Gender'].value_counts()

Gender
Male       2916
Female     1817
Fe Male     155
Name: count, dtype: int64

In [9]:
df['Gender'] = df['Gender'].apply(lambda x:'Female' if x=='Fe Male' else x)
df['Gender'].value_counts()

Gender
Male      2916
Female    1972
Name: count, dtype: int64

In [10]:
df['MaritalStatus'].value_counts()

MaritalStatus
Married      2340
Divorced      950
Single        916
Unmarried     682
Name: count, dtype: int64

In [11]:
df['MaritalStatus'] = df['MaritalStatus'].replace({'Single':'Unmarried'})
df['MaritalStatus'].value_counts()

MaritalStatus
Married      2340
Unmarried    1598
Divorced      950
Name: count, dtype: int64

In [12]:
df['TypeofContact'].value_counts()

TypeofContact
Self Enquiry       3444
Company Invited    1419
Name: count, dtype: int64

In [13]:
df.head()

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Unmarried,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Unmarried,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0


In [18]:
## Checking missing values

feature_with_na = [features for features in df.columns if df[features].isnull().sum() > 0]
for feature in feature_with_na:
    #print(f"{feature} has {df[feature].isnull().sum()} missing values.")
    print(feature, np.round(df[feature].isnull().mean()*100, 5), '% missing values')

Age 4.62357 % missing values
TypeofContact 0.51146 % missing values
DurationOfPitch 5.13502 % missing values
NumberOfFollowups 0.92062 % missing values
PreferredPropertyStar 0.53191 % missing values
NumberOfTrips 2.86416 % missing values
NumberOfChildrenVisiting 1.35025 % missing values
MonthlyIncome 4.76678 % missing values


In [19]:
# statistical summary of numerical columns
df[feature_with_na].select_dtypes(exclude='object').describe()

Unnamed: 0,Age,DurationOfPitch,NumberOfFollowups,PreferredPropertyStar,NumberOfTrips,NumberOfChildrenVisiting,MonthlyIncome
count,4662.0,4637.0,4843.0,4862.0,4748.0,4822.0,4655.0
mean,37.622265,15.490835,3.708445,3.581037,3.236521,1.187267,23619.853491
std,9.316387,8.519643,1.002509,0.798009,1.849019,0.857861,5380.698361
min,18.0,5.0,1.0,3.0,1.0,0.0,1000.0
25%,31.0,9.0,3.0,3.0,2.0,1.0,20346.0
50%,36.0,13.0,4.0,3.0,3.0,1.0,22347.0
75%,44.0,20.0,4.0,4.0,4.0,2.0,25571.0
max,61.0,127.0,6.0,5.0,22.0,3.0,98678.0


## Imputing Null Values:
- Impute median value for age column
- impute mode value for type of contract column
- impute median for duration of putch 
- impute mode for number of followups as it is a discrete variable
- impute mode for preferredPropertyStar
- impute median for number of trips
- impute mode for nmber of children visiting
- impute median for monthly income

In [23]:
df.Age.fillna(df.Age.median(), inplace=True)
df.TypeofContact.fillna(df.TypeofContact.mode()[0], inplace=True)
df.DurationOfPitch.fillna(df.DurationOfPitch.median(), inplace=True)
df.NumberOfFollowups.fillna(df.NumberOfFollowups.mode()[0], inplace=True)
df.PreferredPropertyStar.fillna(df.PreferredPropertyStar.mode()[0], inplace=True)
df.NumberOfTrips.fillna(df.NumberOfTrips.median(), inplace=True)
df.NumberOfChildrenVisiting.fillna(df.NumberOfChildrenVisiting.mode()[0], inplace=True)
df.MonthlyIncome.fillna(df.MonthlyIncome.median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.Age.fillna(df.Age.median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.DurationOfPitch.fillna(df.DurationOfPitch.median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which w

In [25]:
df.isnull().sum()

CustomerID                  0
ProdTaken                   0
Age                         0
TypeofContact               0
CityTier                    0
DurationOfPitch             0
Occupation                  0
Gender                      0
NumberOfPersonVisiting      0
NumberOfFollowups           0
ProductPitched              0
PreferredPropertyStar       0
MaritalStatus               0
NumberOfTrips               0
Passport                    0
PitchSatisfactionScore      0
OwnCar                      0
NumberOfChildrenVisiting    0
Designation                 0
MonthlyIncome               0
dtype: int64

In [26]:
df.drop(columns=['CustomerID'], inplace=True)

In [27]:
df.head(1)

Unnamed: 0,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Unmarried,1.0,1,2,1,0.0,Manager,20993.0


## Feature Engineering:
### Feature Extraction:

In [28]:
# Create new columns for feature extraction
df['TotalVisiting'] = df.NumberOfChildrenVisiting + df.NumberOfPersonVisiting

In [30]:
df.drop(columns=['NumberOfChildrenVisiting', 'NumberOfPersonVisiting'], inplace=True)

In [34]:
# get all numerical columns
numerical_cols = [feature for feature in df.columns if df[feature].dtype!= 'O']
print(len(numerical_cols))

12


In [37]:
# get all numerical columns
categorical_cols = [feature for feature in df.columns if df[feature].dtype == 'O']
print(len(categorical_cols))

6


In [38]:
# discrete features - also knowns as categorical features
# are those features which have a limited number of unique values
discrete_features = [feature for feature in numerical_cols if len(df[feature].unique()) < 25]
print(len(discrete_features))

9


In [39]:
# continuous features - are those features which have a large number of unique values
continuous_features = [feature for feature in numerical_cols if feature not in discrete_features]
print(len(continuous_features))

3


## Train and Test Split, and Model Training:

In [40]:
from sklearn.model_selection import train_test_split
X = df.drop(columns=['ProdTaken'], axis=1)
y = df['ProdTaken']

In [41]:
y.value_counts()

ProdTaken
0    3968
1     920
Name: count, dtype: int64

In [42]:
X.head()

Unnamed: 0,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,Designation,MonthlyIncome,TotalVisiting
0,41.0,Self Enquiry,3,6.0,Salaried,Female,3.0,Deluxe,3.0,Unmarried,1.0,1,2,1,Manager,20993.0,3.0
1,49.0,Company Invited,1,14.0,Salaried,Male,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,Manager,20130.0,5.0
2,37.0,Self Enquiry,1,8.0,Free Lancer,Male,4.0,Basic,3.0,Unmarried,7.0,1,3,0,Executive,17090.0,3.0
3,33.0,Company Invited,1,9.0,Salaried,Female,3.0,Basic,3.0,Divorced,2.0,1,5,1,Executive,17909.0,3.0
4,36.0,Self Enquiry,1,8.0,Small Business,Male,3.0,Basic,4.0,Divorced,1.0,0,5,1,Executive,18468.0,2.0


In [55]:
# separate dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3910, 17), (978, 17), (3910,), (978,))

### Fix the categorical features:

In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888 entries, 0 to 4887
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   ProdTaken               4888 non-null   int64  
 1   Age                     4888 non-null   float64
 2   TypeofContact           4888 non-null   object 
 3   CityTier                4888 non-null   int64  
 4   DurationOfPitch         4888 non-null   float64
 5   Occupation              4888 non-null   object 
 6   Gender                  4888 non-null   object 
 7   NumberOfFollowups       4888 non-null   float64
 8   ProductPitched          4888 non-null   object 
 9   PreferredPropertyStar   4888 non-null   float64
 10  MaritalStatus           4888 non-null   object 
 11  NumberOfTrips           4888 non-null   float64
 12  Passport                4888 non-null   int64  
 13  PitchSatisfactionScore  4888 non-null   int64  
 14  OwnCar                  4888 non-null   

In [57]:
cat_features = X.select_dtypes(include='object').columns
num_features = X.select_dtypes(exclude='object').columns
print("Categorical Features:", cat_features)
print("Numerical Features:", num_features)

Categorical Features: Index(['TypeofContact', 'Occupation', 'Gender', 'ProductPitched',
       'MaritalStatus', 'Designation'],
      dtype='object')
Numerical Features: Index(['Age', 'CityTier', 'DurationOfPitch', 'NumberOfFollowups',
       'PreferredPropertyStar', 'NumberOfTrips', 'Passport',
       'PitchSatisfactionScore', 'OwnCar', 'MonthlyIncome', 'TotalVisiting'],
      dtype='object')


In [58]:
# One Hot Encoding for Categorical Features and Standardization for Numerical Features
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer


In [59]:
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(drop='first')

In [60]:
preprocessor = ColumnTransformer(
    [
    ("OneHotEncoder", categorical_transformer, cat_features),
    ("StandardScaler", numeric_transformer, num_features)
    ]
)

In [62]:
preprocessor

In [61]:
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

In [75]:
# Machine Learning Model - Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve,precision_score,recall_score,f1_score

In [76]:
from sklearn.metrics import roc_auc_score



models = {
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
          }
for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) # train the model

    # make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # train set metrics
    train_accuracy = accuracy_score(y_train, y_train_pred)
    train_f1 = f1_score(y_train, y_train_pred, average='weighted')
    train_precision = precision_score(y_train, y_train_pred)
    train_recall = recall_score(y_train, y_train_pred)
    train_rocauc = roc_auc_score(y_train, y_train_pred)

    # test set metrics
    test_accuracy = accuracy_score(y_test, y_test_pred)
    test_f1 = f1_score(y_test, y_test_pred, average='weighted')
    test_precision = precision_score(y_test, y_test_pred)
    test_recall = recall_score(y_test, y_test_pred)
    test_rocauc = roc_auc_score(y_test, y_test_pred)

    print(f"Model: {list(models.keys())[i]}")

    print(f"Train Accuracy: {train_accuracy:.4f}, Train F1 Score: {train_f1:.4f}, Train Precision: {train_precision:.4f}, Train Recall: {train_recall:.4f}, Train ROC AUC: {train_rocauc:.4f}")
    print(f"Test Accuracy: {test_accuracy:.4f}, Test F1 Score: {test_f1:.4f}, Test Precision: {test_precision:.4f}, Test Recall: {test_recall:.4f}, Test ROC AUC: {test_rocauc:.4f}")
    print("\n" + "="*35 + "\n")


Model: Decision Tree
Train Accuracy: 1.0000, Train F1 Score: 1.0000, Train Precision: 1.0000, Train Recall: 1.0000, Train ROC AUC: 1.0000
Test Accuracy: 0.9192, Test F1 Score: 0.9186, Test Precision: 0.8043, Test Recall: 0.7749, Test ROC AUC: 0.8646


Model: Random Forest
Train Accuracy: 1.0000, Train F1 Score: 1.0000, Train Precision: 1.0000, Train Recall: 1.0000, Train ROC AUC: 1.0000
Test Accuracy: 0.9305, Test F1 Score: 0.9257, Test Precision: 0.9556, Test Recall: 0.6754, Test ROC AUC: 0.8339




In [77]:
# hyperparameter tuning
rf_params = {
    'max_depth': [5, 8,15, None, 10],
    'n_estimators': [100,200,500,1000],
    'min_samples_split': [2, 8,15,20],
    'max_features': [5, 7, "auto", 8],
}

In [78]:
randomcv_models = [
    ("RF", RandomForestClassifier(),rf_params)
]

In [79]:
from sklearn.model_selection import RandomizedSearchCV

model_param = {}
for name, model, params in randomcv_models:
    random_search = RandomizedSearchCV(model, params, n_iter=10, cv=3, verbose=2, n_jobs=-1)
    random_search.fit(X_train, y_train)
    model_param[name] = random_search.best_params_
    print(f"Best parameters for {name}: {random_search.best_params_}")

Fitting 3 folds for each of 10 candidates, totalling 30 fits


9 fits failed out of a total of 30.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
4 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\stuar\Desktop\Data Science Learning\venv\lib\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\stuar\Desktop\Data Science Learning\venv\lib\site-packages\sklearn\base.py", line 1382, in wrapper
    estimator._validate_params()
  File "c:\Users\stuar\Desktop\Data Science Learning\venv\lib\site-packages\sklearn\base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "c:\Users\stuar\Desktop\Data Science Learning\venv\lib\site-packages\sklearn\

Best parameters for RF: {'n_estimators': 100, 'min_samples_split': 2, 'max_features': 5, 'max_depth': 15}


In [80]:
models = {
    "Random Forest": RandomForestClassifier(**model_param['RF'])
}

In [81]:
for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) # train the model
    # make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    # train set metrics
    train_accuracy = accuracy_score(y_train, y_train_pred)
    train_f1 = f1_score(y_train, y_train_pred, average='weighted')
    train_precision = precision_score(y_train, y_train_pred)
    train_recall = recall_score(y_train, y_train_pred)
    train_rocauc = roc_auc_score(y_train, y_train_pred)
    # test set metrics
    test_accuracy = accuracy_score(y_test, y_test_pred)
    test_f1 = f1_score(y_test, y_test_pred, average='weighted')
    test_precision = precision_score(y_test, y_test_pred)
    test_recall = recall_score(y_test, y_test_pred)
    test_rocauc = roc_auc_score(y_test, y_test_pred)
    print(f"Model: {list(models.keys())[i]}")
    print(f"Train Accuracy: {train_accuracy:.4f}, Train F1 Score: {train_f1:.4f}, Train Precision: {train_precision:.4f}, Train Recall: {train_recall:.4f}, Train ROC AUC: {train_rocauc:.4f}")
    print(f"Test Accuracy: {test_accuracy:.4f}, Test F1 Score: {test_f1:.4f}, Test Precision: {test_precision:.4f}, Test Recall: {test_recall:.4f}, Test ROC AUC: {test_rocauc:.4f}")
    print("\n" + "="*35 + "\n")
    

Model: Random Forest
Train Accuracy: 0.9987, Train F1 Score: 0.9987, Train Precision: 1.0000, Train Recall: 0.9931, Train ROC AUC: 0.9966
Test Accuracy: 0.9254, Test F1 Score: 0.9194, Test Precision: 0.9609, Test Recall: 0.6440, Test ROC AUC: 0.8188


