## Holiday Package Prediciton

### 1) Problem statement.
"Trips & Travel.Com" company wants to enable and establish a viable business model to expand the customer base.
One of the ways to expand the customer base is to introduce a new offering of packages. Currently, there are 5 types of packages the company is offering * Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages. However, the marketing cost was quite high because customers were contacted at random without looking at the available information.
The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being.
However, this time company wants to harness the available data of existing and potential customers to make the marketing expenditure more efficient.
### 2) Data Collection.
The Dataset is collected from https://www.kaggle.com/datasets/susant4learning/holiday-package-purchase-prediction
The data consists of 20 column and 4888 rows.

In [1]:
## importing important libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv("Travel.csv")
df.head(2)

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0


In [3]:
df.columns

Index(['CustomerID', 'ProdTaken', 'Age', 'TypeofContact', 'CityTier',
       'DurationOfPitch', 'Occupation', 'Gender', 'NumberOfPersonVisiting',
       'NumberOfFollowups', 'ProductPitched', 'PreferredPropertyStar',
       'MaritalStatus', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore',
       'OwnCar', 'NumberOfChildrenVisiting', 'Designation', 'MonthlyIncome'],
      dtype='object')

In [4]:
# statistics on numerical columns
df.describe()

Unnamed: 0,CustomerID,ProdTaken,Age,CityTier,DurationOfPitch,NumberOfPersonVisiting,NumberOfFollowups,PreferredPropertyStar,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,MonthlyIncome
count,4888.0,4888.0,4662.0,4888.0,4637.0,4888.0,4843.0,4862.0,4748.0,4888.0,4888.0,4888.0,4822.0,4655.0
mean,202443.5,0.188216,37.622265,1.654255,15.490835,2.905074,3.708445,3.581037,3.236521,0.290917,3.078151,0.620295,1.187267,23619.853491
std,1411.188388,0.390925,9.316387,0.916583,8.519643,0.724891,1.002509,0.798009,1.849019,0.454232,1.365792,0.485363,0.857861,5380.698361
min,200000.0,0.0,18.0,1.0,5.0,1.0,1.0,3.0,1.0,0.0,1.0,0.0,0.0,1000.0
25%,201221.75,0.0,31.0,1.0,9.0,2.0,3.0,3.0,2.0,0.0,2.0,0.0,1.0,20346.0
50%,202443.5,0.0,36.0,1.0,13.0,3.0,4.0,3.0,3.0,0.0,3.0,1.0,1.0,22347.0
75%,203665.25,0.0,44.0,3.0,20.0,3.0,4.0,4.0,4.0,1.0,4.0,1.0,2.0,25571.0
max,204887.0,1.0,61.0,3.0,127.0,5.0,6.0,5.0,22.0,1.0,5.0,1.0,3.0,98678.0


Customer & Demographic Information
- CustomerID – A unique identifier for each traveler.
- Age – The customer's age.
- Gender – The traveler's gender.
- MaritalStatus – Whether the traveler is single, married, etc.
- Occupation – The customer’s profession (e.g., Salaried, Business, Retired).
- Designation – The traveler’s job title or rank.
- MonthlyIncome – The traveler’s estimated monthly income.

Travel Preferences & Booking Details
- TypeofContact – The method of inquiry (e.g., Online, Self Enquiry, Agent Contact).
- CityTier – Classification of the customer's city (1 being metro, 3 being smaller cities).
- ProductPitched – The travel package or service recommended (e.g., Deluxe, Standard).
- PreferredPropertyStar – The preferred hotel/star rating (e.g., 3-star, 5-star).
- NumberOfPersonVisiting – Total individuals traveling with the customer.
- NumberOfChildrenVisiting – Number of children accompanying the traveler.

Travel History & Indicators
- NumberOfTrips – The number of trips the traveler has taken.
- Passport – Indicates possession of a passport (1: Yes, 0: No).
- OwnCar – Whether the traveler owns a car (1: Yes, 0: No).

Sales & Interaction Metrics
- DurationOfPitch – The length of the sales pitch in minutes.
- NumberOfFollowups – The number of follow-up interactions made with the customer.
- PitchSatisfactionScore – A rating of how satisfied the traveler was with the pitch.
- ProdTaken – Whether the customer purchased the travel package (1: Yes, 0: No).

`Pitch refers to the sales presentation given to potential customers about a travel package or service. The goal of the pitch is to persuade the customer to purchase the offered product.`

## Data Cleaning

#### Missing values

In [5]:
df.isnull().sum()

CustomerID                    0
ProdTaken                     0
Age                         226
TypeofContact                25
CityTier                      0
DurationOfPitch             251
Occupation                    0
Gender                        0
NumberOfPersonVisiting        0
NumberOfFollowups            45
ProductPitched                0
PreferredPropertyStar        26
MaritalStatus                 0
NumberOfTrips               140
Passport                      0
PitchSatisfactionScore        0
OwnCar                        0
NumberOfChildrenVisiting     66
Designation                   0
MonthlyIncome               233
dtype: int64

In [6]:
df['Age'] = df['Age'].fillna(df['Age'].median())
df['TypeofContact'] = df['TypeofContact'].fillna(df['TypeofContact'].mode()[0])
df['DurationOfPitch'] = df['DurationOfPitch'].fillna(df['DurationOfPitch'].mean())
df['NumberOfFollowups'] = df['NumberOfFollowups'].fillna(df['NumberOfFollowups'].mean())
df['PreferredPropertyStar'] = df['PreferredPropertyStar'].fillna(df['PreferredPropertyStar'].mean())
df['NumberOfTrips'] = df['NumberOfTrips'].fillna(df['NumberOfTrips'].median())
df['NumberOfChildrenVisiting'] = df['NumberOfChildrenVisiting'].fillna(df['NumberOfChildrenVisiting'].mode()[0])
df['MonthlyIncome'] = df['MonthlyIncome'].fillna(df['MonthlyIncome'].mean())

In [7]:
df.isnull().sum()

CustomerID                  0
ProdTaken                   0
Age                         0
TypeofContact               0
CityTier                    0
DurationOfPitch             0
Occupation                  0
Gender                      0
NumberOfPersonVisiting      0
NumberOfFollowups           0
ProductPitched              0
PreferredPropertyStar       0
MaritalStatus               0
NumberOfTrips               0
Passport                    0
PitchSatisfactionScore      0
OwnCar                      0
NumberOfChildrenVisiting    0
Designation                 0
MonthlyIncome               0
dtype: int64

#### Drop unnesessary columns

In [8]:
df.drop(['CustomerID','NumberOfChildrenVisiting'], inplace=True, axis=1)

In [9]:
df.head(2)

Unnamed: 0,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,Designation,MonthlyIncome
0,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,Manager,20993.0
1,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,Manager,20130.0


#### Drop duplicate rows

In [10]:
df.shape

(4888, 18)

In [11]:
df[df.duplicated()]

Unnamed: 0,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,Designation,MonthlyIncome
1483,0,34.0,Self Enquiry,1,25.0,Small Business,Male,3,3.0,Basic,3.0,Married,1.0,0,3,0,Executive,17661.000000
1514,0,36.0,Company Invited,1,6.0,Small Business,Female,2,3.0,Deluxe,3.0,Single,2.0,0,3,1,Manager,23619.853491
1518,0,46.0,Company Invited,3,11.0,Small Business,Male,3,3.0,Deluxe,3.0,Single,5.0,1,5,1,Manager,20772.000000
1520,1,48.0,Self Enquiry,1,6.0,Salaried,Male,3,4.0,Standard,3.0,Single,1.0,1,5,0,Senior Manager,20381.000000
1531,0,38.0,Company Invited,1,35.0,Salaried,Female,2,3.0,Deluxe,3.0,Single,2.0,0,3,1,Manager,17406.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4789,0,45.0,Self Enquiry,1,36.0,Salaried,Male,3,4.0,Deluxe,3.0,Unmarried,3.0,0,5,1,Manager,23219.000000
4793,0,61.0,Self Enquiry,3,14.0,Small Business,Male,3,2.0,Deluxe,3.0,Married,2.0,1,5,0,Manager,23898.000000
4795,0,33.0,Company Invited,1,9.0,Salaried,Fe Male,3,4.0,Deluxe,3.0,Unmarried,6.0,1,5,1,Manager,23676.000000
4798,0,41.0,Self Enquiry,3,17.0,Large Business,Female,3,5.0,Deluxe,4.0,Married,2.0,0,5,1,Manager,25530.000000


In [12]:
df.drop_duplicates(inplace = True)

In [13]:
df.shape

(4566, 18)

#### Calegorical columns to Numerical columns

In [14]:
df['TypeofContact'].value_counts()

TypeofContact
Self Enquiry       3253
Company Invited    1313
Name: count, dtype: int64

In [15]:
df['Occupation'].value_counts()

Occupation
Salaried          2208
Small Business    1954
Large Business     402
Free Lancer          2
Name: count, dtype: int64

In [16]:
df['Occupation'] = df['Occupation'].map({'Free Lancer':0,'Small Business':2,'Salaried':3,'Large Business':1})

In [17]:
df['Gender'].value_counts()

Gender
Male       2724
Female     1707
Fe Male     135
Name: count, dtype: int64

In [18]:
df['Gender'] = df['Gender'].replace('Fe Male','Female')

In [19]:
df['ProductPitched'].value_counts()

ProductPitched
Basic           1721
Deluxe          1628
Standard         690
Super Deluxe     312
King             215
Name: count, dtype: int64

In [20]:
df['ProductPitched'].unique()

array(['Deluxe', 'Basic', 'Standard', 'Super Deluxe', 'King'],
      dtype=object)

In [21]:
df['ProductPitched'] = df['ProductPitched'].map({'Deluxe':3, 'Basic':4, 'Standard':2, 'Super Deluxe':1, 'King':0})

In [22]:
df['MaritalStatus'].value_counts()

MaritalStatus
Married      2204
Divorced      950
Single        810
Unmarried     602
Name: count, dtype: int64

In [23]:
df['MaritalStatus'].unique()

array(['Single', 'Divorced', 'Married', 'Unmarried'], dtype=object)

In [24]:
df['MaritalStatus'] = df['MaritalStatus'].map({'Single':1, 'Divorced':2, 'Married':3, 'Unmarried':0})

In [25]:
df['Designation'].value_counts()

Designation
Executive         1721
Manager           1628
Senior Manager     690
AVP                312
VP                 215
Name: count, dtype: int64

In [26]:
df['Designation'].unique()

array(['Manager', 'Executive', 'Senior Manager', 'AVP', 'VP'],
      dtype=object)

In [27]:
df['Designation'] = df['Designation'].map({'Manager':3, 'Executive':4, 'Senior Manager':2, 'AVP':1, 'VP':0})

In [28]:
df = pd.get_dummies(df,drop_first=True)
df.head(1)

Unnamed: 0,ProdTaken,Age,CityTier,DurationOfPitch,Occupation,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,Designation,MonthlyIncome,TypeofContact_Self Enquiry,Gender_Male
0,1,41.0,3,6.0,3,3,3.0,3,3.0,1,1.0,1,2,1,3,20993.0,True,False


# Train Test Split And Model Training

In [29]:
df.columns

Index(['ProdTaken', 'Age', 'CityTier', 'DurationOfPitch', 'Occupation',
       'NumberOfPersonVisiting', 'NumberOfFollowups', 'ProductPitched',
       'PreferredPropertyStar', 'MaritalStatus', 'NumberOfTrips', 'Passport',
       'PitchSatisfactionScore', 'OwnCar', 'Designation', 'MonthlyIncome',
       'TypeofContact_Self Enquiry', 'Gender_Male'],
      dtype='object')

In [30]:
X = df[['Age', 'CityTier', 'DurationOfPitch', 'Occupation',
       'NumberOfPersonVisiting', 'NumberOfFollowups', 'ProductPitched',
       'PreferredPropertyStar', 'MaritalStatus', 'NumberOfTrips', 'Passport',
       'PitchSatisfactionScore', 'OwnCar', 'Designation', 'MonthlyIncome',
       'TypeofContact_Self Enquiry', 'Gender_Male']]
y = df[['ProdTaken']]

In [31]:
y.value_counts()
# Random forest, Xgboost, Gradiant Boost works well for imbalanced dataset

ProdTaken
0            3709
1             857
Name: count, dtype: int64

In [32]:
# separate dataset into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
X_train.shape, X_test.shape

((3652, 17), (914, 17))

# Scaler

In [33]:
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [34]:
import pickle
with open("scaler.pkl","wb") as file:
    pickle.dump(scaler,file)

# Model Classifier Training

In [35]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve 

In [36]:
models={"Logisitic Regression":LogisticRegression(),
        "Decision Tree":DecisionTreeClassifier(),
        "Random Forest":RandomForestClassifier(),
        "Gradient Boost":GradientBoostingClassifier(),
        "Adaboost": AdaBoostClassifier()}
print(list(models))
print(list(models.values()))
print(list(models.values())[0])

['Logisitic Regression', 'Decision Tree', 'Random Forest', 'Gradient Boost', 'Adaboost']
[LogisticRegression(), DecisionTreeClassifier(), RandomForestClassifier(), GradientBoostingClassifier(), AdaBoostClassifier()]
LogisticRegression()


In [37]:
models={"Logisitic Regression":LogisticRegression(),
        "Decision Tree":DecisionTreeClassifier(),
        "Random Forest":RandomForestClassifier(),
        "Gradient Boost":GradientBoostingClassifier(),
        "Adaboost": AdaBoostClassifier()}

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) # Train model

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Training set performance
    model_train_accuracy = accuracy_score(y_train, y_train_pred) # Calculate Accuracy
    model_train_f1 = f1_score(y_train, y_train_pred, average='weighted') # Calculate F1-score
    model_train_precision = precision_score(y_train, y_train_pred) # Calculate Precision
    model_train_recall = recall_score(y_train, y_train_pred) # Calculate Recall
    model_train_rocauc_score = roc_auc_score(y_train, y_train_pred)


    # Test set performance
    model_test_accuracy = accuracy_score(y_test, y_test_pred) # Calculate Accuracy
    model_test_f1 = f1_score(y_test, y_test_pred, average='weighted') # Calculate F1-score
    model_test_precision = precision_score(y_test, y_test_pred) # Calculate Precision
    model_test_recall = recall_score(y_test, y_test_pred) # Calculate Recall
    model_test_rocauc_score = roc_auc_score(y_test, y_test_pred) #Calculate Roc


    print(list(models.keys())[i])
    
    print('Model performance for Training set')
    print(f"- Accuracy: {model_train_accuracy:.4f}")
    print(f"- F1 score: {model_train_f1:.4f}")
    print(f"- Precision: {model_train_precision:.4f}")
    print(f"- Recall: {model_train_recall:.4f}")
    print(f"- Roc Auc Score: {model_train_rocauc_score:.4f}")

    
    print('-'*35)
    
    print('Model performance for Test set')
    print(f"- Accuracy: {model_test_accuracy:.4f}")
    print(f"- F1 score: {model_test_f1:.4f}")
    print(f"- Precision: {model_test_precision:.4f}")
    print(f"- Recall: {model_test_recall:.4f}")
    print(f"- Roc Auc Score: {model_test_rocauc_score:.4f}")

    
    print('='*35)
    print('\n')

Logisitic Regression
Model performance for Training set
- Accuracy: 0.8453
- F1 score: 0.8163
- Precision: 0.7061
- Recall: 0.2749
- Roc Auc Score: 0.6245
-----------------------------------
Model performance for Test set
- Accuracy: 0.8304
- F1 score: 0.7981
- Precision: 0.7101
- Recall: 0.2663
- Roc Auc Score: 0.6195


Decision Tree
Model performance for Training set
- Accuracy: 1.0000
- F1 score: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- Roc Auc Score: 1.0000
-----------------------------------
Model performance for Test set
- Accuracy: 0.9103
- F1 score: 0.9113
- Precision: 0.7602
- Recall: 0.8098
- Roc Auc Score: 0.8727


Random Forest
Model performance for Training set
- Accuracy: 1.0000
- F1 score: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- Roc Auc Score: 1.0000
-----------------------------------
Model performance for Test set
- Accuracy: 0.9387
- F1 score: 0.9355
- Precision: 0.9571
- Recall: 0.7283
- Roc Auc Score: 0.8600


Gradient Boost
Model performance for Training

In [38]:
# Random Forest is performing better than any other algorithm in both train and test performance. 
# Thus we select Random Forest as best model.

In [39]:
model = RandomForestClassifier()
model.fit(X_train,y_train)
with open("best_model.pkl","wb") as file:
    pickle.dump(model,file)

#### New data prediction

In [40]:
CustomerID=200000
Age = 41.0
TypeofContact ="Self Enquiry"
CityTier =3
DurationOfPitch =6.0
Occupation ="Salaried"
Gender ="Female"
NumberOfPersonVisiting=3
NumberOfFollowups=3.0
ProductPitched="Deluxe"
PreferredPropertyStar=3.0
MaritalStatus="Single"
NumberOfTrips=1.0
Passport=1
PitchSatisfactionScore=2
OwnCar=1
NumberOfChildrenVisiting=0.0
Designation="Manager"
MonthlyIncome=20993.0

In [41]:
new_df = pd.DataFrame({
"CustomerID" : [CustomerID] ,
"Age" : [Age] ,
"TypeofContact" : [TypeofContact] ,
"CityTier" : [CityTier] ,
"DurationOfPitch" : [DurationOfPitch] ,
"Occupation" : [Occupation] ,
"Gender" : [Gender] ,
"NumberOfPersonVisiting" : [NumberOfPersonVisiting] ,
"NumberOfFollowups" : [NumberOfFollowups] ,
"ProductPitched" : [ProductPitched] ,
"PreferredPropertyStar" : [PreferredPropertyStar] ,
"MaritalStatus" : [MaritalStatus] ,
"NumberOfTrips" : [NumberOfTrips] ,
"Passport" : [Passport] ,
"PitchSatisfactionScore" : [PitchSatisfactionScore] ,
"OwnCar" : [OwnCar] ,
"NumberOfChildrenVisiting" : [NumberOfChildrenVisiting] ,
"Designation" : [Designation] ,
"MonthlyIncome" : [MonthlyIncome]})

In [42]:
new_df

Unnamed: 0,CustomerID,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0


In [43]:
new_df.drop(['CustomerID','NumberOfChildrenVisiting'], inplace=True, axis=1)
new_df['Occupation'] = new_df['Occupation'].map({'Free Lancer':0,'Small Business':2,'Salaried':3,'Large Business':1})
new_df['ProductPitched'] = new_df['ProductPitched'].map({'Deluxe':3, 'Basic':4, 'Standard':2, 'Super Deluxe':1, 'King':0})
new_df['MaritalStatus'] = new_df['MaritalStatus'].map({'Single':1, 'Divorced':2, 'Married':3, 'Unmarried':0})
new_df['Designation'] = new_df['Designation'].map({'Manager':3, 'Executive':4, 'Senior Manager':2, 'AVP':1, 'VP':0})

In [44]:
if new_df['TypeofContact'][0] == "Self Enquiry":
    new_df['TypeofContact_Self Enquiry'] = True
    new_df.drop("TypeofContact",inplace = True, axis = 1)
elif new_df['TypeofContact'][0] == "Company Invited":
    new_df['TypeofContact_Self Enquiry'] = False
    new_df.drop("TypeofContact",inplace = True, axis = 1)

In [45]:
if new_df['Gender'][0] == "Female":
    new_df['Gender_Male'] = False
    new_df.drop("Gender", inplace = True, axis = 1)

elif new_df['Gender'][0] == "Male":
    new_df['Gender_Male'] = True
    new_df.drop("Gender", inplace = True, axis = 1)

In [46]:
new_df

Unnamed: 0,Age,CityTier,DurationOfPitch,Occupation,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,Designation,MonthlyIncome,TypeofContact_Self Enquiry,Gender_Male
0,41.0,3,6.0,3,3,3.0,3,3.0,1,1.0,1,2,1,3,20993.0,True,False


In [47]:
new_df.shape

(1, 17)

In [48]:
X_train.shape

(3652, 17)

In [49]:
new_df = scaler.transform(new_df)
prediction = model.predict(new_df)

In [50]:
if prediction[0] == 1:
    print("The Predicted result is that Customer will purchace the package.")
elif prediction[0] == 0:
    print("The Predicted result is that Customer will not purchace the package.")

The Predicted result is that Customer will purchace the package.
