<a href="https://colab.research.google.com/github/basangoudapatil/Customer-Purchasing-Pattern/blob/main/Predictive_modeling_of_customer_bookings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 2

---

## Predictive modeling of customer bookings

This Jupyter notebook includes some code to get you started with this predictive modeling task. We will use various packages for data manipulation, feature engineering and machine learning.

### Exploratory data analysis

First, we must explore the data in order to better understand what we have and the statistical properties of the dataset.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("/content/customer_booking.csv", encoding="ISO-8859-1")
df.head()

Unnamed: 0,num_passengers,sales_channel,trip_type,purchase_lead,length_of_stay,flight_hour,flight_day,route,booking_origin,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
0,2,Internet,RoundTrip,262,19,7,Sat,AKLDEL,New Zealand,1,0,0,5.52,0
1,1,Internet,RoundTrip,112,20,3,Sat,AKLDEL,New Zealand,0,0,0,5.52,0
2,2,Internet,RoundTrip,243,22,17,Wed,AKLDEL,India,1,1,0,5.52,0
3,1,Internet,RoundTrip,96,31,4,Sat,AKLDEL,New Zealand,0,0,1,5.52,0
4,2,Internet,RoundTrip,68,22,15,Wed,AKLDEL,India,1,0,1,5.52,0


The `.head()` method allows us to view the first 5 rows in the dataset, this is useful for visual inspection of our columns

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   num_passengers         50000 non-null  int64  
 1   sales_channel          50000 non-null  object 
 2   trip_type              50000 non-null  object 
 3   purchase_lead          50000 non-null  int64  
 4   length_of_stay         50000 non-null  int64  
 5   flight_hour            50000 non-null  int64  
 6   flight_day             50000 non-null  object 
 7   route                  50000 non-null  object 
 8   booking_origin         50000 non-null  object 
 9   wants_extra_baggage    50000 non-null  int64  
 10  wants_preferred_seat   50000 non-null  int64  
 11  wants_in_flight_meals  50000 non-null  int64  
 12  flight_duration        50000 non-null  float64
 13  booking_complete       50000 non-null  int64  
dtypes: float64(1), int64(8), object(5)
memory usage: 5.3+ 

The `.info()` method gives us a data description, telling us the names of the columns, their data types and how many null values we have. Fortunately, we have no null values. It looks like some of these columns should be converted into different data types, e.g. flight_day.

To provide more context, below is a more detailed data description, explaining exactly what each column means:

- `num_passengers` = number of passengers travelling
- `sales_channel` = sales channel booking was made on
- `trip_type` = trip Type (Round Trip, One Way, Circle Trip)
- `purchase_lead` = number of days between travel date and booking date
- `length_of_stay` = number of days spent at destination
- `flight_hour` = hour of flight departure
- `flight_day` = day of week of flight departure
- `route` = origin -> destination flight route
- `booking_origin` = country from where booking was made
- `wants_extra_baggage` = if the customer wanted extra baggage in the booking
- `wants_preferred_seat` = if the customer wanted a preferred seat in the booking
- `wants_in_flight_meals` = if the customer wanted in-flight meals in the booking
- `flight_duration` = total duration of flight (in hours)
- `booking_complete` = flag indicating if the customer completed the booking

Before we compute any statistics on the data, lets do any necessary data conversion

In [None]:
df["flight_day"].unique()

array(['Sat', 'Wed', 'Thu', 'Mon', 'Sun', 'Tue', 'Fri'], dtype=object)

In [None]:
mapping = {
    "Mon": 1,
    "Tue": 2,
    "Wed": 3,
    "Thu": 4,
    "Fri": 5,
    "Sat": 6,
    "Sun": 7,
}

df["flight_day"] = df["flight_day"].map(mapping)

In [None]:
df["flight_day"].unique()

array([6, 3, 4, 1, 7, 2, 5])

In [None]:
df.describe()

Unnamed: 0,num_passengers,purchase_lead,length_of_stay,flight_hour,flight_day,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,1.59124,84.94048,23.04456,9.06634,3.81442,0.66878,0.29696,0.42714,7.277561,0.14956
std,1.020165,90.451378,33.88767,5.41266,1.992792,0.470657,0.456923,0.494668,1.496863,0.356643
min,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,4.67,0.0
25%,1.0,21.0,5.0,5.0,2.0,0.0,0.0,0.0,5.62,0.0
50%,1.0,51.0,17.0,9.0,4.0,1.0,0.0,0.0,7.57,0.0
75%,2.0,115.0,28.0,13.0,5.0,1.0,1.0,1.0,8.83,0.0
max,9.0,867.0,778.0,23.0,7.0,1.0,1.0,1.0,9.5,1.0


The `.describe()` method gives us a summary of descriptive statistics over the entire dataset (only works for numeric columns). This gives us a quick overview of a few things such as the mean, min, max and overall distribution of each column.

From this point, you should continue exploring the dataset with some visualisations and other metrics that you think may be useful. Then, you should prepare your dataset for predictive modelling. Finally, you should train your machine learning model, evaluate it with performance metrics and output visualisations for the contributing variables. All of this analysis should be summarised in your single slide.

In [None]:
df['trip_type'].value_counts()

RoundTrip     49497
OneWay          387
CircleTrip      116
Name: trip_type, dtype: int64

In [None]:
from sklearn import preprocessing
label_encoder=preprocessing.LabelEncoder()

df['trip_type']=label_encoder.fit_transform(df['trip_type'])

In [None]:
df['sales_channel'].value_counts()

Internet    44382
Mobile       5618
Name: sales_channel, dtype: int64

In [None]:
df['sales_channel']=df['sales_channel'].map({'Internet':1, 'Mobile':0})

In [None]:
df['booking_origin'].value_counts()

Australia               17872
Malaysia                 7174
South Korea              4559
Japan                    3885
China                    3387
                        ...  
Panama                      1
Tonga                       1
Tanzania                    1
Bulgaria                    1
Svalbard & Jan Mayen        1
Name: booking_origin, Length: 104, dtype: int64

In [None]:
df['booking_origin'].unique()

array(['New Zealand', 'India', 'United Kingdom', 'China', 'South Korea',
       'Japan', 'Malaysia', 'Singapore', 'Switzerland', 'Germany',
       'Indonesia', 'Czech Republic', 'Vietnam', 'Thailand', 'Spain',
       'Romania', 'Ireland', 'Italy', 'Slovakia', 'United Arab Emirates',
       'Tonga', 'Réunion', '(not set)', 'Saudi Arabia', 'Netherlands',
       'Qatar', 'Hong Kong', 'Philippines', 'Sri Lanka', 'France',
       'Croatia', 'United States', 'Laos', 'Hungary', 'Portugal',
       'Cyprus', 'Australia', 'Cambodia', 'Poland', 'Belgium', 'Oman',
       'Bangladesh', 'Kazakhstan', 'Brazil', 'Turkey', 'Kenya', 'Taiwan',
       'Brunei', 'Chile', 'Bulgaria', 'Ukraine', 'Denmark', 'Colombia',
       'Iran', 'Bahrain', 'Solomon Islands', 'Slovenia', 'Mauritius',
       'Nepal', 'Russia', 'Kuwait', 'Mexico', 'Sweden', 'Austria',
       'Lebanon', 'Jordan', 'Greece', 'Mongolia', 'Canada', 'Tanzania',
       'Peru', 'Timor-Leste', 'Argentina', 'New Caledonia', 'Macau',
       'Myanmar (

In [None]:
df['booking_origin']=df['booking_origin'].map({'New Zealand':0,
                                               'India':1,
                                               'United Kingdom':2,
                                               'China':3,
                                               'South Korea':4,
                                                'Japan':5,
                                                'Malaysia':6,
                                                'Singapore':7,
                                                'Switzerland':8,
                                                'Germany':9,
                                                'Indonesia':10, 'Czech Republic':11, 'Vietnam':12, 'Thailand':13, 'Spain':14,
                                                'Romania':15, 'Ireland':16, 'Italy':17, 'Slovakia':18, 'United Arab Emirates':19,
                                                'Tonga':20, 'Réunion':21, '(not set)':22, 'Saudi Arabia':23, 'Netherlands':24,
                                                'Qatar':25, 'Hong Kong':26, 'Philippines':27, 'Sri Lanka':28, 'France':29,
                                                'Croatia':30, 'United States':31, 'Laos':32, 'Hungary':33, 'Portugal':34,
                                                'Cyprus':35, 'Australia':36, 'Cambodia':37, 'Poland':38, 'Belgium':39, 'Oman':40,
                                                'Bangladesh':41, 'Kazakhstan':42, 'Brazil':43, 'Turkey':44, 'Kenya':45, 'Taiwan':46,
                                                'Brunei':47, 'Chile':48, 'Bulgaria':49, 'Ukraine':50, 'Denmark':51, 'Colombia':52,
                                                'Iran':53, 'Bahrain':54, 'Solomon Islands':55, 'Slovenia':56, 'Mauritius':57,
                                                'Nepal':58, 'Russia':59, 'Kuwait':60, 'Mexico':61, 'Sweden':61, 'Austria':62,
                                                'Lebanon':63, 'Jordan':64, 'Greece':65, 'Mongolia':66, 'Canada':67, 'Tanzania':68,
                                                'Peru':69, 'Timor-Leste':70, 'Argentina':71, 'New Caledonia':72, 'Macau':73,
                                                'Myanmar (Burma)':74, 'Norway':75, 'Panama':76, 'Bhutan':77, 'Norfolk Island':78,
                                                'Finland':79, 'Nicaragua':80, 'Maldives':81, 'Egypt':82, 'Israel':83, 'Tunisia':84,
                                                'South Africa':85, 'Papua New Guinea':86, 'Paraguay':87, 'Estonia':88,
                                                'Seychelles':89, 'Afghanistan':90, 'Guam':91, 'Czechia':92, 'Malta':93, 'Vanuatu':94,
                                                'Belarus':95, 'Pakistan':96, 'Iraq':97, 'Ghana':98, 'Gibraltar':99, 'Guatemala':100,
                                                'Algeria':101, 'Svalbard & Jan Mayen':102})

In [None]:
df.head()

Unnamed: 0,num_passengers,sales_channel,trip_type,purchase_lead,length_of_stay,flight_hour,flight_day,route,booking_origin,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
0,2,1,2,262,19,7,6,AKLDEL,0,1,0,0,5.52,0
1,1,1,2,112,20,3,6,AKLDEL,0,0,0,0,5.52,0
2,2,1,2,243,22,17,3,AKLDEL,1,1,1,0,5.52,0
3,1,1,2,96,31,4,6,AKLDEL,0,0,0,1,5.52,0
4,2,1,2,68,22,15,3,AKLDEL,1,1,0,1,5.52,0


In [None]:
df['route'].unique()

array(['AKLDEL', 'AKLHGH', 'AKLHND', 'AKLICN', 'AKLKIX', 'AKLKTM',
       'AKLKUL', 'AKLMRU', 'AKLPEK', 'AKLPVG', 'AKLTPE', 'AORICN',
       'AORKIX', 'AORKTM', 'AORMEL', 'BBIMEL', 'BBIOOL', 'BBIPER',
       'BBISYD', 'BDOCTS', 'BDOCTU', 'BDOHGH', 'BDOICN', 'BDOIKA',
       'BDOKIX', 'BDOMEL', 'BDOOOL', 'BDOPEK', 'BDOPER', 'BDOPUS',
       'BDOPVG', 'BDOSYD', 'BDOTPE', 'BDOXIY', 'BKICKG', 'BKICTS',
       'BKICTU', 'BKIHND', 'BKIICN', 'BKIKIX', 'BKIKTM', 'BKIMEL',
       'BKIMRU', 'BKIOOL', 'BKIPEK', 'BKIPER', 'BKIPUS', 'BKIPVG',
       'BKISYD', 'BKIXIY', 'BLRICN', 'BLRMEL', 'BLRPER', 'BLRSYD',
       'BOMMEL', 'BOMOOL', 'BOMPER', 'BOMSYD', 'BTJJED', 'BTUICN',
       'BTUPER', 'BTUSYD', 'BTUWUH', 'BWNCKG', 'BWNDEL', 'BWNHGH',
       'BWNIKA', 'BWNKTM', 'BWNMEL', 'BWNOOL', 'BWNPER', 'BWNSYD',
       'BWNTPE', 'CANDEL', 'CANIKA', 'CANMEL', 'CANMRU', 'CANOOL',
       'CANPER', 'CANSYD', 'CCUMEL', 'CCUMRU', 'CCUOOL', 'CCUPER',
       'CCUSYD', 'CCUTPE', 'CEBMEL', 'CEBOOL', 'CEBPER', 'CEBS

In [None]:
from sklearn import preprocessing
label_encoder=preprocessing.LabelEncoder()

df['route']=label_encoder.fit_transform(df['route'])

In [None]:
df.head()

Unnamed: 0,num_passengers,sales_channel,trip_type,purchase_lead,length_of_stay,flight_hour,flight_day,route,booking_origin,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
0,2,1,2,262,19,7,6,0,0,1,0,0,5.52,0
1,1,1,2,112,20,3,6,0,0,0,0,0,5.52,0
2,2,1,2,243,22,17,3,0,1,1,1,0,5.52,0
3,1,1,2,96,31,4,6,0,0,0,0,1,5.52,0
4,2,1,2,68,22,15,3,0,1,1,0,1,5.52,0


In [None]:
df['route'].unique()

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,
        28,  29,  30,  31,  32,  33,  34,  36,  37,  38,  39,  41,  42,
        43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,
        56,  57,  58,  59,  60,  61,  62,  64,  65,  66,  67,  68,  69,
        70,  71,  72,  73,  74,  75,  76,  77,  79,  80,  81,  82,  83,
        84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,
        97,  98,  99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109,
       110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 121, 122, 125,
       126, 127, 129, 130, 131, 132, 133, 134, 136, 137, 138, 139, 140,
       141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153,
       154, 155, 157, 158, 159, 160, 161, 162, 163, 165, 166, 167, 170,
       171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183,
       185, 187, 188, 189, 190, 192, 193, 194, 195, 196, 197, 19

In [None]:
df['booking_complete'].value_counts()

0    42522
1     7478
Name: booking_complete, dtype: int64

From the above values, the problem seems to be imbalanced. It has to be converted to balanced one. 

In [None]:
!pip install imblearn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import imblearn

In [None]:
df.head()

Unnamed: 0,num_passengers,sales_channel,trip_type,purchase_lead,length_of_stay,flight_hour,flight_day,route,booking_origin,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
0,2,1,2,262,19,7,6,0,0,1,0,0,5.52,0
1,1,1,2,112,20,3,6,0,0,0,0,0,5.52,0
2,2,1,2,243,22,17,3,0,1,1,1,0,5.52,0
3,1,1,2,96,31,4,6,0,0,0,0,1,5.52,0
4,2,1,2,68,22,15,3,0,1,1,0,1,5.52,0


In [None]:
df['flight_duration'].unique()

array([5.52, 5.07, 7.57, 6.62, 7.  , 4.75, 8.83, 7.42, 6.42, 5.33, 4.67,
       5.62, 8.58, 8.67, 4.72, 8.15, 6.33, 5.  , 4.83, 9.5 , 5.13])

In [None]:
df.isna().sum()

num_passengers           0
sales_channel            0
trip_type                0
purchase_lead            0
length_of_stay           0
flight_hour              0
flight_day               0
route                    0
booking_origin           0
wants_extra_baggage      0
wants_preferred_seat     0
wants_in_flight_meals    0
flight_duration          0
booking_complete         0
dtype: int64

In [None]:
df.head()

Unnamed: 0,num_passengers,sales_channel,trip_type,purchase_lead,length_of_stay,flight_hour,flight_day,route,booking_origin,wants_extra_baggage,wants_preferred_seat,wants_in_flight_meals,flight_duration,booking_complete
0,2,1,2,262,19,7,6,0,0,1,0,0,5.52,0
1,1,1,2,112,20,3,6,0,0,0,0,0,5.52,0
2,2,1,2,243,22,17,3,0,1,1,1,0,5.52,0
3,1,1,2,96,31,4,6,0,0,0,0,1,5.52,0
4,2,1,2,68,22,15,3,0,1,1,0,1,5.52,0


In [None]:
X=df.drop(['booking_complete'], axis =1)
y=df['booking_complete']

In [None]:
from imblearn.combine import SMOTEENN

smt = SMOTEENN(sampling_strategy='all')
X_smt, y_smt = smt.fit_resample(X,y)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_smt.values, y_smt.values, test_size=0.2)

In [None]:
X_train.shape, y_train.shape

((49748, 13), (49748,))

In [None]:
from sklearn.preprocessing import StandardScaler
std = StandardScaler()

X_train_std = std.fit_transform(X_train)
X_test_std = std.transform(X_test)

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import numpy as np
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
for lr in [0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.1,0.11,0.12,0.13,0.14,0.15,0.2,0.5,0.7,1]:
  xgb=XGBClassifier(learning_rate=lr,n_estimators=100)
  xgb.fit(X_train, y_train)
  print("Learning rate : ", lr, " Train score : ", xgb.score(X_train,y_train), " Cross-Val score : ", np.mean(cross_val_score(xgb, X_train, y_train, cv=10)))

Learning rate :  0.01  Train score :  0.7859009407413363  Cross-Val score :  0.7854187584484545
Learning rate :  0.02  Train score :  0.7965747366728311  Cross-Val score :  0.7965548692396441
Learning rate :  0.03  Train score :  0.8014191525287448  Cross-Val score :  0.801057539405916
Learning rate :  0.04  Train score :  0.8081128889603603  Cross-Val score :  0.8068466942674772
Learning rate :  0.05  Train score :  0.815630779126799  Cross-Val score :  0.8138017227270247
Learning rate :  0.06  Train score :  0.8229074535659725  Cross-Val score :  0.8212592152560146
Learning rate :  0.07  Train score :  0.8307469647020985  Cross-Val score :  0.8266061873501
Learning rate :  0.08  Train score :  0.831169092224813  Cross-Val score :  0.8323551048366076
Learning rate :  0.09  Train score :  0.8413403553911715  Cross-Val score :  0.8373603401001792
Learning rate :  0.1  Train score :  0.8423655222320495  Cross-Val score :  0.840958467447814
Learning rate :  0.11  Train score :  0.84915976

In [None]:
#learning rate with 0.7 cross-val score is high, ther i use lr=0.7
xgb=XGBClassifier(lr=0.7, n_estimators = 100)
xgb.fit(X_train, y_train)

XGBClassifier(lr=0.7)

In [None]:
y_pred = xgb.predict(X_test)

In [None]:
df1=pd.DataFrame({'Predicted':y_pred, 'Actual':y_test})
df1.head()

Unnamed: 0,Predicted,Actual
0,0,0
1,0,0
2,0,0
3,0,0
4,0,1


In [None]:
from sklearn.metrics import accuracy_score
auc= accuracy_score(y_pred, y_test)
auc

0.836375331671625

In [None]:
from sklearn.metrics import roc_auc_score
auc1=roc_auc_score(y_pred, y_test)
auc1

0.8358682453845504

In [None]:
# Get numerical feature importances
importances = list(xgb.feature_importances_)

# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(df, importances)]

# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

Variable: booking_origin       Importance: 0.33000001311302185
Variable: wants_in_flight_meals Importance: 0.11999999731779099
Variable: flight_duration      Importance: 0.10999999940395355
Variable: flight_day           Importance: 0.10000000149011612
Variable: wants_preferred_seat Importance: 0.09000000357627869
Variable: route                Importance: 0.07000000029802322
Variable: length_of_stay       Importance: 0.05999999865889549
Variable: num_passengers       Importance: 0.05000000074505806
Variable: purchase_lead        Importance: 0.03999999910593033
Variable: flight_hour          Importance: 0.029999999329447746
Variable: sales_channel        Importance: 0.009999999776482582
Variable: trip_type            Importance: 0.0
Variable: wants_extra_baggage  Importance: 0.0
