<a href="https://colab.research.google.com/github/ashishmission93/ML-PTOJECTS/blob/main/Hotel_Booking_(Logistic_Regression).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*****Hotel Booking Prediction - with Data Analysis and Logistic Regression***

Importing the Libraries:
To access the data, which is available in CSV, and further manipulate it, we'll use **pandas**. To do operations on the data, we'll use **numpy**.

In [None]:
import numpy as np
import pandas as pd

Importing the dataset.

In [None]:
data = pd.read_csv('/kaggle/input/hotel-booking-demand/hotel_bookings.csv')

**Analysing the Data**

Let's first have a look at the dataset and let's try to get an essence of the information it contains.

In [None]:
data.head(10)

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03
5,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03
6,Resort Hotel,0,0,2015,July,27,1,0,2,2,...,No Deposit,,,0,Transient,107.0,0,0,Check-Out,2015-07-03
7,Resort Hotel,0,9,2015,July,27,1,0,2,2,...,No Deposit,303.0,,0,Transient,103.0,0,1,Check-Out,2015-07-03
8,Resort Hotel,1,85,2015,July,27,1,0,3,2,...,No Deposit,240.0,,0,Transient,82.0,0,1,Canceled,2015-05-06
9,Resort Hotel,1,75,2015,July,27,1,0,3,2,...,No Deposit,15.0,,0,Transient,105.5,0,0,Canceled,2015-04-22


In [None]:
data.shape

(119390, 32)

We can see that there are 32 features (columns) and 119390 records (rows) in our dataset.

Our main objective with this data is to predict if the booking would be made by a customer, provided if they make a reservation within the constraints of out data.

Since, we have defined our objective, let's see which all features (columns) won't be any use to us for finding the objective.

Upon inspecting, we can see that the following features won't be useful for our objective:
1. hotel - It doesn't matter which type of hotel they make a reservation, the main objective is to see if they make ANY type of reservation at all or not
2. agent - The agent that got the reservation for us won't matter
3. company - Same logic goes for company as for the agent
4. reservation_status_date - We have other features (like: arrival_date_week_number, arrival_date_day_of_month etc) that gives us the same information

Hence all these 4 columns need to be dropped from the data.

In [None]:
data.drop(inplace=True, axis=1, labels=['agent', 'company','hotel','reservation_status_date'])

Note:
* inplace = True - The changes will be reflected in the original dataframe
* axis = 1 - inferring that the columns are to be dropped
* labels = [...] - The names of the columns that need to be dropped

*P.S. Dropping of these columns is just based on my intution and hence you can probably use all of these columns and decide to drop some other, or maybe none. Therefore, it's recommended to play with the data and have an iterative approach to solving the problem*

It will be interesting to have a look at all the unique values that every column contains

In [None]:
cols = data.columns
for i in cols:
    print('\n',i,'\n',data[i].unique(),'\n','-'*80)


 is_canceled 
 [0 1] 
 --------------------------------------------------------------------------------

 lead_time 
 [342 737   7  13  14   0   9  85  75  23  35  68  18  37  12  72 127  78
  48  60  77  99 118  95  96  69  45  40  15  36  43  70  16 107  47 113
  90  50  93  76   3   1  10   5  17  51  71  63  62 101   2  81 368 364
 324  79  21 109 102   4  98  92  26  73 115  86  52  29  30  33  32   8
 100  44  80  97  64  39  34  27  82  94 110 111  84  66 104  28 258 112
  65  67  55  88  54 292  83 105 280 394  24 103 366 249  22  91  11 108
 106  31  87  41 304 117  59  53  58 116  42 321  38  56  49 317   6  57
  19  25 315 123  46  89  61 312 299 130  74 298 119  20 286 136 129 124
 327 131 460 140 114 139 122 137 126 120 128 135 150 143 151 132 125 157
 147 138 156 164 346 159 160 161 333 381 149 154 297 163 314 155 323 340
 356 142 328 144 336 248 302 175 344 382 146 170 166 338 167 310 148 165
 172 171 145 121 178 305 173 152 354 347 158 185 349 183 352 177 200 192
 361 

Let's check for any null values, if there are any, in the remaining dataset.

In [None]:
data.isnull().sum()

Unnamed: 0,0
is_canceled,0
lead_time,0
arrival_date_year,0
arrival_date_month,0
arrival_date_week_number,0
arrival_date_day_of_month,0
stays_in_weekend_nights,0
stays_in_week_nights,0
adults,0
children,4


As it can bee seen, only 'country' column has null values. We can deal with this by choosing one of the following methods:
1. Replacing the null values with the most frequent value in the column (In this case, it would be the most frequent country).
2. Deeting the records (rows) which contains the null values
3. Developing a model to predict the null values from existing data.

All the above mentioned solutions are good solutions for a dataset of this many records. I decided to choose the 1st solution as it is the most easily implemented solution.

In [None]:
data.fillna(data.mode().iloc[0], inplace=True)

Note: 'mode()', will replace the 'NaN's with most frequent value in the column.

Let's check again for the null values and have a look at how our data looks now

In [None]:
data.isnull().sum()

Unnamed: 0,0
is_canceled,0
lead_time,0
arrival_date_year,0
arrival_date_month,0
arrival_date_week_number,0
arrival_date_day_of_month,0
stays_in_weekend_nights,0
stays_in_week_nights,0
adults,0
children,0


In [None]:
data.head()

Unnamed: 0,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,...,reserved_room_type,assigned_room_type,booking_changes,deposit_type,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status
0,0,342,2015,July,27,1,0,0,2,0.0,...,C,C,3,No Deposit,0,Transient,0.0,0,0,Check-Out
1,0,737,2015,July,27,1,0,0,2,0.0,...,C,C,4,No Deposit,0,Transient,0.0,0,0,Check-Out
2,0,7,2015,July,27,1,0,1,1,0.0,...,A,C,0,No Deposit,0,Transient,75.0,0,0,Check-Out
3,0,13,2015,July,27,1,0,1,1,0.0,...,A,A,0,No Deposit,0,Transient,75.0,0,0,Check-Out
4,0,14,2015,July,27,1,0,2,2,0.0,...,A,A,0,No Deposit,0,Transient,98.0,0,1,Check-Out


As we can see, there are only 28 features left, after we removed 4 columns from our data and there are no null values left in our data.

Let's now seperate the dependant and independant variables from each other. The independant variable, which we eventually need to predict, would be the 'is_cancelled' column as it tells us if the that particular reservation was cancelled or not. If the reservations was cancelled, the 'is_cancelled' column would hold the value '1' for that particular record, otherwise it would hold the value '0'.

In [None]:
X = data.iloc[:,1:]
y = data.iloc[:,0]

Now, we can see that our data doesn't only have numerical values but it also has strings as values. Machine Learning models, since they work with distances such as Euclidean, Manhattan, Minkowski etc, which all require nuumeric values to be accessed, required all the values to be numerics. Hence, we convert all the categorical variables (columns with string values) to numeric representations. And to do that, we will be using One Hot Encoder.

In [None]:
# Importing relevant libraries
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

In [None]:
#Implementing Column Transformer
ct = make_column_transformer(
    (OneHotEncoder(),['meal','distribution_channel','reservation_status','country','arrival_date_month','market_segment','deposit_type','customer_type', 'reserved_room_type','assigned_room_type' ]), remainder = 'passthrough'
    )

Here, the Column Transformer is given the One Hot Encoder and the list of all categorical columns. Now, we simply need to apply fit and transform to our independant variables.

In [None]:
X = ct.fit_transform(X).toarray()

Please note that 'X' is no longer a dataframe, it has been changed to numpy array and the number of columns has also been increased from 28 to 256. This is because the One Hot Encoder has converted each unique value of every categorical variable to its dedicated column.

In [None]:
X

array([[  1.  ,   0.  ,   0.  , ...,   0.  ,   0.  ,   0.  ],
       [  1.  ,   0.  ,   0.  , ...,   0.  ,   0.  ,   0.  ],
       [  1.  ,   0.  ,   0.  , ...,  75.  ,   0.  ,   0.  ],
       ...,
       [  1.  ,   0.  ,   0.  , ..., 157.71,   0.  ,   4.  ],
       [  1.  ,   0.  ,   0.  , ..., 104.4 ,   0.  ,   0.  ],
       [  0.  ,   0.  ,   1.  , ..., 151.2 ,   0.  ,   2.  ]])

In [None]:
y

Unnamed: 0,is_canceled
0,0
1,0
2,0
3,0
4,0
...,...
119385,0
119386,0
119387,0
119388,0


Perfect.

Now, we need to split our data into training and test sets.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Note: We are spliting the training and test set with 20% records in the test set and remaining 80% in the training. You can play with this number if you think it will have serious impact on the prediction rate.

Another important note to make here is that we just saw the number of features exploding from just 28 to 256. That's a huge number. Generally, more number of features in any dataset leads to the [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality). It simply means that our model will have too many unncessary information to process, which will eventually hamper its processing time and efficiency.

To avoid the curse of dimensionality, we use something known as [Dimensionality Reduction](https://en.wikipedia.org/wiki/Dimensionality_reduction) algorithms. One of the most used one is known as [PCA - Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). We are going to use the same. However, one small requirement of PCA is that the data it is applied on should have a sandar scale. Which can be achieved by sklearn's [Standard Scalar](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) function as follows

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
print("X_train ---------->\n", X_train, "\nX_test -------->\n", X_test)

X_train ---------->
 [[ 0.54036534 -0.0823272  -0.37034568 ...  1.38185952 -0.25462991
  -0.71991517]
 [ 0.54036534 -0.0823272  -0.37034568 ...  0.44713919 -0.25462991
  -0.71991517]
 [ 0.54036534 -0.0823272  -0.37034568 ...  0.98096836 -0.25462991
   1.80114067]
 ...
 [ 0.54036534 -0.0823272  -0.37034568 ... -0.19470211 -0.25462991
   3.06166858]
 [ 0.54036534 -0.0823272  -0.37034568 ... -0.24455386 -0.25462991
  -0.71991517]
 [ 0.54036534 -0.0823272  -0.37034568 ...  1.20945555 -0.25462991
  -0.71991517]] 
X_test -------->
 [[ 0.54036534 -0.0823272  -0.37034568 ... -1.33381462 -0.25462991
  -0.71991517]
 [ 0.54036534 -0.0823272  -0.37034568 ... -0.32556295 -0.25462991
  -0.71991517]
 [ 0.54036534 -0.0823272  -0.37034568 ... -0.82615762 -0.25462991
  -0.71991517]
 ...
 [-1.85059983 -0.0823272  -0.37034568 ...  0.37859303 -0.25462991
  -0.71991517]
 [ 0.54036534 -0.0823272  -0.37034568 ... -0.14069604 -0.25462991
  -0.71991517]
 [ 0.54036534 -0.0823272  -0.37034568 ... -0.47304105 -0.2

As you can see, now all the values are in a standardised scale.

Now, we can safely implement PCA.

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 100)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_

Please note that upon running the PCA for the first time, set 'n_components' to 'None' and then evaluate the 'explained_variance' variable for choosing the optimal number of n_components. In this case, 100 should be fine.

Now, we are finally done with everything else except fitting the Logistic Regression model on our data. Let's do that now.

**Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0, max_iter=1000)
classifier.fit(X_train, y_train)

Now, let's see how our model performs on the test data

In [None]:
y_pred = classifier.predict(X_test)

To calculate the accuracy of our model, the simplest way is to construct a confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[14906,    28],
       [   21,  8923]])

Accuracy can be calculated as:

14917 + 8932 (Total number of correct predictions) / 14917 + 8932 + 17 + 12 (Total number of predictions)

= 23849 / 23878 * 100

= 99.87%

That's a GREAT accuracy rate.

BUT

This certainly is overfitted. Having such a high accuracy on any dataset should always ring bells since most of the times, it's an indication of our model being overfitted.
I'll try to improve the accuracy to a realistic score, but for now, the basics are all set up.


**Hope this was useful. Please leave suggestions, mistakes or any other tips in the comments.
**
