This is Part 1 of a mult-part series exploring a dataset with data on hotel bookings.

I found the original hotel booking demand data at <a href="https://www.sciencedirect.com/science/article/pii/S2352340918315191" target="_blank">ScienceDirect</a>. It's also available in a slightly more processed form on <a href="https://www.kaggle.com/jessemostipak/hotel-booking-demand" target="_blank">Kaggle</a>.

Let's import the data into Python and see what we have.

In [1]:
# Import pandas 
# so we can view the data as a dataframe
import pandas as pd 

# Read the resort hotel data (H1.csv) 
# into a dataframe 
# named resort_hotel_data
resort_hotel_data = pd.read_csv('H1.csv')

# Display the resort_hotel_data dataframe
display(resort_hotel_data)

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate
0,0,342,2015,July,27,1,0,0,2,0,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
1,0,737,2015,July,27,1,0,0,2,0,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
2,0,7,2015,July,27,1,0,1,1,0,...,No Deposit,,,0,Transient,75.00,0,0,Check-Out,2015-07-02
3,0,13,2015,July,27,1,0,1,1,0,...,No Deposit,304,,0,Transient,75.00,0,0,Check-Out,2015-07-02
4,0,14,2015,July,27,1,0,2,2,0,...,No Deposit,240,,0,Transient,98.00,0,1,Check-Out,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40055,0,212,2017,August,35,31,2,8,2,1,...,No Deposit,143,,0,Transient,89.75,0,0,Check-Out,2017-09-10
40056,0,169,2017,August,35,30,2,9,2,0,...,No Deposit,250,,0,Transient-Party,202.27,0,1,Check-Out,2017-09-10
40057,0,204,2017,August,35,29,4,10,2,0,...,No Deposit,250,,0,Transient,153.57,0,3,Check-Out,2017-09-12
40058,0,211,2017,August,35,31,4,10,2,0,...,No Deposit,40,,0,Contract,112.80,0,1,Check-Out,2017-09-14


Okay. Good. Right away we know we have 40,060 records each of which contains 31 variables.

Let's see what we get when we read the city hotel data into a dataframe.

In [2]:
# Read the city hotel data (H2.csv) 
# into a dataframe 
# named city_hotel_data
city_hotel_data = pd.read_csv('H2.csv')

# Display the city_hotel_data dataframe
display(city_hotel_data)

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate
0,0,6,2015,July,27,1,0,2,1,0.0,...,No Deposit,6,,0,Transient,0.00,0,0,Check-Out,2015-07-03
1,1,88,2015,July,27,1,0,4,2,0.0,...,No Deposit,9,,0,Transient,76.50,0,1,Canceled,2015-07-01
2,1,65,2015,July,27,1,0,4,1,0.0,...,No Deposit,9,,0,Transient,68.00,0,1,Canceled,2015-04-30
3,1,92,2015,July,27,1,2,4,2,0.0,...,No Deposit,9,,0,Transient,76.50,0,2,Canceled,2015-06-23
4,1,100,2015,July,27,2,0,2,2,0.0,...,No Deposit,9,,0,Transient,76.50,0,1,Canceled,2015-04-02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79325,0,23,2017,August,35,30,2,5,2,0.0,...,No Deposit,394,,0,Transient,96.14,0,0,Check-Out,2017-09-06
79326,0,102,2017,August,35,31,2,5,3,0.0,...,No Deposit,9,,0,Transient,225.43,0,2,Check-Out,2017-09-07
79327,0,34,2017,August,35,31,2,5,2,0.0,...,No Deposit,9,,0,Transient,157.71,0,4,Check-Out,2017-09-07
79328,0,109,2017,August,35,31,2,5,2,0.0,...,No Deposit,89,,0,Transient,104.40,0,0,Check-Out,2017-09-07


And we have 79,330 records of city hotel data.

Based on the first paragraph in the <a href="https://www.sciencedirect.com/science/article/pii/S2352340918315191" target="_blank">Hotel booking demand datasets</a> article, this is exactly what we should expect:
* 40,060 records of resort hotel data
* 79,330 records of city hotel data

This means that, when we combine the two (currently) separate dataframes into a single dataframe, we should end up with one dataframe with 119,390 records.

BBefore we combine the two dataframes into one larger dataframe, let's take a closer look at the  <a href="https://www.sciencedirect.com/science/article/pii/S2352340918315191" target="_blank">first paragraph in the article</a>. The first paragraph also states that "Both datasets share the same structure, with 31 variables . . ."

I take this to mean that the two datasets should each have 31 variables, the variables should have the same name and datatype in each dataset, and the variables should be in the same order in each dataset.

Let's double check the variable names, datatypes, and order in each dataset.

Here are the variables for the resort hotel dataframe.

In [3]:
# Display all of the columns 
# in the resort hotel dataframe
resort_hotel_data.info()  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40060 entries, 0 to 40059
Data columns (total 31 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   IsCanceled                   40060 non-null  int64  
 1   LeadTime                     40060 non-null  int64  
 2   ArrivalDateYear              40060 non-null  int64  
 3   ArrivalDateMonth             40060 non-null  object 
 4   ArrivalDateWeekNumber        40060 non-null  int64  
 5   ArrivalDateDayOfMonth        40060 non-null  int64  
 6   StaysInWeekendNights         40060 non-null  int64  
 7   StaysInWeekNights            40060 non-null  int64  
 8   Adults                       40060 non-null  int64  
 9   Children                     40060 non-null  int64  
 10  Babies                       40060 non-null  int64  
 11  Meal                         40060 non-null  object 
 12  Country                      39596 non-null  object 
 13  MarketSegment   

And here are the variables for the city hotel dataframe.

In [4]:
# Display all of the columns 
# in the city hotel dataframe
city_hotel_data.info()  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79330 entries, 0 to 79329
Data columns (total 31 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   IsCanceled                   79330 non-null  int64  
 1   LeadTime                     79330 non-null  int64  
 2   ArrivalDateYear              79330 non-null  int64  
 3   ArrivalDateMonth             79330 non-null  object 
 4   ArrivalDateWeekNumber        79330 non-null  int64  
 5   ArrivalDateDayOfMonth        79330 non-null  int64  
 6   StaysInWeekendNights         79330 non-null  int64  
 7   StaysInWeekNights            79330 non-null  int64  
 8   Adults                       79330 non-null  int64  
 9   Children                     79326 non-null  float64
 10  Babies                       79330 non-null  int64  
 11  Meal                         79330 non-null  object 
 12  Country                      79306 non-null  object 
 13  MarketSegment   

When I compare the info for the two dataframes, I can see that both dataframes have the same :
* number of columns (31)
* column names in the same order

That's all good, but something's off with the datatypes. The resort hotel data has 1 float, 17 ints, and 13 objects, and the city hotel data has 2 floats, 16 ints, and 13 objects.

After reviewing the info for each dataframe again, it looks like the difference is the Children column. The resort hotel dataframe info indicates the Children column is an *int* while the city hotel dataframe info indicates that the Children column is a *float*.

Okay. Well, the best way I can think of to figure out what datatype the Children variable *should* be is to take a look at the data.

Let's look at the first 5 rows of the resort hotel dataframe.

In [5]:
# Display the first 5 rows 
# of the resort hotel dataframe
resort_hotel_data.head()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate
0,0,342,2015,July,27,1,0,0,2,0,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,0,737,2015,July,27,1,0,0,2,0,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,0,7,2015,July,27,1,0,1,1,0,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,0,13,2015,July,27,1,0,1,1,0,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,0,14,2015,July,27,1,0,2,2,0,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


When I scroll to the right, all I see are zeroes with no decimal places in the Children column. This makes me think the Children variable should be an int, but I'm not sure yet.

Let's look at the first five rows of the city hotel dataframe to see if that reveals anything else.

In [6]:
# Display the first 5 rows 
# of the city hotel dataframe
city_hotel_data.head()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate
0,0,6,2015,July,27,1,0,2,1,0.0,...,No Deposit,6,,0,Transient,0.0,0,0,Check-Out,2015-07-03
1,1,88,2015,July,27,1,0,4,2,0.0,...,No Deposit,9,,0,Transient,76.5,0,1,Canceled,2015-07-01
2,1,65,2015,July,27,1,0,4,1,0.0,...,No Deposit,9,,0,Transient,68.0,0,1,Canceled,2015-04-30
3,1,92,2015,July,27,1,2,4,2,0.0,...,No Deposit,9,,0,Transient,76.5,0,2,Canceled,2015-06-23
4,1,100,2015,July,27,2,0,2,2,0.0,...,No Deposit,9,,0,Transient,76.5,0,1,Canceled,2015-04-02


Ah, okay. It looks like the city hotel dataframe is storing the number of children as a float because each of the values in the first five rows is 0.0, and the *point* in "zero point zero" is what makes this a float. 

Of course, there are no "fractions" of children that show up to a hotel, but I just want to make sure there isn't perhaps an error in one or more of the Children records. 

Let's look at all of the unique values in the Children column in each dataframe, starting with the resort hotel dataframe.

In [7]:
# What are the unique values 
# in the Children column 
# of the resort hotel dataframe?
resort_hotel_data['Children'].unique()    

array([ 0,  1,  2, 10,  3], dtype=int64)

Okay, the resort hotel data only has 5 unique values in the Children column (0, 1, 2, 3, and 10). In other words, some parties showed up with no children, some with 1 child, some with 2, and so on.

Now let's look at the unique Children values in the city hotel dataframe.

In [8]:
# What are the unique values 
# in the Children column 
# of the city hotel dataframe?
city_hotel_data['Children'].unique()   

array([ 0.,  1.,  2., nan,  3.])

Hmm . . . well, none of these values need to be floats because 0, 1, 2, and 3 are all whole numbers.

But we must be missing some values in the Children column of the city hotel dataframe because we're getting that nan (**n**ot **a** **n**umber) value.

Let's see how many nan values we actually have.

In [9]:
# How many NaN values are there 
# in the Children column 
# of the city hotel dataframe?
city_hotel_data['Children'].isna().sum()   

4

Ah, okay. Only 4 of the values are nan. I'm inclined to fix this before I merge the two dataframes.

To fix this issue, I want to fill those four missing values either with the most common value in the Children column (the <a href="http://gettoknowdata.com/2021/10/13/mode/" target="_blank">mode</a>), with the <a href="http://gettoknowdata.com/2021/10/01/describing-data-with-an-arithmetic-mean/" target="_blank">average</a> of the values in the Children column, or with the <a href="http://gettoknowdata.com/2021/10/06/median/" target="_blank">median</a> of the values in the Children column.

Let's count how many times 0, 1, 2, and 3 appear in the Children column of the city hotel dataframe.

In [10]:
# Of the non-NaN values 
# in the Children column 
# of the city hotel dataframe,
# how many are there of each?
city_hotel_data['Children'].value_counts()   

0.0    74220
1.0     3023
2.0     2024
3.0       59
Name: Children, dtype: int64

Well there you go. Of the 79,326 customer reservations in the Children column, 74,220 (or about 94% of them) did not include any children.

Let's just make sure 74,220 is the <a href="http://gettoknowdata.com/2021/10/13/mode/" target="_blank">mode</a>.

In [11]:
# What is the mode of the values 
# in the Children column 
# of the city hotel dataframe?
city_hotel_data['Children'].mode()   

0    0.0
dtype: float64

There you have it. Even Python us telling us 0.0 is the <a href="http://gettoknowdata.com/2021/10/13/mode/" target="_blank">mode</a>.

I'm already sure I want to fill those nan values with 0, but let's just check what <a href="http://gettoknowdata.com/2021/10/01/describing-data-with-an-arithmetic-mean/" target="_blank">average</a> number of children is in the city hotel dataframe.

In [12]:
# What is the average (arithmetic mean) of the values 
# in the Children column 
# of the city hotel dataframe?
city_hotel_data['Children'].mean()   

0.09136979048483473

How about the <a href="http://gettoknowdata.com/2021/10/06/median/" target="_blank">median</a> number of children?

In [13]:
# What is the median of the values 
# in the Children column 
# of the city hotel dataframe?
city_hotel_data['Children'].median() 

0.0

Yeah, this confirms it for me. No reservation would ever include 0.09 children (the <a href="http://gettoknowdata.com/2021/10/01/describing-data-with-an-arithmetic-mean/" target="_blank">average</a>). And because both the <a href="http://gettoknowdata.com/2021/10/13/mode/" target="_blank">mode</a> and the <a href="http://gettoknowdata.com/2021/10/06/median/" target="_blank">median</a> are 0.0, it makes sense to fill those 4 nan values with 0.0.

Let's do that.

Let's just take a look at those 4 reservations with nan values.

In [14]:
# Which records in the city hotel dataframe
# have a NaN value
# in the Children column?
city_hotel_data[city_hotel_data['Children'].isna()]

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate
540,1,2,2015,August,32,3,1,0,2,,...,No Deposit,,,0,Transient-Party,12.0,0,1,Canceled,2015-08-01
607,1,1,2015,August,32,5,0,2,2,,...,No Deposit,14.0,,0,Transient-Party,12.0,0,1,Canceled,2015-08-04
619,1,1,2015,August,32,5,0,2,3,,...,No Deposit,,,0,Transient-Party,18.0,0,2,Canceled,2015-08-04
1100,1,8,2015,August,33,13,2,5,2,,...,No Deposit,9.0,,0,Transient-Party,76.5,0,1,Canceled,2015-08-09


Scroll right and we can see that these four reservations all have NaN in the Children column.

Let's change all those NaNs to zeroes.

In [15]:
# Change all of the NaN values
# in the Children column
# of the city hotel dataframe
# from NaN to 0
city_hotel_data['Children'].fillna(value=0, inplace=True)

Now let's double-check to make sure the NaNs are gone.

In [16]:
city_hotel_data['Children'].unique() 

array([0., 1., 2., 3.])

Excellent. No more NaNs in the Children column of the city hotel dataframe.

If we did this correctly, we should now have 74,224 reservations with zero children.

Let's see.

In [17]:
# In the newly updated
# city hotel dataframe,
# what are the unique values
# in the Children column,
# and how many of each unique value
# do we have?
city_hotel_data['Children'].value_counts() 

0.0    74224
1.0     3023
2.0     2024
3.0       59
Name: Children, dtype: int64

Cool. That's what we expected.

Now let's change the datatype of the Children column in the city hotel dataframe from float to int.

In [18]:
# Change the datatype 
# of the Children column
# in the city hotel dataframe
# to int64
city_hotel_data['Children'] = city_hotel_data['Children'].astype('int64')

And now let's double-check our work just to make sure the resort hotel and city hotel dataframes have the exact same column labels and datatypes.

Here's the resort hotel info again.

In [19]:
# Display all of the columns 
# in the resort hotel dataframe
resort_hotel_data.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40060 entries, 0 to 40059
Data columns (total 31 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   IsCanceled                   40060 non-null  int64  
 1   LeadTime                     40060 non-null  int64  
 2   ArrivalDateYear              40060 non-null  int64  
 3   ArrivalDateMonth             40060 non-null  object 
 4   ArrivalDateWeekNumber        40060 non-null  int64  
 5   ArrivalDateDayOfMonth        40060 non-null  int64  
 6   StaysInWeekendNights         40060 non-null  int64  
 7   StaysInWeekNights            40060 non-null  int64  
 8   Adults                       40060 non-null  int64  
 9   Children                     40060 non-null  int64  
 10  Babies                       40060 non-null  int64  
 11  Meal                         40060 non-null  object 
 12  Country                      39596 non-null  object 
 13  MarketSegment   

And here's the (*newly updated*) city hotel dataframe info again.

In [20]:
# Display all of the columns 
# in the city hotel dataframe
city_hotel_data.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79330 entries, 0 to 79329
Data columns (total 31 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   IsCanceled                   79330 non-null  int64  
 1   LeadTime                     79330 non-null  int64  
 2   ArrivalDateYear              79330 non-null  int64  
 3   ArrivalDateMonth             79330 non-null  object 
 4   ArrivalDateWeekNumber        79330 non-null  int64  
 5   ArrivalDateDayOfMonth        79330 non-null  int64  
 6   StaysInWeekendNights         79330 non-null  int64  
 7   StaysInWeekNights            79330 non-null  int64  
 8   Adults                       79330 non-null  int64  
 9   Children                     79330 non-null  int64  
 10  Babies                       79330 non-null  int64  
 11  Meal                         79330 non-null  object 
 12  Country                      79306 non-null  object 
 13  MarketSegment   

Excellent. Now all of the column labels and datatypes are identical in each dataframe.

There are more things I'd like to change in the dataframes. However, I want to make the same changes to the same columns in each dataframe. In this case, it doesn't make sense to do that twice (once for the resort hotel dataframe and then again for the city hotel dataframe). I'd rather merge the two dataframes first, then perform all the other changes I want to make to that larger dataframe&#8212;that way I only make those additional changes once.

In [21]:
# Combine (concatenate) the
# resort hotel dataframe and the
# city hotel dataframe
# into a single dataframe
# named resort_and_city_hotel_data
resort_and_city_hotel_data = pd.concat([resort_hotel_data, city_hotel_data], axis=0)

If we did this right, we should now have a single dataframe with 119,390 reservation records.

Let's see.

In [22]:
# How many rows and columns
# are there in the 
# resort and city hotel dataframe?
# (# rows, # columns)
resort_and_city_hotel_data.shape

(119390, 31)

Looking good so far.

Let's take a peak at some of the rows in our new dataframe.

In [23]:
# Display the first five
# and the last five rows
# of the resort and city hotel dataframe
resort_and_city_hotel_data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate
0,0,342,2015,July,27,1,0,0,2,0,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
1,0,737,2015,July,27,1,0,0,2,0,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
2,0,7,2015,July,27,1,0,1,1,0,...,No Deposit,,,0,Transient,75.00,0,0,Check-Out,2015-07-02
3,0,13,2015,July,27,1,0,1,1,0,...,No Deposit,304,,0,Transient,75.00,0,0,Check-Out,2015-07-02
4,0,14,2015,July,27,1,0,2,2,0,...,No Deposit,240,,0,Transient,98.00,0,1,Check-Out,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79325,0,23,2017,August,35,30,2,5,2,0,...,No Deposit,394,,0,Transient,96.14,0,0,Check-Out,2017-09-06
79326,0,102,2017,August,35,31,2,5,3,0,...,No Deposit,9,,0,Transient,225.43,0,2,Check-Out,2017-09-07
79327,0,34,2017,August,35,31,2,5,2,0,...,No Deposit,9,,0,Transient,157.71,0,4,Check-Out,2017-09-07
79328,0,109,2017,August,35,31,2,5,2,0,...,No Deposit,89,,0,Transient,104.40,0,0,Check-Out,2017-09-07


Wait a minute.

Why does the output say we have 119,390 rows, but when we look at the last row in the dataframe it shows us it's only row 79,329?

Because we merged two dataframes.

Here, look at the shape of the resort hotel data. It has 40,060 rows.

In [24]:
resort_hotel_data.shape

(40060, 31)

Now look at the shape of the city hotel dataframe. It has . . . 79,330 rows.

In [25]:
city_hotel_data.shape

(79330, 31)

So, when we see that the last record in the combined dataframe is record 79,329 . . .

In [26]:
# Display the last five rows
# of the resort and city hotel dataframe
resort_and_city_hotel_data.tail()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate
79325,0,23,2017,August,35,30,2,5,2,0,...,No Deposit,394,,0,Transient,96.14,0,0,Check-Out,2017-09-06
79326,0,102,2017,August,35,31,2,5,3,0,...,No Deposit,9,,0,Transient,225.43,0,2,Check-Out,2017-09-07
79327,0,34,2017,August,35,31,2,5,2,0,...,No Deposit,9,,0,Transient,157.71,0,4,Check-Out,2017-09-07
79328,0,109,2017,August,35,31,2,5,2,0,...,No Deposit,89,,0,Transient,104.4,0,0,Check-Out,2017-09-07
79329,0,205,2017,August,35,29,2,7,2,0,...,No Deposit,9,,0,Transient,151.2,0,2,Check-Out,2017-09-07


. . . what we're seeing (at this point) is the last five rows of the *city* hotel dataframe at the *bottom* of the newly combined resort and city hotel dataframe. 

Say what? Here, just look at the last 5 rows of the city hotel dataframe.

In [27]:
# Display the last five rows
# of the city hotel dataframe
city_hotel_data.tail()

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate
79325,0,23,2017,August,35,30,2,5,2,0,...,No Deposit,394,,0,Transient,96.14,0,0,Check-Out,2017-09-06
79326,0,102,2017,August,35,31,2,5,3,0,...,No Deposit,9,,0,Transient,225.43,0,2,Check-Out,2017-09-07
79327,0,34,2017,August,35,31,2,5,2,0,...,No Deposit,9,,0,Transient,157.71,0,4,Check-Out,2017-09-07
79328,0,109,2017,August,35,31,2,5,2,0,...,No Deposit,89,,0,Transient,104.4,0,0,Check-Out,2017-09-07
79329,0,205,2017,August,35,29,2,7,2,0,...,No Deposit,9,,0,Transient,151.2,0,2,Check-Out,2017-09-07


Now compare those to the last five rows of the resort and city hotel dataframe. They're identical.

Why?

Because when we used this code . . .

<code>resort_and_city_hotel_data = pd.concat([resort_hotel_data, city_hotel_data], axis=0)
</code>

. . . to combine (concatenate) the resort hotel dataframe with the city hotel dataframe, all Python did was, essentially, stack the resort hotel dataframe *on top of* the city hotel dataframe.

It's basically the same thing as taking this table . . .

| Index  | Animal  | Vegetable | Mineral  |
|---|---|---|---|
| 7|  Dog |  Carrot | Marrite  |
| 23 |  Cat |  Bell pepper |  Violarite |

. . . and this table . . .

| Index  | Animal  | Vegetable | Mineral  |
|---|---|---|---|
| 92 | Bird |  Celery | Variscite  |
| 118 | Horse |  Broccoli |  Monazite |

. . . and combining them into this table . . .

| Index  | Animal  | Vegetable | Mineral  |
|---|---|---|---|
| 7|  Dog |  Carrot | Marrite  |
| 23 |  Cat |  Bell pepper |  Violarite |
| 92 | Bird |  Celery | Variscite  |
| 118 | Horse |  Broccoli |  Monazite |

In other words, all we need to do to fix this in our new dataframe is reset the index.

Let's do that.

In [28]:
# Reset the index of the
# resort and city hotel dataframe
resort_and_city_hotel_data.reset_index(drop=True, inplace=True)

Let's take a look . . .

In [29]:
resort_and_city_hotel_data

Unnamed: 0,IsCanceled,LeadTime,ArrivalDateYear,ArrivalDateMonth,ArrivalDateWeekNumber,ArrivalDateDayOfMonth,StaysInWeekendNights,StaysInWeekNights,Adults,Children,...,DepositType,Agent,Company,DaysInWaitingList,CustomerType,ADR,RequiredCarParkingSpaces,TotalOfSpecialRequests,ReservationStatus,ReservationStatusDate
0,0,342,2015,July,27,1,0,0,2,0,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
1,0,737,2015,July,27,1,0,0,2,0,...,No Deposit,,,0,Transient,0.00,0,0,Check-Out,2015-07-01
2,0,7,2015,July,27,1,0,1,1,0,...,No Deposit,,,0,Transient,75.00,0,0,Check-Out,2015-07-02
3,0,13,2015,July,27,1,0,1,1,0,...,No Deposit,304,,0,Transient,75.00,0,0,Check-Out,2015-07-02
4,0,14,2015,July,27,1,0,2,2,0,...,No Deposit,240,,0,Transient,98.00,0,1,Check-Out,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,0,23,2017,August,35,30,2,5,2,0,...,No Deposit,394,,0,Transient,96.14,0,0,Check-Out,2017-09-06
119386,0,102,2017,August,35,31,2,5,3,0,...,No Deposit,9,,0,Transient,225.43,0,2,Check-Out,2017-09-07
119387,0,34,2017,August,35,31,2,5,2,0,...,No Deposit,9,,0,Transient,157.71,0,4,Check-Out,2017-09-07
119388,0,109,2017,August,35,31,2,5,2,0,...,No Deposit,89,,0,Transient,104.40,0,0,Check-Out,2017-09-07


Looking good! We now have one dataframe (resort_and_city_hotel_data) that combines the original two dataframes (resort_hotel_data and city_hotel_data) into a single dataframe with an index relevant to the *new* dataframe.

Plenty more to do before we start asking the dataframe questions . . .

Continue to Part 2.

* combine the arrival  date stuff into one dateteime column
* change the datatype of the reservation status date column to datetime
* change the categorical variables represented by numbers to letters