# Phase 1: Data Ingestion into MongoDB

This notebook demonstrates:
1. Setting up MongoDB Atlas connection
2. Loading hotel booking dataset into MongoDB
3. Creating indexes for efficient querying
4. Demonstrating MongoDB aggregation queries

## Dataset
- **Source**: Hotel Booking Demand Dataset from Kaggle
- **Records**: 119,390 hotel bookings
- **Features**: 32 columns


In [1]:
# Install required packages
%pip install pymongo dnspython python-dotenv pandas tqdm -q


Note: you may need to restart the kernel to use updated packages.


In [2]:
# Import libraries
import os
import sys
import pandas as pd
from pymongo import MongoClient
from pymongo.errors import ConnectionFailure
from dotenv import load_dotenv
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Add src to path for imports
sys.path.append('../src')

print("✓ Libraries imported successfully")


✓ Libraries imported successfully


## Step 1: MongoDB Connection Configuration

**Note**: See README.md for MongoDB setup instructions. Ensure you have created a `.env` file with your MongoDB connection string.


In [None]:
# MongoDB Connection Configuration
from dotenv import load_dotenv
load_dotenv()  # Load variables from .env file

MONGODB_URI = os.getenv('MONGODB_URI')
DB_NAME = os.getenv('MONGODB_DB', 'hotel_bookings')  # Default if not set
COLLECTION_NAME = os.getenv('MONGODB_COLLECTION', 'bookings')  # Default if not set

# Verify connection string is set
if not MONGODB_URI:
    raise ValueError("MONGODB_URI not set. Please see README.md for MongoDB setup instructions.")
else:
    print("✓ MongoDB configuration loaded from .env file")
    print(f"✓ Database: {DB_NAME}")
    print(f"✓ Collection: {COLLECTION_NAME}")


✓ MongoDB configuration loaded from .env file
✓ Database: hotel_bookings
✓ Collection: bookings


In [4]:
# Test MongoDB connection
try:
    client = MongoClient(MONGODB_URI, serverSelectionTimeoutMS=5000)
    # Test connection
    client.admin.command('ping')
    print("✓ Successfully connected to MongoDB Atlas")
    
    # List databases
    print(f"\nAvailable databases: {client.list_database_names()}")
    
except ConnectionFailure as e:
    print(f"✗ Failed to connect to MongoDB: {e}")
    print("\nPlease check your connection string and network settings.")


✓ Successfully connected to MongoDB Atlas

Available databases: ['hotel_bookings', 'admin', 'local']


## Step 2: Load Dataset

**Note**: Download the dataset from [Kaggle](https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand) and place it in the `data/` directory.


In [5]:
# Load dataset
import os

# Use relative path from notebook location
csv_path = os.path.join("..", "data", "hotel_bookings.csv")
csv_path = os.path.abspath(csv_path)  # Convert to absolute path
print(f"CSV path: {csv_path}")

# Verify CSV file exists
if not os.path.exists(csv_path):
    raise FileNotFoundError(
        f"CSV file not found at: {csv_path}\n"
        f"Please ensure 'hotel_bookings.csv' is in the data folder.\n"
        f"Current working directory: {os.getcwd()}"
    )

try:
    # Read first few rows to check
    df_sample = pd.read_csv(csv_path, nrows=5)
    print("✓ Dataset found!")
    print(f"Columns: {df_sample.columns.tolist()}")
    print(f"\nSample data:")
    display(df_sample.head())
except FileNotFoundError:
    print("✗ Dataset not found. Please ensure the CSV file is in the correct location.")


CSV path: /Users/abdelrahman/Developer/Hotel Booking Cancellation Prediction/data/hotel_bookings.csv
✓ Dataset found!
Columns: ['hotel', 'is_canceled', 'lead_time', 'arrival_date_year', 'arrival_date_month', 'arrival_date_week_number', 'arrival_date_day_of_month', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies', 'meal', 'country', 'market_segment', 'distribution_channel', 'is_repeated_guest', 'previous_cancellations', 'previous_bookings_not_canceled', 'reserved_room_type', 'assigned_room_type', 'booking_changes', 'deposit_type', 'agent', 'company', 'days_in_waiting_list', 'customer_type', 'adr', 'required_car_parking_spaces', 'total_of_special_requests', 'reservation_status', 'reservation_status_date']

Sample data:


Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98,0,1,Check-Out,2015-07-03


## Step 3: Load Data into MongoDB


In [6]:
# Connect to database
db = client[DB_NAME]
collection = db[COLLECTION_NAME]

# Check if collection already exists and has data
existing_count = collection.count_documents({})
if existing_count > 0:
    print(f"⚠ Collection already contains {existing_count} documents.")
    response = input("Do you want to drop and reload? (yes/no): ")
    if response.lower() == 'yes':
        collection.drop()
        print("✓ Collection dropped")
    else:
        print("Skipping data load. Using existing data.")
        load_data = False
else:
    load_data = True


⚠ Collection already contains 119390 documents.
Skipping data load. Using existing data.


In [7]:
# Load data into MongoDB in chunks
if load_data:
    print("Loading data into MongoDB...")
    chunk_size = 10000
    total_inserted = 0
    
    # Read CSV in chunks
    for chunk in tqdm(pd.read_csv(csv_path, chunksize=chunk_size), desc="Loading chunks"):
        # Convert DataFrame to list of dictionaries
        records = chunk.to_dict('records')
        # Insert into MongoDB
        try:
            result = collection.insert_many(records)
            total_inserted += len(result.inserted_ids)
        except Exception as e:
            print(f"Error inserting chunk: {e}")
    
    print(f"\n✓ Successfully inserted {total_inserted} documents into MongoDB")
else:
    total_inserted = collection.count_documents({})
    print(f"Using existing {total_inserted} documents")


Using existing 119390 documents


## Step 4: Create Indexes

Indexes improve query performance significantly.


In [8]:
# Create indexes for efficient querying
indexes = [
    ("is_canceled", 1),  # Target variable - most important for filtering
    ("hotel", 1),  # Hotel type filter
    ("arrival_date_year", 1),  # Year filter
    ("arrival_date_month", 1),  # Month filter
    ("country", 1),  # Country filter
    ("market_segment", 1),  # Market segment filter
    ("deposit_type", 1),  # Deposit type filter
    ("customer_type", 1),  # Customer type filter
]

print("Creating indexes...")
for field, direction in indexes:
    try:
        collection.create_index([(field, direction)])
        print(f"✓ Created index on {field}")
    except Exception as e:
        print(f"✗ Failed to create index on {field}: {e}")

# List all indexes
print("\nCurrent indexes:")
for index in collection.list_indexes():
    print(f"  - {index['name']}: {index.get('key', {})}")


Creating indexes...
✓ Created index on is_canceled
✓ Created index on hotel
✓ Created index on arrival_date_year
✓ Created index on arrival_date_month
✓ Created index on country
✓ Created index on market_segment
✓ Created index on deposit_type
✓ Created index on customer_type

Current indexes:
  - _id_: SON([('_id', 1)])
  - is_canceled_1: SON([('is_canceled', 1)])
  - hotel_1: SON([('hotel', 1)])
  - arrival_date_year_1: SON([('arrival_date_year', 1)])
  - arrival_date_month_1: SON([('arrival_date_month', 1)])
  - country_1: SON([('country', 1)])
  - market_segment_1: SON([('market_segment', 1)])
  - deposit_type_1: SON([('deposit_type', 1)])
  - customer_type_1: SON([('customer_type', 1)])


## Step 5: Verify Data and Basic Statistics


In [9]:
# Get collection statistics
total_docs = collection.count_documents({})
cancelled = collection.count_documents({"is_canceled": 1})
not_cancelled = collection.count_documents({"is_canceled": 0})
cancellation_rate = (cancelled / total_docs * 100) if total_docs > 0 else 0

print("=== Collection Statistics ===")
print(f"Total documents: {total_docs:,}")
print(f"Cancelled bookings: {cancelled:,} ({cancellation_rate:.2f}%)")
print(f"Not cancelled bookings: {not_cancelled:,} ({100-cancellation_rate:.2f}%)")


=== Collection Statistics ===
Total documents: 119,390
Cancelled bookings: 44,224 (37.04%)
Not cancelled bookings: 75,166 (62.96%)


In [10]:
# Sample a few documents to verify structure
print("\n=== Sample Documents ===")
sample_docs = list(collection.find().limit(2))
for i, doc in enumerate(sample_docs, 1):
    print(f"\nDocument {i}:")
    for key, value in list(doc.items())[:10]:  # Show first 10 fields
        print(f"  {key}: {value}")
    if len(doc) > 10:
        print(f"  ... and {len(doc) - 10} more fields")



=== Sample Documents ===

Document 1:
  _id: 695397a615e0a76cad0189f0
  hotel: Resort Hotel
  is_canceled: 0
  lead_time: 342
  arrival_date_year: 2015
  arrival_date_month: July
  arrival_date_week_number: 27
  arrival_date_day_of_month: 1
  stays_in_weekend_nights: 0
  stays_in_week_nights: 0
  ... and 23 more fields

Document 2:
  _id: 695397a615e0a76cad0189f1
  hotel: Resort Hotel
  is_canceled: 0
  lead_time: 737
  arrival_date_year: 2015
  arrival_date_month: July
  arrival_date_week_number: 27
  arrival_date_day_of_month: 1
  stays_in_weekend_nights: 0
  stays_in_week_nights: 0
  ... and 23 more fields


## Step 6: MongoDB Aggregation Queries

Demonstrating MongoDB's powerful aggregation framework for data analysis.


In [11]:
# Query 1: Cancellation rate by hotel type
print("=== Cancellation Rate by Hotel Type ===")
pipeline1 = [
    {
        "$group": {
            "_id": "$hotel",
            "total_bookings": {"$sum": 1},
            "cancelled": {"$sum": "$is_canceled"},
            "not_cancelled": {"$sum": {"$subtract": [1, "$is_canceled"]}}
        }
    },
    {
        "$project": {
            "hotel": "$_id",
            "total_bookings": 1,
            "cancelled": 1,
            "not_cancelled": 1,
            "cancellation_rate": {
                "$multiply": [
                    {"$divide": ["$cancelled", "$total_bookings"]},
                    100
                ]
            }
        }
    },
    {"$sort": {"cancellation_rate": -1}}
]

results1 = list(collection.aggregate(pipeline1))
df_hotel = pd.DataFrame(results1)
display(df_hotel)


=== Cancellation Rate by Hotel Type ===


Unnamed: 0,_id,total_bookings,cancelled,not_cancelled,hotel,cancellation_rate
0,City Hotel,79330,33102,46228,City Hotel,41.726963
1,Resort Hotel,40060,11122,28938,Resort Hotel,27.763355


In [12]:
# Query 2: Cancellation rate by deposit type
print("=== Cancellation Rate by Deposit Type ===")
pipeline2 = [
    {
        "$group": {
            "_id": "$deposit_type",
            "total_bookings": {"$sum": 1},
            "cancelled": {"$sum": "$is_canceled"}
        }
    },
    {
        "$project": {
            "deposit_type": "$_id",
            "total_bookings": 1,
            "cancelled": 1,
            "cancellation_rate": {
                "$multiply": [
                    {"$divide": ["$cancelled", "$total_bookings"]},
                    100
                ]
            }
        }
    },
    {"$sort": {"cancellation_rate": -1}}
]

results2 = list(collection.aggregate(pipeline2))
df_deposit = pd.DataFrame(results2)
display(df_deposit)


=== Cancellation Rate by Deposit Type ===


Unnamed: 0,_id,total_bookings,cancelled,deposit_type,cancellation_rate
0,Non Refund,14587,14494,Non Refund,99.362446
1,No Deposit,104641,29694,No Deposit,28.377022
2,Refundable,162,36,Refundable,22.222222


In [13]:
# Query 3: Top 10 countries by booking count
print("=== Top 10 Countries by Booking Count ===")
pipeline3 = [
    {
        "$group": {
            "_id": "$country",
            "total_bookings": {"$sum": 1},
            "cancelled": {"$sum": "$is_canceled"}
        }
    },
    {
        "$project": {
            "country": "$_id",
            "total_bookings": 1,
            "cancelled": 1,
            "cancellation_rate": {
                "$multiply": [
                    {"$divide": ["$cancelled", "$total_bookings"]},
                    100
                ]
            }
        }
    },
    {"$sort": {"total_bookings": -1}},
    {"$limit": 10}
]

results3 = list(collection.aggregate(pipeline3))
df_countries = pd.DataFrame(results3)
display(df_countries)


=== Top 10 Countries by Booking Count ===


Unnamed: 0,_id,total_bookings,cancelled,country,cancellation_rate
0,PRT,48590,27519,PRT,56.63511
1,GBR,12129,2453,GBR,20.224256
2,FRA,10415,1934,FRA,18.569371
3,ESP,8568,2177,ESP,25.408497
4,DEU,7287,1218,DEU,16.714697
5,ITA,3766,1333,ITA,35.395645
6,IRL,3375,832,IRL,24.651852
7,BEL,2342,474,BEL,20.239112
8,BRA,2224,830,BRA,37.320144
9,NLD,2104,387,NLD,18.393536


In [14]:
# Query 4: Average lead time by market segment
print("=== Average Lead Time by Market Segment ===")
pipeline4 = [
    {
        "$group": {
            "_id": "$market_segment",
            "avg_lead_time": {"$avg": "$lead_time"},
            "total_bookings": {"$sum": 1},
            "cancelled": {"$sum": "$is_canceled"}
        }
    },
    {
        "$project": {
            "market_segment": "$_id",
            "avg_lead_time": {"$round": ["$avg_lead_time", 2]},
            "total_bookings": 1,
            "cancellation_rate": {
                "$multiply": [
                    {"$divide": ["$cancelled", "$total_bookings"]},
                    100
                ]
            }
        }
    },
    {"$sort": {"avg_lead_time": -1}}
]

results4 = list(collection.aggregate(pipeline4))
df_segment = pd.DataFrame(results4)
display(df_segment)


=== Average Lead Time by Market Segment ===


Unnamed: 0,_id,total_bookings,market_segment,avg_lead_time,cancellation_rate
0,Groups,19811,Groups,186.97,61.062036
1,Offline TA/TO,24219,Offline TA/TO,135.0,34.316033
2,Online TA,56477,Online TA,83.0,36.721143
3,Direct,12606,Direct,49.86,15.341901
4,Corporate,5295,Corporate,22.13,18.734655
5,Complementary,743,Complementary,13.29,13.055182
6,Aviation,237,Aviation,4.44,21.940928
7,Undefined,2,Undefined,1.5,100.0


## Step 7: Export Data for Next Steps

Export data from MongoDB to CSV for use in subsequent notebooks.


In [15]:
# Load all data from MongoDB to pandas DataFrame
print("Loading data from MongoDB...")
cursor = collection.find()
df_mongodb = pd.DataFrame(list(cursor))

# Remove MongoDB _id field
if '_id' in df_mongodb.columns:
    df_mongodb = df_mongodb.drop('_id', axis=1)

print(f"✓ Loaded {len(df_mongodb)} records")
print(f"Shape: {df_mongodb.shape}")
print(f"\nColumns: {df_mongodb.columns.tolist()}")


Loading data from MongoDB...
✓ Loaded 119390 records
Shape: (119390, 32)

Columns: ['hotel', 'is_canceled', 'lead_time', 'arrival_date_year', 'arrival_date_month', 'arrival_date_week_number', 'arrival_date_day_of_month', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies', 'meal', 'country', 'market_segment', 'distribution_channel', 'is_repeated_guest', 'previous_cancellations', 'previous_bookings_not_canceled', 'reserved_room_type', 'assigned_room_type', 'booking_changes', 'deposit_type', 'agent', 'company', 'days_in_waiting_list', 'customer_type', 'adr', 'required_car_parking_spaces', 'total_of_special_requests', 'reservation_status', 'reservation_status_date']


In [16]:
# Save to CSV for next notebook (optional - can also load directly from MongoDB)
import os

# Use relative path from notebook location
output_path = os.path.join("..", "data", "hotel_bookings_from_mongodb.csv")
output_path = os.path.abspath(output_path)  # Convert to absolute path

df_mongodb.to_csv(output_path, index=False)
print(f"✓ Data exported to {output_path}")

# Display summary
print("\n=== Data Summary ===")
print(df_mongodb.info())
print("\n=== First few rows ===")
display(df_mongodb.head())


✓ Data exported to /Users/abdelrahman/Developer/Hotel Booking Cancellation Prediction/data/hotel_bookings_from_mongodb.csv

=== Data Summary ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  


Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


## Summary

✓ Data successfully loaded into MongoDB Atlas
✓ Indexes created for efficient querying
✓ Aggregation queries demonstrated
✓ Data ready for EDA and ML processing

**Next Steps**: Proceed to `02_eda_analysis.ipynb` for exploratory data analysis.
