# Phase 1: Data Ingestion into MongoDB

This notebook demonstrates:
1. Setting up MongoDB Atlas connection
2. Loading hotel booking dataset into MongoDB
3. Creating indexes for efficient querying
4. Demonstrating MongoDB aggregation queries

## Dataset
- **Source**: Hotel Booking Demand Dataset from Kaggle
- **Records**: 119,390 hotel bookings
- **Features**: 32 columns


In [None]:
# Install required packages
%pip install pymongo dnspython python-dotenv pandas tqdm -q


In [None]:
# Import libraries
import os
import sys
import pandas as pd
from pymongo import MongoClient
from pymongo.errors import ConnectionFailure
from dotenv import load_dotenv
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Add src to path for imports
sys.path.append('/content/src') if 'google.colab' in str(get_ipython()) else sys.path.append('../src')

print("✓ Libraries imported successfully")


## Step 1: MongoDB Atlas Setup

**Instructions:**
1. Go to https://www.mongodb.com/cloud/atlas and create a free account
2. Create a new cluster (free tier M0)
3. Create a database user with read/write permissions
4. Whitelist your IP address (use 0.0.0.0/0 for Colab)
5. Get your connection string
6. Replace the connection string below


In [None]:
# MongoDB Connection Configuration
# Replace with your MongoDB Atlas connection string
MONGODB_URI = "mongodb+srv://username:password@cluster.mongodb.net/?retryWrites=true&w=majority"
DB_NAME = "hotel_bookings"
COLLECTION_NAME = "bookings"

# Alternative: Load from environment variable
# from dotenv import load_dotenv
# load_dotenv()
# MONGODB_URI = os.getenv('MONGODB_URI')

print("MongoDB configuration set")


In [None]:
# Test MongoDB connection
try:
    client = MongoClient(MONGODB_URI, serverSelectionTimeoutMS=5000)
    # Test connection
    client.admin.command('ping')
    print("✓ Successfully connected to MongoDB Atlas")
    
    # List databases
    print(f"\nAvailable databases: {client.list_database_names()}")
    
except ConnectionFailure as e:
    print(f"✗ Failed to connect to MongoDB: {e}")
    print("\nPlease check your connection string and network settings.")


## Step 2: Load Dataset

**Note**: Download the dataset from [Kaggle](https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand) and upload it to Colab or mount Google Drive.


In [None]:
# Load dataset
# Option 1: From uploaded file in Colab
csv_path = "/content/hotel_bookings.csv"

# Option 2: From Google Drive (uncomment if using)
# from google.colab import drive
# drive.mount('/content/drive')
# csv_path = "/content/drive/MyDrive/hotel_bookings.csv"

# Option 3: From local path (for local Jupyter)
# csv_path = "../data/hotel_bookings.csv"

try:
    # Read first few rows to check
    df_sample = pd.read_csv(csv_path, nrows=5)
    print("✓ Dataset found!")
    print(f"Columns: {df_sample.columns.tolist()}")
    print(f"\nSample data:")
    display(df_sample.head())
except FileNotFoundError:
    print("✗ Dataset not found. Please upload the CSV file to Colab or update the path.")


## Step 3: Load Data into MongoDB


In [None]:
# Connect to database
db = client[DB_NAME]
collection = db[COLLECTION_NAME]

# Check if collection already exists and has data
existing_count = collection.count_documents({})
if existing_count > 0:
    print(f"⚠ Collection already contains {existing_count} documents.")
    response = input("Do you want to drop and reload? (yes/no): ")
    if response.lower() == 'yes':
        collection.drop()
        print("✓ Collection dropped")
    else:
        print("Skipping data load. Using existing data.")
        load_data = False
else:
    load_data = True


In [None]:
# Load data into MongoDB in chunks
if load_data:
    print("Loading data into MongoDB...")
    chunk_size = 10000
    total_inserted = 0
    
    # Read CSV in chunks
    for chunk in tqdm(pd.read_csv(csv_path, chunksize=chunk_size), desc="Loading chunks"):
        # Convert DataFrame to list of dictionaries
        records = chunk.to_dict('records')
        # Insert into MongoDB
        try:
            result = collection.insert_many(records)
            total_inserted += len(result.inserted_ids)
        except Exception as e:
            print(f"Error inserting chunk: {e}")
    
    print(f"\n✓ Successfully inserted {total_inserted} documents into MongoDB")
else:
    total_inserted = collection.count_documents({})
    print(f"Using existing {total_inserted} documents")


## Step 4: Create Indexes

Indexes improve query performance significantly.


In [None]:
# Create indexes for efficient querying
indexes = [
    ("is_canceled", 1),  # Target variable - most important for filtering
    ("hotel", 1),  # Hotel type filter
    ("arrival_date_year", 1),  # Year filter
    ("arrival_date_month", 1),  # Month filter
    ("country", 1),  # Country filter
    ("market_segment", 1),  # Market segment filter
    ("deposit_type", 1),  # Deposit type filter
    ("customer_type", 1),  # Customer type filter
]

print("Creating indexes...")
for field, direction in indexes:
    try:
        collection.create_index([(field, direction)])
        print(f"✓ Created index on {field}")
    except Exception as e:
        print(f"✗ Failed to create index on {field}: {e}")

# List all indexes
print("\nCurrent indexes:")
for index in collection.list_indexes():
    print(f"  - {index['name']}: {index.get('key', {})}")


## Step 5: Verify Data and Basic Statistics


In [None]:
# Get collection statistics
total_docs = collection.count_documents({})
cancelled = collection.count_documents({"is_canceled": 1})
not_cancelled = collection.count_documents({"is_canceled": 0})
cancellation_rate = (cancelled / total_docs * 100) if total_docs > 0 else 0

print("=== Collection Statistics ===")
print(f"Total documents: {total_docs:,}")
print(f"Cancelled bookings: {cancelled:,} ({cancellation_rate:.2f}%)")
print(f"Not cancelled bookings: {not_cancelled:,} ({100-cancellation_rate:.2f}%)")


In [None]:
# Sample a few documents to verify structure
print("\n=== Sample Documents ===")
sample_docs = list(collection.find().limit(2))
for i, doc in enumerate(sample_docs, 1):
    print(f"\nDocument {i}:")
    for key, value in list(doc.items())[:10]:  # Show first 10 fields
        print(f"  {key}: {value}")
    if len(doc) > 10:
        print(f"  ... and {len(doc) - 10} more fields")


## Step 6: MongoDB Aggregation Queries

Demonstrating MongoDB's powerful aggregation framework for data analysis.


In [None]:
# Query 1: Cancellation rate by hotel type
print("=== Cancellation Rate by Hotel Type ===")
pipeline1 = [
    {
        "$group": {
            "_id": "$hotel",
            "total_bookings": {"$sum": 1},
            "cancelled": {"$sum": "$is_canceled"},
            "not_cancelled": {"$sum": {"$subtract": [1, "$is_canceled"]}}
        }
    },
    {
        "$project": {
            "hotel": "$_id",
            "total_bookings": 1,
            "cancelled": 1,
            "not_cancelled": 1,
            "cancellation_rate": {
                "$multiply": [
                    {"$divide": ["$cancelled", "$total_bookings"]},
                    100
                ]
            }
        }
    },
    {"$sort": {"cancellation_rate": -1}}
]

results1 = list(collection.aggregate(pipeline1))
df_hotel = pd.DataFrame(results1)
display(df_hotel)


In [None]:
# Query 2: Cancellation rate by deposit type
print("=== Cancellation Rate by Deposit Type ===")
pipeline2 = [
    {
        "$group": {
            "_id": "$deposit_type",
            "total_bookings": {"$sum": 1},
            "cancelled": {"$sum": "$is_canceled"}
        }
    },
    {
        "$project": {
            "deposit_type": "$_id",
            "total_bookings": 1,
            "cancelled": 1,
            "cancellation_rate": {
                "$multiply": [
                    {"$divide": ["$cancelled", "$total_bookings"]},
                    100
                ]
            }
        }
    },
    {"$sort": {"cancellation_rate": -1}}
]

results2 = list(collection.aggregate(pipeline2))
df_deposit = pd.DataFrame(results2)
display(df_deposit)


In [None]:
# Query 3: Top 10 countries by booking count
print("=== Top 10 Countries by Booking Count ===")
pipeline3 = [
    {
        "$group": {
            "_id": "$country",
            "total_bookings": {"$sum": 1},
            "cancelled": {"$sum": "$is_canceled"}
        }
    },
    {
        "$project": {
            "country": "$_id",
            "total_bookings": 1,
            "cancelled": 1,
            "cancellation_rate": {
                "$multiply": [
                    {"$divide": ["$cancelled", "$total_bookings"]},
                    100
                ]
            }
        }
    },
    {"$sort": {"total_bookings": -1}},
    {"$limit": 10}
]

results3 = list(collection.aggregate(pipeline3))
df_countries = pd.DataFrame(results3)
display(df_countries)


In [None]:
# Query 4: Average lead time by market segment
print("=== Average Lead Time by Market Segment ===")
pipeline4 = [
    {
        "$group": {
            "_id": "$market_segment",
            "avg_lead_time": {"$avg": "$lead_time"},
            "total_bookings": {"$sum": 1},
            "cancelled": {"$sum": "$is_canceled"}
        }
    },
    {
        "$project": {
            "market_segment": "$_id",
            "avg_lead_time": {"$round": ["$avg_lead_time", 2]},
            "total_bookings": 1,
            "cancellation_rate": {
                "$multiply": [
                    {"$divide": ["$cancelled", "$total_bookings"]},
                    100
                ]
            }
        }
    },
    {"$sort": {"avg_lead_time": -1}}
]

results4 = list(collection.aggregate(pipeline4))
df_segment = pd.DataFrame(results4)
display(df_segment)


## Step 7: Export Data for Next Steps

Export data from MongoDB to CSV for use in subsequent notebooks.


In [None]:
# Load all data from MongoDB to pandas DataFrame
print("Loading data from MongoDB...")
cursor = collection.find()
df_mongodb = pd.DataFrame(list(cursor))

# Remove MongoDB _id field
if '_id' in df_mongodb.columns:
    df_mongodb = df_mongodb.drop('_id', axis=1)

print(f"✓ Loaded {len(df_mongodb)} records")
print(f"Shape: {df_mongodb.shape}")
print(f"\nColumns: {df_mongodb.columns.tolist()}")


In [None]:
# Save to CSV for next notebook (optional - can also load directly from MongoDB)
output_path = "/content/hotel_bookings_from_mongodb.csv"
df_mongodb.to_csv(output_path, index=False)
print(f"✓ Data exported to {output_path}")

# Display summary
print("\n=== Data Summary ===")
print(df_mongodb.info())
print("\n=== First few rows ===")
display(df_mongodb.head())


## Summary

✓ Data successfully loaded into MongoDB Atlas
✓ Indexes created for efficient querying
✓ Aggregation queries demonstrated
✓ Data ready for EDA and ML processing

**Next Steps**: Proceed to `02_eda_analysis.ipynb` for exploratory data analysis.
