# TravelTide Customer Segmentation Project
## 02_EDA_behavioral_analysis.ipynb

**Author:** Alberto Diaz Durana  
**Date:** October 2025  
**Purpose:** Behavioral analysis and user-level aggregation

---

## Objectives

This notebook consolidates Week 1 Days 4-5 work to analyze customer behavior patterns and create user-level aggregations.

**Mission:** Understand demographics, booking patterns, temporal trends, and aggregate session-level data to user-level metrics

## Business Context

**From Notebook 01:** We established a cohort of 5,765 qualified users with 43,344 clean sessions.

**Now:** Analyze behavioral patterns to identify segmentation opportunities and prepare user-level features.

**Key Questions:**
- Who are our customers? (demographics)
- What do they book? (flights, hotels, packages)
- When do they book? (seasonality, timing)
- How engaged are they? (session patterns, conversion)

**Deliverables:**
- Demographic insights
- Booking behavior analysis
- Temporal pattern identification
- User-level aggregated dataset
- Preliminary CLV analysis

**Outputs:**
- `../data/processed/user_base_complete.csv`
- `../data/results/eda/eda_summary_stats.csv`
- `../outputs/figures/eda/` (8-10 visualizations)

---

## 1. Setup & Load Cohort Data

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Visualization settings
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 6)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.2f}'.format)

# Path constants
DATA_RAW = '../data/raw/'
DATA_PROCESSED = '../data/processed/'
DATA_RESULTS_EDA = '../data/results/eda/'
FIGURES_EDA = '../outputs/figures/eda/'

# Ensure directories exist
import os
os.makedirs(DATA_PROCESSED, exist_ok=True)
os.makedirs(DATA_RESULTS_EDA, exist_ok=True)
os.makedirs(FIGURES_EDA, exist_ok=True)

print(f"02_EDA_behavioral_analysis.ipynb - {datetime.now().strftime('%Y-%m-%d')}")
print("="*80)


02_EDA_behavioral_analysis.ipynb - 2025-10-25


In [6]:
print("LOADING COHORT DATA FROM NOTEBOOK 01")
print("="*80)

# Load cleaned cohort from notebook 01
session_base = pd.read_csv(f'{DATA_RESULTS_EDA}eda_cohort_qualified.csv')

# Convert date columns
date_columns = ['birthdate', 'sign_up_date', 'session_start', 'session_end', 
                'departure_time', 'return_time', 'check_in_time', 'check_out_time']
for col in date_columns:
    if col in session_base.columns:
        session_base[col] = pd.to_datetime(session_base[col])

print(f"\nOK: Cohort data loaded")
print(f"  Shape: {session_base.shape}")
print(f"  Sessions: {len(session_base):,}")
print(f"  Users: {session_base['user_id'].nunique():,}")
print(f"  Date range: {session_base['session_start'].min().date()} to {session_base['session_start'].max().date()}")

# Quick validation
print("\nData Validation:")
print(f"  Missing user_id: {session_base['user_id'].isna().sum()}")
print(f"  Missing session_id: {session_base['session_id'].isna().sum()}")
print(f"  Age range: {session_base['age'].min():.0f}-{session_base['age'].max():.0f} years")

print("\n" + "="*80)

LOADING COHORT DATA FROM NOTEBOOK 01

OK: Cohort data loaded
  Shape: (43344, 41)
  Sessions: 43,344
  Users: 5,765
  Date range: 2023-01-04 to 2023-07-24

Data Validation:
  Missing user_id: 0
  Missing session_id: 0
  Age range: 18-88 years



## 2. Demographic Analysis

Understanding who our customers are: age, gender, family status, location, and tenure.

In [7]:
print("DEMOGRAPHIC ANALYSIS")
print("="*80)

# Get unique users (one row per user)
users = session_base.groupby('user_id').first().reset_index()

print(f"\nTotal Users: {len(users):,}")

# Age distribution
print("\nAGE DISTRIBUTION:")
print("-" * 80)
print(f"  Mean: {users['age'].mean():.1f} years")
print(f"  Median: {users['age'].median():.0f} years")
print(f"  Std Dev: {users['age'].std():.1f} years")
print(f"  Range: {users['age'].min():.0f}-{users['age'].max():.0f} years")

# Age groups
age_bins = [18, 25, 35, 45, 55, 65, 100]
age_labels = ['18-24', '25-34', '35-44', '45-54', '55-64', '65+']
users['age_group'] = pd.cut(users['age'], bins=age_bins, labels=age_labels, right=False)

print("\nAge Group Distribution:")
age_dist = users['age_group'].value_counts().sort_index()
for group, count in age_dist.items():
    pct = count / len(users) * 100
    print(f"  {group}: {count:,} ({pct:.1f}%)")

# Gender distribution
print("\nGENDER DISTRIBUTION:")
print("-" * 80)
gender_dist = users['gender'].value_counts()
for gender, count in gender_dist.items():
    pct = count / len(users) * 100
    print(f"  {gender}: {count:,} ({pct:.1f}%)")

# Family status
print("\nFAMILY STATUS:")
print("-" * 80)
married_count = users['married'].sum()
children_count = users['has_children'].sum()
print(f"  Married: {married_count:,} ({married_count/len(users)*100:.1f}%)")
print(f"  Has Children: {children_count:,} ({children_count/len(users)*100:.1f}%)")

# Tenure (days since signup)
users['days_since_signup'] = (pd.Timestamp('2023-04-30') - users['sign_up_date']).dt.days
users['years_active'] = users['days_since_signup'] / 365.25

print("\nTENURE (Account Age):")
print("-" * 80)
print(f"  Mean: {users['years_active'].mean():.2f} years")
print(f"  Median: {users['years_active'].median():.2f} years")
print(f"  Range: {users['years_active'].min():.2f}-{users['years_active'].max():.2f} years")

print("\n" + "="*80)

DEMOGRAPHIC ANALYSIS

Total Users: 5,765

AGE DISTRIBUTION:
--------------------------------------------------------------------------------
  Mean: 42.3 years
  Median: 42 years
  Std Dev: 11.2 years
  Range: 18-88 years

Age Group Distribution:
  18-24: 380 (6.6%)
  25-34: 918 (15.9%)
  35-44: 2,085 (36.2%)
  45-54: 1,724 (29.9%)
  55-64: 416 (7.2%)
  65+: 242 (4.2%)

GENDER DISTRIBUTION:
--------------------------------------------------------------------------------
  F: 5,081 (88.1%)
  M: 674 (11.7%)
  O: 10 (0.2%)

FAMILY STATUS:
--------------------------------------------------------------------------------
  Married: 2,642 (45.8%)
  Has Children: 1,901 (33.0%)

TENURE (Account Age):
--------------------------------------------------------------------------------
  Mean: 0.27 years
  Median: 0.28 years
  Range: -0.05-1.77 years

