# Exploratory Data Analysis

In this notebook, we will perform exploratory data analysis (EDA) on the airline discount dataset. The goal is to understand the data better and derive insights that can help in generating customized discounts for airline routes based on passenger travel history.

In [4]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set(style='whitegrid')

In [6]:
# Connect to the database
import sys
from pathlib import Path

# Add project root to path
project_root = Path().resolve().parent
sys.path.insert(0, str(project_root))

from src.data.database import get_connection

# Create database connection
db = get_connection()
database_connection = db.connection

print("✓ Connected to database successfully")

✓ Database connection successful: /Users/maria/airlst-github-copilot-training/airline-discount-ml/data/airline_discount.db
✓ Connected to database successfully


In [7]:
# Load the dataset from the passengers table
# Available tables: passengers, routes, discounts
data = pd.read_sql_query('SELECT * FROM passengers', con=database_connection)

# Display the first few rows of the dataset
data.head()

Unnamed: 0,id,name,travel_history,created_at
0,1,John Smith,"{""flights"": 10, ""miles"": 5000}",2025-10-12 20:09:22
1,2,Jane Doe,"{""flights"": 25, ""miles"": 15000}",2025-10-12 20:09:22
2,3,Bob Johnson,"{""flights"": 5, ""miles"": 2500}",2025-10-12 20:09:22
3,4,John Smith,"{""flights"": 10, ""miles"": 5000}",2025-10-12 20:11:06
4,5,Jane Doe,"{""flights"": 25, ""miles"": 15000}",2025-10-12 20:11:06


In [None]:
# Explore the data structure
print("Dataset shape:", data.shape)
print("\nColumn names:", data.columns.tolist())
print("\nData types:")
print(data.dtypes)
print("\nFirst few rows:")
data.head()

In [8]:
# Load all tables and create a comprehensive dataset
query = """
SELECT 
    p.id as passenger_id,
    p.name as passenger_name,
    p.travel_history,
    r.origin,
    r.destination,
    r.distance,
    d.discount_value
FROM discounts d
JOIN passengers p ON d.passenger_id = p.id
JOIN routes r ON d.route_id = r.id
"""

discount_data = pd.read_sql_query(query, con=database_connection)
print(f"Loaded {len(discount_data)} discount records")
discount_data.head()

Loaded 6 discount records


Unnamed: 0,passenger_id,passenger_name,travel_history,origin,destination,distance,discount_value
0,1,John Smith,"{""flights"": 10, ""miles"": 5000}",New York,London,3459.0,15.0
1,2,Jane Doe,"{""flights"": 25, ""miles"": 15000}",Los Angeles,Tokyo,5478.0,25.0
2,3,Bob Johnson,"{""flights"": 5, ""miles"": 2500}",San Francisco,Paris,5558.0,10.0
3,1,John Smith,"{""flights"": 10, ""miles"": 5000}",New York,London,3459.0,15.0
4,2,Jane Doe,"{""flights"": 25, ""miles"": 15000}",Los Angeles,Tokyo,5478.0,25.0


In [None]:
# Visualize the distribution of discount values
plt.figure(figsize=(10, 6))
sns.histplot(discount_data['discount_value'], bins=20, kde=True, color='steelblue')
plt.title('Distribution of Discount Values')
plt.xlabel('Discount Value (%)')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Analyze relationship between distance and discount
plt.figure(figsize=(10, 6))
sns.scatterplot(data=discount_data, x='distance', y='discount_value', s=100, alpha=0.7)
plt.title('Route Distance vs Discount Value')
plt.xlabel('Distance (miles)')
plt.ylabel('Discount Value (%)')
plt.grid(True, alpha=0.3)
plt.show()

# Show summary statistics
print("\nDiscount Statistics by Route:")
print(discount_data.groupby(['origin', 'destination'])['discount_value'].describe())

## Conclusion

This exploratory data analysis provides insights into the distribution of predicted values and the relationships between different features in the dataset. Further analysis can be conducted to refine the discount generation strategy.