---
## 1. Setup & Data Loading

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

# Display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("‚úÖ Libraries loaded successfully!")

In [None]:
# Load the dataset
# TODO: Update the file path if needed
data_path = '../DATA/airline_delays_2022_2024.csv'

df = pd.read_csv(data_path)

print(f"‚úÖ Dataset loaded!")
print(f"   Total records: {len(df):,}")
print(f"   Columns: {len(df.columns)}")

---
## 2. Data Exploration

Before analyzing, let's understand the structure and contents of our dataset.

In [None]:
# Display first few rows
print("üìä First 5 rows of the dataset:")
df.head()

In [None]:
# Dataset information
print("üìã Dataset Info:")
df.info()

In [None]:
# Check for missing values
print("üîç Missing Values by Column:")
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
})
missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

In [None]:
# Explore key variables
print("‚úàÔ∏è Airlines in Dataset:")
print(df['CarrierName'].value_counts())
print(f"\nTotal airlines: {df['CarrierName'].nunique()}")

In [None]:
# Check date range
print("üìÖ Date Range:")
print(f"   From: {df['Year'].min()}-{df['Month'].min():02d}")
print(f"   To: {df['Year'].max()}-{df['Month'].max():02d}")

In [None]:
# Explore the key delay indicator
print("‚è∞ ArrDelay15 Distribution (BTS On-Time Standard):")
print(df['ArrDelay15'].value_counts())
print(f"\nOverall delay rate: {df['ArrDelay15'].mean() * 100:.2f}%")

In [None]:
# Check cancellation rate
print("‚ùå Cancellation Statistics:")
print(df['Cancelled'].value_counts())
print(f"\nOverall cancellation rate: {df['Cancelled'].mean() * 100:.2f}%")

---
## 3. Data Cleaning & Preparation

### Decision: How to handle cancelled flights?

**Option A:** Exclude cancelled flights from delay calculations (they didn't arrive, so no arrival delay).  
**Option B:** Count cancellations separately and report cancellation rate alongside delay rate.

For this analysis, we'll use **Option B** - exclude cancelled flights from delay metrics but compute cancellation rate separately.

In [None]:
# Create a subset of completed (non-cancelled) flights
df_completed = df[df['Cancelled'] == 0].copy()

print(f"üìä Filtering Results:")
print(f"   Total flights: {len(df):,}")
print(f"   Cancelled flights: {(df['Cancelled'] == 1).sum():,}")
print(f"   Completed flights: {len(df_completed):,}")
print(f"   Cancellation rate: {(df['Cancelled'].mean() * 100):.2f}%")

In [None]:
# TODO: Add any additional data cleaning steps here
# Examples:
# - Handle missing values in specific columns
# - Filter to specific time periods if needed
# - Create additional derived variables

# STUDENT: Add your data cleaning code here

---
## 4. Airline-Level Metrics Calculation

Now we'll compute key metrics for each airline:
1. **Delay Rate**: % of flights with `ArrDelay15 = 1`
2. **Cancellation Rate**: % of flights cancelled
3. **Average Delay** (for delayed flights only)

### Metric 1: Delay Rate

In [None]:
# Calculate delay rate by airline (using completed flights only)
delay_metrics = df_completed.groupby('CarrierName').agg(
    total_flights=('ArrDelay15', 'count'),
    delayed_flights=('ArrDelay15', 'sum')
).reset_index()

# Calculate delay rate percentage
delay_metrics['delay_rate_pct'] = (
    delay_metrics['delayed_flights'] / delay_metrics['total_flights'] * 100
)

# Sort by worst performers
delay_metrics_sorted = delay_metrics.sort_values('delay_rate_pct', ascending=False)

print("üìä Delay Rate by Airline (Sorted by Worst Performer):")
print(delay_metrics_sorted.to_string(index=False))

### Metric 2: Cancellation Rate

In [None]:
# Calculate cancellation rate by airline (using ALL flights)
cancellation_metrics = df.groupby('CarrierName').agg(
    total_flights=('Cancelled', 'count'),
    cancelled_flights=('Cancelled', 'sum')
).reset_index()

cancellation_metrics['cancellation_rate_pct'] = (
    cancellation_metrics['cancelled_flights'] / cancellation_metrics['total_flights'] * 100
)

cancellation_metrics_sorted = cancellation_metrics.sort_values('cancellation_rate_pct', ascending=False)

print("üìä Cancellation Rate by Airline (Sorted by Worst Performer):")
print(cancellation_metrics_sorted.to_string(index=False))

### Metric 3: Average Delay Magnitude (for delayed flights)

In [None]:
# Calculate average delay for flights that were actually delayed
df_delayed_only = df_completed[df_completed['ArrDelay15'] == 1].copy()

avg_delay_metrics = df_delayed_only.groupby('CarrierName').agg(
    avg_delay_minutes=('ArrDelayMinutes', 'mean'),
    median_delay_minutes=('ArrDelayMinutes', 'median'),
    max_delay_minutes=('ArrDelayMinutes', 'max')
).reset_index()

avg_delay_metrics_sorted = avg_delay_metrics.sort_values('avg_delay_minutes', ascending=False)

print("üìä Average Delay Magnitude for Delayed Flights:")
print(avg_delay_metrics_sorted.to_string(index=False))

### Combine All Metrics into Scorecard

In [None]:
# Merge all metrics into one comprehensive scorecard
scorecard = delay_metrics[['CarrierName', 'total_flights', 'delay_rate_pct']].copy()

# Add cancellation rate
scorecard = scorecard.merge(
    cancellation_metrics[['CarrierName', 'cancellation_rate_pct']], 
    on='CarrierName'
)

# Add average delay
scorecard = scorecard.merge(
    avg_delay_metrics[['CarrierName', 'avg_delay_minutes']], 
    on='CarrierName'
)

# Sort by delay rate
scorecard = scorecard.sort_values('delay_rate_pct', ascending=False)

# Round for readability
scorecard['delay_rate_pct'] = scorecard['delay_rate_pct'].round(2)
scorecard['cancellation_rate_pct'] = scorecard['cancellation_rate_pct'].round(2)
scorecard['avg_delay_minutes'] = scorecard['avg_delay_minutes'].round(1)

print("\n" + "="*80)
print("‚úàÔ∏è COMPREHENSIVE AIRLINE DELAY SCORECARD (2022-2024)")
print("="*80)
print(scorecard.to_string(index=False))
print("="*80)

In [None]:
# TODO: Compute at least one additional metric
# Ideas:
# - % of flights with extreme delays (‚â•60 minutes)
# - On-time rate (inverse of delay rate)
# - Standard deviation of delays
# - Delay rate by month or season

# STUDENT: Add your additional metric calculation here

---
## 5. Visualizations

Create clear, professional charts to communicate your findings.

### Visualization 1: Delay Rate by Airline

In [None]:
# Bar chart: Delay rate by airline
plt.figure(figsize=(12, 6))

# Sort for visualization (worst at top)
scorecard_sorted = scorecard.sort_values('delay_rate_pct', ascending=True)

# Create horizontal bar chart
bars = plt.barh(scorecard_sorted['CarrierName'], scorecard_sorted['delay_rate_pct'], 
                color='steelblue', edgecolor='black')

# Highlight worst performer in red
worst_idx = scorecard_sorted['delay_rate_pct'].idxmax()
bars[list(scorecard_sorted.index).index(worst_idx)].set_color('crimson')

# Labels and formatting
plt.xlabel('Delay Rate (%)', fontsize=12, fontweight='bold')
plt.ylabel('Airline', fontsize=12, fontweight='bold')
plt.title('Which Airline Is Most Likely to Make You Late?\nDelay Rate by Carrier (2022-2024)', 
          fontsize=14, fontweight='bold', pad=20)

# Add value labels on bars
for i, (idx, row) in enumerate(scorecard_sorted.iterrows()):
    plt.text(row['delay_rate_pct'] + 0.3, i, f"{row['delay_rate_pct']:.1f}%", 
             va='center', fontsize=10)

plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nüî¥ Worst Performer: {scorecard.iloc[0]['CarrierName']} ({scorecard.iloc[0]['delay_rate_pct']}% delay rate)")
print(f"üü¢ Best Performer: {scorecard.iloc[-1]['CarrierName']} ({scorecard.iloc[-1]['delay_rate_pct']}% delay rate)")

### Visualization 2: Cancellation Rate by Airline

In [None]:
# Bar chart: Cancellation rate by airline
plt.figure(figsize=(12, 6))

scorecard_sorted_cancel = scorecard.sort_values('cancellation_rate_pct', ascending=True)

bars = plt.barh(scorecard_sorted_cancel['CarrierName'], scorecard_sorted_cancel['cancellation_rate_pct'], 
                color='coral', edgecolor='black')

plt.xlabel('Cancellation Rate (%)', fontsize=12, fontweight='bold')
plt.ylabel('Airline', fontsize=12, fontweight='bold')
plt.title('Flight Cancellation Rate by Carrier (2022-2024)', 
          fontsize=14, fontweight='bold', pad=20)

for i, (idx, row) in enumerate(scorecard_sorted_cancel.iterrows()):
    plt.text(row['cancellation_rate_pct'] + 0.05, i, f"{row['cancellation_rate_pct']:.2f}%", 
             va='center', fontsize=10)

plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

### Visualization 3: Combined Scatter Plot

In [None]:
# Scatter plot: Delay rate vs. Cancellation rate
plt.figure(figsize=(10, 7))

plt.scatter(scorecard['delay_rate_pct'], scorecard['cancellation_rate_pct'], 
            s=200, alpha=0.6, c='steelblue', edgecolors='black', linewidth=1.5)

# Add airline labels
for idx, row in scorecard.iterrows():
    plt.annotate(row['CarrierName'], 
                 (row['delay_rate_pct'], row['cancellation_rate_pct']),
                 xytext=(5, 5), textcoords='offset points', fontsize=9)

plt.xlabel('Delay Rate (%)', fontsize=12, fontweight='bold')
plt.ylabel('Cancellation Rate (%)', fontsize=12, fontweight='bold')
plt.title('Airline Reliability: Delay Rate vs. Cancellation Rate (2022-2024)', 
          fontsize=14, fontweight='bold', pad=20)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# TODO: Create at least one additional visualization
# Ideas:
# - Average delay magnitude by airline (bar chart)
# - Delay rate over time (line chart by month/year)
# - Stacked bar chart showing delay causes by airline

# STUDENT: Add your additional visualization here

---
## 6. Comparison to BTS Official Rankings

Now let's validate our findings by comparing to the Bureau of Transportation Statistics' official annual rankings.

**Instructions:**
1. Open `SUPPLEMENTAL_MATERIALS/BTS_Annual_Rankings_2022_2024.pdf`
2. Find the on-time performance table for 2022-2024 (or the most recent year available)
3. Compare your top/bottom airlines to BTS's official rankings

In [None]:
# Display your ranking for easy comparison
print("üìä YOUR COMPUTED RANKING (by delay rate, worst to best):")
print("="*60)
for idx, row in scorecard.iterrows():
    rank = list(scorecard.index).index(idx) + 1
    print(f"{rank}. {row['CarrierName']:<25} {row['delay_rate_pct']:.2f}%")
print("="*60)

### TODO: Manual Comparison

**STUDENT: Complete the following based on the BTS rankings document:**

1. **BTS Official Ranking (worst to best):**
   - List the airlines as they appear in BTS's official table
   - [FILL IN HERE]

2. **Similarities:**
   - Which airlines appear in similar positions in both rankings?
   - [FILL IN HERE]

3. **Differences:**
   - Which airlines differ significantly between your ranking and BTS?
   - [FILL IN HERE]

4. **Possible Explanations for Differences:**
   - Sample differences (e.g., you may have fewer routes)
   - Time aggregation (monthly vs. annual)
   - Inclusion/exclusion criteria (domestic only vs. all flights)
   - [ADD YOUR ANALYSIS HERE]

---
## 7. Conclusions & Limitations

### Key Findings

**STUDENT: Summarize your key findings here:**

1. **Which airline is most likely to make you late?**
   - [FILL IN: Name the airline with the highest delay rate and state the percentage]

2. **How much do delay rates vary across airlines?**
   - [FILL IN: Compare best vs. worst performer]

3. **Is there a relationship between delay rate and cancellation rate?**
   - [FILL IN: Based on your scatter plot]

4. **Do your findings align with BTS official rankings?**
   - [FILL IN: Summarize comparison from Section 6]

### Limitations

**STUDENT: List at least 3 limitations of your analysis:**

1. **Route Mix Not Considered:**
   - [EXPLAIN: e.g., Airlines flying to more congested airports may have higher delays for reasons beyond their control]

2. **Seasonal Effects:**
   - [EXPLAIN: e.g., Delays may be higher in winter; aggregating all months may hide important patterns]

3. **External Factors:**
   - [EXPLAIN: e.g., Weather, air traffic control delays, airport infrastructure differences]

4. **Sample vs. Population:**
   - [IF APPLICABLE: If you sampled the data, note that results may differ from full dataset]

5. **[ADD YOUR OWN]:**
   - [FILL IN]

### Recommendations for TravelSmart

**STUDENT: Based on your findings, what should TravelSmart recommend to students?**

- [FILL IN: e.g., "Students should avoid Airline X for time-sensitive travel, as it has a 25% delay rate‚Äîmeaning 1 in 4 flights will be at least 15 minutes late."]
- [FILL IN: Additional recommendations]

**Caveats:**
- [FILL IN: e.g., "These are averages across all routes. Individual route performance may vary."]

---
## Next Steps

‚úÖ **You've completed the analysis!** Now:

1. **Export this notebook** as HTML or PDF for your deliverables
2. **Write your 2-3 page report** summarizing findings, comparison, and limitations
3. **Review the rubric** (`RUBRIC.pdf`) to ensure you've met all criteria
4. **Proofread** your code and commentary for clarity

**Great work!** üéâ‚úàÔ∏èüìä