Skip to content

Graph-based flight performance analytics using GraphFrames and Databricks. Analyzes 1M+ flight records to identify delay patterns, rank airports with PageRank, discover optimal routes, and visualize network relationships through motif finding and BFS algorithms.

License

Notifications You must be signed in to change notification settings

abidaziz1/Flight-Performance-Analysis-using-GraphFrames-and-Databricks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Flight Performance Analysis using GraphFrames and Databricks

Databricks Apache Spark Python GraphFrames

πŸ“‹ Overview

A graph-based analytics solution leveraging GraphFrames and Databricks to analyze flight performance data with over 1 million records. This project transforms traditional tabular flight data into graph structures, enabling advanced network analysis, airport ranking, and delay pattern discovery that would be difficult with conventional SQL queries.

🎯 Project Impact

  • Graph-Based Insights: Unlock complex relationship patterns between airports and flight routes
  • Performance Optimization: Identify critical delay patterns and bottleneck airports
  • Network Analysis: Discover hub airports, popular routes, and transfer cities using PageRank
  • Predictive Intelligence: Determine high-risk routes for delays and optimize scheduling
  • Scalable Processing: Handle millions of flight records efficiently using Spark's distributed computing

πŸ—οΈ Architecture

image
CSV Data Source β†’ DBFS Storage (Raw/Parquet) β†’ Spark SQL Processing
                                                        ↓
                                            GraphFrames Creation
                                                        ↓
                                Graph Analytics & Visualization
                                    (PageRank, Motif Finding, BFS)

πŸ› οΈ Tech Stack

Component Technology
Language Python
Framework PySpark
Platform Databricks
Graph Library GraphFrames
Processing Engine Apache Spark
Storage DBFS (Databricks File System)
Format CSV, Parquet

✨ Key Features

GraphFrames Capabilities

  • πŸ” Motif Finding - Discover complex flight patterns and connections
  • πŸ“Š PageRank Algorithm - Rank airports by importance and connectivity
  • πŸ”„ Triangle Count - Identify closely connected airport clusters
  • πŸ›€οΈ Breadth-First Search - Find shortest paths between cities
  • 🎯 Subgraph Queries - Filter and analyze specific route segments

Analysis Features

  • βœ… Delay Analysis - On-time vs delayed flight statistics
  • πŸ† Airport Ranking - Most connected and busiest airports
  • 🌐 Route Optimization - Popular routes and transfer hubs
  • πŸ“ˆ Performance Metrics - Delay trends by origin and destination
  • πŸ”— Network Relationships - Complex multi-hop connection patterns

πŸ“Š Dataset

Size: 1,048,576 flight records

Parameters:

  • Date - Flight date
  • Delay - Departure delay in minutes
  • Distance - Flight distance in miles
  • Origin - Origin airport code
  • Destination - Destination airport code

πŸ”¬ Analysis Questions Answered

  1. Basic Metrics

    • Total number of airports and trips
    • Longest delay in the dataset
    • Delayed vs on-time/early flight distribution
  2. Delay Patterns

    • Flights from SFO most likely to have significant delays
    • Destinations with highest delay tendencies
    • Seattle (SEA) departures with significant delays
  3. Network Analysis

    • Airport importance ranking (PageRank)
    • Most popular flight routes
    • Top transfer cities
    • Relationship patterns through motif finding

πŸš€ Quick Start

Prerequisites

  • Databricks Account (Community Edition or Paid)
  • Basic knowledge of Python and Spark
  • Understanding of graph theory concepts (helpful but not required)

Setup Steps

  1. Login to Databricks

    • Access your Databricks workspace
    • Navigate to the workspace home
  2. Create Cluster

    Cluster Configuration:
    - Runtime: Latest Spark version (3.x+)
    - Worker Type: Standard_DS3_v2 (or similar)
    - Min Workers: 2
    - Max Workers: 4
    
  3. Install GraphFrames Library

    Compute β†’ Select Cluster β†’ Libraries β†’ Install New
    Maven Coordinates: graphframes:graphframes:0.8.2-spark3.2-s_2.12
    
  4. Upload Data Files

    DBFS Path Structure:
    /FileStore/tables/raw/
    β”œβ”€β”€ airport_codes_na.txt
    └── departuredelays.csv
    
  5. Import & Run Notebook

    • Import the provided notebook
    • Attach to your cluster
    • Execute cells sequentially

πŸ“ Project Structure

FileStore/
└── tables/
    └── raw/
        β”œβ”€β”€ airport_codes_na.txt      # Airport metadata
        └── departuredelays.csv        # Flight delay records

πŸ”„ Workflow

  1. Data Ingestion - Load CSV files into Spark DataFrames
  2. Data Preparation - Clean and transform data for graph structure
  3. Graph Creation - Build GraphFrame with vertices (airports) and edges (flights)
  4. Analysis Execution
    • Run PageRank for airport importance
    • Execute motif finding for pattern discovery
    • Perform BFS for shortest paths
    • Calculate delay statistics and aggregations
  5. Visualization - Display results and insights

πŸŽ“ Key Concepts

Graph Components

  • Vertices - Airports (nodes in the network)
  • Edges - Flights (connections between airports)
  • Directed vs Undirected - Flight routes have direction (origin β†’ destination)

Algorithms Used

  • PageRank - Measures airport importance based on incoming connections
  • Triangle Count - Identifies tightly connected airport groups
  • Motif Finding - Discovers specific patterns (e.g., Aβ†’Bβ†’Cβ†’A cycles)
  • Breadth-First Search - Finds shortest route between two cities

πŸ“š Learn More

Official Documentation

Tutorials & Resources

Related Topics

  • Network Analysis
  • Social Network Graphs
  • Recommendation Systems
  • Route Optimization

πŸ’‘ Use Cases

Aviation Industry

  • Flight delay prediction
  • Route optimization
  • Hub airport identification
  • Scheduling improvements

Similar Applications

  • Social network analysis
  • Recommendation engines
  • Fraud detection networks
  • Supply chain optimization
  • Traffic flow analysis

🀝 Contributing

Contributions are welcome! Feel free to:

  • Report issues
  • Suggest improvements
  • Submit pull requests
  • Share your analysis insights

πŸ“ License

This project is licensed under the MIT License.

πŸ™ Acknowledgments

  • Built on Databricks platform
  • Powered by Apache Spark and GraphFrames
  • Inspired by network analysis and graph theory principles

Note: When working with Databricks Community Edition, remember that clusters automatically terminate after 2 hours of inactivity. Always save your work and export notebooks regularly.

About

Graph-based flight performance analytics using GraphFrames and Databricks. Analyzes 1M+ flight records to identify delay patterns, rank airports with PageRank, discover optimal routes, and visualize network relationships through motif finding and BFS algorithms.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published