A graph-based analytics solution leveraging GraphFrames and Databricks to analyze flight performance data with over 1 million records. This project transforms traditional tabular flight data into graph structures, enabling advanced network analysis, airport ranking, and delay pattern discovery that would be difficult with conventional SQL queries.
- Graph-Based Insights: Unlock complex relationship patterns between airports and flight routes
- Performance Optimization: Identify critical delay patterns and bottleneck airports
- Network Analysis: Discover hub airports, popular routes, and transfer cities using PageRank
- Predictive Intelligence: Determine high-risk routes for delays and optimize scheduling
- Scalable Processing: Handle millions of flight records efficiently using Spark's distributed computing
CSV Data Source β DBFS Storage (Raw/Parquet) β Spark SQL Processing
β
GraphFrames Creation
β
Graph Analytics & Visualization
(PageRank, Motif Finding, BFS)
| Component | Technology |
|---|---|
| Language | Python |
| Framework | PySpark |
| Platform | Databricks |
| Graph Library | GraphFrames |
| Processing Engine | Apache Spark |
| Storage | DBFS (Databricks File System) |
| Format | CSV, Parquet |
- π Motif Finding - Discover complex flight patterns and connections
- π PageRank Algorithm - Rank airports by importance and connectivity
- π Triangle Count - Identify closely connected airport clusters
- π€οΈ Breadth-First Search - Find shortest paths between cities
- π― Subgraph Queries - Filter and analyze specific route segments
- β Delay Analysis - On-time vs delayed flight statistics
- π Airport Ranking - Most connected and busiest airports
- π Route Optimization - Popular routes and transfer hubs
- π Performance Metrics - Delay trends by origin and destination
- π Network Relationships - Complex multi-hop connection patterns
Size: 1,048,576 flight records
Parameters:
- Date - Flight date
- Delay - Departure delay in minutes
- Distance - Flight distance in miles
- Origin - Origin airport code
- Destination - Destination airport code
-
Basic Metrics
- Total number of airports and trips
- Longest delay in the dataset
- Delayed vs on-time/early flight distribution
-
Delay Patterns
- Flights from SFO most likely to have significant delays
- Destinations with highest delay tendencies
- Seattle (SEA) departures with significant delays
-
Network Analysis
- Airport importance ranking (PageRank)
- Most popular flight routes
- Top transfer cities
- Relationship patterns through motif finding
- Databricks Account (Community Edition or Paid)
- Basic knowledge of Python and Spark
- Understanding of graph theory concepts (helpful but not required)
-
Login to Databricks
- Access your Databricks workspace
- Navigate to the workspace home
-
Create Cluster
Cluster Configuration: - Runtime: Latest Spark version (3.x+) - Worker Type: Standard_DS3_v2 (or similar) - Min Workers: 2 - Max Workers: 4 -
Install GraphFrames Library
Compute β Select Cluster β Libraries β Install New Maven Coordinates: graphframes:graphframes:0.8.2-spark3.2-s_2.12 -
Upload Data Files
DBFS Path Structure: /FileStore/tables/raw/ βββ airport_codes_na.txt βββ departuredelays.csv -
Import & Run Notebook
- Import the provided notebook
- Attach to your cluster
- Execute cells sequentially
FileStore/
βββ tables/
βββ raw/
βββ airport_codes_na.txt # Airport metadata
βββ departuredelays.csv # Flight delay records
- Data Ingestion - Load CSV files into Spark DataFrames
- Data Preparation - Clean and transform data for graph structure
- Graph Creation - Build GraphFrame with vertices (airports) and edges (flights)
- Analysis Execution
- Run PageRank for airport importance
- Execute motif finding for pattern discovery
- Perform BFS for shortest paths
- Calculate delay statistics and aggregations
- Visualization - Display results and insights
- Vertices - Airports (nodes in the network)
- Edges - Flights (connections between airports)
- Directed vs Undirected - Flight routes have direction (origin β destination)
- PageRank - Measures airport importance based on incoming connections
- Triangle Count - Identifies tightly connected airport groups
- Motif Finding - Discovers specific patterns (e.g., AβBβCβA cycles)
- Breadth-First Search - Finds shortest route between two cities
- Network Analysis
- Social Network Graphs
- Recommendation Systems
- Route Optimization
- Flight delay prediction
- Route optimization
- Hub airport identification
- Scheduling improvements
- Social network analysis
- Recommendation engines
- Fraud detection networks
- Supply chain optimization
- Traffic flow analysis
Contributions are welcome! Feel free to:
- Report issues
- Suggest improvements
- Submit pull requests
- Share your analysis insights
This project is licensed under the MIT License.
- Built on Databricks platform
- Powered by Apache Spark and GraphFrames
- Inspired by network analysis and graph theory principles
Note: When working with Databricks Community Edition, remember that clusters automatically terminate after 2 hours of inactivity. Always save your work and export notebooks regularly.