# US Airline Dataset

https://www.kaggle.com/datasets/shaivyac/us-airline-dataset

## Dataset content:

- FL_DATE:
The date of the flight.Instead of storing all attributes to store date we use a single attribute.Date of the Flight in yyyymmdd Airline Reporting Unique.
- AIRLINE_ID:
An identification number assigned by US DOT to identify a unique airline (carrier). It has a single value for each individual carrier.
- TAIL_NUM:
An identification number to store the tail number of the airlines.
- FLIGHT_NUM:
The flight number is stored in form of unique value for each flight.
- ORIGIN_SEQ_ID:
Unique id for storing each origin airport of the flight.
- ORIGIN_AIRPORT:
Stores the origin airport.It can be used for displaying the airport which are easier to understand. For example, JFK.
- DEST_SEQ_ID:
Unique id for storing each destination airport of the flight.
- DEST_AIRPORT:
Stores the destination airport.It can be used for displaying the airport which are easier to understand.For example, JFK.
- DEP_TIME:
Actual Departure Time in local time: hhmm

Preprocessing:

Checklist:
- https://ft-interactive.github.io/visual-vocabulary/: use at least all of the categories in the summary of
- Think of the storyline according to this and the theory

- TO do the dashboard check pyviz tools. (we liked it more streamlit)


Create app:

# Task
Generate Python code to create a Streamlit application that analyzes flight data and provides insights on general flight patterns, airlines, and airports, based on the requirements provided in the user message. The application should include visualizations and data summaries for various aspects of flight data. The code should be placed in the empty cell with id "8nlGZX_DPPQT".

## Data loading and preprocessing

### Subtask:
Load the data from the CSV file into a pandas DataFrame and perform necessary preprocessing steps, such as converting data types and handling missing values.


**Reasoning**:
Load the data and perform the specified preprocessing steps, including type conversions and handling missing values.



## Create streamlit app structure

### Subtask:
Set up the basic structure of the Streamlit application, including the title and sidebar for navigation.


In [1]:
!pip install streamlit
!pip install -q kagglehub[pandas-datasets]

Collecting streamlit
  Downloading streamlit-1.51.0-py3-none-any.whl.metadata (9.5 kB)
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Downloading streamlit-1.51.0-py3-none-any.whl (10.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m38.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m45.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pydeck, streamlit
Successfully installed pydeck-0.9.1 streamlit-1.51.0


In [2]:
%%writefile app.py
import streamlit as st
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import networkx as nx

###
# Install dependencies
#!pip install -q kagglehub[pandas-datasets]

import kagglehub
from kagglehub import KaggleDatasetAdapter
import os


# Load the dataset with caching
@st.cache_data
def load_data():
    #df = pd.read_csv('/content/Airline_dataset.csv')
####################################
    # Download dataset (returns local directory path)
    path = kagglehub.dataset_download("shaivyac/us-airline-dataset")

    # Now, load a specific file (replace with actual filename you see printed)
    file_to_load = os.path.join(path, "Airline_dataset.csv")  # Corrected filename

    # Load the dataset using pandas directly from the local path

    df = pd.read_csv(file_to_load)
#############################
    df['FL_DATE'] = pd.to_datetime(df['FL_DATE'], format='%m/%d/%y')
    cols = ["AIRLINE_ID", "FLIGHT_NUM", "ORIGIN_SEQ_ID","DEST_SEQ_ID"]
    df[cols] = df[cols].astype(int)
    df['DEP_DELAY'] = df['DEP_DELAY'].fillna(0)
    df['ARR_DELAY'] = df['ARR_DELAY'].fillna(0)
    df['WEATHER_DELAY'] = df['WEATHER_DELAY'].fillna(0) # Ensure WEATHER_DELAY is also filled
    return df

df = load_data()


st.title('US Airline Data Analysis')

st.sidebar.title('Navigation')
page = st.sidebar.radio("Go to", ["General Visualizations", "Airlines Analysis", "Airports Analysis"])

if page == "General Visualizations":
    st.header("General Flight Patterns")

    st.subheader("Most Common Flight Trajectories")

    # Calculate the frequency of each origin-destination pair
    trajectory_counts = df.groupby(['ORIGIN_AIRPORT', 'DEST_AIRPORT']).size().reset_index(name='count')
    # Get the top N trajectories
    n_trajectories = st.slider("Select number of top trajectories to display", 10, 100, 50)
    top_trajectories = trajectory_counts.nlargest(n_trajectories, 'count')

    # Create a network graph
    G = nx.DiGraph()
    for index, row in top_trajectories.iterrows():
        G.add_edge(row['ORIGIN_AIRPORT'], row['DEST_AIRPORT'], weight=row['count'])

    # Get positions for the nodes using a layout
    pos = nx.spring_layout(G, k=0.15, iterations=20)

    # Create Edges
    edge_x = []
    edge_y = []
    for edge in G.edges():
        x0, y0 = pos[edge[0]]
        x1, y1 = pos[edge[1]]
        edge_x.append(x0)
        edge_x.append(x1)
        edge_x.append(None)
        edge_y.append(y0)
        edge_y.append(y1)
        edge_y.append(None)

    edge_trace = go.Scatter(
        x=edge_x, y=edge_y,
        line=dict(width=0.5, color='#888'),
        hoverinfo='none',
        mode='lines')

    # Create Nodes
    node_x = []
    node_y = []
    for node in G.nodes():
        x, y = pos[node]
        node_x.append(x)
        node_y.append(y)

    node_trace = go.Scatter(
        x=node_x, y=node_y,
        mode='markers',
        hoverinfo='text',
        marker=dict(
            showscale=True,
            colorscale='YlGnBu',
            reversescale=True,
            color=[],
            size=10,
            colorbar=dict(
                thickness=15,
                title='Node Connections',
                xanchor='left',
                titleside='right'
            ),
            line_width=2))

    # Color Node Points
    node_adjacencies = []
    node_text = []
    for node, adjacencies in enumerate(G.adjacency()):
        node_adjacencies.append(len(adjacencies[1]))
        node_text.append(f'{list(G.nodes())[node]}: # of connections: {len(adjacencies[1])}')

    node_trace.marker.color = node_adjacencies
    node_trace.text = node_text

    # Create figure
    fig = go.Figure(data=[edge_trace, node_trace],
                 layout=go.Layout(
                    title='Network graph of most common flight trajectories',
                    titlefont_size=16,
                    showlegend=False,
                    hovermode='closest',
                    margin=dict(b=20,l=5,r=5,t=40),
                    annotations=[ dict(
                        text="Python code: <a href='https://plotly.com/ipython-notebooks/network-graphs/'> https://plotly.com/ipython-notebooks/network-graphs/</a>",
                        showarrow=False,
                        xref="paper", yref="paper",
                        x=0.005, y=-0.002 ) ],
                    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False))
                    )

    st.plotly_chart(fig)


    st.subheader("Best Airline Suggester")

    origin_airport_suggester = st.selectbox("Select Origin Airport", df['ORIGIN_AIRPORT'].unique(), key='origin_suggester')
    dest_airport_suggester = st.selectbox("Select Destination Airport", df['DEST_AIRPORT'].unique(), key='dest_suggester')

    if st.button("Suggest Best Airline"):
        # Filter data for the selected route
        route_df = df[(df['ORIGIN_AIRPORT'] == origin_airport_suggester) & (df['DEST_AIRPORT'] == dest_airport_suggester)]

        if route_df.empty:
            st.write("No flights found for this route.")
        else:
            # Calculate average arrival delay for each airline on this route
            airline_performance = route_df.groupby('AIRLINE_ID')['ARR_DELAY'].mean().reset_index()
            # Suggest the airline with the lowest average arrival delay
            best_airline_id = airline_performance.nsmallest(1, 'ARR_DELAY')['AIRLINE_ID'].iloc[0]
            # Find the airline name (you might need a mapping for airline IDs to names)
            # For now, we'll just display the ID
            st.write(f"The suggested best airline for this route is Airline ID: {best_airline_id} (based on lowest average arrival delay).")


elif page == "Airlines Analysis":
    st.header("Airline Specific Insights")

    # Get unique airline IDs
    airline_ids = df['AIRLINE_ID'].unique()

    # Add a dropdown to select an airline
    selected_airline_id = st.selectbox("Select an Airline ID", airline_ids)

    # Filter data for the selected airline
    airline_df = df[df['AIRLINE_ID'] == selected_airline_id]

    st.subheader(f"Analysis for Airline ID: {selected_airline_id}")

    # Number of flights per year and month
    st.write("### Number of Flights Over Time")
    airline_df['year'] = airline_df['FL_DATE'].dt.year
    airline_df['month'] = airline_df['FL_DATE'].dt.month

    flights_per_year = airline_df['year'].value_counts().sort_index().reset_index()
    flights_per_year.columns = ['Year', 'Number of Flights']
    fig_year = px.bar(flights_per_year, x='Year', y='Number of Flights', title=f'Number of Flights per Year for Airline ID {selected_airline_id}')
    st.plotly_chart(fig_year)

    flights_per_month = airline_df['month'].value_counts().sort_index().reset_index()
    flights_per_month.columns = ['Month', 'Number of Flights']
    fig_month = px.line(flights_per_month, x='Month', y='Number of Flights', title=f'Number of Flights per Month for Airline ID {selected_airline_id}')
    st.plotly_chart(fig_month)


    # Average delays
    st.write("### Average Delays")
    avg_dep_delay = airline_df['DEP_DELAY'].mean()
    avg_arr_delay = airline_df['ARR_DELAY'].mean()

    st.metric("Average Departure Delay (minutes)", f"{avg_dep_delay:.2f}")
    st.metric("Average Arrival Delay (minutes)", f"{avg_arr_delay:.2f}")


    # Most common trips and map visualization
    st.write("### Most Common Trips")
    # Calculate the frequency of each origin-destination pair for the selected airline
    airline_trajectory_counts = airline_df.groupby(['ORIGIN_AIRPORT', 'DEST_AIRPORT']).size().reset_index(name='count')
    # Get the top N trajectories for this airline
    n_airline_trajectories = st.slider(f"Select number of top trips to display for Airline ID {selected_airline_id}", 5, 50, 10, key=f'airline_trips_slider_{selected_airline_id}')
    top_airline_trajectories = airline_trajectory_counts.nlargest(n_airline_trajectories, 'count')

    st.write("Top Common Trips:")
    st.dataframe(top_airline_trajectories)

    # To plot on a map, we need geographical coordinates.
    # This dataset doesn't include them, so we'll need to use a separate source or a placeholder.
    # For demonstration, let's assume we have a way to get coordinates for airports.
    # In a real scenario, you would join with an airport codes dataset that has lat/lon.

    # Placeholder for map visualization (requires airport coordinates)
    st.write("### Map of Most Common Trips")
    st.write("Map visualization of common trips for this airline would appear here if airport coordinates were available.")

    # Example of how a map visualization with Plotly Express might look IF coordinates were available:
    # fig_map = px.scatter_geo(top_airline_trajectories,
    #                         locations="ORIGIN_AIRPORT", # Assuming ORIGIN_AIRPORT is an IATA code or similar
    #                         locationmode="airports",
    #                         hover_name="ORIGIN_AIRPORT",
    #                         size="count",
    #                         title=f'Most Common Trips for Airline ID {selected_airline_id}')
    # st.plotly_chart(fig_map)



elif page == "Airports Analysis":
    st.header("Airport Specific Insights")

    # Create a dropdown menu to select an airport
    selected_airport = st.selectbox("Select an Airport", df['ORIGIN_AIRPORT'].unique())

    # Filter data for the selected airport as the origin
    airport_df = df[df['ORIGIN_AIRPORT'] == selected_airport]

    st.subheader(f"Analysis for {selected_airport}")

    # Number of flights per year
    st.write("### Number of Flights per Year")
    airport_df['year'] = airport_df['FL_DATE'].dt.year
    flights_per_year_airport = airport_df['year'].value_counts().sort_index().reset_index()
    flights_per_year_airport.columns = ['Year', 'Number of Flights']
    fig_year_airport = px.bar(flights_per_year_airport, x='Year', y='Number of Flights', title=f'Number of Flights per Year from {selected_airport}')
    st.plotly_chart(fig_year_airport)

    # Number of flights per month
    st.write("### Number of Flights per Month")
    airport_df['month'] = airport_df['FL_DATE'].dt.month
    flights_per_month_airport = airport_df['month'].value_counts().sort_index().reset_index()
    flights_per_month_airport.columns = ['Month', 'Number of Flights']
    fig_month_airport = px.line(flights_per_month_airport, x='Month', y='Number of Flights', title=f'Number of Flights per Month from {selected_airport}')
    st.plotly_chart(fig_month_airport)

    # Number of flights per day of the week
    st.write("### Number of Flights per Day of the Week")
    airport_df['day_of_week'] = airport_df['FL_DATE'].dt.day_name()
    # Order the days of the week
    days_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    flights_per_day_airport = airport_df['day_of_week'].value_counts().reindex(days_order).reset_index()
    flights_per_day_airport.columns = ['Day of Week', 'Number of Flights']
    fig_day_airport = px.bar(flights_per_day_airport, x='Day of Week', y='Number of Flights', title=f'Number of Flights per Day of the Week from {selected_airport}')
    st.plotly_chart(fig_day_airport)

    # Most common airlines operating from the selected airport
    st.write("### Most Common Airlines")
    airline_counts_airport = airport_df['AIRLINE_ID'].value_counts().reset_index()
    airline_counts_airport.columns = ['Airline ID', 'Number of Flights']

    n_airlines = st.slider(f"Select number of top airlines to display for {selected_airport}", 1, 10, 5, key=f'airport_airlines_slider_{selected_airport}')
    top_airlines_airport = airline_counts_airport.head(n_airlines)
    st.dataframe(top_airlines_airport)

    # Average delays for the selected airport
    st.write("### Average Delays")
    avg_dep_delay_airport = airport_df['DEP_DELAY'].mean()
    avg_weather_delay_airport = airport_df['WEATHER_DELAY'].mean()

    st.metric(f"Average Departure Delay from {selected_airport} (minutes)", f"{avg_dep_delay_airport:.2f}")
    st.metric(f"Average Weather Delay from {selected_airport} (minutes)", f"{avg_weather_delay_airport:.2f}")


Writing app.py


## Summary:

### Data Analysis Key Findings

*   The analysis successfully loaded and preprocessed the flight dataset, converting date and identifier columns to appropriate types and filling missing delay values with 0.
*   A Streamlit application structure was built with sections for General Visualizations, Airlines Analysis, and Airports Analysis.
*   The General Visualizations section includes a network graph showing the most common flight trajectories, with a slider to adjust the number of trajectories displayed.
*   A "Best Airline Suggester" was implemented to recommend the airline with the lowest average arrival delay for a user-selected origin and destination.
*   The Airlines Analysis section allows users to select an airline and view the number of flights per year and month using bar and line charts. It also displays the average departure and arrival delays for the selected airline and lists the most common trips.
*   The Airports Analysis section allows users to select an airport and view the number of flights originating from that airport per year, month, and day of the week using bar and line charts. It also lists the most common airlines operating from the airport and displays the average departure and weather delays.

### Insights or Next Steps

*   Consider adding a mapping from airline IDs to airline names for better readability in the Airlines Analysis and Best Airline Suggester sections.
*   To implement the map visualization for common trips, acquire or integrate a dataset containing geographical coordinates (latitude and longitude) for the airports.


In [3]:
# Install ngrok
!pip install pyngrok -q

In [4]:
# Authenticate ngrok (optional, but recommended for stability and more features)
# You can get your authtoken from https://dashboard.ngrok.com/get-started/your-authtoken
# Add your authtoken to Colab's secrets and name it 'NGROK_AUTH_TOKEN'
from google.colab import userdata
import os

# Get the authtoken from Colab secrets
authtoken = userdata.get('NGROK_AUTH_TOKEN')

# Authenticate ngrok
if authtoken:
  !ngrok authtoken $authtoken
else:
  print("Ngrok authtoken not found in Colab secrets. Running without authentication (might be less stable).")

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


In [None]:
# Run Streamlit app with ngrok
from pyngrok import ngrok
import threading
import time
import os
import psutil # Import the psutil library

# Terminate any existing ngrok tunnels
# More robust way to kill ngrok processes
for proc in psutil.process_iter():
    if "ngrok" in proc.name().lower():
        try:
            proc.kill()
            print(f"Killed existing ngrok process: {proc.pid}")
        except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
            pass

# Give a moment for processes to terminate
time.sleep(2)


# Start a ngrok tunnel to the Streamlit port (8501)
# We'll use a Thread to keep the Colab cell running
def run_streamlit():
    !streamlit run app.py --server.port 8501

print("Starting Streamlit app...")
thread = threading.Thread(target=run_streamlit)
thread.start()

# Give Streamlit a moment to start
time.sleep(10)

# Get the ngrok tunnel URL
try:
    public_url = ngrok.connect(8501).public_url
    print(f"Streamlit app is running at: {public_url}")
except Exception as e:
    print(f"Failed to start ngrok tunnel: {e}")
    print("Please check the ngrok documentation or try restarting the Colab runtime.")
    public_url = None


# Keep the cell alive while the Streamlit app is running
if public_url:
    try:
        while thread.is_alive():
            time.sleep(1)
    except KeyboardInterrupt:
        print("Stopping Streamlit app and ngrok tunnel.")
        ngrok.kill()
else:
    print("Streamlit app started, but ngrok tunnel failed. Cell will terminate after a short delay.")
    time.sleep(60) # Keep the cell alive for a minute to inspect logs if needed

Starting Streamlit app...

Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://34.23.80.76:8501[0m
[0m
Streamlit app is running at: https://substriated-rick-reunitable.ngrok-free.dev
Downloading from https://www.kaggle.com/api/v1/datasets/download/shaivyac/us-airline-dataset?dataset_version_number=2...
100% 22.1M/22.1M [00:00<00:00, 85.0MB/s]
Extracting files...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] 