# US Airline Dataset

https://www.kaggle.com/datasets/shaivyac/us-airline-dataset

## Dataset content:

- FL_DATE:
The date of the flight.Instead of storing all attributes to store date we use a single attribute.Date of the Flight in yyyymmdd Airline Reporting Unique.
- AIRLINE_ID:
An identification number assigned by US DOT to identify a unique airline (carrier). It has a single value for each individual carrier.
- TAIL_NUM:
An identification number to store the tail number of the airlines.
- FLIGHT_NUM:
The flight number is stored in form of unique value for each flight.
- ORIGIN_SEQ_ID:
Unique id for storing each origin airport of the flight.
- ORIGIN_AIRPORT:
Stores the origin airport.It can be used for displaying the airport which are easier to understand. For example, JFK.
- DEST_SEQ_ID:
Unique id for storing each destination airport of the flight.
- DEST_AIRPORT:
Stores the destination airport.It can be used for displaying the airport which are easier to understand.For example, JFK.
- DEP_TIME:
Actual Departure Time in local time: hhmm

Preprocessing:

Checklist:
- https://ft-interactive.github.io/visual-vocabulary/: use at least all of the categories in the summary of
- Think of the storyline according to this and the theory

- TO do the dashboard check pyviz tools. (we liked it more streamlit)


Create app:

# Task
Generate Python code to create a Streamlit application that analyzes flight data and provides insights on general flight patterns, airlines, and airports, based on the requirements provided in the user message. The application should include visualizations and data summaries for various aspects of flight data. The code should be placed in the empty cell with id "8nlGZX_DPPQT".

## Data loading and preprocessing

### Subtask:
Load the data from the CSV file into a pandas DataFrame and perform necessary preprocessing steps, such as converting data types and handling missing values.


**Reasoning**:
Load the data and perform the specified preprocessing steps, including type conversions and handling missing values.



## Create streamlit app structure

### Subtask:
Set up the basic structure of the Streamlit application, including the title and sidebar for navigation.


In [2]:
!pip install streamlit
!pip install -q kagglehub[pandas-datasets]

Collecting streamlit
  Downloading streamlit-1.51.0-py3-none-any.whl.metadata (9.5 kB)
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Downloading streamlit-1.51.0-py3-none-any.whl (10.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m63.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m82.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pydeck, streamlit
Successfully installed pydeck-0.9.1 streamlit-1.51.0


In [17]:
import streamlit as st
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import networkx as nx
import os

# Optional: kagglehub dataset download (user already used it)
import kagglehub

st.set_page_config(layout="wide")

# --- Data loading ------------------------------------------------------------------
@st.cache_data
def load_flight_data():
    # download dataset via kagglehub if present
    path = None
    try:
        path = kagglehub.dataset_download("shaivyac/us-airline-dataset")
        file_to_load = os.path.join(path, "Airline_dataset.csv")
        df = pd.read_csv(file_to_load)
    except Exception as e:
        st.warning("Could not auto-download the Airline dataset via kagglehub. Please upload a CSV named Airline_dataset.csv or provide the file via Streamlit uploader.")
        df = pd.DataFrame()
    return df

@st.cache_data
def parse_and_clean(df):
    if df.empty:
        return df
    df = df.copy()
    # Try to parse date with multiple common formats
    try:
        df['FL_DATE'] = pd.to_datetime(df['FL_DATE'], format='%m/%d/%y')
    except Exception:
        df['FL_DATE'] = pd.to_datetime(df['FL_DATE'], errors='coerce')
    # convert some columns to int if possible
    for c in ["AIRLINE_ID", "FLIGHT_NUM", "ORIGIN_SEQ_ID","DEST_SEQ_ID"]:
        if c in df.columns:
            df[c] = pd.to_numeric(df[c], errors='coerce').fillna(0).astype(int)
    for c in ['DEP_DELAY', 'ARR_DELAY', 'WEATHER_DELAY']:
        if c in df.columns:
            df[c] = pd.to_numeric(df[c], errors='coerce').fillna(0)
    return df

# Load flights
uploaded_flights = st.file_uploader("Upload Airline_dataset.csv (optional) — if not provided the app will try to download via kagglehub", type=['csv'])
if uploaded_flights is not None:
    df = pd.read_csv(uploaded_flights)
else:
    df = load_flight_data()

if df.empty:
    st.error("No flight data available yet. Upload the CSV or ensure kagglehub download works. Many visualizations will be disabled until flight data is available.")
else:
    df = parse_and_clean(df)

# Optional airport dataset (auto-downloaded only; no user uploads allowed)
# The app will automatically download and filter a curated US airport list from ourairports.com
import requests
from io import StringIO

st.sidebar.markdown("---")
st.sidebar.subheader("Airport data (auto-downloaded — uploads disabled)")

# List of IATA codes to keep
# (same long set — unchanged)
iata_codes = {
 'ABE','ABI','ABQ','ABR','ABY','ACK','ACT','ACV','ACY','ADK','ADQ','AEX','AGS','AKN','ALB','ALO',
 'ALW','AMA','ANC','APN','ART','ASE','ATL','ATW','ATY','AUS','AVL','AVP','AZA','AZO','BDL','BET',
 'BFF','BFL','BFM','BGM','BGR','BHM','BIL','BIS','BJI','BKG','BLI','BLV','BMI','BNA','BOI','BOS',
 'BPT','BQK','BQN','BRD','BRO','BRW','BTM','BTR','BTV','BUF','BUR','BWI','BZN','CAE','CAK','CDC',
 'CDV','CGI','CHA','CHO','CHS','CID','CIU','CKB','CLE','CLL','CLT','CMH','CMI','CMX','CNY','COD',
 'COS','COU','CPR','CRP','CRW','CSG','CVG','CWA','CYS','DAB','DAL','DAY','DBQ','DCA','DEN','DFW',
 'DHN','DIK','DLG','DLH','DRO','DRT','DSM','DTW','DUT','DVL','EAR','EAT','EAU','ECP','EGE','EKO',
 'ELM','ELP','ERI','ESC','EUG','EVV','EWN','EWR','EYW','FAI','FAR','FAT','FAY','FCA','FLG','FLL',
 'FLO','FNT','FSD','FSM','FWA','GCC','GCK','GEG','GFK','GGG','GJT','GNV','GPT','GRB','GRI','GRK',
 'GRR','GSO','GSP','GST','GTF','GTR','GUC','GUM','HDN','HGR','HHH','HIB','HLN','HNL','HOB','HOU',
 'HPN','HRL','HSV','HTS','HVN','HYA','HYS','IAD','IAG','IAH','ICT','IDA','ILM','IMT','IND','INL',
 'IPT','ISN','ISP','ITH','ITO','JAC','JAN','JAX','JFK','JHM','JLN','JMS','JNU','KOA','KTN','LAN',
 'LAR','LAS','LAW','LAX','LBB','LBE','LBF','LBL','LCH','LCK','LEX','LFT','LGA','LGB','LIH','LIT',
 'LNK','LNY','LRD','LSE','LWB','LWS','LYH','MAF','MBS','MCI','MCO','MDT','MDW','MEI','MEM','MFE',
 'MFR','MGM','MHK','MHT','MIA','MKE','MKG','MKK','MLB','MLI','MLU','MMH','MOB','MOT','MQT','MRY',
 'MSN','MSO','MSP','MSY','MTJ','MVY','MYR','OAJ','OAK','OGD','OGG','OGS','OKC','OMA','OME','ONT',
 'ORD','ORF','ORH','OTH','OTZ','OWB','PAE','PAH','PBG','PBI','PDX','PGD','PGV','PHF','PHL','PHX',
 'PIA','PIB','PIE','PIH','PIR','PIT','PLN','PNS','PPG','PQI','PRC','PSC','PSE','PSG','PSM','PSP',
 'PUB','PUW','PVD','PVU','PWM','RAP','RDD','RDM','RDU','RFD','RHI','RIC','RIW','RKS','RNO','ROA',
 'ROC','ROW','RST','RSW','SAF','SAN','SAT','SAV','SBA','SBN','SBP','SBY','SCC','SCE','SCK','SDF',
 'SEA','SFB','SFO','SGF','SGU','SHD','SHR','SHV','SIT','SJC','SJT','SJU','SLC','SLN','SMF','SMX',
 'SNA','SPI','SPN','SPS','SRQ','STC','STL','STS','STT','STX','SUN','SUX','SWF','SWO','SYR','TLH',
 'TOL','TPA','TRI','TTN','TUL','TUS','TVC','TWF','TXK','TYR','TYS','UIN','USA','VEL','VLD','VPS',
 'WRG','WYS','XNA','XWA','YAK','YKM','YUM'
}

filtered = None
try:
    st.sidebar.write("Downloading airports list from ourairports.com...")
    url = "https://ourairports.com/data/airports.csv"
    csv_data = requests.get(url, timeout=30).text
    airports = pd.read_csv(StringIO(csv_data))
    airports_us = airports[airports['iso_country'] == 'US']
    filtered = airports_us[airports_us['iata_code'].isin(iata_codes)][['iata_code','name','municipality','iso_region','latitude_deg','longitude_deg']]
    filtered = filtered.rename(columns={
        'iata_code': 'IATA',
        'name': 'Airport_Name',
        'municipality': 'City',
        'iso_region': 'State',
        'latitude_deg': 'Latitude',
        'longitude_deg': 'Longitude'
    })
    st.sidebar.success(f"Loaded {len(filtered)} airports (auto-downloaded)")
except Exception as e:
    st.sidebar.error("Failed to load airport data: " + str(e))
    filtered = None

# --- App layout ------------------------------------------------------------------
st.title('US Airline Data Analysis — Enhanced')
st.sidebar.title('Navigation')
page = st.sidebar.radio("Go to", ["General Visualizations", "Airlines Analysis", "Airports Analysis"])

# --------------------- General Visualizations ------------------------------------
if page == "General Visualizations":
    st.header("General Flight Patterns")

    st.subheader("Most Common Flight Trajectories (network)")
    if not df.empty:
        trajectory_counts = df.groupby(['ORIGIN_AIRPORT', 'DEST_AIRPORT']).size().reset_index(name='count')
        n_trajectories = st.slider("Select number of top trajectories to display", 10, 200, 50)
        top_trajectories = trajectory_counts.nlargest(n_trajectories, 'count')

        # network graph (existing)
        G = nx.DiGraph()
        for _, row in top_trajectories.iterrows():
            G.add_edge(row['ORIGIN_AIRPORT'], row['DEST_AIRPORT'], weight=row['count'])
        pos = nx.spring_layout(G, k=0.15, iterations=20)

        edge_x, edge_y = [], []
        for edge in G.edges():
            x0, y0 = pos[edge[0]]
            x1, y1 = pos[edge[1]]
            edge_x += [x0, x1, None]
            edge_y += [y0, y1, None]

        edge_trace = go.Scatter(x=edge_x, y=edge_y, line=dict(width=0.5, color='#888'), hoverinfo='none', mode='lines')

        node_x, node_y = [], []
        for node in G.nodes():
            x, y = pos[node]
            node_x.append(x); node_y.append(y)

        node_trace = go.Scatter(x=node_x, y=node_y, mode='markers', hoverinfo='text',
                                marker=dict(showscale=True, colorscale='YlGnBu', reversescale=True, color=[], size=10, line_width=2))

        node_adjacencies = []
        node_text = []
        for node, adjacencies in enumerate(G.adjacency()):
            node_adjacencies.append(len(adjacencies[1]))
            node_text.append(f'{list(G.nodes())[node]}: # of connections: {len(adjacencies[1])}')

        node_trace.marker.color = node_adjacencies
        node_trace.text = node_text

        fig_net = go.Figure(data=[edge_trace, node_trace], layout=go.Layout(title='Network graph of most common flight trajectories',
                                                                          titlefont_size=16, showlegend=False, hovermode='closest',
                                                                          margin=dict(b=20,l=5,r=5,t=40),
                                                                          xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                                                                          yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)))
        st.plotly_chart(fig_net, use_container_width=True)
    else:
        st.info("Upload flight data to see network graph.")

    st.subheader("Top Airport Links (map)")
    # New: map visualization of strongest links using airport coordinates when available
    if (filtered is not None) and (not df.empty):
        try:
            # Merge flight data with airport coordinates
            df_merged = pd.merge(df, filtered, left_on='ORIGIN_AIRPORT', right_on='IATA', how='left')
            df_merged = pd.merge(df_merged, filtered, left_on='DEST_AIRPORT', right_on='IATA', how='left', suffixes=('_ORIGIN', '_DEST'))

            # flight counts
            flight_counts = df_merged.groupby(['ORIGIN_AIRPORT', 'DEST_AIRPORT']).size().reset_index(name='count')
            flight_counts = pd.merge(flight_counts, df_merged[['ORIGIN_AIRPORT', 'Latitude_ORIGIN', 'Longitude_ORIGIN']].drop_duplicates(), on='ORIGIN_AIRPORT', how='left')
            flight_counts = pd.merge(flight_counts, df_merged[['DEST_AIRPORT', 'Latitude_DEST', 'Longitude_DEST']].drop_duplicates(), on='DEST_AIRPORT', how='left')

            n_links = st.slider('Number of top links to show on the map', 10, 500, 100)
            top_links = flight_counts.nlargest(n_links, 'count')

            fig_links = go.Figure()
            # airport markers
            fig_links.add_trace(go.Scattergeo(lon=filtered['Longitude'], lat=filtered['Latitude'], text=filtered.get('Airport_Name', filtered.get('IATA', '')),
                                              mode='markers', marker=dict(size=5, color='rgb(255,0,0)', line=dict(width=1))))

            # add lines
            max_count = top_links['count'].max() if not top_links.empty else 1
            for _, row in top_links.iterrows():
                lon = [row['Longitude_ORIGIN'], row['Longitude_DEST']]
                lat = [row['Latitude_ORIGIN'], row['Latitude_DEST']]
                width = max(0.5, row['count'] / max_count * 5)
                fig_links.add_trace(go.Scattergeo(lon=lon, lat=lat, mode='lines', line=dict(width=width, color='blue'), opacity=0.6))

            fig_links.update_layout(title_text=f'Top {n_links} Strongest US Airport Links', showlegend=False,
                                    geo=dict(scope='usa', projection_type='albers usa', showland=True, landcolor='rgb(243,243,243)', countrycolor='rgb(204,204,204)'))
            st.plotly_chart(fig_links, use_container_width=True)
        except Exception as e:
            st.error("Error building links map: " + str(e))
    else:
        st.info("To enable the map of top links please upload an airport CSV with IATA/Latitude/Longitude via the sidebar.")

    st.subheader("Best Airline Suggester (top 3)")
    if not df.empty:
        origin_airport_suggester = st.selectbox("Select Origin Airport", df['ORIGIN_AIRPORT'].unique(), key='origin_suggester')
        dest_airport_suggester = st.selectbox("Select Destination Airport", df['DEST_AIRPORT'].unique(), key='dest_suggester')

        if st.button("Suggest Best Airline"):
            route_df = df[(df['ORIGIN_AIRPORT'] == origin_airport_suggester) & (df['DEST_AIRPORT'] == dest_airport_suggester)]
            if route_df.empty:
                st.write("No flights found for this route.")
            else:
                airline_performance = route_df.groupby('AIRLINE_ID')['ARR_DELAY'].mean().reset_index()
                top3 = airline_performance.nsmallest(3, 'ARR_DELAY')
                top3 = top3.merge(top3, how='left')  # keep structure; user can map IDs to names if they have mapping
                st.write("Top 3 airlines (by lowest average arrival delay) for this route:")
                st.dataframe(top3.reset_index(drop=True))
    else:
        st.info("Upload flight data to use the airline suggester.")

# --------------------- Airlines Analysis -----------------------------------------
elif page == "Airlines Analysis":
    st.header("Airline Specific Insights")
    if df.empty:
        st.error("Upload flight data to use this page.")
    else:
        # global airline-level plots
        st.subheader("Total Flights per Airline")
        airline_flight_counts = df['AIRLINE_ID'].value_counts().reset_index(name='total_flights')
        airline_flight_counts.rename(columns={'index': 'AIRLINE_ID'}, inplace=True)
        airline_flight_counts['AIRLINE_ID'] = airline_flight_counts['AIRLINE_ID'].astype(str)
        fig_all_airlines = px.bar(airline_flight_counts.sort_values('total_flights', ascending=False), x='AIRLINE_ID', y='total_flights', title='Total Number of Flights per Airline')
        fig_all_airlines.update_layout(xaxis_title='Airline ID', yaxis_title='Total Number of Flights')
        st.plotly_chart(fig_all_airlines, use_container_width=True)

        st.subheader("Average Arrival Delay per Airline")
        airline_avg_delay = df.groupby('AIRLINE_ID')['ARR_DELAY'].mean().reset_index(name='average_delay')
        airline_avg_delay['AIRLINE_ID'] = airline_avg_delay['AIRLINE_ID'].astype(str)
        fig_avg_delay = px.bar(airline_avg_delay.sort_values('average_delay'), x='AIRLINE_ID', y='average_delay', title='Average Arrival Delay per Airline')
        fig_avg_delay.update_layout(xaxis_title='Airline ID', yaxis_title='Average Arrival Delay (minutes)')
        st.plotly_chart(fig_avg_delay, use_container_width=True)

        st.subheader("Most Common Airline by Destination (map)")
        if filtered is not None:
            try:
                most_common_airline_per_dest = df.groupby('DEST_AIRPORT')['AIRLINE_ID'].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else None).reset_index(name='Most_Common_Airline')
                airport_most_common_airline = pd.merge(filtered, most_common_airline_per_dest, left_on='IATA', right_on='DEST_AIRPORT', how='left')
                airport_most_common_airline.dropna(subset=['Most_Common_Airline'], inplace=True)
                airport_most_common_airline['Most_Common_Airline_Str'] = airport_most_common_airline['Most_Common_Airline'].astype(str)

                fig_dest_airline = px.scatter_geo(airport_most_common_airline, lat='Latitude', lon='Longitude', color='Most_Common_Airline_Str', hover_name='Airport_Name', projection='albers usa', title='Most Common Airline by Destination Airport')
                fig_dest_airline.update_layout(geo=dict(scope='usa'))
                st.plotly_chart(fig_dest_airline, use_container_width=True)
            except Exception as e:
                st.error("Error building destination-airline map: " + str(e))
        else:
            st.info("Upload airport CSV to view most-common-airline map.")

        # airline-specific insights (keeping user's original behavior)
        st.markdown("---")
        st.subheader("Airline-specific drilldown")
        airline_ids = df['AIRLINE_ID'].unique()
        selected_airline_id = st.selectbox("Select an Airline ID", airline_ids.astype(int) if airline_ids.dtype != object else airline_ids)
        airline_df = df[df['AIRLINE_ID'] == int(selected_airline_id)].copy()

        st.write(f"### Analysis for Airline ID: {selected_airline_id}")
        if airline_df.empty:
            st.write("No records for selected airline.")
        else:
            airline_df['year'] = airline_df['FL_DATE'].dt.year
            airline_df['month'] = airline_df['FL_DATE'].dt.month
            flights_per_year = airline_df['year'].value_counts().sort_index().reset_index()
            flights_per_year.columns = ['Year', 'Number of Flights']
            fig_year = px.bar(flights_per_year, x='Year', y='Number of Flights', title=f'Number of Flights per Year for Airline ID {selected_airline_id}')
            st.plotly_chart(fig_year, use_container_width=True)

            flights_per_month = airline_df['month'].value_counts().sort_index().reset_index()
            flights_per_month.columns = ['Month', 'Number of Flights']
            fig_month = px.line(flights_per_month, x='Month', y='Number of Flights', title=f'Number of Flights per Month for Airline ID {selected_airline_id}')
            st.plotly_chart(fig_month, use_container_width=True)

            avg_dep_delay = airline_df['DEP_DELAY'].mean()
            avg_arr_delay = airline_df['ARR_DELAY'].mean()
            st.metric("Average Departure Delay (minutes)", f"{avg_dep_delay:.2f}")
            st.metric("Average Arrival Delay (minutes)", f"{avg_arr_delay:.2f}")

            # Most common trips
            airline_trajectory_counts = airline_df.groupby(['ORIGIN_AIRPORT', 'DEST_AIRPORT']).size().reset_index(name='count')
            n_airline_trajectories = st.slider(f"Select number of top trips to display for Airline ID {selected_airline_id}", 5, 50, 10, key=f'airline_trips_slider_{selected_airline_id}')
            top_airline_trajectories = airline_trajectory_counts.nlargest(n_airline_trajectories, 'count')
            st.write("Top Common Trips:")
            st.dataframe(top_airline_trajectories)
            st.write("Map visualization of common trips for this airline would appear here if airport coordinates were available.")

# --------------------- Airports Analysis -----------------------------------------
elif page == "Airports Analysis":
    st.header("Airport Specific Insights")
    if df.empty:
        st.error("Upload flight data to use this page.")
    else:
        selected_airport = st.selectbox("Select an Airport", df['ORIGIN_AIRPORT'].unique())
        airport_df = df[df['ORIGIN_AIRPORT'] == selected_airport].copy()
        st.subheader(f"Analysis for {selected_airport}")

        airport_df['year'] = airport_df['FL_DATE'].dt.year
        flights_per_year_airport = airport_df['year'].value_counts().sort_index().reset_index()
        flights_per_year_airport.columns = ['Year', 'Number of Flights']
        fig_year_airport = px.bar(flights_per_year_airport, x='Year', y='Number of Flights', title=f'Number of Flights per Year from {selected_airport}')
        st.plotly_chart(fig_year_airport, use_container_width=True)

        airport_df['month'] = airport_df['FL_DATE'].dt.month
        flights_per_month_airport = airport_df['month'].value_counts().sort_index().reset_index()
        flights_per_month_airport.columns = ['Month', 'Number of Flights']
        fig_month_airport = px.line(flights_per_month_airport, x='Month', y='Number of Flights', title=f'Number of Flights per Month from {selected_airport}')
        st.plotly_chart(fig_month_airport, use_container_width=True)

        airport_df['day_of_week'] = airport_df['FL_DATE'].dt.day_name()
        days_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
        flights_per_day_airport = airport_df['day_of_week'].value_counts().reindex(days_order).reset_index()
        flights_per_day_airport.columns = ['Day of Week', 'Number of Flights']
        fig_day_airport = px.bar(flights_per_day_airport, x='Day of Week', y='Number of Flights', title=f'Number of Flights per Day of the Week from {selected_airport}')
        st.plotly_chart(fig_day_airport, use_container_width=True)

        airline_counts_airport = airport_df['AIRLINE_ID'].value_counts().reset_index()
        airline_counts_airport.columns = ['Airline ID', 'Number of Flights']
        n_airlines = st.slider(f"Select number of top airlines to display for {selected_airport}", 1, 10, 5, key=f'airport_airlines_slider_{selected_airport}')
        top_airlines_airport = airline_counts_airport.head(n_airlines)
        st.dataframe(top_airlines_airport)

        avg_dep_delay_airport = airport_df['DEP_DELAY'].mean()
        avg_weather_delay_airport = airport_df['WEATHER_DELAY'].mean()
        st.metric(f"Average Departure Delay from {selected_airport} (minutes)", f"{avg_dep_delay_airport:.2f}")
        st.metric(f"Average Weather Delay from {selected_airport} (minutes)", f"{avg_weather_delay_airport:.2f}")

        # Additional airport-level plots that require airport coordinates
        st.markdown("---")
        st.subheader("Airport heatmaps (require airport CSV with coordinates)")
        if filtered is not None:
            try:
                # significant weather-delayed flights (>5 mins)
                weather_delayed_flights_significant = df[df['WEATHER_DELAY'] > 5].copy()
                weather_delay_counts_significant = weather_delayed_flights_significant['ORIGIN_AIRPORT'].value_counts().reset_index(name='weather_delay_count_significant')
                weather_delay_counts_significant.rename(columns={'index': 'ORIGIN_AIRPORT'}, inplace=True)
                total_flight_counts = df['ORIGIN_AIRPORT'].value_counts().reset_index(name='total_count')
                total_flight_counts.rename(columns={'index': 'ORIGIN_AIRPORT'}, inplace=True)
                weather_delay_percentage_significant = pd.merge(weather_delay_counts_significant, total_flight_counts, on='ORIGIN_AIRPORT', how='left').fillna(0)
                weather_delay_percentage_significant['weather_delay_percentage_significant'] = (weather_delay_percentage_significant['weather_delay_count_significant'] / weather_delay_percentage_significant['total_count']) * 100
                airport_weather_delay_percentage_significant = pd.merge(filtered, weather_delay_percentage_significant, left_on='IATA', right_on='ORIGIN_AIRPORT', how='left').fillna(0)

                fig = px.density_mapbox(airport_weather_delay_percentage_significant, lat='Latitude', lon='Longitude', z='weather_delay_percentage_significant', radius=10, color_continuous_scale="Hot", center=dict(lat=37.0902, lon=-95.7129), zoom=3, mapbox_style="carto-positron", title='Heatmap of Percentage of Significant Weather Delayed Flights (> 5 mins) by Origin Airport', hover_name="Airport_Name", hover_data={'IATA': True, 'weather_delay_percentage_significant': ':.2f', 'Latitude': False, 'Longitude': False})
                fig.update_layout(margin={"r":0,"t":40,"l":0,"b":0})
                st.plotly_chart(fig, use_container_width=True)

                # Non-weather delays average
                non_weather_delayed_flights = df[df['WEATHER_DELAY'] == 0].copy()
                non_weather_delay_avg = non_weather_delayed_flights.groupby('ORIGIN_AIRPORT')['ARR_DELAY'].mean().reset_index(name='average_non_weather_delay')
                airport_non_weather_delay_avg = pd.merge(filtered, non_weather_delay_avg, left_on='IATA', right_on='ORIGIN_AIRPORT', how='left').fillna(0)
                fig2 = px.density_mapbox(airport_non_weather_delay_avg, lat='Latitude', lon='Longitude', z='average_non_weather_delay', radius=10, color_continuous_scale="Viridis", center=dict(lat=37.0902, lon=-95.7129), zoom=3, mapbox_style="carto-positron", title='Heatmap of Average Non-Weather Arrival Delays by Origin Airport', hover_name="Airport_Name", hover_data={'IATA': True, 'average_non_weather_delay': ':.2f', 'Latitude': False, 'Longitude': False})
                fig2.update_layout(margin={"r":0,"t":40,"l":0,"b":0})
                st.plotly_chart(fig2, use_container_width=True)

            except Exception as e:
                st.error("Error building airport heatmaps: " + str(e))
        else:
            st.info("Upload airport CSV to enable airport heatmaps and maps.")

# ---------------------------------------------------------------------------------

st.sidebar.markdown("---")
st.sidebar.caption("Enhanced app: supports optional airport CSV upload to enable maps. If you have a mapping of AIRLINE_ID to names, provide it separately to show airline names instead of IDs.")


[34m  Stopping...[0m


2025-11-06 12:47:46.955 No runtime found, using MemoryCacheStorageManager
2025-11-06 12:47:46.963 No runtime found, using MemoryCacheStorageManager
2025-11-06 12:47:49.040 Please replace `use_container_width` with `width`.

`use_container_width` will be removed after 2025-12-31.

For `use_container_width=True`, use `width='stretch'`. For `use_container_width=False`, use `width='content'`.
2025-11-06 12:47:50.792 Please replace `use_container_width` with `width`.

`use_container_width` will be removed after 2025-12-31.

For `use_container_width=True`, use `width='stretch'`. For `use_container_width=False`, use `width='content'`.


DeltaGenerator(_root_container=1, _parent=DeltaGenerator())

## Summary:

### Data Analysis Key Findings

*   The analysis successfully loaded and preprocessed the flight dataset, converting date and identifier columns to appropriate types and filling missing delay values with 0.
*   A Streamlit application structure was built with sections for General Visualizations, Airlines Analysis, and Airports Analysis.
*   The General Visualizations section includes a network graph showing the most common flight trajectories, with a slider to adjust the number of trajectories displayed.
*   A "Best Airline Suggester" was implemented to recommend the airline with the lowest average arrival delay for a user-selected origin and destination.
*   The Airlines Analysis section allows users to select an airline and view the number of flights per year and month using bar and line charts. It also displays the average departure and arrival delays for the selected airline and lists the most common trips.
*   The Airports Analysis section allows users to select an airport and view the number of flights originating from that airport per year, month, and day of the week using bar and line charts. It also lists the most common airlines operating from the airport and displays the average departure and weather delays.

### Insights or Next Steps

*   Consider adding a mapping from airline IDs to airline names for better readability in the Airlines Analysis and Best Airline Suggester sections.
*   To implement the map visualization for common trips, acquire or integrate a dataset containing geographical coordinates (latitude and longitude) for the airports.


In [10]:
# Install ngrok
!pip install pyngrok -q

[31mERROR: Operation cancelled by user[0m[31m
[0m

In [15]:
# Authenticate ngrok (optional, but recommended for stability and more features)
# You can get your authtoken from https://dashboard.ngrok.com/get-started/your-authtoken
# Add your authtoken to Colab's secrets and name it 'NGROK_AUTH_TOKEN'
from google.colab import userdata
import os

# Get the authtoken from Colab secrets
authtoken = userdata.get('NGROK_AUTH_TOKEN')

# Authenticate ngrok
if authtoken:
  !ngrok authtoken $authtoken
else:
  print("Ngrok authtoken not found in Colab secrets. Running without authentication (might be less stable).")

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


In [None]:
# Run Streamlit app with ngrok
from pyngrok import ngrok
import threading
import time
import os
import psutil # Import the psutil library

# Terminate any existing ngrok tunnels
# More robust way to kill ngrok processes
for proc in psutil.process_iter():
    if "ngrok" in proc.name().lower():
        try:
            proc.kill()
            print(f"Killed existing ngrok process: {proc.pid}")
        except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
            pass

# Give a moment for processes to terminate
time.sleep(2)


# Start a ngrok tunnel to the Streamlit port (8501)
# We'll use a Thread to keep the Colab cell running
def run_streamlit():
    !streamlit run app.py --server.port 8501

print("Starting Streamlit app...")
thread = threading.Thread(target=run_streamlit)
thread.start()

# Give Streamlit a moment to start
time.sleep(10)

# Get the ngrok tunnel URL
try:
    public_url = ngrok.connect(8501).public_url
    print(f"Streamlit app is running at: {public_url}")
except Exception as e:
    print(f"Failed to start ngrok tunnel: {e}")
    print("Please check the ngrok documentation or try restarting the Colab runtime.")
    public_url = None


# Keep the cell alive while the Streamlit app is running
if public_url:
    try:
        while thread.is_alive():
            time.sleep(1)
    except KeyboardInterrupt:
        print("Stopping Streamlit app and ngrok tunnel.")
        ngrok.kill()
else:
    print("Streamlit app started, but ngrok tunnel failed. Cell will terminate after a short delay.")
    time.sleep(60) # Keep the cell alive for a minute to inspect logs if needed

Starting Streamlit app...

Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://35.222.76.38:8501[0m
[0m
Streamlit app is running at: https://substriated-rick-reunitable.ngrok-free.dev
2025-11-06 12:48:21.141 Please replace `use_container_width` with `width`.

`use_container_width` will be removed after 2025-12-31.

For `use_container_width=True`, use `width='stretch'`. For `use_container_width=False`, use `width='content'`.
2025-11-06 12:48:22.960 Please replace `use_container_width` with `width`.

`use_container_width` will be removed after 2025-12-31.

For `use_container_width=True`, use `width='stretch'`. For `use_container_width=False`, use `width='content'`.
2025-11-06 12:48:33.187 Please replace `use_container_width` with `w