In [13]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

# Load the dataset
df = pd.read_csv('./DataSets/KaggleData/flights_gold.csv')

# Display available columns in the dataset
print("Available columns in dataset:", df.columns)

# Handle missing values in important columns
df['arr_delay'].fillna(df['arr_delay'].mean(), inplace=True)  # Fill missing arrival delay with the mean
df['dep_delay'].fillna(df['dep_delay'].mean(), inplace=True)  # Fill missing departure delay with the mean

# Convert year, month, and day into a single datetime feature (flight_date)
df['flight_date'] = pd.to_datetime(df[['year', 'month', 'day']])

# Extract additional time-based features (e.g., day of the week, hour of the day)
df['day_of_week'] = df['flight_date'].dt.dayofweek
df['hour_of_day'] = df['dep_time'].astype(str).str.replace(r'\.', '', regex=True).str.slice(0, 2).astype(int)

# Fill missing values for other numeric columns (e.g., flight_duration, distance)
df['flight_duration'].fillna(df['flight_duration'].mean(), inplace=True)
df['distance'].fillna(df['distance'].mean(), inplace=True)

# If 'carrier' is a categorical column, encode it numerically
le_carrier = LabelEncoder()
df['carrier'] = le_carrier.fit_transform(df['carrier'])

# Feature selection: Use columns that are available in the dataset
features = ['day_of_week', 'hour_of_day', 'dep_delay', 'arr_delay', 'carrier', 'distance', 'flight_duration']
X = df[features]

# Target variable: Predict whether there will be a delay (binary classification)
# Example: We create a binary target variable based on arrival delay
df['target_binary'] = (df['arr_delay'] > 15).astype(int)  # Label flights with arrival delay > 15 minutes as delayed

# Handle missing target variable values
df.dropna(subset=['target_binary'], inplace=True)  # Remove rows with missing target

# Define X (features) and y (target)
X = df[features]
y = df['target_binary']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the model:", accuracy)
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Report generation for action steps to mitigate delays
# Using GPT-2 (Publicly available)
generator = pipeline('text-generation', model='gpt2')

# Define a prompt for the model
prompt = """
The following airlines are showing a prediction of delays based on their flight performance:
- Carrier A: High chances of delay in the coming weeks due to frequent departures with high delays.
- Carrier B: Moderate chances of delay, with some disruptions due to weather conditions.
- Carrier C: Low chances of delay, currently maintaining on-time arrivals.

What actions should airline leaders take to mitigate unfavorable circumstances that are not good for business?
"""

# Generate text based on the prompt
generated_text = generator(prompt, max_length=200, num_return_sequences=1)

# Display the output
print(generated_text[0]['generated_text'])


Available columns in dataset: Index(['id', 'year', 'month', 'day', 'dep_time', 'sched_dep_time', 'dep_delay',
       'arr_time', 'sched_arr_time', 'arr_delay', 'carrier', 'flight',
       'tailnum', 'origin', 'dest', 'air_time', 'distance', 'hour', 'minute',
       'time_hour', 'name', 'flight_duration', 'day_of_week', 'total_delay'],
      dtype='object')
Accuracy of the model: 1.0

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     49987
           1       1.00      1.00      1.00     10196

    accuracy                           1.00     60183
   macro avg       1.00      1.00      1.00     60183
weighted avg       1.00      1.00      1.00     60183



Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



The following airlines are showing a prediction of delays based on their flight performance:
- Carrier A: High chances of delay in the coming weeks due to frequent departures with high delays.
- Carrier B: Moderate chances of delay, with some disruptions due to weather conditions.
- Carrier C: Low chances of delay, currently maintaining on-time arrivals.

What actions should airline leaders take to mitigate unfavorable circumstances that are not good for business?

The airlines should start the elimination of the risk factor for carrier cancellations based on the actual time of the scheduled arrival date as outlined below:

What will the airline's decision be for flights coming on the cancelled aircraft be affected by on-time departures?

As detailed in our Airports and Terminal Policy, there are several aspects to consider when taking action to mitigate any adverse situation you may face before taking a call for a cancellation:

Whether a customer is cancelling a flight for travel re

In [14]:
from transformers import pipeline

# Using GPT-2 (Publicly available)
generator = pipeline('text-generation', model='gpt2')

# Define a prompt for the model
prompt = """
The following airlines are showing a prediction of delays based on their flight performance:
- Carrier A: High chances of delay in the coming weeks due to frequent departures with high delays.
- Carrier B: Moderate chances of delay, with some disruptions due to weather conditions.
- Carrier C: Low chances of delay, currently maintaining on-time arrivals.

What actions should airline leaders take to mitigate unfavorable circumstances that are not good for business?
"""

# Generate text based on the prompt
generated_text = generator(prompt, max_length=200, num_return_sequences=1)

# Display the output
print(generated_text[0]['generated_text'])


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



The following airlines are showing a prediction of delays based on their flight performance:
- Carrier A: High chances of delay in the coming weeks due to frequent departures with high delays.
- Carrier B: Moderate chances of delay, with some disruptions due to weather conditions.
- Carrier C: Low chances of delay, currently maintaining on-time arrivals.

What actions should airline leaders take to mitigate unfavorable circumstances that are not good for business?

In order to reduce delays in general and a small number of particular airports on one chain, airlines are required to use their own time and resources. Air carriers should:

Consider the impact of any scheduled and delayed departures in the future and the impact of these delayed departures on the overall business performance of their airline. A number of important factors such as airline employees, employees' travel habits, the quality of the travel the airline conducts with its employees, and the ability to pay staff are i