Imports the necessary libraries for the notebook. It includes PyTorch for building and training neural networks, scikit-learn for data splitting and cross-validation, NumPy for numerical operations, Matplotlib for plotting, and Pandas for data manipulation.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold

import torch
import torch.nn as nn
import torch.optim as optim

Load the dataset from a file named dataset.txt into a Pandas DataFrame. The dataset contains features like total_flights, num_cancellations, time_since_booking, and season_cancelled, with a target label cancel_label. The features and labels are converted into PyTorch tensors. The data is then split into training and testing sets using an 80-20 split.

In [2]:
# Загрузка данных
df = pd.read_csv(
    "dataset.txt",
    header=None,
    names=[
        'total_flights',
        'num_cancellations',
        'time_since_booking',
        'season_cancelled',
        'cancel_label'
    ]
)

# Подготовка данных
X = torch.tensor(
    df[
        [
            'total_flights',
            'num_cancellations',
            'time_since_booking',
            'season_cancelled'
        ]
    ].values,
    dtype=torch.float32
)
y = torch.tensor(
    df['cancel_label'].values,
    dtype=torch.float32
).view(-1, 1)

# Разделение данных на обучающую и тестовую выборки
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

Define a neural network model called CancellationPredictor using PyTorch. The model consists of three fully connected layers with ReLU activation functions and a sigmoid activation function for the output layer. The model is initialized with the input size derived from the training data. The Adam optimizer and Binary Cross-Entropy Loss (BCELoss) are set up for training.

In [3]:
class CancellationPredictor(nn.Module):
    """Модель для предсказания отмены рейсов."""
    def __init__(self, input_size):
        super().__init__()
        self.fc1 = nn.Linear(input_size, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        """Прямой проход через сеть."""
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.sigmoid(self.fc3(x))
        return x


# Инициализация модели
input_size = X_train.shape[1]
model = CancellationPredictor(input_size)

# Оптимизатор и функция потерь
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.BCELoss()

A function to calculate the accuracy

In [5]:
def accuracy(y_true, y_pred):
    y_pred_labels = (y_pred > 0.5).float()
    correct = (y_pred_labels == y_true).float()
    return correct.mean().item()

Perform 5-fold cross-validation on the training data. For each fold, the model is trained for 15 epochs, and the validation loss and accuracy are computed. The results for each fold are stored, and the average validation loss and accuracy across all folds are printed.

In [6]:
from sklearn.model_selection import KFold

# Инициализация KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
results = {'val_loss': [], 'val_accuracy': []}

# Кросс-валидация
for fold, (train_ids, val_ids) in enumerate(kfold.split(X_train)):
    print(f'Fold {fold + 1}')
    X_train_fold, X_val_fold = X_train[train_ids], X_train[val_ids]
    y_train_fold, y_val_fold = y_train[train_ids], y_train[val_ids]

    # Обучение модели
    for epoch in range(15):
        model.train()
        optimizer.zero_grad()
        outputs = model(X_train_fold)
        loss = criterion(outputs, y_train_fold)
        loss.backward()
        optimizer.step()

    # Валидация модели
    model.eval()
    with torch.no_grad():
        val_outputs = model(X_val_fold)
        val_loss = criterion(val_outputs, y_val_fold)
        val_acc = accuracy(y_val_fold, val_outputs)

        results['val_loss'].append(val_loss.item())
        results['val_accuracy'].append(val_acc)

        print(f'Validation Loss: {val_loss.item()}, Validation Accuracy: {val_acc}')

# Вывод средних результатов
print(f'Average Validation Loss: {np.mean(results["val_loss"])}')
print(f'Average Validation Accuracy: {np.mean(results["val_accuracy"])}')

Fold 1
Validation Loss: 0.32722264528274536, Validation Accuracy: 0.96875
Fold 2
Validation Loss: 0.5417284369468689, Validation Accuracy: 0.934374988079071
Fold 3
Validation Loss: 0.24359531700611115, Validation Accuracy: 0.9593750238418579
Fold 4
Validation Loss: 0.2850986421108246, Validation Accuracy: 0.9281250238418579
Fold 5
Validation Loss: 0.1614392250776291, Validation Accuracy: 0.96875
Average Validation Loss: 0.3118168532848358
Average Validation Accuracy: 0.9518750071525574


Evaluate the trained model on the test set. The model is set to evaluation mode, and the test loss and accuracy are computed and printed.

In [7]:
# Оценка модели на тестовых данных
model.eval()
with torch.no_grad():
    test_outputs = model(X_test)
    test_loss = criterion(test_outputs, y_test)
    test_acc = accuracy(y_test, test_outputs)
    print(f'Test Loss: {test_loss.item()}, Test Accuracy: {test_acc}')

Test Loss: 0.19809608161449432, Test Accuracy: 0.9549999833106995


Demonstrate how to use the trained model to make predictions on new data. Three sample data points are provided, and the model predicts the probability of cancellation for each. The probabilities are stored in a list and printed.

In [8]:
# Данные для предсказания
fly = [
    [112, 9, 133, 3],
    [68, 5, 365, 3],
    [56, 3, 209, 1]
]

# Копирование данных и подготовка списка для результатов
data = np.copy(fly)
predictions = []

# Предсказание для каждого примера
for person in data:
    # Подготовка данных в виде словаря
    person_data = {
        'total_flights': person[0],
        'num_cancellations': person[1],
        'time_since_booking': person[2],
        'season_cancelled': person[3]
    }

    # Преобразование данных в тензор
    person_tensor = torch.tensor(
        [
            person_data['total_flights'],
            person_data['num_cancellations'],
            person_data['time_since_booking'],
            person_data['season_cancelled']
        ],
        dtype=torch.float32
    ).unsqueeze(0)

    # Предсказание модели
    model.eval()
    with torch.no_grad():
        prediction = model(person_tensor)
        probability = prediction.item()

    predictions.append(probability)

# Вывод результатов
print(predictions)

[0.06784028559923172, 0.013978885486721992, 0.05624682456254959]


Calculate the probability that at least one of the passengers in the sample data will cancel their booking. It uses the predicted probabilities from the previous cell to compute this.

In [14]:
predicted_probabilities = np.array(predictions)
prob_no_one_cancels = np.prod(1 - predicted_probabilities)
prob_at_least_one_cancels = 1 - prob_no_one_cancels
print(f'Вероятность того, что хотя бы один пассажир откажется: {prob_at_least_one_cancels:.4f}')

Вероятность того, что хотя бы один пассажир откажется: 0.1311
