___

<div style="text-align: center;">
  <span style="font-family: 'Playfair Display', serif; font-size: 24px; font-weight: bold;">
    Training an MLP Classifier with TransE Embeddings for Price Group Prediction
  </span>
</div>

___

In this notebook, we utilize embeddings generated by the `transE.ipynb` notebook to train a Multi-Layer Perceptron (MLP) for predicting price groups. The main steps include:

- Data Acquisition: We obtain the embeddings created by the TransE model, which are stored in a previous notebook (transE.ipynb).
- Target Definition: The price group, which serves as the target variable for our prediction model, is defined and prepared.
- Model Design and Training: We design and train an MLP model using the TransE embeddings as input features and the price group as the target.
- Evaluation: The trained MLP model is evaluated to assess its performance in predicting the price group.
- Integration and Application: The trained model is integrated into the pipeline for further use in analysis and decision-making processes.

In [None]:
#!pip install torchkge

In [None]:
# Imports
import torch
import pickle
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from sklearn.metrics import accuracy_score

We load pre-processed datasets and a pre-trained model from pickle files. It includes the training data (`train`), knowledge graph training data (`kg_train`), test features (`test_X`), test labels (`test_y`), and the model (`model`). These are used for further training, analysis, and prediction tasks.

In [None]:
with open('./objects/train.pkl', 'rb') as outp:
    train = pickle.load(outp)

with open('./objects/kg_train.pkl', 'rb') as outp:
    kg_train = pickle.load(outp)

with open('./objects/test_X.pkl', 'rb') as outp:
    test_X = pickle.load(outp)

with open('./objects/test_y.pkl', 'rb') as outp:
    test_y = pickle.load(outp)

with open('./objects/model.pkl', 'rb') as outp:
    model = pickle.load(outp)

Data Preparation

Now, we filter and process training and testing data for an MLP model. Firts, we extract embeddings related to the 'price_discretized' relationship, loads them in batches, and concatenates them into a tensor. It then converts both the features (`train_X`) and target values (`train_y`) into PyTorch tensors for training.

Train:

In [None]:
batch_size = 1000 

filtered_items = train[train['rel'] == 'http://example.org/apartment/price_discretized']['from']

train_embedding_list = [np.load(f'./train_X/batch_{start}.npy') for start in range(0, len(filtered_items), batch_size)]
train_embedding_list = np.concatenate(train_embedding_list, axis=0)
train_embedding_tensor = torch.tensor(train_embedding_list, dtype=torch.float32)

In [None]:
train_X = train_embedding_tensor
train_y = train[train['rel'] == 'http://example.org/apartment/price_discretized']['to']
train_y = torch.tensor(train_y.astype(float).array)

Test:

In [None]:
filtered_items = test_X['from']

test_embedding_list = [np.load(f'./test_X/batch_{start}.npy') for start in range(0, len(filtered_items), batch_size)]
test_embedding_list = np.concatenate(test_embedding_list, axis=0)
test_embedding_tensor = torch.tensor(test_embedding_list, dtype=torch.float32)

In [None]:
test_X = test_embedding_tensor
test_y = torch.tensor(test_y.astype(float).array)

MLP Training:

Next, we define the MLP model, train it on the training data over 80 epochs, and evaluate its accuracy on both the training and test datasets using PyTorch.

In [None]:
# Model definition: MLP Classifier
class MLPClassifier(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(MLPClassifier, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

class Trainer:
    def __init__(self, train_X, train_y, test_X, test_y, input_size, hidden_size=128, output_size=4, lr=0.001, num_epochs=80):
        self.train_X = train_X
        self.train_y = train_y.long()
        self.test_X = test_X
        self.test_y = test_y.long()
        self.num_epochs = num_epochs
        
        self.model = MLPClassifier(input_size, hidden_size, output_size)
        self.criterion = nn.CrossEntropyLoss()
        self.optimizer = optim.Adam(self.model.parameters(), lr=lr)

    def train(self):
        for epoch in range(self.num_epochs):
            self.model.train()
            self.optimizer.zero_grad()
            outputs = self.model(self.train_X)
            loss = self.criterion(outputs, self.train_y)
            loss.backward()
            self.optimizer.step()
            
            if (epoch + 1) % 2 == 0:
                print(f'Epoch [{epoch + 1}/{self.num_epochs}], Loss: {loss.item():.4f}')

    def evaluate(self):
        self.model.eval()
        with torch.no_grad():
            train_outputs = self.model(self.train_X)
            train_predictions = torch.argmax(train_outputs, dim=1)
            train_accuracy = accuracy_score(self.train_y.numpy(), train_predictions.numpy())

            test_outputs = self.model(self.test_X)
            test_predictions = torch.argmax(test_outputs, dim=1)
            test_accuracy = accuracy_score(self.test_y.numpy(), test_predictions.numpy())

        print(f'Train Accuracy: {train_accuracy * 100:.2f}%')
        print(f'Test Accuracy: {test_accuracy * 100:.2f}%')
        return test_predictions

In [None]:
input_size = train_X.shape[1]
trainer = Trainer(train_X, train_y, test_X, test_y, input_size)
trainer.train()
test_predictions = trainer.evaluate()

In [None]:
x = pd.Series(test_predictions)
x.value_counts()

In [None]:
# Calculate confusion matrix
cm = confusion_matrix(test_y, test_predictions)

# Create a heatmap
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')

# Add labels to the plot
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

As it stands, the model is not the most effective predictor. While it does perform better than a random baseline (which would be 25% accuracy), it rarely predicts the two most expensive classes accurately and often performs poorly when it does.

Through experimentation, we have determined that these are the best results achievable with the current data before the model begins to overfit. This outcome could be attributed to several factors, including the inherent unpredictability of the data (though this is unlikely), an insufficient number of data instances, or a suboptimal embedding method (TransE), potentially due to an inadequately small embedding dimension.