# GNN Model
Original A* model takes too damn long to run. We are building a GNN to make up for it.

The gist of how this one works is that it does a classification model to predict the likelihood of a node being part of the solution, as in normal classification. During actual prediction, it takes a moment to only add an element if it is valid in the current answer, IE it is connected to the existing nodes.

In [1]:
import sys
import os
sys.path.append('../')
import data_readers

import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import statistics
import math

# networkx
import networkx as nx
from networkx.drawing.nx_pydot import graphviz_layout

# For semantic similarity
from urllib.parse import unquote
from sentence_transformers import SentenceTransformer
import torch

# Python functions in .py file to read data
import machine_searchers
import time

import warnings
from tqdm import TqdmWarning
warnings.filterwarnings('ignore', category=TqdmWarning)
# I'll ignore the data embeddings, as that is 

In [2]:
wikispeedia= nx.read_edgelist('../datasets/wikispeedia_paths-and-graph/links.tsv',
                              create_using=nx.DiGraph)

def decode_word(word):
    word = word.replace('_', ' ')
    return unquote(word)

# Create a new graph with decoded node labels
decoded_wikispeedia = nx.DiGraph()

for node in wikispeedia.nodes():
    decoded_node = decode_word(node)
    decoded_wikispeedia.add_node(decoded_node)

# Copy the edges from the original graph to the new graph with decoded node labels
for edge in wikispeedia.edges():
    decoded_edge = tuple(decode_word(node) for node in edge)
    decoded_wikispeedia.add_edge(*decoded_edge)

# Building the embeddings
There are a shit ton of ways of building the embeddings, for now, we will take a simpler approach of just using semantic distance. We also need to add to the embeddings if it's a source node and if it's a target node.

For creating the training dataset, we also need to transform the shortest path pairs of the dataset into a vector of classifier. Which we need to test properly

A decent amount of data creation...

# Getting all the embeddings

We need to find the embeddings for each element. I'll just create the code and assume someone can do this better with cuda than what I feel like doing

In [3]:
# Do model to cuda here
model = SentenceTransformer('all-MiniLM-L6-v2')

# Function to get embeddings using sentence transformer
def get_embedding(text):
    temp_embed = model.encode(text, convert_to_tensor=True)
    temp_embed = temp_embed / temp_embed.norm(p=2, dim=0, keepdim=True)
    return temp_embed

model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

In [5]:
temp = model.encode('Test', convert_to_tensor=True)
temp.shape

torch.Size([384])

In [11]:
# Didn't run this part, but we just need to run it to be free
node_list = decoded_wikispeedia.nodes()

text_embeddings = torch.zeros((len(node_list), 384))

i = 0
for node in node_list:
    text_embeddings[i] = get_embedding(node)
    i += 1

KeyboardInterrupt: 

In [12]:
torch.save(text_embeddings, 'text_embeddings.pt')

# Getting the classification task

For this, we run all shortest distance pairs to get the target classification module.

We then transform this into a 2D tensor, with the following information:
- First dimension is what is the source, and what is the target node. Everything else is 0
- Second dimension has a 1 for all the nodes included in the shortest path. Including source and target

Reason it's a 2D module is so we can extract the first one and add it to the embeddings, and the second is the actual target classification. It's just a way of guaranteeing all the units are together

Also means the output will be a dictionary? I don't know if dictionary or a 3D tensor is better...

I'll go with a dictionary because I don't feel like thinking. We can just pickle it or something

In [15]:
all_shortest_paths = dict(nx.all_pairs_shortest_path(decoded_wikispeedia))
shortest_paths_df = pd.DataFrame(all_shortest_paths)

In [17]:
shortest_paths_df.head()

Unnamed: 0,Áedán mac Gabráin,Bede,Columba,Dál Riata,Great Britain,Ireland,Isle of Man,Monarchy,Orkney,Picts,...,Wyndham Robertson,X-Men The Last Stand,X Window System protocols and architecture,X Window core protocol,Xanadu House,Yellowhammer,Yotsuya Kaidan,You're Still the One,"Yungay, Peru",Zara Yaqob
Áedán mac Gabráin,[Áedán mac Gabráin],,,,,,,,,,...,,,,,,,,,,
Bede,"[Áedán mac Gabráin, Bede]",[Bede],"[Columba, Dál Riata, Bede]","[Dál Riata, Bede]","[Great Britain, British Isles (terminology), B...","[Ireland, Anno Domini, Bede]","[Isle of Man, England, Bede]","[Monarchy, England, Bede]","[Orkney, Dál Riata, Bede]","[Picts, Bede]",...,"[Wyndham Robertson, American Civil War, Jesus,...","[X-Men The Last Stand, Vancouver, England, Bede]","[X Window System protocols and architecture, A...","[X Window core protocol, Unix, AT&T, United Ki...","[Xanadu House, Business, England, Bede]","[Yellowhammer, Bird migration, England, Bede]","[Yotsuya Kaidan, Tokyo, England, Bede]","[You're Still the One, California, Bird migrat...","[Yungay, Peru, Peru, 8th century, Bede]","[Zara Yaqob, Ethiopia, Jesus, Bede]"
Columba,"[Áedán mac Gabráin, Columba]","[Bede, Abbot, Christian monasticism, Columba]",[Columba],"[Dál Riata, Columba]","[Great Britain, Picts, Columba]","[Ireland, Book of Kells, Columba]","[Isle of Man, Scotland, Columba]","[Monarchy, Scotland, Columba]","[Orkney, Dál Riata, Columba]","[Picts, Columba]",...,"[Wyndham Robertson, Abraham Lincoln, Charles D...","[X-Men The Last Stand, Canada, Irish people, ...","[X Window System protocols and architecture, A...","[X Window core protocol, Unix, AT&T, United Ki...","[Xanadu House, Florida, Irish people, Columba]","[Yellowhammer, Europe, Anglicanism, Columba]","[Yotsuya Kaidan, Tokyo, Scotland, Columba]","[You're Still the One, California, English lan...","[Yungay, Peru, Earthquake, Netherlands, Scotla...","[Zara Yaqob, Ethiopia, 6th century, Columba]"
Dál Riata,"[Áedán mac Gabráin, Dál Riata]","[Bede, Durham Cathedral, Oswald of Northumbria...","[Columba, Dál Riata]",[Dál Riata],"[Great Britain, British Isles, Dál Riata]","[Ireland, British Isles, Dál Riata]","[Isle of Man, British Isles, Dál Riata]","[Monarchy, British monarchy, Dál Riata]","[Orkney, Dál Riata]","[Picts, Dál Riata]",...,"[Wyndham Robertson, U.S. state, British Isles,...","[X-Men The Last Stand, Canada, Irish people, ...","[X Window System protocols and architecture, A...","[X Window core protocol, Unix, AT&T, United Ki...","[Xanadu House, Florida, Irish people, Dál Riata]","[Yellowhammer, Carolus Linnaeus, Orkney, Dál R...","[Yotsuya Kaidan, Osaka, Tree, British Isles, D...","[You're Still the One, California, English lan...","[Yungay, Peru, Earthquake, United Kingdom, Bri...","[Zara Yaqob, Europe, British Isles, Dál Riata]"
Great Britain,"[Áedán mac Gabráin, Great Britain]","[Bede, Great Britain]","[Columba, Dál Riata, Great Britain]","[Dál Riata, Great Britain]",[Great Britain],"[Ireland, Great Britain]","[Isle of Man, British Isles, Great Britain]","[Monarchy, Anguilla, Great Britain]","[Orkney, Dál Riata, Great Britain]","[Picts, Aberdeen, Great Britain]",...,"[Wyndham Robertson, Abraham Lincoln, Andrew Jo...","[X-Men The Last Stand, Canada, Afghanistan, G...","[X Window System protocols and architecture, A...","[X Window core protocol, Unix, Latin, Great Br...","[Xanadu House, Computer, Great Britain]","[Yellowhammer, Animal, Latin, Great Britain]","[Yotsuya Kaidan, Osaka, Aquarium, Great Britain]","[You're Still the One, California, Computer, G...","[Yungay, Peru, Earthquake, Indonesia, Great Br...","[Zara Yaqob, Islam, Great Britain]"


In [19]:
series_of_elements = pd.Series(shortest_paths_df.index)
series_of_elements.head()

0    Áedán mac Gabráin
1                 Bede
2              Columba
3            Dál Riata
4        Great Britain
dtype: object

In [71]:
# In the resulting DF the column is the source node
# The target is the row

# There's probably a pandas way of doing this that is nicer and not a bunch of shitty for loops
# But that's okay
shortest_paths_classification_dict = {}

# I need a way of transforming an input into the classification part...
# Having a pandas series with the values being the node names seems like the easiest solution!
series_of_elements = pd.Series(shortest_paths_df.index)

# IMPORTANT: This means the resulting nodes are in whatever is the current order
# This can be shit, as the system might learn other stuff by coincidence...
# Might also not matter?
# We need a way of guaranteeing this order is kept!!!

for source in node_list:
    for target in node_list:
        source_and_target = series_of_elements.isin([source, target])
        # Order of target and source should make sense, it's just a consequence of how things are organized elsewhere
        
        # There are cases where the element is a nan, which is why we do this extra check
        # This just skips the element, as there is no need to classify it then!
        if type(shortest_paths_df.loc[target, source]) != list:
            #print('No path between', source, target)
            continue
        shortest_path_here = series_of_elements.isin(shortest_paths_df.loc[target, source])
        
        # So each element in the dict is a tuple where the first element is the classification,
        # The second is the solution
        shortest_paths_classification_dict[(source, target)] = (
            torch.tensor(source_and_target, dtype=torch.float32),
            torch.tensor(shortest_path_here, dtype=torch.float32)
        )



KeyboardInterrupt: 

In [None]:
# At this point you should pickle the shortest_paths_classification_dict
# this is to guarantee it works out!
# Didn't write the code, but it's not ultra hard
import pickle


In [27]:
temp = shortest_paths_classification_dict[('Áedán mac Gabráin',
                                    'German language')]

temp

tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [1., 1., 0.,  ..., 0., 0., 0.]])

In [29]:
# Someone please double check that the orders line up!
adj_matrix = nx.adjacency_matrix(decoded_wikispeedia, nodelist=node_list)

# Building the problem generator
All the parts are there, now is the part of kinda building the things together.

Namely, this does the following:
- Go over the shortest_paths_classification_dict and add the relevant source/target and the y value where needed
- Checks the queries and only supplies problems that have not been done by humans. This is to make the testing more equal to avoid issues of dataleakage

There is some extra smoothing that needs to be done here, particularly because there is already a dataloader class in pytorch and in pytorch geometric. But no fucking clue how to grab it!

In [32]:
# TODO: We really should store this article_combinations as a df... it's pretty important
#   and calculating it all the time is a pain

finished_paths = pd.read_csv('../datasets/wikispeedia_paths-and-graph/paths_finished.tsv', sep='\t', skiprows=15,
                             names=['hashedIpAddress', 'timestamp', "durationInSec", 'path', "rating"])
finished_paths['first_article'] = finished_paths['path'].apply(lambda x: x.split(';')[0])
finished_paths['last_article'] = finished_paths['path'].apply(lambda x: x.split(';')[-1])
finished_paths['path_length'] = finished_paths['path'].apply(lambda x: len(x.split(';')))
finished_paths['date'] = pd.to_datetime(finished_paths['timestamp'], unit='s')

# How many each pair of articles has been visited
article_combinations_count = finished_paths.groupby(['first_article', 'last_article']).size().reset_index(name='count')

# The mean and std of the path length for each pair of articles
article_combinations_stats = finished_paths.groupby(['first_article', 'last_article'])['path_length'].agg(['mean', 'std']).reset_index()
article_combinations_stats['std'] = article_combinations_stats['std'].fillna(0)
article_combinations_stats.rename(columns={'mean': 'mean_length', 'std': 'std_length'}, inplace=True)

# The mean and std of the rating for each pair of articles. 
# Note that mean and std may be nan if there are nan ratings. We purposely leave them as nan, as we don't want to fill them with 0s or 1s.
# Depending on the application, we could change this in the future if neeeded.
rating_combinations_stats_rating = finished_paths.groupby(['first_article', 'last_article'])['rating'].agg(['mean', 'std']).reset_index()
#rating_combinations_stats_rating['std'] = rating_combinations_stats_rating['std'].fillna(0)
mask = rating_combinations_stats_rating['mean'].notnull()
rating_combinations_stats_rating.loc[mask, 'std'] = rating_combinations_stats_rating.loc[mask, 'std'].fillna(0)
rating_combinations_stats_rating.rename(columns={'mean': 'mean_rating', 'std': 'std_rating'}, inplace=True)

# The mean and std of the time for each pair of articles.
rating_combinations_stats_time = finished_paths.groupby(['first_article', 'last_article'])['durationInSec'].agg(['mean', 'std']).reset_index()
rating_combinations_stats_time['std'] = rating_combinations_stats_time['std'].fillna(0)
rating_combinations_stats_time.rename(columns={'mean': 'mean_durationInSec', 'std': 'std_durationInSec'}, inplace=True)

# Merging all the dataframes
article_combinations = pd.merge(article_combinations_count, article_combinations_stats, on=['first_article', 'last_article'])
article_combinations = pd.merge(article_combinations, rating_combinations_stats_rating, on=['first_article', 'last_article'])
article_combinations = pd.merge(article_combinations, rating_combinations_stats_time, on=['first_article', 'last_article'])

article_combinations['first_article'] = article_combinations['first_article'].apply(decode_word)
article_combinations['last_article'] = article_combinations['last_article'].apply(decode_word)


In [33]:
article_combinations.head()

Unnamed: 0,first_article,last_article,count,mean_length,std_length,mean_rating,std_rating,mean_durationInSec,std_durationInSec
0,€2 commemorative coins,Irish Sea,1,3.0,0.0,1.0,0.0,15.0,0.0
1,10th century,11th century,3,2.0,0.0,2.333333,2.309401,4.333333,1.527525
2,10th century,Banknote,1,5.0,0.0,3.0,0.0,48.0,0.0
3,10th century,Country,1,3.0,0.0,1.0,0.0,15.0,0.0
4,10th century,Harlem Globetrotters,2,4.5,0.707107,2.0,0.0,75.0,24.041631


In [41]:
result_list = list(zip(article_combinations['first_article'], article_combinations['last_article']))

[('€2 commemorative coins', 'Irish Sea'),
 ('10th century', '11th century'),
 ('10th century', 'Banknote'),
 ('10th century', 'Country'),
 ('10th century', 'Harlem Globetrotters'),
 ('10th century', 'History of democracy'),
 ('10th century', 'Marco Polo'),
 ('10th century', 'Soviet Union'),
 ('11th century', 'Art'),
 ('11th century', 'Dimetrodon'),
 ('11th century', 'Education in the United States'),
 ('11th century', 'Ho Chi Minh'),
 ('11th century', 'Hurricane Alex (2004)'),
 ('11th century', 'John Adams'),
 ('11th century', 'Lhasa'),
 ('11th century', 'Nintendo Entertainment System'),
 ('11th century', 'Plum'),
 ('11th century', 'Taiwan'),
 ('11th century', 'Warsaw'),
 ('12th century', 'Belfast'),
 ('12th century', 'Edinburgh'),
 ('12th century', 'Flat Earth'),
 ('12th century', 'Geography of Ireland'),
 ('12th century', 'Guitar'),
 ('12th century', 'Islam'),
 ('12th century', 'Katana'),
 ('12th century', 'Mexico'),
 ('12th century', 'Suleiman the Magnificent'),
 ('12th century', 'U

In [93]:
import random

class ProblemGenerator:
    def __init__(self, embedded_nodes: torch.tensor = None, classification_dict: dict = None,
                 adjacency_matrix: torch.tensor = None, paths_not_to_use: pd.DataFrame = None):
        if embedded_nodes is None:
            # TODO: Add a way of reading in the pickled data of the tensors
            # Remove the return, I just added it in to have some code here
            pass
        
        if classification_dict is None:
            # TODO: Pickle data here too
            pass
        
        # TODO: Do the pickling thing also for the adjacency matrix and the paths not to use!
        
        self.embedded_nodes = embedded_nodes
        self.classification_dict = classification_dict
        
        self.adjacency_matrix = adjacency_matrix

        self.remove_paths_not_to_use(paths_not_to_use)
        
    def remove_paths_not_to_use(self, path_not_to_use: pd.DataFrame):
        """Method goes over the classification dict, and removes any node pairs that
        are contained in the paths not to use df"""
        
        # First part is creating a list that is just the first_article and last_article
        
        elems_to_remove = list(zip(article_combinations['first_article'], article_combinations['last_article']))
        for elem in elems_to_remove:
            self.classification_dict.pop(elem, None)
        
    def generate_problems(self, number_of_problems_to_generate: int = 200):
        """Generates problems.
        
        Namely, this means for the number_of_problems_to_generate, pick a problem,
        attach the source and target matrix to the embeddings, and output the y value
        as an adjacent value
        
        I guess this also means the output should be 3d, so we need to do some work on
        the shape so that it has that structure!
        
        I'm not sure how to include the adjacency matrix in the output though, sorry :D"""
        
        # This should get the random keys
        random_keys = random.sample(sorted(self.classification_dict.keys()), number_of_problems_to_generate)
        
        # Now, for each key generate a version of the problem
        # I'll store the info in a 3d tensor, which I just update at each count!
        # It's shape[1] because we add the source and target nodes to the inputs
        emb_nodes_shape = self.embedded_nodes.shape
        X_values = torch.zeros((number_of_problems_to_generate, emb_nodes_shape[0], emb_nodes_shape[1] + 1))
        
        # This is a 2d matrix, first dim is the inputs
        # second dim is the solutions for each of the relevant elems
        y_values = torch.zeros((number_of_problems_to_generate, emb_nodes_shape[0]))
        
        i = 0
        for elem in random_keys:
            cur_source_and_target, cur_solution = self.classification_dict[elem]
            y_values[i] = cur_solution
            
            # TODO: I'm not sure if it's worth it to copy the embedded nodes...
            # This is to avoid weird shit from happening with the gradient
            X_values[i] = torch.cat([self.embedded_nodes, cur_source_and_target.unsqueeze(dim=1)], dim=1)
            
        return X_values, y_values
        

In [102]:
# This is to show what the structure is for the prob_gen!
prob_gen = ProblemGenerator(text_embeddings, shortest_paths_classification_dict,
                            adj_matrix, article_combinations)

test_X, test_y = prob_gen.generate_problems(5)

print(test_X.shape)
print(test_y.shape)

torch.Size([5, 4592, 385])
torch.Size([5, 4592])


Okay, the previous parts work, which is good!

They need to be added to separate files and we need to do some pickling, that's okay!

# The actual GNN

