# **Hands-on Part 4/4: Education Pipeline in Action**

---

![](https://drive.google.com/uc?id=1QUpBcnRMLS_N79I1Bmo1ASyPLJOzAhb9)

## **Acknowledgment**

---

The code use in this tutorial directly derive from our [PEARLM Library](https://github.com/Chris1nexus/pearlm). If this tutorial is useful for your research, we would appreciate an acknowledgment by citing our paper:

> Balloccu, G., Boratto, L., Cancedda, C., Fenu, G., & Marras, M. (2023). Faithful Path Language Modelling for Explainable Recommendation over Knowledge Graph. ArXiv, abs/2310.16452.


## **Get Started**

---


### This notebook

By now, you should already have the Tutorial folder in your google drive. You just need to mount your drive executing the following line.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


And browsing in the working directory.

In [None]:
# Your path to hands on
%cd '/content/drive/MyDrive/ExpRecSys Tutorial Series/2024 ECIR/Hands-On'

/content/drive/MyDrive/ExpRecSys Tutorial Series/2024 ECIR/Hands-On


If you followed Part1, you are ready! 🤘 You can skip the next lines.

### Instead, if you joined late

Open the google drive folder [https://tinyurl.com/ecir2024-tutorial1](https://tinyurl.com/ecir2024-tutorial1) containing the material and follow the instrucions inside `GetStarted.ipynb`

## Outline

---



In the previous part we:

1️⃣ Sampled paths from an existing knowledge graph which will be used as **training data for PLM and PEARLM**.

2️⃣ Trained the **PLM** [[37]](#p37) and **PEARLM [[38]](#p38)** models and use their decoding to generate paths.

3️⃣ **Evaluated the models** and converted their path into **texual explanations via templates** [[33]](#p33).

4️⃣ **Measured the hallucination phenomena** [[38]](#p38) in PLM and see how PEARLM's constraint decoding solves it.

In **this part**,  we will execute again the whole pipeline with the CoCo dataset.


In this part we will:

1️⃣ Learn about the Educational **CoCo dataset**.

2️⃣ Train and inference with **path reasoning methods** (PGPR & CAFE).

3️⃣ Train and inference with **Causal Language Models for path reasoning** methods (PLM & PEARLM).



- [ 0 - Packages](#0)
- [ 1 - Prerequisites](#1)
- [ 2 - Training TransE Knowledge Graph Embedding](#2)
- [ 3 - PGPR train pipeline](#3)
- [ 4 - CAFE train pipeline](#4)
- [ 5 - Causal Language Modeling for Path Reasoning](#5)
- [ 6 - PLM train pipeline](#6)
- [ 7 - PEARLM train pipeline](#7)

<a name="0"></a>

## 0 - Packages

---

In [96]:
import numpy as np
import pandas as pd
import random
from collections import defaultdict
import pickle
import warnings
!pip install datasets



In [None]:
! pip install .

Processing /content/drive/MyDrive/ExpRecSys Tutorial Series/2024 ECIR/Hands-On
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pathlm
  Building wheel for pathlm (setup.py) ... [?25l[?25hdone
  Created wheel for pathlm: filename=pathlm-0.0.0-py3-none-any.whl size=236948 sha256=2e37065aca7c5fdfe76d64a5264b8f7ec052380eb22e5cdd7ecea3b34854fd73
  Stored in directory: /root/.cache/pip/wheels/5d/d6/fa/857159ee5e51c820ba3cb52d5f60b61adc5ab379954711150d
Successfully built pathlm
Installing collected packages: pathlm
  Attempting uninstall: pathlm
    Found existing installation: pathlm 0.0.0
    Uninstalling pathlm-0.0.0:
      Successfully uninstalled pathlm-0.0.0
Successfully installed pathlm-0.0.0


<a name="1"></a>

## 1 - CoCo Dataset

---


In [97]:
dataset_name = 'coco'
coco_preprocessed_path = f'data/{dataset_name}/preprocessed'

<a name="3.1"></a>
### 3.1 Dataset

This dataset is located under `data/coco`. The dataset has been obtained adding side informations like `target_audience`, `bloom taxonomy`, [[34, 39]](#p34). And performing the preprocessing illustrated with the other datasets in Notebook 1. You can find the script and the final dataset in this [github repository](https://github.com/giacoballoccu/CoCo-Educational-Recommendation-Dataset).



#### Dataset Description

The dataset is composed by the following files:

- 👥 `data/lfm1m/coco/users.txt`: List of users.

- 📚 `data/lfm1m/coco/products.txt`: Items catalog of courses and associated teacher.

- ⭐ `data/lfm1m/coco/ratings.txt`: List of users' positive interactions with courses.

The files are composed by rows **separated by** `\t` that contain crucial and side-information for the recommendation task.

This version is a reduced version of the original CoCo dataset [[34]](#p34). It has been obtained reducing the original dataset performing a k-core with k=10.

All the preprocessingis already done! We will just load the data, visualize it's content, and print some stats.

#### Load and visualize

Before starting, let's save a variable with the path to the `data/lfm1m/`; in this way, we can avoid repeating code.

In [None]:
coco_path = 'data/coco/preprocessed'

Let's load into a dataframe the users and visualize them.

In [None]:
coco_users_df = pd.read_csv(f"{coco_path}/users.txt", sep="\t")
display(coco_users_df.head(5))
print(f"Number of users: {coco_users_df.shape[0]}")

Unnamed: 0,uid,gender,age
0,34039642,-1,-1
1,26079730,-1,-1
2,35306780,-1,-1
3,12445364,-1,-1
4,1530750,-1,-1


Number of users: 24036


As you may notice for this dataset we do not have sensible attributes for users 😿

Let's visualise now the courses available in the dataset.

In [None]:
coco_courses_df = pd.read_csv(f"{coco_path}/products.txt", sep="\t")
display(coco_courses_df.head(5))
print(f"Number of tracks: {coco_courses_df.shape[0]}")

Unnamed: 0,pid,name,provider_id,genre,pop_item,pop_provider
0,497736,social-psychology,12056666,academics,0.008343,0.119248
1,27696,introductory-psychology,345797,academics,0.0441,0.001979
2,992220,international-politics-online-course,26526572,academics,0.02205,0.000742
3,508508,amazing-psychology-experiments,12056666,academics,0.018474,0.119248
4,1054944,advanced-psychology,17111598,academics,0.010727,0.01237


Number of tracks: 8196


And the ratings.

In [None]:
coco_ratings_df = pd.read_csv(f"{coco_path}/ratings.txt", sep="\t")
display(coco_ratings_df.head(5))
print(f"Number of ratings: {coco_ratings_df.shape[0]}")

Unnamed: 0,uid,pid,rating,timestamp
0,34039642,497736,1,1506674923
1,26079730,497736,1,1506499865
2,35306780,497736,1,1505932107
3,12445364,497736,1,1505664516
4,1530750,497736,1,1504780240


Number of ratings: 378469


<a name="3.2"></a>
### 3.2 Knowledge Graph (KG)


This KG is located under `data/coco/preprocessed`. It was extracted from the metadata available from the dataset.

#### KG description

The Freebase KG is composed of 4 main files:

- `i2kg_map.tsv`: Mapping between the item id from thd CoCo dataset and the corresponding entity in the KG.

- `e_map.txt`: Set of entities (Including items).

- `r_map.txt`: Set of relations.

- `kg.txt`: Set of triplets *(entity_head, relation, entity_tail)*.

The files, internally, again follow the standard format.

#### KG Files

First the item mapping to the kg `i2kg_map.txt`.

In [None]:
courses_to_kg_df = pd.read_csv(f"{coco_path}/i2kg_map.txt", sep="\t")
display(courses_to_kg_df.head(5))
print(f"Number of songs mapped to KG: {courses_to_kg_df.shape[0]}")

Unnamed: 0,eid,pid,name,entity
0,0,497736,social-psychology,497736
1,1,27696,introductory-psychology,27696
2,2,992220,international-politics-online-course,992220
3,3,508508,amazing-psychology-experiments,508508
4,4,1054944,advanced-psychology,1054944


Number of songs mapped to KG: 8196


Then the entity list `e_map.txt`.

In [None]:
entities_df = pd.read_csv(f"{coco_path}/e_map.txt", sep="\t")
display(entities_df.tail(5))
print(f"Number of entities in the KG: {entities_df.shape[0]}")

Unnamed: 0,eid,name,entity
11241,11241,Food preparation assistants,Food preparation assistants
11242,11242,"Food processing, wood working, garment and oth...","Food processing, wood working, garment and oth..."
11243,11243,"Labourers in mining, construction, manufacturi...","Labourers in mining, construction, manufacturi..."
11244,11244,Refuse workers and other elementary workers,Refuse workers and other elementary workers
11245,11245,"Subsistence farmers, fishers, hunters and gath...","Subsistence farmers, fishers, hunters and gath..."


Number of entities in the KG: 11246


Then the KG triplets `kg_final.txt`.

In [None]:
kg_df = pd.read_csv(f"{coco_path}/kg_final.txt", sep="\t")
display(kg_df.head(5))
print(f"Number of triplets in the KG: {kg_df.shape[0]}")

Unnamed: 0,entity_head,relation,entity_tail
0,0,0,8196
1,0,0,8197
2,1,0,8196
3,1,0,8197
4,2,0,8196


Number of triplets in the KG: 104983


And the relations `r_map.txt`.

In [None]:
relations_df = pd.read_csv(f"{coco_path}/r_map.txt", sep="\t")
relations_df

Unnamed: 0,id,kb_relation,name
0,0,belong_to_category,belong_to_category
1,1,related_to_concept,related_to_concept
2,2,taught_in_level,taught_in_level
3,3,taught_in_language,taught_in_language
4,4,has_target_audience,has_target_audience


<a name="2"></a>

## 2. Training TransE Knowledge Graph Embedding


### 2.1 - Data Mapper

---

Recall from previous notebooks that to train the TransE KGE we need firstly to map our standard formatted dataset to a format readable by the TransE code.

In [None]:
! python pathlm/data_mappers/map_dataset.py --data {dataset_name} --model transe

The model selected saves in preprocessed/mapping/ directory
Creating data/coco/preprocessed/mapping/ filesystem
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  triplets_grouped_by_rel.entity_head = triplets_grouped_by_rel.entity_head.map(eid2new_id[PRODUCT])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  triplets_grouped_by_rel.entity_tail = triplets_grouped_by_rel.entity_tail.map(eid2new_id[entity_name])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documenta

### 2.2 Training TransE Embeddings

Recall that most of the methods we are going to train have dependecies with the TransE embeddings (PGPR, CAFE, PLM) we firstly learn the **TransE representation** of our KG by executing the `train_transe_model.py`.

The transE hyperparameter list is reported as follow:
- `--epochs`: number of epochs to train.
- `--batch_size`: batch size.
- `--lr`: learning rate
- `--weight_decay`: weight decay for adam.
- `--l2_lambda`: l2 normalization
- `--max_grad_norm`: clipping gradient
- `--embed_size`: knowledge embedding size.
- `--num_neg_samples`: number of negative samples.

For simplicity, we have already set the values so we can just run the script indicating the `dataset_name`.

⚠️ We have already the **precomputed transE for all datasets**. So you **don't need** to run this command now.

⏲️ Estimate time: 1h with `ML1M`

In [None]:
! python pathlm/models/embeddings/train_transe_model.py --dataset {dataset_name}

The output will be save into `weights/<dataset_name>/embeddings` as a `.pkl` file named `transe_embed.pkl`

In [None]:
! ls weights/{dataset_name}/embeddings

ckpt  transe_embed.pkl


✅ Done! We now have the TransE embedding that will be used by PGPR, CAFE and PLM (in the next notebook).

<a name="3"></a>


## 3. - PGPR train pipeline

---

![](https://drive.google.com/uc?id=1aP6Sg7WBhEVuK-Ln7pqzVqcu4efOK5Ea)


Let's now execute the whole PGPR pipeline with the CoCo dataset

In [98]:
curr_model = 'pgpr'
dataset_name = 'coco'

### 3.1 - Prepare data for PGPR data Loader
---

To run PGPR, as TransE, we need to have the files in a format which is readable by the model.

To do that, let's use the script `map_dataset.py` from the `data_mappers` module takes as **arguments** the following parameters:
- `dataset_name`: One among `{ml1m, lfm1m, cellphones, coco}`
- `model_name`: One among `{transe, pgpr, cafe}`

Let's run this step for TransE which will be trained and later used by PGPR, CAFE and PLM

In [None]:
! python pathlm/data_mappers/map_dataset.py --data {dataset_name} --model {curr_model}

Creating data/coco/preprocessed/pgpr/ filesystem
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  triplets_grouped_by_rel.entity_head = triplets_grouped_by_rel.entity_head.map(eid2new_id[PRODUCT])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  triplets_grouped_by_rel.entity_tail = triplets_grouped_by_rel.entity_tail.map(eid2new_id[entity_name])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/i

Since TransE and PGPR have the same dataloader this process will create identical files to the one created for TransE.

But the output will be saved in `preprocessed/pgpr`

In [None]:
! ls data/{dataset_name}/preprocessed/{curr_model}

audience.txt.gz		   has_target_audience.txt.gz  related_to_concept.txt.gz  tmp
belong_to_category.txt.gz  language.txt.gz	       taught_in_language.txt.gz  train.txt.gz
category.txt.gz		   level.txt.gz		       taught_in_level.txt.gz	  user.txt.gz
concept.txt.gz		   product.txt.gz	       test.txt.gz		  valid.txt.gz


### 3.2 - Run PGPR preprocessing

---

Now, the last thing to do before training the agent is to run the `pathlm/models/rl/pgpr/preprocess.py`.

This PGPR code takes as input the `dataset_name` and uses it to located the files that we have created in the previous step.

Internally, it will use these files to instantiate a class object dataset and a class object KG. These objects are then stored in the `data/<dataset>/preprocessed/pgpr/tmp` folder for loading them later.

In [None]:
! python pathlm/models/rl/pgpr/preprocess.py --dataset {dataset_name}

Load coco dataset from file...
Load product of size 8196
Load level of size 4
Load audience of size 43
Load concept of size 2844
Load language of size 13
Load user of size 24036
Load category of size 146
Load has_target_audience of size 60765
Load taught_in_level of size 8195
Load belong_to_category of size 16392
Load taught_in_language of size 8196
Load related_to_concept of size 11435
Invalid users: 0, invalid items: 0
Load review of size 218805
Create coco knowledge graph from dataset...
^C


In [None]:
! ls data/{dataset_name}/preprocessed/{curr_model}/tmp

dataset.pkl  pgpr_hparams_file.json  train_agent      valid_label.pkl
kg.pkl	     test_label.pkl	     train_label.pkl


### 3.3 - Train Agent
---



We can learn the policy by executing `pathlm/models/rl/pgpr/train_agent.py`.

The train_agent hyperparameter list is reported as follows:
- `--epochs`: Max number of epochs.
- `--batch_size`: Batch size.
- `--lr`: Learning rate.
- `--max_acts`: Max number of actions.
- `--max_path_len`: Max path length.
- `--gamma`: reward discount factor.
- `--ent_weight`: weight factor for entropy loss.
- `--act_dropout`: action dropout rate.
- `--state_history`: state history length.
- `--hidden`: Number of samples.

For simplicity, **we have already set the values** so we can just run the script indicating the `dataset_name`.

⚠️ We already have the **precomputed policy for all datasets**. So you **don't need** to run this command now. The `policy_model_epoch50.pkl` file will be store in `data/{dataset_name}/preprocessed/pgpr/tmp/train_agent`

⏲️ Estimate time: 20m with `ML1M`

In [None]:
! python pathlm/models/rl/pgpr/train_agent.py --dataset {dataset_name}

In [None]:
! ls weights/{dataset_name}/{curr_model}

ckpt


### 3.4 Extract paths from policy
---

Let's **extract the paths** from the previously learned policy $\pi$. We can extract the paths by using the **beam search** running `test_agent.py`.



The `test_agent.py` hyperparameter list is reported as follow:
- `--epochs`: Max number of epochs.
- `--max_acts`: Max number of actions.
- `--max_path_len`: Max path length.
- `--gamma`: reward discount factor.
- `--hidden`: Number of samples.
- `--act_dropout`: Action dropout rate.
- `--state_history`: State history length.
- `--topk`: number of samples for the batch beam search. (e.g. [Z1, Z2, Z3])
- `--run_path`: Generate predicted path? (takes long time)
- `--run_eval`: Run evaluation?

For simplicity, we have already set the values so we can just run the script indicating the `dataset_name`.

⚠️ We already have the **paths extracted from the policy for all datasets**. So you **don't need** to run this command now. The `policy_paths_epoch50.pkl` file will be stored in `weights/{dataset_name}/pgpr/`

⏲️ Estimate time: 25m with `ML1M`

In [None]:
! python pathlm/models/rl/pgpr/test_agent.py --dataset {dataset_name}

Traceback (most recent call last):
  File "/content/drive/MyDrive/ExpRecSys Tutorial Series/2024 ECIR/Hands-On/pathlm/models/rl/pgpr/test_agent.py", line 18, in <module>
    from pathlm.models.rl.PGPR.train_agent import ActorCritic
  File "/usr/local/lib/python3.10/dist-packages/pathlm/models/rl/PGPR/train_agent.py", line 7, in <module>
    from pathlm.models.wadb_utils import MetricsLogger
  File "/usr/local/lib/python3.10/dist-packages/pathlm/models/wadb_utils.py", line 1, in <module>
    import wandb
ModuleNotFoundError: No module named 'wandb'


The predicted paths stored in `policy_paths_epoch50.pkl` have also been sorted base on probability and used to create the top`k` (`k`=10) of items and paths which will be store in `results/dataset/pgpr`.

In [None]:
! ls weights/{dataset_name}/{curr_model}

ckpt  policy_paths_epoch50.pkl


✅ We are done! Let's now evaluate the performance of our model

### 3.5 - Evaluating results

---

The final top`k` of predicted item stored in `results/<dataset>/pgpr/top{k}_items.pkl` as dictionary where each key is a user and its value the k recommended items.

In [99]:
pgpr_results_path = f'results/{dataset_name}/{curr_model}'

In [100]:
with open(f"{pgpr_results_path}/top10_items.pkl", 'rb') as pred_top_items_file:
    pgpr_item_topks = pickle.load(pred_top_items_file)
pred_top_items_file.close()

In [101]:
list(pgpr_item_topks.items())[:5]

[(0, [2122, 159, 2246, 56, 3599, 2196, 2004, 230, 374, 2143]),
 (1, [1673, 1090, 6067, 5207, 28, 5059, 5052, 4817, 5321, 5322]),
 (2, [5322, 1069, 4260, 139, 5319, 5060, 674, 7326, 7700, 2837]),
 (3, [5978, 5322, 2151, 2139, 3317, 1217, 4821, 1561, 4099, 2143]),
 (4, [30, 5052, 3924, 2137, 1850, 5756, 2163, 4825, 2170, 4820])]

Meanwhile the associated explanations are stored in `results/<dataset>/pgpr/top{k}_paths.pkl`

In [102]:
with open(f"{pgpr_results_path}/top10_paths.pkl", 'rb') as pred_top_paths_file:
    pgpr_path_topks = pickle.load(pred_top_paths_file)
pred_top_paths_file.close()

In [103]:
list(pgpr_path_topks.items())[0]

(0,
 [[('self_loop', 'user', 0),
   ('interacted', 'product', 924),
   ('taught_in_level', 'level', 0),
   ('taught_in_level', 'product', 2122)],
  [('self_loop', 'user', 0),
   ('interacted', 'product', 923),
   ('taught_in_level', 'level', 1),
   ('taught_in_level', 'product', 159)],
  [('self_loop', 'user', 0),
   ('interacted', 'product', 924),
   ('has_target_audience', 'audience', 22),
   ('has_target_audience', 'product', 2246)],
  [('self_loop', 'user', 0),
   ('interacted', 'product', 929),
   ('interacted', 'user', 1871),
   ('interacted', 'product', 56)],
  [('self_loop', 'user', 0),
   ('interacted', 'product', 1028),
   ('interacted', 'user', 10843),
   ('interacted', 'product', 3599)],
  [('self_loop', 'user', 0),
   ('interacted', 'product', 1028),
   ('taught_in_language', 'language', 0),
   ('taught_in_language', 'product', 2196)],
  [('self_loop', 'user', 0),
   ('interacted', 'product', 924),
   ('taught_in_language', 'language', 0),
   ('taught_in_language', 'produc

To evaluate the produced topks we can use the `pathlm/evaluation/evaluate_results.py` script.

In [104]:
! python pathlm/evaluation/evaluate_results.py --dataset {dataset_name} --model {curr_model} --k 10

Evaluating rec quality for ['ndcg', 'mrr', 'precision', 'recall', 'serendipity', 'diversity', 'novelty']: 100% 24036/24036 [00:03<00:00, 7897.58it/s]
Number of users: 24036, average topk size: 10.00
ndcg: 0.04, mrr: 0.03, precision: 0.01, recall: 0.01, serendipity: 0.74, diversity: 0.34, novelty: 0.99, coverage: 0.18


### 3.6 - Textual Explanation generation

---

To convert the explanations to plain text, we need to **substitute the entities IDs with their name**. To do that let's load **these mappings** for both models using the `get_local_eid_to_name` function from the `pathlm.datasets.data_utils` module.

*Note: The path predicted by PGPR and CAFE have local ids so we need a mapping for (entity_type, local_id) to name*

In [105]:
from pathlm.datasets.data_utils import get_local_eid_to_name
eid_type2local_eid2name = get_local_eid_to_name(dataset_name)
eid_type2local_eid2name['product'][0]

'social-psychology'

In [106]:
# Path Structure: [('self_loop', 'user', 0), ('watched', 'product', 2433), ('watched', 'user', 932), ('watched', 'product', 1694)],
def template(path):
    path = [piece for tuple in path for piece in tuple]
    if path[0] == "self_loop":
        path = path[1:]

    for i in range(1, len(path)):
        s = str(path[i])
        if s.isnumeric():
            if path[i-1] == 'user': continue
            entity_type = path[i-1]
            path[i] = eid_type2local_eid2name[entity_type][path[i]]
    _, uid, rel_0, e_type_1, e_1, rel_1, e_type_2, e_2, rel_k, _, pid  = path
    if e_type_2 == 'user':
        return f"You may be interested in {pid} because you {rel_0} {e_1} also {rel_k} by another user"
    else:
        return f"You may be interested in {pid} because you {rel_0} {e_1} also {rel_k} by {e_2}"

In [107]:
import collections
pgpr_explanations = defaultdict(list)
random_user = random.randint(0, len(pgpr_path_topks.keys()))
for i, pid_exp_tuple in enumerate(pgpr_path_topks[random_user]):
    exp = pid_exp_tuple
    pid = pid_exp_tuple[-1][-1]
    pgpr_explanations[random_user].append([pid, template(exp)])

In [108]:
pgpr_explanations[random_user]

[[1565,
  'You may be interested in wordpress-for-beginners-course because you interacted fiverr-selling-101-fiverr-for-complete-beginners also interacted by another user'],
 [3592,
  'You may be interested in the-python-bible because you interacted java-for-noobs-beginners also interacted by another user'],
 [54,
  'You may be interested in machlearn1 because you interacted java-for-noobs-go-from-noob-to-semi-noob-noob-coder also taught_in_language by english'],
 [1187,
  'You may be interested in r-basics because you interacted python-for-absolute-beginners-u also interacted by another user'],
 [2004,
  'You may be interested in 2d-game-art-for-non-artists because you interacted youtube-beginners-guide-to-a-successful-channel also interacted by another user'],
 [3930,
  'You may be interested in java-programming-basics because you interacted java-for-noobs-go-from-noob-to-semi-noob-noob-coder also interacted by another user'],
 [3701,
  'You may be interested in learn-c-sharp-program

As you can see many of this explanations derive from user to user relations. We define this as collaborative explanations. We refer to our works related to path explanation quality perspectives [[32, 33, 36]](#p32) which may act as an in-depth if you are interest in these perspectives.

<a name="40"></a>


## 4. - CAFE train pipeline

---


In [109]:
dataset_name = 'coco'
curr_model = 'cafe'

### 4.1 - Prepare data for CAFE data Loader

---

Again like for the other models to run CAFE we need our datasets and KG files in a **format** which is **readable by the model**. To do that, we will use the `map_to_CAFE` function from our `mapper` module.

The script `pathlm/data_mappers/map_dataset.py` takes as **input** the `dataset_name` and the model we want to create the files to. This function performs **both the relation/entity grouping** and the **train-test split** due to internal dependencies of the model.

Internally, it will extract entities from our KG stardard format and create the following files:
- `kg_entities.txt.gz`: Set of all entities. It is represented by triplets (`entity_global_id`, `entity_local_id`, `entity_name`). The entitiy name will be user **later for textual explanation generation**.
- `kg_relations.txt.gz`: Set of relations. It includes **implicitly the reverse** relation (e.g. watched: being_watched, starring: starred).
- `kg_rules.txt.gz`: Defines the metapath rules.
- `kg_triples.txt.gz`: Set of KG triplets. Differently from PGPR also the **reverse triplets** for example (**movie**, *starred_by*, **actor**) will have (**actor**, *starring*, **movie**). It includes also the user interaction triplets like (user, watched, movie).
- `train.txt.gz`: Train set, where every row is composed by a `user_id` and the list of products interacted by that user.
- `test.txt.gz`: Test set, where every row is composed by a `user_id` and the list of products interacted by that user.

Let's execute the `pathlm/data_mappers/map_dataset.py` giving it as parameter the current dataset and the cafe model

In [None]:
! python pathlm/data_mappers/map_dataset.py --data {dataset_name} --model {curr_model}

Creating data/coco/preprocessed/cafe/ filesystem
Mapping to CAFE...
Getting splits...
Writing split CAFE...
Writing UID and PID mappings...


To see the created file, let's do a `ls` command in the destination folder `data/ml1m/preprocessed/cafe`.

In [None]:
%ls data/{dataset_name}/preprocessed/cafe

kg_entities.txt.gz   kg_rules.txt.gz    test.txt.gz  train.txt.gz
kg_relations.txt.gz  kg_triples.txt.gz  [0m[01;34mtmp[0m/         valid.txt.gz


### 4.2 - Run CAFE preprocessing

---

Now, the first thing to do before training the agent is to run the `preprocessing.py`.

This CAFE code takes as input the `dataset_name` and uses it to locate the files that we have created in the previous step.

Internally, it will use these files to instantiate a class object for the dataset, the KG and the embeddings. In addition it will also perform the user profiles composition. These objects are then stored in the `data/<dataset>/preprocessed/cafe/tmp` folder for loading them later.


⚠️ We have already the **preprocessed CAFE files**. So you **don't need** to run this command now. The files will be store in `data/{dataset_name}/preprocessed/cafe/tmp/`

⏲️ Estimate time: 1h with `ML1M`

In [None]:
! python pathlm/models/rl/CAFE/preprocess.py --dataset {dataset_name}

>>> Load KG embeddings ...
Load embedding: ./weights/ml1m/embeddings
>>> user: (6041, 100)
>>> product: (2985, 100)
>>> actor: (2412, 100)
>>> director: (602, 100)
>>> prodcompany: (306, 100)
>>> editor: (221, 100)
>>> writter: (666, 100)
>>> cinematographer: (237, 100)
>>> composer: (296, 100)
>>> country: (33, 100)
>>> category: (1822, 100)
>>> producer: (670, 100)
>>> wikipage: (10840, 100)
File is saved to "/content/drive/MyDrive/ExpRecSys Tutorial Series/2024 ECIR/Hands-On/data/ml1m/preprocessed/cafe/tmp/embed.pkl".
>>> 19844 entities are loaded.
>>> 24 relations are loaded.
>>> Discarted 0 triplets
 1525736 triples are loaded (including reverse edges).
>>> 12 rules are loaded.
[[(None, 'user'), ('watched', 'product'), ('rev_watched', 'user'), ('watched', 'product')], [(None, 'user'), ('watched', 'product'), ('cinematography_by_cinematographer', 'cinematographer'), ('rev_cinematography_by_cinematographer', 'product')], [(None, 'user'), ('watched', 'product'), ('produced_by_prodcom

In [None]:
%ls data/{dataset_name}/preprocessed/cafe/tmp

cafe_hparams_file.json  embed.pkl  kg.pkl  path_count.pkl  user_products_pos.npy


✅ Preprocessing done! We are ready for train the **Neural Symbolic Network**.

### 4.3 - Train Neural Symbol

We can learn the policy by executing `train_neural_symbolic.py`.

The train_agent hyperparameter list is reported as follow:
- `--epochs`: Max number of epochs.
- `--batch_size`: Batch size.
- `--lr`: Learning rate.
- `--deep_module`: Use deep module or not.
- `--embed_size`: KG embedding size.
- `use_dropout`: use dropout or not.
- `rank_weight`: weighting factor for ranking loss.
- `topk_candidates`: weighting factor for ranking loss.

For simplicity **we have already set the values** so we can just run the script indicating the `dataset_name`.

⚠️ We have already execute the **preprocessing all datasets**. So you **don't need** to run this command now. The `symbolic_model_epoch20.ckpt` file will be store in `weights/{dataset_name}/cafe/tmp/neural_symbolic_model`

⏲️ Estimate time: 20m with `ML1M`

In [None]:
! python pathlm/models/rl/CAFE/train_neural_symbol.py --dataset {dataset_name}

In [None]:
! ls data/{dataset_name}/preprocessed/cafe/tmp/

embed.pkl  kg.pkl  neural_symbolic_model  path_count.pkl  user_products_pos.npy


### 4.4 - Extract paths

---

We can extract the paths from the policy running `execute_neural_symbol.py`.

The test_agent hyperparameter list is reported as follow:
- `--sample_size`: sample size for model.
- `--do_infer`: Whether to infer paths after training.
- `--do_execute`: Whether to execute neural programs.

For simplicity we have **already set the values** so we can just run the script indicating the `dataset_name`.

⚠️ We have already the **precomputed policy for all datasets**. So you **don't need** to run this command now. The `infer_path_data.pkl` file will be stored in `weights/{dataset_name}/cafe`

⏲️ Estimate time: 25m with `ML1M`

In [None]:
! python pathlm/models/rl/CAFE/execute_neural_symbol.py --dataset {dataset_name} --do_infer True --do_execute False

Again the predicted paths stored in `infer_path_data.pkl` have been sorted base on probability and used to create the top`k` (`k`=10) of items and paths which will be store in `results/dataset/cafe`.

In [None]:
! ls weights/{dataset_name}/cafe

ckpt  program_exe_heuristic_ss50.txt


✅ We are done! Let's now evaluate the performance of our model

### 4.5 - Evaluating Results

---

In [110]:
cafe_results_path = f"results/{dataset_name}/{curr_model}"

This paths are sorted by path score to produce the final topk of predicted item stored in `results/<dataset>/pgpr/top{k}_items.pkl`

In [111]:
with open(f"{cafe_results_path}/top10_items.pkl", 'rb') as pred_top_items_file:
    cafe_item_topks = pickle.load(pred_top_items_file)
pred_top_items_file.close()

In [112]:
list(cafe_item_topks.items())[:5]

[(0, [2120, 4820, 2147, 2143, 2129, 2121, 2126, 2837, 3300, 2152]),
 (1, [2120, 2143, 2129, 2121, 2126, 2147, 3963, 4820, 2837, 3322]),
 (2, [2120, 2143, 2129, 2121, 2126, 4820, 2147, 2837, 3300, 2152]),
 (3, [2120, 2143, 2129, 2121, 2126, 4820, 3300, 2837, 3963, 2944]),
 (4, [2120, 2143, 2129, 2121, 2126, 3963, 2837, 2147, 4099, 4820])]

And the associated explanations store in `results/<dataset>/pgpr/top{k}_paths.pkl`

In [113]:
with open(f"{cafe_results_path}/top10_paths.pkl", 'rb') as pred_top_paths_file:
    cafe_path_topks = pickle.load(pred_top_paths_file)
pred_top_paths_file.close()

In [114]:
list(cafe_path_topks.items())[0]

(0,
 [(-11.041087,
   -1.5421276,
   [('self_loop', 'user', 0),
    ('interacted', 'product', 924),
    ('taught_in_level', 'level', 0),
    ('rev_taught_in_level', 'product', 2120)]),
  (-11.103316,
   -2.4428632,
   [('self_loop', 'user', 0),
    ('interacted', 'product', 923),
    ('taught_in_language', 'language', 0),
    ('rev_taught_in_language', 'product', 4820)]),
  (-11.1446905,
   -2.4842374,
   [('self_loop', 'user', 0),
    ('interacted', 'product', 923),
    ('taught_in_language', 'language', 0),
    ('rev_taught_in_language', 'product', 2147)]),
  (-11.254278,
   -1.7553186,
   [('self_loop', 'user', 0),
    ('interacted', 'product', 924),
    ('taught_in_level', 'level', 0),
    ('rev_taught_in_level', 'product', 2143)]),
  (-11.566606,
   -2.067646,
   [('self_loop', 'user', 0),
    ('interacted', 'product', 924),
    ('taught_in_level', 'level', 0),
    ('rev_taught_in_level', 'product', 2129)]),
  (-11.6139765,
   -2.115017,
   [('self_loop', 'user', 0),
    ('interac

In [115]:
! python pathlm/evaluation/evaluate_results.py --dataset {dataset_name} --model {curr_model} --k 10

Evaluating rec quality for ['ndcg', 'mrr', 'precision', 'recall', 'serendipity', 'diversity', 'novelty']: 100% 24036/24036 [00:01<00:00, 12366.99it/s]
Number of users: 24036, average topk size: 10.00
ndcg: 0.04, mrr: 0.03, precision: 0.01, recall: 0.01, serendipity: 0.21, diversity: 0.18, novelty: 0.99, coverage: 0.01


### 4.6 - Textual Explanation generation

---


Let's convert the explanation paths to textual explanations! Let's use the previously defined `template` function and `eid2name` map.

In [116]:
import collections
cafe_explanations = defaultdict(list)
random_user = random.randint(0, len(cafe_path_topks.keys()))
for i, pid_exp_tuple in enumerate(cafe_path_topks[random_user]):
    prob, score, exp = pid_exp_tuple
    cafe_explanations[random_user].append([exp[-1][-1], template(exp)])

In [117]:
cafe_explanations[random_user]

[[2120,
  'You may be interested in understand-javascript because you interacted goal-setting also rev_taught_in_language by english'],
 [2143,
  'You may be interested in the-web-developer-bootcamp because you interacted goal-setting also rev_taught_in_language by english'],
 [2129,
  'You may be interested in react-redux because you interacted goal-setting also rev_taught_in_language by english'],
 [2121,
  'You may be interested in learn-angularjs because you interacted goal-setting also rev_taught_in_language by english'],
 [2126,
  'You may be interested in understand-nodejs because you interacted goal-setting also rev_taught_in_language by english'],
 [2837,
  'You may be interested in web-design-secrets because you interacted goal-setting also rev_has_target_audience by Teaching professionals'],
 [2152,
  'You may be interested in the-complete-guide-to-angular-2 because you interacted goal-setting also rev_has_target_audience by Teaching professionals'],
 [3963,
  'You may be in

You can notice that in this example and as evidenced by previous studies [] CAFE produces more diverse explanations in term of type of relations used in a topk

✅ We are done with this notebook!

<a name="5"></a>


## 5. - Causual Language Modeling for Path Reasoning

---

### 5.1 - Path sampling

---

To use **Causal Language Models (CLM) for path reasoning** the first step is to sample **user-centric paths** from the knowledge graph. This paths will be tokenized and used as training sequences by our CLM based path reasoning models.

The sampled paths will start from a user, connect him to a seen product through its interaction in the train and bring to another seen products. This will allow the data to have patterns between the items interacted by each user.

To perform the sampling we can employ our `create_dataset.sh` script giving as positional parameters:
1. `dataset`: the dataset we want to sample for `{ml1m, lfm1m}`
2. `sample_size`: represent the amount of paths sampled for each user
3. `n_hop`: represent the fixed hop size for the paths sampled
4. `n_proc`: number of processors to employ for multiprocessing operations


In [118]:
sample_size = 50
n_hop = 3
n_proc = 2

⚠️ We already have the **sampled paths for all datasets**. So you **don't need** to run this command now. The `paths_end-to-end_250_3.txt` file will be store in `data/<dataset>/path_random_walk/`

⏲️ Estimate time: 20m with `ML1M`

In [None]:
! bash create_dataset.sh {dataset_name} {sample_size} {n_hop} {n_proc}

This code will create the dataset into `data/<dataset>/path_random_walk/paths_end-to-end_<sample_size>_<n_hop>.txt`

In [None]:
! ls data/{dataset_name}/paths_random_walk

paths_end-to-end_50_3.txt


Let's see how the paths look like

In [None]:
! head -10 data/{dataset_name}/paths_random_walk/paths_end-to-end_{sample_size}_{n_hop}.txt

U11786 R-1 P3591 R3 E11190 R3 P3589
U5807 R-1 P1898 R2 E11186 R2 P1785
U16015 R-1 P2190 R3 E11190 R3 P2164
U3359 R-1 P5194 R3 E11190 R3 P6899
U17286 R-1 P2877 R2 E11187 R2 P5105
U23352 R-1 P4125 R0 E8240 R0 P4193
U8334 R-1 P646 R4 E11207 R4 P2837
U18381 R-1 P2161 R2 E11187 R2 P4441
U23097 R-1 P7517 R-1 U24009 R-1 P7168
U21243 R-1 P5727 R4 E11208 R4 P3591


### 5.2 - Tokenized datasets creation

---

Training PLM or PEARLM requires the learning of a Whitespace tokenizer that possess as vocubary all the entities and relations token. Additionally we need to tokenize the sampled path using this learned tokenizer.

To do this we will execute `pathlm/models/lm/tokenize_dataset.py`. This will create our tokenzier that will be stored in `tokenizers/<dataset_name>/WordLevel.json` and our tokenized dataset as hugginface dataset object in `data/<dataset_name>/WordLevel/end-to-end_{sample_size}_{n_hop}_tokenized_dataset.hf[link text](https://)`

In [None]:
! python pathlm/models/lm/tokenize_dataset.py --data {dataset_name} --sample_size {sample_size} --n_hop {n_hop} --nproc {n_proc}

<a name="6"></a>


## 6. - PLM train pipeline

---

![](https://drive.google.com/uc?id=1w_GwOaNPNsITfSTEX-LgkiGiQuqTsE88)


### 6.1 Train PLM

---

The  hyperparameter list is reported as follow:
- `--num_epochs`: Max number of epochs.
- `--model`: The base huggingface model from where eredit the architecture one from `{distilgpt2, gpt2, gpt2-large}`
- `--batch_size`: Batch size.
- `--sample_size`: Dataset sample size (to dermine which dataset to use)
- `--n_hop`: Dataset hop size (to dermine which dataset to use)
- `--logit_processor_type`: Decoding strategy `gcd` PEARLM, empty for PLM
- `--n_seq_infer`: Number of sequences generated for each user should be `> k`

⚠️ We have already the **precomputed PLM for all datasets**. So you **don't need** to run this command now.

In [None]:
! python pathlm/models/lm/plm_main.py --data {dataset_name} --sample_size {sample_size} --n_hop {n_hop} --nproc {n_proc}

Again like with previous models the train will save during evaluation the topks and topks paths in `results` and the final model in `weights` under the name of `end-to-end@<dataset_name>@plm-rec@<model>@<sample_size>@<n_hop>@<logit_processor_type>`

### 6.2 Evaluate PLM

---

Let's now load the paths from `results/`

In [119]:
model_base = 'distilgpt2'
logit_constraint = ''
curr_model = f'end-to-end@{dataset_name}@plm-rec@{model_base}@{sample_size}@{n_hop}@{logit_constraint}'
plm_results_path = f"results/{dataset_name}/{curr_model}"

This paths are sorted by path score to produce the final topk of predicted item stored in `results/<dataset>/<curr_model>/top{k}_items.pkl`

In [120]:
with open(f"{plm_results_path}/top10_items.pkl", 'rb') as pred_top_items_file:
    plm_item_topks = pickle.load(pred_top_items_file)
pred_top_items_file.close()

In [121]:
list(plm_item_topks.items())[:5]

[(43, [223, 8023, 6, 938, 7223, 8055, 7700, 1766, 7380, 905]),
 (32, [1221, 890, 6115, 1218, 7565, 6183, 6118, 5978, 5980, 6188]),
 (46, [101, 7219, 95, 5, 159, 292, 4730, 7118, 105, 7700]),
 (10, [833, 181, 8167, 295, 827, 1455, 217, 8171, 791, 156]),
 (27, [7149, 7213, 7112, 7219, 7143, 7118, 7595, 7591, 7592, 7598])]

And the associated explanations store in `results/<dataset>/<curr_model>/top{k}_paths.pkl`

In [122]:
with open(f"{plm_results_path}/top10_paths.pkl", 'rb') as pred_top_paths_file:
    plm_path_topks = pickle.load(pred_top_paths_file)
pred_top_paths_file.close()

In [123]:
list(plm_path_topks.items())[0]

(43,
 [['[BOS]', 'U43', 'R-1', 'P6339', 'R3', 'E11190', 'R2', 'P223'],
  ['[BOS]', 'U43', 'R-1', 'P7619', 'R4', 'E11233', 'R4', 'P8023'],
  ['[BOS]', 'U43', 'R-1', 'P7619', 'R4', 'E11233', 'R1', 'P6'],
  ['[BOS]', 'U43', 'R-1', 'P6339', 'R3', 'E11190', 'R2', 'P938'],
  ['[BOS]', 'U43', 'R-1', 'P6339', 'R3', 'E11190', 'R2', 'P7223'],
  ['[BOS]', 'U43', 'R-1', 'P5923', 'R1', 'E8351', 'R3', 'P8055'],
  ['[BOS]', 'U43', 'R-1', 'P6339', 'R3', 'E11190', 'R2', 'P7700'],
  ['[BOS]', 'U43', 'R-1', 'P1779', 'R4', 'E11233', 'R3', 'P1766'],
  ['[BOS]', 'U43', 'R-1', 'P6339', 'R3', 'E11190', 'R2', 'P7380'],
  ['[BOS]', 'U43', 'R-1', 'P6339', 'R3', 'E11190', 'R2', 'P905']])

In [124]:
command = f'python pathlm/evaluation/evaluate_results.py --dataset {dataset_name} --model plm-rec@{model_base} --k 10 --sample_size 50'
!$command

Evaluating rec quality for ['ndcg', 'mrr', 'precision', 'recall', 'serendipity', 'diversity', 'novelty']: 100% 18750/18750 [00:01<00:00, 9438.36it/s] 
Number of users: 24036, average topk size: 9.88
ndcg: 0.08, mrr: 0.05, precision: 0.02, recall: 0.02, serendipity: 0.88, diversity: 0.25, novelty: 0.98, coverage: 0.25


### 6.3 Textual Explanation generation

---

As done before with PGPR and CAFE to convert the explanation path to textual we need to remap the entities and relations to their names. To do so let's import `get_eid_to_name` and `get_rid_to_name` from `pathlm.datasets.data_utils`.

In [125]:
from pathlm.datasets.data_utils import get_eid_to_name
eid2name = get_eid_to_name(dataset_name)
eid2name['0']

'social-psychology'

In [126]:
from pathlm.datasets.data_utils import get_rid_to_name
rid2name = get_rid_to_name(dataset_name)
rid2name['0']

'belong_to_category'

Let's now create the template function to handle the paths in the form `['[BOS]', 'U15', 'R-1', 'P1113', 'R5', 'E7174', 'R0', 'P1767']`

In [127]:
def template(path):
    if path[0] == "[BOS]":
        path = path[1:]
    for i in range(len(path)):
        s = str(path[i])[1:]
        if i % 2 == 0: #Entity
            if s not in eid2name: # It has predicted a user
                path[i] = f'U{s}'
            else:
                path[i] = eid2name[s]
        else: #Relation
            if s == "-1":
                path[i] = 'interacted'
                continue
            path[i] = rid2name[s]
    u, r, pi, r1, e1, r2, rp  = path
    if e1 == 'user':
        return f"You may be interested in {rp} because you {r} {pi} also {r2} by another user"
    else:
        return f"You may be interested in {rp} because you {r} {pi} also {r2} by {e1}"

Let's convert the explanation paths for a random user to textual explanation

In [128]:
import collections
plm_textual_exps = collections.defaultdict(list)
random_user = random.randint(0, len(plm_path_topks.keys()))
for i, pid_exp_tuple in enumerate(plm_path_topks[random_user]):
    exp = pid_exp_tuple
    pid = pid_exp_tuple[-1][-1]
    plm_textual_exps[random_user].append([pid, template(exp)])

In [129]:
plm_textual_exps[random_user]

[['1',
  'You may be interested in bootstrap-to-wordpress because you interacted web-design-secrets also interacted by Landing page'],
 ['4',
  'You may be interested in master-web-design-in-photoshop because you interacted web-design-secrets also taught_in_language by Landing page'],
 ['8',
  'You may be interested in logo-design-fundamentals because you interacted master-web-design-in-photoshop also related_to_concept by design'],
 ['4',
  'You may be interested in introduction-to-graphic-design because you interacted master-web-design-in-photoshop also related_to_concept by design'],
 ['3',
  'You may be interested in the-web-developer-bootcamp because you interacted design-and-develop-a-killer-website-with-html5-and-css3 also taught_in_level by Health associate professionals'],
 ['7',
  'You may be interested in ux-web-design-master-course-strategy-design-development because you interacted design-and-develop-a-killer-website-with-html5-and-css3 also belong_to_category by Health ass

Recall that:

> Hallucinations can arise within an explanation when a model incorrectly establishes **incoherent semantic relations** between entities in the KG, e.g., when user-item connections extend beyond mere interaction relations, which would constitute the sole viable option in the KG.

**Incorrect semantics** may lead to provide explanations connecting the two by a "starred by" relation, which is coherent only between an actor and a movie item as for the KG. **Incoherence** can also manifest between entities that are semantically linked in the real world but **do not have such corresponding** relationships in the KG (e.g., entity "Johnny Depp", linked by the relation "starred in", to the item "interstellar", which does not exist in the underlying KG).

Such inaccuracies in explanations compromise the fundamental rationale for utilizing a KG, as the factual truths presented in the explanations become misaligned and incoherent with the underlying KG.

To measure the extend of these phenomena let's calculate what is the rate of corrupted paths among the predicted ones. To do so let's use `` function from `` module

In [130]:
command = f'python pathlm/models/lm/assess_faithfulness.py --dataset {dataset_name} --model plm-rec@{model_base} --k 10 --sample_size {sample_size}'
!$command

Inputing a PLM
Loading KG
Load user of size 24036
Load category of size 146
Load audience of size 43
Load level of size 4
Load concept of size 2844
Load language of size 13
Load product of size 8196
Load taught_in_level of size 8195
Load taught_in_language of size 8196
Load belong_to_category of size 16392
Load related_to_concept of size 11435
Load has_target_audience of size 60765
Invalid users: 0, invalid items: 0
Load review of size 218805
Loading from  data/coco/preprocessed  the dataset  coco
(104983, 3)
(104983, 3)
{0: 'belong_to_category', 1: 'related_to_concept', 2: 'taught_in_level', 3: 'taught_in_language', 4: 'has_target_audience', -1: 'interacted'}
Creating augmented kg
Created augmented kg
Creating token index
Created token index
Check faithfulness
2024-03-23 18:51:39.888365: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-23 18:51:

<a name="7"></a>

## 7. PEARLM train pipeline
---


![](https://drive.google.com/uc?id=1leZB7KU-KgtnZPga2j8EXimktkNFAk1K)


   

### 7.1 Train PEARLM

---

The  hyperparameter list is reported as follow:
- `--num_epochs`: Max number of epochs.
- `--model`: The base huggingface model from where eredit the architecture one from `{distilgpt2, gpt2, gpt2-large}`
- `--batch_size`: Batch size.
- `--sample_size`: Dataset sample size (to dermine which dataset to use)
- `--n_hop`: Dataset hop size (to dermine which dataset to use)
- `--logit_processor_type`: Decoding strategy `gcd PEARLM, empty for PLM
- `--n_seq_infer`: Number of sequences generated for each user should be `> k`

⚠️ We have already the **precomputed PEARLM for all datasets**. So you **don't need** to run this command now. The hugginface model will be stored into `end-to-end@<dataset_name>@<model>@<sample_size>@<n_hop>@<logit_processor_type>`.

⏲️

In [None]:
! python pathlm/models/lm/pearlm_main.py --data {dataset_name} --sample_size {sample_size} --n_hop {n_hop} --nproc {n_proc}

### 7.2 Evaluate PEARLM

---

In [131]:
model_base = 'distilgpt2'
logit_constraint = 'gcd'
curr_model = f'end-to-end@{dataset_name}@{model_base}@{sample_size}@{n_hop}@{logit_constraint}'
pearlm_results_path = f"results/{dataset_name}/{curr_model}"

This paths are sorted by path score to produce the final topk of predicted item stored in `results/<dataset>/<curr_model>/top{k}_items.pkl`

In [132]:
with open(f"{pearlm_results_path}/top10_items.pkl", 'rb') as pred_top_items_file:
    pearlm_item_topks = pickle.load(pred_top_items_file)
pred_top_items_file.close()

In [133]:
list(pearlm_item_topks.items())[:5]

[(2, [7205, 7703, 7680, 7304, 4807, 4688, 7266, 7432, 14, 11]),
 (11, [370, 7673, 2693, 7430, 6460, 938, 2719, 833, 7691, 7706]),
 (14, [5, 108, 7572, 4721, 7772, 7153, 7266, 0, 5911, 4728]),
 (22, [6486, 2721, 6204, 5981, 7140, 6108, 6235, 6209, 7458, 6227]),
 (25, [7676, 4724, 7124, 7128, 7700, 7673, 7213, 249, 7120, 5])]

And the associated explanations store in `results/<dataset>/<curr_model>/top{k}_paths.pkl`

In [134]:
with open(f"{pearlm_results_path}/top10_paths.pkl", 'rb') as pred_top_paths_file:
    pearlm_path_topks = pickle.load(pred_top_paths_file)
pred_top_paths_file.close()

In [135]:
list(pearlm_path_topks.items())[0]

(2,
 [['[BOS]', 'U2', 'R-1', 'P5', 'R2', 'E11186', 'R2', 'P7205'],
  ['[BOS]', 'U2', 'R-1', 'P5', 'R2', 'E11186', 'R2', 'P7703'],
  ['[BOS]', 'U2', 'R-1', 'P7581', 'R0', 'E8303', 'R0', 'P7680'],
  ['[BOS]', 'U2', 'R-1', 'P6', 'R2', 'E11187', 'R2', 'P7304'],
  ['[BOS]', 'U2', 'R-1', 'P6', 'R4', 'E11214', 'R4', 'P4807'],
  ['[BOS]', 'U2', 'R-1', 'P6', 'R2', 'E11187', 'R2', 'P4688'],
  ['[BOS]', 'U2', 'R-1', 'P6', 'R2', 'E11187', 'R2', 'P7266'],
  ['[BOS]', 'U2', 'R-1', 'P6', 'R2', 'E11187', 'R2', 'P7432'],
  ['[BOS]', 'U2', 'R-1', 'P6', 'R2', 'E11187', 'R2', 'P14'],
  ['[BOS]', 'U2', 'R-1', 'P8', 'R1', 'E8342', 'R1', 'P11']])

In [136]:
command = f'python pathlm/evaluation/evaluate_results.py --dataset {dataset_name} --model {curr_model} --k 10'
!$command

Evaluating rec quality for ['ndcg', 'mrr', 'precision', 'recall', 'serendipity', 'diversity', 'novelty']: 100% 18750/18750 [00:01<00:00, 10878.06it/s]
Number of users: 24036, average topk size: 9.96
ndcg: 0.34, mrr: 0.28, precision: 0.07, recall: 0.07, serendipity: 0.93, diversity: 0.28, novelty: 0.98, coverage: 0.84


### 7.3 Textual explanation generation

---

Let's convert some PEARLM predicted explanation paths to textual explanations. To do so we will leverage the previously defined `template` function and the dictionaries `eid2name` and `rid2name`.

In [137]:
import collections
pearlm_textual_exps = collections.defaultdict(list)
random.seed(2024)
random_user = random.randint(0, len(pearlm_path_topks.keys()))
for i, pid_exp_tuple in enumerate(pearlm_path_topks[random_user]):
    exp = pid_exp_tuple
    pid = pid_exp_tuple[-1][-1]
    pearlm_textual_exps[random_user].append([pid, template(exp)])

In [138]:
pearlm_textual_exps[random_user]

[['0',
  'You may be interested in learn-angular-2-from-beginner-to-advanced because you interacted mongoosejs-essentials also taught_in_language by english'],
 ['5',
  'You may be interested in getting-started-with-typescript because you interacted mongodb-essentials also taught_in_level by beginner'],
 ['1',
  'You may be interested in how-to-use-javascript-objects-json-ajax-explained because you interacted mongoosejs-essentials also taught_in_language by english'],
 ['4',
  'You may be interested in javascript-from-beginner-to-pro-best-course because you interacted mongoosejs-essentials also taught_in_language by english'],
 ['8',
  'You may be interested in node-js-training-and-fundamentals because you interacted javascript-based-website-in-minutes-using-the-mean-stack also taught_in_language by english'],
 ['3',
  'You may be interested in ionic-2-the-practical-guide-to-building-ios-android-apps because you interacted javascript-based-website-in-minutes-using-the-mean-stack also h

In [139]:
command = f'python pathlm/models/lm/assess_faithfulness.py --dataset {dataset_name} --model {model_base} --k 10 --sample_size 50'
!$command

Loading KG
Load user of size 24036
Load category of size 146
Load audience of size 43
Load level of size 4
Load concept of size 2844
Load language of size 13
Load product of size 8196
Load taught_in_level of size 8195
Load taught_in_language of size 8196
Load belong_to_category of size 16392
Load related_to_concept of size 11435
Load has_target_audience of size 60765
Invalid users: 0, invalid items: 0
Load review of size 218805
Loading from  data/coco/preprocessed  the dataset  coco
(104983, 3)
(104983, 3)
{0: 'belong_to_category', 1: 'related_to_concept', 2: 'taught_in_level', 3: 'taught_in_language', 4: 'has_target_audience', -1: 'interacted'}
Creating augmented kg
Created augmented kg
Creating token index
Created token index
Check faithfulness
Average rate of valid sequences per user: 1.0
0/187500 corrupted sequences, specifically at position Counter()
Traceback (most recent call last):
  File "/content/drive/MyDrive/ExpRecSys Tutorial Series/2024 ECIR/Hands-On/pathlm/models/lm/asse

## References

<a name="p1">[1]</a> F. Maxwell Harper, Joseph A. Konstan:
The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. 5(4): 19:1-19:19 (2016)

<a name="p2">[2]</a> Markus Schedl: The LFM-1b Dataset for Music Retrieval and Recommendation. ICMR 2016: 103-110

<a name="p3">[3]</a> Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, Zachary G. Ives:
DBpedia: A Nucleus for a Web of Open Data. ISWC/ASWC 2007: 722-735

<a name="p4">[4]</a> Denny Vrandecic, Markus Krötzsch:
Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10): 78-85 (2014)

<a name="p5">[5]</a> Yixin Cao, Xiang Wang, Xiangnan He, Zikun Hu, Tat-Seng Chua:
Unifying Knowledge Graph Learning and Recommendation: Towards a Better Understanding of User Preferences. WWW 2019: 151-161


<a name="p6">[6]</a> Qingyao Ai, Vahid Azizi, Xu Chen, Yongfeng Zhang:
Learning Heterogeneous Knowledge Base Embeddings for Explainable Recommendation. Algorithms 11(9): 137 (2018)

<a name="p7">[7]</a> Kurt D. Bollacker, Colin Evans, Praveen K. Paritosh, Tim Sturge, Jamie Taylor:
Freebase: a collaboratively created graph database for structuring human knowledge. SIGMOD Conference 2008: 1247-1250

<a name="p8">[8]</a> Wayne Xin Zhao, Gaole He, Kunlin Yang, Hongjian Dou, Jin Huang, Siqi Ouyang, Ji-Rong Wen:
KB4Rec: A Data Set for Linking Knowledge Bases with Recommender Systems. Data Intell. 1(2): 121-136 (2019)

<a name="p9">[9]</a> Yongfeng Zhang, Qingyao Ai, Xu Chen, W. Bruce Croft:
Joint Representation Learning for Top-N Recommendation with Heterogeneous Information Sources. CIKM 2017: 1449-1458

<a name="p10">[10]</a> Yikun Xian, Zuohui Fu, S. Muthukrishnan, Gerard de Melo, Yongfeng Zhang:
Reinforcement Knowledge Graph Reasoning for Explainable Recommendation. SIGIR 2019: 285-294

<a name="p11">[11]</a> Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, Oksana Yakhnenko:
Translating Embeddings for Modeling Multi-relational Data. NIPS 2013: 2787-2795

<a name="p12">[12]</a> Yikun Xian, Zuohui Fu, Handong Zhao, Yingqiang Ge, Xu Chen, Qiaoying Huang, Shijie Geng, Zhou Qin, Gerard de Melo, S. Muthukrishnan, Yongfeng Zhang:
CAFE: Coarse-to-Fine Neural Symbolic Reasoning for Explainable Recommendation. CIKM 2020: 1645-1654

<a name="p13">[13]</a> Zhu Sun, Jie Yang, Jie Zhang, Alessandro Bozzon, Long-Kai Huang, Chi Xu:
Recurrent knowledge graph embedding for effective recommendation. RecSys 2018: 297-305

<a name="p14">[14]</a> Hongwei Wang, Fuzheng Zhang, Miao Zhao, Wenjie Li, Xing Xie, Minyi Guo:
Multi-Task Feature Learning for Knowledge Graph Enhanced Recommendation. CoRR abs/1901.08907 (2019)

<a name="p15">[15]</a> Xiang Wang, Tinglin Huang, Dingxian Wang, Yancheng Yuan, Zhenguang Liu, Xiangnan He, Tat-Seng Chua:
Learning Intents behind Interactions with Knowledge Graph for Recommendation. WWW 2021: 878-887

<a name="p16">[16]</a> Song, Weiping, Zhijian Duan, Ziqing Yang, Hao Zhu, Ming Zhang, and Jian Tang. "Explainable knowledge graph-based recommendation via deep reinforcement learning." arXiv preprint arXiv:1906.09506 (2019).

<a name="p17">[17]</a>	Hongwei Wang, Fuzheng Zhang, Jialin Wang, Miao Zhao, Wenjie Li, Xing Xie, Minyi Guo:
RippleNet: Propagating User Preferences on the Knowledge Graph for Recommender Systems. CIKM 2018: 417-426

<a name="p18">[18]</a> Xiang Wang, Dingxian Wang, Canran Xu, Xiangnan He, Yixin Cao, Tat-Seng Chua:
Explainable Reasoning over Knowledge Graphs for Recommendation. AAAI 2019: 5329-5336

<a name="p19">[19]</a> Binbin Hu, Chuan Shi, Wayne Xin Zhao, Philip S. Yu: Leveraging Meta-path based Context for Top- N Recommendation with A Neural Co-Attention Model. KDD 2018: 1531-1540

<a name="p20">[20]</a>
Chuan Shi, Binbin Hu, Wayne Xin Zhao, Philip S. Yu:
Heterogeneous Information Network Embedding for Recommendation. CoRR abs/1711.10730 (2017)

<a name="p21">[21]</a> Xiaowen Huang, Quan Fang, Shengsheng Qian, Jitao Sang, Yan Li, Changsheng Xu:
Explainable Interaction-driven User Modeling over Knowledge Graph for Sequential Recommendation. ACM Multimedia 2019: 548-556

<a name="p22">[22]</a> Song, Weiping, et al. "Explainable knowledge graph-based recommendation via deep reinforcement learning." arXiv preprint arXiv:1906.09506 (2019).

<a name="p23">[23]</a> Chang-You Tai, Liang-Ying Huang, Chien-Kun Huang, Lun-Wei Ku:
User-Centric Path Reasoning towards Explainable Recommendation. SIGIR 2021: 879-889

<a name="p24">[24]</a> Xiting Wang, Kunpeng Liu, Dongjie Wang, Le Wu, Yanjie Fu, Xing Xie:
Multi-level Recommendation Reasoning over Knowledge Graphs with Reinforcement Learning. WWW 2022: 2098-2108

<a name="p25">[25]</a> Danyang Liu, Jianxun Lian, Zheng Liu, Xiting Wang, Guangzhong Sun, Xing Xie:
Reinforced Anchor Knowledge Graph Generation for News Recommendation Reasoning. KDD 2021: 1055-1065

<a name="p26">[26]</a> Zhen Wang, Jianwen Zhang, Jianlin Feng, Zheng Chen:
Knowledge Graph Embedding by Translating on Hyperplanes. AAAI 2014: 1112-

<a name="p27">[27]</a> Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, Jian Tang:
RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space. ICLR (Poster) 2019

<a name="p28">[28]</a>  Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, Xuan Zhu:
Learning Entity and Relation Embeddings for Knowledge Graph Completion. AAAI 2015: 2181-2187

<a name="p29">[29]</a>  Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, Sebastian Riedel:
Convolutional 2D Knowledge Graph Embeddings. AAAI 2018: 1811-1818

<a name="p30">[30]</a> Ni Lao, Tom M. Mitchell, William W. Cohen:
Random Walk Inference and Learning in A Large Scale Knowledge Base. EMNLP 2011: 529-539


<a name="p31">[31]</a> Yining Wang, Liwei Wang, Yuanzhi Li, Di He, Tie-Yan Liu:
A Theoretical Analysis of NDCG Type Ranking Measures. COLT 2013: 25-54

<a name="p32">[32]</a> Giacomo Balloccu, Ludovico Boratto, Gianni Fenu, and Mirko Marras. 2022. Post Processing Recommender Systems with Knowledge Graphs for Recency, Popularity, and Diversity of Explanations. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 646–656. https://doi.org/10.1145/3477495.3532041

<a name="p33">[33]</a> Balloccu G, Boratto L, Fenu G, Marras M. Reinforcement recommendation reasoning through knowledge graphs for explanation path quality. Knowledge-Based Systems. 2023 Jan 25;260:110098.

<a name="p34">[34]</a> Dessì D, Fenu G, Marras M, Reforgiato Recupero D. Coco: Semantic-enriched collection of online courses at scale with experimental use cases. InTrends and Advances in Information Systems and Technologies: Volume 2 6 2018 (pp. 1386-1396). Springer International Publishing.

<a name="p35">[35]</a>  Ni J, Li J, McAuley J. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) 2019 Nov (pp. 188-197).

<a name="p36">[36]</a> Balloccu G, Boratto L, Cancedda C, Fenu G, Marras M. Knowledge is power, understanding is impact: Utility and beyond goals, explanation quality, and fairness in path reasoning recommendation. InEuropean Conference on Information Retrieval 2023 Mar 16 (pp. 3-19). Cham: Springer Nature Switzerland.

<a name="p37">[37]</a> Geng S, Fu Z, Tan J, Ge Y, De Melo G, Zhang Y. Path language modeling over knowledge graphsfor explainable recommendation. InProceedings of the ACM Web Conference 2022 2022 Apr 25 (pp. 946-955).

<a name="p38">[38]</a> Balloccu G, Boratto L, Cancedda C, Fenu G, Marras M. Faithful Path Language Modelling for Explainable Recommendation over Knowledge Graph. arXiv preprint arXiv:2310.16452. 2023 Oct 25.

<a name="p39">[39]</a> Afreen N, Balloccu G, Boratto L, Fenu G, Marras M. Towards explainable educational recommendation through path reasoning methods. InCEUR WORKSHOP PROCEEDINGS 2023 (Vol. 3448, pp. 131-136). CEUR-WS.