# Explainable Recommender System with Knowledge Graphs

**Name: Virkha Kumari**<br>
**RS Project**
<br>

**Code with explanation:** https://explainablerecsys.github.io/recsys2022/
<BR><br>
**TUTORIAL**: 
https://www.youtube.com/watch?v=qPiIvcCOyBg
<br><br>
**Folder containing all files**
https://drive.google.com/drive/folders/1FBnh8SJvdTgmJoUoMvrzg7BppiHO8oIc
<br><br>
**Datasets=>** ***MovieLens-1M*** = 1 million user-movie ratings (0-5) from 6000 users on 4000 movies.

## Datasets, KGs, and pre-processing

In [24]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd

# LOAD MOVIELENS DATA
ml1m_path='data/ml1m'
ml1m_users_df=pd.read_csv(f'{ml1m_path}/users.dat', sep='::', names=["UserID","Gender","Age","Occupation","Zip-code"], header=None)
display(ml1m_users_df.head(5))
print(f'Unique Users: {len(ml1m_users_df)}')

ml1m_movies_df=pd.read_csv(f'{ml1m_path}/movies.dat', sep='::', names=["movie_id", "movie_name", "genre"], header=None, encoding='latin-1')
display(ml1m_movies_df.head(5))
print(f'Unique Videos: {len(ml1m_movies_df)}')

ml1m_ratings_df=pd.read_csv(f'{ml1m_path}/ratings.dat', sep="::", names=["user_id", "movie_id", "rating", "timestamp"], header=None)
display(ml1m_ratings_df.head(5))
print(f"Unique interactions: {len(ml1m_ratings_df)}")

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


Unique Users: 6040


Unnamed: 0,movie_id,movie_name,genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


Unique Videos: 3883


Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


Unique interactions: 1000209


**From the entity url, you can access the page of the entity in the DBpedia knowledge base and read more about it and what are the relations that it has with other entities.**

In [25]:
# LOAD DBPEDIA Knowledge Graph (KG) data for MovieLens
entities_df=pd.read_csv(f'{ml1m_path}/kg/e_map.dat', sep="\t", names=["entity_id", "entity_url"])
display(entities_df.head(5))

movies_to_kg_df=pd.read_csv(f'{ml1m_path}/kg/i2kg_map.tsv', sep="\t", names=["dataset_id", "movie_name", "entity_url"])
display(movies_to_kg_df.head(5))
print(f"Items mapped in the KG: {movies_to_kg_df.shape[0]}")

kg_df=pd.read_csv(f'{ml1m_path}/kg/kg.dat', sep="\t")
display(kg_df.head(5))
print(f"Number of triplets: {kg_df.shape[0]}")

relations_df=pd.read_csv(f'{ml1m_path}/kg/r_map.dat', sep="\t", names=["relation_id", "relation_url"])
display(relations_df.head(5))

Unnamed: 0,entity_id,entity_url
0,0,http://dbpedia.org/resource/Roger_Carel
1,1,http://dbpedia.org/resource/Soundtrack_album
2,2,http://dbpedia.org/resource/1982_in_film
3,3,http://dbpedia.org/resource/Category:Films_set...
4,4,http://dbpedia.org/resource/Plaza_Hotel


Unnamed: 0,dataset_id,movie_name,entity_url
0,781,Stealing Beauty (1996),http://dbpedia.org/resource/Stealing_Beauty
1,1799,Suicide Kings (1997),http://dbpedia.org/resource/Suicide_Kings
2,521,Romeo Is Bleeding (1993),http://dbpedia.org/resource/Romeo_Is_Bleeding
3,3596,Screwed (2000),http://dbpedia.org/resource/Screwed_(2000_film)
4,3682,Magnum Force (1973),http://dbpedia.org/resource/Magnum_Force


Items mapped in the KG: 3301


Unnamed: 0,entity_head,entity_tail,relation
0,9386,4955,12
1,6851,1770,1
2,210,5210,8
3,4205,406,12
4,11533,12345,3


Number of triplets: 434189


Unnamed: 0,relation_id,relation_url
0,0,http://dbpedia.org/ontology/cinematography
1,1,http://dbpedia.org/property/productionCompanies
2,2,http://dbpedia.org/property/composer
3,3,http://purl.org/dc/terms/subject
4,4,http://dbpedia.org/ontology/openingFilm


In [26]:
# Solve the problem: linking process fails and we end up having some products not mapped with the KG
# So we will solve "MISSING ITEM IN KG" problem
print(f'Items in original dataset: {ml1m_movies_df.shape[0]}')
print(f'Items correctly mapped in KG: {movies_to_kg_df.shape[0]}')

# Let's remove from the ml1m_movies_df DataFrame the movies that are not in item_to_kg_map.
num_movies=ml1m_movies_df.shape[0]
ml1m_movies_df=ml1m_movies_df[ml1m_movies_df['movie_id'].isin(movies_to_kg_df.dataset_id)]
ml1m_movies_df.reset_index()
display(ml1m_movies_df.head(5))
print(f'Number of rows removed due to missing links in KG: {num_movies-ml1m_movies_df.shape[0]}')

# Now lets remove videos/items not present in entities_df by merging with movies_to_kg_df
movies_to_kg_df=pd.merge(movies_to_kg_df, entities_df, on=['entity_url'])
print('\nUpdated movies_to_kg_df: ')
display(movies_to_kg_df.head(5))
print(f'Correct Mapped Videos/Items: {movies_to_kg_df.shape[0]}')

print(f'\nMovies Before: {ml1m_movies_df.shape[0]}')
movies_to_kg_df=movies_to_kg_df[movies_to_kg_df.entity_id.isin(entities_df.entity_id)]
print(f'Number of rows due to missing entity data in KG: {movies_to_kg_df.shape[0]}')

# NOW LETS PROPAGATE CHANGES CONSISTENTLY
num_ratings=ml1m_ratings_df.shape[0]
ml1m_ratings_df=ml1m_ratings_df[ml1m_ratings_df['movie_id'].isin(movies_to_kg_df.dataset_id)]
ml1m_ratings_df.reset_index()
display(ml1m_ratings_df.head(5))
print(f"Num rows removed due to interaction with removed movies: {num_ratings-ml1m_ratings_df.shape[0]}")

Items in original dataset: 3883
Items correctly mapped in KG: 3301


Unnamed: 0,movie_id,movie_name,genre
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller


Number of rows removed due to missing links in KG: 582

Updated movies_to_kg_df: 


Unnamed: 0,dataset_id,movie_name,entity_url,entity_id
0,781,Stealing Beauty (1996),http://dbpedia.org/resource/Stealing_Beauty,5474
1,1799,Suicide Kings (1997),http://dbpedia.org/resource/Suicide_Kings,9337
2,521,Romeo Is Bleeding (1993),http://dbpedia.org/resource/Romeo_Is_Bleeding,8520
3,3596,Screwed (2000),http://dbpedia.org/resource/Screwed_(2000_film),4713
4,3682,Magnum Force (1973),http://dbpedia.org/resource/Magnum_Force,14676


Correct Mapped Videos/Items: 3266

Movies Before: 3301
Number of rows due to missing entity data in KG: 3266


Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


Num rows removed due to interaction with removed movies: 58904


In [27]:
# NOW LETS REDUCE DATASET TO ITS K-CORE, so discard items and/ore users that occur less than k items in ratings
counts_col_user=ml1m_ratings_df.groupby('user_id')['user_id'].transform(len)
counts_col_movies=ml1m_ratings_df.groupby('movie_id')['movie_id'].transform(len)
print(counts_col_user.head(5))

k_user, k_movie=5,5 #LESS THAN 5 OCCURRENCES DISCARDED
mask_user=counts_col_user>=k_user
mask_movies=counts_col_movies>=k_movie
print(mask_movies.head(5))

print(f'\nNum ratings before: {ml1m_ratings_df.shape[0]}')
ml1m_ratings_df=ml1m_ratings_df[mask_user & mask_movies]
print(f'Num ratings after: {ml1m_ratings_df.shape[0]}')
display(ml1m_ratings_df.head(5))

# now propagate these changes
ml1m_users_df=ml1m_users_df[ml1m_users_df.UserID.isin(ml1m_ratings_df.user_id.unique())]
ml1m_movies_df=ml1m_movies_df[ml1m_movies_df.movie_id.isin(ml1m_ratings_df.movie_id.unique())]
print(f'Final Num Users: {ml1m_users_df.shape[0]}, Num Movies: {ml1m_movies_df.shape[0]}')

0    48
1    48
2    48
3    48
4    48
Name: user_id, dtype: int64
0    True
1    True
2    True
3    True
4    True
Name: movie_id, dtype: bool

Num ratings before: 941305
Num ratings after: 940963


Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


Final Num Users: 6040, Num Movies: 3030


In [28]:
# NOW AS THE CHANGES ARE MADE WHAT NEEDED IS TO PROPAGATE SAME CHANGES INTO KG DATA
from knowledge_graph_utils import propagate_item_removal_to_kg
movies_to_kg_df, entities_df, kg_df=propagate_item_removal_to_kg(ml1m_movies_df, movies_to_kg_df, entities_df, kg_df)

Removed 236 entries from i2kg map.
Removed 236 entries from e_map
Removed 11916 triplets from kg_df


**NOW DONE WITH PRE-PROCESSING OF DATASET**

In [29]:
ml1m_preprocessed_path='data/ml1m/preprocessed'

# SAVE IN Standard Format for Users
#remove unnecessary data
ml1m_users_df=ml1m_users_df.drop(["Gender", "Age", "Occupation", "Zip-code"], axis=1)
display(ml1m_users_df.head(5))
#add new_id as per Standard format
ml1m_users_df.insert(0, 'new_id', range(ml1m_users_df.shape[0]))
display(ml1m_users_df.head(5))
#save
ml1m_users_df.to_csv(f'{ml1m_preprocessed_path}/users.txt', header=["new_id", "raw_dataset_id"], index=False, sep='\t', mode='w+')
user_id2new_id=dict(zip(ml1m_users_df['UserID'], ml1m_users_df.new_id))

# SAVE IN Standard Format for Movies
ml1m_movies_df=ml1m_movies_df.drop(["movie_name", "genre"], axis=1)
ml1m_movies_df.insert(0, 'new_id', range(ml1m_movies_df.shape[0]))
display(ml1m_movies_df.head(5))
print(ml1m_movies_df.shape[0])
ml1m_movies_df.to_csv(f'{ml1m_preprocessed_path}/products.txt', header=["new_id", "raw_dataset_id"], index=False, sep='\t', mode='w+')
movie_id2new_id = dict(zip(ml1m_movies_df["movie_id"], ml1m_movies_df.new_id))

# CHANGE IN RATINGS
ml1m_ratings_df["user_id"]=ml1m_ratings_df['user_id'].map(user_id2new_id)
ml1m_ratings_df["movie_id"]=ml1m_ratings_df['movie_id'].map(movie_id2new_id)
display(ml1m_ratings_df.head(5))
#Save ratings
ml1m_ratings_df.to_csv(f'{ml1m_preprocessed_path}/ratings.txt', header=["uid", "pid", "rating", "timestamp"], index=False, sep='\t', mode='w+')

Unnamed: 0,UserID
0,1
1,2
2,3
3,4
4,5


Unnamed: 0,new_id,UserID
0,0,1
1,1,2
2,2,3
3,3,4
4,4,5


Unnamed: 0,new_id,movie_id
1,0,2
2,1,3
3,2,4
4,3,5
5,4,6


3030


Unnamed: 0,user_id,movie_id,rating,timestamp
0,0,872,5,978300760
1,0,537,3,978302109
2,0,679,3,978301968
3,0,2606,4,978300275
4,0,1790,5,978824291


In [30]:
# Preprocessing KG
#keep only relation between entity head and entity tail
display(movies_to_kg_df.head(5))
print(f"Number of movies correctly mapped: {movies_to_kg_df.shape[0]}")
mask=kg_df['entity_tail'].isin(movies_to_kg_df.entity_id) & ~kg_df['entity_head'].isin(movies_to_kg_df.entity_id)
kg_df.loc[mask, ['entity_head', 'entity_tail']]=(kg_df.loc[mask, ['entity_tail', 'entity_head']].values)

n_of_triplets=kg_df.shape[0]
kg_df=kg_df[(kg_df['entity_head'].isin(movies_to_kg_df.entity_id) & ~kg_df['entity_tail'].isin(movies_to_kg_df.entity_id))]
display(kg_df.head(5))
print(f"Number of triplets before: {n_of_triplets}")
print(f"Number of triplets after: {kg_df.shape[0]}")

print(len(kg_df.relation.unique()))

Unnamed: 0,dataset_id,movie_name,entity_url,entity_id
0,781,Stealing Beauty (1996),http://dbpedia.org/resource/Stealing_Beauty,5474
1,1799,Suicide Kings (1997),http://dbpedia.org/resource/Suicide_Kings,9337
2,521,Romeo Is Bleeding (1993),http://dbpedia.org/resource/Romeo_Is_Bleeding,8520
3,3596,Screwed (2000),http://dbpedia.org/resource/Screwed_(2000_film),4713
4,3682,Magnum Force (1973),http://dbpedia.org/resource/Magnum_Force,14676


Number of movies correctly mapped: 3030


Unnamed: 0,entity_head,entity_tail,relation
0,9386,4955,12
1,6851,1770,1
2,210,5210,8
3,4205,406,12
4,11533,12345,3


Number of triplets before: 422273
Number of triplets after: 409887
20


In [31]:
# Make the KG more dense discarding relations or entities that appear less than k times
v=kg_df[['relation']]
n_of_triplets=kg_df.shape[0]
kg_df=kg_df[v.replace(v.apply(pd.Series.value_counts)).gt(300).all(1)]
display(kg_df.head(5))
print(f"Number of triplets before: {n_of_triplets}")
print(f"Number of triplets after: {kg_df.shape[0]}")

print(len(kg_df.relation.unique()))

# Update our relations_df keeping only the relations that are currently in our kg_df and save them in r_map_filtered.dat
display(relations_df)
relations_df=relations_df[relations_df['relation_id'].isin(kg_df.relation.unique())]
relations_df.reset_index()
display(relations_df.head(5))

# remove senseless relations semantically
relations_df=relations_df[(relations_df['relation_id'] != 13) & (relations_df['relation_id'] != 8)]
display(relations_df)

# Propagate this change in e_map and kg_df.
print(f"Triplets before: {kg_df.shape[0]}")
kg_df=kg_df[kg_df.relation.isin(relations_df.relation_id)]
print(f"Triplets after: {kg_df.shape[0]}")
print(f"Entities before: {entities_df.shape[0]}")
entities_df=entities_df[entities_df.entity_id.isin(kg_df.entity_head) | entities_df.entity_id.isin(kg_df.entity_tail)]
print(f"Entities after: {entities_df.shape[0]}")

Unnamed: 0,entity_head,entity_tail,relation
0,9386,4955,12
1,6851,1770,1
2,210,5210,8
3,4205,406,12
4,11533,12345,3


Number of triplets before: 409887
Number of triplets after: 409477
13


Unnamed: 0,relation_id,relation_url
0,0,http://dbpedia.org/ontology/cinematography
1,1,http://dbpedia.org/property/productionCompanies
2,2,http://dbpedia.org/property/composer
3,3,http://purl.org/dc/terms/subject
4,4,http://dbpedia.org/ontology/openingFilm
5,5,http://www.w3.org/2000/01/rdf-schema#seeAlso
6,6,http://dbpedia.org/property/story
7,7,http://dbpedia.org/ontology/series
8,8,http://www.w3.org/1999/02/22-rdf-syntax-ns#type
9,9,http://dbpedia.org/ontology/basedOn


Unnamed: 0,relation_id,relation_url
0,0,http://dbpedia.org/ontology/cinematography
1,1,http://dbpedia.org/property/productionCompanies
2,2,http://dbpedia.org/property/composer
3,3,http://purl.org/dc/terms/subject
8,8,http://www.w3.org/1999/02/22-rdf-syntax-ns#type


Unnamed: 0,relation_id,relation_url
0,0,http://dbpedia.org/ontology/cinematography
1,1,http://dbpedia.org/property/productionCompanies
2,2,http://dbpedia.org/property/composer
3,3,http://purl.org/dc/terms/subject
10,10,http://dbpedia.org/ontology/starring
11,11,http://dbpedia.org/ontology/country
12,12,http://dbpedia.org/ontology/wikiPageWikiLink
14,14,http://dbpedia.org/ontology/editing
15,15,http://dbpedia.org/property/producers
16,16,http://dbpedia.org/property/allWriting


Triplets before: 409477
Triplets after: 323499
Entities before: 14472
Entities after: 13804


In [32]:
# Saving preprocessed KG in Standard Format
ml1m_kg_preprocessed_path='data/ml1m/preprocessed/'
relations_df.to_csv(f'{ml1m_kg_preprocessed_path}/r_map.txt', header=["relation_id", "relation_url"], index=False, sep='\t', mode='w+')
display(entities_df.head(5))
entities_df.to_csv(f'{ml1m_kg_preprocessed_path}/e_map.txt', header=["entity_id", "entity_url"], index=False, sep='\t', mode='w+')
display(movies_to_kg_df.head(5))
movies_to_kg_df.to_csv(f'{ml1m_kg_preprocessed_path}/i2kg_map.txt', header=["dataset_id", "movie_name", 'entity_url', 'entity_id'], index=False, sep='\t', mode='w+')
display(kg_df.head(5))
kg_df.to_csv(f'{ml1m_kg_preprocessed_path}/kg_final.txt', header=["entity_head", "entity_tail", 'relation'], index=False, sep='\t', mode='w+')

Unnamed: 0,entity_id,entity_url
0,0,http://dbpedia.org/resource/Roger_Carel
1,1,http://dbpedia.org/resource/Soundtrack_album
2,2,http://dbpedia.org/resource/1982_in_film
3,3,http://dbpedia.org/resource/Category:Films_set...
4,4,http://dbpedia.org/resource/Plaza_Hotel


Unnamed: 0,dataset_id,movie_name,entity_url,entity_id
0,781,Stealing Beauty (1996),http://dbpedia.org/resource/Stealing_Beauty,5474
1,1799,Suicide Kings (1997),http://dbpedia.org/resource/Suicide_Kings,9337
2,521,Romeo Is Bleeding (1993),http://dbpedia.org/resource/Romeo_Is_Bleeding,8520
3,3596,Screwed (2000),http://dbpedia.org/resource/Screwed_(2000_film),4713
4,3682,Magnum Force (1973),http://dbpedia.org/resource/Magnum_Force,14676


Unnamed: 0,entity_head,entity_tail,relation
0,9386,4955,12
1,6851,1770,1
3,4205,406,12
4,11533,12345,3
5,2075,7182,12


## Policy Guided Path Reasoning (PGPR)

* KG in explainable recommendation method used: Reasoning path methods as they are able to extract paths and recommendations in the same step. This guarantees a high fidelity since the produced explanation path encode the real model functioning, so the chosen paths do encode model behavior
* Knowledge graphs paths are composed by a chain of entities and relations. For the task of explainable recommendation, we are interested in user-centric paths that start from a user entity and end with a product entity.
* To constraint the possible paths and select only those that follow some structured rules (based on length and semantics) what is done is to define meta-paths. Why? because some paths don't make sense or not user-centeric. A metapath, is relation sequence connecting two entities. Is proposed to capture the structural and semantic relation between objects. 
* A great property of the paths is that they are a great source to produce textual explanations and some methods are able to extract them while performing recommendations. This is particular useful since in this way the path encodes the real model functioning. Paths have this property and are composed with a step-by-step approach take the name of reasoning paths
* TransE is a translational knowledge graph embedding (KGE). KGE methods aims to learn a low-dimensional representation of a KGs' entities and relations while preserving their semantic meaning.
* WHAT IS PGPR? PGPR is a Self-Explainable recommendation method that relies on a Reinforcement Learning (RL) agent conditioned on the user and trained to navigate to potentially relevant products. Starting from a user node, the agent will conduct an explicit multi-step path reasoning over the graph, so as to discover suitable items in the graph for recommendation to the target user. The path from the user A to the product can be used to explain the recommendation.
<br>

To model a **RL problem** we need to define 4 things:
- `Enviroment`: The KG
- `Set of States`: Nodes of the KG
- `Set of Actions`: Edges of the KG
- `Reward`: Computed using transE embedding

all KG RL work: <a href='models/PGPR/kg_env.py'>path</a>
<br>
* During the learning, the agent starts from a random user node (state), and will search for valid paths (according to our metapaths) that bring to a product composing the path step-by-step until we reach a product.
* The Action Space (which action we can take as next) is done by considering both random actions (exploration) and actions that maximize the scoring function (exploitation).
* To handle the possible high amount of actions avaiable, the authors also proposed a pruning based on the scoring function.
* The Multi-Hop scoring function for a path of len k is defined as the sum of the dot product in the translational space (<...>) between the entities that compose the path.

In [8]:
import gzip
from mapper import write_time_based_train_test_split

dataset_name='ml1m'
#create a time-based user-based split, this means that every user (user-based) has the first n_train ratings sorted by timestamp (time-based) in train and the others in the test set.

train_size=0.8
write_time_based_train_test_split(dataset_name, "pgpr", train_size)

#these files, every row represents an interaction and is formatted as uid, pid, 1, timestamp separated by \t.
with gzip.open('data/ml1m/preprocessed/pgpr/train.txt.gz', 'rt') as train_file:
    for i, line in enumerate(train_file):
        if i>5: 
            break
        print(line)
train_file.close()

0	2455	1	978300019



0	937	1	978300055



0	1279	1	978300055





In [9]:
#run PGPR, we need to have the files in a format which is readable by the model. In particular, PGPR needs to have all the entities (e.g., actor, category) and relations (e.g. starring, belong to) grouped by their type.
from mapper import map_to_PGPR

map_to_PGPR(dataset_name)

with gzip.open("data/ml1m/preprocessed/pgpr/editor.txt.gz", 'rt') as editor_file:
    for i, line in enumerate(editor_file):
        if i > 5: break
        print(line)
editor_file.close()

with gzip.open("data/ml1m/preprocessed/pgpr/category_p_ca.txt.gz", 'rt') as belong_to_file:
    for i, line in enumerate(belong_to_file):
        if i > 5: break
        print(line)
belong_to_file.close()

new_id	name



0	14104



1	648



1132	1194	109	1741	638	770	1817	1218	718	1401	1230	679	802	1021	849	1167	1742	187	680	1818	450	1117	1660	1121	509	328	1044	1496	1154



1167	315	652	691	942	1500	990	1144	1022	1316	1230	638	1004



727	1144	1359	905	276	1230	1167	255	291	1128	1424	638	990	1748	417





In [10]:
#Run PGPR preprocessing
%cd models/PGPR/

c:\Users\IT\OneDrive\Desktop\Projects\sem8_RS_project\models\PGPR


In [4]:
%run preprocess.py --dataset {dataset_name}

Load ml1m dataset from file...
Load user of size 6040
Load product of size 3030
Load actor of size 2411
Load composer of size 295
Load director of size 601
Load producer of size 669
Load production_company of size 305
Load category of size 1821
Load country of size 32
Load editor of size 220
Load writter of size 665
Load cinematographer of size 236
Load wikipage of size 10839
Load directed_by of size 1884
Load composed_by of size 2005
Load produced_by_company of size 4329
Load produced_by_producer of size 2287
Load produced_in of size 311
Load starring of size 9599
Load belong_to of size 40470
Load edited_by of size 1193
Load wrote_by of size 1756
Load cinematography_by of size 1480
Load related_to of size 258018
0 0
Load review of size 750376 with positive reviews= 0  and negative reviews= 750376
Create ml1m knowledge graph from dataset...
Load entities...
Total 27177 nodes.
Load reviews...
Total 1500752 review edges.
Load knowledge produced_by_company...
Total 8458 produced_by_compan

In [5]:
#learn the TransE representation of our KG by executing the train_transe_model.py.
%run train_transe_model --dataset {dataset_name}

../../data/ml1m/preprocessed/pgpr/tmp
[INFO]  Namespace(dataset='ml1m', name='train_transe_model', seed=123, gpu='0', epochs=30, batch_size=64, lr=0.5, weight_decay=0, l2_lambda=0, max_grad_norm=5.0, embed_size=100, num_neg_samples=5, steps_per_checkpoint=200, device='cpu', log_dir='../../data/ml1m/preprocessed/pgpr/tmp\\train_transe_model')


[INFO]  Parameters:['watched', 'produced_by_company', 'produced_by_producer', 'belong_to', 'directed_by', 'starring', 'wrote_by', 'edited_by', 'cinematography_by', 'composed_by', 'produced_in', 'related_to', 'user.weight', 'product.weight', 'actor.weight', 'composer.weight', 'director.weight', 'production_company.weight', 'producer.weight', 'writter.weight', 'editor.weight', 'cinematographer.weight', 'category.weight', 'country.weight', 'wikipage.weight', 'watched_bias.weight', 'produced_by_company_bias.weight', 'produced_by_producer_bias.weight', 'belong_to_bias.weight', 'directed_by_bias.weight', 'starring_bias.weight', 'wrote_by_bias.weight', 'edited_by_bias.weight', 'cinematography_by_bias.weight', 'composed_by_bias.weight', 'produced_in_bias.weight', 'related_to_bias.weight']
[INFO]  Epoch: 01 | Review: 12800/22511281 | Lr: 0.49972 | Smooth loss: 41.98669
[INFO]  Epoch: 01 | Review: 25600/22511281 | Lr: 0.49943 | Smooth loss: 36.39988
[INFO]  Epoch: 01 | Review: 38400/22511281 | L

  state_dict = torch.load(model_file, map_location=lambda storage, loc: storage)


In [5]:
# Now going to learn the policy π. The goal is to find a policy that maximizes the cumulative reward
%run train_agent --dataset {dataset_name}

[INFO]  Namespace(dataset='ml1m', name='train_agent', seed=123, gpu='0', epochs=50, batch_size=32, lr=0.0001, max_acts=250, max_path_len=3, gamma=0.99, ent_weight=0.001, act_dropout=0, state_history=1, hidden=[512, 256], device='cpu', log_dir='../../data/ml1m/preprocessed/pgpr/tmp\\train_agent')
Load embedding: ../../data/ml1m/preprocessed/pgpr/tmp/transe_embed.pkl
[INFO]  Parameters:['l1.weight', 'l1.bias', 'l2.weight', 'l2.bias', 'actor.weight', 'actor.bias', 'critic.weight', 'critic.bias']
[INFO]  epoch/step=1/100 | loss=0.56705 | ploss=0.36920 | vloss=0.21244 | entropy=-14.60015 | reward=0.22313
[INFO]  epoch/step=2/200 | loss=0.33977 | ploss=0.15067 | vloss=0.20368 | entropy=-14.58161 | reward=0.22005
[INFO]  epoch/step=2/300 | loss=0.22393 | ploss=0.04188 | vloss=0.19661 | entropy=-14.56329 | reward=0.21802
[INFO]  epoch/step=3/400 | loss=0.23307 | ploss=0.05366 | vloss=0.19404 | entropy=-14.62686 | reward=0.21858
[INFO]  epoch/step=3/500 | loss=0.28138 | ploss=0.10770 | vloss=0.

In [4]:
# only remaining step is to extract the paths from the previously learned policy  π .
# This is done by executing a beam search using as score for every step the probability  π(at|st,Au,t) .
%run test_agent --dataset {dataset_name}

Predicting paths...
Load embedding: ../../data/ml1m/preprocessed/pgpr/tmp/transe_embed.pkl


  pretrain_sd = torch.load(policy_file)
6048it [1:30:33,  1.11it/s]                          


In [12]:
%run test_agent --dataset {dataset_name} --run_path False --run_eval True

Load embedding: ../../data/ml1m/preprocessed/pgpr/tmp/transe_embed.pkl
Normalizing items scores...
Saving pred_paths...
Average test set size:  31.553973509933776
Overall for noOfUser=6040, ndcg=0.2987


Overall for noOfUser=6040, hr=0.5358


Overall for noOfUser=6040, precision=0.1004


Overall for noOfUser=6040, recall=0.0473




## Textual explanation creation and evaluation

In [1]:
import numpy as np
import pandas as pd
from random import seed, randint, choice
from collections import defaultdict
import pickle

dataset_name='ml1m'
models=['pgpr']
users_topk={model: defaultdict(list) for model in models}
train_labels={}
test_labels={}
# EVALUATION METRICS: alculate Normalized Discounted Cumulative Gain (NDCG), Recall and Precision.
ndcgs={model:[] for model in models}
recalls={model:[] for model in models}
precisions={model:[] for model in models}

In [10]:
def dcg_at_k(topk, k, method=1):
    topk=np.asarray(topk, dtype=float)[:k]
    if topk.size:
        if method==0:
            return topk[0]+np.sum(topk[1:] / np.log2(np.arange(2, topk.size+1)))
        elif method == 1:
            return np.sum(topk / np.log2(np.arange(2, topk.size+2)))
        else:
            raise ValueError('method must be 0 or 1.')
    return 0.

def ndcg_at_k(topk, k, method=1):
    dcg_max = dcg_at_k(sorted(topk, reverse=True), k, method)
    if not dcg_max:
        return 0.
    return dcg_at_k(topk, k, method) / dcg_max

def recall_at_k(topk, test_pids):
    return sum(topk) / len(test_pids)

def precision_at_k(topk, k):
    return sum(topk) / k

In [3]:
# To convert the explanations to plain text, we need to substitute the entities IDs with their name. To do that let's load these mappings for both models using the entity2plain_text function of the knowledge_graph_utils module.
from knowledge_graph_utils import entity2plain_text
entity2name = {}
entity2name["pgpr"] = entity2plain_text(dataset_name, "pgpr")

def path_len(path):
    len=0
    for s in path:
        if type(s) != str:
            s = str(s)
        if s.isnumeric():
            len+=1
    return len

# Path Structure: user 5038 watched product 2430 watched user 1498 watched product 1788
def template(curr_model, path):
    if path[0]=="self_loop":
        path=path[1:]
    path_length=path_len(path)
    for i in range(1, len(path)):
        s=str(path[i])
        if s.isnumeric():
            if path[i-1]=='user': 
                continue
            if int(path[i]) not in entity2name[curr_model][path[i-1]]: 
                continue
            path[i] = entity2name[curr_model][path[i-1]][int(path[i])]
    if path_length == 4:
        _, uid, rel_0, e_type_1, e_1, rel_1, e_type_2, e_2, rel_k, _, pid  = path
        return f"{pid} is recommend to you because you {rel_0} {e_1} also {rel_k} by {e_2}"
    elif path_len(path) == 3:
        _, uid, rel_0, e_type_1, e_1, rel_1, _, pid  = path
        return f"{pid} is recommend to you because is {rel_k} with {e_1} that you previously {rel_0}"

Opening file: data/ml1m/preprocessed/pgpr/mappings.txt.gz
Header: ['product_0', 'Jumanji (1995)']
Row 0: []
Row 1: ['product_1', 'Grumpier Old Men (1995)']
Row 2: []
Row 3: ['product_2', 'Waiting to Exhale (1995)']
Row 4: []
Opening file: data/ml1m/preprocessed/pgpr/mappings.txt.gz
Header: ['product_0', 'Jumanji (1995)']
Row 0: []
Row 1: ['product_1', 'Grumpier Old Men (1995)']
Row 2: []
Row 3: ['product_2', 'Waiting to Exhale (1995)']
Row 4: []


In [4]:
curr_model="pgpr"

with open(f"results/{dataset_name}/{curr_model}/pred_paths.pkl", 'rb') as pred_paths_file:
    pred_paths_pgpr=pickle.load(pred_paths_file)
pred_paths_file.close()
pd.DataFrame(pred_paths_pgpr[:10], columns=["uid", "pid", "path_score", "path_probability", "path"])

Unnamed: 0,uid,pid,path_score,path_probability,path
0,0,584,0.5891642130483046,4e-06,self_loop user 0 watched product 875 watched u...
1,0,584,0.5891642130483046,2e-06,self_loop user 0 watched product 875 watched u...
2,0,584,0.5891642130483046,5e-06,self_loop user 0 watched product 569 watched u...
3,0,584,0.5891642130483046,2e-06,self_loop user 0 watched product 760 watched u...
4,0,584,0.5891642130483046,2e-06,self_loop user 0 watched product 537 watched u...
5,0,584,0.5891642130483046,2e-06,self_loop user 0 watched product 2157 watched ...
6,0,584,0.5891642130483046,2e-06,self_loop user 0 watched product 499 watched u...
7,0,584,0.5891642130483046,2e-06,self_loop user 0 watched product 679 watched u...
8,0,584,0.5891642130483046,1e-06,self_loop user 0 watched product 2402 watched ...
9,0,584,0.5891642130483046,2e-06,self_loop user 0 watched product 1514 watched ...


In [5]:
header = ["uid", "pid", "path_score", "path_prob", "path"]
pred_paths_map_pgpr = defaultdict(dict)
for record in pred_paths_pgpr:
    uid, pid, path_score, path_prob, path = record
    if pid not in pred_paths_map_pgpr[uid]:
        pred_paths_map_pgpr[uid][pid] = []
    pred_paths_map_pgpr[uid][pid].append((float(path_score), float(path_prob), path))

n_users = len(pred_paths_map_pgpr.keys())
random_user = randint(0, n_users)
random_product = choice(list(pred_paths_map_pgpr[random_user].keys()))
print(pred_paths_map_pgpr[random_user][random_product])

[(0.7232408397882328, 1.3698380598725635e-06, 'self_loop user 766 watched product 671 watched user 3407 watched product 683'), (0.7232408397882328, 1.018620423565153e-06, 'self_loop user 766 watched product 671 watched user 4793 watched product 683'), (0.7232408397882328, 8.111001648103411e-07, 'self_loop user 766 watched product 953 watched user 3209 watched product 683'), (0.7232408397882328, 2.8296349228185136e-06, 'self_loop user 766 watched product 497 watched user 1486 watched product 683'), (0.7232408397882328, 3.8145597613947757e-07, 'self_loop user 766 watched product 118 watched user 1421 watched product 683'), (0.7232408397882328, 1.4622210073866881e-06, 'self_loop user 766 watched product 674 watched user 1372 watched product 683')]


In [6]:
%cd models/PGPR

c:\Users\IT\OneDrive\Desktop\Projects\sem8_RS_project\models\PGPR


In [7]:
from pgpr_utils import load_labels

train_labels[curr_model] = load_labels(dataset_name, 'train')
test_labels[curr_model] = load_labels(dataset_name, 'test')

%cd ../..

c:\Users\IT\OneDrive\Desktop\Projects\sem8_RS_project


In [8]:
best_pred_paths = {}
for uid in pred_paths_map_pgpr:
    if uid in train_labels[curr_model]:
        train_pids = set(train_labels[curr_model][uid])
    else:
        print("Invalid train_pids")
    best_pred_paths[uid] = []
    for pid in pred_paths_map_pgpr[uid]:
        if pid in train_pids:
            continue
        # Get the path with highest probability
        sorted_path = sorted(pred_paths_map_pgpr[uid][pid], key=lambda x: x[1], reverse=True)
        best_pred_paths[uid].append(sorted_path[0])

random_user = randint(0, n_users)
print(best_pred_paths[random_user][:5])

n=10
for uid in range(len(best_pred_paths.keys())):
    sorted_paths = sorted(best_pred_paths[uid], key=lambda x: (x[0], x[1]), reverse=True)
    sorted_paths = [[path[0], path[1], path[-1].split(" ")] for path in sorted_paths]
    topk_products = [int(path[-1][-1]) for path in sorted_paths[:n]]
    topk_explanations = [path[-1] for path in sorted_paths[:n]]
    users_topk[curr_model][uid] = list(zip(topk_products, topk_explanations))

random_user = randint(0, n_users)
users_topk[curr_model][random_user][0]

[(0.4039733102461305, 1.4403590284928214e-06, 'self_loop user 293 watched product 1225 related_to wikipage 7209 related_to product 1241'), (0.4551168962339, 4.050557890877826e-06, 'self_loop user 293 watched product 1225 related_to wikipage 709 related_to product 21'), (0.5104558979026127, 9.069613042811397e-06, 'self_loop user 293 watched product 1225 belong_to category 1759 belong_to product 2495'), (0.2740205975655808, 4.073062154930085e-05, 'self_loop user 293 watched product 1225 related_to wikipage 3560 related_to product 1258'), (0.5359952611805613, 1.2168482044216944e-06, 'self_loop user 293 watched product 1225 related_to wikipage 9608 related_to product 2835')]


(1377,
 ['self_loop',
  'user',
  '1283',
  'watched',
  'product',
  '498',
  'watched',
  'user',
  '3376',
  'watched',
  'product',
  '1377'])

In [11]:
# Evaluation
# We use the hits array to calculate the utility metrics for that user.
# We repeat this process for every user and we store the metric values. At the end, we use them to compute the average values for the dataset.

for uid, rec_exp_tuples in users_topk[curr_model].items():
    hits = []
    for rec_exp_tuple in rec_exp_tuples:
        recommended_pid = rec_exp_tuple[0]
        if recommended_pid in test_labels[curr_model][uid]:
            hits.append(1)
        else:
            hits.append(0)
    while len(hits) < 10:
        hits.append(0)
    ndcg = ndcg_at_k(hits, n)
    precision = precision_at_k(hits, n)
    recall = recall_at_k(hits, test_labels[curr_model][uid])
    ndcgs[curr_model].append(ndcg)
    precisions[curr_model].append(precision)
    recalls[curr_model].append(recall)
print(f"Overall NDGC: {np.mean(ndcgs[curr_model])}, Precision: {np.mean(precisions[curr_model])}, Recall: {np.mean(recalls[curr_model])}")

Overall NDGC: 0.2987081828554165, Precision: 0.10038079470198677, Recall: 0.04730047882034777


In [12]:
# textual explanation generation
random_user = randint(0, len(users_topk[curr_model].keys()))
for i, pid_exp_tuple in enumerate(users_topk[curr_model][random_user]):
    pid, exp = pid_exp_tuple[0], pid_exp_tuple[1]
    users_topk[curr_model][random_user][i] = (pid, template(curr_model, exp))

pd.DataFrame(users_topk[curr_model][random_user], columns=["pid", "textual explanation"])

Unnamed: 0,pid,textual explanation
0,2199,American Beauty (1999) is recommend to you bec...
1,1167,Men in Black (1997) is recommend to you becaus...
2,498,Silence of the Lambs is recommend to you becau...
3,1279,Titanic (1997) is recommend to you because you...
4,449,Schindler's List (1993) is recommend to you be...
5,2376,Sleepy Hollow (1999) is recommend to you becau...
6,2401,End of Days (1999) is recommend to you because...
7,2087,Haunting is recommend to you because you watch...
8,874,Star Wars: Episode V - The Empire Strikes Back...
9,1825,Shakespeare in Love (1998) is recommend to you...


# Conclusion
**We investigated the dimension of reasoning path quality and how can we measure properties of reasonings path.<br>
Since reasoning paths are the source that will be used to create the textual explanation, increasing the reasoning path quality (in terms of some properties) will consequently enchance the quality of the textual explanations.<br>
We formalized three metrics to measure these perspective, in particular Linking Interaction Recency (LIR), Shared Entity Popularity (SEP) and Path Type Diversity (PTD).<br>
We also showed how these metrics can be incorporated in the training process readapting the Multi-Hop scoring function and Reward fuction of PGPR.<br>
The results showed that explanations produced by in-train model causes less side-effects in terms of (LID, SED and PTC), produce smaller decreases in NDCG and combining Post-Processing + In-train causes the best results in terms of both explanation path quality and recommendation utility.**