# **Project: Personalized information Retrieval using the SE-PQA dataset**

**Goal:** implement retrieval methods, such as TF-IDF and BM25, with neural re-rankers and personalized models to evaluate how the model performs before and after adding advanced recomender systems' features to the traditional information retrieval methods.


# Downloading the dataset

In [1]:
!gdown 1HhgXzyEpsZNcenU9XhJuOYyDUKEzUse4

Downloading...
From (original): https://drive.google.com/uc?id=1HhgXzyEpsZNcenU9XhJuOYyDUKEzUse4
From (redirected): https://drive.google.com/uc?id=1HhgXzyEpsZNcenU9XhJuOYyDUKEzUse4&confirm=t&uuid=5eeab327-b468-476e-8290-5d4d5758d667
To: /content/pir_data.zip
100% 3.30G/3.30G [00:24<00:00, 136MB/s]


## Loading the data from the "community question answering" dataset



### Necessary libraries to use use and check the dataset

In [2]:
import json         # to read the json files
import os           # to find the current directories
import pandas as pd # to view the datasets
from collections import defaultdict
from pathlib import Path

### Install PyTerrier

In [3]:
!pip install python-terrier



In [4]:
#Initialise PyTerrier.
import pyterrier as pt
if not pt.started():
  pt.init()

  if not pt.started():
Java started and loaded: pyterrier.java, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]
java is now started automatically with default settings. To force initialisation early, run:
pt.java.init() # optional, forces java initialisation
  pt.init()


### Install Retriv

Sometimes, in dowloading the Retriv library the runtime disconnects, therefore it is only necessary to re-run everything starting from the beginning

In [5]:
!pip install retriv ranx



### Understanding the structure of the drive folder "/content"

In [6]:
# Understanding the files in the directory to which we are connected
print(os.getcwd())  # prints your current directory
print(os.listdir()) # lists files in the current directory

# Note: 'PIR_data' gotten from unzipping 'pir_data.zip'

/content
['.config', 'PIR_data', 'pir_data.zip', 'sample_data']


Here we unzip the zip file "pir_data.zip"

In [7]:
!unzip /content/pir_data.zip -d /content/

Archive:  /content/pir_data.zip
replace /content/PIR_data/tags.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: /content/PIR_data/tags.csv  
replace /content/PIR_data/questions_with_answer.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: /content/PIR_data/questions_with_answer.csv  
  inflating: /content/PIR_data/questions.csv  
  inflating: /content/PIR_data/comments.csv  
  inflating: /content/PIR_data/users.csv  
  inflating: /content/PIR_data/answers.csv  
   creating: /content/PIR_data/answer_retrieval/
   creating: /content/PIR_data/answer_retrieval/val/
  inflating: /content/PIR_data/answer_retrieval/val/subset_data.jsonl  
  inflating: /content/PIR_data/answer_retrieval/val/qrels.json  
   creating: /content/PIR_data/answer_retrieval/train/
  inflating: /content/PIR_data/answer_retrieval/train/subset_data.jsonl  
  inflating: /content/PIR_data/answer_retrieval/train/qrels.json  
   creating: /content/PIR_data/answer_retrieval/test/
  inflating: /content/PIR_da

We check the content of the folders

In [88]:
!ls /content/pir_data.zip

/content/pir_data.zip


In [89]:
!ls /content/sample_data

anscombe.json		     california_housing_train.csv  mnist_train_small.csv
california_housing_test.csv  mnist_test.csv		   README.md


In [90]:
!ls /content/PIR_data
# Unzipped "pir_data.zip"

answer_retrieval  comments.csv	 questions.csv		    tags.csv
answers.csv	  postlinks.csv  questions_with_answer.csv  users.csv


Extract the datasets from "PIR_data"

In [91]:
# Current directory
print("Current directory:", os.getcwd())

# Files in the current directory
print("Files in current dir:", os.listdir('.'))

# Going to the folder with the datasets we are interested in
os.chdir('/content/PIR_data')
print("Files in current child dir:", os.listdir('.'))
files_inDirectory = os.listdir('.')

# Filtering to get only the csv files
csv_files = [f for f in files_inDirectory if f.endswith('.csv')]
print(csv_files)

Current directory: /content/PIR_data/answer_retrieval/val
Files in current dir: ['subset_data.jsonl', 'qrels.json', 'dataset_with_expandedQueries.json']
Files in current child dir: ['postlinks.csv', 'questions_with_answer.csv', 'questions.csv', 'users.csv', 'answer_retrieval', 'comments.csv', 'answers.csv', 'tags.csv']
['postlinks.csv', 'questions_with_answer.csv', 'questions.csv', 'users.csv', 'comments.csv', 'answers.csv', 'tags.csv']


From the folder of "PIR_data" we go to the folder "answer retrieval", and check the content of each folder in it

In [92]:
# Current directory
print("Current directory:", os.getcwd())

# Files in the current directory
print("Files in current dir:", os.listdir('.'))

# Going to the folder with the datasets we are interested in
os.chdir('/content/PIR_data/answer_retrieval')
print("Files in current child dir:", os.listdir('.'))

# In this folder we have the train, test and validation sets

Current directory: /content/PIR_data
Files in current dir: ['postlinks.csv', 'questions_with_answer.csv', 'questions.csv', 'users.csv', 'answer_retrieval', 'comments.csv', 'answers.csv', 'tags.csv']
Files in current child dir: ['val', 'subset_answers.json', 'test', 'train', 'answers.csv']


## Extracting the necessary data

From the folder "PIR_data" we extract the data that is necessary to compute the standard and the personalized information retrieval model.


We take the data from "subset_answer" as the corpus

In [93]:
corpus_df = pd.read_json('/content/PIR_data/answer_retrieval/subset_answers.json', orient='index')
corpus_df = corpus_df.reset_index()
corpus_df.columns = ['docno', 'text']
corpus_df

Unnamed: 0,docno,text
0,writers_2010,TL;DRIf you're going to do present tense do it...
1,writers_2018,"Your writing style is stream-of-consciousness,..."
2,writers_2023,Place emphasis on uncomfortable things. Depend...
3,writers_2026,The answer to this depends a lot on what you'r...
4,writers_2095,Short Answer: Read a book on writing stand up ...
...,...,...
9393,academia_138970,"Generally no, there is no time set aside for a..."
9394,academia_139396,"He took a break, but the comics are back now."
9395,academia_143753,If you provide enough context and frame the is...
9396,academia_148936,I've edited journals without holding a PhD in ...


The following folders contain the data with which we have to train and test the information retrieval system

In [94]:
# Get to know the name of the files inside the folders
os.chdir('/content/PIR_data/answer_retrieval/train')
print(os.getcwd())
print("Files in current child dir:", os.listdir('.'))
files_inTrain = os.listdir('.')

os.chdir('/content/PIR_data/answer_retrieval/test')
print(os.getcwd())
print("Files in current child dir:", os.listdir('.'))
files_inTest = os.listdir('.')

os.chdir('/content/PIR_data/answer_retrieval/val')
print(os.getcwd())
print("Files in current child dir:", os.listdir('.'))
files_inVal = os.listdir('.')

/content/PIR_data/answer_retrieval/train
Files in current child dir: ['subset_data.jsonl', 'qrels.json']
/content/PIR_data/answer_retrieval/test
Files in current child dir: ['subset_data.jsonl', 'qrels.json']
/content/PIR_data/answer_retrieval/val
Files in current child dir: ['subset_data.jsonl', 'qrels.json', 'dataset_with_expandedQueries.json']


In [95]:
# Assigning variables to the train datasets
print("Seeing the data file")
trainData_df = pd.read_json(
    '/content/PIR_data/answer_retrieval/train/subset_data.jsonl',
    lines=True
)
# "lines=True" tells pandas that each line in the file is a separate JSON object
print(trainData_df.head())

print("Seeing the qrels file")
trainQrels_df = pd.read_json('/content/PIR_data/answer_retrieval/train/qrels.json',typ="series")
print(trainQrels_df.head())

Seeing the data file
                id                                               text  \
0  academia_100305  What are CNRS research units and how are they ...   
1  academia_100456  Is there a free (as in freedom) alternative to...   
2  academia_103390  Search for StackExchange citations with Google...   
3   academia_10481  Reproducible research and corporate identity M...   
4   academia_10649  Advantages of second marking In the UK a porti...   

                                               title           timestamp  \
0  What are CNRS research units and how are they ... 2017-12-11 16:30:20   
1  Is there a free (as in freedom) alternative to... 2017-12-13 19:02:32   
2  Search for StackExchange citations with Google... 2018-02-06 16:40:59   
3       Reproducible research and corporate identity 2013-06-06 09:11:05   
4                       Advantages of second marking 2013-06-17 12:24:37   

   score  views  favorite  user_id  \
0     14   2484       2.0  1106095   
1     1

In [96]:
# Assigning variables to the validation datasets
print("Seeing the data file")
valData_df = pd.read_json(
    '/content/PIR_data/answer_retrieval/val/subset_data.jsonl',
    lines=True
)
# "lines=True" tells pandas that each line in the file is a separate JSON object
print(valData_df.head())

print("Seeing the qrels file")
valQrels_df = pd.read_json('/content/PIR_data/answer_retrieval/val/qrels.json',typ="series")
print(valQrels_df.head())

Seeing the data file
                id                                               text  \
0  academia_143743  On answering a question that no one has asked ...   
1  academia_148899  How much domain expertise and network does a s...   
2      anime_56513  Does Overhaul need to touch with his hands to ...   
3      anime_59459  Why did Kanon reincarnate in another race? So ...   
4     apple_408963  How do I disallow screen sharing for Messages?...   

                                               title           timestamp  \
0      On answering a question that no one has asked 2020-02-03 05:38:49   
1  How much domain expertise and network does a s... 2020-05-09 09:10:08   
2  Does Overhaul need to touch with his hands to ... 2020-01-18 18:01:25   
3         Why did Kanon reincarnate in another race? 2020-08-31 12:37:37   
4     How do I disallow screen sharing for Messages? 2020-12-15 20:46:16   

   score  views  favorite  user_id  \
0      1    325       NaN  1582241   
1      

In [97]:
# Assigning variables to the test datasets
print("Seeing the data file")
testData_df = pd.read_json(
    '/content/PIR_data/answer_retrieval/test/subset_data.jsonl',
    lines=True
)
# "lines=True" tells pandas that each line in the file is a separate JSON object
print(testData_df.head())

print("Seeing the qrels file")
testQrels_df = pd.read_json('/content/PIR_data/answer_retrieval/test/qrels.json',typ="series")
print(testQrels_df.head())

Seeing the data file
                id                                               text  \
0  academia_185177  After what George was Georgetown University na...   
1      anime_67047  Can someone explain why Garou made Saitama do ...   
2      anime_64177  Why did Madara want to resurrect if with the E...   
3     apple_429543  What is the ~/Applications directory for? I wa...   
4     apple_444310  How to make notification but no noise when tim...   

                                               title           timestamp  \
0  After what George was Georgetown University na... 2022-05-12 21:27:50   
1  Can someone explain why Garou made Saitama do ... 2022-07-21 12:04:38   
2  Why did Madara want to resurrect if with the E... 2021-07-10 15:29:43   
3          What is the ~/Applications directory for? 2021-10-26 11:34:11   
4  How to make notification but no noise when tim... 2022-07-25 14:40:32   

   score  views  favorite  user_id  \
0     -2    180       NaN  1532620   
1      

# Phase 1: retrieval and baseline neural networks implementation

## Preprocessing and choosing the best model

### Selecting the stopwords

To be sure of what is going to be removed from the text, a personalized list of custom stopwords is created.

In [98]:
# Costum stop words
custom_stopwords = [
    "i", "me", "my", "myself", "we", "our", "ours", "ourselves",
    "you", "your", "yours", "yourself", "yourselves", "he", "him",
    "his", "himself", "she", "her", "hers", "herself", "it", "its",
    "itself", "they", "them", "their", "theirs", "themselves", "this",
    "that", "these", "those", "am", "is", "are", "was", "were", "be",
    "been", "being", "have", "has", "had", "having", "a", "an", "the",
    "but", "if", "as", "until", "while", "of", "at", "by", "for", "with",
    "about", "against", "between", "into", "through", "above", "below", "to",
    "from", "up", "down", "in", "out", "on", "off", "over", "under",
    "again", "further", "then", "once", "here", "there", "all", "any",
    "both", "each", "few", "more", "most", "other", "some", "such",
    "no", "nor", "only", "own", "same", "so", "than", "too",
    "very", "s", "t", "will", "just", "don", "should", "now"
]

### Preprocessing functions

To improve performance the text must be pre-processed. We have chosen to divide the preprocessing step in two sub-steps for clarity.

These are:
- *Basic text preprocessing*, used to remove special characters and possible inconsistencies in the text, such as double white spaces.
- *Normalization of the text*, used to tokenize, remove stopwords and stem the words (in order)

Please note: Applying stemming implies that the text will not be readable anymore for the user. Therefore, an answer to a user's question must be retrieved directly from "*corpus_df*" (before personalization) through the corresponding document id.

In [99]:
import string
import re
from textblob import TextBlob

def preprocess_text_basics(text):
  # remove punctuation
  text = text.translate(str.maketrans('', '', string.punctuation))
  # lowercase
  text = text.lower()
  # remove numbers
  text = re.sub(r'\d', '', text)
  # remove links
  text = re.sub(r'http\S+', '<URL>', text) # remove occurences that start with https
  text = re.sub(r"www.[A-Za-z]*\.com", "<URL>", text) # remove occurences that end with .com
  #Remove extra spaces
  text = re.sub(r'\s+', ' ', text).strip()

  return text

In [100]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

# Necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

# Function for further preprocessing
def preprocess_text_norm(text):
  # tokenization
  sentences = sent_tokenize(text)
  words = []
  for sentence in sentences:
    words_in_sentence = word_tokenize(sentence)
    words.extend(words_in_sentence)
  # Remove stop words
  stop_words = set(stopwords.words('english')).union(set(custom_stopwords))
  tokenized_text_without_stopwords = [word for word in words if word.lower() not in stop_words]
  # Stemming
  stemmer = PorterStemmer()
  text = [stemmer.stem(word) for word in tokenized_text_without_stopwords] # Used with the removal of stopwords
  normalized_text = ' '.join(text)
  return normalized_text

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### Preprocessing of the queries

Applying the pre-processing to all the queries' datasets:
- train queries
- validation queries
- test queries

Preprocessing of the train queries

In [101]:
trainData_df.head()

Unnamed: 0,id,text,title,timestamp,score,views,favorite,user_id,user_questions,user_answers,tags,rel_ids,rel_scores,rel_timestamps,best_answer
0,academia_100305,What are CNRS research units and how are they ...,What are CNRS research units and how are they ...,2017-12-11 16:30:20,14,2484,2.0,1106095,"[workplace_40845, workplace_40899, workplace_9...","[travel_45926, travel_46391, travel_47403, tra...","[funding, france]",[academia_100217],[1],"[1512814966, 1513014615, 1513020822]",academia_100217
1,academia_100456,Is there a free (as in freedom) alternative to...,Is there a free (as in freedom) alternative to...,2017-12-13 19:02:32,13,1117,2.0,1106095,"[workplace_40845, workplace_40899, workplace_9...","[travel_45926, travel_46391, travel_47403, tra...","[peer-review, open-access]",[academia_100462],[1],"[1513205016, 1536615064, 1553005541, 1615097827]",academia_100462
2,academia_103390,Search for StackExchange citations with Google...,Search for StackExchange citations with Google...,2018-02-06 16:40:59,2,157,1.0,1532620,"[writers_27613, writers_29562, sound_42166, so...","[skeptics_39944, philosophy_3098, philosophy_9...","[citations, google-scholar]",[academia_103391],[1],[1517936080],academia_103391
3,academia_10481,Reproducible research and corporate identity M...,Reproducible research and corporate identity,2013-06-06 09:11:05,18,372,1.0,1106095,"[academia_1698, academia_1772, academia_1911, ...","[academia_1699, academia_1700, academia_1701, ...","[copyright, creative-commons]",[academia_10499],[1],"[1370596608, 1370601095]",academia_10499
4,academia_10649,Advantages of second marking In the UK a porti...,Advantages of second marking,2013-06-17 12:24:37,6,1235,2.0,1106095,"[academia_1698, academia_1772, academia_1911, ...","[academia_1699, academia_1700, academia_1701, ...",[assessment],[academia_10650],[1],"[1371477146, 1371477156, 1371552185]",academia_10650


In [102]:
# Preprocessing the of the queries in the train set
trainData_df['text'] = trainData_df['text'].apply(preprocess_text_basics)
trainData_df['text'] = trainData_df['text'].apply(preprocess_text_norm)

In [103]:
trainData_df['text'][0]

'cnr research unit staf centr nation de la recherch scientifiqu cnr major fund bodi franc nearli staff member bigger us nih us nsf combin yet budgetboth cnr nih multipl institut although nih institut health relat cnr cover rang scienc cnr mix research unit proper research unit servic unit well intern unit nih larg number intramur labswhat differ research unit staf full time research academ teach duti like nih intramur lab mrc centr'

Preprocessing of the validation queries

In [104]:
valData_df.head()

Unnamed: 0,id,text,title,timestamp,score,views,favorite,user_id,user_questions,user_answers,tags,rel_ids,rel_scores,rel_timestamps,best_answer
0,academia_143743,On answering a question that no one has asked ...,On answering a question that no one has asked,2020-02-03 05:38:49,1,325,,1582241,"[travel_149904, travel_151531, sound_39671, so...","[politics_13376, politics_37453, politics_3793...","[publications, publishability, history]",[academia_143753],[1],"[1580711589, 1580737712]",academia_143753
1,academia_148899,How much domain expertise and network does a s...,How much domain expertise and network does a s...,2020-05-09 09:10:08,4,341,,935589,"[writers_9077, workplace_6846, workplace_9638,...","[writers_27741, writers_43414, writers_44312, ...","[publications, editors, special-issue]",[academia_148936],[1],"[1589030916, 1589126809]",academia_148936
2,anime_56513,Does Overhaul need to touch with his hands to ...,Does Overhaul need to touch with his hands to ...,2020-01-18 18:01:25,0,769,,59256,"[sports_14060, sports_14107, sports_14133, spo...","[sports_14061, sports_14062, sports_14064, spo...",[my-hero-academia],[anime_56712],[1],"[1579573343, 1580831901]",anime_56712
3,anime_59459,Why did Kanon reincarnate in another race? So ...,Why did Kanon reincarnate in another race?,2020-08-31 12:37:37,2,1466,2.0,59256,"[sports_14060, sports_14107, sports_14133, spo...","[sports_14061, sports_14062, sports_14064, spo...",[mao-gakuin-no-futekigosha],[anime_59463],[1],[1598920506],anime_59463
4,apple_408963,How do I disallow screen sharing for Messages?...,How do I disallow screen sharing for Messages?,2020-12-15 20:46:16,0,30,,331923,"[travel_4961, travel_46275, travel_46638, trav...","[travel_4647, travel_90661, skeptics_6876, ske...","[security, screen-sharing]",[apple_408965],[1],[1608068121],apple_408965


In [105]:
# Preprocessing the of the queries in the validation set
valData_df['text'] = valData_df['text'].apply(preprocess_text_basics)
valData_df['text'] = valData_df['text'].apply(preprocess_text_norm)

In [106]:
valData_df['text'][0]

'answer question one ask histor research came across novel question invest time answer help origin project explan novel question substanti would like publish separatelyit well known x thesi propos articl z explain xs decis unfortun cant point public ask x z obviou relev present day issueswhi review want see articl address unknown problem justifi project'

Preprocessing of the test queries

In [107]:
testData_df.head()

Unnamed: 0,id,text,title,timestamp,score,views,favorite,user_id,user_questions,user_answers,tags,rel_ids,rel_scores,rel_timestamps,best_answer
0,academia_185177,After what George was Georgetown University na...,After what George was Georgetown University na...,2022-05-12 21:27:50,-2,180,,1532620,"[writers_27613, writers_29562, writers_43973, ...","[vegetarianism_1871, skeptics_39944, skeptics_...","[academic-history, history]",[academia_185179],[1],"[1652393822, 1652394194]",academia_185179
1,anime_67047,Can someone explain why Garou made Saitama do ...,Can someone explain why Garou made Saitama do ...,2022-07-21 12:04:38,1,93,,59256,"[sports_14060, sports_14107, sports_14133, spo...","[sports_14061, sports_14062, sports_14064, spo...",[one-punch-man],[anime_67049],[1],[1658449137],anime_67049
2,anime_64177,Why did Madara want to resurrect if with the E...,Why did Madara want to resurrect if with the E...,2021-07-10 15:29:43,1,279,,59256,"[sports_14060, sports_14107, sports_14133, spo...","[sports_14061, sports_14062, sports_14064, spo...",[naruto],[anime_64193],[1],[1626111337],anime_64193
3,apple_429543,What is the ~/Applications directory for? I wa...,What is the ~/Applications directory for?,2021-10-26 11:34:11,3,1124,,67390,"[travel_138383, sports_7496, sports_12466, spo...","[writers_51235, writers_51280, workplace_15391...","[macos, applications, big-sur]",[apple_429546],[1],[1635249109],apple_429546
4,apple_444310,How to make notification but no noise when tim...,How to make notification but no noise when tim...,2022-07-25 14:40:32,1,31,,4019407,"[writers_48189, writers_48714, sports_14845, s...","[writers_52925, parenting_33054, music_43040, ...","[iphone, ios]",[apple_444709],[1],[1659522791],apple_444709


In [108]:
# Preprocessing the of the queries in the test set
testData_df['text'] = testData_df['text'].apply(preprocess_text_basics)
testData_df['text'] = testData_df['text'].apply(preprocess_text_norm)

In [109]:
testData_df['text'][0]

'georg georgetown univers name georg georgetown univers name name locat'

### Basic preprocessing of the corpus

Applying the preprocessing to the corpus, so the dataset with all the possible answers.

In [110]:
# Preprocessing the of the corpus
corpus_df['text']= corpus_df['text'].apply(preprocess_text_basics)
corpus_df['text']= corpus_df['text'].apply(preprocess_text_norm)

In [111]:
corpus_df['text'][0]

'tldrif your go present tens good reason mitig downsideslong versionpres tens lend sens immediaci work also may make feel like read screenplay drama oppos typic past tens novel that good part want urgenc sens close action present tens help present tens natur suspens give reader sens nobodi even narrat know comingunfortun present tens come lot downsid typic english languag novel charl dicken bleak hous midth centuri written half present tens probabl first exampl presenttens novel histori go back th centuri reader notic use present tens may make difficult read book unless done quit well time step fiction norm best art reader noticewhen someth happen present tens reader arent go expect lot narrat charact monologu present tens go car chase reader arent go believ time monologu drive wrong way one way mile per hour speed limit hand reader believ bit monologu know narrat current drive car chase rather rememb itpres tens flatten stori well present tens narrat close emot charact especi narrat p

### Converting Data to Retriv-friendly formats

In [112]:
# Converting corpus_df
docs = [
    {"id": row["docno"], "text": row["text"]}
    for _, row in corpus_df.iterrows()
]

print("Number of documents:", len(docs))
print(docs[0])  # Example

Number of documents: 9398
{'id': 'writers_2010', 'text': 'tldrif your go present tens good reason mitig downsideslong versionpres tens lend sens immediaci work also may make feel like read screenplay drama oppos typic past tens novel that good part want urgenc sens close action present tens help present tens natur suspens give reader sens nobodi even narrat know comingunfortun present tens come lot downsid typic english languag novel charl dicken bleak hous midth centuri written half present tens probabl first exampl presenttens novel histori go back th centuri reader notic use present tens may make difficult read book unless done quit well time step fiction norm best art reader noticewhen someth happen present tens reader arent go expect lot narrat charact monologu present tens go car chase reader arent go believ time monologu drive wrong way one way mile per hour speed limit hand reader believ bit monologu know narrat current drive car chase rather rememb itpres tens flatten stori we

In [113]:
# Converting train queries
train_queries = [
    {"id": row["id"], "text": row["text"]}
    for _, row in trainData_df.iterrows()
]

print("Number of queries:", len(train_queries))
print(train_queries[0])

Number of queries: 10000
{'id': 'academia_100305', 'text': 'cnr research unit staf centr nation de la recherch scientifiqu cnr major fund bodi franc nearli staff member bigger us nih us nsf combin yet budgetboth cnr nih multipl institut although nih institut health relat cnr cover rang scienc cnr mix research unit proper research unit servic unit well intern unit nih larg number intramur labswhat differ research unit staf full time research academ teach duti like nih intramur lab mrc centr'}


In [114]:
# Converting val queries
val_queries = [
    {"id": row["id"], "text": row["text"]}
    for _, row in valData_df.iterrows()
]

print("Number of queries:", len(val_queries))
print(val_queries[0])

Number of queries: 100
{'id': 'academia_143743', 'text': 'answer question one ask histor research came across novel question invest time answer help origin project explan novel question substanti would like publish separatelyit well known x thesi propos articl z explain xs decis unfortun cant point public ask x z obviou relev present day issueswhi review want see articl address unknown problem justifi project'}


In [115]:
# Converting test queries
test_queries = [
    {"id": row["id"], "text": row["text"]}
    for _, row in testData_df.iterrows()
]

print("Number of queries:", len(test_queries))
print(test_queries[0])

Number of queries: 100
{'id': 'academia_185177', 'text': 'georg georgetown univers name georg georgetown univers name name locat'}


In [116]:
trainQrels_df

Unnamed: 0,0
academia_100305,academia_100217
academia_100456,academia_100462
academia_103390,academia_103391
academia_10481,academia_10499
academia_10649,academia_10650
...,...
rpg_116892,rpg_116894
gaming_183540,gaming_183541
rpg_100984,rpg_100987
scifi_95624,scifi_95628


In [117]:
# Converting Train Qrels
# Note: trainQrels_df is a pandas.core.series.Series
trainQrels_df = trainQrels_df.reset_index(drop=False) # Given this function, NEVER re-run just this cell twice,
                                                      # otherwise it gives error because another coluln is created
# inserting a column (for consistency) in the qrels dataframe
trainQrels_df.insert(loc=2, column='score', value = 1)

if len(trainQrels_df.columns) == 3:
  trainQrels_df.columns = ["q_id", "doc_id", "score_relevance"]
  print(trainQrels_df)
else:
  raise ValueError("trainQrels_df must have 2 columns (q_id and doc_id), re-run from <understanding the structure of the data>")

train_relevance_data = [
    {"q_id": row["q_id"], "doc_id": row["doc_id"]}
    for _, row in trainQrels_df.iterrows()
]

print("Number of relevant judgements:", len(train_relevance_data))
print(train_relevance_data[0])

                 q_id           doc_id  score_relevance
0     academia_100305  academia_100217                1
1     academia_100456  academia_100462                1
2     academia_103390  academia_103391                1
3      academia_10481   academia_10499                1
4      academia_10649   academia_10650                1
...               ...              ...              ...
9211       rpg_116892       rpg_116894                1
9212    gaming_183540    gaming_183541                1
9213       rpg_100984       rpg_100987                1
9214      scifi_95624      scifi_95628                1
9215   outdoors_14757   outdoors_14758                1

[9216 rows x 3 columns]
Number of relevant judgements: 9216
{'q_id': 'academia_100305', 'doc_id': 'academia_100217'}


In [118]:
# Converting Val Qrels
# Note: valQrels_df is a pandas.core.series.Series
valQrels_df = valQrels_df.reset_index(drop=False) # Given this function, NEVER re-run just this cell twice,
                                                      # otherwise it gives error because another coluln is created
# inserting a column (for consistency) in the qrels dataframe
valQrels_df.insert(loc=2, column='score', value = 1)

if len(valQrels_df.columns) == 3:
  valQrels_df.columns = ["q_id", "doc_id", "score_relevance"]
  print(valQrels_df)
else:
  raise ValueError("valQrels_df must have 2 columns (q_id and doc_id), re-run from <understanding the structure of the data>")

val_relevance_data = [
    {"q_id": row["q_id"], "doc_id": row["doc_id"]}
    for _, row in valQrels_df.iterrows()
]

print("Number of relevant judgements:", len(val_relevance_data))
print(val_relevance_data[0])

                  q_id              doc_id  score_relevance
0      academia_143743     academia_143753                1
1      academia_148899     academia_148936                1
2          anime_56513         anime_56712                1
3          anime_59459         anime_59463                1
4         apple_408963        apple_408965                1
..                 ...                 ...              ...
93      judaism_110801      judaism_110803                1
94  christianity_75499  christianity_75501                1
95       gaming_363535       gaming_363542                1
96      skeptics_47727      skeptics_47728                1
97      politics_53742      politics_53745                1

[98 rows x 3 columns]
Number of relevant judgements: 98
{'q_id': 'academia_143743', 'doc_id': 'academia_143753'}


In [119]:
# Converting Test Qrels
# Note: testQrels_df is a pandas.core.series.Series
testQrels_df = testQrels_df.reset_index(drop=False) # Given this function, NEVER re-run just this cell twice,
                                                    # otherwise it gives error because another coluln is created
# inserting a column (for consistency) in the qrels dataframe
testQrels_df.insert(loc=2, column='score', value = 1)

if len(testQrels_df.columns) == 3:
  testQrels_df.columns = ["q_id", "doc_id", "score_relevance"]
  print(testQrels_df)
else:
  raise ValueError("testQrels_df must have 2 columns (q_id and doc_id), re-run from <understanding the structure of the data>")

test_relevance_data = [
    {"q_id": row["q_id"], "doc_id": row["doc_id"]}
    for _, row in testQrels_df.iterrows()
]

print("Number of relevant judgements:", len(test_relevance_data))
print(test_relevance_data[0])

                  q_id              doc_id  score_relevance
0      academia_185177     academia_185179                1
1          anime_67047         anime_67049                1
2          anime_64177         anime_64193                1
3         apple_429543        apple_429546                1
4         apple_444310        apple_444709                1
..                 ...                 ...              ...
93      politics_62030      politics_62578                1
94      politics_74901      politics_74939                1
95        scifi_251849        scifi_251851                1
96       gaming_386587       gaming_386590                1
97  christianity_91603  christianity_91611                1

[98 rows x 3 columns]
Number of relevant judgements: 98
{'q_id': 'academia_185177', 'doc_id': 'academia_185179'}


### Preparing for the indexing of the corpus

To index the corpus we have chosen to use the library "*Retriv*".
For the purpose of the project two information retrieval models were compared:
- bm25 (Best Model)
- tf-idf (Term-Frequency Inverse Document Frequency)

In [120]:
from retriv import SparseRetriever
from retriv import *

SparseRetriever.delete('index_docs_retriv') # if it exists

# Creating the bm25 indexer
sr_bm25 = SparseRetriever(
  index_name="index_bm25",              # help to manage the index and to identify it - name of the index
  model="bm25",                         # retrieval model to use for searching
  min_df=2,                             # if a term appears just once it is ignored
  tokenizer="whitespace",               # splitting text after each whitespace
  stemmer="krovetz",                    # "krovetz" (less "aggressive" (less strict) than just 'english')
  stopwords=custom_stopwords,           # stopwords to remove during preprocessing
  do_lowercasing=True,                  # lowercase texts
  do_ampersand_normalization=True,      # coverting "&" in "and"
  do_special_chars_normalization=True,  # remove special characters for letters,
  do_acronyms_normalization=True,       # remove full stop symbols from acronyms without splitting them
  do_punctuation_removal=True,          # remove punctuation
)

# Creating the tfidf indexer
sr_tfidf = SparseRetriever(
  index_name="index_tfidf",             # help to manage the index and to identify it - name of the index
  model="tf-idf",                       # retrieval model to use for searching
  min_df=2,                             # if a term appears just once it is ignored
  tokenizer="whitespace",               # splitting text after each whitespace
  stemmer="krovetz",                    # "krovetz" (less "aggressive" (less strict) than just 'english')
  stopwords=custom_stopwords,           # stopwords to remove during preprocessing
  do_lowercasing=True,                  # lowercase texts
  do_ampersand_normalization=True,      # coverting "&" in "and"
  do_special_chars_normalization=True,  # remove special characters for letters,
  do_acronyms_normalization=True,       # remove full stop symbols from acronyms without splitting them
  do_punctuation_removal=True,          # remove punctuation
)

index_docs_retriv successfully removed.


### Indexing of the corpus

In [121]:
# Apply the indexing to the docs and save the retriever
sr_bm25.index(docs)
sr_bm25.save()

sr_tfidf.index(docs)
sr_tfidf.save()

Building TDF matrix: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 9398/9398 [00:02<00:00, 4059.03it/s]
Building inverted index: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 24674/24674 [00:02<00:00, 9495.39it/s]
Building TDF matrix: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 9398/9398 [00:02<00:00, 4119.64it/s]
Building inverted index: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 24674/24674 [00:02<00:00, 9684.41it/s]


In [122]:
# check if it is correctely indexed
print(f"number of documents indexed: {sr_bm25.doc_count}")
docs[0]

number of documents indexed: 9398


{'id': 'writers_2010',
 'text': 'tldrif your go present tens good reason mitig downsideslong versionpres tens lend sens immediaci work also may make feel like read screenplay drama oppos typic past tens novel that good part want urgenc sens close action present tens help present tens natur suspens give reader sens nobodi even narrat know comingunfortun present tens come lot downsid typic english languag novel charl dicken bleak hous midth centuri written half present tens probabl first exampl presenttens novel histori go back th centuri reader notic use present tens may make difficult read book unless done quit well time step fiction norm best art reader noticewhen someth happen present tens reader arent go expect lot narrat charact monologu present tens go car chase reader arent go believ time monologu drive wrong way one way mile per hour speed limit hand reader believ bit monologu know narrat current drive car chase rather rememb itpres tens flatten stori well present tens narrat cl

## Run an experiment on test collection: test a query to find the best model

### List of the queries

In [123]:
# Selecting the queries
query_strings = [q['text'] for q in train_queries]

In [124]:
# Initialize lists to store results
results_bm25 = []
results_tfidf = []

# Iterate over each query string and perform the search
for query in query_strings:
    bm25_result = sr_bm25.search(
        query,
        return_docs=False,
        cutoff = 100
    )
    tfidf_result = sr_tfidf.search(
        query,
        return_docs=False,
        cutoff = 100
    )
    results_bm25.append(bm25_result)
    results_tfidf.append(tfidf_result)

In [125]:
final_results_bm25 = {
    q['id']: results_bm25[idx] for idx, q in enumerate(train_queries)
}
final_results_tfidf = {
    q['id']: results_tfidf[idx] for idx, q in enumerate(train_queries)
}

### Evaluate

In [126]:
# The library ranx was used to be able to measure and evaluate the quality of the indexes
from ranx import compare, Qrels, Run

qrels = Qrels.from_df(
    df= trainQrels_df,
    q_id_col="q_id",
    doc_id_col="doc_id",
    score_col="score_relevance"
)

In [127]:
# For bm25
resultFor_bm25 = Run(final_results_bm25, name="bm25")
# For tfidf
resultFor_tfidf = Run(final_results_tfidf, name="tfidf")

In [128]:
# Compare different runs and perform Two-sided Paired Student's t-Test
report = compare(
    qrels=qrels,
    runs=[resultFor_bm25, resultFor_tfidf],
    metrics=["map@100", "f1", "precision@5", "precision@10"]
)
print(report)

#    Model    MAP@100    F1      P@5     P@10
---  -------  ---------  ------  ------  ------
a    bm25     0.721áµ‡     0.018áµ‡  0.160áµ‡  0.084áµ‡
b    tfidf    0.380      0.018   0.108   0.066


Given the results of the experiment, bm25 is understood to be the best model to use. Therefore, bm25 will be the base of our neural re-ranking.

## Neural re-ranking

For the neural re-ranking we have chosen to use **cross encoder**.

The reason for this is that cross encoder is ideal for question-answer pair verification, as well as sentence entailment which will be useful for the second phase of the project.

### Preparing the text for the cross encoder

SparseRetriever doesn't directly provide an index compatible with PyTerrier, so we initialize a PyTerrier indec using **docs**

In [129]:
docs[0]

{'id': 'writers_2010',
 'text': 'tldrif your go present tens good reason mitig downsideslong versionpres tens lend sens immediaci work also may make feel like read screenplay drama oppos typic past tens novel that good part want urgenc sens close action present tens help present tens natur suspens give reader sens nobodi even narrat know comingunfortun present tens come lot downsid typic english languag novel charl dicken bleak hous midth centuri written half present tens probabl first exampl presenttens novel histori go back th centuri reader notic use present tens may make difficult read book unless done quit well time step fiction norm best art reader noticewhen someth happen present tens reader arent go expect lot narrat charact monologu present tens go car chase reader arent go believ time monologu drive wrong way one way mile per hour speed limit hand reader believ bit monologu know narrat current drive car chase rather rememb itpres tens flatten stori well present tens narrat cl

In [130]:
for doc in docs:
    if 'id' in doc:  # Check if the key 'id' exists
        doc['docno'] = doc.pop('id')  # Rename 'id' to 'docno'

docs[0]

{'text': 'tldrif your go present tens good reason mitig downsideslong versionpres tens lend sens immediaci work also may make feel like read screenplay drama oppos typic past tens novel that good part want urgenc sens close action present tens help present tens natur suspens give reader sens nobodi even narrat know comingunfortun present tens come lot downsid typic english languag novel charl dicken bleak hous midth centuri written half present tens probabl first exampl presenttens novel histori go back th centuri reader notic use present tens may make difficult read book unless done quit well time step fiction norm best art reader noticewhen someth happen present tens reader arent go expect lot narrat charact monologu present tens go car chase reader arent go believ time monologu drive wrong way one way mile per hour speed limit hand reader believ bit monologu know narrat current drive car chase rather rememb itpres tens flatten stori well present tens narrat close emot charact especi

In [131]:
# Moved from Retriv indexing to Pyterrier indexing

pt_index_path = '/content/pyterrier_index'
os.makedirs(pt_index_path, exist_ok=True)

if not os.path.exists(pt_index_path + '/data.properties'):
  indexer = pt.IterDictIndexer(
      pt_index_path,
      meta={'docno':50, 'text':2048},
      text_attrs=['text'],
      meta_reverse=['docno'],
      fields = True,
      overwrite=True
  )
  index_ref = indexer.index(corpus_df.to_dict(orient="records"))
else:
  index_ref = pt.IndexRef.of(pt_index_path + '/data.properties')

# Create the index
index = pt.IndexFactory.of(index_ref)

# Print index properties for verification
print(index.getMetaIndex().getKeys())

['docno', 'text']


In [132]:
print('Collection Statistics', index.getCollectionStatistics().toString())

Collection Statistics Number of documents: 9398
Number of terms: 89665
Number of postings: 694418
Number of fields: 1
Number of tokens: 1031104
Field names: [text]
Positions:   false



### Importing the cross encoder

In [133]:
!pip install -q sentence_transformers ipdb
from sentence_transformers import CrossEncoder

In [134]:
crossmodel = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

### Applying the cross encoder

In [135]:
def _crossencoder_apply(df, column='text'):
  return crossmodel.predict(list(zip(df['query'].values, df[column].values)))

from functools import partial
crossencoder_apply_text = partial(_crossencoder_apply, column='text')
cross_encT = pt.apply.doc_score(crossencoder_apply_text, batch_size=64)

In [136]:
br = pt.BatchRetrieve(index, wmodel='BM25') % 100 # we want to rerank only the top-100

cross_pipeline = br >> pt.text.get_text(index, 'text') >> cross_encT  # take the query text and run it through the pipeline (coss_encT)

normalized_br = br >> pt.pipelines.PerQueryMaxMinScoreTransformer()

normalized_cross_pipeline = cross_pipeline >> pt.pipelines.PerQueryMaxMinScoreTransformer()

cross_sum_1_pipeline = .1*normalized_cross_pipeline + (1-.1)*normalized_br #apply it on the validation test
cross_sum_2_pipeline = .2*normalized_cross_pipeline + (1-.2)*normalized_br
cross_sum_3_pipeline = .3*normalized_cross_pipeline + (1-.3)*normalized_br
cross_sum_4_pipeline = .4*normalized_cross_pipeline + (1-.4)*normalized_br
cross_sum_5_pipeline = .5*normalized_cross_pipeline + (1-.5)*normalized_br
cross_sum_6_pipeline = .6*normalized_cross_pipeline + (1-.6)*normalized_br
cross_sum_7_pipeline = .7*normalized_cross_pipeline + (1-.7)*normalized_br
cross_sum_8_pipeline = .8*normalized_cross_pipeline + (1-.8)*normalized_br
cross_sum_9_pipeline = .9*normalized_cross_pipeline + (1-.9)*normalized_br

  br = pt.BatchRetrieve(index, wmodel='BM25') % 100 # we want to rerank only the top-100


In [137]:
# Rename columns of valData_df and valQrels_df
valData_df = valData_df.rename(columns={"id":"qid", "text":"query"})
valQrels_df = valQrels_df.rename(columns={"q_id":"qid", "doc_id":"docno", "score_relevance":"relevance"})

# Sample for the pt.Experiment later
subsetVal_data = valData_df.sample(frac=0.2) # Selecting 20% of the dataset randomly
# Used to do a partial training of model before applying everything (shorter runtime)

In [138]:
# Rename columns of testData_df and testQrels_df
testData_df = testData_df.rename(columns={"id":"qid", "text":"query"})
testQrels_df = testQrels_df.rename(columns={"q_id":"qid", "doc_id":"docno", "score_relevance":"relevance"})

Running an experiment using the validation set

In [139]:
pt.Experiment(
    [
        br,
        cross_pipeline,
        cross_sum_1_pipeline,  # .1*CrossEnc + .9*BM25
        cross_sum_2_pipeline,  # .2*CrossEnc + .8*BM25
        cross_sum_3_pipeline,  # .3*CrossEnc + .7*BM25
        cross_sum_4_pipeline,  # .4*CrossEnc + .6*BM25
        cross_sum_5_pipeline,  # .5*CrossEnc + .5*BM25
        cross_sum_6_pipeline,  # .6*CrossEnc + .4*BM25
        cross_sum_7_pipeline,  # .7*CrossEnc + .3*BM25
        cross_sum_8_pipeline,  # .8*CrossEnc + .2*BM25
        cross_sum_9_pipeline,  # .9*CrossEnc + .1*BM25
    ],
    subsetVal_data, # apply on a subset of the validation set
    valQrels_df,
    names=[
        'BM25',
        'CrossEnc',
        '.1*CrossEnc + .9*BM25',
        '.2*CrossEnc + .8*BM25',
        '.3*CrossEnc + .7*BM25',
        '.4*CrossEnc + .6*BM25',
        '.5*CrossEnc + .5*BM25',
        '.6*CrossEnc + .4*BM25',
        '.7*CrossEnc + .3*BM25',
        '.8*CrossEnc + .2*BM25',
        '.9*CrossEnc + .1*BM25'
    ],
    eval_metrics=["map", "P.5", "P.10", "ndcg", 'recall']
)

Unnamed: 0,name,map,P.5,P.10,ndcg,R@5,R@10,R@15,R@20,R@30,R@100,R@200,R@500,R@1000
0,BM25,0.892045,0.19,0.095,0.917027,0.95,0.95,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,CrossEnc,0.721474,0.17,0.085,0.780957,0.85,0.85,0.85,0.85,0.95,1.0,1.0,1.0,1.0
2,.1*CrossEnc + .9*BM25,0.892045,0.19,0.095,0.917027,0.95,0.95,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,.2*CrossEnc + .8*BM25,0.904545,0.19,0.095,0.92704,0.95,0.95,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,.3*CrossEnc + .7*BM25,0.905,0.19,0.1,0.927546,0.95,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
5,.4*CrossEnc + .6*BM25,0.905,0.19,0.1,0.927546,0.95,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
6,.5*CrossEnc + .5*BM25,0.904545,0.19,0.095,0.92704,0.95,0.95,1.0,1.0,1.0,1.0,1.0,1.0,1.0
7,.6*CrossEnc + .4*BM25,0.858333,0.19,0.095,0.891592,0.95,0.95,1.0,1.0,1.0,1.0,1.0,1.0,1.0
8,.7*CrossEnc + .3*BM25,0.820625,0.18,0.095,0.861313,0.9,0.95,0.95,1.0,1.0,1.0,1.0,1.0,1.0
9,.8*CrossEnc + .2*BM25,0.811875,0.17,0.09,0.852482,0.85,0.9,0.9,1.0,1.0,1.0,1.0,1.0,1.0


## Evaluation on the test set

Given the best pipeline above we select it and run it on the test set

In [140]:
pt.Experiment(
    [
        br,
        cross_pipeline,
        cross_sum_2_pipeline, # Best pipeline
    ],
    testData_df,  # apply on the test set
    testQrels_df,
    names=[
        'BM25',
        'CrossEnc',
        '.2*CrossEnc + .8*BM25',
      ],
    eval_metrics=["map", "P.5", "P.10", "ndcg", 'recall']
)

Unnamed: 0,name,map,P.5,P.10,ndcg,R@5,R@10,R@15,R@20,R@30,R@100,R@200,R@500,R@1000
0,BM25,0.813984,0.173469,0.088776,0.845546,0.867347,0.887755,0.908163,0.928571,0.94898,0.959184,0.959184,0.959184,0.959184
1,CrossEnc,0.716131,0.163265,0.086735,0.76832,0.816327,0.867347,0.877551,0.877551,0.908163,0.959184,0.959184,0.959184,0.959184
2,.2*CrossEnc + .8*BM25,0.827523,0.179592,0.089796,0.856757,0.897959,0.897959,0.928571,0.938776,0.94898,0.959184,0.959184,0.959184,0.959184


# Phase 2: extending models with user features for personalized information retrieval

## Query expansion

### Setting of the pre-trained model

Here, an LLM is used for query expansions based on user data and context.

In [141]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

In [142]:
torch.random.manual_seed(0)

# Pre-trained LLM model
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [143]:
# Corresponding tokenizer for the model
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

def query_expander(id, content, query, keywords):
  # Message
  messages = [
    {"role": "system", "content": "You are a helpful AI assistant."}, # Prompt
    # it would be ideal to get keywords from all possible previous user questions, here is just and example
    {"role": f"{id}", "content": f"{content}", "Query": f"{query}", "Keywords": f"{keywords}"},
  ]

  # High-level pipeline for text generation
  pipe = pipeline(
  "text-generation",
  model=model,
  tokenizer=tokenizer,
  )

  # Parameters for generating text
  generation_args = {
      "max_new_tokens": 500,
      "return_full_text": False,
      "temperature": 0.0,
      "do_sample": False,
  }

  return pipe(query, **generation_args)

### Application of the model on a subset of the train data

In [144]:
# Rename columns (if it was not already done)
if "title" in trainData_df.columns:
    trainData_df.rename(columns={'title': 'query'}, inplace=True)  # Drop the column
else:
    print("Already renamed")

In [145]:
# Check if the column was already created from the previous runs
if "expanded_query" in trainData_df.columns:
    trainData_df = trainData_df.drop('expanded_query', axis=1)  # Drop the column
else:
    print("Column 'expanded_query' does not exist.")

Column 'expanded_query' does not exist.


In [146]:
# We use 10% of the dataset for efficiency purposes (too slow otherwise)
subsetTrain_data = trainData_df.sample(frac=0.1)

We use as user context the tags he/she added to the query.

In [147]:
# This code had been run once, and took >5 hours, so the new dataset with the "query expanded" column was saved into the drive
# The dataset was accessed and used below

# Adding a column to the dataset where we keep the expanded queries
"""
query_expanded_list = []

for index, row in subsetTrain_data.iterrows():
    id = row["id"]
    text = row["text"]
    query = row["query"]
    keywords = row["tags"]

    expanded_query = query_expander(id=id,
                                    content=text,
                                    query=query,
                                    keywords=keywords)

    # Append the expanded query to the list
    query_expanded_list.append(expanded_query)

subsetTrain_data["query_expanded"] = query_expanded_list
"""

'\nquery_expanded_list = []\n\nfor index, row in subsetTrain_data.iterrows():\n    id = row["id"]\n    text = row["text"]\n    query = row["query"]\n    keywords = row["tags"]\n\n    expanded_query = query_expander(id=id,\n                                    content=text,\n                                    query=query,\n                                    keywords=keywords)\n\n    # Append the expanded query to the list\n    query_expanded_list.append(expanded_query)\n\nsubsetTrain_data["query_expanded"] = query_expanded_list\n'

In [148]:
# Used to upload the new dataset
"""
from google.colab import drive
drive.mount('/content/drive')
"""

"\nfrom google.colab import drive\ndrive.mount('/content/drive')\n"

In [149]:
# Save the DataFrame to a JSON file
"""
file_path = "/content/drive/MyDrive/Colab Notebooks/subsetTrain_data.json"
subsetTrain_data.to_json(file_path, orient="records", lines=True)

print(f"Dataset saved as JSON to: {file_path}")
"""

'\nfile_path = "/content/drive/MyDrive/Colab Notebooks/subsetTrain_data.json"\nsubsetTrain_data.to_json(file_path, orient="records", lines=True)\n\nprint(f"Dataset saved as JSON to: {file_path}")\n'

Downloading the dataset previously run:

In [150]:
# Getting the saved dataset from the drive
!gdown 1-3B4AbfxLswo7I5TUr0hpmUBQY-gkWgT -O dataset_with_expandedQueries.json

Downloading...
From: https://drive.google.com/uc?id=1-3B4AbfxLswo7I5TUr0hpmUBQY-gkWgT
To: /content/PIR_data/answer_retrieval/val/dataset_with_expandedQueries.json
  0% 0.00/9.47M [00:00<?, ?B/s] 11% 1.05M/9.47M [00:00<00:00, 10.3MB/s]100% 9.47M/9.47M [00:00<00:00, 51.6MB/s]


In [151]:
# Print of the first 5 rows of the dataset
dataset_with_expandedQueries = pd.read_json("dataset_with_expandedQueries.json", orient="records", lines=True)
print(dataset_with_expandedQueries.head())

                   id                                               text  \
0        history_1208  offici posit china sinosoviet border clash kno...   
1        movies_16035  talent doubl use chopstick usag american drama...   
2       history_10030  happen israel persia attack egypt see wiki bat...   
3    boardgames_13317  affect haunt roll item omen effect betray hous...   
4  hermeneutics_26728  messeng coven malachi kjvmalachi behold send m...   

                                               query           timestamp  \
0  What was the official position of China during... 2012-01-27 17:35:31   
1  Are talent doubles used for chopstick usage in... 2013-12-20 08:16:19   
2  What happened to Israel when Persia attacked E... 2013-08-27 03:10:00   
3  Can you affect the Haunt roll with items, omen... 2013-10-25 19:29:04   
4  Who is the "messenger of the covenant" Malachi... 2017-01-23 17:00:17   

   score  views  favorite  user_id  \
0      4    361       NaN   407237   
1      3  

In [152]:
dataset_with_expandedQueries["query_expanded"][0]

[{'generated_text': "\n\nA) China supported the Soviet Union's actions.\nB) China remained neutral and did not intervene.\nC) China supported North Korea's actions.\nD) China supported South Korea's actions.\n\nAnswer: B) China remained neutral and did not intervene.\n\n2. What was the primary reason for the Sino-Soviet border conflict in 1969?\n\nA) Disagreements over trade policies\nB) Disputes over territorial claims\nC) Ideological differences between the two nations\nD) A border incident involving a Soviet map\n\nAnswer: D) A border incident involving a Soviet map\n\n3. What was the outcome of the Sino-Soviet border conflict in terms of territorial changes?\n\nA) China gained significant territory from the Soviet Union.\nB) The border remained unchanged after the conflict.\nC) The Soviet Union gained territory from China.\nD) Both countries agreed to a new border line.\n\nAnswer: B) The border remained unchanged after the conflict.\n\n4. How did the Sino-Soviet border conflict aff

In [153]:
dataset_with_expandedQueries["query_expanded"][1]

[{'generated_text': "\n\nNo, talent doubles are not used for chopstick usage in American dramas. Talent doubles, also known as body doubles, are actors or models who stand in for the main actor in scenes where the actor's presence is not required, or to maintain continuity in a scene. This practice is common in various forms of media, including movies, television shows, and sometimes in live performances.\n\nChopstick usage, on the other hand, is a skill that involves using chopsticks to pick up food, which is a common utensil in many East Asian cultures. The portrayal of chopstick usage in American dramas would depend on the cultural setting of the story. If the drama is set in an East Asian context or aims to depict an authentic representation of that culture, then actors might be required to learn and perform chopstick usage. However, this would not involve the use of talent doubles.\n\nIn summary, talent doubles are used for reasons related to the actor's presence and continuity in

## Information retrieval system using only the expanded queries - Adding personalization

### Preprocessing the queries

Please note the structure of the text in the column "query_expanded" is the following: [{'generated_text': actual generated text}]

In [154]:
print(type(dataset_with_expandedQueries["query_expanded"].iloc[0]))

<class 'list'>


In the next piece of code we reduce the list of dictionaries into a string, such that later it is possible to apply the preprocessing

In [155]:
import pandas as pd

def item_to_string(item):
    if isinstance(item, list):
        # If it's a list of dictionaries
        if len(item) > 0 and isinstance(item[0], dict):
            # Extract the *value* from each dictionary (assuming each dict has one key)
            values = []
            for d in item:
                if len(d) == 1:
                    # Extract the single value from the dict
                    value = next(iter(d.values()))
                    values.append(str(value))
                else:
                    # If the dict has multiple keys, handle as needed
                    # (Here we convert the entire dictionary to string)
                    values.append(str(d))
            return " ".join(values)
        else:
            # Otherwise, assume it's a list of strings
            return " ".join(str(x) for x in item)

    # If it's already a string, just return it as is
    elif isinstance(item, str):
        return item

# Example usage:
dataset_with_expandedQueries["query_expanded"] = dataset_with_expandedQueries["query_expanded"].apply(item_to_string)


In [156]:
# Preprocessing the expanded queries
dataset_with_expandedQueries["query_expanded"] = dataset_with_expandedQueries["query_expanded"].apply(preprocess_text_basics)
dataset_with_expandedQueries["query_expanded"] = dataset_with_expandedQueries["query_expanded"].apply(preprocess_text_norm)

In [157]:
# Changing the name of the column to "query"
dataset_with_expandedQueries.rename(columns={"query": "basic_query"}, inplace=True)

In [158]:
dataset_with_expandedQueries.head()

Unnamed: 0,id,text,basic_query,timestamp,score,views,favorite,user_id,user_questions,user_answers,tags,rel_ids,rel_scores,rel_timestamps,best_answer,query_expanded
0,history_1208,offici posit china sinosoviet border clash kno...,What was the official position of China during...,2012-01-27 17:35:31,4,361,,407237,"[philosophy_1285, linguistics_1235, linguistic...","[philosophy_1346, linguistics_1238, history_10...","[china, russia, cold-war]",[history_1210],[1],"[1327686923, 1327789142]",history_1210,china support soviet union action b china rema...
1,movies_16035,talent doubl use chopstick usag american drama...,Are talent doubles used for chopstick usage in...,2013-12-20 08:16:19,3,309,0.0,17355,"[workplace_10560, travel_932, travel_3965, tra...","[travel_4174, travel_4717, travel_5063, travel...",[film-techniques],[movies_16899],[1],[1390753002],movies_16899,talent doubl use chopstick usag american drama...
2,history_10030,happen israel persia attack egypt see wiki bat...,What happened to Israel when Persia attacked E...,2013-08-27 03:10:00,2,1018,,359320,"[scifi_13437, scifi_23869, scifi_27365, scifi_...","[history_6055, history_6058, gaming_81786, gam...",[bible],[history_10031],[1],[1377575079],history_10031,answer persia attack egypt israel directli inv...
3,boardgames_13317,affect haunt roll item omen effect betray hous...,"Can you affect the Haunt roll with items, omen...",2013-10-25 19:29:04,8,1664,0.0,1077413,"[writers_6201, writers_6562, workplace_12372, ...","[scifi_7887, scifi_8424, scifi_8426, scifi_855...",[betrayal-at-house-on-the-hill],[boardgames_13334],[1],[1382928506],boardgames_13334,im play betray hous hill im traitor im tri fig...
4,hermeneutics_26728,messeng coven malachi kjvmalachi behold send m...,"Who is the ""messenger of the covenant"" Malachi...",2017-01-23 17:00:17,5,6511,0.0,9101990,"[hermeneutics_24986, hermeneutics_25002, herme...","[hermeneutics_25688, hermeneutics_26056, herme...",[malachi],[hermeneutics_29145],[1],"[1503499479, 1503525222]",hermeneutics_29145,responseth messeng coven mention malachi refer...


### Splitting the dataset into train and test for the future evaluation

Splitting the dataset into train and test set to later evaluate the quality of the model

In [159]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Suppose 'df' is your entire dataset (with or without labels).
df_train, df_test = train_test_split(dataset_with_expandedQueries, test_size=0.2, random_state=42)

print("Training set:", df_train.shape)
print("Testing set:", df_test.shape)

Training set: (800, 16)
Testing set: (200, 16)


### Converting to Retriv friendly format

In [160]:
# Converting train queries
train_queries_list = [
    {"id": row["id"], "text": row["query_expanded"]}
    for _, row in df_train.iterrows()
]

print("Number of queries:", len(train_queries_list))
print(train_queries_list[0])

Number of queries: 800
{'id': 'gaming_116133', 'text': 'answer provid accur answer would need specif detail game puzzl your refer includ natur frog question context ask without inform possibl determin correct answer reward get correct provid detail game puzzl rule question ask frog possibl answer would abl assist better pleas share relev inform help find correct answer understand reward system'}


In [161]:
# Converting test queries
test_queries_list = [
    {"id": row["id"], "text": row["query_expanded"]}
    for _, row in df_test.iterrows()
]

print("Number of queries:", len(test_queries_list))
print(test_queries_list[0])

Number of queries: 200
{'id': 'rpg_60357', 'text': 'im tri draw layout mansionmanor im sure best tool use would im look someth allow draw layout room hallway ive tri use microsoft paint good ive also tri use googl sketchup bit complic im look tool easi use allow draw layout mansionmanor recommend respons draw layout mansion manor might want consid use special tool cater architectur interior design need recommend userfriendli suitabl creat detail layout sketchup mention bit complic sketchup free version call sketchup free quit userfriendli wide use architectur interior design intuit interfac power featur also mani tutori avail onlin help get start autocad professionalgrad tool use architect design complex sketchup offer precis control design free version like autocad lt might suffici need adob illustr your look someth allow artist control use design adob illustr great choic part adob creativ cloud suit offer wide rang tool creat detail layout inkscap free opensourc vector graphic editor

In [162]:
trainQrels_df

Unnamed: 0,q_id,doc_id,score_relevance
0,academia_100305,academia_100217,1
1,academia_100456,academia_100462,1
2,academia_103390,academia_103391,1
3,academia_10481,academia_10499,1
4,academia_10649,academia_10650,1
...,...,...,...
9211,rpg_116892,rpg_116894,1
9212,gaming_183540,gaming_183541,1
9213,rpg_100984,rpg_100987,1
9214,scifi_95624,scifi_95628,1


### Applying the cross encoder on the train set

In [163]:
# Rename columns of testData_df and testQrels_df

df_train = df_train.rename(columns={"id":"qid", "query_expanded":"query"})
df_test = df_test.rename(columns={"id":"qid", "query_expanded":"query"}) # Needed for the evaluation

In [164]:
# Define the qrels
trainQrels_df = trainQrels_df.rename(columns={"q_id":"qid", "doc_id":"docno", "score_relevance":"relevance"})
trainQrels_df_copy = trainQrels_df.copy()

# Keep only the trainQrels rows whose qid is in df_train
Qrels_df_forTrain = trainQrels_df_copy [trainQrels_df_copy ["qid"].isin(df_train["qid"])]

# Keep only the trainQrels rows whose qid is in df_test
Qrels_df_forTest = trainQrels_df[trainQrels_df["qid"].isin(df_test["qid"])] # Needed for the evaluation

In [165]:
# Experiment to find the best pipeline when using the expanded queries

pt.Experiment(
    [
        br,
        cross_pipeline,
        cross_sum_1_pipeline,  # .1*CrossEnc + .9*BM25
        cross_sum_2_pipeline,  # .2*CrossEnc + .8*BM25
        cross_sum_3_pipeline,  # .3*CrossEnc + .7*BM25
        cross_sum_4_pipeline,  # .4*CrossEnc + .6*BM25
        cross_sum_5_pipeline,  # .5*CrossEnc + .5*BM25
        cross_sum_6_pipeline,  # .6*CrossEnc + .4*BM25
        cross_sum_7_pipeline,  # .7*CrossEnc + .3*BM25
        cross_sum_8_pipeline,  # .8*CrossEnc + .2*BM25
        cross_sum_9_pipeline,  # .9*CrossEnc + .1*BM25
    ],
    df_train,  #apply them on the train set
    Qrels_df_forTrain,
    names=[
        'BM25',
        'CrossEnc',
        '.1*CrossEnc + .9*BM25',
        '.2*CrossEnc + .8*BM25',
        '.3*CrossEnc + .7*BM25',
        '.4*CrossEnc + .6*BM25',
        '.5*CrossEnc + .5*BM25',
        '.6*CrossEnc + .4*BM25',
        '.7*CrossEnc + .3*BM25',
        '.8*CrossEnc + .2*BM25',
        '.9*CrossEnc + .1*BM25'
    ],
    eval_metrics=["map", "P.5", "P.10", "ndcg", 'recall']
)

  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


Unnamed: 0,name,map,P.5,P.10,ndcg,R@5,R@10,R@15,R@20,R@30,R@100,R@200,R@500,R@1000
0,BM25,0.494262,0.116792,0.064035,0.556573,0.58396,0.640351,0.676692,0.70802,0.73183,0.789474,0.789474,0.789474,0.789474
1,CrossEnc,0.315796,0.077694,0.047744,0.409622,0.388471,0.477444,0.5401,0.580201,0.641604,0.789474,0.789474,0.789474,0.789474
2,.1*CrossEnc + .9*BM25,0.501099,0.118296,0.064536,0.562175,0.591479,0.645363,0.684211,0.713033,0.735589,0.789474,0.789474,0.789474,0.789474
3,.2*CrossEnc + .8*BM25,0.504156,0.118296,0.065539,0.564835,0.591479,0.655388,0.692982,0.719298,0.735589,0.789474,0.789474,0.789474,0.789474
4,.3*CrossEnc + .7*BM25,0.507903,0.118546,0.066165,0.567796,0.592732,0.661654,0.694236,0.715539,0.736842,0.789474,0.789474,0.789474,0.789474
5,.4*CrossEnc + .6*BM25,0.50305,0.120301,0.065664,0.564239,0.601504,0.656642,0.694236,0.710526,0.738095,0.789474,0.789474,0.789474,0.789474
6,.5*CrossEnc + .5*BM25,0.498718,0.118045,0.065414,0.560369,0.590226,0.654135,0.680451,0.70802,0.736842,0.789474,0.789474,0.789474,0.789474
7,.6*CrossEnc + .4*BM25,0.478002,0.113784,0.064035,0.543708,0.568922,0.640351,0.674185,0.696742,0.733083,0.789474,0.789474,0.789474,0.789474
8,.7*CrossEnc + .3*BM25,0.43613,0.106516,0.061529,0.510343,0.532581,0.615288,0.657895,0.685464,0.718045,0.789474,0.789474,0.789474,0.789474
9,.8*CrossEnc + .2*BM25,0.391628,0.097995,0.057268,0.473964,0.489975,0.572682,0.625313,0.659148,0.696742,0.789474,0.789474,0.789474,0.789474


### Evaluation: applying the cross encoder on the test set

Checking the quality of the model

In [166]:
pt.Experiment(
    [
        br,
        cross_pipeline,
        cross_sum_3_pipeline, # Best pipeline
    ],
    df_test,  #apply them on the test set
    Qrels_df_forTest,
    names=[
        'BM25',
        'CrossEnc',
        '.3*CrossEnc + .7*BM25',
      ],
    eval_metrics=["map", "P.5", "P.10", "ndcg", 'recall']
)

Unnamed: 0,name,map,P.5,P.10,ndcg,R@5,R@10,R@15,R@20,R@30,R@100,R@200,R@500,R@1000
0,BM25,0.570942,0.131658,0.071357,0.630714,0.658291,0.713568,0.738693,0.758794,0.78392,0.859296,0.859296,0.859296,0.859296
1,CrossEnc,0.379263,0.094472,0.053266,0.473097,0.472362,0.532663,0.582915,0.623116,0.683417,0.859296,0.859296,0.859296,0.859296
2,.3*CrossEnc + .7*BM25,0.579312,0.134673,0.073869,0.638117,0.673367,0.738693,0.758794,0.763819,0.788945,0.859296,0.859296,0.859296,0.859296


## Recommender systems scores

We incorporate content-based filtering and user-based collaborative filtering to add personalization.

In [167]:
# Used to upload the new datasets
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Retrieval scores of the expanded queries matrix

*Retrieval Scores:* these are numerical scores assigned to documents for each query by the retrieval models

In [169]:
# Retrieve results for a query
query = dataset_with_expandedQueries["query_expanded"].iloc[0]
results = sr_bm25.search(query, cutoff=dataset_with_expandedQueries.shape[0])

# Inspect the results
print(results[0]) # List of dictionaries ordered by relevance
print(results[1])

print(f"Dimension: {len(results)}")

{'id': 'politics_32701', 'text': 'answer sort beat around bush russia motiv pure simpl russian peopl feel secur threat nato eu worri variou sort depriv seek defend action putin russian govern gener popular russianswhi earth would russian fear eu nato bunch popular reasonsmemori food shortag econom upheav mani peopl hungri mani lost pension econom stabil shortag view direct consequ lose cold war enemi aggressorincess border war year russia fought war vast major border mani million die wwii veteran hot war afghanistan china late searli still around commun memori civil war great game conflict japan unit state find hard know feel like histor commun fear invas start realli mongol golden horderesidu soviet propaganda soviet union enemi unit state europ decad propaganda newsprint left mani russia retain feelingscurr propaganda putin effort stay power control much press manipul popul alreadi suscept accord friend region steal wealth countrya vocal hostil usa usa militari hegemon today world vo

In [170]:
def matrix_retrivScores(dataset):
    # Initialize an empty dictionary to store query results
    query_doc_scores = {}

    # Check if the required column exists
    if "query_expanded" in dataset.columns:
        for query in dataset["query_expanded"]:
          for query_id in dataset["id"]:
            # Retrieve results for the query
            results = sr_bm25.search(query, cutoff=dataset.shape[0])
            # Extract document IDs and relevance scores
            doc_scores = {}
            for result_dict in results:
                doc_id = result_dict.get("id")
                score = result_dict.get("score")
                doc_scores[doc_id] = score
            # Store the scores for the current query
            query_doc_scores[query_id] = doc_scores
    else:
        raise ValueError("The column 'query_expanded' is missing in the dataset.")

    df = pd.DataFrame.from_dict(query_doc_scores, orient="index")  # orient="index" makes outer keys rows
    df.insert(0, "query id", df.index) # Set the query id as one of the columns

    # Return the DataFrame
    return df


Note: Here each
- each row is a query
- each column is a document

In [171]:
# The dataset was saved, and can be seen below
# This choice was taken because it takes >4 hours to run
"""
retrivMatrix = matrix_retrivScores(dataset_with_expandedQueries)
retrivMatrix
"""

'\nretrivMatrix = matrix_retrivScores(dataset_with_expandedQueries)\nretrivMatrix\n'

Saving the data to avoid running the cose (computationally expensive code)

In [172]:
import os
os.chdir('/content/drive/MyDrive')
print(os.getcwd())

/content/drive/MyDrive


In [173]:
# Saving the file
"""
file_path = "/content/drive/MyDrive/retrivMatrix.csv"
retrivMatrix.to_csv(file_path, index=False)

print(f"Dataset saved as CSV to: {file_path}")
"""

'\nfile_path = "/content/drive/MyDrive/retrivMatrix.csv"\nretrivMatrix.to_csv(file_path, index=False)\n\nprint(f"Dataset saved as CSV to: {file_path}")\n'

Importing directly the data

In [174]:
!gdown 1I2VsBsxvVN-QxcONrY7eW5TPH68fPMJQ -O retrivMatrix.csv

Downloading...
From: https://drive.google.com/uc?id=1I2VsBsxvVN-QxcONrY7eW5TPH68fPMJQ
To: /content/drive/MyDrive/retrivMatrix.csv
100% 9.46M/9.46M [00:00<00:00, 34.5MB/s]


In [175]:
retrivMatrix = pd.read_csv("retrivMatrix.csv")
print(retrivMatrix.head())

             query id  gaming_350826  gaming_130973  gaming_103827  rpg_11641  \
0        history_1208      280.88104      254.92812      198.31427  196.08018   
1        movies_16035      280.88104      254.92812      198.31427  196.08018   
2       history_10030      280.88104      254.92812      198.31427  196.08018   
3    boardgames_13317      280.88104      254.92812      198.31427  196.08018   
4  hermeneutics_26728      280.88104      254.92812      198.31427  196.08018   

   anime_4030  boardgames_12204  rpg_93293  rpg_92534  rpg_89524  ...  \
0   190.77309         188.51622  187.84375  174.77057  171.01512  ...   
1   190.77309         188.51622  187.84375  174.77057  171.01512  ...   
2   190.77309         188.51622  187.84375  174.77057  171.01512  ...   
3   190.77309         188.51622  187.84375  174.77057  171.01512  ...   
4   190.77309         188.51622  187.84375  174.77057  171.01512  ...   

   gaming_148656  scifi_117433  boardgames_28348  woodworking_13203  \
0  

### Content based filtering matrix

In [176]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [168]:
from joblib import Parallel, delayed # Library for induced parallel computing

**Content-based filtering**, using user-item similarity

In [177]:
def content_based_filtering(query_id, query, doc_ids, docs):
  # we are interested in the query id and the text of the query
  # we are interested in the doc id and the text of the document

  tfidf = TfidfVectorizer() # TF-IDF vector representation
  tfidf_matrix = tfidf.fit_transform([query] + docs)

  # Cosine similarity between the query and all documents
  cosine_sim = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1:]).flatten()

  # Create a DataFrame with query_id and document_ids
  result = pd.DataFrame({
    'query_id':[query_id] * len(doc_ids),
    'document_id': doc_ids,
    'similarity_score': cosine_sim
  })

  return result

**Scores for 1 query**

In [178]:
valid_docs = []
valid_docs_ids = []
for i in range(corpus_df.shape[0]):
  for j in range(retrivMatrix.shape[1]):
    if corpus_df.iloc[i]["docno"] == retrivMatrix.columns[j]:
        text = corpus_df.iloc[i]["text"]
        doc_id = retrivMatrix.columns[j]
        valid_docs.append(text)
        valid_docs_ids.append(doc_id)

In [179]:
docs = list(set(valid_docs)) # To get the unique values
query = dataset_with_expandedQueries["query_expanded"].iloc[0]
query_id = dataset_with_expandedQueries["id"].iloc[0]

if len(valid_docs_ids) != len(docs):
    raise ValueError("doc_ids and docs must have the same length")

In [180]:
cbf_scores = content_based_filtering(
    query_id = query_id,
    query = query,
    doc_ids = valid_docs_ids,
    docs = docs
)

cbf_scores

Unnamed: 0,query_id,document_id,similarity_score
0,history_1208,writers_2010,0.006858
1,history_1208,writers_2026,0.027136
2,history_1208,writers_2542,0.009086
3,history_1208,writers_2548,0.013093
4,history_1208,writers_2582,0.011226
...,...,...,...
995,history_1208,academia_30610,0.004104
996,history_1208,academia_34151,0.040910
997,history_1208,academia_38238,0.001876
998,history_1208,academia_69225,0.012918


**Scores for many queries**

In [181]:
def matrix_cbf(dataset, valid_docs, valid_docs_ids):

    # Check if the required column exists
    if "query_expanded" not in dataset.columns:
      raise ValueError("The column 'query_expanded' is missing in the dataset.")

    docs = list(set(valid_docs)) # To get the unique values
    doc_ids = valid_docs_ids

    if len(doc_ids) != len(docs):
        raise ValueError("doc_ids and docs must have the same length")

    # Helper function to process a single query
    def process_query(query_id, query):
      # Extract document IDs and relevance scores
      cbf_scores = content_based_filtering(
          query_id = query_id,
          query = query,
          doc_ids = doc_ids,
          docs = docs
      )

      # Extract document IDs and scores into a dictionary
      return {
          cbf_scores.iloc[i]["document_id"]: cbf_scores.iloc[i]["similarity_score"]
          for i in range(cbf_scores.shape[0])
      }

    # Use Parallel processing for efficiency
    results = Parallel(n_jobs= -1)(
        delayed(process_query)(row["id"], row["query_expanded"])
        for _, row in dataset.iterrows()
    )

    # Construct the DataFrame
    query_ids = dataset["id"].tolist()
    df = pd.DataFrame(results, index=query_ids).fillna(0)
    # Return the DataFrame
    return df


In [182]:
cbf_matrix = matrix_cbf(
    dataset = dataset_with_expandedQueries,
    valid_docs = valid_docs,
    valid_docs_ids = valid_docs_ids
)

# Print the dataframe to see the result
cbf_matrix

Unnamed: 0,writers_2010,writers_2026,writers_2542,writers_2548,writers_2582,writers_2617,writers_3290,writers_5447,writers_6108,writers_6195,...,anime_55736,anime_56051,academia_3093,academia_11573,academia_16379,academia_30610,academia_34151,academia_38238,academia_69225,academia_73666
history_1208,0.006858,0.027136,0.009086,0.013093,0.011226,0.004835,0.002400,0.007132,0.001515,0.003476,...,0.001692,0.011986,0.021690,0.000000,0.003392,0.004104,0.040910,0.001876,0.012918,0.005851
movies_16035,0.019753,0.013647,0.003042,0.005922,0.007348,0.005918,0.000000,0.033664,0.005269,0.004661,...,0.016581,0.038354,0.000000,0.017071,0.005559,0.018124,0.020446,0.005044,0.008515,0.017901
history_10030,0.004542,0.003169,0.008065,0.009527,0.011056,0.007092,0.007934,0.004669,0.001541,0.003059,...,0.002498,0.002017,0.009360,0.000631,0.002041,0.006891,0.001985,0.005399,0.001075,0.019380
boardgames_13317,0.001334,0.000000,0.016415,0.067443,0.004285,0.002595,0.000000,0.032673,0.000000,0.000000,...,0.003001,0.018948,0.000000,0.031615,0.044456,0.000420,0.004885,0.000000,0.000711,0.002010
hermeneutics_26728,0.004157,0.019078,0.011430,0.002700,0.014965,0.000000,0.000000,0.012296,0.048728,0.011471,...,0.006223,0.012973,0.001289,0.000000,0.012540,0.015920,0.000549,0.022294,0.002106,0.004148
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
scifi_53120,0.036393,0.019275,0.010323,0.003463,0.011698,0.019111,0.010466,0.003132,0.031676,0.027863,...,0.060021,0.014863,0.001559,0.003021,0.013747,0.015750,0.022010,0.013267,0.004987,0.004154
english_88035,0.024539,0.007325,0.005370,0.009299,0.023726,0.006454,0.000000,0.008901,0.000000,0.011668,...,0.045927,0.049744,0.000000,0.027251,0.004889,0.007018,0.007463,0.019218,0.032660,0.010424
judaism_1957,0.022056,0.032603,0.007218,0.006254,0.025508,0.004615,0.000000,0.031007,0.000000,0.010419,...,0.011215,0.019257,0.050742,0.008594,0.014025,0.016816,0.031431,0.016477,0.018050,0.007411
judaism_26134,0.011868,0.006148,0.004419,0.001731,0.002827,0.000930,0.000000,0.004697,0.015169,0.005766,...,0.001346,0.012027,0.018899,0.000000,0.008165,0.013177,0.005895,0.004793,0.007549,0.004662


### User-based collaborative filtering matrix

**Checking some statistics**

In [183]:
# Count unique users
n_users = dataset_with_expandedQueries['user_id'].nunique()
print(f"Number of users: {n_users}")

# Count unique movies
n_queries = dataset_with_expandedQueries['id'].nunique()
print(f"Number of queries: {n_queries}")

# Count total scores
n_scores = dataset_with_expandedQueries["score"].count()
print(f"Number of scores: {n_scores}")

# Count the unique values of the scores
unique_scores = dataset_with_expandedQueries['score'].unique()
print(f"Unique scores: {unique_scores}")

# Count non-zero scores
n_scores_non_zero = (dataset_with_expandedQueries['score'] != 0).sum()
print(f"Number of non-zero scores: {n_scores_non_zero}")

# Count zero scores
n_scores_zero = (dataset_with_expandedQueries['score'] == 0).sum()
print(f"Number of zero scores: {n_scores_zero}")

Number of users: 88
Number of queries: 997
Number of scores: 1000
Unique scores: [  4   3   2   8   5   0   1  14  13   9  46  25   6  45   7  12  -1  69
  11  30  15 212  20  26  10  54  -2  31  19  17  18  35  37  23  16  21
  33  27  22  36  28  63  44  42  40  50  32  24  -4  49  80  39  48  74
  29 181  -3  66  -5  59  34  47  91]
Number of non-zero scores: 919
Number of zero scores: 81


**User-based collaborative filtering on one user**

In [184]:
def user_based_cf(user_id, user_ids, rel_ids, rel_scores):
  # build user-item matrix
  user_item_matrix = pd.DataFrame(
      index = user_ids,
      columns = rel_ids,
      data = 0
  )

  for user_id, rel_id, rel_score in zip(user_ids, rel_ids, rel_scores):
    user_item_matrix.at[user_id, rel_id] = rel_score

  # compute similarity
  cosine_sim = cosine_similarity(user_item_matrix.fillna(0))
  user_index = list(user_item_matrix.index).index(user_id)
  user_similarities = cosine_sim[user_index]

  scores = user_item_matrix.T.dot(user_similarities)

  score_dict = scores.to_dict()

  return score_dict

In [185]:
user_id = dataset_with_expandedQueries["user_id"][0]
user_ids = dataset_with_expandedQueries["user_id"].tolist()
rel_ids = dataset_with_expandedQueries["rel_ids"].tolist()
rel_scores = dataset_with_expandedQueries["rel_scores"].tolist()

In [186]:
# Flatten nested lists if necessary
from itertools import chain
user_ids = list(chain.from_iterable(user_ids)) if isinstance(user_ids[0], list) else user_ids
rel_ids = list(chain.from_iterable(rel_ids)) if isinstance(rel_ids[0], list) else rel_ids
rel_scores = list(chain.from_iterable(rel_scores)) if isinstance(rel_scores[0], list) else rel_scores


In [187]:
# Check for nested lists or improper types
if any(isinstance(i, list) for i in user_ids + rel_ids + rel_scores):
    raise ValueError("Input data contains nested lists. Please flatten your data.")

In [188]:
cf_scores = user_based_cf(user_id, user_ids, rel_ids, rel_scores)
cf_scores

{'history_1210': 0.0,
 'movies_16899': 0.0,
 'history_10031': 0.0,
 'boardgames_13334': 0.0,
 'hermeneutics_29145': 0.0,
 'apple_231445': 0.0,
 'outdoors_17156': 0.0,
 'english_200923': 0.0,
 'pets_15755': 0.0,
 'english_69482': 0.0,
 'islam_2117': 0.0,
 'history_40444': 0.0,
 'christianity_63045': 0.0,
 'academia_73666': 0.0,
 'gaming_13511': 0.0,
 'workplace_9641': 0.0,
 'scifi_91082': 0.0,
 'scifi_56127': 0.0,
 'hsm_471': 0.0,
 'pets_111': 0.0,
 'philosophy_23442': 0.0,
 'gaming_325143': 18.999999999999996,
 'travel_134511': 0.0,
 'rpg_68972': 0.0,
 'scifi_21557': 0.0,
 'history_7799': 0.0,
 'gaming_42216': 0.0,
 'apple_187740': 0.0,
 'scifi_69447': 0.0,
 'gaming_116142': 0.0,
 'gardening_38598': 0.0,
 'sports_24961': 18.999999999999996,
 'judaism_3347': 0.0,
 'hermeneutics_34682': 0.0,
 'history_38428': 0.0,
 'hermeneutics_31434': 0.0,
 'english_212729': 0.0,
 'skeptics_37441': 0.0,
 'english_54676': 0.0,
 'english_2134': 0.0,
 'english_253921': 0.0,
 'scifi_134795': 0.0,
 'history

In [189]:
def matrix_userCF(dataset):
  scores = {}

  user_ids = dataset["user_id"].tolist()
  rel_ids = dataset["rel_ids"].tolist()
  rel_scores = dataset["rel_scores"].tolist()

  # Flatten nested lists if necessary
  from itertools import chain
  user_ids = list(chain.from_iterable(user_ids)) if isinstance(user_ids[0], list) else user_ids
  rel_ids = list(chain.from_iterable(rel_ids)) if isinstance(rel_ids[0], list) else rel_ids
  rel_scores = list(chain.from_iterable(rel_scores)) if isinstance(rel_scores[0], list) else rel_scores


  for user in dataset_with_expandedQueries['user_id'].unique():
    score = user_based_cf(user, user_ids, rel_ids, rel_scores)
    scores[user] = score

  df = pd.DataFrame.from_dict(scores, orient="index")  # orient="index" makes outer keys rows

  return df

In [190]:
matrix_userCF = matrix_userCF(dataset_with_expandedQueries)
matrix_userCF

Unnamed: 0,history_1210,movies_16899,history_10031,boardgames_13334,hermeneutics_29145,apple_231445,outdoors_17156,english_200923,pets_15755,english_69482,...,diy_124582,english_160130,politics_47336,outdoors_20790,scifi_142165,scifi_53123,english_88079,judaism_1958,judaism_26136,gaming_350826
407237,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0
17355,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0
359320,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0
1077413,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0
9101990,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1665141,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0
1824780,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0
1352706,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0
3581268,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0


To make the "matrix" above compatible with "matrix_retrivScores" and "matrix_cbf"

In [191]:
def expanded_matrix_userCF(matrix_userCF, dataset):
    for target_user in matrix_userCF.index:
        # Get query IDs corresponding to the user
        query_ids = dataset.loc[dataset["user_id"] == target_user, "id"].tolist()

        # Get the user row from the matrix
        user_row = matrix_userCF.loc[target_user]

        # Create new rows for each query_id
        expanded_rows = pd.DataFrame(
            [user_row.values] * len(query_ids),
            columns=matrix_userCF.columns,
            index=query_ids
        )

        # Remove target user and concatenate the expanded rows
        matrix_userCF = pd.concat([matrix_userCF.drop(index=target_user), expanded_rows])

    return matrix_userCF


In [192]:
expanded_matrix_userCF = expanded_matrix_userCF(matrix_userCF, dataset_with_expandedQueries)
expanded_matrix_userCF

Unnamed: 0,history_1210,movies_16899,history_10031,boardgames_13334,hermeneutics_29145,apple_231445,outdoors_17156,english_200923,pets_15755,english_69482,...,diy_124582,english_160130,politics_47336,outdoors_20790,scifi_142165,scifi_53123,english_88079,judaism_1958,judaism_26136,gaming_350826
history_1208,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0
english_200432,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0
history_7788,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0
history_1175,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0
english_42786,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
history_54533,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0
judaism_22216,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0
judaism_48401,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0
literature_40,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0


### Integrating the scores to get an unique matrix defined above

Assign a weight to each score

In [193]:
# Since the first column is the query_id column:
retrivMatrix = retrivMatrix.set_index("query id")
retrivMatrix

Unnamed: 0_level_0,gaming_350826,gaming_130973,gaming_103827,rpg_11641,anime_4030,boardgames_12204,rpg_93293,rpg_92534,rpg_89524,rpg_2963,...,gaming_148656,scifi_117433,boardgames_28348,woodworking_13203,boardgames_48501,rpg_91898,boardgames_33566,scifi_127253,apple_23130,money_15386
query id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
history_1208,280.88104,254.92812,198.31427,196.08018,190.77309,188.51622,187.84375,174.77057,171.01512,164.36914,...,52.49648,52.474277,52.447056,52.429096,52.418526,52.411976,52.390495,52.323242,52.275,52.16021
movies_16035,280.88104,254.92812,198.31427,196.08018,190.77309,188.51622,187.84375,174.77057,171.01512,164.36914,...,52.49648,52.474277,52.447056,52.429096,52.418526,52.411976,52.390495,52.323242,52.275,52.16021
history_10030,280.88104,254.92812,198.31427,196.08018,190.77309,188.51622,187.84375,174.77057,171.01512,164.36914,...,52.49648,52.474277,52.447056,52.429096,52.418526,52.411976,52.390495,52.323242,52.275,52.16021
boardgames_13317,280.88104,254.92812,198.31427,196.08018,190.77309,188.51622,187.84375,174.77057,171.01512,164.36914,...,52.49648,52.474277,52.447056,52.429096,52.418526,52.411976,52.390495,52.323242,52.275,52.16021
hermeneutics_26728,280.88104,254.92812,198.31427,196.08018,190.77309,188.51622,187.84375,174.77057,171.01512,164.36914,...,52.49648,52.474277,52.447056,52.429096,52.418526,52.411976,52.390495,52.323242,52.275,52.16021
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
scifi_53120,280.88104,254.92812,198.31427,196.08018,190.77309,188.51622,187.84375,174.77057,171.01512,164.36914,...,52.49648,52.474277,52.447056,52.429096,52.418526,52.411976,52.390495,52.323242,52.275,52.16021
english_88035,280.88104,254.92812,198.31427,196.08018,190.77309,188.51622,187.84375,174.77057,171.01512,164.36914,...,52.49648,52.474277,52.447056,52.429096,52.418526,52.411976,52.390495,52.323242,52.275,52.16021
judaism_1957,280.88104,254.92812,198.31427,196.08018,190.77309,188.51622,187.84375,174.77057,171.01512,164.36914,...,52.49648,52.474277,52.447056,52.429096,52.418526,52.411976,52.390495,52.323242,52.275,52.16021
judaism_26134,280.88104,254.92812,198.31427,196.08018,190.77309,188.51622,187.84375,174.77057,171.01512,164.36914,...,52.49648,52.474277,52.447056,52.429096,52.418526,52.411976,52.390495,52.323242,52.275,52.16021


In [194]:
# Check for duplicates in the rows of the retriv Dataset
if retrivMatrix.index.duplicated().any():
    print("There are duplicate rows in the retriv DataFrame.")
    retrivMatrix = retrivMatrix[~retrivMatrix.index.duplicated(keep='first')]
    print(f"Now: {retrivMatrix.index.duplicated().sum()}")
else:
    print("There are no duplicate rows in the DataFrame.")

# Check for duplicates in the rows of the content based filtering Dataset
if cbf_matrix.index.duplicated().any():
    print("There are duplicate rows in the cbf DataFrame.")
    cbf_matrix = cbf_matrix[~cbf_matrix.index.duplicated(keep='first')]
    print(f"Now: {cbf_matrix.index.duplicated().sum()}")
else:
    print("There are no duplicate rows in the DataFrame.")

# Check for duplicates in the rows of the user collaborative filtering Dataset
if expanded_matrix_userCF.index.duplicated().any():
    print("There are duplicate rows in the userCF DataFrame.")
    expanded_matrix_userCF = expanded_matrix_userCF[~expanded_matrix_userCF.index.duplicated(keep='first')]
    print(f"Now: {expanded_matrix_userCF.index.duplicated().sum()}")
else:
    print("There are no duplicate rows in the DataFrame.")

There are no duplicate rows in the DataFrame.
There are duplicate rows in the cbf DataFrame.
Now: 0
There are duplicate rows in the userCF DataFrame.
Now: 0


In [195]:
# Check for duplicates in the columns of the retriv Dataset
if retrivMatrix.columns.duplicated().any():
    print("There are duplicate columns in the retriv DataFrame.")
    retrivMatrix = retrivMatrix[~retrivMatrix.columns.duplicated(keep='first')]
    print(f"Now: {retrivMatrix.columns.duplicated().sum()}")
else:
    print("There are no duplicate columns in the DataFrame.")

# Check for duplicates in the columns of the content based filtering Dataset
if cbf_matrix.columns.duplicated().any():
    print("There are duplicate columns in the cbf DataFrame.")
    cbf_matrix = cbf_matrix[~cbf_matrix.columns.duplicated(keep='first')]
    print(f"Now: {cbf_matrix.columns.duplicated().sum()}")
else:
    print("There are no duplicate columns in the DataFrame.")

# Check for duplicates in the columns of the user collaborative filtering Dataset
if expanded_matrix_userCF.columns.duplicated().any():
    print("There are duplicate columns in the userCF DataFrame.")
    expanded_matrix_userCF = expanded_matrix_userCF[~expanded_matrix_userCF.columns.duplicated(keep='first')]
    print(f"Now: {expanded_matrix_userCF.columns.duplicated().sum()}")
else:
    print("There are no duplicate columns in the DataFrame.")

There are no duplicate columns in the DataFrame.
There are no duplicate columns in the DataFrame.
There are no duplicate columns in the DataFrame.


In [196]:
import pandas as pd

def integrate_scores(retriv, cbf, ucf, weight_retriv, weight_cbf, weight_ucf):

    weights = [weight_retriv, weight_cbf, weight_ucf]

    # Ensure the weights sum to 1
    if sum(weights) != 1:
        raise ValueError("Weights must sum to 1.")

    # Align the DataFrames to a common shape (union of all indices and columns)
    all_indices = sorted(set(retriv.index) | set(cbf.index) | set(ucf.index))
    all_columns = sorted(set(retriv.columns) | set(cbf.columns) | set(ucf.columns))

    # Reindex matrices to ensure alignment
    retriv = retriv.reindex(index=all_indices, columns=all_columns, fill_value=0)
    cbf = cbf.reindex(index=all_indices, columns=all_columns, fill_value=0)
    ucf = ucf.reindex(index=all_indices, columns=all_columns, fill_value=0)

    # Ensure all matrices have the same shape
    if not (retriv.shape == cbf.shape == ucf.shape):
        raise ValueError("All input matrices must have the same shape.")

    # Compute the weighted average
    weighted_avg = retriv * weights[0] + cbf * weights[1] + ucf * weights[2]
    weighted_avg.to_csv("weighted_avg_matrix.csv")

    return weighted_avg


In [197]:
pers_df = personalized_scores = integrate_scores(
    retriv = retrivMatrix,
    cbf = cbf_matrix,
    ucf = expanded_matrix_userCF,
    weight_retriv = 0.5,
    weight_cbf = 0.3,
    weight_ucf = 0.2
)

In [198]:
pers_df

# Columns: doc_id
# rows: queries

Unnamed: 0_level_0,academia_110666,academia_11573,academia_122342,academia_16379,academia_16534,academia_19548,academia_2329,academia_2529,academia_2680,academia_30610,...,writers_8895,writers_8945,writers_8961,writers_9031,writers_9106,writers_9321,writers_9419,writers_9527,writers_9833,writers_9930
query id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
academia_110662,0.0,59.523432,0.0,47.355030,0.0,0.0,0.0,0.0,0.0,26.316149,...,37.917103,44.347625,0.0,0.0,32.347761,0.0,61.653085,46.802125,57.956150,51.727898
academia_122339,0.0,59.527250,0.0,47.365908,0.0,0.0,0.0,0.0,0.0,26.319562,...,37.918624,44.350820,0.0,0.0,32.345243,0.0,61.668275,46.803694,57.962844,51.727958
academia_16532,0.0,59.517082,0.0,47.356329,0.0,0.0,0.0,0.0,0.0,26.326836,...,37.918202,44.349458,0.0,0.0,32.341940,0.0,61.653369,46.805416,57.957246,51.719125
academia_19511,0.0,59.519020,0.0,47.355030,0.0,0.0,0.0,0.0,0.0,26.323617,...,37.916037,44.350086,0.0,0.0,32.343738,0.0,61.654175,46.802306,57.963068,51.720277
academia_2326,0.0,59.532109,0.0,47.359698,0.0,0.0,0.0,0.0,0.0,26.339353,...,37.932543,44.353112,0.0,0.0,32.346871,0.0,61.656102,46.802125,57.956150,51.723913
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
writers_8554,0.0,59.520101,0.0,47.359004,0.0,0.0,0.0,0.0,0.0,26.318822,...,37.919541,44.357747,0.0,0.0,32.347593,0.0,61.671346,46.805611,57.966644,51.720433
writers_8710,0.0,59.516160,0.0,47.363710,0.0,0.0,0.0,0.0,0.0,26.330885,...,37.931612,44.352576,0.0,0.0,32.343978,0.0,61.654725,46.807602,57.958069,51.721671
writers_8948,0.0,59.521942,0.0,47.361556,0.0,0.0,0.0,0.0,0.0,26.323579,...,37.931312,44.352526,0.0,0.0,32.342858,0.0,61.655700,46.805192,57.961060,51.724454
writers_9029,0.0,59.521027,0.0,47.357353,0.0,0.0,0.0,0.0,0.0,26.330680,...,37.949084,44.367734,0.0,0.0,32.346626,0.0,61.656818,46.808086,57.956862,51.720110


### Personalized Qrels

In [199]:
# Personalized Qrels

# q_id: query id --> qid
# doc_id: document_id --> docno
# score_rel: score combined between personalized scores and retrieval scores --> relevance

# For each query we extract the most relevant document using the bm25 search

def combinationIn_Qrels(pers_df):
  q_ids = pers_df.index
  doc_ids = pers_df.columns

  qrels = []

  for q_id in q_ids:
    for doc_id in doc_ids:
      score_rel = pers_df.loc[q_id, doc_id]
      score_rel = int(round(score_rel))

      qrels.append([q_id, doc_id, score_rel])

  qrels_df = pd.DataFrame(qrels, columns=["qid", "docno", "relevance"])

  return qrels_df


In [200]:
qrels_df = combinationIn_Qrels(pers_df)
qrels_df

Unnamed: 0,qid,docno,relevance
0,academia_110662,academia_110666,0
1,academia_110662,academia_11573,60
2,academia_110662,academia_122342,0
3,academia_110662,academia_16379,47
4,academia_110662,academia_16534,0
...,...,...,...
1873358,writers_9318,writers_9321,0
1873359,writers_9318,writers_9419,62
1873360,writers_9318,writers_9527,47
1873361,writers_9318,writers_9833,58


### Applying the cross encoder with the personalized Qrels on the train set

In [201]:
# Keep only the trainQrels rows whose qid is in df_test
persQrels_df_forTest = qrels_df[qrels_df["qid"].isin(df_test["qid"])]

# Keep only the trainQrels rows whose qid is in df_train
persQrels_df_forTrain = qrels_df[qrels_df ["qid"].isin(df_train["qid"])]

In [202]:
# Experiment to find the best pipeline when using the expanded queries

pt.Experiment(
    [
        br,
        cross_pipeline,
        cross_sum_1_pipeline,  # .1*CrossEnc + .9*BM25
        cross_sum_2_pipeline,  # .2*CrossEnc + .8*BM25
        cross_sum_3_pipeline,  # .3*CrossEnc + .7*BM25
        cross_sum_4_pipeline,  # .4*CrossEnc + .6*BM25
        cross_sum_5_pipeline,  # .5*CrossEnc + .5*BM25
        cross_sum_6_pipeline,  # .6*CrossEnc + .4*BM25
        cross_sum_7_pipeline,  # .7*CrossEnc + .3*BM25
        cross_sum_8_pipeline,  # .8*CrossEnc + .2*BM25
        cross_sum_9_pipeline,  # .9*CrossEnc + .1*BM25
    ],
    df_train,  #apply them on the train set
    persQrels_df_forTrain,
    names=[
        'BM25',
        'CrossEnc',
        '.1*CrossEnc + .9*BM25',
        '.2*CrossEnc + .8*BM25',
        '.3*CrossEnc + .7*BM25',
        '.4*CrossEnc + .6*BM25',
        '.5*CrossEnc + .5*BM25',
        '.6*CrossEnc + .4*BM25',
        '.7*CrossEnc + .3*BM25',
        '.8*CrossEnc + .2*BM25',
        '.9*CrossEnc + .1*BM25'
    ],
    eval_metrics=["map", "P.5", "P.10", "ndcg", 'recall']
)

  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


Unnamed: 0,name,map,P.5,P.10,ndcg,R@5,R@10,R@15,R@20,R@30,R@100,R@200,R@500,R@1000
0,BM25,0.006203,0.163659,0.164286,0.027384,0.000809,0.001623,0.002443,0.003276,0.004937,0.016737,0.016737,0.016737,0.016737
1,CrossEnc,0.006315,0.174937,0.174311,0.027754,0.000864,0.001722,0.002566,0.003347,0.005006,0.016737,0.016737,0.016737,0.016737
2,.1*CrossEnc + .9*BM25,0.006215,0.163158,0.164536,0.027404,0.000806,0.001626,0.002447,0.003272,0.004949,0.016737,0.016737,0.016737,0.016737
3,.2*CrossEnc + .8*BM25,0.00622,0.162155,0.164536,0.027415,0.000801,0.001626,0.002428,0.003263,0.004967,0.016737,0.016737,0.016737,0.016737
4,.3*CrossEnc + .7*BM25,0.006232,0.162406,0.165288,0.027434,0.000802,0.001633,0.002465,0.003278,0.004978,0.016737,0.016737,0.016737,0.016737
5,.4*CrossEnc + .6*BM25,0.00625,0.162406,0.165038,0.027485,0.000802,0.001631,0.002469,0.003317,0.00499,0.016737,0.016737,0.016737,0.016737
6,.5*CrossEnc + .5*BM25,0.006262,0.164411,0.166541,0.027503,0.000812,0.001646,0.002498,0.003373,0.004974,0.016737,0.016737,0.016737,0.016737
7,.6*CrossEnc + .4*BM25,0.006286,0.171178,0.168045,0.027574,0.000846,0.001661,0.00252,0.003361,0.00497,0.016737,0.016737,0.016737,0.016737
8,.7*CrossEnc + .3*BM25,0.006301,0.171679,0.169549,0.027631,0.000848,0.001675,0.002553,0.003398,0.004964,0.016737,0.016737,0.016737,0.016737
9,.8*CrossEnc + .2*BM25,0.006316,0.174687,0.172932,0.027708,0.000863,0.001709,0.002558,0.003395,0.004998,0.016737,0.016737,0.016737,0.016737


## Evaluation: Applying the cross encoder with the personalized Qrels on the test set

In [204]:
pt.Experiment(
    [
        br,
        cross_pipeline,
        cross_sum_8_pipeline, # Best pipeline
    ],
    df_test,  #apply them on the test set
    Qrels_df_forTest,
    names=[
        'BM25',
        'CrossEnc',
        '.8*CrossEnc + .2*BM25',
      ],
    eval_metrics=["map", "P.5", "P.10", "ndcg", 'recall']
)

Unnamed: 0,name,map,P.5,P.10,ndcg,R@5,R@10,R@15,R@20,R@30,R@100,R@200,R@500,R@1000
0,BM25,0.570942,0.131658,0.071357,0.630714,0.658291,0.713568,0.738693,0.758794,0.78392,0.859296,0.859296,0.859296,0.859296
1,CrossEnc,0.379263,0.094472,0.053266,0.473097,0.472362,0.532663,0.582915,0.623116,0.683417,0.859296,0.859296,0.859296,0.859296
2,.8*CrossEnc + .2*BM25,0.44609,0.108543,0.062814,0.530571,0.542714,0.628141,0.658291,0.693467,0.753769,0.859296,0.859296,0.859296,0.859296
