# Homework 4, Part 2

In Part 1, we saw how to create a bi-encoder to estimate the relevance of a query-document pair and generate these relevance scores. In Part 2, we'll see how to integrate those scores into a learning to rank (L2R) model with a few features.

For this part, you are going to:
1. Create the dataset ready to use for Pyterrier.
2. integrate the cosine similarity you have got in part 1 into the features of learning to rank models.


Learning goals for Homework 4, Part 2:
* Improve familiarity with installing and running Pyterrier code
* Learn how to use L2R models in Pyterrier
* Learn how to add custom features to L2R models with Pyterrier.
* Deepen your understanding of how different models perform in mixed-domain settings (e.g., text queries / code docs)


### Step 0: install things as needed

In case you didn't do any of Homework 3 (which was extra credit), please be sure to have the following libraries installed and ready. The installation command is commented out for now but uncomment and run each as needed.

In [1]:
!pip install fastrank
!pip install lightgbm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fastrank
  Downloading fastrank-0.7.0-py3-none-manylinux2010_x86_64.whl (891 kB)
[K     |████████████████████████████████| 891 kB 15.4 MB/s 
Installing collected packages: fastrank
Successfully installed fastrank-0.7.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Task 1: Creating a dataset with precomputed features

## Task 1.1

Load in the dataset used for evaluation as a pandas data frame, which is in `final_evaluation_set.csv`. Then print the number of unique queries (99), unique code-documents in the dataset (958) to verify it was loaded correctly.

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv("final_evaluation_set.csv")
print(len(df['Query'].unique()))
print(len(df['code'].unique()))

99
958


In [4]:
df

Unnamed: 0.1,Unnamed: 0,Language,Query,GitHubUrl,code,relevance
0,54,Python,write csv,https://github.com/sentinel-hub/sentinelhub-py...,"def write_csv(filename, data, delimiter=CSV_DE...",3
1,176,Python,write csv,https://github.com/jordanjoz1/flickr-views-cou...,"def write_to_csv(fname, header, rows): with op...",3
2,357,Python,write csv,https://github.com/fastai/fastai/blob/9fb84a5c...,def write_csv(self): # Get first element's fil...,3
3,615,Python,write csv,https://github.com/majerteam/sylk_parser/blob/...,"def to_csv(self, fbuf, quotechar='""', delimite...",3
4,722,Python,write csv,https://github.com/gem/oq-engine/blob/8294553a...,"def write_to_csv(self, filename): ''' Exports ...",3
...,...,...,...,...,...,...
966,1145,Python,aes encryption,https://github.com/konomae/lastpass-python/blo...,"def decode_aes256(cipher, iv, data, encryption...",3
967,1162,Python,aes encryption,https://github.com/jcassee/django-geckoboard/b...,"def _encrypt(data): """"""Equivalent to OpenSSL u...",3
968,1359,Python,aes encryption,https://github.com/ontio/ontology-python-sdk/b...,"def aes_cbc_encrypt(plain_text: bytes, key: by...",3
969,14,Python,aes encryption,https://github.com/etingof/pysnmp/blob/cde062d...,"def encryptData(self, encryptKey, privParamete...",0


## Task 1.2: Creating an index  (5 points)

Since the code documents are text, we can still create an index to store them (just like regular documents before). Before, we mostly used pre-built indices or loaded them from file. In this part, you'll see how to create your own index from a pandas dataframe. 

The rough steps are as follows:
* Start pyterrier
* Map each unique code document to a unique string identifier (keep this around in a dictionary!)
* Create a pandas DataFrame of each unique code-document with two columns:
  * `text` containing the contents of the code-document 
  * `docid` a unique string identifier for that code-document
* use pyterrier's [`DFIndexer`](https://pyterrier.readthedocs.io/en/latest/terrier-indexing.html) to create an index from the data frame.

Once you're finished with these steps, print the collection statistics, which should look something like this:
```
Number of documents: 958
Number of terms: 4929
Number of postings: 26358
Number of fields: 0
Number of tokens: 65017
Field names: []
Positions:   false
```

In [5]:
# TODO: Set this based on where Java is installed
!export JAVA_HOME=/usr/lib/jvm/java-18-openjdk-amd64/

!pip install python-terrier
import pyterrier as pt
import pandas as pd
import os
if not pt.started():
    pt.init()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting python-terrier
  Downloading python-terrier-0.9.1.tar.gz (102 kB)
[K     |████████████████████████████████| 102 kB 16.8 MB/s 
Collecting wget
  Downloading wget-3.2.zip (10 kB)
Collecting pyjnius>=1.4.2
  Downloading pyjnius-1.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 41.2 MB/s 
[?25hCollecting matchpy
  Downloading matchpy-0.5.5-py3-none-any.whl (69 kB)
[K     |████████████████████████████████| 69 kB 7.8 MB/s 
[?25hCollecting sklearn
  Downloading sklearn-0.0.post1.tar.gz (3.6 kB)
Collecting deprecated
  Downloading Deprecated-1.2.13-py2.py3-none-any.whl (9.6 kB)
Collecting chest
  Downloading chest-0.2.3.tar.gz (9.6 kB)
Collecting nptyping==1.4.4
  Downloading nptyping-1.4.4-py3-none-any.whl (31 kB)
Collecting ir_datasets>=0.3.2
  Downloading ir_datasets-0.5.4-py3-none-any.whl (311 kB)
[K     |

PyTerrier 0.9.1 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7



In [6]:
a = dict(enumerate(df['code'].unique()))
df1 = pd.DataFrame(a.items())
df1=df1.rename(columns={0: "docno", 1: "text"}) #note: in order to avoid java errors, we are using 'docno' instead of 'docid' here
df1['docno'] = df1['docno'].apply(str)
df1['docno']

0        0
1        1
2        2
3        3
4        4
      ... 
953    953
954    954
955    955
956    956
957    957
Name: docno, Length: 958, dtype: object

In [7]:
index_dir = './df1'
indexer = pt.DFIndexer(index_dir)
index_ref = indexer.index(df1["text"], df1["docno"])
index_ref.toString()
index = pt.IndexFactory.of(index_ref)
print(index.getCollectionStatistics().toString())

Number of documents: 958
Number of terms: 4929
Number of postings: 26358
Number of fields: 0
Number of tokens: 65017
Field names: []
Positions:   false



## Task 1.3: Preparing the query data

We'll be using Pyterrier's `Experiment` framework to do our evaluation so we'll need to organize our queries in the test set into a pandas `DataFrame`. Create a new dataframe for all unique queries with two columns:
* `query` the text of the query
* `qid` a unique string identifier for that query

In [8]:
test_data = pd.read_csv("final_evaluation_set.csv")
a = dict(enumerate(test_data['Query'].unique()))
test = pd.DataFrame(a.items())
test=test.rename(columns={0: "qid", 1: "query"})
test['qid'] = test['qid'].apply(str)
test

Unnamed: 0,qid,query
0,0,write csv
1,1,unzipping large files
2,2,unique elements
3,3,underline text in label widget
4,4,string to date
...,...,...
94,94,concatenate several file remove header lines
95,95,buffered file reader read text
96,96,binomial distribution
97,97,all permutations of a list


In [9]:
test_data

Unnamed: 0.1,Unnamed: 0,Language,Query,GitHubUrl,code,relevance
0,54,Python,write csv,https://github.com/sentinel-hub/sentinelhub-py...,"def write_csv(filename, data, delimiter=CSV_DE...",3
1,176,Python,write csv,https://github.com/jordanjoz1/flickr-views-cou...,"def write_to_csv(fname, header, rows): with op...",3
2,357,Python,write csv,https://github.com/fastai/fastai/blob/9fb84a5c...,def write_csv(self): # Get first element's fil...,3
3,615,Python,write csv,https://github.com/majerteam/sylk_parser/blob/...,"def to_csv(self, fbuf, quotechar='""', delimite...",3
4,722,Python,write csv,https://github.com/gem/oq-engine/blob/8294553a...,"def write_to_csv(self, filename): ''' Exports ...",3
...,...,...,...,...,...,...
966,1145,Python,aes encryption,https://github.com/konomae/lastpass-python/blo...,"def decode_aes256(cipher, iv, data, encryption...",3
967,1162,Python,aes encryption,https://github.com/jcassee/django-geckoboard/b...,"def _encrypt(data): """"""Equivalent to OpenSSL u...",3
968,1359,Python,aes encryption,https://github.com/ontio/ontology-python-sdk/b...,"def aes_cbc_encrypt(plain_text: bytes, key: by...",3
969,14,Python,aes encryption,https://github.com/etingof/pysnmp/blob/cde062d...,"def encryptData(self, encryptKey, privParamete...",0


## Task 1.4: Preparing the Evaluation data

In the final step, we'll create a single data frame that contains the queries, documents, and true relevance scores, which we'll use to evaluate our models using `pt.Experiment`. Your dataframe should have three columns:
* `qid` the unique string identifier for a query
* `docno` the unique string identifier for a code-document
* `label` the relevance score for that query-document pair

In [10]:
qrels = pd.DataFrame(columns=['qid', 'docno', 'label'])
for i in range(len(test)):
  for j in range(len(df1)):
    for k in range(len(test_data)):
      if test_data["Query"][k] == test["query"][i] and test_data["code"][k] == df1["text"][j]:
        df2 = {'qid': test["qid"][i], 'docno': df1["docno"][j], 'label': test_data["relevance"][k]}
        qrels = qrels.append(df2, ignore_index = True)
        break

qrels

Unnamed: 0,qid,docno,label
0,0,0,3
1,0,1,3
2,0,2,3
3,0,3,3
4,0,4,3
...,...,...,...
966,98,953,3
967,98,954,3
968,98,955,3
969,98,956,0


In [11]:
qrels["label"] = pd.to_numeric(qrels["label"])

# Task 2: Learning to Rank

The steps in Task 2 will have you running some evaluations and setting up a Learning to Rank model that we'll extend later to incorporate the bi-encoder features.

First, we'll split our labeled query-document data into train, development, and test sets so we can train models and evaluate unsupervised models.

In [12]:
SEED=42
from sklearn.model_selection import train_test_split
tr_va_topics, test_topics = train_test_split(test, test_size=30, random_state=SEED)
train_topics, valid_topics =  train_test_split(tr_va_topics, test_size=10, random_state=SEED)

## Task 3.1: Test baseline models (5 points)

In this initial step, create two `BatchRetrieve` rankers that use "BM25" or "TF_IDF" and run an `pt.Experiment` using them on the code index, using "map" and "ndcg" to evaluate their performance. We'll evaluate these only on the test data (no hyperparameter fine-tuning).

In [13]:
bm25 = pt.BatchRetrieve(index, wmodel="BM25")
tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")
pt.Experiment(
    [bm25,tfidf],
    test,qrels,
    eval_metrics=["map","ndcg"])

Unnamed: 0,name,map,ndcg
0,BR(BM25),0.760663,0.823839
1,BR(TF_IDF),0.760833,0.826133


## Task 3.2: Creating our first pipeline (10 points)

Let's start getting more complex with our pipelines. Create a feature pipeline that has three features:
1.   the BM25 code score;
2.   the TF-IDF code score;
3.   the coordinate match score for the query--i.e. how many query terms appear in the code;

We'll use these features later in learning to rank. Fo

In [14]:
ltr_feats1= pt.BatchRetrieve(index) >>(
    bm25
    **
    tfidf
    ** 
    pt.BatchRetrieve(index, wmodel="CoordinateMatch")
)


## Setting up the Learning to Rank (L2R) models

For the next part, you won't need to write any code (we've done it for you) but you will need to run the cells to train a few different kinds of L2R models on the training set. Each of the models captures a different kind of L2R that we talked about.

Train the following three models on our training set:
 - random forests from `scikit-learn`, a pointwise regression tree technique
 - coordinate ascent from FastRank, a listwise linear technique
 - LambdaMART from LightGBM, a listwise regression tree technique

In [15]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=400, verbose=1, random_state=SEED, n_jobs=2)

rf_pipe = ltr_feats1 >> pt.ltr.apply_learned_model(rf)

%time rf_pipe.fit(train_topics, qrels)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.5s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:    1.9s


CPU times: user 11.7 s, sys: 336 ms, total: 12.1 s
Wall time: 6.97 s


[Parallel(n_jobs=2)]: Done 400 out of 400 | elapsed:    3.8s finished


In [16]:
import fastrank
train_request = fastrank.TrainRequest.coordinate_ascent()

params = train_request.params
params.init_random = True
params.normalize = True
params.seed = 1234567

ca_pipe = ltr_feats1 >> pt.ltr.apply_learned_model(train_request, form='fastrank')

%time ca_pipe.fit(train_topics, qrels)

CPU times: user 6.06 s, sys: 133 ms, total: 6.19 s
Wall time: 4.17 s


In [17]:
import lightgbm as lgb

# this configures LightGBM as LambdaMART
lmart_l = lgb.LGBMRanker(
    task="train",
    silent=False,
    min_data_in_leaf=1,
    min_sum_hessian_in_leaf=1,
    max_bin=255,
    num_leaves=31,
    objective="lambdarank",
    metric="ndcg",
    ndcg_eval_at=[10],
    ndcg_at=[10],
    eval_at=[10],
    learning_rate= .1,
    importance_type="gain",
    num_iterations=100,
    early_stopping_rounds=5
)

lmart_x_pipe = ltr_feats1 >> pt.ltr.apply_learned_model(lmart_l, form="ltr", fit_kwargs={'eval_at':[20]})

%time lmart_x_pipe.fit(train_topics, qrels, valid_topics, qrels)

[1]	valid_0's ndcg@20: 0.741616
Training until validation scores don't improve for 5 rounds.
[2]	valid_0's ndcg@20: 0.722364
[3]	valid_0's ndcg@20: 0.675198
[4]	valid_0's ndcg@20: 0.670466
[5]	valid_0's ndcg@20: 0.663902
[6]	valid_0's ndcg@20: 0.665603
Early stopping, best iteration is:
[1]	valid_0's ndcg@20: 0.741616
CPU times: user 4.96 s, sys: 92 ms, total: 5.06 s
Wall time: 3.09 s




## Task 3.4: Comparing L2R performance (10 points)

Now that we have all of our models, let's compare them with the baselines we had before. Run another `Experiment` that compare the three L2R models with the two baselines (BM25 and tf-idf). This time, we'll add "ndcg_cut_10" to see their performance on just the top 10 docs and "mrt" to see how fast the models are.

In [18]:
pt.Experiment(
    [bm25, tfidf,rf_pipe,ca_pipe,lmart_x_pipe],
    test,qrels,
    eval_metrics=["map","ndcg", "ndcg_cut_10","mrt"],
    names=["BM25","tf-idf","random forest","fastrank","LambdaMart"])

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.0s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:    0.2s
[Parallel(n_jobs=2)]: Done 400 out of 400 | elapsed:    0.4s finished


Unnamed: 0,name,map,ndcg,ndcg_cut_10,mrt
0,BM25,0.760663,0.823839,0.739832,7.774374
1,tf-idf,0.760833,0.826133,0.73987,4.095725
2,random forest,0.838264,0.895204,0.837209,41.274258
3,fastrank,0.764237,0.831876,0.741953,35.042528
4,LambdaMart,0.596328,0.648062,0.499968,35.235348


# Task 4: Incorporating new features

We didn't expect those approaches to do too well since queries might not reflect the content in the code-documents. But our bi-encoder model knows how to compare both! In Task 4's steps, you'll incorporate it's relevance predictions into the model as another feature.

**Note**: For your course projects, if you use Pyterrier, this code should give you some idea of how to incorporate ranking features (or other information) that you've calculated from elsewhere.

## Task 4.1 Loading in the precomputed relevance data

Read in the dataframe with the bi-encoder's estimated relevance scores for each query-document pair (i.e., its cosine similarity), which we produced in Part 1. The length of the dataframe should be (number of unique query) * (number of unique documents).

In [21]:
relevance = pd.read_csv('relevance_scores.csv')
relevance=relevance[['Query_id', 'Doc_id','sim']]
print(len(relevance))
relevance

94842


Unnamed: 0,Query_id,Doc_id,sim
0,0,0,0.934156
1,0,1,0.938960
2,0,2,0.933673
3,0,3,0.897763
4,0,4,0.938280
...,...,...,...
94837,98,947,-0.088894
94838,98,948,0.185306
94839,98,949,-0.103721
94840,98,956,0.724715


In [23]:
relevance.dtypes

Query_id      int64
Doc_id        int64
sim         float64
dtype: object

## Task 4.2: Adding new features (10 points)

Once we have our bi-encoder estimates, we'll create a new pipeline that adds the score as a new feature. Recall that Pyterrier's [Pipeline](https://pyterrier.readthedocs.io/en/latest/pipeline_examples.html) is a transformation on a pandas `DataFrame` object. For us, that means we can write a function that operates on each row of the data frame and use pyterrier's [`apply`](https://pyterrier.readthedocs.io/en/latest/apply.html) (whhich is much like pand'as [`apply`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)). Specifically, we'll write some code that for a given row with a document and query, looks up the precomputed relevance score.

While there's many ways to do this, your steps should probably look something like this:
* Create some data structure that can map a tuple of the query id and document id to the bi-encoder's relevance score
* Write a function takes in a row from a `DataFrame` and uses the query id and document id in the row's columns to look up the bi-encoder's relevance.
* Copy and extend your earlier pipeline by adding one new feature that uses pyterrier's `apply` function with your new function. Call this new pipeline `bienc_ltr_feats` so the later training functions can use it

Once you have this pipeline in place, use the code below to retrain the models. 

Add the feature of cosine similarity between query and code embedding into the feaure pipeline. Train the three models and run the experiements again.

In [25]:
dic = {}
for i in range(len(relevance)):
  tup = (relevance["Query_id"][i],relevance["Doc_id"][i])
  dic[tup] = relevance["sim"][i]

def func(row):
  return dic[(pd.to_numeric(row[0]),pd.to_numeric(row[1]))]

bienc_ltr_feats = pt.BatchRetrieve(index) >>(
    tfidf
    **
    bm25
    ** 
    pt.BatchRetrieve(index, wmodel="CoordinateMatch")
    **
    pt.apply.doc_score(lambda x:func(x))
)


In [26]:
rf = RandomForestRegressor(n_estimators=400, verbose=1, random_state=SEED, n_jobs=2)
rf_pipe = bienc_ltr_feats >> pt.ltr.apply_learned_model(rf)
rf_pipe.fit(train_topics, qrels)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.7s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:    2.7s
[Parallel(n_jobs=2)]: Done 400 out of 400 | elapsed:    5.5s finished


In [27]:
train_request = fastrank.TrainRequest.coordinate_ascent()
params = train_request.params
params.init_random = True
params.normalize = True
params.seed = 1234567
ca_pipe = bienc_ltr_feats >> pt.ltr.apply_learned_model(train_request, form='fastrank')
ca_pipe.fit(train_topics, qrels)

In [28]:
lmart_l = lgb.LGBMRanker(
    task="train",
    silent=False,
    min_data_in_leaf=1,
    min_sum_hessian_in_leaf=1,
    max_bin=255,
    num_leaves=31,
    objective="lambdarank",
    metric="ndcg",
    ndcg_eval_at=[10],
    ndcg_at=[10],
    eval_at=[10],
    learning_rate= .1,
    importance_type="gain",
    num_iterations=100,
    early_stopping_rounds=5
)
lmart_x_pipe.fit(train_topics, qrels, valid_topics, qrels)

[1]	valid_0's ndcg@20: 0.741616
Training until validation scores don't improve for 5 rounds.
[2]	valid_0's ndcg@20: 0.722364
[3]	valid_0's ndcg@20: 0.675198
[4]	valid_0's ndcg@20: 0.670466
[5]	valid_0's ndcg@20: 0.663902
[6]	valid_0's ndcg@20: 0.665603
Early stopping, best iteration is:
[1]	valid_0's ndcg@20: 0.741616




## Task 4.3 Re-run the experiment here using the new features! (10 points)

In [29]:
pt.Experiment(
    [bm25, rf_pipe,ca_pipe,lmart_x_pipe],
     test,qrels,
    eval_metrics=["map","ndcg", "ndcg_cut_10","mrt"],
    names=["BM25","random forest","fastrank","LambdaMart"])

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.0s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:    0.2s
[Parallel(n_jobs=2)]: Done 400 out of 400 | elapsed:    0.3s finished


Unnamed: 0,name,map,ndcg,ndcg_cut_10,mrt
0,BM25,0.760663,0.823839,0.739832,4.346506
1,random forest,0.881997,0.917861,0.875782,52.10234
2,fastrank,0.802718,0.858604,0.788942,44.102675
3,LambdaMart,0.596328,0.648062,0.499968,33.923937


# _Optional_: Evaluating the different models (20 points total; this is part 2)

How much training does the model actually need to recognize relevance? Would one epoch be enough? What if we did 10? or 100? (100 might be too many for Great Lakes limits...). In this **optional part**, we'll describe a series of steps you can take to explore this part!
 
The instructions in Part 1 had you update that notebook to save the model after each epoch and then generate relevance predictions for each, saving those to a file. In Part 2, we'll load those files and compare the performance:
 
Here's what you need to do:
* Using the code from the blocks above, create new version of the test data DataFrame that have predictions from each trained bi-encoder model. (i.e., you should have predictions from the model trained on one epoch worth of data, predictions from the model trained on two epochs, etc.)
* Retrain each L2R model using each of these new features, using just one feature at a time. This should give you (number of L2R models) * (number of different-epoch-trained-biencoder-models) worth of results.
* Create a line plot where
  * the x-axis is the number of epochs the bi-encoder model was trained
  * the y-axis is NDCG_cut_10
  * there are different lines for each L2R model (with different colors/hues for each model)
 
This plot should show you how much the bi-encoder's training time influences the scores. Compare that with the F1 performance plot you produced for Part 1. Does increasing F1 performance lead to increasing NDCG@10? How many epochs do you think you need to train to maximize performance?

**TODO:** For full credit, submit a separate doc/pdf with the plots from Parts 1 and 2 and a short paragraph describing your observations on the performance (see the questions above).