<a href="https://colab.research.google.com/github/VARSHAJOSHY/multi-lingual-stance-dataset/blob/main/IR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Information Retrieval Exercise 3 Notebook


This is the template notebook for Exercise 3. The specification for the exercise and the corresponding Exercise 3 Quiz submission instance are available on the Moodle page of the course.

This exercise builds upon Exercise 2, and assumes that you are now familiar with concepts we have introduced in both Exercise 1 and Exercise 2, including:
 - [PyTerrier operators](https://pyterrier.readthedocs.io/en/latest/operators.html)
 - [Pyterrier apply transformers](https://pyterrier.readthedocs.io/en/latest/transformer.html)
 - [PyTerrier pt.Experiment()](https://pyterrier.readthedocs.io/en/latest/experiments.html)


## PyTerrier Setup

First, let's install PyTerrier as usual.

In [None]:
%pip install -q python-terrier lightgbm==2.2.3

Let's start PyTerrier:

In [None]:
import pyterrier as pt
if not pt.started():
  pt.init()

# we require a specific version of LightGBM for this exercise
import lightgbm
assert lightgbm.__version__ == '2.2.3'

## Index, Topics & Qrels for Exercise 3

You will need your login & password credentials from Exercise 2. We will be using again the "50pct" and the "trec-wt-2004" datasets from Exercise 2.


In [None]:
UNAME="2699662j"
PWORD="ca0648c4"

# we will again be using the "50pct" and "trec-wt-2004" datasets
Fiftypct = pt.get_dataset("50pct",  user=UNAME, password=PWORD)
dotgov_topicsqrels = pt.get_dataset("trec-wt-2004")

On the other hand, you will be using a slightly updated index for Exercise 3. It is a bit bigger than the Exercise 2 index, hence it takes about 2-3 minutes to download to Colab.


In [None]:

indexref = Fiftypct.get_index(variant="ex3")
index = pt.IndexFactory.of(indexref)


15:18:52.742 [main] WARN org.terrier.structures.BaseCompressingMetaIndex - Structure meta reading data file directly from disk (SLOW) - try index.meta.data-source=fileinmem in the index properties file. 860.9 MiB of memory would be required.


Let's check out the new index. Compared to the index we used for Exercise 2, you can see that this index has `Field Names: [TITLE, ELSE]`, which means that we can provide statistics about how many times each term occurs in the title of each document (the "TITLE" field), vs the rest of the document (the "ELSE" field). Refer to Lecture 8 for more information about fields.

Let's also display the keys in the meta index - this is the metadata that we have stored for each document. You can see that we are storing the "url" and the "body" (content) of the document. These will particularly come in handy for Q2 and Q3 of Exercise 3, respectively.


In [None]:
print(index.getCollectionStatistics())
print("In the meta index: " + str(index.getMetaIndex().getKeys()))

Number of documents: 807775
Number of terms: 2043788
Number of postings: 177737957
Number of fields: 2
Number of tokens: 572916194
Field names: [TITLE, ELSE]
Positions:   true

In the meta index: ['docno', 'url', 'title', 'body']


Finally, these are all of the topics and qrels (including the training and validation datasets) that you will need to conduct Exercise 3.

In [None]:
tr_topics = Fiftypct.get_topics("training")
va_topics = Fiftypct.get_topics("validation")

tr_qrels = Fiftypct.get_qrels("training")
va_qrels = Fiftypct.get_qrels("validation")

test_topics = dotgov_topicsqrels.get_topics("hp")
test_qrels = dotgov_topicsqrels.get_qrels("hp")

## Baseline Setup

We introduce here the BatchRetrieve for our baseline. Note that:
 - We are using PL2 as our weighting model to generate the sample (the candidate set of documents to re-rank).
 - We expose more document metadata, namely "url" and "body" for each document retrieved, which you will need to deploy your two new features.
 - By setting `verbose=True`, we display a progress bar while retrieval executes.

In [None]:
firstpassUB = pt.BatchRetrieve(index, wmodel="PL2", metadata=["docno", "url", "body"], verbose=True)

Let's see the resulting output - you can see that there are now "url" and "body" attributes for each retrieved document. (We also display a progress bar, enabled by the `verbose=True`).

In [None]:
firstpassUB.search("chemical reactions").head(5)

BR(PL2):   0%|          | 0/1 [00:00<?, ?q/s]

Unnamed: 0,qid,docid,docno,url,body,rank,score,query
0,1,513586,G18-38-1767991,http://www.boulder.nist.gov/div838/tar/file03....,NIST - Physical and Chemical Properties Divi...,0,12.755546,chemical reactions
1,1,38544,G01-14-2537005,http://www.labtrain.noaa.gov/shemtfa/chemhaz/n...,. ...,1,11.906524,chemical reactions
2,1,707122,G26-06-3754605,http://www.aps.anl.gov/xfd/tech/safetyenvelope...,APS Experiment Safety Envelope 6: Chemicals ...,2,11.87755,chemical reactions
3,1,382754,G13-59-3981168,http://response.restoration.noaa.gov/chemaids/...,"""); } else { document.write(...",3,11.858475,chemical reactions
4,1,70292,G02-16-2617043,http://www.symp14.nist.gov/PDF/COR04MAY.PDF,A Database of Chemical Reactions Designed to A...,4,11.73149,chemical reactions


# Standard list of features

Let's introduce the list of features we need to deploy a baseline learning-to-rank approach.

In [None]:
pagerankfile = indexref + "/data-pagerank.oos"
features = [
    "SAMPLE", #ie PL2
    "WMODEL:SingleFieldModel(BM25,0)", #BM25 title
    "QI:StaticFeature(OIS,%s)" % pagerankfile,
]

stdfeatures = pt.FeaturesBatchRetrieve(index, features, verbose=True)
stage12 = firstpassUB >> stdfeatures

This is our feature set. We will be using FeaturesBatchRetrieve to compute these extra features on the fly. Let's see the output. You can see that there is now a "features" column.

In [None]:
df = stage12.search("chemical reactions").head(2)
df

BR(PL2):   0%|          | 0/1 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/1 [00:00<?, ?q/s]

Unnamed: 0,qid,query,docid,rank,features,docno,score
0,1,chemical reactions,513586,0,"[12.755545561073266, 3.0924078763629836, 0.000...",G18-38-1767991,12.755546
1,1,chemical reactions,38544,1,"[11.90652405775751, 10.789390732195702, 0.0002...",G01-14-2537005,11.906524


Let's look in more detail at the features. It is clear that there are 3 numbers for each document. The first is the PL2 score (1.27555456e+01 == 12.7555), the second is the BM25 score, and the third is the PageRank (a link analysis feature - discussed in more detail in Lecture 10)

# Q1

You now have everyting you need to attempt Q1. You will need to refer to the specification, and to PyTerrier's [learning to rank documentation](https://pyterrier.readthedocs.io/en/latest/ltr.html).

You should use a LightGBM LambdaMART implementation (*not* XGBoost), instantiated using the configuration suggested in the PyTerrier documentation.

Hints:
 - You will need to use the provided separate “training” and “validation” topic sets and qrels to train the learning-to-rank.
 - There is no need to vary the configuration of LightGBM from that in the documentation.

In [None]:
#YOUR SOLUTION
import lightgbm as lgb

lmart_l = lgb.LGBMRanker(task="train",
    min_data_in_leaf=1,
    min_sum_hessian_in_leaf=100,
    max_bin=255,
    num_leaves=7,
    objective="lambdarank",
    metric="ndcg",
    ndcg_eval_at=[1, 3, 5, 10],
    learning_rate= .1,
    importance_type="gain",
    num_iterations=10)

lmart_l_pipe = stage12 >> pt.ltr.apply_learned_model(lmart_l, form="ltr")
lmart_l_pipe.fit(tr_topics, tr_qrels, va_topics, va_qrels)

#MAP performance of the LTR model using the 3 features where PL2 was used to generate the sample
df = pt.Experiment(
    [firstpassUB, lmart_l_pipe],
    test_topics,
    test_qrels,
    eval_metrics = ["map"],
    round = {'map':4},
    names=["PL2 Baseline",  "LTR Baseline" ],
    baseline=0
)
print(df)

#P@5 performance of the LTR model using the 3 features where PL2 was used to generate the sample.
df = pt.Experiment(
    [firstpassUB, lmart_l_pipe],
    test_topics,
    test_qrels,
    eval_metrics = ["P.5"],
    round = {"P.5":4},
    names=["PL2 Baseline",  "LTR Baseline" ],
    baseline=0
)
print(df)

BR(PL2):   0%|          | 0/96 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/96 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/54 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/54 [00:00<?, ?q/s]



[1]	valid_0's ndcg@1: 0.277778
[2]	valid_0's ndcg@1: 0.351852
[3]	valid_0's ndcg@1: 0.388889
[4]	valid_0's ndcg@1: 0.407407
[5]	valid_0's ndcg@1: 0.407407
[6]	valid_0's ndcg@1: 0.388889
[7]	valid_0's ndcg@1: 0.388889
[8]	valid_0's ndcg@1: 0.388889
[9]	valid_0's ndcg@1: 0.388889
[10]	valid_0's ndcg@1: 0.388889


BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/75 [00:00<?, ?q/s]

           name     map  map +  map -  map p-value
0  PL2 Baseline  0.2251    NaN    NaN          NaN
1  LTR Baseline  0.4000   46.0   14.0     0.000123


BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/75 [00:00<?, ?q/s]

           name     P.5  P.5 +  P.5 -  P.5 p-value
0  PL2 Baseline  0.0693    NaN    NaN          NaN
1  LTR Baseline  0.1173   23.0    6.0     0.001094


# Q2 - URL Length Features

In this block, please provide your code for Q2 concerning your two URL Length features, namely URL Length by counting slashes (URL-slashes) and URL Length through using the type of the URL (URL-type). The two different URL length features that you will need to implement are detailed in the specification. Do carefully read and follow the Exercise 3 specification before starting the implementation of the features.

Some hints:

 - You will need to use a [pt.apply function](https://pyterrier.readthedocs.io/en/latest/apply.html) for computing your URL feature(s). The dataframe of results obtained from the upstream transformer has all of the information you need.

 - You can use a `**` operator for combining feature sets.

 - Refer to the PyTerrier learning to rank documentation  concerning `features_importances_` for obtaining feature importances.

 - You may wish to refer to Python's [`urlparse()`](https://docs.python.org/3/library/urllib.parse.html) function.

 - Use Python assertions to test that your feature implmentation(s) give the expected results.


## Q2 (a) URL-Slashes Feature

In this block you should define your URL-Slashes feature, and test it.

In [None]:
#YOUR SOLUTION
from urllib.parse import urlparse

#function returns number of slashes present in the URL
def urlSlashes(row):
  o = urlparse(row['url']) #parsing the url to split the components and combines them back into form a url string
  url = o._replace(fragment="").geturl().lower()
  #print("No of slashes in the URL :",url.count('/'))
  return url.count('/') #return number of slashes in teh url


#sample URLs for testing
url1 = "http://trec.nist.gov"      #root - category 1
url2 = "http://trec.nist.gov/pubs/"   #subroot - category 2
url3 = "http://trec.nist.gov/pubs/trec9/papers/"    #path - category 3
url4 = "http://trec.nist.gov/pubs/trec9/t9_proceedings.html"    #file - category 4

#assert urlSlashCount(url1) == 2
#assert urlSlashCount(url2) == 4
#assert urlSlashCount(url3) == 6
#assert urlSlashCount(url4) == 5

#### (i) URL-Slashes as a PL2 re-ranker

Now you should evaluate your URL-slashes score by re-ranking PL2. You can now answer the corresponding quiz questions.

In [None]:
#YOUR SOLUTION- Answer to Que 8
p_url = firstpassUB >> pt.apply.url_slashes(urlSlashes) #constructing a transformer that makes a new column with name 'url_slashes' on a row-wise basis
URL_slashes = p_url >> pt.apply.doc_score(urlSlashes) #  function urlSlashes will be called once for each document and  return the new score for that document.
URL_slashes.search("cryption").head(5) #Answer to Que 8 - First 5 top--ranked document for the query ‘cryption’.

BR(PL2):   0%|          | 0/1 [00:00<?, ?q/s]

Unnamed: 0,qid,docid,docno,url,body,score,query,url_slashes,rank
1,1,494954,G17-68-2584616,http://www.ncs.gov/n2/content/technote/tnv7n4/...,OFFICE OF THE MANAGER ...,7,cryption,7,0
0,1,434993,G15-50-1054100,http://cs-www.ncsl.nist.gov/publications/nistp...,"References[BOCK 88] Peter Bocker, ISDN The Int...",6,cryption,6,1
6,1,457024,G16-34-3764782,http://w3.access.gpo.gov/bxa/ear/txt/734.txt,Part 734--Scope of the Export Administration R...,6,cryption,6,2
7,1,424551,G15-11-3633588,http://cs-www.ncsl.nist.gov/publications/nistp...,Special Publication 800-41 Guidelines on Firew...,6,cryption,6,3
8,1,427549,G15-22-3805523,http://cs-www.ncsl.nist.gov/publications/nistp...,Security Issues in the Database Language SQLW....,6,cryption,6,4


#### (ii) URL-Slashes within an LTR model

Now you should evaluate your URL-slashes score as a feature within a new learned model. You can now answer the corresponding quiz questions.

In [None]:
#YOUR SOLUTION - Answer to Que 9 and 10
stage3 = firstpassUB >> URL_slashes #URL-slashes feature

#MAP performance of re-ranking PL2 using your URL-slashes feature implementation.
df = pt.Experiment(
    [stage3],
    test_topics,
    test_qrels,
    eval_metrics = ["map"],
    round = {'map':4},
    names=["PL2 URL-slashes" ],
    baseline=0
)
print(df)

# P@5 performance of re-ranking PL2 using your URL-slashes feature implementation.
df = pt.Experiment(
    [stage3],
    test_topics,
    test_qrels,
    eval_metrics = ["P.5"],
    round = {"P.5":4},
    names=["PL2 URL-slashes" ],
    baseline=0
)
print(df)

BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

              name     map map + map - map p-value
0  PL2 URL-slashes  0.0022  None  None        None


BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

              name  P.5 P.5 + P.5 - P.5 p-value
0  PL2 URL-slashes  0.0  None  None        None


In [None]:
#YOUR SOLUTION - Answers to Que number 11 to 16
stage123 = firstpassUB >> (stdfeatures ** URL_slashes) #URL-slashes feature as a 4th feature

lmart_l = lgb.LGBMRanker(task="train",
    min_data_in_leaf=1,
    min_sum_hessian_in_leaf=100,
    max_bin=255,
    num_leaves=7,
    objective="lambdarank",
    metric="ndcg",
    ndcg_eval_at=[1, 3, 5, 10],
    learning_rate= .1,
    importance_type="gain",
    num_iterations=10)

lmart_l_pipe_url_slash = stage123 >> pt.ltr.apply_learned_model(lmart_l, form="ltr")
lmart_l_pipe_url_slash.fit(tr_topics, tr_qrels, va_topics, va_qrels)

#MAP and P@5 performance of LTR model with 4 features
df = pt.Experiment(
    [lmart_l_pipe, lmart_l_pipe_url_slash],
    test_topics,
    test_qrels,
    eval_metrics = ["map","P.5"],
    round = 4,
    names=["LTR baseline",  "LTR URL-slashes" ],
    baseline=0
)
print(df)

BR(PL2):   0%|          | 0/96 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/96 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/96 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/54 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/54 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/54 [00:00<?, ?q/s]



[1]	valid_0's ndcg@1: 0.277778
[2]	valid_0's ndcg@1: 0.296296
[3]	valid_0's ndcg@1: 0.333333
[4]	valid_0's ndcg@1: 0.333333
[5]	valid_0's ndcg@1: 0.37037
[6]	valid_0's ndcg@1: 0.388889
[7]	valid_0's ndcg@1: 0.425926
[8]	valid_0's ndcg@1: 0.444444
[9]	valid_0's ndcg@1: 0.444444
[10]	valid_0's ndcg@1: 0.5


BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/75 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/75 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

              name    map     P.5  map +  map -  map p-value  P.5 +  P.5 -  \
0     LTR baseline  0.400  0.1173    NaN    NaN          NaN    NaN    NaN   
1  LTR URL-slashes  0.384  0.1040   25.0   23.0     0.680398    3.0    8.0   

   P.5 p-value  
0          NaN  
1     0.132599  


In [None]:
#YOUR SOLUTION - Answer to Que no 18
# rank all 4 features by feature importance.Higher the rank, the more important the feature.

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=400)
rf_pipe = stage123 >> pt.ltr.apply_learned_model(rf)
rf_pipe.fit(tr_topics, tr_qrels)
rf.feature_importances_ # first is PL2 feature, second is  BM25 feature, third is  PageRank and fourth is URL-slash faeture

BR(PL2):   0%|          | 0/96 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/96 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/96 [00:00<?, ?q/s]

array([0.27646951, 0.33077239, 0.35045227, 0.04230583])

## Q2 (b) URL Type Feature

In this block you should define your URL Type feature and test it.

In [None]:
#YOUR SOLUTION
from urllib.parse import urlparse

def urlType(row):
  #o = urlparse(row)
  o = urlparse(row['url'])
  url = o._replace(fragment="").geturl().lower()

  if url.count('/')==2 or (url.count('/')==3 and url.endswith('/index.html')): #root category: url represent a domain name or ends with “index.html”
    category=1
  elif url.count('/')==4 and (url.endswith('/') or url.endswith('/index.html')):#Subroot category: url represent a domain name followed by a single directory or ends with “index.html”
    category=2
  elif url.count('/')>4 and (url.endswith('/') or url.endswith('/index.html')): #path category : a domain name followed by an arbitrary deep path, but ending with file name “index.html”
    category=3
  elif not url.endswith('/index.html'): # file catgeory : ending in a filename other than “index.html”
    category=4
  return category

#sample URLs for testing
url1 = "http://trec.nist.gov"      #root - category 1
url2 = "http://trec.nist.gov/pubs/"   #subroot - category 2
url3 = "http://trec.nist.gov/pubs/trec9/papers/"    #path - category 3
url4 = "http://trec.nist.gov/pubs/trec9/t9_proceedings.html"    #file - category 4
url5 = "http://www.atsdr.cdc.gov/toxprofiles/phs105.html" #file - category 4

#assert used to continue the execute if the url category conndition is True
#assert urlType(url1) == 1
#assert urlType(url5) == 4
#assert urlType(url2) == 2
#assert urlType(url3) == 3
#assert urlType(url4) == 4

#### (i) URL Type as a PL2 re-ranker

Now you should evaluate your URL type score by re-ranking PL2. You can now answer the corresponding quiz questions.

In [None]:
#YOUR SOLUTION - Answer to Que 20
p_url_type = firstpassUB >> pt.apply.url_type(urlType)#constructing a transformer that makes a new column with name 'url_type' on a row-wise basis
url_type = p_url_type >> pt.apply.doc_score(urlType)#  function urlType will be called once for each document and  return the new score for that document.
url_type.search("aaie").head(5)#Answer to Que 20 - First 5 top--ranked document for the query ‘aaie’.

BR(PL2):   0%|          | 0/1 [00:00<?, ?q/s]

Unnamed: 0,qid,docid,docno,url,body,score,query,url_type,rank
1,1,88532,G02-80-0379929,http://sunshine.jpl.nasa.gov/1rst%20Tier/Photo...,C ol o Photo Album This section is f...,4,aaie,4,0
2,1,301428,G10-61-1895354,http://www.cdpr.ca.gov/docs/ipminov/01awards.htm,The 2001 IPM Innovators Awards The 2001 A...,4,aaie,4,1
3,1,375914,G13-35-3399834,http://www.cdpr.ca.gov/docs/pressrls/9pestinno...,Media Contacts: Glenn Brank 916/445-3974 ...,4,aaie,4,2
0,1,543541,G19-52-0995113,http://sunshine.jpl.nasa.gov/AAIE%20Site%20%c4...,AAIE Photo Album The Jet Propulsion Labo...,3,aaie,3,3
4,1,51341,G01-54-3873617,http://goldmine.cde.ca.gov/calendar/,BODY { margin-left : 0; margin-...,2,aaie,2,4


#### (ii) URL Type within an LTR model

Now you should evaluate your URL type score as a feature within a new learned model. You can now answer the corresponding quiz questions.

In [None]:
#YOUR SOLUTION - ANSWER to Que 21 and 22

stage4 = firstpassUB >> url_type

# MAP aand P@5 performance of re-ranking the PL2 candidate set using your URL-type feature implementation
df = pt.Experiment(
    [stage4],
    test_topics,
    test_qrels,
    eval_metrics = ["map", "P_5"],
    round = 4,
    names=["PL2 URL-type" ],
    baseline=0
)
print(df)

BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

           name     map  P_5 map + map - map p-value P_5 + P_5 - P_5 p-value
0  PL2 URL-type  0.0013  0.0  None  None        None  None  None        None


In [None]:
#YOUR SOLUTION - ANSWER to Que 23 to 30

stage124 = firstpassUB >> (stdfeatures ** url_type) #URL Type feature as 4th feature

lmart_l = lgb.LGBMRanker(task="train",
    min_data_in_leaf=1,
    min_sum_hessian_in_leaf=100,
    max_bin=255,
    num_leaves=7,
    objective="lambdarank",
    metric="ndcg",
    ndcg_eval_at=[1, 3, 5, 10],
    learning_rate= .1,
    importance_type="gain",
    num_iterations=10)

lmart_l_pipe_url_type = stage124 >> pt.ltr.apply_learned_model(lmart_l, form="ltr")
lmart_l_pipe_url_type.fit(tr_topics, tr_qrels, va_topics, va_qrels)

# MAP and P@5 performance of the LTR model with 4 feature
df = pt.Experiment(
    [ lmart_l_pipe_url_type],
    test_topics,
    test_qrels,
    eval_metrics = ["map","P.5"],
    round = 4,
    names=[ "LTR URL-type" ],
    baseline=0
)
print(df)

# MAP performance of the LTR model with 4 feature w.r.t LTR baseline
df = pt.Experiment(
    [lmart_l_pipe, lmart_l_pipe_url_type],
    test_topics,
    test_qrels,
    eval_metrics = ["map","P.5"],
    round = 4,
    names=["LTR baseline",  "LTR URL-type" ],
    baseline=0
)
print(df)

BR(PL2):   0%|          | 0/96 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/96 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/96 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/54 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/54 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/54 [00:00<?, ?q/s]



[1]	valid_0's ndcg@1: 0.296296
[2]	valid_0's ndcg@1: 0.351852
[3]	valid_0's ndcg@1: 0.388889
[4]	valid_0's ndcg@1: 0.388889
[5]	valid_0's ndcg@1: 0.388889
[6]	valid_0's ndcg@1: 0.407407
[7]	valid_0's ndcg@1: 0.462963
[8]	valid_0's ndcg@1: 0.462963
[9]	valid_0's ndcg@1: 0.462963
[10]	valid_0's ndcg@1: 0.481481


BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/75 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

           name     map     P.5 map + map - map p-value P.5 + P.5 -  \
0  LTR URL-type  0.4371  0.1147  None  None        None  None  None   

  P.5 p-value  
0        None  


BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/75 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/75 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

           name     map     P.5  map +  map -  map p-value  P.5 +  P.5 -  \
0  LTR baseline  0.4000  0.1173    NaN    NaN          NaN    NaN    NaN   
1  LTR URL-type  0.4371  0.1147   28.0   22.0     0.320884    5.0    6.0   

   P.5 p-value  
0          NaN  
1     0.765264  


In [None]:
#YOUR SOLUTION - Answer to Que No. 31
# rank all 4 features by feature importance.Higher the rank, the more important the feature.
rf_pipe_url_type = stage124 >> pt.ltr.apply_learned_model(rf)
rf_pipe_url_type.fit(tr_topics, tr_qrels)
rf.feature_importances_  # first is PL2 feature, second is  BM25 feature, third is  PageRank and fourth is URL-Type faeture

BR(PL2):   0%|          | 0/96 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/96 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/96 [00:00<?, ?q/s]

array([0.27444668, 0.34140169, 0.33246325, 0.05168837])

# Q3 Proximity Search Feature

Now you will implement a new query-dependent feature, using the MinDist() function, as discussed in the specification. Do carefully read the specification before starting the implementation.

Hints:
 - Again, remember to use assertions to test your feature implementations.
 - Refer to the PyTerrier learning to rank documentation concerning features_importances_ for obtaining feature importances

As mentioned in the specification, you should implement a function called `avgmindist()`, which takes the text of the query and the text of the document, and returns a score for the document, i.e. it must conform to the following Python specification:
```python
def avgmindist(query : str, document : str) -> float
```

NB: There are particular specific requirements for your implementations of MinDist() and avgmindist() that are detailed in the specification.

In [None]:
#YOUR AVGMINDIST IMPLEMENTATION
import string
import itertools
import operator

def minDist(a : str,b : str,D : str):
  #minimum distance (in terms of number of tokens) between any occurrences of a and b in document D
  min_dist =0
  a_pos = None
  b_pos = None
  doc_terms = D.split() #split the document into terms

  if a == b:
    return 0.0; #if both terms are same, return 0 as min distance

  for i in range(len(doc_terms)):
    if doc_terms[i] == a: #finding the position of term a in document
      a_pos = i
    if doc_terms[i] == b: #finding the position of term b in document
      b_pos = i
    if (not a_pos == None) and (not b_pos == None): #if both a and b are present in the document
      min_dist = abs(a_pos - b_pos)-1 #calculate the distance between a and b (in no of terms)
      break;

  if a_pos == None or b_pos == None: #if either of the terms are not present in the document, return number of terms present in the document
    return len(doc_terms)
  else:
    return min_dist


def avgmindist(query : str, document : str) -> float:
  #update your implementation here.
  aggMinDist = 0

  document = document.translate(str.maketrans('', '', string.punctuation)).lower() #remove punctuation from the document if any.
  terms = query.translate(str.maketrans('', '', string.punctuation)).lower().split(' ') #remove punctuation from the query if any. convert query into lower case.then split the query into terms.

  term_pair_list=[]
  if len(terms) == 1:
    term_pair_list.append((terms[0],''))
  else:
    term_pair_list = [(x,y) for x,y in zip(terms, terms[1:])] #adjacent pair of query terms
    #term_pair_list = list(itertools.combinations(terms, 2)) #all pairs of query terms; needed to answer another question-

  for term1, term2 in term_pair_list: #calculate the distance between terms within the document
    aggMinDist += minDist(term1,term2,document)

  if len(term_pair_list)==1 :
    aggMinDist = aggMinDist
  else:
    aggMinDist = aggMinDist/len(term_pair_list)
  return aggMinDist

assert avgmindist('!hi. wh?at is the tim[e] now?#.','hi, how are you?') == 4
assert avgmindist('Are you OKAY',']er lik?e.", " and stay updated on the LATEST weather news with the comprehensiv') == 13

You should test your impementation yourself, however to allow us to verify your implementation, we have created 9 testcases. Please run `run_test_cases()` and use its responses to answer the relevant quiz questions.



In [None]:
#DO NOT ALTER THIS CELL
TEST_CASES = [
  ('fermilab directory', 45, 567257), #1
  ('webcam', 45, 567257), #2
  ('DOM surface', 384034, 388292), #3
  ('DOM surface', 45, 384034), #4
  ('DOM surface document', 388292, 384034), #5
  ('DOM software AMANDA', 639302, 384034), #6
  ('fermilab directory', 388292, 384034), #7
  ('trigger data', 596532, 639302), #8
  ('underlying hardware', 384034, 333649) #9
]

def run_test_cases():
  docno=0
  body=3
  for i, (query, docid1, docid2) in enumerate(TEST_CASES):
    meta1 = index.getMetaIndex().getAllItems(docid1)
    meta2 = index.getMetaIndex().getAllItems(docid2)
    s1 = avgmindist(query, meta1[body])
    s2 = avgmindist(query, meta2[body])
    if s1 > s2:
      result = meta1[docno]
      cmpD = "%s > %s" % (meta1[docno],meta2[docno])
    elif s2 > s1:
      result = meta2[docno]
      cmpD = "%s > %s" % (meta2[docno],meta1[docno])
    else:
      result = "EQUAL"
      cmpD = "%s == %s" % (meta1[docno],meta2[docno])
    print("TEST CASE %d result %s " % (i+1, result))

run_test_cases()

TEST CASE 1 result G20-36-1335992 
TEST CASE 2 result G20-36-1335992 
TEST CASE 3 result G13-80-1271020 
TEST CASE 4 result G00-00-0478398 
TEST CASE 5 result G13-80-1271020 
TEST CASE 6 result G13-64-2457111 
TEST CASE 7 result G13-80-1271020 
TEST CASE 8 result G21-44-1000362 
TEST CASE 9 result G13-64-2457111 


You should now integrate your avgmindist() function into a new LTR model, and compare its MAP & P@5 performance to the LTR baseline. You can now answer the corresponding quiz questions.

In [None]:
#YOUR SOLUTION - Answers to QUestion from 43 - 44
p_term_dist = firstpassUB >> pt.apply.avg_min_dist(lambda row: avgmindist(row['query'], row['body']))
term_dist_score = p_term_dist >> pt.apply.doc_score(lambda row: avgmindist(row['query'], row['body']))
stage125 = firstpassUB >> (stdfeatures ** term_dist_score) #average min distance as 4th feature

lmart_l = lgb.LGBMRanker(task="train",
    min_data_in_leaf=1,
    min_sum_hessian_in_leaf=100,
    max_bin=255,
    num_leaves=7,
    objective="lambdarank",
    metric="ndcg",
    ndcg_eval_at=[1, 3, 5, 10],
    learning_rate= .1,
    importance_type="gain",
    num_iterations=10)

lmart_l_pipe_url_type = stage125 >> pt.ltr.apply_learned_model(lmart_l, form="ltr")
lmart_l_pipe_url_type.fit(tr_topics, tr_qrels, va_topics, va_qrels)

# MAP and P@5 performance of the LTR model with 4 feature
df = pt.Experiment(
    [lmart_l_pipe, lmart_l_pipe_url_type],
    test_topics,
    test_qrels,
    eval_metrics = ["map","P.5"],
    round = 4,
    names=["LTR baseline",  "PL2 sample" ],
    baseline=0
)
print(df)


BR(PL2):   0%|          | 0/96 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/96 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/96 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/54 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/54 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/54 [00:00<?, ?q/s]



[1]	valid_0's ndcg@1: 0.296296
[2]	valid_0's ndcg@1: 0.37037
[3]	valid_0's ndcg@1: 0.37037
[4]	valid_0's ndcg@1: 0.388889
[5]	valid_0's ndcg@1: 0.388889
[6]	valid_0's ndcg@1: 0.407407
[7]	valid_0's ndcg@1: 0.407407
[8]	valid_0's ndcg@1: 0.407407
[9]	valid_0's ndcg@1: 0.388889
[10]	valid_0's ndcg@1: 0.425926


BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/75 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/75 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

           name     map     P.5  map +  map -  map p-value  P.5 +  P.5 -  \
0  LTR baseline  0.4000  0.1173    NaN    NaN          NaN    NaN    NaN   
1    PL2 sample  0.4138  0.1040   23.0   21.0     0.625537    3.0    8.0   

   P.5 p-value  
0          NaN  
1     0.132599  


In [None]:
#YOUR SOLUTION
# rank all 4 features by feature importance.Higher the rank, the more important the feature.
rf_pipe_min_dist = stage125 >> pt.ltr.apply_learned_model(rf)
rf_pipe_min_dist.fit(tr_topics, tr_qrels)
rf.feature_importances_ # first is PL2 feature, second is  BM25 feature, third is  PageRank and fourth is avgmindist feature

BR(PL2):   0%|          | 0/96 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/96 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/96 [00:00<?, ?q/s]

array([0.26077217, 0.2478426 , 0.34911522, 0.14227001])

# Q4 A 5-feature Learning-to-Rank Model

You will now experiment with the LightGBM LambdaMART technique where you include both your added features (URL Type and AvgMinDist) along with the 3 initial features inc PL2 sample (5 features in total).

You need to learn a *new* model when using your final selection of 5 features.

Evaluate the performance of your resulting LTR system in comparison to the LTR baseline and answer the quiz questions.

In [None]:
#YOUR SOLUTION Answers to Que from 45 to 53
stage1245 = firstpassUB >> (stdfeatures ** url_type ** term_dist_score) #apply URL_TYPE and AVG_MIN_DISTANCE as 4th and 5th feature
#stage1245.search("chemical reactions").head(2)

lmart_l = lgb.LGBMRanker(task="train",
    min_data_in_leaf=1,
    min_sum_hessian_in_leaf=100,
    max_bin=255,
    num_leaves=7,
    objective="lambdarank",
    metric="ndcg",
    ndcg_eval_at=[1, 3, 5, 10],
    learning_rate= .1,
    importance_type="gain",
    num_iterations=10)

lmart_l_pipe_url_type = stage1245 >> pt.ltr.apply_learned_model(lmart_l, form="ltr")
lmart_l_pipe_url_type.fit(tr_topics, tr_qrels, va_topics, va_qrels)

#MAP & P@5 performance of 5-features LTR model.
df = pt.Experiment(
    [lmart_l_pipe, lmart_l_pipe_url_type],
    test_topics,
    test_qrels,
    eval_metrics = ["map","P.5"],
    round = 4,
    names=["LTR baseline",  "LTR 5 feature" ],
    baseline=0
)
print(df)


BR(PL2):   0%|          | 0/96 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/96 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/96 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/96 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/54 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/54 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/54 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/54 [00:00<?, ?q/s]



[1]	valid_0's ndcg@1: 0.296296
[2]	valid_0's ndcg@1: 0.333333
[3]	valid_0's ndcg@1: 0.425926
[4]	valid_0's ndcg@1: 0.425926
[5]	valid_0's ndcg@1: 0.444444
[6]	valid_0's ndcg@1: 0.425926
[7]	valid_0's ndcg@1: 0.407407
[8]	valid_0's ndcg@1: 0.407407
[9]	valid_0's ndcg@1: 0.388889
[10]	valid_0's ndcg@1: 0.407407


BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/75 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/75 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/75 [00:00<?, ?q/s]

            name     map     P.5  map +  map -  map p-value  P.5 +  P.5 -  \
0   LTR baseline  0.4000  0.1173    NaN    NaN          NaN    NaN    NaN   
1  LTR 5 feature  0.4474  0.1067   28.0   15.0     0.064689    2.0    6.0   

   P.5 p-value  
0          NaN  
1     0.158688  


In [None]:
#YOUR SOLUTION
# rank all 5 features by feature importance.Higher the rank, the more important the feature.
rf_pipe_5 = stage1245 >> pt.ltr.apply_learned_model(rf)
rf_pipe_5.fit(tr_topics, tr_qrels)
rf.feature_importances_ # first is PL2 feature, second is  BM25 feature, third is  PageRank , fourth URL_type and 5th is AvgMinDist feature

BR(PL2):   0%|          | 0/96 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/96 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/96 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/96 [00:00<?, ?q/s]

array([0.23622017, 0.30608509, 0.285928  , 0.05208967, 0.11967706])

In [None]:
stage12.search("usda food nutrition and consumer services").head(2)

BR(PL2):   0%|          | 0/1 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/1 [00:00<?, ?q/s]

Unnamed: 0,qid,query,docid,rank,features,docno,score
0,1,usda food nutrition and consumer services,288808,0,"[22.283267262620598, 7.132655209043121, 0.0003...",G10-14-0643453,22.283267
1,1,usda food nutrition and consumer services,754309,1,"[21.016091113545205, 3.6749475723212752, 8.115...",G28-00-2145108,21.016091


In [None]:
stage123.search("usda food nutrition and consumer services").head(2)

BR(PL2):   0%|          | 0/1 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/1 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/1 [00:00<?, ?q/s]

Unnamed: 0,qid,docid,docno,url,body,rank,score,query,url_slashes,features
0,1,288808,G10-14-0643453,http://www.nal.usda.gov/fnic/pubs_and_db.html,Nutrition Information Center) Accessib...,0,22.283267,usda food nutrition and consumer services,4,"[22.283267262620598, 7.132655209043121, 0.0003..."
1,1,754309,G28-00-2145108,http://www.ers.usda.gov/briefing/InformationPo...,0 { d=parent.frames[n.substring(p+1)].docu...,1,21.016091,usda food nutrition and consumer services,5,"[21.016091113545205, 3.6749475723212752, 8.115..."


In [None]:
stage124.search("usda food nutrition and consumer services").head(2)

BR(PL2):   0%|          | 0/1 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/1 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/1 [00:00<?, ?q/s]

Unnamed: 0,qid,docid,docno,url,body,rank,score,query,url_type,features
0,1,288808,G10-14-0643453,http://www.nal.usda.gov/fnic/pubs_and_db.html,Nutrition Information Center) Accessib...,0,22.283267,usda food nutrition and consumer services,4,"[22.283267262620598, 7.132655209043121, 0.0003..."
1,1,754309,G28-00-2145108,http://www.ers.usda.gov/briefing/InformationPo...,0 { d=parent.frames[n.substring(p+1)].docu...,1,21.016091,usda food nutrition and consumer services,4,"[21.016091113545205, 3.6749475723212752, 8.115..."


In [None]:
stage125.search("usda food nutrition and consumer services").head(2)

BR(PL2):   0%|          | 0/1 [00:00<?, ?q/s]

FBR(3 features):   0%|          | 0/1 [00:00<?, ?q/s]

BR(PL2):   0%|          | 0/1 [00:00<?, ?q/s]

Unnamed: 0,qid,docid,docno,url,body,rank,score,query,avg_min_dist,features
0,1,288808,G10-14-0643453,http://www.nal.usda.gov/fnic/pubs_and_db.html,Nutrition Information Center) Accessib...,0,22.283267,usda food nutrition and consumer services,170.4,"[22.283267262620598, 7.132655209043121, 0.0003..."
1,1,754309,G28-00-2145108,http://www.ers.usda.gov/briefing/InformationPo...,0 { d=parent.frames[n.substring(p+1)].docu...,1,21.016091,usda food nutrition and consumer services,67.0,"[21.016091113545205, 3.6749475723212752, 8.115..."


# That's all Folks

**Submission Instructions:** Complete this notebook. All your answers to Exercise 3 must be submitted on the Exercise 3 Quiz instance on Moodle with your completed notebook (showing **both** your solutions and the results of their executions).


Your answers to the Quiz questions along with your .ipynb notebook file (showing code and outputs) must be submitted by the stated Exercise 3 deadline.