<a href="https://colab.research.google.com/github/d1p013/testing/blob/master/AI6122_Assignment_Simple_Search_Engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PyTerrier Indexing Demo

This notebook takes you through indexing using [PyTerrier](https://github.com/terrier-org/pyterrier).

## Prerequisites

You will need PyTerrier installed. PyTerrier also needs Java to be installed, and will find most installations.

In [1]:
!pip install python-terrier
#!pip install --upgrade git+https://github.com/terrier-org/pyterrier.git#egg=python-terrier

Collecting python-terrier
  Downloading python-terrier-0.7.0.tar.gz (95 kB)
[?25l[K     |███▍                            | 10 kB 28.0 MB/s eta 0:00:01[K     |██████▉                         | 20 kB 28.9 MB/s eta 0:00:01[K     |██████████▎                     | 30 kB 11.8 MB/s eta 0:00:01[K     |█████████████▊                  | 40 kB 9.4 MB/s eta 0:00:01[K     |█████████████████▏              | 51 kB 5.1 MB/s eta 0:00:01[K     |████████████████████▋           | 61 kB 5.4 MB/s eta 0:00:01[K     |████████████████████████        | 71 kB 5.8 MB/s eta 0:00:01[K     |███████████████████████████▌    | 81 kB 6.6 MB/s eta 0:00:01[K     |███████████████████████████████ | 92 kB 6.6 MB/s eta 0:00:01[K     |████████████████████████████████| 95 kB 2.6 MB/s 
Collecting wget
  Downloading wget-3.2.zip (10 kB)
Collecting pyjnius~=1.3.0
  Downloading pyjnius-1.3.0-cp37-cp37m-manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 32.3 MB/s 
[?25hCollecti

## Init 

You must run `pt.init()` before other pyterrier functions and classes

Optional Arguments:    
 - `version` - terrier IR version e.g. "5.2"    
 - `mem` - megabytes allocated to java e.g. "4096"      
 - `packages` - external java packages for Terrier to load e.g. ["org.terrier:terrier.prf"]
 - `logging` - logging level for Terrier. Defaults to "WARN", use "INFO" or "DEBUG" for more output.

NB: PyTerrier needs Java 11 installed. If it cannot find your Java installation, you can set the `JAVA_HOME` environment variable.

In [2]:
import pyterrier as pt
if not pt.started():
  pt.init()

terrier-assemblies 5.6  jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.6  jar not found, downloading to /root/.pyterrier...
Done
PyTerrier 0.7.0 has loaded Terrier 5.6 (built by craigmacdonald on 2021-09-17 13:27)


In [3]:
 vaswani_dataset = pt.datasets.get_dataset("vaswani")
 indexref = vaswani_dataset.get_index()
 index = pt.IndexFactory.of(indexref)
 print(index.getCollectionStatistics().toString()) 

Downloading vaswani index to /root/.pyterrier/corpora/vaswani/index


data.direct.bf:   0%|          | 0.00/388k [00:00<?, ?iB/s]

data.document.fsarrayfile:   0%|          | 0.00/234k [00:00<?, ?iB/s]

data.inverted.bf:   0%|          | 0.00/362k [00:00<?, ?iB/s]

data.lexicon.fsomapfile:   0%|          | 0.00/682k [00:00<?, ?iB/s]

data.lexicon.fsomaphash:   0%|          | 0.00/777 [00:00<?, ?iB/s]

data.lexicon.fsomapid:   0%|          | 0.00/30.3k [00:00<?, ?iB/s]

data.meta-0.fsomapfile:   0%|          | 0.00/725k [00:00<?, ?iB/s]

data.meta.idx:   0%|          | 0.00/89.3k [00:00<?, ?iB/s]

data.meta.zdata:   0%|          | 0.00/224k [00:00<?, ?iB/s]

data.properties:   0%|          | 0.00/4.29k [00:00<?, ?iB/s]

md5sums:   0%|          | 0.00/619 [00:00<?, ?iB/s]

Number of documents: 11429
Number of terms: 7756
Number of postings: 224573
Number of fields: 1
Number of tokens: 271581
Field names: [text]
Positions:   false



In [4]:
topics = vaswani_dataset.get_topics()
topics.head(5) 

Downloading vaswani topics to /root/.pyterrier/corpora/vaswani/query-text.trec


query-text.trec:   0%|          | 0.00/3.05k [00:00<?, ?iB/s]

Unnamed: 0,qid,query
0,1,measurement of dielectric constant of liquids ...
1,2,mathematical analysis and design details of wa...
2,3,use of digital computers in the design of band...
3,4,systems of data coding for information transfer
4,5,use of programs in engineering testing of comp...


In [5]:
retr = pt.BatchRetrieve(index, controls = {"wmodel": "TF_IDF"}, num_results=10)
retr.setControl("wmodel", "TF_IDF")
retr.setControls({"wmodel": "TF_IDF"})
res=retr.transform(topics)
res 

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,8171,8172,0,13.746087,measurement of dielectric constant of liquids ...
1,1,9880,9881,1,12.352666,measurement of dielectric constant of liquids ...
2,1,5501,5502,2,12.178153,measurement of dielectric constant of liquids ...
3,1,1501,1502,3,10.993585,measurement of dielectric constant of liquids ...
4,1,9858,9859,4,10.271452,measurement of dielectric constant of liquids ...
...,...,...,...,...,...,...
925,93,3255,3256,5,12.761340,high frequency oscillators using transistors t...
926,93,6609,6610,6,12.740825,high frequency oscillators using transistors t...
927,93,150,151,7,12.660943,high frequency oscillators using transistors t...
928,93,3536,3537,8,12.580599,high frequency oscillators using transistors t...


In [6]:
# Import PyDrive and associated libraries.
# This only needs to be done once per notebook.
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# Download a file based on its file ID.
#
# A file ID looks like: laggVyWshwcyP6kEI-y_W3P8D26sz
# file_id = '1YkltHAl3Ro9xv25EkGB56h271_STNlPA' //'AI6122_Dataset_B1.csv'
file_id = '1aVXMJ_luTXISxMwP5_Bt2xE0ObXqEsQ2'
downloaded = drive.CreateFile({'id': file_id})
#print('Downloaded content "{}"'.format(downloaded.GetContentString()))
#downloaded.GetContentFile('AI6122_Dataset_B1.csv')
downloaded.GetContentFile('Dataset_B1to8.csv')

## Indexing a Pandas dataframe

Sometimes we have the documents that we want to index in memory. Terrier makes it easy to index standard Python data structures, particularly [Pandas dataframes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

To do thise, we can use a `pt.DFIndexer()` object

In [7]:
## load data into df
import pandas as pd
#df = pd.read_csv("AI6122_Dataset_B1.csv", dtype = str)
df = pd.read_csv("Dataset_B1to8.csv", dtype = str)

try :
  del docno 
except:
  pass

##docno = list(range(1,len(df)+1))
##df['docno'] = docno
##df = df.astype({"docno" : str})
#print(type(df["docno"][0]))

docno =[]
for idx in range(1,len(df)+1):
  docno.append("d" + str(idx))

df["docno"] = docno


In [8]:
#import pandas as pd
!rm -rf ./pd_index
pd_indexer = pt.DFIndexer("./pd_index", overwrite=True, verbose=True)

# optionally modify properties
# index_properies = {"block.indexing":"true", "invertedfile.lexiconscanner":"pointers"}
# indexer.setProperties(**index_properies)

Then there are a number of options to index the dataframe:    
The first argument should always a pandas.Series object of Strings, which specifies the body of each document.    
Any arguments after that are for specifying metadata.


In [9]:
# no metadata
# pd_indexer.index(df["text"])

# Add metadata fields as Pandas.Series objects, with the name of the Series object becoming the name of the meta field.
#indexref = pd_indexer.index(df["text"], df["docno"], df["review_id"], df["user_id"], df["business_id"], df["stars"], df["useful"], df["funny"], df["cool"])
indexref = pd_indexer.index(df["text"], df)
indexinfo = pt.IndexFactory.of(indexref)
print(indexinfo.getCollectionStatistics().toString()) 
# pd_indexer.index(df["text"], df["docno"], df["url"])

# Add metadata fields as lists to a keyword arguement
# pd_indexer.index(df["text"], docno=["1","2","3"], url=["url1", "url2", "url3"])

# Add the metadata fields with a dictionary
# meta_fields={"docno":["1","2","3"],"url":["url1", "url2", "url3"]}
# pd_indexer.index(df["text"], **meta_fields)

# Add the entire dataframe as metadata
# pd_indexer.index(df["text"], df)

  0%|          | 0/6928 [00:00<?, ?documents/s]

Number of documents: 6928
Number of terms: 12049
Number of postings: 323743
Number of fields: 0
Number of tokens: 388543
Field names: []
Positions:   false



## Retrieval

Lets see how we can use one of these for retrieval. Retrieval takes place using the `BatchRetrieve` object, by invoking `transform()` method for one or more queries. For a quick test, you can give just pass your query to `transform()`. 

BatchRetrieve will return the results as a Pandas dataframe.


In [10]:
pt.BatchRetrieve(indexref).search("so many")

Unnamed: 0,docid,docno,rank,score,qid,query


In [11]:

#this ranker will make the candidate set of documents for each query
BM25 = pt.BatchRetrieve(indexref, controls = {"wmodel": "BM25"}, num_results=5)
#these rankers we will use to re-rank the BM25 results
TF_IDF = pt.BatchRetrieve(indexref, controls = {"wmodel": "TF_IDF"}, num_results=5)
PL2 =  pt.BatchRetrieve(indexref, controls = {"wmodel": "PL2"}, num_results=5)

pipe = BM25 >> (TF_IDF ** PL2)
pipe.transform("Really cute restaurant") 


  topics = m.transform(topics)


Unnamed: 0,qid,docid,docno,rank,score,query,features
0,1,5656,d5657,0,11.656181,Really cute restaurant,"[6.957669848484342, 6.861145712753511]"
1,1,247,d248,1,10.478894,Really cute restaurant,"[6.254937186160457, 5.707333751062654]"
2,1,3374,d3375,2,10.271409,Really cute restaurant,"[6.131087801312275, 5.541288977958049]"
3,1,5785,d5786,3,10.009293,Really cute restaurant,"[5.875657207663913, 5.262894075279536]"
4,1,5183,d5184,4,9.889212,Really cute restaurant,"[5.995664430264314, 5.490947108412864]"


In [12]:
pt.BatchRetrieve(indexref, controls = {"wmodel": "PL2"}, num_results=5).search("Really cute restaurant")


Unnamed: 0,qid,docid,docno,rank,score,query
0,1,5656,d5657,0,6.861146,Really cute restaurant
1,1,247,d248,1,5.707334,Really cute restaurant
2,1,3374,d3375,2,5.541289,Really cute restaurant
3,1,5183,d5184,3,5.490947,Really cute restaurant
4,1,5237,d5238,4,5.427991,Really cute restaurant


In [16]:
pt.BatchRetrieve(indexref, wmodel="BM25", properties={"termpipelines" : "Stopwords,PorterStemmer"})
pt.BatchRetrieve(indexref, metadata=["business_id", "stars"], num_results=10).search("Really cute restaurant")


Unnamed: 0,qid,docid,business_id,stars,rank,score,query
0,1,5785,UacakYbLnef2TYU2YDrtuw,2,0,6.212172,Really cute restaurant
1,1,5183,UacakYbLnef2TYU2YDrtuw,5,1,6.208631,Really cute restaurant
2,1,0,Slj9yz_RfDRqiMRH8VxUMQ,3,2,6.12212,Really cute restaurant
3,1,5656,UacakYbLnef2TYU2YDrtuw,4,3,5.880452,Really cute restaurant
4,1,247,Slj9yz_RfDRqiMRH8VxUMQ,5,4,5.856493,Really cute restaurant
5,1,3374,gUHpQYwW_fd0l0hcE2i6Dg,5,5,5.792231,Really cute restaurant
6,1,5229,UacakYbLnef2TYU2YDrtuw,5,6,5.774188,Really cute restaurant
7,1,414,Slj9yz_RfDRqiMRH8VxUMQ,3,7,5.692445,Really cute restaurant
8,1,3357,gUHpQYwW_fd0l0hcE2i6Dg,5,8,5.670671,Really cute restaurant
9,1,742,Slj9yz_RfDRqiMRH8VxUMQ,5,9,5.625087,Really cute restaurant


However, most IR experiments, will use a set of queries. You can pass such a set using a data frame for input.

In [14]:
import pandas as pd
topics = pd.DataFrame([["q1", "Really cute restaurant"]],columns=['qid','query'])
pt.BatchRetrieve(indexref, metadata=["docno", "review_id"], num_results=10).transform(topics)

Unnamed: 0,qid,docid,docno,review_id,rank,score,query
0,q1,5785,d5786,MdsUfha2lMQEyWHqvJrWNQ,0,6.212172,Really cute restaurant
1,q1,5183,d5184,1GiY-SjY_izEIGkjhUJuww,1,6.208631,Really cute restaurant
2,q1,0,d1,1dZ2Vjex6D4FcGUvc633xw,2,6.12212,Really cute restaurant
3,q1,5656,d5657,p3_II5M5kYCVczKlSht_ow,3,5.880452,Really cute restaurant
4,q1,247,d248,0C16-USlaLuynYWUp1Ftyw,4,5.856493,Really cute restaurant
5,q1,3374,d3375,NRmJu_mG32-q0EVOoYgOKQ,5,5.792231,Really cute restaurant
6,q1,5229,d5230,HCUgThSO-0eiYItaRefiUA,6,5.774188,Really cute restaurant
7,q1,414,d415,Ob5_X4lPsQlERiTJm0t0IQ,7,5.692445,Really cute restaurant
8,q1,3357,d3358,1414gLiIp2fxK2kUUy_7Ew,8,5.670671,Really cute restaurant
9,q1,742,d743,ukcxyc_lxQWhl1r5PCeLeg,9,5.625087,Really cute restaurant


In [15]:
pt.new.queries(["Really cute restaurant"], qid=["q1"])

Unnamed: 0,qid,query
0,q1,Really cute restaurant


Thats the end of the indexing tutorial - you can continue with other example tutorials.