<a href="https://colab.research.google.com/github/d1p013/testing/blob/master/examples/notebooks/indexing_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PyTerrier Indexing Demo

This notebook takes you through indexing using [PyTerrier](https://github.com/terrier-org/pyterrier).

## Prerequisites

You will need PyTerrier installed. PyTerrier also needs Java to be installed, and will find most installations.

In [19]:
!pip install python-terrier
#!pip install --upgrade git+https://github.com/terrier-org/pyterrier.git#egg=python-terrier



## Init 

You must run `pt.init()` before other pyterrier functions and classes

Optional Arguments:    
 - `version` - terrier IR version e.g. "5.2"    
 - `mem` - megabytes allocated to java e.g. "4096"      
 - `packages` - external java packages for Terrier to load e.g. ["org.terrier:terrier.prf"]
 - `logging` - logging level for Terrier. Defaults to "WARN", use "INFO" or "DEBUG" for more output.

NB: PyTerrier needs Java 11 installed. If it cannot find your Java installation, you can set the `JAVA_HOME` environment variable.

In [20]:
import pyterrier as pt
if not pt.started():
  pt.init()

## TREC Indexing

Here, we are going to make use of Pyterrier's dataset API. We will use the [vaswani_npl corpus](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/), a very small information retrieval test collection. 

In [21]:
dataset = pt.datasets.get_dataset("vaswani")

print("Files in vaswani corpus: %s " % dataset.get_corpus())

Downloading vaswani corpus to /root/.pyterrier/corpora/vaswani/corpus


doc-text.trec:   0%|          | 0.00/0.99M [00:00<?, ?iB/s]

Files in vaswani corpus: ['/root/.pyterrier/corpora/vaswani/corpus/doc-text.trec'] 


In [22]:
index_path = "./index"

Create `pt.TRECCollectionIndexer` object    
index_path argument specifies where to store the index

In [23]:
!rm -rf ./index
indexer = pt.TRECCollectionIndexer(index_path, blocks=True)

Index the files by calling the index method on the TRECCollectionIndexer object

In [24]:
indexref = indexer.index(dataset.get_corpus())

# indexer method takes either a string or a list of strings with the files names
# indexer.index(["/vaswani_corpus/doc-text.trec",])
# indexer.index("/vaswani_corpus/doc-text.trec")


Lets see what we got from the indexer.

IndexRef is a python object representing a Terrier [IndexRef](http://terrier.org/docs/current/javadoc/org/terrier/querying/IndexRef.html) object. You can think of this like a pointer, or a URI. In this case, it points to the location of the main index file.

In [25]:
indexref.toString()

'./index/data.properties'

We can use that to get more information about the index. For instance, to see the statistics of the index, lets use `index.getCollectionStatistics().toString()`. You can see that we have indexed 11429 documents, containing a total of 7756 unique words.

In [26]:
index = pt.IndexFactory.of(indexref)
print(index.getCollectionStatistics().toString())

Number of documents: 11429
Number of terms: 7756
Number of postings: 224573
Number of fields: 0
Number of tokens: 271581
Field names: []
Positions:   true



To index TXT, PDF, Microsoft Word, etc files use pt.FilesIndexer instead of pt.TRECCollectionIndexer

## Indexing a Pandas dataframe

Sometimes we have the documents that we want to index in memory. Terrier makes it easy to index standard Python data structures, particularly [Pandas dataframes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

To do thise, we can use a `pt.DFIndexer()` object

In [38]:
import pandas as pd
!rm -rf ./pd_index
pd_indexer = pt.DFIndexer("./pd_index")

# optionally modify properties
# index_properies = {"block.indexing":"true", "invertedfile.lexiconscanner":"pointers"}
# indexer.setProperties(**index_properies)

In [30]:
## load df
df = pd.read_csv("AI6122_Dataset_B1.csv", dtype = str)
df

Unnamed: 0.1,docno,Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,1,5160031,1dZ2Vjex6D4FcGUvc633xw,S2IaUH7jqGg021oJ1jo27w,Slj9yz_RfDRqiMRH8VxUMQ,3,0,0,0,Really cute restaurant with lots of character....,3/23/2018 13:51
1,2,5161915,XTOOwso4XocpgvfzUrLahQ,1-ikyrBXfTij5k6M4nTaZg,Slj9yz_RfDRqiMRH8VxUMQ,5,1,0,0,We visited when my wife and I came to Atlanta....,2/1/2018 20:08
2,3,5162697,Dq--78gS7JHHHCoryoUSsA,1ohGAyANSWmHI1k5h5t3BA,Slj9yz_RfDRqiMRH8VxUMQ,4,0,0,0,I came here for lunch one day to pick up an or...,1/21/2018 15:25
3,4,5162961,K1aPKM6fRpNXmpiRgjclbg,cFNmQ_EUjcNeF7aS4sz19Q,Slj9yz_RfDRqiMRH8VxUMQ,5,0,0,0,Our waitress was EXTREMELY courteous. She chec...,12/15/2017 23:56
4,5,5164768,CM9C7n3QvBwA9nsaPRSExg,TZFhgwVEkGurFBEsqmi21A,Slj9yz_RfDRqiMRH8VxUMQ,3,1,0,0,I had the lasagna calzone. The calzone was sm...,6/7/2017 0:03
...,...,...,...,...,...,...,...,...,...,...,...
745,746,6025575,tw7ehcqssbIT-eiMXwP9rw,uLgaY0g3DB4M1WY5-vYx1g,Slj9yz_RfDRqiMRH8VxUMQ,4,0,0,0,Great food and service. Arancini and Caesar sa...,9/11/2020 23:35
746,747,6032330,uFWBSp2D8M_0BKuVgA7xFw,lnVxcGJpuHjr60oALMWJFQ,Slj9yz_RfDRqiMRH8VxUMQ,3,0,1,0,I got Pizza Amalfi. It tastes good. But if our...,8/5/2017 4:49
747,748,6033789,uY2dQ1afBW34Lv9SUBMXIQ,SvXzxrg6DV1yF8uf0pbFuw,Slj9yz_RfDRqiMRH8VxUMQ,4,0,0,0,Picked up a pizza to go during the marathon we...,3/10/2020 14:11
748,749,6033928,GOTi_Qvr1jduk0H4reTZWw,utBSeTC2ZT5Gl33G1UiaHw,Slj9yz_RfDRqiMRH8VxUMQ,4,1,0,2,We came here on a Friday night during the Summ...,6/20/2020 0:38


Then there are a number of options to index the dataframe:    
The first argument should always a pandas.Series object of Strings, which specifies the body of each document.    
Any arguments after that are for specifying metadata.


In [39]:
# no metadata
# pd_indexer.index(df["text"])

# Add metadata fields as Pandas.Series objects, with the name of the Series object becoming the name of the meta field.
indexref2 = pd_indexer.index(df["text"], df["cool"], df["funny"], df["useful"], df["stars"], df["user_id"], df["review_id"], df["docno"])
# pd_indexer.index(df["text"], df["docno"], df["url"])

# Add metadata fields as lists to a keyword arguement
# pd_indexer.index(df["text"], docno=["1","2","3"], url=["url1", "url2", "url3"])

# Add the metadata fields with a dictionary
# meta_fields={"docno":["1","2","3"],"url":["url1", "url2", "url3"]}
# pd_indexer.index(df["text"], **meta_fields)

# Add the entire dataframe as metadata
# pd_indexer.index(df["text"], df)

## Retrieval

Lets see how we can use one of these for retrieval. Retrieval takes place using the `BatchRetrieve` object, by invoking `transform()` method for one or more queries. For a quick test, you can give just pass your query to `transform()`. 

BatchRetrieve will return the results as a Pandas dataframe.


In [35]:
pt.BatchRetrieve(indexref2).search("pizza")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,721,722,0,1.406280,pizza
1,1,84,85,1,1.401368,pizza
2,1,457,458,2,1.393727,pizza
3,1,699,700,3,1.388808,pizza
4,1,54,55,4,1.376573,pizza
...,...,...,...,...,...,...
608,1,153,154,608,-0.571830,pizza
609,1,64,65,609,-0.591046,pizza
610,1,100,101,610,-0.649032,pizza
611,1,15,16,611,-0.717392,pizza


However, most IR experiments, will use a set of queries. You can pass such a set using a data frame for input.

In [36]:
import pandas as pd
topics = pd.DataFrame([["1", "Really cute restaurant"]],columns=['qid','query'])
pt.BatchRetrieve(indexref2).transform(topics)

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,0,1,0,5.440632,Really cute restaurant
1,1,247,248,1,5.279715,Really cute restaurant
2,1,742,743,2,5.029153,Really cute restaurant
3,1,414,415,3,4.990202,Really cute restaurant
4,1,270,271,4,4.761451,Really cute restaurant
...,...,...,...,...,...,...
193,1,100,101,193,0.353070,Really cute restaurant
194,1,395,396,194,0.344944,Really cute restaurant
195,1,167,168,195,0.189095,Really cute restaurant
196,1,372,373,196,0.114713,Really cute restaurant


Thats the end of the indexing tutorial - you can continue with other example tutorials.