<a href="https://colab.research.google.com/github/d1p013/testing/blob/master/AI6122_Assignment_Simple_Search_Engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PyTerrier Indexing Demo

This notebook takes you through indexing using [PyTerrier](https://github.com/terrier-org/pyterrier).

## Prerequisites

You will need PyTerrier installed. PyTerrier also needs Java to be installed, and will find most installations.

In [2]:
!pip install python-terrier
#!pip install --upgrade git+https://github.com/terrier-org/pyterrier.git#egg=python-terrier

Collecting python-terrier
  Downloading python-terrier-0.7.0.tar.gz (95 kB)
[?25l[K     |███▍                            | 10 kB 25.0 MB/s eta 0:00:01[K     |██████▉                         | 20 kB 31.7 MB/s eta 0:00:01[K     |██████████▎                     | 30 kB 38.8 MB/s eta 0:00:01[K     |█████████████▊                  | 40 kB 32.7 MB/s eta 0:00:01[K     |█████████████████▏              | 51 kB 19.2 MB/s eta 0:00:01[K     |████████████████████▋           | 61 kB 20.0 MB/s eta 0:00:01[K     |████████████████████████        | 71 kB 14.4 MB/s eta 0:00:01[K     |███████████████████████████▌    | 81 kB 15.9 MB/s eta 0:00:01[K     |███████████████████████████████ | 92 kB 17.5 MB/s eta 0:00:01[K     |████████████████████████████████| 95 kB 3.7 MB/s 
Collecting wget
  Downloading wget-3.2.zip (10 kB)
Collecting pyjnius~=1.3.0
  Downloading pyjnius-1.3.0-cp37-cp37m-manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 52.7 MB/s 
[?25hCo

## Init 

You must run `pt.init()` before other pyterrier functions and classes

Optional Arguments:    
 - `version` - terrier IR version e.g. "5.2"    
 - `mem` - megabytes allocated to java e.g. "4096"      
 - `packages` - external java packages for Terrier to load e.g. ["org.terrier:terrier.prf"]
 - `logging` - logging level for Terrier. Defaults to "WARN", use "INFO" or "DEBUG" for more output.

NB: PyTerrier needs Java 11 installed. If it cannot find your Java installation, you can set the `JAVA_HOME` environment variable.

In [3]:
import pyterrier as pt
if not pt.started():
  pt.init()

terrier-assemblies 5.6  jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.6  jar not found, downloading to /root/.pyterrier...
Done
PyTerrier 0.7.0 has loaded Terrier 5.6 (built by craigmacdonald on 2021-09-17 13:27)


In [4]:
 #vaswani_dataset = pt.datasets.get_dataset("vaswani")
 #indexref = vaswani_dataset.get_index()
 #index = pt.IndexFactory.of(indexref)
 #print(index.getCollectionStatistics().toString()) 

In [5]:
#topics = vaswani_dataset.get_topics()
#topics.head(5) 

In [6]:
#retr = pt.BatchRetrieve(index, controls = {"wmodel": "TF_IDF"}, num_results=10)
#retr.setControl("wmodel", "TF_IDF")
#retr.setControls({"wmodel": "TF_IDF"})
#res=retr.transform(topics)
#res 

In [7]:
# Import PyDrive and associated libraries.
# This only needs to be done once per notebook.
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# Download a file based on its file ID.
#
# A file ID looks like: laggVyWshwcyP6kEI-y_W3P8D26sz
# file_id = '1YkltHAl3Ro9xv25EkGB56h271_STNlPA' #'AI6122_Dataset_B1.csv'
file_id = '1aVXMJ_luTXISxMwP5_Bt2xE0ObXqEsQ2' #"Dataset_B1to8"
#file_id = '14N45V84iAf6q59vzL6OslMXpyQgWskuh' #"dataset_review.csv"
#https://drive.google.com/file/d/14N45V84iAf6q59vzL6OslMXpyQgWskuh/view?usp=sharing
downloaded = drive.CreateFile({'id': file_id})
#print('Downloaded content "{}"'.format(downloaded.GetContentString()))
#downloaded.GetContentFile('AI6122_Dataset_B1.csv')
downloaded.GetContentFile('Dataset_B1to8.csv')
#downloaded.GetContentFile('dataset_review.csv')

## Indexing a Pandas dataframe

Sometimes we have the documents that we want to index in memory. Terrier makes it easy to index standard Python data structures, particularly [Pandas dataframes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

To do thise, we can use a `pt.DFIndexer()` object

In [9]:
## load data into df
import pandas as pd
#df = pd.read_csv("AI6122_Dataset_B1.csv", dtype = str)
df = pd.read_csv("Dataset_B1to8.csv", dtype = str)

try :
  del docno 
except:
  pass

##docno = list(range(1,len(df)+1))
##df['docno'] = docno
##df = df.astype({"docno" : str})
#print(type(df["docno"][0]))

docno =[]
for idx in range(1,len(df)+1):
  docno.append("d" + str(idx))

df["docno"] = docno


In [74]:
#import pandas as pd
!rm -rf ./pd_index
pd_indexer = pt.DFIndexer("./pd_index", overwrite=True, verbose=True)

# optionally modify properties
# index_properies = {"block.indexing":"true", "invertedfile.lexiconscanner":"pointers"}
# indexer.setProperties(**index_properies)

Then there are a number of options to index the dataframe:    
The first argument should always a pandas.Series object of Strings, which specifies the body of each document.    
Any arguments after that are for specifying metadata.


In [75]:
import time
# no metadata
# pd_indexer.index(df["text"])

# Add metadata fields as Pandas.Series objects, with the name of the Series object becoming the name of the meta field.
#indexref = pd_indexer.index(df["text"], df["docno"], df["review_id"], df["user_id"], df["business_id"], df["stars"], df["useful"], df["funny"], df["cool"])
Tstart = time.perf_counter()
indexref = pd_indexer.index(df["text"], df)
Tend = time.perf_counter()
print(f"search completed in {Tend - Tstart:0.4f} seconds")
indexinfo = pt.IndexFactory.of(indexref)
print(indexinfo.getCollectionStatistics().toString()) 
# pd_indexer.index(df["text"], df["docno"], df["url"])

# Add metadata fields as lists to a keyword arguement
# pd_indexer.index(df["text"], docno=["1","2","3"], url=["url1", "url2", "url3"])

# Add the metadata fields with a dictionary
# meta_fields={"docno":["1","2","3"],"url":["url1", "url2", "url3"]}
# pd_indexer.index(df["text"], **meta_fields)

# Add the entire dataframe as metadata
# pd_indexer.index(df["text"], df)

  0%|          | 0/6928 [00:00<?, ?documents/s]

search completed in 7.2864 seconds
Number of documents: 6928
Number of terms: 12049
Number of postings: 323743
Number of fields: 0
Number of tokens: 388543
Field names: []
Positions:   false



In [76]:
for idx in range(1,11):
  #print(idx)
  df_resize = df.iloc[:int(len(df)/(10/idx)),:]
  !rm -rf ./pd_index
  pd_indexer = pt.DFIndexer("./pd_index", overwrite=True, verbose=True)
  Tstart = time.perf_counter()
  indexref = pd_indexer.index(df_resize["text"], df_resize)
  Tend = time.perf_counter()
  print(str(10*idx)+f"% df search completed in {Tend - Tstart:0.4f} seconds")

  0%|          | 0/692 [00:00<?, ?documents/s]

10% df search completed in 1.1398 seconds


  0%|          | 0/1385 [00:00<?, ?documents/s]

20% df search completed in 1.8369 seconds


  0%|          | 0/2078 [00:00<?, ?documents/s]

30% df search completed in 2.5531 seconds


  0%|          | 0/2771 [00:00<?, ?documents/s]

40% df search completed in 3.1788 seconds


  0%|          | 0/3464 [00:00<?, ?documents/s]

50% df search completed in 3.8119 seconds


  0%|          | 0/4156 [00:00<?, ?documents/s]

60% df search completed in 4.5575 seconds


  0%|          | 0/4849 [00:00<?, ?documents/s]

70% df search completed in 5.0883 seconds


  0%|          | 0/5542 [00:00<?, ?documents/s]

80% df search completed in 5.9937 seconds


  0%|          | 0/6235 [00:00<?, ?documents/s]

90% df search completed in 6.5859 seconds


  0%|          | 0/6928 [00:00<?, ?documents/s]

100% df search completed in 7.2143 seconds


In [28]:
index = pt.IndexFactory.of(indexref)

#lets see what type index is.
type(index)
IIndex = index.getInvertedIndex()
IIndex

#for kv in index.getLexicon():
#  print("%s (%s) -> %s (%s)" % (kv.getKey(), type(kv.getKey()), kv.getValue().toString(), type(kv.getValue()) ) )
  

<org.terrier.structures.PostingIndex at 0x7f8adc465ef0 jclass=org/terrier/structures/PostingIndex jself=<LocalRef obj=0x5575c4386298 at 0x7f8adeb6f7d0>>

In [21]:
print(type(kv))

<class 'jnius.reflect.org.terrier.structures.Lexicon$LexiconFileEntry'>


## Retrieval

Lets see how we can use one of these for retrieval. Retrieval takes place using the `BatchRetrieve` object, by invoking `transform()` method for one or more queries. For a quick test, you can give just pass your query to `transform()`. 

BatchRetrieve will return the results as a Pandas dataframe.


In [13]:
pt.BatchRetrieve(indexref).search("so many")

Unnamed: 0,docid,docno,rank,score,qid,query


In [14]:

#this ranker will make the candidate set of documents for each query
BM25 = pt.BatchRetrieve(indexref, controls = {"wmodel": "BM25"}, num_results=5)
#these rankers we will use to re-rank the BM25 results
TF_IDF = pt.BatchRetrieve(indexref, controls = {"wmodel": "TF_IDF"}, num_results=5)
PL2 =  pt.BatchRetrieve(indexref, controls = {"wmodel": "PL2"}, num_results=5)

pipe = BM25 >> (TF_IDF ** PL2)
pipe.transform("Really cute restaurant") 


  topics = m.transform(topics)


Unnamed: 0,qid,docid,docno,rank,score,query,features
0,1,5656,d5657,0,11.656181,Really cute restaurant,"[6.957669848484342, 6.861145712753511]"
1,1,247,d248,1,10.478894,Really cute restaurant,"[6.254937186160457, 5.707333751062654]"
2,1,3374,d3375,2,10.271409,Really cute restaurant,"[6.131087801312275, 5.541288977958049]"
3,1,5785,d5786,3,10.009293,Really cute restaurant,"[5.875657207663913, 5.262894075279536]"
4,1,5183,d5184,4,9.889212,Really cute restaurant,"[5.995664430264314, 5.490947108412864]"


In [15]:
pt.BatchRetrieve(indexref, controls = {"wmodel": "PL2"}, num_results=5).search("Really cute restaurant")


Unnamed: 0,qid,docid,docno,rank,score,query
0,1,5656,d5657,0,6.861146,Really cute restaurant
1,1,247,d248,1,5.707334,Really cute restaurant
2,1,3374,d3375,2,5.541289,Really cute restaurant
3,1,5183,d5184,3,5.490947,Really cute restaurant
4,1,5237,d5238,4,5.427991,Really cute restaurant


In [16]:
pt.BatchRetrieve(indexref, wmodel="BM25", properties={"termpipelines" : "Stopwords,PorterStemmer"})
pt.BatchRetrieve(indexref, metadata=["business_id", "stars"], num_results=10).search("Really cute restaurant")


Unnamed: 0,qid,docid,business_id,stars,rank,score,query
0,1,5785,UacakYbLnef2TYU2YDrtuw,2,0,6.212172,Really cute restaurant
1,1,5183,UacakYbLnef2TYU2YDrtuw,5,1,6.208631,Really cute restaurant
2,1,0,Slj9yz_RfDRqiMRH8VxUMQ,3,2,6.12212,Really cute restaurant
3,1,5656,UacakYbLnef2TYU2YDrtuw,4,3,5.880452,Really cute restaurant
4,1,247,Slj9yz_RfDRqiMRH8VxUMQ,5,4,5.856493,Really cute restaurant
5,1,3374,gUHpQYwW_fd0l0hcE2i6Dg,5,5,5.792231,Really cute restaurant
6,1,5229,UacakYbLnef2TYU2YDrtuw,5,6,5.774188,Really cute restaurant
7,1,414,Slj9yz_RfDRqiMRH8VxUMQ,3,7,5.692445,Really cute restaurant
8,1,3357,gUHpQYwW_fd0l0hcE2i6Dg,5,8,5.670671,Really cute restaurant
9,1,742,Slj9yz_RfDRqiMRH8VxUMQ,5,9,5.625087,Really cute restaurant


However, most IR experiments, will use a set of queries. You can pass such a set using a data frame for input.

In [46]:
import pandas as pd
topics = pd.DataFrame([["q1", "Really cute restaurant"]],columns=['qid','query'])
pt.BatchRetrieve(indexref, metadata=["text"], num_results=10).transform(topics)

Unnamed: 0,qid,docid,text,rank,score,query
0,q1,5785,Perlas the restaurant is very beachy and cute....,0,6.212172,Really cute restaurant
1,q1,5183,Really cute restaurant on So Co. Great music a...,1,6.208631,Really cute restaurant
2,q1,0,Really cute restaurant with lots of character....,2,6.12212,Really cute restaurant
3,q1,5656,Very cute restaurant. Great atmosphere and ind...,3,5.880452,Really cute restaurant
4,q1,247,Such a cute pizza place! The ambiance was very...,4,5.856493,Really cute restaurant
5,q1,3374,This restaurant was super cute and the food wa...,5,5.792231,Really cute restaurant
6,q1,5229,OUTSTANDING FOOD!!! A little pricey but worth ...,6,5.774188,Really cute restaurant
7,q1,414,I was staying in downtown and wanted to grab a...,7,5.692445,Really cute restaurant
8,q1,3357,"So, I went out with a friend to catch up and t...",8,5.670671,Really cute restaurant
9,q1,742,Went to this pizza spot in Downtown Atlanta ye...,9,5.625087,Really cute restaurant


In [42]:
pt.new.queries(["Really cute restaurant"], qid=["q1"])

Unnamed: 0,qid,query
0,q1,Really cute restaurant


In [57]:
import time
search_str = input("Please enter your search string: ")
#print("Search string: ", search_str)
TopN = input("Please enter number results to display: ")
#print("Top N results: ", TopN)
Tstart = time.perf_counter()
topics = pd.DataFrame([["q1", search_str]],columns=['qid','query'])
results = pt.BatchRetrieve(indexref, metadata=["text"], num_results=int(TopN)).transform(topics)
Tend = time.perf_counter()
print(f"search completed in {Tend - Tstart:0.4f} seconds")
results
if len(results) == 0:
  print("no result found")

Please enter your search string: xavier
Please enter number results to display: 20
search completed in 0.0424 seconds
no result found


In [55]:
len(results)

10

Thats the end of the indexing tutorial - you can continue with other example tutorials.