<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Build-Indexer-and-index-data.-Store-the-index-to-disk" data-toc-modified-id="Build-Indexer-and-index-data.-Store-the-index-to-disk-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Build Indexer and index data. Store the index to disk</a></span></li><li><span><a href="#Load-inverted-index-from-folder" data-toc-modified-id="Load-inverted-index-from-folder-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load inverted index from folder</a></span></li><li><span><a href="#Make-queries-(single-term)-from-the-loaded-indexer" data-toc-modified-id="Make-queries-(single-term)-from-the-loaded-indexer-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Make queries (single term) from the loaded indexer</a></span></li><li><span><a href="#Storing-list-of-strings-to-disk" data-toc-modified-id="Storing-list-of-strings-to-disk-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Storing list of strings to disk</a></span></li><li><span><a href="#Make-queries-(multiple-terms)-from-the-loaded-indexer" data-toc-modified-id="Make-queries-(multiple-terms)-from-the-loaded-indexer-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Make queries (multiple terms) from the loaded indexer</a></span><ul class="toc-item"><li><span><a href="#How-can-we-store-metadata-in-the-inverted-index-and-return-it-when-the-user-calls-.search?" data-toc-modified-id="How-can-we-store-metadata-in-the-inverted-index-and-return-it-when-the-user-calls-.search?-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>How can we store metadata in the inverted index and return it when the user calls .search?</a></span></li></ul></li><li><span><a href="#Row-serialization-process-for-dataframe" data-toc-modified-id="Row-serialization-process-for-dataframe-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Row serialization process for dataframe</a></span><ul class="toc-item"><li><span><a href="#Retrieve-serialized-rows-for-a-query-search" data-toc-modified-id="Retrieve-serialized-rows-for-a-query-search-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Retrieve serialized rows for a query search</a></span></li></ul></li><li><span><a href="#benchmark" data-toc-modified-id="benchmark-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>benchmark</a></span></li></ul></div>

## Build Indexer and index data. Store the index to disk

In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import lsearch

In [2]:
from lsearch import InvertedIndex
from lsearch.text_processing.tokenization.tokenizers import get_vocabulary_and_tdf_tuples, build_inv_index_from_tdf_tuples

In [7]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
docs = newsgroups_train['data']
len(docs)

11314

In [8]:
inv_ind = InvertedIndex()

In [9]:
inv_ind.index(docs, folder_store="inv_index_store")

InvertedIndex stored in inv_index_store


after fitting the inverted index the necesary components will found 

In [10]:
ls inv_index_store/

data_by_term.bin            word2pos.pkl
doc_freq.pkl                word_freq.pkl
postings_term_pointers.pkl


We can find all docs that have a particular term using '.search'

## Load inverted index from folder

In [11]:
inv_index_from_disk = InvertedIndex.read_inv_index('./inv_index_store')

In [12]:
inv_index_from_disk.__dict__.keys()

dict_keys(['word2pos', 'n_docs_seen', 'folder_store', 'word_freq', 'doc_freq', 'postings_term_pointers'])

In [13]:
word_pos_tuples = inv_ind.word2pos.items()
words = [x[0] for x in word_pos_tuples]
pos = [x[1] for x in word_pos_tuples]

inv_ind.write_strings_to_file("word2pos_test.bin", inv_ind.word2pos)

In [14]:
vocab_from_file = inv_ind.read_strings_from_file("word2pos_test.bin")

In [15]:
vocab_from_file[0:10]

['from', 'lerxst', 'wam', 'umd', 'edu', 'where', 's', 'my', 'thing', 'subject']

## Make queries (single term) from the loaded indexer

In [16]:
%%timeit
inv_index_from_disk.get_tuples_for_term_id(12312)

90.5 µs ± 771 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [17]:
%%timeit
inv_index_from_disk.get_tuples_for_term_id_slow(12312)

1.55 ms ± 19.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [18]:
inv_index_from_disk.get_tuples_for_term_id(12312)

[(12312, 227, 1),
 (12312, 954, 1),
 (12312, 1672, 1),
 (12312, 2721, 1),
 (12312, 3568, 1),
 (12312, 7209, 1)]

In [19]:
inv_index_from_disk.get_tuples_for_term_id_slow(12312)

[(12312, 227, 1),
 (12312, 954, 1),
 (12312, 1672, 1),
 (12312, 2721, 1),
 (12312, 3568, 1),
 (12312, 7209, 1)]

We can use a `term` (string) instead of the `term_id` (integer).

In [20]:
inv_index_from_disk.get_tuples_for_term('nintendo')

[(23577, 525, 3),
 (23577, 830, 2),
 (23577, 1125, 2),
 (23577, 2138, 1),
 (23577, 2839, 1),
 (23577, 3867, 1),
 (23577, 4922, 1),
 (23577, 5855, 1),
 (23577, 6016, 2),
 (23577, 6470, 1),
 (23577, 7890, 1),
 (23577, 8815, 1),
 (23577, 9023, 2),
 (23577, 9382, 2),
 (23577, 9941, 3)]

## Storing list of strings to disk

In [21]:
vocab = list(inv_index_from_disk.word2pos.keys())

In [22]:
%%timeit
inv_index_from_disk.write_strings_to_file('test.bin', vocab)

35.8 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [23]:
%%timeit
v = inv_index_from_disk.read_strings_from_file('test.bin')

7.55 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Make queries (multiple terms) from the loaded indexer


The notebook 

```
chap_6_processing_boolean_queries.ipynb
```

has a detailed method to compute intersections between postings that are sorted by `doc_id`.

In [24]:
query = 'nintendo super mario'
inv_index_from_disk.search_postings_for_terms(query)

[[(23577, 525, 3),
  (23577, 830, 2),
  (23577, 1125, 2),
  (23577, 2138, 1),
  (23577, 2839, 1),
  (23577, 3867, 1),
  (23577, 4922, 1),
  (23577, 5855, 1),
  (23577, 6016, 2),
  (23577, 6470, 1),
  (23577, 7890, 1),
  (23577, 8815, 1),
  (23577, 9023, 2),
  (23577, 9382, 2),
  (23577, 9941, 3)],
 [(5259, 70, 1),
  (5259, 173, 2),
  (5259, 525, 2),
  (5259, 830, 1),
  (5259, 1125, 2),
  (5259, 1331, 1),
  (5259, 1395, 1),
  (5259, 1401, 1),
  (5259, 1461, 1),
  (5259, 1624, 3),
  (5259, 1822, 1),
  (5259, 2179, 1),
  (5259, 2194, 1),
  (5259, 2269, 1),
  (5259, 2303, 1),
  (5259, 2316, 2),
  (5259, 2379, 1),
  (5259, 2526, 1),
  (5259, 2665, 1),
  (5259, 2676, 1),
  (5259, 2705, 1),
  (5259, 2741, 1),
  (5259, 2749, 1),
  (5259, 3024, 1),
  (5259, 3169, 2),
  (5259, 3179, 1),
  (5259, 3209, 1),
  (5259, 3226, 1),
  (5259, 3243, 1),
  (5259, 3480, 4),
  (5259, 3574, 1),
  (5259, 3696, 1),
  (5259, 3777, 1),
  (5259, 3822, 3),
  (5259, 3867, 1),
  (5259, 3869, 1),
  (5259, 3966, 1),
  (

In [25]:
%%timeit
inv_index_from_disk.search(query)

268 µs ± 452 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [26]:
search_results = inv_index_from_disk.search(query)
search_results

[830, 9023]

In [27]:
print(docs[search_results[1]])

From: fields@cis.ohio-state.edu (jonathan david fields)
Subject: Misc. Stuff for Sale
Article-I.D.: penguin.1po5lqINN749
Distribution: usa
Organization: The Ohio State University Dept. of Computer and Info. Science
Lines: 46
NNTP-Posting-Host: penguin.cis.ohio-state.edu


Misc. Items for sale:


Walkman:  Aiwa Model HS-T30, 1 year old, mint condition, hardly used, 
          autoreverse, 3 band Equalizer, Super Bass, Dolby Noise Reduction,
          AM FM tuner..........Paid $70.......Asking $40+shipping.

Mount Plate:  Sony Model CPM-203P, mounting plate for Sony portable CD players
for Portable: plugs into car lighter, snaps onto the bottom of any Sony
CD Player:    Portable CD player, perfect condition. Will also throw in a 
	      cassette adapter in SO SO condition.
	      Paid $45...............Asking $30+shipping.

AM FM:	    Factory Stereo from Toyota with AM FM radio, autoreverse cassette
Cassette:   deck, digital tuning, like new condition only in car 6 months,
Car Stereo: As

### How can we store metadata in the inverted index and return it when the user calls .search?

In [28]:
from datasets import load_from_disk

train_dataset = load_from_disk("/Users/dbuchaca/Desktop/text_retrieval_and_search_engines/datasets/amazon_2023_genq/train_dataset")
# If running from QNAP 
#train_dataset = load_from_disk("/Users/davidbuchaca1/Datasets/amazon_2023_genq/train_dataset")
train_dataset

Dataset({
    features: ['parent_asin', 'main_category', 'title', 'description', 'features', 'embellished_description', 'brand', 'images', 'short_query', 'long_query'],
    num_rows: 205637
})

In [29]:
df = train_dataset.to_pandas()
df.head()
df.to_parquet('./dataset.parquet')

Read as parquet

In [43]:
df_ = pd.read_parquet('dataset.parquet')

Storing metadata for documents as an external parquet file such as https://chatgpt.com/c/68080f54-97c0-8008-8c4f-7de93aa43048

## Row serialization process for dataframe

In [7]:
from lsearch import TableSerializer

In [8]:
df_ = pd.read_parquet('dataset.parquet')

In [9]:
cols = ["parent_asin", 
        "main_category",
        "title",
        "description",
        "features", 
        "brand",
        "short_query"]

schema = {col:'str' for col in cols}
schema

{'parent_asin': 'str',
 'main_category': 'str',
 'title': 'str',
 'description': 'str',
 'features': 'str',
 'brand': 'str',
 'short_query': 'str'}

In [20]:
table_ser = TableSerializer(bin_path="variable_bin_storage.bin", 
                            schema=schema,
                            variable_length_columns = cols)

In [19]:
%%time
table_ser.serialize(df_)

CPU times: user 15.1 s, sys: 379 ms, total: 15.5 s
Wall time: 15.9 s


In [17]:
%%time
table_ser.serialize_slow(df_)

CPU times: user 29.1 s, sys: 590 ms, total: 29.7 s
Wall time: 30.3 s


In [36]:
%%timeit
table_ser.read_rows_parallel([0, 520, 40])

1.05 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


### Retrieve serialized rows for a query search

In [23]:
(df_['title'] + ' ' + df_['description']).iloc[0]

'NHL Molded Auto Emblem Showcase your team spirit with this eye-catching Chrome Finished Auto Emblem 3D Sticker by Rico Industries. This auto emblem 3D sticker measures 5-inches by 5-inches and is decorated with a dynamic and bold team logo. It easily adheres to any vehicle or other hard surface and is made of weather resistant materials. Made in the USA.'

In [25]:
%%time
inv_ind = InvertedIndex()
inv_ind.index(df_['title'] + ' ' + df_['description'], folder_store="inv_index_store_amazon_2023_genq")

InvertedIndex stored in inv_index_store_amazon_2023_genq
CPU times: user 46.6 s, sys: 1.21 s, total: 47.8 s
Wall time: 48.5 s


In [35]:
query = 'nintendo super mario'
%time search_results = inv_ind.search(query)
%time df_res = table_ser.read_rows_parallel(search_results)
print(df_res.shape)
df_res.head()

CPU times: user 18.1 ms, sys: 849 µs, total: 18.9 ms
Wall time: 18.3 ms
CPU times: user 6.19 ms, sys: 5.56 ms, total: 11.7 ms
Wall time: 7.38 ms
(49, 7)


Unnamed: 0,parent_asin,main_category,title,description,features,brand,short_query
0,B08JZMQMNF,AMAZON FASHION,"Super Mario Nintendo Rain Boots,Mid Height Sli...","Stomping, hopping, and walking in the rain has...",Rubber sole\nSUPER MARIO FUN: Super Mario Bros...,SUPER MARIO,Super Mario Rain Boots for Kids
1,B09TRWSQYG,AMAZON FASHION,Nintendo Super Mario Little & Big Boys Swim Tr...,Little and big kids will love these Super Mari...,100% Polyester\nImported\nElastic closure\nSwi...,Nintendo,Super Mario boys swim trunks
2,B07ZZLH18W,AMAZON FASHION,Mario Kart Nintendo Boys' Super Mario Drifting...,This tee is a special edition not available an...,100% Cotton\nPull On closure\nMachine Wash\nLO...,Mario Kart,Mario Kart t-shirt for boys
3,B0BXVXKFLQ,AMAZON FASHION,Nintendo Boys' Super Mario Boxer Briefs Availa...,All packs available are fun and adds a unique ...,"92% Polyester, 8% Spandex\nImported\nPull On c...",Handcraft Children's Apparel,Nintendo Boxer Briefs
4,B07HYBLKRD,AMAZON FASHION,Jumping Beans Boys 4-10 Nintendo Super Mario B...,Boys 4-10 Jumping Beans Nintendo Super Mario B...,100% Cotton\nPull On closure\nMachine Wash\nCr...,Jumping Beans,Boys Yoshi Graphic Tee


In [36]:
query = 'black'
%time search_results = inv_ind.search(query)
%time df_res = table_ser.read_rows_parallel(search_results)
df_res.shape
df_res.head()

CPU times: user 13.6 ms, sys: 728 µs, total: 14.3 ms
Wall time: 13.7 ms
CPU times: user 2.51 s, sys: 2.03 s, total: 4.54 s
Wall time: 2.66 s


Unnamed: 0,parent_asin,main_category,title,description,features,brand,short_query
0,B07K4CWWZH,All Beauty,Hvaxing Synthetic Afro Kinkys Curly Crochet Br...,Since 2014 crochet braids have been rising in ...,"Style: Synthetic Crochet Braiding Hair,Afro Ki...",HVAXING,Afro Kinky Curly Crochet Hair
1,B01GTEH4F4,Sports & Outdoors,Attwood 11772-1 Stand-Up Paddle Board (SUP) Pa...,This Attwood Stand-Up Paddle Board (SUP) Paddl...,Length-adjustable between 55 and 82 inches\nVe...,Attwood,Attwood Adjustable SUP Paddle
2,B08G4K88YZ,Unknown,denisel Waterproof Small Toiletry Bag for Men ...,"Made with good quality, water-resistant improv...",[Waterproof] Made with Good quality polyester ...,Ecoland,Compact waterproof toiletry bag
3,B07HJVVBXB,All Beauty,BLISSHAIR 2x6 Deep Middle Part Lace Closure Hu...,BLISSHAIR Brazilian 2x6 Deep Middle Part Lace ...,Bliss Hair - Brazilian Virgin 2x6 Lace Closure...,BLISSHAIR,Bliss Hair Lace Closure
4,B08BLB1ZZH,AMAZON FASHION,Mens Basketball Jerseys Lola#10 Bunny Space Mo...,It is the LOLA #10 Men's Basketball Jersey Spa...,"Jersey,Polyester,Mesh\nPull On closure\n100% P...",Ki Cut,Lola #10 basketball jersey


In [41]:
query = 'black'
%time search_results = inv_ind.search(query)
%time df_res = table_ser.read_rows_parallel_mmap(search_results)
df_res.shape
df_res.head()

CPU times: user 15 ms, sys: 890 µs, total: 15.9 ms
Wall time: 15.3 ms
CPU times: user 2.02 s, sys: 751 ms, total: 2.78 s
Wall time: 2.04 s


Unnamed: 0,parent_asin,main_category,title,description,features,brand,short_query
0,B07K4CWWZH,All Beauty,Hvaxing Synthetic Afro Kinkys Curly Crochet Br...,Since 2014 crochet braids have been rising in ...,"Style: Synthetic Crochet Braiding Hair,Afro Ki...",HVAXING,Afro Kinky Curly Crochet Hair
1,B01GTEH4F4,Sports & Outdoors,Attwood 11772-1 Stand-Up Paddle Board (SUP) Pa...,This Attwood Stand-Up Paddle Board (SUP) Paddl...,Length-adjustable between 55 and 82 inches\nVe...,Attwood,Attwood Adjustable SUP Paddle
2,B08G4K88YZ,Unknown,denisel Waterproof Small Toiletry Bag for Men ...,"Made with good quality, water-resistant improv...",[Waterproof] Made with Good quality polyester ...,Ecoland,Compact waterproof toiletry bag
3,B07HJVVBXB,All Beauty,BLISSHAIR 2x6 Deep Middle Part Lace Closure Hu...,BLISSHAIR Brazilian 2x6 Deep Middle Part Lace ...,Bliss Hair - Brazilian Virgin 2x6 Lace Closure...,BLISSHAIR,Bliss Hair Lace Closure
4,B08BLB1ZZH,AMAZON FASHION,Mens Basketball Jerseys Lola#10 Bunny Space Mo...,It is the LOLA #10 Men's Basketball Jersey Spa...,"Jersey,Polyester,Mesh\nPull On closure\n100% P...",Ki Cut,Lola #10 basketball jersey


## benchmark

In [55]:
import time
import random

def benchmark_read_methods(ts: TableSerializer, num_rows: int = 1000, runs: int = 3, max_workers: int = 4):
    """
    Benchmark different read_rows_parallel strategies.

    Args:
        ts: An instance of TableSerializer.
        num_rows: Number of rows to sample.
        runs: Number of benchmark repetitions.
        max_workers: Number of threads/processes.
    """
    # Sample indices to simulate random access
    total_rows = len(ts._load_offsets())
    indices = random.sample(range(total_rows), min(num_rows, total_rows))

    results = {}

    # Method 1: ThreadPoolExecutor
    def method_threadpool():
        return ts.read_rows_parallel(indices, max_workers=max_workers)

    # Method 2: mmap + ThreadPoolExecutor
    def method_mmap():
        return ts.read_rows_parallel_mmap(indices, max_workers=max_workers)  # Make sure your class uses mmap version

    # Method 3: ProcessPoolExecutor
    def method_processpool():
        return ts.read_rows_parallel_multiprocess(indices, max_workers=max_workers)

    methods = {
        "ThreadPoolExecutor": method_threadpool,
        "mmap + ThreadPoolExecutor": method_mmap,
#        "ProcessPoolExecutor": method_processpool, # does not work in jupyter
    }

    for name, method in methods.items():
        times = []
        for _ in range(runs):
            start = time.time()
            _ = method()
            times.append(time.time() - start)
        avg = sum(times) / runs
        results[name] = avg
        print(f"{name:<30} | Avg Time: {avg:.4f} sec")

    return results


In [56]:
benchmark_read_methods(table_ser, num_rows=10000)

ThreadPoolExecutor             | Avg Time: 1.2743 sec
mmap + ThreadPoolExecutor      | Avg Time: 0.7878 sec


{'ThreadPoolExecutor': 1.2742706934611003,
 'mmap + ThreadPoolExecutor': 0.7878332932790121}