Eric Phann  
DSBA 6188  
Calmcode 3: lunr.py

lunr.js is a lightweight search engine that can handle smaller datasets pretty well.  
An example is using the search functionality on a [mkdocs material](https://squidfunk.github.io/mkdocs-material/).  
It has been ported to python as lunr.py

First, let's grab a dataset.

In [11]:
# command line for dataset from calmcode... not found
# https://calmcode.io/datasets/%7B'name':%20'lunr.py',%20'link':%20'/shorts/lunr.py.html'%7D returns a 404...
# !curl https://calmcode.io/datasets/clinc.csv

In [17]:
import pandas as pd
import os

os.chdir(r"C:\Users\Eric\DSBA 6188\Calmcode\data")
df = pd.read_csv("clinc.csv").assign(idx=lambda d: d.index)
df.sample(3)

Unnamed: 0,text,label,idx
12326,how many calories are in a honey bun,calories,12326
19534,i need 2000 to go to my chase checking from my...,transfer,19534
19171,i need you to buy a laptop,order,19171


lunr prefers list of dictionaries, not a dataframe.

In [18]:
documents = df.to_dict(orient="records")
documents

[{'text': 'how would you say fly in italian', 'label': 'translate', 'idx': 0},
 {'text': "what's the spanish word for pasta", 'label': 'translate', 'idx': 1},
 {'text': 'how would they say butter in zambia',
  'label': 'translate',
  'idx': 2},
 {'text': 'how do you say fast in spanish', 'label': 'translate', 'idx': 3},
 {'text': "what's the word for trees in norway",
  'label': 'translate',
  'idx': 4},
 {'text': 'how does one say wonderful in german',
  'label': 'translate',
  'idx': 5},
 {'text': 'how do they say tacos in mexico', 'label': 'translate', 'idx': 6},
 {'text': 'how would one say cruiser in china',
  'label': 'translate',
  'idx': 7},
 {'text': "what's the french word you use for potato",
  'label': 'translate',
  'idx': 8},
 {'text': 'what would the word for grass be in finland',
  'label': 'translate',
  'idx': 9},
 {'text': 'how do you say please in french', 'label': 'translate', 'idx': 10},
 {'text': 'how would i say nice to meet you if i were russian',
  'label': 't

lunr needs an index passed to it.

In [20]:
# ! pip install lunr

Collecting lunr
  Downloading lunr-0.7.0.post1-py3-none-any.whl.metadata (6.8 kB)
Downloading lunr-0.7.0.post1-py3-none-any.whl (35 kB)
Installing collected packages: lunr
Successfully installed lunr-0.7.0.post1


In [21]:
from lunr import lunr

index = lunr(ref="idx", fields=('text', ), documents=documents)

Now, can search through the index.

In [22]:
index.search("spanish")

[{'ref': '4501', 'score': 7.801, 'match_data': <MatchData "spanish">},
 {'ref': '3', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '26', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '27', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '28', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4526', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4529', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4556', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4573', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4575', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4576', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4585', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '5638', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '19505', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '19507', 'score': 

We can use list comprehension to reformat into records.

In [23]:
[documents[int(i['ref'])] for i in index.search("spanish")]

[{'text': "can you tell me how to say 'i do not speak much spanish', in spanish",
  'label': 'translate',
  'idx': 4501},
 {'text': 'how do you say fast in spanish', 'label': 'translate', 'idx': 3},
 {'text': 'what is dog in spanish', 'label': 'translate', 'idx': 26},
 {'text': 'how do you say dog in spanish', 'label': 'translate', 'idx': 27},
 {'text': 'dog in spanish', 'label': 'translate', 'idx': 28},
 {'text': 'how can i say not now in spanish',
  'label': 'translate',
  'idx': 4526},
 {'text': 'how do you say goodbye in spanish',
  'label': 'translate',
  'idx': 4529},
 {'text': 'what is spanish for hello', 'label': 'translate', 'idx': 4556},
 {'text': 'how do you say thank you in spanish',
  'label': 'translate',
  'idx': 4573},
 {'text': 'how can i say thank you in spanish',
  'label': 'translate',
  'idx': 4575},
 {'text': 'what is thank you in spanish', 'label': 'translate', 'idx': 4576},
 {'text': 'how do you say cat in spanish', 'label': 'translate', 'idx': 4585},
 {'text': 

We can also serialize the index, making it into a dictionary

In [24]:
index.serialize()

{'version': '2.3.9',
 'fields': ['text'],
 'fieldVectors': [['text/0', [0, 8.342, 1, 8.161]],
  ['text/1', [2, 3.398, 3, 6.085, 4, 5.985, 5, 6.726]],
  ['text/2', [6, 8.649, 7, 12.234]],
  ['text/3', [3, 7.62, 8, 8.553]],
  ['text/4', [2, 3.398, 4, 5.985, 9, 8.66, 10, 8.66]],
  ['text/5', [11, 4.707, 12, 7.772, 13, 7.772]],
  ['text/6', [14, 8.23, 15, 7.556]],
  ['text/7', [11, 4.707, 16, 10.864, 17, 7.821]],
  ['text/8', [2, 3.087, 4, 5.437, 18, 5.291, 19, 3.33, 20, 6.052]],
  ['text/9', [4, 6.655, 21, 9.216, 22, 9.404]],
  ['text/10', [18, 7.292, 23, 3.335]],
  ['text/11', [24, 7.517, 25, 4.839, 26, 8.24]],
  ['text/12', [3, 6.085, 27, 4.731, 28, 4.838, 29, 9.254]],
  ['text/13', [30, 4.748, 31, 8.24, 32, 7.443]],
  ['text/14', [33, 6.622, 34, 8.649]],
  ['text/15', [18, 7.292, 35, 4.902]],
  ['text/16', [13, 7.772, 36, 3.877, 37, 4.515]],
  ['text/17', [1, 8.161, 38, 9.2]],
  ['text/18', [39, 7.688, 40, 11.588]],
  ['text/19', [13, 7.772, 33, 5.881, 41, 3.279]],
  ['text/20', [13, 8

This is useful for storing and loading the search index on disk.

In [27]:
import json
from lunr.index import Index

serialized = index.serialize()

# Save the index
with open('idx.json', 'w') as fd:
    json.dump(serialized, fd)

# Load it again
with open("idx.json") as fd:
    reloaded = json.loads(fd.read())

idx = Index.load(reloaded)
idx.search("plant")

[{'ref': '11998', 'score': 9.056, 'match_data': <MatchData "plant">},
 {'ref': '9435', 'score': 8.144, 'match_data': <MatchData "plant">},
 {'ref': '2097', 'score': 7.399, 'match_data': <MatchData "plant">},
 {'ref': '9433', 'score': 7.399, 'match_data': <MatchData "plant">},
 {'ref': '23246', 'score': 7.399, 'match_data': <MatchData "plant">},
 {'ref': '9439', 'score': 6.778, 'match_data': <MatchData "plant">},
 {'ref': '19441', 'score': 6.254, 'match_data': <MatchData "plant">}]

lunr.py is great as a search engine for small datasets and rapid prototyping e.g., searching through small texts.  
Benchmarks below to show it performs in retrieving data vs. list comprehension and pandas.

In [29]:
%timeit df.loc[lambda d: d['text'].str.contains("spanish")]

%timeit [d for d in documents if 'spanish' in d['text']]

%timeit index.search('spanish')

%timeit [documents[int(i['ref'])] for i in index.search('spanish')]

10.4 ms ± 487 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.1 ms ± 60.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
389 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
392 µs ± 18.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
