# Index Receipts


In [1]:
from ir import IndexDataframe, search_loop, make_dictionary, make_raw_to_web, \
SimpleQueryMaker, WildQueryMaker, MashedWildQueryMaker, FuzzyMashedWildQueryMaker, Config, SimpleSearcher, evaluate
import pandas as pd
from datetime import datetime
from org.apache.lucene.analysis.standard import StandardAnalyzer

In [2]:
INDEX_DIR = "indexes/receipts1"

In [3]:
path = '../data/raw_web_joined/703_00198_2020-03-20_3_1391204_joined.json'
df = pd.read_json(path)

In [4]:
def index_test(quiet=False):
    start = datetime.now()
    try:
        IndexDataframe(df, INDEX_DIR, StandardAnalyzer(), quiet)
        end = datetime.now()
        print('Elapsed: %s' % (end - start))
    except Exception as e:
        print("Failed: %s" % e)
        raise e    

In [5]:
index_test(True)

.done
Elapsed: 0:00:00.216182


In [6]:
search_loop(INDEX_DIR, 'web', explain=True)

Hit enter with no input to quit.
Query:water
Searching for: water
3 total matching documents.
(LMTD QTY) Essentia Ionized Alkaline Water | ESNT WATER | 1.4246102571487427
1.4246103 = weight(web:water in 0) [BM25Similarity], result of:
  1.4246103 = score(freq=1.0), product of:
    3.1898882 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
      3 = n, number of documents containing term
      84 = N, total number of documents with field
    0.44660193 = tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
      1.0 = freq, occurrences of term within document
      1.2 = k1, term saturation parameter
      0.75 = b, length normalization parameter
      6.0 = dl, length of field
      5.75 = avgdl, average length of field

------------
(LMTD QTY) FIJI Natural Artesian Water | FIJI WATER | 1.4246102571487427
1.4246103 = weight(web:water in 1) [BM25Similarity], result of:
  1.4246103 = score(freq=1.0), product of:
    3.1898882 = idf, computed as log(1 + (N - n + 

## Equivalency Evaluation
```
build a table of raw -> [web]
where [web] is all the possible values of matches for raw
example: 'FFST CAT FOOD' -> ['Fancy Feast Flaked Fish Cat Food', 'Purina Fancy Feast Chicken Cat Food']

search on query (e.g. FFST CAT FOOD) if top result is any of the ones associated with query, it counts as a hit
```

## Wildcard Technique
```
when building a query evaluate against dictionary of seen web terms
unseen tokens get wildcard treatment
wildcard treatment means insert * between each letter
e.g., FFST CAT FOOD -> F*F*S*T CAT FOOD
since cat and food exist in dictionary
```

In [7]:
WORDS = make_dictionary(df)

In [8]:
RAW_TO_WEB = make_raw_to_web(df)

## Scoring
### Interface
```
def is_hit(raw, config): -> bool

config:
  Index
  Searcher
  QueryMaker
```

Algorithm:
We lookup the webs for the webs for the raw to generate the hit candidates
we process raw into a query using QueryMaker
We run the query using Searcher
Index may not be necessary
If the top result is in webs, we return true, otherwise false

In [9]:
qm = SimpleQueryMaker()

In [10]:
ss = SimpleSearcher(INDEX_DIR)

In [11]:
simple_config = Config(qm, ss)

In [12]:
queries = ['AVOCADO', 'FFST CAT FOOD']

In [13]:
evaluate(queries, simple_config, RAW_TO_WEB)

(0.5, ['AVOCADO'])

In [14]:
queries = df.raw.unique()

In [15]:
simple_score, simple_misses = evaluate(queries, simple_config, RAW_TO_WEB)

In [16]:
len(queries)

69

In [17]:
wqm = WildQueryMaker(WORDS)

In [18]:
wild_config = Config(wqm, ss)

In [19]:
wild_score, wild_misses = evaluate(queries, wild_config, RAW_TO_WEB)

In [20]:
simple_score

0.6231884057971014

In [21]:
wild_score

0.8405797101449275

In [22]:
simple_misses

['KRO WATER',
 'CA REDEM VAL',
 'ARTICHOKES',
 'BYND SSG HT ITLN',
 'BRHD CHEESE',
 'CUCUMBERS',
 'FRGO STR CHS',
 'GLBNI STR CHS',
 'KRO SOAP',
 'KRO CCNT MK',
 'MSHRM BYBL WHL',
 'LES PET CHEESE BAR',
 'BROWN ONIONS',
 'ASP ORG',
 'PPRS BL GRN ORGN',
 'RADISH ORG',
 'SQSH YLW ORG',
 'TOMATO ORGNC',
 'PRSL MPL TKY GNG',
 'PRSL MUENSTR',
 'STO CRT BABY ORGNC',
 'STO CCNT MILK',
 'STO BROTH',
 'STO CARROTS ORGNC',
 'STN CHCK RST',
 'SFTSOAP KTCHN FRSH']

### wild_misses
`wild_misses` is the set of queries that wasn't matched by using wild queries on terms.  
Examples
* `CA REDEM VAL` - difficult to match since little lexical overlap
* `ARTICHOKES`, `CUCMBER` - plural mismatmatch
* `BRHD *` `PRSL *`, `STO *`   - Wildcard won't match terms because BRHD, PRSL, STO spans multiple terms.  However, a wildcard match against the entire web string might work ok
* `KRO SOAP` - Matched `Kroger® Pear & Coconut Hand Soap` which had associated raw field of `KRO PEAR COCONUT`.  The actual hit was at rank 2.  Measuring precision@2 would have caught it.
* `LES PET CHEESE BAR` - Matched `Les Petites Havarati Cheese Wedge`.  The second hit was correct, with web value of `Les Petites Kosher Colby Jack Cheese`.  Note that neither web had disambiguating `BAR` term present.
* `BROWN ONIONS` - Matched `Onions - Green`.  Correct hit was ranked second: `Onions - Yellow`.  Note that there was no Yellow Onions in the web texts.

In [23]:
wild_misses

['CA REDEM VAL',
 'ARTICHOKES',
 'BRHD CHEESE',
 'CUCUMBERS',
 'KRO SOAP',
 'LES PET CHEESE BAR',
 'BROWN ONIONS',
 'PRSL MUENSTR',
 'STO CCNT MILK',
 'STO BROTH',
 'STO CARROTS ORGNC']

### wild_eliminees
`wild_eliminees` is the set of simple_misses that were eliminated by using wildcard queries on terms
Example: `ASP ORG` is eliminated by wildcard queries, probably due to match of `Asparagus` and `Organic`

In [24]:
wild_eliminees = set(simple_misses) - set(wild_misses)
wild_eliminees

{'ASP ORG',
 'BYND SSG HT ITLN',
 'FRGO STR CHS',
 'GLBNI STR CHS',
 'KRO CCNT MK',
 'KRO WATER',
 'MSHRM BYBL WHL',
 'PPRS BL GRN ORGN',
 'PRSL MPL TKY GNG',
 'RADISH ORG',
 'SFTSOAP KTCHN FRSH',
 'SQSH YLW ORG',
 'STN CHCK RST',
 'STO CRT BABY ORGNC',
 'TOMATO ORGNC'}

In [25]:
"""
Note that there are no `wild_misses` that were not in `simple_misses`.
This means that using wildcards doesn't hurt performance
"""
set(wild_misses) - set(simple_misses)

set()

## TODO
* Try not not analyzing the entire web string and doing wildcard matches against only the unanalyzed string (don't do wildcard matches against terms
* Try wildcard matching against terms and entire unanalyzed web string

### Fuzzy match
Notice that wildcard can over match, resulting in false positives
For example `m*ue*en*s*t*r` matches `MoisturePartSkimOriginalMozzarellaStringCheese`
See the next example
Combining the wildcard match with a fuzzy match makes Muenster surface to the top (see following examples)

In [26]:
ss.search('mashed_web:m*u*e*n*s*t*r*')[0]

<Document: Document<stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS<raw:FRGO STR CHS> stored,indexed,tokenized<web:Frigo Cheese Heads Low Moisture Part Skim Original Mozzarella String Cheese> stored,indexed,tokenized<mashed_web:frigocheeseheadslowmoisturepartskimoriginalmozzarellastringcheese cheeseheadslowmoisturepartskimoriginalmozzarellastringcheese headslowmoisturepartskimoriginalmozzarellastringcheese lowmoisturepartskimoriginalmozzarellastringcheese moisturepartskimoriginalmozzarellastringcheese partskimoriginalmozzarellastringcheese skimoriginalmozzarellastringcheese originalmozzarellastringcheese mozzarellastringcheese stringcheese cheese> stored,indexed,tokenized<bigrams:mozzarella_string string_cheese> stored,indexed,tokenized<trigrams:mozzarella_string_cheese> stored<id:4171623216>>>

In [27]:
ss.explain('mashed_web:m*u*e*n*s*t*r*')

mashed_web:m*u*e*n*s*t*r*
Frigo Cheese Heads Low Moisture Part Skim Original Mozzarella String Cheese | FRGO STR CHS | 1.0
1.0 = mashed_web:m*u*e*n*s*t*r*

------------
Private Selection™ Grab & Go Muenster Cheese | PRSL MUENSTR | 1.0
1.0 = mashed_web:m*u*e*n*s*t*r*

------------


In [28]:
ss.search('mashed_web:m*u*e*n*s*t*r*  meunstr~')[0]

<Document: Document<stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS<raw:PRSL MUENSTR> stored,indexed,tokenized<web:Private Selection™ Grab & Go Muenster Cheese> stored,indexed,tokenized<mashed_web:privateselectiongrabgomuenstercheese selectiongrabgomuenstercheese grabgomuenstercheese gomuenstercheese gomuenstercheese muenstercheese cheese> stored<id:20615390000>>>

In [29]:
ss.explain('mashed_web:m*u*e*n*s*t*r*  meunstr~')

mashed_web:m*u*e*n*s*t*r*  meunstr~
Private Selection™ Grab & Go Muenster Cheese | PRSL MUENSTR | 2.2037241458892822
2.2037241 = sum of:
  1.0 = mashed_web:m*u*e*n*s*t*r*
  1.2037241 = weight(web:muenster in 59) [BM25Similarity], result of:
    1.2037241 = score(freq=1.0), product of:
      0.71428573 = boost
      4.037186 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
        1 = n, number of documents containing term
        84 = N, total number of documents with field
      0.41742286 = tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
        1.0 = freq, occurrences of term within document
        1.2 = k1, term saturation parameter
        0.75 = b, length normalization parameter
        7.0 = dl, length of field
        5.75 = avgdl, average length of field

------------
Frigo Cheese Heads Low Moisture Part Skim Original Mozzarella String Cheese | FRGO STR CHS | 1.0
1.0 = sum of:
  1.0 = mashed_web:m*u*e*n*s*t*r*

------------


In [30]:
ss.search('mashed_web:p*r*s*l* prsl~ mashed_web:m*u*e*n*s*t*r*  meunstr~')[0]

<Document: Document<stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS<raw:PRSL MUENSTR> stored,indexed,tokenized<web:Private Selection™ Grab & Go Muenster Cheese> stored,indexed,tokenized<mashed_web:privateselectiongrabgomuenstercheese selectiongrabgomuenstercheese grabgomuenstercheese gomuenstercheese gomuenstercheese muenstercheese cheese> stored<id:20615390000>>>

In [31]:
ss.explain('mashed_web:p*r*s*l* prsl~ mashed_web:m*u*e*n*s*t*r*  meunstr~')

mashed_web:p*r*s*l* prsl~ mashed_web:m*u*e*n*s*t*r*  meunstr~
Private Selection™ Grab & Go Muenster Cheese | PRSL MUENSTR | 3.2037241458892822
3.2037241 = sum of:
  1.0 = mashed_web:p*r*s*l*
  1.0 = mashed_web:m*u*e*n*s*t*r*
  1.2037241 = weight(web:muenster in 59) [BM25Similarity], result of:
    1.2037241 = score(freq=1.0), product of:
      0.71428573 = boost
      4.037186 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
        1 = n, number of documents containing term
        84 = N, total number of documents with field
      0.41742286 = tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
        1.0 = freq, occurrences of term within document
        1.2 = k1, term saturation parameter
        0.75 = b, length normalization parameter
        7.0 = dl, length of field
        5.75 = avgdl, average length of field

------------
Frigo Cheese Heads Low Moisture Part Skim Original Mozzarella String Cheese | FRGO STR CHS | 2.0
2.0 = sum of:
  1.0 = mashed_

## Mashed and Fuzzy
* mashed wildcard (matching mashed_web field with wildcard query)
* fuzzy mashed wildcard (mix in fuzzy terms to mashed wildcard)

In [32]:
mwqm = MashedWildQueryMaker(WORDS)

In [33]:
mashed_wild_config = Config(mwqm, ss)

In [34]:
mashed_wild_score, mashed_wild_misses = evaluate(queries, mashed_wild_config, RAW_TO_WEB)

In [35]:
mashed_wild_score

0.8840579710144928

In [36]:
mashed_wild_misses

['CA REDEM VAL',
 'ARTICHOKES',
 'CUCUMBERS',
 'KRO SOAP',
 'LES PET CHEESE BAR',
 'BROWN ONIONS',
 'PRSL MUENSTR',
 'STO CARROTS ORGNC']

### mashed_wild_eliminees
`mashed_wild_eliminees` is the set of `wild_misses` that were eliminated by using mashed wildcard queries on terms
Example: `BRHD` is eliminated by wildcard queries, probably due to match of `BRHD` and `Boar's Head`

In [37]:
mashed_wild_eliminees = set(wild_misses) - set(mashed_wild_misses)
mashed_wild_eliminees

{'BRHD CHEESE', 'STO BROTH', 'STO CCNT MILK'}

In [40]:
regressions = set(mashed_wild_misses) - set(wild_misses)
regressions

set()

In [41]:
fmwqm = FuzzyMashedWildQueryMaker(WORDS)

In [42]:
fuzzy_mashed_wild_config = Config(fmwqm, ss)

In [43]:
fuzzy_mashed_wild_score, fuzzy_mashed_wild_misses = evaluate(queries, fuzzy_mashed_wild_config, RAW_TO_WEB)

In [44]:
fuzzy_mashed_wild_score

0.8985507246376812

In [45]:
fuzzy_mashed_wild_misses

['CA REDEM VAL',
 'FRGO STR CHS',
 'KRO SOAP',
 'LES PET CHEESE BAR',
 'OCEANS HALO BROTH',
 'BROWN ONIONS',
 'STO CARROTS ORGNC']

### fuzzy_mashed_wild_eliminees
By introducing fuzzy matching, we get plurals (`ARTICHOKES`, `CUCUMBERS`) in addition to Mild misspellings (`MUENSTR`).

In [48]:
fuzzy_mashed_wild_eliminees = set(mashed_wild_misses) - set(fuzzy_mashed_wild_misses)
fuzzy_mashed_wild_eliminees

{'ARTICHOKES', 'CUCUMBERS', 'PRSL MUENSTR'}

In [51]:
fuzzy_mashed_regressions = set(fuzzy_mashed_wild_misses) - set(mashed_wild_misses)
fuzzy_mashed_regressions

{'FRGO STR CHS', 'OCEANS HALO BROTH'}

In [50]:
ss.search(fmwqm.make_query('FRGO STR CHS'))[0]

<Document: Document<stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS<raw:BEEF STR FRY> stored,indexed,tokenized<web:Beef Choice For Stir Fry> stored,indexed,tokenized<mashed_web:beefchoiceforstirfry choiceforstirfry forstirfry stirfry fry> stored<id:20254600000>>>

FRGO STR CHS matches Beef Choice For Stir Fry
* frgo~ matches fry  i think???
* frgo~ matches for i think???
* str~ matches stir
* c*h*s* matches choice for stir.  This would be eliminated if we only consider using bigram/trigram terms (choice for stir hopefully isn't a trigram)

This indicates that we should eliminate web words that are shorter than raw words- both for and fry are shorter than frgo~

The False Negative is the second hit
* Frigo Cheese Heads Low Moisture Part Skim Original Mozzarella String Cheese
* frgo~ matches Frigo
* chs~ does not match Cheese... why?
* str~ does not match String... why?

I think maybe bonus points should be awarded for prefix matches
Examples:
KRO => Kroger
STR => String

This could be accomplished by adding prefix queries 
e.g., in addition to S*T*R*, have STR*


In [49]:
ss.explain(fmwqm.make_query('FRGO STR CHS'))

mashed_web:f*r*g*o* frgo~ mashed_web:s*t*r* str~ mashed_web:c*h*s* chs~
Beef Choice For Stir Fry | BEEF STR FRY | 3.921565532684326
3.9215655 = sum of:
  0.5105596 = weight(web:fry in 10) [BM25Similarity], result of:
    0.5105596 = score(freq=1.0), product of:
      0.3333333 = boost
      3.1898882 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
        3 = n, number of documents containing term
        84 = N, total number of documents with field
      0.48016697 = tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
        1.0 = freq, occurrences of term within document
        1.2 = k1, term saturation parameter
        0.75 = b, length normalization parameter
        5.0 = dl, length of field
        5.75 = avgdl, average length of field
  1.0 = mashed_web:s*t*r*
  0.47033533 = weight(web:for in 10) [BM25Similarity], result of:
    0.47033533 = score(freq=1.0), product of:
      0.3333333 = boost
      2.9385738 = idf, computed as log(1 + (N - n + 0.5)

Note above matches a stop word `(web:for)`
Maybe eliminating stop words could help

In [45]:
ss.search(mwqm.make_query('FRGO STR CHS'))[0]

<Document: Document<stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS<raw:FRGO STR CHS> stored,indexed,tokenized<web:Frigo Cheese Heads Low Moisture Part Skim Original Mozzarella String Cheese> stored,indexed,tokenized<mashed_web:FrigoCheeseHeadsLowMoisturePartSkimOriginalMozzarellaStringCheese CheeseHeadsLowMoisturePartSkimOriginalMozzarellaStringCheese HeadsLowMoisturePartSkimOriginalMozzarellaStringCheese LowMoisturePartSkimOriginalMozzarellaStringCheese MoisturePartSkimOriginalMozzarellaStringCheese PartSkimOriginalMozzarellaStringCheese SkimOriginalMozzarellaStringCheese OriginalMozzarellaStringCheese MozzarellaStringCheese StringCheese Cheese> stored<id:4171623216>>>

In [46]:
fmwqm.make_query('FRGO STR CHS')

'mashed_web:f*r*g*o* frgo~ mashed_web:s*t*r* str~ mashed_web:c*h*s* chs~'

OCEANS HALO BROTH matches Pero Organic Green Beans
* o*c*e*a*n*s* matches OrganiC green bEANS - this would have been eliminated using ngrams

In [52]:
ss.explain(fmwqm.make_query('OCEANS HALO BROTH'))

mashed_web:o*c*e*a*n*s* oceans~ mashed_web:h*a*l*o* halo~ broth
Pero Organic Green Beans | GREEN BEANS ORGNC | 2.257633686065674
2.2576337 = sum of:
  1.0 = mashed_web:o*c*e*a*n*s*
  1.2576336 = weight(web:beans in 56) [BM25Similarity], result of:
    1.2576336 = score(freq=1.0), product of:
      0.6 = boost
      4.037186 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
        1 = n, number of documents containing term
        84 = N, total number of documents with field
      0.51918733 = tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
        1.0 = freq, occurrences of term within document
        1.2 = k1, term saturation parameter
        0.75 = b, length normalization parameter
        4.0 = dl, length of field
        5.75 = avgdl, average length of field

------------
NO CHICKEN BROTH TETRA | OCEANS HALO BROTH | 1.8308416604995728
1.8308417 = sum of:
  1.8308417 = weight(web:broth in 40) [BM25Similarity], result of:
    1.8308417 = score(freq=1.

## TODO 2020_05_01
* Improve receipt collocations to eliminate hyphens, blanks, and dashes
* Enhance index with 2 new fields- bigrams and trigrams
* Create new nwqm - a competitor for mwqm - that matches against ngram fields instead of mashed.  nwqm stands for Ngram wildcard matcher
* Instead of doing wildcard matching on mashed wildcard field, do against bigram and trigram fields
* Create fnwqm - fuzzy ngram wildcard query maker
* expected ordering: mwqm < nwqm < fmwqm < fnwqm

## TODO
* Read Deep Learning + Search chapters to see how they handle synonyms
* ReRun word2vec notebook
* Train/test word2vec on some big data
* Train/test word2vec on the receipt data
* Figure out how to integrate wordnet
* Maybe: index json data into lucene including receipt Id, product id etc.  Use as a basis for data manipulation instead of dataframes
* Get latest receipt data
* Detailed query.explain() to figure out why we're still missing and missing more on fmwqm.  Also why can't we AND the mwqm queries?
* more data (divide receipts into dev and test)
* instead of wildcrad matching on concatenation of entire web phrase, do bigrams, trigrams, 4-grams (maybe)
* synonyms
  * wordnet
  * word2vec