# Index Receipts


In [33]:
from ir import IndexDataframe, search_loop, make_dictionary, make_raw_to_web, \
SimpleQueryMaker, WildQueryMaker, MashedWildQueryMaker, Config, SimpleSearcher, evaluate
import pandas as pd
from datetime import datetime
from org.apache.lucene.analysis.standard import StandardAnalyzer

In [2]:
INDEX_DIR = "indexes/receipts1"

In [3]:
path = '../data/raw_web_joined/703_00198_2020-03-20_3_1391204_joined.json'
df = pd.read_json(path)

In [4]:
def index_test(quiet=False):
    start = datetime.now()
    try:
        IndexDataframe(df, INDEX_DIR, StandardAnalyzer(), quiet)
        end = datetime.now()
        print('Elapsed: %s' % (end - start))
    except Exception as e:
        print("Failed: %s" % e)
        raise e    

In [5]:
index_test(True)

.done
Elapsed: 0:00:00.167107


In [6]:
search_loop(INDEX_DIR, 'web')

Hit enter with no input to quit.
Query:


## Equivalency Evaluation
```
build a table of raw -> [web]
where [web] is all the possible values of matches for raw
example: 'FFST CAT FOOD' -> ['Fancy Feast Flaked Fish Cat Food', 'Purina Fancy Feast Chicken Cat Food']

search on query (e.g. FFST CAT FOOD) if top result is any of the ones associated with query, it counts as a hit
```

## Wildcard Technique
```
when building a query evaluate against dictionary of seen web terms
unseen tokens get wildcard treatment
wildcard treatment means insert * between each letter
e.g., FFST CAT FOOD -> F*F*S*T CAT FOOD
since cat and food exist in dictionary
```

In [7]:
WORDS = make_dictionary(df)

In [8]:
RAW_TO_WEB = make_raw_to_web(df)

## Scoring
### Interface
```
def is_hit(raw, config): -> bool

config:
  Index
  Searcher
  QueryMaker
```

Algorithm:
We lookup the webs for the webs for the raw to generate the hit candidates
we process raw into a query using QueryMaker
We run the query using Searcher
Index may not be necessary
If the top result is in webs, we return true, otherwise false

In [9]:
qm = SimpleQueryMaker()

In [10]:
ss = SimpleSearcher(INDEX_DIR)

In [11]:
simple_config = Config(qm, ss)

In [12]:
queries = ['AVOCADO', 'FFST CAT FOOD']

In [13]:
evaluate(queries, simple_config, RAW_TO_WEB)

(0.5, ['AVOCADO'])

In [14]:
queries = df.raw.unique()

In [15]:
simple_score, simple_misses = evaluate(queries, simple_config, RAW_TO_WEB)

In [16]:
len(queries)

69

In [17]:
wqm = WildQueryMaker(WORDS)

In [18]:
wild_config = Config(wqm, ss)

In [19]:
wild_score, wild_misses = evaluate(queries, wild_config, RAW_TO_WEB)

In [20]:
simple_score

0.6231884057971014

In [21]:
wild_score

0.8405797101449275

In [22]:
simple_misses

['KRO WATER',
 'CA REDEM VAL',
 'ARTICHOKES',
 'BYND SSG HT ITLN',
 'BRHD CHEESE',
 'CUCUMBERS',
 'FRGO STR CHS',
 'GLBNI STR CHS',
 'KRO SOAP',
 'KRO CCNT MK',
 'MSHRM BYBL WHL',
 'LES PET CHEESE BAR',
 'BROWN ONIONS',
 'ASP ORG',
 'PPRS BL GRN ORGN',
 'RADISH ORG',
 'SQSH YLW ORG',
 'TOMATO ORGNC',
 'PRSL MPL TKY GNG',
 'PRSL MUENSTR',
 'STO CRT BABY ORGNC',
 'STO CCNT MILK',
 'STO BROTH',
 'STO CARROTS ORGNC',
 'STN CHCK RST',
 'SFTSOAP KTCHN FRSH']

### wild_misses
`wild_misses` is the set of queries that wasn't matched by using wild queries on terms.  
Examples
* `CA REDEM VAL` - difficult to match since little lexical overlap
* `ARTICHOKES`, `CUCMBER` - plural mismatmatch
* `BRHD *` `PRSL *`, `STO *`   - Wildcard won't match terms because BRHD, PRSL, STO spans multiple terms.  However, a wildcard match against the entire web string might work ok
* `KRO SOAP` - Matched `Kroger® Pear & Coconut Hand Soap` which had associated raw field of `KRO PEAR COCONUT`.  The actual hit was at rank 2.  Measuring precision@2 would have caught it.
* `LES PET CHEESE BAR` - Matched `Les Petites Havarati Cheese Wedge`.  The second hit was correct, with web value of `Les Petites Kosher Colby Jack Cheese`.  Note that neither web had disambiguating `BAR` term present.
* `BROWN ONIONS` - Matched `Onions - Green`.  Correct hit was ranked second: `Onions - Yellow`.  Note that there was no Yellow Onions in the web texts.

In [23]:
wild_misses

['CA REDEM VAL',
 'ARTICHOKES',
 'BRHD CHEESE',
 'CUCUMBERS',
 'KRO SOAP',
 'LES PET CHEESE BAR',
 'BROWN ONIONS',
 'PRSL MUENSTR',
 'STO CCNT MILK',
 'STO BROTH',
 'STO CARROTS ORGNC']

### wild_eliminees
`wild_eliminees` is the set of simple_misses that were eliminated by using wildcard queries on terms
Example: `ASP ORG` is eliminated by wildcard queries, probably due to match of `Asparagus` and `Organic`

In [24]:
wild_eliminees = set(simple_misses) - set(wild_misses)
wild_eliminees

{'ASP ORG',
 'BYND SSG HT ITLN',
 'FRGO STR CHS',
 'GLBNI STR CHS',
 'KRO CCNT MK',
 'KRO WATER',
 'MSHRM BYBL WHL',
 'PPRS BL GRN ORGN',
 'PRSL MPL TKY GNG',
 'RADISH ORG',
 'SFTSOAP KTCHN FRSH',
 'SQSH YLW ORG',
 'STN CHCK RST',
 'STO CRT BABY ORGNC',
 'TOMATO ORGNC'}

In [25]:
"""
Note that there are no `wild_misses` that were not in `simple_misses`.
This means that using wildcards doesn't hurt performance
"""
set(wild_misses) - set(simple_misses)

set()

## TODO
* Try not not analyzing the entire web string and doing wildcard matches against only the unanalyzed string (don't do wildcard matches against terms
* Try wildcard matching against terms and entire unanalyzed web string

### Fuzzy match
Notice that wildcard can over match, resulting in false positives
For example `m*ue*en*s*t*r` matches `MoisturePartSkimOriginalMozzarellaStringCheese`
See the next example
Combining the wildcard match with a fuzzy match makes Muenster surface to the top (see following examples)

In [29]:
ss.search('mashed_web:m*u*e*n*s*t*r*')[0]

<Document: Document<stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS<raw:FRGO STR CHS> stored,indexed,tokenized<web:Frigo Cheese Heads Low Moisture Part Skim Original Mozzarella String Cheese> stored,indexed,tokenized<mashed_web:FrigoCheeseHeadsLowMoisturePartSkimOriginalMozzarellaStringCheese CheeseHeadsLowMoisturePartSkimOriginalMozzarellaStringCheese HeadsLowMoisturePartSkimOriginalMozzarellaStringCheese LowMoisturePartSkimOriginalMozzarellaStringCheese MoisturePartSkimOriginalMozzarellaStringCheese PartSkimOriginalMozzarellaStringCheese SkimOriginalMozzarellaStringCheese OriginalMozzarellaStringCheese MozzarellaStringCheese StringCheese Cheese> stored<id:4171623216>>>

In [30]:
ss.search('mashed_web:m*u*e*n*s*t*r*  meunstr~')[0]

<Document: Document<stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS<raw:PRSL MUENSTR> stored,indexed,tokenized<web:Private Selection™ Grab & Go Muenster Cheese> stored,indexed,tokenized<mashed_web:PrivateSelection™Grab&GoMuensterCheese Selection™Grab&GoMuensterCheese Grab&GoMuensterCheese &GoMuensterCheese GoMuensterCheese MuensterCheese Cheese> stored<id:20615390000>>>

In [31]:
ss.search('mashed_web:p*r*s*l* prsl~ mashed_web:m*u*e*n*s*t*r*  meunstr~')[0]

<Document: Document<stored,indexed,tokenized,indexOptions=DOCS_AND_FREQS<raw:PRSL MUENSTR> stored,indexed,tokenized<web:Private Selection™ Grab & Go Muenster Cheese> stored,indexed,tokenized<mashed_web:PrivateSelection™Grab&GoMuensterCheese Selection™Grab&GoMuensterCheese Grab&GoMuensterCheese &GoMuensterCheese GoMuensterCheese MuensterCheese Cheese> stored<id:20615390000>>>

## TODO
* mashed wildcard (matching mashed_web field with wildcard query)
* fuzzy mashed wildcard (mix in fuzzy terms to mashed wildcard)

In [34]:
mwqm = MashedWildQueryMaker(WORDS)

In [35]:
mashed_wild_config = Config(mwqm, ss)

In [36]:
mashed_wild_score, mashed_wild_misses = evaluate(queries, mashed_wild_config, RAW_TO_WEB)

In [37]:
mashed_wild_score

0.8695652173913043

In [38]:
mashed_wild_misses

['CA REDEM VAL',
 'ARTICHOKES',
 'CUCUMBERS',
 'KRO SOAP',
 'LES PET CHEESE BAR',
 'BROWN ONIONS',
 'PRSL MUENSTR',
 'STO CARROTS ORGNC',
 'STN CHCK RST']

### mashed_wild_eliminees
`mashed_wild_eliminees` is the set of `wild_misses` that were eliminated by using mashed wildcard queries on terms
Example: `BRHD` is eliminated by wildcard queries, probably due to match of `BRHD` and `Boar's Head`

In [40]:
mashed_wild_eliminees = set(wild_misses) - set(mashed_wild_misses)
mashed_wild_eliminees

{'BRHD CHEESE', 'STO BROTH', 'STO CCNT MILK'}