## Installing jupyter
```
pyenv activate search_eng
/home/gitpod/.pyenv/versions/3.9.7/envs/search_eng/bin/python3.9 -m pip install jupyter
```

In [2]:
import os

import pandas as pd

## Level 1: Experimenting with indexing performance

- Seems like using the default (aka dynamic) mappings is faster, probably because it's just using sensible defaults and ignoring data types
- For details, the indexing above is done on the following fields
```
    "sku":"sku/text()",
    "productId": "productId/text()",
    "name": "name/text()",
    "type":"type/text()",
    "shortDescription": "shortDescription/text()",
    "startDate": "startDate/text()",
```
- With the provided mappings, we're using these data types. Maybe the additional analyzers slow down indexing?
```
    "sku": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
    },
    "productId": {
        "type": "long"
    },
    "name": {
        "type": "text",
        "analyzer": "english",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 2048
          },
          "hyphens": {
            "type": "text",
            "analyzer": "smarter_hyphens"
          },
          "suggest": {
            "type": "completion"
          }
        }
    },
    "type": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
    },
    "shortDescription": {
        "type": "text",
        "analyzer": "english"
    },
    "startDate": {
        "type": "date"
    },
```

In [14]:
os.environ['HOST'] = 'localhost'
os.environ['BBUY_DATA'] = '/workspace/datasets/product_data/products/'

In [31]:
# Set environment variables
!export HOST=localhost
!export BBUY_DATA=/workspace/datasets/product_data/products/

In [34]:
# Index with provided field mappings
!curl -k -X PUT -u admin:admin "https://$HOST:9200/bbuy_products" -H 'Content-Type: application/json' -d @/workspace/search_engineering/week1/bbuy_products.json
!python index.py -s $BBUY_DATA

INFO:Indexing /workspace/datasets/product_data/products/ to bbuy_products with 8 workers, refresh_interval of -1 to host localhost with a maximum number of docs sent per file per worker of 200000 and 200 per batch.
INFO:Done. 1275077 were indexed in 2.436928406183142 minutes.  Total accumulated time spent in `bulk` indexing: 10.582057921632561 minutes


In [35]:
# Delete index
!curl -k -X DELETE -u admin:admin https://localhost:9200/bbuy_products

{"acknowledged":true}

In [36]:
# Index with default mappings
!curl -k -X PUT -u admin:admin "https://$HOST:9200/bbuy_products" -H 'Content-Type: application/json' -d @/workspace/search_engineering/week1/bbuy_products_no_map.json
!python index.py -s $BBUY_DATA

{"acknowledged":true,"shards_acknowledged":true,"index":"bbuy_products"}INFO:Indexing /workspace/datasets/product_data/products/ to bbuy_products with 8 workers, refresh_interval of -1 to host localhost with a maximum number of docs sent per file per worker of 200000 and 200 per batch.
INFO:Done. 1275077 were indexed in 2.009230298833184 minutes.  Total accumulated time spent in `bulk` indexing: 7.754665959007495 minutes


## Updating refresh intervals
- Even though we're only indexing six fields, we see a difference 0.3 minute difference between 60s refresh and 1s refresh

In [37]:
# Delete index
!curl -k -X DELETE -u admin:admin https://localhost:9200/bbuy_products

# Index with default mappings
!curl -k -X PUT -u admin:admin "https://$HOST:9200/bbuy_products" -H 'Content-Type: application/json' -d @/workspace/search_engineering/week1/bbuy_products_no_map.json
!python index.py -s $BBUY_DATA --refresh_interval 60s

{"acknowledged":true}{"acknowledged":true,"shards_acknowledged":true,"index":"bbuy_products"}INFO:Indexing /workspace/datasets/product_data/products/ to bbuy_products with 8 workers, refresh_interval of 60s to host localhost with a maximum number of docs sent per file per worker of 200000 and 200 per batch.
INFO:Done. 1275077 were indexed in 1.9028761528495428 minutes.  Total accumulated time spent in `bulk` indexing: 7.048447553357497 minutes


In [38]:
# Delete index
!curl -k -X DELETE -u admin:admin https://localhost:9200/bbuy_products

# Index with default mappings
!curl -k -X PUT -u admin:admin "https://$HOST:9200/bbuy_products" -H 'Content-Type: application/json' -d @/workspace/search_engineering/week1/bbuy_products_no_map.json
!python index.py -s $BBUY_DATA --refresh_interval 1s

{"acknowledged":true}{"acknowledged":true,"shards_acknowledged":true,"index":"bbuy_products"}INFO:Indexing /workspace/datasets/product_data/products/ to bbuy_products with 8 workers, refresh_interval of 1s to host localhost with a maximum number of docs sent per file per worker of 200000 and 200 per batch.
INFO:Done. 1275077 were indexed in 2.189135347933431 minutes.  Total accumulated time spent in `bulk` indexing: 8.229896282364885 minutes


## Batch size
- Hard to say what's going on but batch size >3200 and 800 seems to work best

In [41]:
# Delete index
!curl -k -X DELETE -u admin:admin https://localhost:9200/bbuy_products

# Index with default mappings
!curl -k -X PUT -u admin:admin "https://$HOST:9200/bbuy_products" -H 'Content-Type: application/json' -d @/workspace/search_engineering/week1/bbuy_products_no_map.json
!python index.py -s $BBUY_DATA --batch_size 400

{"acknowledged":true}{"acknowledged":true,"shards_acknowledged":true,"index":"bbuy_products"}INFO:Indexing /workspace/datasets/product_data/products/ to bbuy_products with 8 workers, refresh_interval of -1 to host localhost with a maximum number of docs sent per file per worker of 200000 and 400 per batch.
INFO:Done. 1275077 were indexed in 2.32198421498324 minutes.  Total accumulated time spent in `bulk` indexing: 8.021191667257032 minutes


In [42]:
# Delete index
!curl -k -X DELETE -u admin:admin https://localhost:9200/bbuy_products

# Index with default mappings
!curl -k -X PUT -u admin:admin "https://$HOST:9200/bbuy_products" -H 'Content-Type: application/json' -d @/workspace/search_engineering/week1/bbuy_products_no_map.json
!python index.py -s $BBUY_DATA --batch_size 800

{"acknowledged":true}{"acknowledged":true,"shards_acknowledged":true,"index":"bbuy_products"}INFO:Indexing /workspace/datasets/product_data/products/ to bbuy_products with 8 workers, refresh_interval of -1 to host localhost with a maximum number of docs sent per file per worker of 200000 and 800 per batch.
INFO:Done. 1275077 were indexed in 2.2801962692664044 minutes.  Total accumulated time spent in `bulk` indexing: 7.852303329568046 minutes


In [43]:
# Delete index
!curl -k -X DELETE -u admin:admin https://localhost:9200/bbuy_products

# Index with default mappings
!curl -k -X PUT -u admin:admin "https://$HOST:9200/bbuy_products" -H 'Content-Type: application/json' -d @/workspace/search_engineering/week1/bbuy_products_no_map.json
!python index.py -s $BBUY_DATA --batch_size 1600

{"acknowledged":true}{"acknowledged":true,"shards_acknowledged":true,"index":"bbuy_products"}INFO:Indexing /workspace/datasets/product_data/products/ to bbuy_products with 8 workers, refresh_interval of -1 to host localhost with a maximum number of docs sent per file per worker of 200000 and 1600 per batch.
INFO:Done. 1275077 were indexed in 2.5825834213833634 minutes.  Total accumulated time spent in `bulk` indexing: 9.253081009297846 minutes


In [44]:
# Delete index
!curl -k -X DELETE -u admin:admin https://localhost:9200/bbuy_products

# Index with default mappings
!curl -k -X PUT -u admin:admin "https://$HOST:9200/bbuy_products" -H 'Content-Type: application/json' -d @/workspace/search_engineering/week1/bbuy_products_no_map.json
!python index.py -s $BBUY_DATA --batch_size 3200

{"acknowledged":true}{"acknowledged":true,"shards_acknowledged":true,"index":"bbuy_products"}INFO:Indexing /workspace/datasets/product_data/products/ to bbuy_products with 8 workers, refresh_interval of -1 to host localhost with a maximum number of docs sent per file per worker of 200000 and 3200 per batch.
INFO:Done. 1275077 were indexed in 2.1275613109833404 minutes.  Total accumulated time spent in `bulk` indexing: 7.271591539253617 minutes


In [45]:
# Delete index
!curl -k -X DELETE -u admin:admin https://localhost:9200/bbuy_products

# Index with default mappings
!curl -k -X PUT -u admin:admin "https://$HOST:9200/bbuy_products" -H 'Content-Type: application/json' -d @/workspace/search_engineering/week1/bbuy_products_no_map.json
!python index.py -s $BBUY_DATA --batch_size 5000

{"acknowledged":true}{"acknowledged":true,"shards_acknowledged":true,"index":"bbuy_products"}INFO:Indexing /workspace/datasets/product_data/products/ to bbuy_products with 8 workers, refresh_interval of -1 to host localhost with a maximum number of docs sent per file per worker of 200000 and 5000 per batch.
INFO:Done. 1275077 were indexed in 1.8943774112170406 minutes.  Total accumulated time spent in `bulk` indexing: 6.500708215699221 minutes


## Worker count
- More works is better, up till a point (4)

In [46]:
# Delete index
!curl -k -X DELETE -u admin:admin https://localhost:9200/bbuy_products

# Index with default mappings
!curl -k -X PUT -u admin:admin "https://$HOST:9200/bbuy_products" -H 'Content-Type: application/json' -d @/workspace/search_engineering/week1/bbuy_products_no_map.json
!python index.py -s $BBUY_DATA --workers 2

{"acknowledged":true}{"acknowledged":true,"shards_acknowledged":true,"index":"bbuy_products"}INFO:Indexing /workspace/datasets/product_data/products/ to bbuy_products with 2 workers, refresh_interval of -1 to host localhost with a maximum number of docs sent per file per worker of 200000 and 200 per batch.
INFO:Done. 1275077 were indexed in 3.416948190666638 minutes.  Total accumulated time spent in `bulk` indexing: 3.1127306910270516 minutes


In [47]:
# Delete index
!curl -k -X DELETE -u admin:admin https://localhost:9200/bbuy_products

# Index with default mappings
!curl -k -X PUT -u admin:admin "https://$HOST:9200/bbuy_products" -H 'Content-Type: application/json' -d @/workspace/search_engineering/week1/bbuy_products_no_map.json
!python index.py -s $BBUY_DATA --workers 4

{"acknowledged":true}{"acknowledged":true,"shards_acknowledged":true,"index":"bbuy_products"}INFO:Indexing /workspace/datasets/product_data/products/ to bbuy_products with 4 workers, refresh_interval of -1 to host localhost with a maximum number of docs sent per file per worker of 200000 and 200 per batch.
INFO:Done. 1275077 were indexed in 1.9412308881500697 minutes.  Total accumulated time spent in `bulk` indexing: 3.8542129833382206 minutes


In [48]:
# Delete index
!curl -k -X DELETE -u admin:admin https://localhost:9200/bbuy_products

# Index with default mappings
!curl -k -X PUT -u admin:admin "https://$HOST:9200/bbuy_products" -H 'Content-Type: application/json' -d @/workspace/search_engineering/week1/bbuy_products_no_map.json
!python index.py -s $BBUY_DATA --workers 8

{"acknowledged":true}{"acknowledged":true,"shards_acknowledged":true,"index":"bbuy_products"}INFO:Indexing /workspace/datasets/product_data/products/ to bbuy_products with 8 workers, refresh_interval of -1 to host localhost with a maximum number of docs sent per file per worker of 200000 and 200 per batch.
INFO:Done. 1275077 were indexed in 2.008232675149823 minutes.  Total accumulated time spent in `bulk` indexing: 7.589799960780268 minutes


In [49]:
# Delete index
!curl -k -X DELETE -u admin:admin https://localhost:9200/bbuy_products

# Index with default mappings
!curl -k -X PUT -u admin:admin "https://$HOST:9200/bbuy_products" -H 'Content-Type: application/json' -d @/workspace/search_engineering/week1/bbuy_products_no_map.json
!python index.py -s $BBUY_DATA --workers 16

{"acknowledged":true}{"acknowledged":true,"shards_acknowledged":true,"index":"bbuy_products"}INFO:Indexing /workspace/datasets/product_data/products/ to bbuy_products with 16 workers, refresh_interval of -1 to host localhost with a maximum number of docs sent per file per worker of 200000 and 200 per batch.
INFO:Done. 1275077 were indexed in 3.1152460792000056 minutes.  Total accumulated time spent in `bulk` indexing: 25.727708799493847 minutes


## Level 2

In [17]:
# Delete index
!curl -k -X DELETE -u admin:admin https://localhost:9200/bbuy_products

# Index with provided field mappings
!curl -k -X PUT -u admin:admin "https://$HOST:9200/bbuy_products" -H 'Content-Type: application/json' -d @/workspace/search_engineering/week1/bbuy_products.json
!python index.py -s $BBUY_DATA --refresh_interval 60s --batch_size 5000

{"acknowledged":true}{"acknowledged":true,"shards_acknowledged":true,"index":"bbuy_products"}INFO:Indexing /workspace/datasets/product_data/products/ to bbuy_products with 8 workers, refresh_interval of 60s to host localhost with a maximum number of docs sent per file per worker of 200000 and 5000 per batch.
INFO:Done. 1275077 were indexed in 12.747937300683407 minutes.  Total accumulated time spent in `bulk` indexing: 37.63850982505343 minutes


In [21]:
os.environ['QUERY_FILE'] = '/workspace/datasets/train.csv'

In [22]:
!python query.py --query_file $QUERY_FILE --max_queries 10000

INFO:Loading query file from /workspace/datasets/train.csv
INFO:Running queries, checking in every 1000 queries:
INFO:Query: Bad teacher has 10 hits.
INFO:Query: martin has 10 hits.
INFO:Query: netgear n has 10 hits.
INFO:Query: iPad case has 10 hits.
INFO:Query: digital tv converter has 10 hits.
INFO:Query: Bluetooth has 10 hits.
INFO:Query: Xbox 360 battery charger has 10 hits.
INFO:Query: Bose sound dock has 10 hits.
INFO:Query: HP touchpad has 10 hits.
INFO:Query: rechargeable batteries has 10 hits.
INFO:Finished running 10000 queries in 3.0797756292663205 minutes


In [28]:
# Change the name matching field to no longer be a fuzzy match.
!python query.py --query_file $QUERY_FILE --max_queries 10000

INFO:Loading query file from /workspace/datasets/train.csv
INFO:Running queries, checking in every 1000 queries:
INFO:Query: Bad teacher has 5 hits.
INFO:Query: martin has 10 hits.
INFO:Query: netgear n has 10 hits.
INFO:Query: iPad case has 10 hits.
INFO:Query: digital tv converter has 10 hits.
INFO:Query: Bluetooth has 10 hits.
INFO:Query: Xbox 360 battery charger has 10 hits.
INFO:Query: Bose sound dock has 10 hits.
INFO:Query: HP touchpad has 10 hits.
INFO:Query: rechargeable batteries has 10 hits.
INFO:Finished running 10000 queries in 1.583093975750186 minutes


In [29]:
# Drop the function scores
!python query.py --query_file $QUERY_FILE --max_queries 10000

INFO:Loading query file from /workspace/datasets/train.csv
INFO:Running queries, checking in every 1000 queries:
INFO:Query: Bad teacher has 5 hits.
INFO:Query: martin has 10 hits.
INFO:Query: netgear n has 10 hits.
INFO:Query: iPad case has 10 hits.
INFO:Query: digital tv converter has 10 hits.
INFO:Query: Bluetooth has 10 hits.
INFO:Query: Xbox 360 battery charger has 10 hits.
INFO:Query: Bose sound dock has 10 hits.
INFO:Query: HP touchpad has 10 hits.
INFO:Query: rechargeable batteries has 10 hits.
INFO:Finished running 10000 queries in 1.331303205349832 minutes


In [30]:
# Drop every other matching function except the multi_match
!python query.py --query_file $QUERY_FILE --max_queries 10000

INFO:Loading query file from /workspace/datasets/train.csv
INFO:Running queries, checking in every 1000 queries:
INFO:Query: Bad teacher has 5 hits.
INFO:Query: martin has 10 hits.
INFO:Query: netgear n has 10 hits.
INFO:Query: iPad case has 10 hits.
INFO:Query: digital tv converter has 10 hits.
INFO:Query: Bluetooth has 10 hits.
INFO:Query: Bose sound dock has 10 hits.
INFO:Query: HP touchpad has 10 hits.
INFO:Query: rechargeable batteries has 10 hits.
INFO:Finished running 10000 queries in 1.031394305100063 minutes


In [31]:
# Last, but not least, change the multi_match to only search the name and shortDescription field.
!python query.py --query_file $QUERY_FILE --max_queries 10000

INFO:Loading query file from /workspace/datasets/train.csv
INFO:Running queries, checking in every 1000 queries:
INFO:Query: Bad teacher has 5 hits.
INFO:Query: martin has 10 hits.
INFO:Query: netgear n has 10 hits.
INFO:Query: iPad case has 10 hits.
INFO:Query: digital tv converter has 8 hits.
INFO:Query: Bluetooth has 10 hits.
INFO:Query: HP touchpad has 10 hits.
INFO:Query: rechargeable batteries has 10 hits.
INFO:Finished running 10000 queries in 0.7386545902166594 minutes
