## Starting the environment

from hashlib import sha1


```bash
# In one terminal
cd docker
docker-compose -f docker-compose-w2.yml start

# Then, in another terminal
```bash
cd docker-grafana
./install-plugin.sha
docker compose -f monitoring.yml start
```
Then, add port 3000 to your VS code port forwarding. 

Finally, open this url [http://localhost:3000/d/opensearch/opensearch-prometheus](http://localhost:3000/d/opensearch/opensearch-prometheus)

In [2]:
import os

In [8]:
# Set environment variables
os.environ['HOST'] = 'localhost'
os.environ['BBUY_DATA'] = '/workspace/datasets/product_data/products/'
os.environ['BBUY_QUERIES'] = '/workspace/datasets/'

In [5]:
# Index with provided field mappings
!curl -k -X PUT -u admin:admin "https://$HOST:9200/bbuy_products" -H 'Content-Type: application/json' -d @/workspace/search_engineering/week1/bbuy_products.json

{"acknowledged":true,"shards_acknowledged":true,"index":"bbuy_products"}

## First indexing of the BB product data

In [6]:
# Index with 16 worker threads and 500 batch size
!python index.py -s $BBUY_DATA -w 16 -b 500

INFO:Indexing /workspace/datasets/product_data/products/ to bbuy_products with 16 workers, refresh_interval of -1 to host localhost with a maximum number of docs sent per file per worker of 200000 and 500 per batch.
INFO:Done. 1275077 were indexed in 21.870381166016646 minutes.  Total accumulated time spent in `bulk` indexing: 257.9378352573321 minutes


Indexing took 22 minutes, at about 1,000 docs per second.  

![](assets/L1-first-indexing.png)

## Reindexing the same content

In [None]:
# Index with 16 worker threads and 500 batch size (and a limit of 100,000 documents)
!python index.py -s $BBUY_DATA -w 16 -b 500 -m 1000000

## Testing query load

In [14]:
!python ./query.py -q $BBUY_QUERIES/train.csv -w 4 -m 10000

INFO:Loading query file from /workspace/datasets//train.csv and using seed 42 for worker: 0
INFO:Loading query file from /workspace/datasets//train.csv and using seed 84 for worker: 1
INFO:Loading query file from /workspace/datasets//train.csv and using seed 126 for worker: 2
INFO:Loading query file from /workspace/datasets//train.csv and using seed 168 for worker: 3
^C

Caught SIGINT. Shutting down workers...



Querying was at approximately 35 queries/second

![](assets/L1-querying.png)

The bottleneck seems to be CPU where the CPU was mostly maxed at 100%.

![](assets/L1-cpu-usage.png)

## Benchmarking with double CPU (from 1 to 2)

In [12]:
# Index with 16 worker threads and 500 batch size, limiting to 1,000 documents
!python index.py -s $BBUY_DATA -w 16 -b 500 -m 1000

INFO:Indexing /workspace/datasets/product_data/products/ to bbuy_products with 16 workers, refresh_interval of -1 to host localhost with a maximum number of docs sent per file per worker of 100 and 500 per batch.
^C

Caught SIGINT. Shutting down workers...



In [15]:
!python ./query.py -q $BBUY_QUERIES/train.csv -w 4 -m 10000

INFO:Loading query file from /workspace/datasets//train.csv and using seed 42 for worker: 0
INFO:Loading query file from /workspace/datasets//train.csv and using seed 84 for worker: 1
INFO:Loading query file from /workspace/datasets//train.csv and using seed 126 for worker: 2
INFO:Loading query file from /workspace/datasets//train.csv and using seed 168 for worker: 3
INFO:WN: 3: Running queries, checking in every 1000 queries:
INFO:WN: 3: Query: Canon DSLR 400D has 10 hits.
INFO:WN: 3: First result: {'_index': 'bbuy_products', '_id': '1980124', '_score': 13.078028, '_source': {'sku': ['1980124'], 'productId': ['1218304066943'], 'name': ['Canon - EOS Rebel T3i 18.0-Megapixel DSLR Camera with 18-55mm Lens - Black'], 'type': ['HardGood'], 'shortDescription': ['Vari-angle 3.0-inch Clear View LCD screen1080 full HD video3.7 FPS (frames per second)ISO 100-6400 (expandable to 12800)'], 'startDate': ['2011-02-07'], 'active': ['true'], 'regularPrice': ['799.99'], 'salePrice': ['738.99'], 'short

### Impact on indexing

Doubling CPU did not seem to have an impact on indexing which was still about 1,000 docs per second.

![](assets/L2-cpu2-indexing.png)

Digging deeper on the CPU impact during indexing, we see that CPU usage was not maxed, but memory usage was maxed at 2GB. Thus, during indexing, the bottleneck was memory.



### Impact on querying.
On the other hand, doubling CPU double queries to 70-80 queries/second. Thus, CPU is the main bottleneck for querying.

![](assets/L2-cpu2-querying.png)

## Benchmarking with double RAM (from 2GB to 4GB)

In [16]:
# Index with 16 worker threads and 500 batch size, limiting to 1,000 documents
!python index.py -s $BBUY_DATA -w 16 -b 500 -m 1000

INFO:Indexing /workspace/datasets/product_data/products/ to bbuy_products with 16 workers, refresh_interval of -1 to host localhost with a maximum number of docs sent per file per worker of 1000 and 500 per batch.
