# Data ingestion

download dataset [Health_and_Personal_Care.jsonl.gz](https://drive.google.com/file/d/12N52kB4D1iqgzSuoWEfNSY3KqVRp10wL/view?usp=drive_link)

put in to `data` dir

In [1]:
%load_ext autoreload
%autoreload 2

import os

print(os.environ['DATA_DIR'])

root_data_dir = os.environ['DATA_DIR']
print(os.listdir(root_data_dir))

/Users/username/PycharmProjects/ml_for_products/data
['model_dockerized.cb', 'zinc_data', 'model.cb', 'Health_and_Personal_Care.jsonl.gz', 'mlflow', 'minio', 'bidmachine_task_data', 'bidmachine_logs.zip', 'downloaded_model.cb', 'item_cards.gzip', 'meta_Health_and_Personal_Care.jsonl.gz']


In [2]:
from utils import read_raw_data

file_name = 'Health_and_Personal_Care.jsonl.gz'
data_path = os.path.join(root_data_dir, file_name)

json_data = read_raw_data(data_path, limit=1000)
print(len(json_data))

.env loaded:  True
Dataset num items: 1000 from /Users/username/PycharmProjects/ml_for_products/data/Health_and_Personal_Care.jsonl.gz
1000


In [3]:
json_data[0]

{'rating': 4.0,
 'title': '12 mg is 12 on the periodic table people! Mg for magnesium',
 'text': 'This review is more to clarify someone else’s review bc they didn’t understand understand the labeling!  It shows 1000mg as advertised & another little label says 12mg bc 12 is on the periodic table for magnesium!  I realize not everyone takes chemistry, but 4 ppl liked his review & so misinformation is spreading.  This works. If however you are on opiate level medications that are causing constipation you should talk to your pain dr or your gastrointestinal dr & ask for a medication called Linzess which works must better & must faster, but is unnecessary for most people.  If magnesium is working for you just make sure to take it with food & drink 6-8 glasses of water per day.  Staying hydrated will really help.  Before switching to Linzess I used to take one 1,000 mg pill am & pm every day with meals & always with an 8 ounce glass of water or other liquid.',
 'images': [],
 'asin': 'B07TD

In [4]:
import pandas as pd

df = pd.json_normalize(json_data)

df.head(3)

Unnamed: 0,rating,title,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase
0,4.0,12 mg is 12 on the periodic table people! Mg f...,This review is more to clarify someone else’s ...,[],B07TDSJZMR,B07TDSJZMR,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,1580950175902,3,True
1,5.0,Save the lanet using less plastic.,Love these easy multitasking bleach tablets. B...,[],B08637FWWF,B08637FWWF,AEVWAM3YWN5URJVJIZZ6XPD2MKIA,1604354586880,3,True
2,5.0,Fantastic,I have been suffering a couple months with hee...,[],B07KJVGNN5,B07KJVGNN5,AHSPLDNW5OOUK2PLH7GXLACFBZNQ,1563966838905,0,True


dummy search

In [7]:
user_query = 'cough'

df[df['text'].apply(
    lambda x: user_query in x.lower())
].head(5).iloc[0, 2]

'These seem like great quality cough drops with beneficial ingredients for when you are under the weather. A couple downsides for me are that the outside of them are rough so they cut up your mouth a bit and secondly no place saying where they are made. They have a mildly sweet and minty taste.'

# Keyword search

* eval index (sparse)
* match documents using cosine similarity


In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    analyzer='word',
    lowercase=True,
    token_pattern=r'\b[\w\d]{3,}\b'
)

vectorizer.fit(df['text'].values)
print('vectorizer fitted')

vectorizer fitted


In [9]:
document_vectors = vectorizer.transform(df['text'].values)
print('Index matrix:', document_vectors.shape)

Index matrix: (1000, 6529)


In [10]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

query_vector = vectorizer.transform([user_query])
print(query_vector.shape)

# matching
similarities = cosine_similarity(query_vector, document_vectors)[0]
print(similarities.shape)
top_k = 10
top_indices = np.argsort(similarities)[::-1][:top_k]
print('Top indicies:', top_indices)

(1, 6529)
(1000,)
Top indicies: [781 101 896 329 341 340 339 338 337 336]


In [11]:
for ind, row in df.iloc[top_indices].iterrows():
    print(ind, '%.3f' % similarities[ind], row['text'][:100], '...')
    print()

781 0.245 VERY soothing.  I get that awful tickly dry cough a lot during allergy season, especially 5 minutes  ...

101 0.221 These seem like great quality cough drops with beneficial ingredients for when you are under the wea ...

896 0.109 We have a big variety of probiotics that we rotate and use daily.  This one has a very impressive la ...

329 0.000 This massager brush cleans scalp effectively and makes your head feel great when you use it. The bri ...

341 0.000 Astragalus has many health benefits including immune support, liver cleansing, healthy skin, and str ...

340 0.000 Beet root powder has a plethora of uses. It's excellent for cooking. It enhances any chocolate desse ...

339 0.000 This is a great set of gloves. They can be used for many different tasks. You get a few pairs so thi ...

338 0.000 If you want a good facial cleansing brush, but don't want to spend $200, this is a good alternative. ...

337 0.000 I think every household needs cleaning gloves. These have the 

# Elasticsearch

In [12]:
from utils import request_elastic, load_config, create_index

request_elastic('version', debug=True, method='get')

http://localhost:4080/version {}


{'version': '0.4.10',
 'build': '0',
 'commit_hash': 'f8b1436487807b107659d6f444a52c9fa442d3c0',
 'branch': '0',
 'build_date': '2024-01-14T09:43:19Z'}

Documents indexing

In [13]:
import json
import requests

ZINCSEARCH_URL = os.environ["ZINCSEARCH_URL"]
USERNAME=os.environ['ZINCSEARCH_USERNAME']
PASSWORD=os.environ['ZINCSEARCH_PASSWORD']

index_config = load_config()
index_name = index_config['name']
create_index(index_config)

Failed to request: 400, {"error":"index [index11] already exists"}


In [21]:
index_config

{'mappings': {'properties': {'category': {'aggregatable': True,
    'index': True,
    'sortable': True,
    'type': 'keyword'},
   'content': {'highlightable': True,
    'index': True,
    'store': True,
    'type': 'text'},
   'content_len': {'aggregatable': False,
    'index': True,
    'sortable': True,
    'type': 'integer'},
   'doc_id': {'highlightable': True,
    'index': True,
    'store': True,
    'type': 'text'}}},
 'name': 'index11',
 'settings': {'analysis': {'analyzer': {'default': {'type': 'standard'}}}},
 'shard_num': 1,
 'storage_type': 'disk'}

In [14]:
print(index_config['mappings']['properties'].keys())

dict_keys(['category', 'content', 'content_len', 'doc_id'])


In [15]:
import hashlib

from utils import clean_text

def eval_doc_id(seed=None, limit=10):
    if seed is None:
        seed = str(int(time.time()))
    res = str(hashlib.md5(seed.encode('utf-8')).hexdigest())[:10]
    return res

docs = [
    {
        'category': 'health',
        'content': clean_text(i['text']),
        'content_len': len(i['text']),
        'doc_id': eval_doc_id(i['user_id']+str(i['timestamp'])),
        '_id': eval_doc_id(i['user_id']+str(i['timestamp']))
    }
    for i in json_data 
]
print(len(docs))

1000


In [16]:
from utils import load_document, load_bulk_documents, search, pretty

load_document(docs[0], index_name)

{'message': 'ok',
 'id': 'b48b7da01e',
 '_id': 'b48b7da01e',
 '_index': 'index11',
 '_version': 1,
 '_seq_no': 0,
 '_primary_term': 0,
 'result': 'created'}

In [17]:
search(index_name, 'cough')

{'took': 0,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 3},
  'max_score': 3.800283804728009,
  'hits': [{'_index': 'index11',
    '_type': '_doc',
    '_id': '9588e668a0',
    '_score': 3.800283804728009,
    '@timestamp': '2024-11-10T19:25:44.530689536Z',
    '_source': {'@timestamp': '2024-11-10T19:25:44.530689536Z',
     '_id': '9588e668a0',
     'category': 'health',
     'content': 'VERY soothing.  I get that awful tickly dry cough a lot during allergy season, especially 5 minutes after lying down to sleep, and this actually quiets it down with zero side effects.  I just wish it came in a purse-size.',
     'content_len': 221,
     'doc_id': '9588e668a0'}},
   {'_index': 'index11',
    '_type': '_doc',
    '_id': '35185fb3a1',
    '_score': 3.463888525405098,
    '@timestamp': '2024-11-10T19:25:40.982337536Z',
    '_source': {'@timestamp': '2024-11-10T19:25:40.982337536Z',
     '_id': '35185fb3a1',
     '

Build search index

In [18]:
load_bulk_documents(index_name, docs)

Document loaded successfully.


In [19]:
res = search(index_name, 'cough')
res

{'took': 0,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 3},
  'max_score': 4.535997886006282,
  'hits': [{'_index': 'index11',
    '_type': '_doc',
    '_id': '9588e668a0',
    '_score': 4.535997886006282,
    '@timestamp': '2024-11-12T17:04:04.335334656Z',
    '_source': {'@timestamp': '2024-11-12T17:04:04.335334656Z',
     '_id': '9588e668a0',
     'category': 'health',
     'content': 'VERY soothing.  I get that awful tickly dry cough a lot during allergy season, especially 5 minutes after lying down to sleep, and this actually quiets it down with zero side effects.  I just wish it came in a purse-size.',
     'content_len': 221,
     'doc_id': '9588e668a0'}},
   {'_index': 'index11',
    '_type': '_doc',
    '_id': '35185fb3a1',
    '_score': 4.320778193388964,
    '@timestamp': '2024-11-12T17:03:54.119958784Z',
    '_source': {'@timestamp': '2024-11-12T17:03:54.119958784Z',
     '_id': '35185fb3a1',
     '

In [20]:
pretty(res['hits']['hits'])

[{'content': 'VERY soothing.  I get that awful tickly dry cough a lot during allergy season, especially 5 minutes after lying down to sleep, and this actually quiets it down with zero side effects.  I just wish it came in a purse-size.'},
 {'content': 'These seem like great quality cough drops with beneficial ingredients for when you are under the weather. A couple downsides for me are that the outside of them are rough so they cut up your mouth a bit and secondly no place saying where they are made. They have a mildly sweet and minty taste.'},
 {'content': "We have a big variety of probiotics that we rotate and use daily.  This one has a very impressive label.  Now, I have no way to test any of these claims, so I just have to believe them.  This one gives more capsules for less money than many others, all-the-while giving lots of CFU's.<br /><br />Not long before the Covid19 outbreak, I had to take another round of antibiotics.  Since this wipes out all of the good bacteria in my inte

# Embeddings search

GPU-powered [embeddings evaluation](https://colab.research.google.com/drive/1avq9WrUSOwsfUUXZZhgyNxKNe62kG-fk?usp=sharing)