# Data ingestion

download dataset [Health_and_Personal_Care.jsonl.gz](https://drive.google.com/file/d/12N52kB4D1iqgzSuoWEfNSY3KqVRp10wL/view?usp=drive_link)

put in to `data` dir

In [1]:
# %load_ext autoreload
# %autoreload 2

import os
import sys

run_env = os.getenv('RUN_ENV', 'COLLAB')
if run_env == 'COLLAB':
  from google.colab import drive
  ROOT_DIR = '/content/drive'
  drive.mount(ROOT_DIR)
  print('Google drive connected')
  root_data_dir = os.path.join(ROOT_DIR, 'MyDrive', 'ml_course_data')
  sys.path.append(os.path.join(ROOT_DIR, 'MyDrive', 'src'))
else:
  root_data_dir = os.getenv('DATA_DIR', '/srv/data')

print(os.listdir(root_data_dir))

['messages.db', 'labeled_data_corpus.csv', 'nltk_data', 'pipelines-data', 'zinc_data', 'nltk-data', 'Health_and_Personal_Care.jsonl.gz', 'mlflow', 'models', 'minio', 'final_dataset.zip', 'logs', 'ocr_dataset.zip', 'scored_corpus.csv', 'brand_tweets.csv', 'brand_tweets_valid.csv']


In [2]:
from utils import read_raw_data

file_name = 'Health_and_Personal_Care.jsonl.gz'
data_path = os.path.join(root_data_dir, file_name)

json_data = read_raw_data(data_path, limit=1000)
print(len(json_data))

Dataset num items: 1000 from /Users/adzhumurat/PycharmProjects/ai_product_engineer/data/Health_and_Personal_Care.jsonl.gz
1000


In [3]:
json_data[0]

{'rating': 4.0,
 'title': '12 mg is 12 on the periodic table people! Mg for magnesium',
 'text': 'This review is more to clarify someone else’s review bc they didn’t understand understand the labeling!  It shows 1000mg as advertised & another little label says 12mg bc 12 is on the periodic table for magnesium!  I realize not everyone takes chemistry, but 4 ppl liked his review & so misinformation is spreading.  This works. If however you are on opiate level medications that are causing constipation you should talk to your pain dr or your gastrointestinal dr & ask for a medication called Linzess which works must better & must faster, but is unnecessary for most people.  If magnesium is working for you just make sure to take it with food & drink 6-8 glasses of water per day.  Staying hydrated will really help.  Before switching to Linzess I used to take one 1,000 mg pill am & pm every day with meals & always with an 8 ounce glass of water or other liquid.',
 'images': [],
 'asin': 'B07TD

In [4]:
import pandas as pd

df = pd.json_normalize(json_data)

df.head(3)

Unnamed: 0,rating,title,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase
0,4.0,12 mg is 12 on the periodic table people! Mg f...,This review is more to clarify someone else’s ...,[],B07TDSJZMR,B07TDSJZMR,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,1580950175902,3,True
1,5.0,Save the lanet using less plastic.,Love these easy multitasking bleach tablets. B...,[],B08637FWWF,B08637FWWF,AEVWAM3YWN5URJVJIZZ6XPD2MKIA,1604354586880,3,True
2,5.0,Fantastic,I have been suffering a couple months with hee...,[],B07KJVGNN5,B07KJVGNN5,AHSPLDNW5OOUK2PLH7GXLACFBZNQ,1563966838905,0,True


dummy search

In [5]:
user_query = 'cough'

df[df['text'].apply(
    lambda x: user_query in x.lower())
].head(5).iloc[0, 2]

'These seem like great quality cough drops with beneficial ingredients for when you are under the weather. A couple downsides for me are that the outside of them are rough so they cut up your mouth a bit and secondly no place saying where they are made. They have a mildly sweet and minty taste.'

# Keyword search

* eval index (sparse)
* match documents using cosine similarity


In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    analyzer='word',
    lowercase=True,
    token_pattern=r'\b[\w\d]{3,}\b'
)

vectorizer.fit(df['text'].values)
print('vectorizer fitted')

vectorizer fitted


In [7]:
document_vectors = vectorizer.transform(df['text'].values)
print('Index matrix:', document_vectors.shape)

Index matrix: (1000, 6529)


In [8]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

query_vector = vectorizer.transform([user_query])
print(query_vector.shape)

# matching
similarities = cosine_similarity(query_vector, document_vectors)[0]
print(similarities.shape)
top_k = 10
top_indices = np.argsort(similarities)[::-1][:top_k]
print('Top indicies:', top_indices)

(1, 6529)
(1000,)
Top indicies: [781 101 896 329 341 340 339 338 337 336]


In [9]:
for ind, row in df.iloc[top_indices].iterrows():
    print(ind, '%.3f' % similarities[ind], row['text'][:100], '...')
    print()

781 0.245 VERY soothing.  I get that awful tickly dry cough a lot during allergy season, especially 5 minutes  ...

101 0.221 These seem like great quality cough drops with beneficial ingredients for when you are under the wea ...

896 0.109 We have a big variety of probiotics that we rotate and use daily.  This one has a very impressive la ...

329 0.000 This massager brush cleans scalp effectively and makes your head feel great when you use it. The bri ...

341 0.000 Astragalus has many health benefits including immune support, liver cleansing, healthy skin, and str ...

340 0.000 Beet root powder has a plethora of uses. It's excellent for cooking. It enhances any chocolate desse ...

339 0.000 This is a great set of gloves. They can be used for many different tasks. You get a few pairs so thi ...

338 0.000 If you want a good facial cleansing brush, but don't want to spend $200, this is a good alternative. ...

337 0.000 I think every household needs cleaning gloves. These have the 

# Elasticsearch

will not working in collab

For running locally fiers start ZincSearch

```shell
run-search
```

In [10]:
from utils import get_auth

auth_config = get_auth(env_path=os.environ['ENV_PATH'])
print(auth_config)

ENV loaded from /Users/adzhumurat/PycharmProjects/ai_product_engineer/.env: True
AuthConfig(ZINCSEARCH_URL='http://localhost:4080', USERNAME='admin', PASSWORD='admin')


In [11]:
from utils import request_elastic, load_config, create_index

request_elastic('version', debug=True, method='get', conf=auth_config)

http://localhost:4080/version {}


{'version': '0.4.10',
 'build': '0',
 'commit_hash': 'f8b1436487807b107659d6f444a52c9fa442d3c0',
 'branch': '0',
 'build_date': '2024-01-14T09:43:19Z'}

Documents indexing

In [12]:
index_config = load_config()
index_name = index_config['name']
create_index(index_config, conf=auth_config)
print(index_name, index_config['mappings']['properties'].keys())

Failed to request: 400, {"error":"index [index12] already exists"}
index12 dict_keys(['category', 'content', 'asin', 'parent_asin', 'content_len', 'doc_id'])


Loading full dataset

In [13]:

json_data = read_raw_data(data_path)
print(len(json_data))

Dataset num items: 494121 from /Users/adzhumurat/PycharmProjects/ai_product_engineer/data/Health_and_Personal_Care.jsonl.gz
494121


In [14]:
import hashlib

from utils import clean_text

def eval_doc_id(seed=None, limit=10):
    if seed is None:
        seed = str(int(time.time()))
    res = str(hashlib.md5(seed.encode('utf-8')).hexdigest())[:10]
    return res

docs = [
    {
        'category': 'health',
        'content': clean_text(i['text']),
        'asin': i['asin'],
        'parent_asin': i['parent_asin'],
        'content_len': len(i['text']),
        'doc_id': eval_doc_id(i['user_id']+str(i['timestamp'])),
        '_id': eval_doc_id(i['user_id']+str(i['timestamp']))
    }
    for i in json_data
]
print(len(docs), docs[0])

494121


In [15]:
from utils import load_document, load_bulk_documents, search, pretty

load_document(docs[0], index_name, conf=auth_config)

{'message': 'ok',
 'id': 'b48b7da01e',
 '_id': 'b48b7da01e',
 '_index': 'index12',
 '_version': 1,
 '_seq_no': 0,
 '_primary_term': 0,
 'result': 'created'}

In [16]:
search(index_name, 'cough', conf=auth_config)

{'took': 0,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 0}, 'max_score': 0, 'hits': []}}

Build search index

In [17]:
load_bulk_documents(index_name, docs, conf=auth_config)

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

In [19]:
res = search(index_name, 'cough', conf=auth_config)
res

{'took': 0,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 43},
  'max_score': 8.755762607332912,
  'hits': [{'_index': 'index12',
    '_type': '_doc',
    '_id': '1d52dd497e',
    '_score': 8.755762607332912,
    '@timestamp': '2025-10-21T12:41:00.384955648Z',
    '_source': {'@timestamp': '2025-10-21T12:41:00.384955648Z',
     '_id': '1d52dd497e',
     'asin': 'B003UWD92Q',
     'category': 'health',
     'content': "As a child, Smith Bros. Black Cough Drops were a real treat for me. These have been very hard to find in stores. Recently, the brand was purchased be a new company and maybe they are making a comeback. I've noticed Smith Bros. Cough Drops in bags recently, but alas, not the black licorice ones. I believe that was the original flavor way back when. Since I'm a life-time licorice lover these are just what I want! Will they cure a cough? I really don't know. Will they relieve a cough? Possibly, same as

In [20]:
pretty(res['hits']['hits'])

[{'content': 'As a child, Smith Bros. Black Cough Drops were a real treat for me. These have been very hard to find in stores. Recently, the brand was purchased be ...'},
 {'content': "I had a bad persisting cough that wouldn't go away. Tried my prescription syrup which only made me sleepy for a week and it didn't improve. Tried taki..."},
 {'content': "This pack was totally sweet! I hate shopping for supplements in those lame mall-supplement stores (cough, cough GNC). They're really not oriented at a..."},
 {'content': 'I love herbs the herbs that god created in this earth.....The herbs are the miracle to our cure.........Forget going to the doctor everytime you have ...'},
 {'content': 'Although it is expensive, it is really big in Europe.  A waitress in Switzerland put us onto it when wife had really bad cough in Lucerne. She said sh...'},
 {'content': 'They also help with cough...'},
 {'content': 'Seems to really help when I have a cough....'},
 {'content': 'Very good quality,and h

# Embeddings search

GPU-powered [embeddings evaluation](https://colab.research.google.com/drive/1avq9WrUSOwsfUUXZZhgyNxKNe62kG-fk?usp=sharing)