# Data ingestion

download dataset [Health_and_Personal_Care.jsonl.gz](https://drive.google.com/file/d/12N52kB4D1iqgzSuoWEfNSY3KqVRp10wL/view?usp=drive_link)

put in to `data` dir

In [1]:
%load_ext autoreload
%autoreload 2

import os
import sys

run_env = os.getenv('RUN_ENV', 'COLLAB')
if run_env == 'COLLAB':
  from google.colab import drive
  ROOT_DIR = '/content/drive'
  drive.mount(ROOT_DIR)
  print('Google drive connected')
  root_data_dir = os.path.join(ROOT_DIR, 'MyDrive', 'ml_course_data')
  sys.path.append(os.path.join(ROOT_DIR, 'MyDrive', 'src'))
else:
  root_data_dir = os.getenv('DATA_DIR', '/srv/data')

print(os.listdir(root_data_dir))

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Google drive connected
['nyt-ingredients-snapshot-2015.csv', 'insurance.csv', 'non_linear.csv', 'client_segmentation.csv', 'eigen.pkl', 'clustering.pkl', 'boosting_toy_dataset.csv', 'politic_meme.jpg', 'gray_goose.jpg', 'memes', 'optimal_push_time', 'sklearn_data', 'my_little_recsys', 'corpora', 'logs', 'nltk_data', 'recsys_data', 'MNIST', 'hymenoptera_data', 'pet_projects', 'ocr_dataset_sample.csv', 'geo_points.csv.gzip', 'scored_corpus.csv', 'labeled_data_corpus.csv', 'memes_stat_dataset.zip', 'als_model.pkl', 'raw_data.zip', 'json_views.tar.gz', 'sales_timeseries_dataset.csv.gz', 'brand_tweets_valid.csv', 'brand_tweets.csv', 'Health_and_Personal_Care.jsonl.gz', 'models']


In [9]:
from utils import read_raw_data

file_name = 'Health_and_Personal_Care.jsonl.gz'
data_path = os.path.join(root_data_dir, file_name)

json_data = read_raw_data(data_path, limit=1000)
print(len(json_data))

Dataset num items: 1000 from /content/drive/MyDrive/ml_course_data/Health_and_Personal_Care.jsonl.gz
1000


In [None]:
json_data[0]

In [10]:
import pandas as pd

df = pd.json_normalize(json_data)

df.head(3)

Unnamed: 0,rating,title,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase
0,4.0,12 mg is 12 on the periodic table people! Mg f...,This review is more to clarify someone else’s ...,[],B07TDSJZMR,B07TDSJZMR,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,1580950175902,3,True
1,5.0,Save the lanet using less plastic.,Love these easy multitasking bleach tablets. B...,[],B08637FWWF,B08637FWWF,AEVWAM3YWN5URJVJIZZ6XPD2MKIA,1604354586880,3,True
2,5.0,Fantastic,I have been suffering a couple months with hee...,[],B07KJVGNN5,B07KJVGNN5,AHSPLDNW5OOUK2PLH7GXLACFBZNQ,1563966838905,0,True


dummy search

In [11]:
user_query = 'cough'

df[df['text'].apply(
    lambda x: user_query in x.lower())
].head(5).iloc[0, 2]

'These seem like great quality cough drops with beneficial ingredients for when you are under the weather. A couple downsides for me are that the outside of them are rough so they cut up your mouth a bit and secondly no place saying where they are made. They have a mildly sweet and minty taste.'

# Keyword search

* eval index (sparse)
* match documents using cosine similarity


In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    analyzer='word',
    lowercase=True,
    token_pattern=r'\b[\w\d]{3,}\b'
)

vectorizer.fit(df['text'].values)
print('vectorizer fitted')

vectorizer fitted


In [13]:
document_vectors = vectorizer.transform(df['text'].values)
print('Index matrix:', document_vectors.shape)

Index matrix: (1000, 6529)


In [14]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

query_vector = vectorizer.transform([user_query])
print(query_vector.shape)

# matching
similarities = cosine_similarity(query_vector, document_vectors)[0]
print(similarities.shape)
top_k = 10
top_indices = np.argsort(similarities)[::-1][:top_k]
print('Top indicies:', top_indices)

(1, 6529)
(1000,)
Top indicies: [781 101 896 329 341 340 339 338 337 336]


In [15]:
for ind, row in df.iloc[top_indices].iterrows():
    print(ind, '%.3f' % similarities[ind], row['text'][:100], '...')
    print()

781 0.245 VERY soothing.  I get that awful tickly dry cough a lot during allergy season, especially 5 minutes  ...

101 0.221 These seem like great quality cough drops with beneficial ingredients for when you are under the wea ...

896 0.109 We have a big variety of probiotics that we rotate and use daily.  This one has a very impressive la ...

329 0.000 This massager brush cleans scalp effectively and makes your head feel great when you use it. The bri ...

341 0.000 Astragalus has many health benefits including immune support, liver cleansing, healthy skin, and str ...

340 0.000 Beet root powder has a plethora of uses. It's excellent for cooking. It enhances any chocolate desse ...

339 0.000 This is a great set of gloves. They can be used for many different tasks. You get a few pairs so thi ...

338 0.000 If you want a good facial cleansing brush, but don't want to spend $200, this is a good alternative. ...

337 0.000 I think every household needs cleaning gloves. These have the 

# Elasticsearch

will not working in collab

In [None]:
from utils import request_elastic, load_config, create_index

request_elastic('version', debug=True, method='get')

Documents indexing

In [None]:
index_config = load_config()
index_name = index_config['name']
create_index(index_config)
print(index_name, index_config['mappings']['properties'].keys())

Loading full dataset

In [None]:

json_data = read_raw_data(data_path)
print(len(json_data))

In [None]:
import hashlib

from utils import clean_text

def eval_doc_id(seed=None, limit=10):
    if seed is None:
        seed = str(int(time.time()))
    res = str(hashlib.md5(seed.encode('utf-8')).hexdigest())[:10]
    return res

docs = [
    {
        'category': 'health',
        'content': clean_text(i['text']),
        'asin': i['asin'],
        'parent_asin': i['parent_asin'],
        'content_len': len(i['text']),
        'doc_id': eval_doc_id(i['user_id']+str(i['timestamp'])),
        '_id': eval_doc_id(i['user_id']+str(i['timestamp']))
    }
    for i in json_data
]
print(len(docs))

In [None]:
from utils import load_document, load_bulk_documents, search, pretty

load_document(docs[0], index_name)

In [None]:
search(index_name, 'cough')

Build search index

In [None]:
load_bulk_documents(index_name, docs)

In [None]:
res = search(index_name, 'cough')
res

In [None]:
pretty(res['hits']['hits'])

# Embeddings search

GPU-powered [embeddings evaluation](https://colab.research.google.com/drive/1avq9WrUSOwsfUUXZZhgyNxKNe62kG-fk?usp=sharing)