<img src="https://github.com/chdb-io/chdb/raw/main/docs/_static/snake-chdb.png" height=100>


Inspired by ClickHouse Blog: [ANN Vector Search with SQL-powered LSH & Random Projections](https://clickhouse.com/blog/approximate-nearest-neighbour-ann-with-sql-powered-local-sensitive-hashing-lsh-random-projections). 

This demo will show how to use chDB to make a simple search engine for a set of movies.
```
movieId,embedding
318,"[-0.32907996  3.2970035   0.15050603  1.9187577  -5.8646975  -3.7843416
 -2.6874192  -6.161338    1.98583    -2.6736846   2.1889842   5.162994
  1.654852   -0.7761136   1.5172766  -0.85932654]"
296,"[-0.01519391  2.443479   -1.480839    0.10609777 -5.6971617  -1.3988643
 -4.1634355  -6.399832    4.8691964  -2.7901962   1.738929    3.839515
  1.5430368   1.4577994   0.56058794 -0.9734406 ]"
356,"[-1.8876978   1.6772441  -1.9821857  -0.93794477 -2.5182424  -3.8408334
 -3.87617    -4.512172    0.8053944  -2.081389    1.454333    6.7315516
  0.22428921  0.72071487  2.211912   -1.3959718 ]"
593,"[-1.4681095   2.4807196  -2.990346    0.239727   -5.800576   -2.9217808
 -2.9491336  -6.646222    4.2070146  -2.650232    0.6342644   5.38617
  1.0954435  -0.71700466  0.43723348 -0.8792468 ]"
2571,"[-2.5742574   1.3490096  -2.0755954   3.0196552  -7.46083    -3.2669234
 -5.8962264  -4.022377    0.9717742   0.75643456  3.016018    4.7698874
 -0.34867725  3.7842882   0.4231439  -0.81689113]"
```


# Recommendation systems these years

The recommendation system has made several major advancements over the past 10 years:

1. 2009-2015: LR (Logistic Regression) combined with sophisticated feature engineering defeated SVM and collaborative filtering, which were algorithms of the previous generation.
1. 2012-2015: NN (Neural Networks) changed the CV (Computer Vision) and NLP (Natural Language Processing) industries, then returned to the recommendation system field, greatly reducing the importance of traditional skill in feature combination.
1. 2013: Embedding was taken out from Google's archives and later developed into techniques like Item2vec, sparking a trend in mining user behavior.
1. 2015-2016: Wide & Deep inspired "grafting" NN with various old models.
1. 2016-2017: Experienced a strong counterattack from tree models such as XGBoost and LightGBM that were fast, good, and efficient.
1. 2017: Transformer became popularized to the point where "Attention Is All You Need."
1. 2018-now: Mainly focused on deep exploration of features, especially user features. Representatively famous is DIEN.

# About this demo

Item2vec technology is developed based on Word2vec. Its core idea is to treat the user's historical behavior sequence as a sentence, and then train the vector representation of each item through Word2vec. Finally, item recommendations are made based on the similarity of item vectors. The core of Item2vec technology is to treat the user's historical behavior sequence as a sentence, and then train the vector representation of each item through Word2vec. Finally, item recommendations are made based on the similarity of item vectors.

The main purpose of this demo is to demonstrate how to train the vector representation of items using Word2vec and make item recommendations based on the similarity of item vectors. It mainly consists of 4 parts:
1. Prepare item sequences based on user behavior.
2. Train a CBOW model using the Word2Vec module of the gensim library.
3. Extract all embedding data and write it to chDB.
4. Perform queries on chDB based on cosine distance to find similar movies to the input movie.


# Briefing about Word2Vec

Word2Vec was introduced in two papers by a team of researchers at Google, published between September and October 2013. Alongside the papers, the researchers released their implementation in C. The Python implementation followed shortly after the first paper, courtesy of Gensim.

The fundamental premise of Word2Vec is that words with similar contexts also have similar meanings and consequently share a comparable vector representation within the model. For example, "dog," "puppy," and "pup" are frequently used in analogous situations with similar surrounding words like "good," "fluffy," or "cute." According to Word2Vec, they will thus possess a corresponding vector representation.

Based on this assumption, Word2Vec can be utilized to discover relationships between words in a dataset, calculate their similarity, or employ the vector representation of these words as input for other applications such as text classification or clustering.

<img src="https://mccormickml.com/assets/word2vec/skip_gram_net_arch.png" alt="Word2Vec" style="max-width:800px">

In [1]:
%pip install -q -i https://pypi.tuna.tsinghua.edu.cn/simple --upgrade tensorflow-cpu gensim chdb pandas pyarrow scikit-learn numpy matplotlib
%pip show tensorflow-cpu chdb gensim

Note: you may need to restart the kernel to use updated packages.
Name: tensorflow-cpu
Version: 2.15.0.post1
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /home/Clickhouse/.venv/lib/python3.9/site-packages
Requires: absl-py, astunparse, flatbuffers, gast, google-pasta, grpcio, h5py, keras, libclang, ml-dtypes, numpy, opt-einsum, packaging, protobuf, setuptools, six, tensorboard, tensorflow-estimator, tensorflow-io-gcs-filesystem, termcolor, typing-extensions, wrapt
Required-by: 
---
Name: chdb
Version: 1.0.2
Summary: chDB is an in-process SQL OLAP Engine powered by ClickHouse
Home-page: https://github.com/auxten/chdb
Author: auxten
Author-email: auxtenwpc@gmail.com
License: Apache-2.0
Location: /home/Clickhouse/.venv/lib/python3.9/site-packages
Requires: 
Required-by: 
---
Name: gensim
Version: 4.3.2
Summary: Python framework fo

In [2]:
import pandas as pd
import zipfile
import urllib.request
import os
import chdb
from chdb import session

# Download and extract the dataset
if not os.path.exists("ml-25m/ratings.csv"):
    url = "https://files.grouplens.org/datasets/movielens/ml-25m.zip"
    import ssl
    ssl._create_default_https_context = ssl._create_unverified_context
    filehandle, _ = urllib.request.urlretrieve(url)
    zip_file_object = zipfile.ZipFile(filehandle, "r")
    zip_file_object.extractall()

!ls -l ml-25m

total 1129584
-rw-r--r-- 1 root root     10460 Dec 11 08:24 README.txt
-rw-r--r-- 1 root root 435164157 Dec 11 08:24 genome-scores.csv
-rw-r--r-- 1 root root     18103 Dec 11 08:24 genome-tags.csv
-rw-r--r-- 1 root root   1368578 Dec 11 08:24 links.csv
-rw-r--r-- 1 root root   3038099 Dec 11 08:24 movies.csv
-rw-r--r-- 1 root root 678260987 Dec 11 08:24 ratings.csv
-rw-r--r-- 1 root root  38810332 Dec 11 08:24 tags.csv


In [3]:
# Peek at the data
print(chdb.query("SELECT * FROM file('ml-25m/ratings.csv') LIMIT 5"))

1,296,5,1147880044
1,306,3.5,1147868817
1,307,5,1147868828
1,665,5,1147878820
1,899,3.5,1147868510



In [4]:
# Create tables for the tables of movieLens dataset
chs = session.Session()
chs.query("CREATE DATABASE IF NOT EXISTS movielens ENGINE = Atomic")
chs.query("USE movielens")
chs.query(
    "CREATE VIEW movies AS SELECT movieId, title, genres FROM file('ml-25m/movies.csv')"
)
chs.query(
    "CREATE VIEW ratings AS SELECT userId, movieId, rating, timestamp FROM file('ml-25m/ratings.csv')"
)
chs.query(
    "CREATE VIEW tags AS SELECT userId, movieId, tag, timestamp FROM file('ml-25m/tags.csv')"
)
print(chs.query("SELECT * FROM movies LIMIT 5", "CSVWithNames"))
print(chs.query("SELECT * FROM ratings LIMIT 5", "CSVWithNames"))
print(chs.query("SELECT * FROM tags LIMIT 5", "CSVWithNames"))

"movieId","title","genres"
1,"Toy Story (1995)","Adventure|Animation|Children|Comedy|Fantasy"
2,"Jumanji (1995)","Adventure|Children|Fantasy"
3,"Grumpier Old Men (1995)","Comedy|Romance"
4,"Waiting to Exhale (1995)","Comedy|Drama|Romance"
5,"Father of the Bride Part II (1995)","Comedy"

"userId","movieId","rating","timestamp"
1,296,5,1147880044
1,306,3.5,1147868817
1,307,5,1147868828
1,665,5,1147878820
1,899,3.5,1147868510

"userId","movieId","tag","timestamp"
3,260,"classic",1439472355
3,260,"sci-fi",1439472256
4,1732,"dark comedy",1573943598
4,1732,"great dialogue",1573943604
4,7569,"so bad it's good",1573943455



# Use word2vec to train the embeddings of movies

In [5]:
# Generate the movie id sequence from user ratings, the movies that have been rated >3.5 by users group by userId
# and concat with " ", order by timestamp
# The movie id sequence is used to generate the movie embedding,
# ie. user 1 rated movie 233, 21, 11 and user 2 rated movie 33, 11, 21
# then the movie id sequence is
# "233 21 11"
# "33 11 21"
movie_id_seq = chs.query("""SELECT arrayStringConcat(groupArray(movieId), ' ') FROM (
                            SELECT userId, movieId FROM ratings WHERE rating > 3.5  ORDER BY userId, timestamp
                            ) GROUP BY userId""")   


# Split the movie id sequence into list
moive_list = str(movie_id_seq).split("\n")

print("Length of movie list: ", len(moive_list))
print("First 5 movie list: ", moive_list[:5])


Length of movie list:  162343
First 5 movie list:  ['"858 1193 2959 50 183837 201773 122914 195159 8961 33794 6377 1203 904 912 2019 79132 58559 593 4226 122912 122916 4973 750 122926 68954 3504 955 4963 8984 53322 158783 45720 117887 178827 171253 1387 6533 49272 71745 154 7209 164909 1256 166461 2648 1340 5769 1198 1732 91094 1265 6893 1945 307 104283 3645 2716 1 78499 3114 201588 898 3546 1220 2936 50872 364 4262 200332 1079 195161 2700 946 1267 180031 953 176371 6273 111734 152970 35836 26393 164179 118466 25865 127202 56782 4361 1276 134853 5712 2065 61934 2761 145150 3362 3928 108932 112852 103980 7064 2423 8360 112450 141 82926 916 3039 2203 2132 3088 26171 951 1148 8641 60756 34162 106918 3948 1923 83976 1682 3988 1485 6373 180265 72294 3556 2987 2406 4681 188301 89745 59315 122920 288 38061 102125 2770 79702 110102 93840 104913 8798 142488 106920 6711 137857 111781 111759 648 1961 45186 189333 55820 139644 158872 69122 140174 93510 169992 122906 185029 5989 56152 4022 150 2797

In [10]:
import multiprocessing
from gensim.models import Word2Vec

cores = multiprocessing.cpu_count()

# Split the movie id sequence into a list of lists
movie_id_seq_list = [seq.strip("\"").split() for seq in moive_list]
print("Length of movie id sequence list: ", len(movie_id_seq_list))
print("First 5 movie id sequence list: ", movie_id_seq_list[:5])

# Train the Word2Vec model using CBOW
model = Word2Vec(sg=0, window=5, vector_size=16, min_count=1, workers=cores-1)
model.build_vocab(movie_id_seq_list, progress_per=10000)
print("Vocabulary size: ", len(model.wv))

# Check the distinct movie id with at least one rating > 3.5 count
print("Distinct movie id count: ", chs.query("SELECT count(DISTINCT movieId) FROM ratings WHERE rating > 3.5"))

model.train(movie_id_seq_list, total_examples=model.corpus_count, epochs=10, report_delay=1)

# Print model info
print("Vocabulary content: ", model.wv.index_to_key)


Length of movie id sequence list:  162343
First 5 movie id sequence list:  [['858', '1193', '2959', '50', '183837', '201773', '122914', '195159', '8961', '33794', '6377', '1203', '904', '912', '2019', '79132', '58559', '593', '4226', '122912', '122916', '4973', '750', '122926', '68954', '3504', '955', '4963', '8984', '53322', '158783', '45720', '117887', '178827', '171253', '1387', '6533', '49272', '71745', '154', '7209', '164909', '1256', '166461', '2648', '1340', '5769', '1198', '1732', '91094', '1265', '6893', '1945', '307', '104283', '3645', '2716', '1', '78499', '3114', '201588', '898', '3546', '1220', '2936', '50872', '364', '4262', '200332', '1079', '195161', '2700', '946', '1267', '180031', '953', '176371', '6273', '111734', '152970', '35836', '26393', '164179', '118466', '25865', '127202', '56782', '4361', '1276', '134853', '5712', '2065', '61934', '2761', '145150', '3362', '3928', '108932', '112852', '103980', '7064', '2423', '8360', '112450', '141', '82926', '916', '3039', '

# Test find similar movies

In [7]:
input_movie_id = 1
top_k = 10
print("Input movie: ", chs.query(f"SELECT title FROM movies WHERE movieId = {input_movie_id}", "CSV"))
print("Top 10 similar movies: ")
similar_movies = model.wv.most_similar(str(input_movie_id), topn=top_k)
print(chs.query(f"SELECT movieId, title FROM movies WHERE movieId IN ({','.join([str(m[0]) for m in similar_movies])})", "CSV"))

Input movie:  "Toy Story (1995)"

Top 10 similar movies: 
34,"Babe (1995)"
150,"Apollo 13 (1995)"
356,"Forrest Gump (1994)"
364,"Lion King, The (1994)"
588,"Aladdin (1992)"
595,"Beauty and the Beast (1991)"
1197,"Princess Bride, The (1987)"
1265,"Groundhog Day (1993)"
1270,"Back to the Future (1985)"
3114,"Toy Story 2 (1999)"



# Save movieId and embeddings to a temporary CSV file

In [8]:
import csv

# Open the CSV file in write mode
with open('movie_embeddings.csv', 'w', newline='') as file:
    writer = csv.writer(file)

    # Write the header row
    writer.writerow(['movieId', 'embedding'])

    # Iterate over each movieId and its corresponding embedding
    for movieId in model.wv.index_to_key:
        embedding = model.wv[movieId]
        # Convert the format [0.1 0.2 ...] into a list of floats, eg. [0.1, 0.2, ...]
        embedding = embedding.tolist()

        # Write the movieId and embedding as a row in the CSV file
        writer.writerow([movieId, embedding])


AttributeError: 'numpy.ndarray' object has no attribute 'split'

# Use brute force to find similar movies

In [None]:
chs.query('SELECT * FROM file(\'movie_embeddings.csv\') LIMIT 5')

318,"[-0.92671806  4.0658607  -0.9362241   2.3389215  -5.5646834  -3.0489233
 -2.3870916  -3.9461932   2.5411336  -4.2200975   2.2373397   4.5567446
  4.2845583  -0.27262357  1.7597772  -0.7714879 ]"
296,"[-2.320194    2.9284086  -0.23645285  0.8121122  -6.580175   -2.6938596
 -4.4205527  -3.6644669   3.6217701  -4.7963047   2.8589551   2.1375341
  1.4287919   1.2680439   1.5670657  -0.12797607]"
356,"[-2.8729887   2.6795983  -3.0608099   0.34714407 -2.6804838  -2.955097
 -4.0935082  -2.248234    1.4665883  -3.4244392   1.334923    5.6284447
  2.4682024   1.7087619   2.0542164  -2.622762  ]"
593,"[-3.350593    3.40134    -2.7125394   1.3866315  -6.415367   -4.108528
 -3.4653835  -4.563889    3.4765499  -4.066826    1.3893566   3.6057796
  1.9603899   0.25266844  1.17256    -0.29026085]"
2571,"[-2.725035    3.2000797  -1.1294255   3.7750504  -6.250042   -3.7862737
 -3.8041484  -2.58676     0.06922363 -0.8518675   4.711027    4.6145434
  1.7265469   4.670463   -0.9652608   0.14176556]"

In [None]:
# Switch to the movie_embeddings database
from chdb import session
chs = session.Session()
chs.query("CREATE DATABASE IF NOT EXISTS movie_embeddings ENGINE = Atomic")
chs.query("USE movie_embeddings")
chs.query('DROP TABLE IF EXISTS embeddings')
chs.query('DROP TABLE IF EXISTS embeddings_with_title')


chs.query("""CREATE TABLE embeddings (
      movieId UInt32 NOT NULL,
      embedding Array(Float32) NOT NULL
  ) ENGINE = MergeTree()
  ORDER BY movieId""")

print("Inserting movie embeddings into the database")
chs.query("INSERT INTO embeddings (movieId, embedding) VALUES (318, [-0.92671806, 4.0658607, -0.9362241, 2.3389215, -5.5646834, -3.0489233, -2.3870916, -3.9461932, 2.5411336, -4.2200975, 2.2373397, 4.5567446, 4.2845583, -0.27262357, 1.7597772, -0.7714879])")

print(chs.query('SELECT * FROM embeddings LIMIT 5'))


# # Join the embeddings table with the movies table to get the title
# chs.query('CREATE TABLE embeddings_with_title '
#           '(movieId UInt32 NOT NULL, title String NOT NULL, embedding Array(Float32) NOT NULL) '
#             'ENGINE = MergeTree() '
#             'ORDER BY movieId '
#             'AS SELECT e.movieId, m.title, e.embedding '
#             'FROM file(\'movie_embeddings.csv\') AS e '
#             'JOIN movielens.movies AS m ON e.movieId = m.movieId')


# print(chs.query('SELECT * FROM embeddings_with_title LIMIT 5'))
