<img src="https://github.com/chdb-io/chdb/raw/main/docs/_static/snake-chdb.png" height=100>


Inspired by ClickHouse Blog: [ANN Vector Search with SQL-powered LSH & Random Projections](https://clickhouse.com/blog/approximate-nearest-neighbour-ann-with-sql-powered-local-sensitive-hashing-lsh-random-projections).

This demo will show how to use chDB to make a simple search engine for a set of movies.
```
movieId,embedding
318,"[-0.32907996  3.2970035   0.15050603  1.9187577  -5.8646975  -3.7843416
 -2.6874192  -6.161338    1.98583    -2.6736846   2.1889842   5.162994
  1.654852   -0.7761136   1.5172766  -0.85932654]"
296,"[-0.01519391  2.443479   -1.480839    0.10609777 -5.6971617  -1.3988643
 -4.1634355  -6.399832    4.8691964  -2.7901962   1.738929    3.839515
  1.5430368   1.4577994   0.56058794 -0.9734406 ]"
356,"[-1.8876978   1.6772441  -1.9821857  -0.93794477 -2.5182424  -3.8408334
 -3.87617    -4.512172    0.8053944  -2.081389    1.454333    6.7315516
  0.22428921  0.72071487  2.211912   -1.3959718 ]"
593,"[-1.4681095   2.4807196  -2.990346    0.239727   -5.800576   -2.9217808
 -2.9491336  -6.646222    4.2070146  -2.650232    0.6342644   5.38617
  1.0954435  -0.71700466  0.43723348 -0.8792468 ]"
2571,"[-2.5742574   1.3490096  -2.0755954   3.0196552  -7.46083    -3.2669234
 -5.8962264  -4.022377    0.9717742   0.75643456  3.016018    4.7698874
 -0.34867725  3.7842882   0.4231439  -0.81689113]"
```


# Recommendation systems these years

The recommendation system has made several major advancements over the past 10 years:

1. 2009-2015: LR (Logistic Regression) combined with sophisticated feature engineering defeated SVM and collaborative filtering, which were algorithms of the previous generation.
1. 2012-2015: NN (Neural Networks) changed the CV (Computer Vision) and NLP (Natural Language Processing) industries, then returned to the recommendation system field, greatly reducing the importance of traditional skill in feature combination.
1. 2013: Embedding was taken out from Google's archives and later developed into techniques like Item2vec, sparking a trend in mining user behavior.
1. 2015-2016: Wide & Deep inspired "grafting" NN with various old models.
1. 2016-2017: Experienced a strong counterattack from tree models such as XGBoost and LightGBM that were fast, good, and efficient.
1. 2017: Transformer became popularized to the point where "Attention Is All You Need."
1. 2018-now: Mainly focused on deep exploration of features, especially user features. Representatively famous is DIEN.

# About this demo

Item2vec technology is developed based on Word2vec. Its core idea is to treat the user's historical behavior sequence as a sentence, and then train the vector representation of each item through Word2vec. Finally, item recommendations are made based on the similarity of item vectors. The core of Item2vec technology is to treat the user's historical behavior sequence as a sentence, and then train the vector representation of each item through Word2vec. Finally, item recommendations are made based on the similarity of item vectors.

The main purpose of this demo is to demonstrate how to train the vector representation of items using Word2vec and make item recommendations based on the similarity of item vectors. It mainly consists of 4 parts:
1. Prepare item sequences based on user behavior.
2. Train a CBOW model using the Word2Vec module of the gensim library.
3. Extract all embedding data and write it to chDB.
4. Perform queries on chDB based on cosine distance to find similar movies to the input movie.


# Briefing about Word2Vec

Word2Vec was introduced in two papers by a team of researchers at Google, published between September and October 2013. Alongside the papers, the researchers released their implementation in C. The Python implementation followed shortly after the first paper, courtesy of Gensim.

The fundamental premise of Word2Vec is that words with similar contexts also have similar meanings and consequently share a comparable vector representation within the model. For example, "dog," "puppy," and "pup" are frequently used in analogous situations with similar surrounding words like "good," "fluffy," or "cute." According to Word2Vec, they will thus possess a corresponding vector representation.

Based on this assumption, Word2Vec can be utilized to discover relationships between words in a dataset, calculate their similarity, or employ the vector representation of these words as input for other applications such as text classification or clustering.

<img src="https://mccormickml.com/assets/word2vec/skip_gram_net_arch.png" alt="Word2Vec" style="max-width:800px">

In [43]:
%pip install -q --upgrade tensorflow gensim chdb pandas pyarrow numpy==1.23.5 matplotlib
%pip show tensorflow chdb gensim numpy

Name: tensorflow
Version: 2.15.0.post1
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /usr/local/lib/python3.10/dist-packages
Requires: absl-py, astunparse, flatbuffers, gast, google-pasta, grpcio, h5py, keras, libclang, ml-dtypes, numpy, opt-einsum, packaging, protobuf, setuptools, six, tensorboard, tensorflow-estimator, tensorflow-io-gcs-filesystem, termcolor, typing-extensions, wrapt
Required-by: dopamine-rl
---
Name: chdb
Version: 1.0.2
Summary: chDB is an in-process SQL OLAP Engine powered by ClickHouse
Home-page: https://github.com/auxten/chdb
Author: auxten
Author-email: auxtenwpc@gmail.com
License: Apache-2.0
Location: /usr/local/lib/python3.10/dist-packages
Requires: 
Required-by: 
---
Name: gensim
Version: 4.3.2
Summary: Python framework for fast Vector Space Modelling
Home-page: https://radimrehurek.com/gensim/
Author:

In [44]:
import numpy as np
print(np.__version__)

1.23.5


In [45]:
import pandas as pd
import zipfile
import urllib.request
import os
import chdb
from chdb import session

# Download and extract the dataset
if not os.path.exists("ml-25m/ratings.csv"):
    url = "https://files.grouplens.org/datasets/movielens/ml-25m.zip"
    import ssl
    ssl._create_default_https_context = ssl._create_unverified_context
    filehandle, _ = urllib.request.urlretrieve(url)
    zip_file_object = zipfile.ZipFile(filehandle, "r")
    zip_file_object.extractall()

!ls -l ml-25m

total 1129588
-rw-r--r-- 1 root root 435164157 Dec 14 08:14 genome-scores.csv
-rw-r--r-- 1 root root     18103 Dec 14 08:14 genome-tags.csv
-rw-r--r-- 1 root root   1368578 Dec 14 08:14 links.csv
-rw-r--r-- 1 root root   3038099 Dec 14 08:14 movies.csv
-rw-r--r-- 1 root root 678260987 Dec 14 08:14 ratings.csv
-rw-r--r-- 1 root root     10460 Dec 14 08:14 README.txt
-rw-r--r-- 1 root root  38810332 Dec 14 08:14 tags.csv


In [46]:
# Peek at the data
print(chdb.query("SELECT * FROM file('ml-25m/ratings.csv') LIMIT 5"))

1,296,5,1147880044
1,306,3.5,1147868817
1,307,5,1147868828
1,665,5,1147878820
1,899,3.5,1147868510



In [47]:
# Create tables for the tables of movieLens dataset
chs = session.Session()
chs.query("CREATE DATABASE IF NOT EXISTS movielens ENGINE = Atomic")
chs.query("USE movielens")
chs.query(
    "CREATE VIEW movies AS SELECT movieId, title, genres FROM file('ml-25m/movies.csv')"
)
chs.query(
    "CREATE VIEW ratings AS SELECT userId, movieId, rating, timestamp FROM file('ml-25m/ratings.csv')"
)
chs.query(
    "CREATE VIEW tags AS SELECT userId, movieId, tag, timestamp FROM file('ml-25m/tags.csv')"
)
print(chs.query("SELECT * FROM movies LIMIT 5", "CSVWithNames"))
print(chs.query("SELECT * FROM ratings LIMIT 5", "CSVWithNames"))
print(chs.query("SELECT * FROM tags LIMIT 5", "CSVWithNames"))

"movieId","title","genres"
1,"Toy Story (1995)","Adventure|Animation|Children|Comedy|Fantasy"
2,"Jumanji (1995)","Adventure|Children|Fantasy"
3,"Grumpier Old Men (1995)","Comedy|Romance"
4,"Waiting to Exhale (1995)","Comedy|Drama|Romance"
5,"Father of the Bride Part II (1995)","Comedy"

"userId","movieId","rating","timestamp"
1,296,5,1147880044
1,306,3.5,1147868817
1,307,5,1147868828
1,665,5,1147878820
1,899,3.5,1147868510

"userId","movieId","tag","timestamp"
3,260,"classic",1439472355
3,260,"sci-fi",1439472256
4,1732,"dark comedy",1573943598
4,1732,"great dialogue",1573943604
4,7569,"so bad it's good",1573943455



# Use word2vec to train the embeddings of movies

In [48]:
# Generate the movie id sequence from user ratings, the movies that have been rated >3.5 by users group by userId
# and concat with " ", order by timestamp
# The movie id sequence is used to generate the movie embedding,
# ie. user 1 rated movie 233, 21, 11 and user 2 rated movie 33, 11, 21
# then the movie id sequence is
# "233 21 11"
# "33 11 21"
movie_id_seq = chs.query("""SELECT arrayStringConcat(groupArray(movieId), ' ') FROM (
                            SELECT userId, movieId FROM ratings WHERE rating > 3.5  ORDER BY userId, timestamp
                            ) GROUP BY userId""")


# Split the movie id sequence into list
moive_list = str(movie_id_seq).split("\n")

print("Length of movie list: ", len(moive_list))
# print("First 3 movie list: ", moive_list[:3])


Length of movie list:  162343


In [49]:
import multiprocessing
from gensim.models import Word2Vec

cores = multiprocessing.cpu_count()

# Split the movie id sequence into a list of lists
movie_id_seq_list = [seq.strip("\"").split() for seq in moive_list]
print("Length of movie id sequence list: ", len(movie_id_seq_list))
# print("First 5 movie id sequence list: ", movie_id_seq_list[:5])

# Train the Word2Vec model using CBOW
model = Word2Vec(sg=0, window=5, vector_size=16, min_count=1, workers=cores-1)
model.build_vocab(movie_id_seq_list, progress_per=10000)
print("Vocabulary size: ", len(model.wv))

# Check the distinct movie id with at least one rating > 3.5 count
print("Distinct movie id count: ", chs.query("SELECT count(DISTINCT movieId) FROM ratings WHERE rating > 3.5"))

model.train(movie_id_seq_list, total_examples=model.corpus_count, epochs=10, report_delay=1)

# Print model info
print("Vocabulary content: ", model.wv.index_to_key[:100])


Length of movie id sequence list:  162343
Vocabulary size:  40858
Distinct movie id count:  40858

Vocabulary content:  ['318', '296', '356', '593', '2571', '260', '527', '2959', '50', '1196', '858', '1198', '4993', '110', '2858', '589', '1210', '47', '1', '7153', '5952', '608', '480', '457', '2028', '1270', '2762', '58559', '4226', '32', '150', '3578', '79132', '1136', '1193', '1704', '1197', '1221', '541', '1214', '1291', '1089', '364', '1213', '4973', '1240', '293', '4306', '1036', '1265', '588', '2329', '590', '7361', '1200', '6539', '6874', '3147', '4886', '111', '4995', '6377', '1258', '1682', '1206', '912', '750', '33794', '1617', '778', '780', '2997', '1580', '1097', '924', '1208', '1527', '595', '60069', '48516', '380', '1732', '8961', '1222', '4963', '5418', '377', '2716', '2324', '4011', '4878', '1961', '5989', '5618', '7438', '3996', '2918', '592', '733', '68954']


# Test find similar movies

In [50]:
input_movie_id = 1
top_k = 10
print("Input movie: ", chs.query(f"SELECT title FROM movies WHERE movieId = {input_movie_id}", "CSV"))
print("Top 10 similar movies: ")
similar_movies = model.wv.most_similar(str(input_movie_id), topn=top_k)
print(chs.query(f"SELECT movieId, title FROM movies WHERE movieId IN ({','.join([str(m[0]) for m in similar_movies])})", "CSV"))

Input movie:  "Toy Story (1995)"

Top 10 similar movies: 
34,"Babe (1995)"
150,"Apollo 13 (1995)"
356,"Forrest Gump (1994)"
364,"Lion King, The (1994)"
588,"Aladdin (1992)"
595,"Beauty and the Beast (1991)"
1197,"Princess Bride, The (1987)"
1265,"Groundhog Day (1993)"
1270,"Back to the Future (1985)"
3114,"Toy Story 2 (1999)"



# Save movieId and embeddings to a temporary CSV file

In [51]:
import csv

# Open the CSV file in write mode
with open('movie_embeddings.csv', 'w', newline='') as file:
    writer = csv.writer(file)

    # Write the header row
    writer.writerow(['movieId', 'embedding'])

    # Iterate over each movieId and its corresponding embedding
    for movieId in model.wv.index_to_key:
        embedding = model.wv[movieId]
        # Convert the format [0.1 0.2 ...] into a list of floats, eg. [0.1, 0.2, ...]
        embedding = embedding.tolist()

        # Write the movieId and embedding as a row in the CSV file
        writer.writerow([movieId, embedding])


# Use brute force to find similar movies

In [52]:
chs.query('SELECT * FROM file(\'movie_embeddings.csv\') LIMIT 5')

318,"[-2.2265753746032715,3.5254011154174805,-0.8498407602310181,2.891636848449707,-5.75970458984375,-4.4655680656433105,-2.910050392150879,-4.874805927276611,1.8407117128372192,-4.037372589111328,0.5827102065086365,3.5602872371673584,2.98940110206604,0.626388669013977,-0.8868098855018616,-2.443618059158325]"
296,"[-2.526492118835449,1.1677000522613525,-0.8071125149726868,1.050546407699585,-6.952147483825684,-3.644843339920044,-4.3457231521606445,-3.9279022216796875,3.377218008041382,-3.938249349594116,0.4486609697341919,1.5516788959503174,-0.06512846052646637,2.3503451347351074,-1.0555064678192139,-3.5413312911987305]"
356,"[-3.7432684898376465,2.5020227432250977,-2.216348648071289,-0.7881428003311157,-2.4738593101501465,-3.0312352180480957,-4.331355571746826,-3.506458044052124,0.7487499117851257,-3.7879791259765625,0.40928640961647034,4.914693355560303,1.905363917350769,2.227639675140381,0.6767667531967163,-3.2117021083831787]"
593,"[-4.163858890533447,2.1928892135620117,-3.188163757

In [53]:
# Switch to the movie_embeddings database
chs.query("CREATE DATABASE IF NOT EXISTS movie_embeddings ENGINE = Atomic")
chs.query("USE movie_embeddings")
chs.query('DROP TABLE IF EXISTS embeddings')
chs.query('DROP TABLE IF EXISTS embeddings_with_title')


chs.query("""CREATE TABLE embeddings (
      movieId UInt32 NOT NULL,
      embedding Array(Float32) NOT NULL
  ) ENGINE = MergeTree()
  ORDER BY movieId""")

print("Inserting movie embeddings into the database")
chs.query("INSERT INTO embeddings FROM INFILE 'movie_embeddings.csv' FORMAT CSV")
print(chs.query('SELECT * FROM embeddings LIMIT 5'))

# print(chs.query("SELCET * FROM movielens.movies LIMIT 5"))

# print(chs.query("""SELECT e.movieId,
#        m.title,
#        e.embedding
# FROM embeddings AS e
# JOIN movielens.movies AS m ON e.movieId = m.movieId
# LIMIT 5"""))

# Join the embeddings table with the movies table to get the title
chs.query("""CREATE TABLE embeddings_with_title (
        movieId UInt32 NOT NULL,
        title String NOT NULL,
        embedding Array(Float32) NOT NULL
    ) ENGINE = MergeTree()
ORDER BY movieId AS
SELECT e.movieId,
       m.title,
       e.embedding
FROM embeddings AS e
JOIN movielens.movies AS m ON e.movieId = m.movieId""")

print("Movie Id, Title, Embeddings")
print(chs.query('SELECT * FROM embeddings_with_title LIMIT 5'))


Inserting movie embeddings into the database
1,"[-5.5897427,2.357738,-0.5006245,0.1455938,-2.2831314,-0.87840086,-2.5551517,-3.6584768,-0.10373427,-4.9113913,-0.30503443,5.5157366,-1.5590336,5.983279,4.385269,-3.1866481]"
2,"[-5.338505,-1.5594655,-4.355125,0.069642186,1.8488991,0.33051878,-1.7176361,2.1713862,3.0727508,3.4173298,1.4632888,6.5680175,1.2017039,1.98483,4.0459323,-2.002859]"
3,"[-3.5466137,1.669789,1.8515323,0.06255668,1.3897773,-4.7042356,3.6903667,-3.7350867,4.38805,5.6368246,3.0906188,2.5778446,-1.5468398,0.23956613,5.3413,-1.9792594]"
4,"[-2.3924065,3.2029195,0.5825141,2.8336196,3.1721113,-1.7251801,4.581799,-1.233385,3.5758104,2.6230328,-0.72014546,1.3625188,-0.13874735,-5.3615384,3.1370401,-4.161427]"
5,"[-1.5131618,1.6629226,0.3938448,-0.31881937,3.5624661,-3.0411289,3.7884533,-4.752323,5.4196496,2.502349,3.7858236,4.6285796,-1.576958,-0.91796184,6.0343866,-0.4933876]"

Movie Id, Title, Embeddings
1,"Toy Story (1995)","[-5.5897427,2.357738,-0.5006245,0.1455938,-2.28

In [54]:
target_movieId = 318
topN = chs.query(f"""
          WITH
            {target_movieId} AS theMovieId,
            (SELECT embedding FROM embeddings_with_title WHERE movieId = theMovieId LIMIT 1) AS targetEmbedding
          SELECT
            movieId,
            title,
            cosineDistance(embedding, targetEmbedding) AS distance
            FROM embeddings_with_title
            WHERE movieId != theMovieId -- Not self
            ORDER BY distance ASC
            LIMIT 10
          """, "Pretty")
print(f"Scaned {topN.rows_read()} rows, "
      f"Top 10 similar movies to movieId {target_movieId} in {topN.elapsed()}")
print("Target Movie:")
print(chs.query(f"SELECT * FROM movielens.movies WHERE movieId={target_movieId}", "Pretty"))
print("Top10 Similar:")
print(topN)


Scaned 10 rows, Top 10 similar movies to movieId 318 in 0.037433266
Target Movie:
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ [1mmovieId[0m ┃ [1mtitle                           [0m ┃ [1mgenres     [0m ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│     318 │ Shawshank Redemption, The (1994) │ Crime|Drama │
└─────────┴──────────────────────────────────┴─────────────┘

Top10 Similar:
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ [1mmovieId[0m ┃ [1mtitle                           [0m ┃ [1m  distance[0m ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│     527 │ Schindler's List (1993)          │ 0.04840994 │
├─────────┼──────────────────────────────────┼────────────┤
│     593 │ Silence of the Lambs, The (1991) │ 0.06956363 │
├─────────┼──────────────────────────────────┼────────────┤
│      50 │ Usual Suspects, The (1995)       │ 0.10293013 │
├─────────┼──────────────────────────────────┼────────────┤
│     296