<a href="https://colab.research.google.com/github/YoshiyukiKono/semantic-text-search/blob/main/semantic_text_search-en.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic Text Search by Astra DB Vector Search

## Getting Started with Astra DB

1. Create a new ***vector search enabled database*** in [Astra](https://astra.datastax.com/).
1. Create a keyspace (`demo`)
1. Get an application token

We will create a table and an index in this walkthrough.




## Data Set and Sentence Transformers

First, we will see the process to prepare the data set used for this demo and the tool to embedded the data into vectors.

To begin we must install the required prerequisite libraries:


In [None]:
!pip install -U \
  datasets==2.12.0 \
  sentence-transformers==2.2.2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets==2.12.0
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m27.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers==2.2.2
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.7,>=0.3.0 (from datasets==2.12.0)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets==2.12.0)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### Data Preprocessing
The dataset preparation process requires a few steps:

1. We download the Quora dataset from Hugging Face Datasets.
2. The text content of the dataset is embedded into vectors.



In [None]:
from datasets import load_dataset

dataset = load_dataset('quora', split='train[240000:320000]')
dataset

Downloading builder script:   0%|          | 0.00/2.38k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.69k [00:00<?, ?B/s]

Downloading and preparing dataset quora/default to /root/.cache/huggingface/datasets/quora/default/0.0.0/36ba4cd42107f051a158016f1bea6ae3f4685c5df843529108a54e42d86c1e04...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/58.2M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/404290 [00:00<?, ? examples/s]

Dataset quora downloaded and prepared to /root/.cache/huggingface/datasets/quora/default/0.0.0/36ba4cd42107f051a158016f1bea6ae3f4685c5df843529108a54e42d86c1e04. Subsequent calls will reuse this data.


Dataset({
    features: ['questions', 'is_duplicate'],
    num_rows: 80000
})

The dataset contains ~400K pairs of natural language questions from Quora.

In [None]:
dataset[:5]

{'questions': [{'id': [207550, 351729],
   'text': ['What is the truth of life?', "What's the evil truth of life?"]},
  {'id': [33183, 351730],
   'text': ['Which is the best smartphone under 20K in India?',
    'Which is the best smartphone with in 20k in India?']},
  {'id': [351731, 351732],
   'text': ['Steps taken by Canadian government to improve literacy rate?',
    'Can I send homemade herbal hair oil from India to US via postal or private courier services?']},
  {'id': [37799, 94186],
   'text': ['What is a good way to lose 30 pounds in 2 months?',
    'What can I do to lose 30 pounds in 2 months?']},
  {'id': [351733, 351734],
   'text': ['Which of the following most accurately describes the translation of the graph y = (x+3)^2 -2 to the graph of y = (x -2)^2 +2?',
    'How do you graph x + 2y = -2?']}],
 'is_duplicate': [False, True, False, True, False]}

Whether or not the questions are duplicates is not so important, all we need for this example is the text itself. We can extract them all into a single questions list.

In [None]:
questions = []

for record in dataset['questions']:
    questions.extend(record['text'])

# remove duplicates
questions = list(set(questions))
print('\n'.join(questions[:5]))
print(len(questions))

How do I prepare for Microsoft’s 98-379 software testing fundamental certification exam?
What is the best book to learn React?
How do I become a great quizzer?
Can one be an extrovert but also an introvert?
Is it OK to quit my job and take a break of say three months refresh my skills, learn new thing and still find a job after that?
136057


### Building Embeddings

To create our embeddings we will us the `MiniLM-L6` sentence transformer model. This is a very efficient semantic similarity embedding model from the sentence-transformers library. We initialize it like so:

In [None]:
from sentence_transformers import SentenceTransformer

device = 'cpu'

model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
model

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

There are three interesting bits of information in the above model printout. Those are:

 - `max_seq_length` is `256`. That means that the maximum number of tokens (like words) that can be encoded into a single vector embedding is `256`. Anything beyond this must be truncated.

 - `word_embedding_dimension` is `384`. This number is the dimensionality of vectors output by this model. It is important that we know this number later when registering this data set into our Astra DB vector-enabled database.

 - `Normalize()`. This final normalization step indicates that all vectors produced by the model are normalized. That means that models that we would typical measure similarity for using cosine similarity can also make use of the dotproduct similarity metric. In fact, with normalized vectors cosine and dotproduct are equivalent.

Moving on, we can create a sentence embedding using this model like so:

In [None]:
query = 'which city is the most populated in the world?'

xq = model.encode(query)
xq.shape

(384,)

In [None]:
query = 'Is it true that the coordinaate of a point on x-axis can be taken as (y,0) while on y-axis it can be taken as (0,x)?'

xq = model.encode(query)
xq.shape

(384,)

We will use this model to embed all questions when upserting them to Astra DB.

In [None]:
def get_embeddings(text):
  return model.encode(text).tolist()

## Astra DB Connection

### Cassandra Driver Install

In [None]:
!pip install cassandra-driver

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting cassandra-driver
  Downloading cassandra_driver-3.28.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.1/19.1 MB[0m [31m94.9 MB/s[0m eta [36m0:00:00[0m
Collecting geomet<0.3,>=0.1 (from cassandra-driver)
  Downloading geomet-0.2.1.post1-py3-none-any.whl (18 kB)
Installing collected packages: geomet, cassandra-driver
Successfully installed cassandra-driver-3.28.0 geomet-0.2.1.post1


In [None]:
import cassandra; print (cassandra.__version__)

3.28.0


### Astra DB Secutiry Settings


If you like, you can download your bundle directory from Astra to your Colab environment (**please modify the cell below**), but note that the URL that you find on your environment is not static, so you would need to copy the URL again when you will run this demo again in another Colab session at a later date.

In [None]:
!wget -O secure-connect-demo.zip "https://datastax-cluster-config-prod.s3.us-east-2.amazonaws.com/d5556151-ea9a-4309-8be3-b8ea2b1cd03d-1/secure-connect-demo.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA2AIQRQ76S2JCB77W%2F20230620%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230620T072829Z&X-Amz-Expires=300&X-Amz-SignedHeaders=host&X-Amz-Signature=a7891f268e675a05d8af328d414f9258e8812bfc4b7d2f0f83f6160486bc624f"

--2023-06-20 07:28:56--  https://datastax-cluster-config-prod.s3.us-east-2.amazonaws.com/d5556151-ea9a-4309-8be3-b8ea2b1cd03d-1/secure-connect-demo.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA2AIQRQ76S2JCB77W%2F20230620%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230620T072829Z&X-Amz-Expires=300&X-Amz-SignedHeaders=host&X-Amz-Signature=a7891f268e675a05d8af328d414f9258e8812bfc4b7d2f0f83f6160486bc624f
Resolving datastax-cluster-config-prod.s3.us-east-2.amazonaws.com (datastax-cluster-config-prod.s3.us-east-2.amazonaws.com)... 52.219.98.18
Connecting to datastax-cluster-config-prod.s3.us-east-2.amazonaws.com (datastax-cluster-config-prod.s3.us-east-2.amazonaws.com)|52.219.98.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12247 (12K) [application/zip]
Saving to: ‘secure-connect-demo.zip’


2023-06-20 07:28:56 (168 MB/s) - ‘secure-connect-demo.zip’ saved [12247/12247]



Modify the following variables to access your environment.

In [None]:
SECURE_CONNECT_BUNDLE_PATH = 'secure-connect-demo.zip'
ASTRA_CLIENT_ID = 'XXX'
ASTRA_CLIENT_SECRET = 'XXX'

### Connection

In [None]:
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider

cloud_config= {
  'secure_connect_bundle': SECURE_CONNECT_BUNDLE_PATH
}
auth_provider = PlainTextAuthProvider(ASTRA_CLIENT_ID, ASTRA_CLIENT_SECRET)
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider)
session = cluster.connect()

row = session.execute("select release_version from system.local").one()
if row:
  print(row[0])
else:
  print("An error occurred.")

ERROR:cassandra.connection:Closing connection <AsyncoreConnection(140367846588272) d5556151-ea9a-4309-8be3-b8ea2b1cd03d-us-east1.db.astra.datastax.com:29042:30b8d54f-3fa9-4a59-8414-5c696c28857d> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


4.0.7-e86f91c568f8


In [None]:
session.set_keyspace('demo')
session

<cassandra.cluster.Session at 0x7fa9ef6200d0>

## Vector Search powered by Astra DB

### Environment Preparation

We will create a table and an index for the demo.

In [None]:
session.execute(f"""CREATE TABLE IF NOT EXISTS demo.questions
(id uuid,
 question text,
 question_embedding vector<float, 384>,

 PRIMARY KEY (id))""")

<cassandra.cluster.ResultSet at 0x7fa9ef66d900>

In [None]:
session.execute(f"""CREATE CUSTOM INDEX IF NOT EXISTS vector_search_index ON demo.questions (question_embedding) USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'""")

<cassandra.cluster.ResultSet at 0x7fa9ef6039d0>

Before registering the demo data set, let's just check the created table and index using a sample record.

In [None]:
question = 'Is it true that the coordinaate of a point on x-axis can be taken as (y,0) while on y-axis it can be taken as (0,x)?'
embedding = get_embeddings(question)
embedding

In [None]:
question

'Is it true that the coordinaate of a point on x-axis can be taken as (y,0) while on y-axis it can be taken as (0,x)?'

In [None]:
from cassandra.query import SimpleStatement
query = SimpleStatement(
                f"""
                INSERT INTO demo.questions
                (id, question, question_embedding)
                VALUES (now(), %s, %s)
                """
            )
session.execute(query,(question, embedding))


<cassandra.cluster.ResultSet at 0x7fa9ef66e080>

In [None]:
query = SimpleStatement(
    f"""
    SELECT id, question, question_embedding
    FROM demo.questions
    ORDER BY question_embedding ANN OF {embedding} LIMIT 5;
    """
    )

In [None]:
results = session.execute(query)
top_5_products = results._current_rows

for row in top_5_products:
  print(f"""{row.id}, {row.question}, {row.question_embedding}\n""")

### Data Registration

**PLEASE NOTE:** Running the following cell should take a couple of hours. If you really want to shorten the time, please slice the `questions` list like `questions[:N]`, but it'd be the point to use the certain amount of data (similar to the Pinecone sample) in order to show the power of Astra DB vector search

In [None]:
for question in questions:
  print(question)
  embedding = get_embeddings(question)
  query = SimpleStatement(
                f"""
                INSERT INTO demo.questions
                (id, question, question_embedding)
                VALUES (now(), %s, %s)
                """
            )
  session.execute(query,(question, embedding))

Finally, you should see 136061 rows in your table as follows.
```
token@cqlsh:demo> select count(*) FROM questions;

 count
--------
 136061

(1 rows)
```

### Vector Search Demo

In [None]:
question = 'How do I promote my e-commerce website?'
embedding = get_embeddings(question)
embedding

In [None]:
query = SimpleStatement(
    f"""
    SELECT id, question, question_embedding
    FROM demo.questions
    ORDER BY question_embedding ANN OF {embedding} LIMIT 5;
    """
    )

In [None]:
results = session.execute(query)
top_5_products = results._current_rows

for row in top_5_products:
  print(f"""{row.id}, {row.question}, {row.question_embedding}\n""")

3acbeca0-0f4f-11ee-993d-ffc8cc4ee2cb, How do I promote my e-commerce website?, [0.006989856716245413, -0.04213973507285118, -0.07815293222665787, -0.009970189072191715, 0.0773971825838089, 0.041939426213502884, -0.006748857907950878, 0.03079938143491745, -0.07842333614826202, -0.032576367259025574, 0.05655835196375847, -0.03520318120718002, 0.07220874726772308, 0.02828521467745304, 0.07695522904396057, -0.05429326370358467, 0.009619835764169693, 0.05280487239360809, -0.01875203847885132, -0.1190795823931694, 0.014776432886719704, -0.0002378705976298079, 0.06035558879375458, 0.0033046738244593143, -0.06253878772258759, -0.04962963983416557, 0.0070444573648273945, 0.05366763100028038, -0.00863986648619175, -0.11296579241752625, 0.040419481694698334, -0.07675008475780487, 0.08271502703428268, 0.048342011868953705, -0.023130204528570175, 0.01384661439806223, -0.03671254962682724, -0.1086340919137001, -0.042044661939144135, 0.025294043123722076, -0.009149031713604927, -0.09623941034078598, 

## Cleanup

In [None]:
session.execute(f"""DROP TABLE IF EXISTS demo.questions""")

<cassandra.cluster.ResultSet at 0x7fce1d882620>