<a href="https://colab.research.google.com/github/YoshiyukiKono/semantic-text-search/blob/main/semantic_text_search_astrapy-en.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic Text Search by Astra DB Vector Search with new AstraPy interface

## Prerequisites

### OpenAI API Access

### Astra DB

1. Create a new ***vector search enabled database*** in [Astra](https://astra.datastax.com/).
1. Get an application token and API endpoint.

We will create a collection in the vector database in this walkthrough.




## Data Set

First, we will see the process to prepare the data set used for this demo and the tool to embedded the data into vectors.

To begin we must install the required prerequisite libraries:


In [2]:
!pip install -U \
  datasets==2.12.0 \
  sentence-transformers==2.2.2

Collecting datasets==2.12.0
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers==2.2.2
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.7,>=0.3.0 (from datasets==2.12.0)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets==2.12.0)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from datasets==2.12.0)

### Data Preprocessing
The dataset preparation process requires a few steps:

1. We download the Quora dataset from Hugging Face Datasets.
2. The text content of the dataset is embedded into vectors.



In [3]:
from datasets import load_dataset

dataset = load_dataset('quora', split='train[240000:320000]')
dataset

Downloading builder script:   0%|          | 0.00/2.38k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.69k [00:00<?, ?B/s]

Downloading and preparing dataset quora/default to /root/.cache/huggingface/datasets/quora/default/0.0.0/36ba4cd42107f051a158016f1bea6ae3f4685c5df843529108a54e42d86c1e04...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/58.2M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/404290 [00:00<?, ? examples/s]

Dataset quora downloaded and prepared to /root/.cache/huggingface/datasets/quora/default/0.0.0/36ba4cd42107f051a158016f1bea6ae3f4685c5df843529108a54e42d86c1e04. Subsequent calls will reuse this data.


Dataset({
    features: ['questions', 'is_duplicate'],
    num_rows: 80000
})

The dataset contains ~400K pairs of natural language questions from Quora.

In [4]:
dataset[:5]

{'questions': [{'id': [207550, 351729],
   'text': ['What is the truth of life?', "What's the evil truth of life?"]},
  {'id': [33183, 351730],
   'text': ['Which is the best smartphone under 20K in India?',
    'Which is the best smartphone with in 20k in India?']},
  {'id': [351731, 351732],
   'text': ['Steps taken by Canadian government to improve literacy rate?',
    'Can I send homemade herbal hair oil from India to US via postal or private courier services?']},
  {'id': [37799, 94186],
   'text': ['What is a good way to lose 30 pounds in 2 months?',
    'What can I do to lose 30 pounds in 2 months?']},
  {'id': [351733, 351734],
   'text': ['Which of the following most accurately describes the translation of the graph y = (x+3)^2 -2 to the graph of y = (x -2)^2 +2?',
    'How do you graph x + 2y = -2?']}],
 'is_duplicate': [False, True, False, True, False]}

Whether or not the questions are duplicates is not so important, all we need for this example is the text itself. We can extract them all into a single questions list.

In [5]:
questions = []

for record in dataset['questions']:
    questions.extend(record['text'])

# remove duplicates
questions = list(set(questions))
print('\n'.join(questions[:5]))
print(len(questions))

How should I download movies in my phone?
Doing excessive masturbation is a cause of less weight. How do I gain weight naturally?
I love TV and Film acting but I hate theatre acting, is there any way I can just audtion for TV and Film?
What are the worst things about being a teenage mother?
What happens if you get bitten by a redback spider? How do you treat a bite from a redback spider?
136057


## Astra DB Connection

### Cassandra Driver Install

In [11]:
!pip install cassandra-driver

Collecting cassandra-driver
  Downloading cassandra_driver-3.28.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.1/19.1 MB[0m [31m48.7 MB/s[0m eta [36m0:00:00[0m
Collecting geomet<0.3,>=0.1 (from cassandra-driver)
  Downloading geomet-0.2.1.post1-py3-none-any.whl (18 kB)
Installing collected packages: geomet, cassandra-driver
Successfully installed cassandra-driver-3.28.0 geomet-0.2.1.post1


In [10]:
import cassandra; print (cassandra.__version__)

3.28.0


In [12]:
!pip install astrapy --upgrade

Collecting astrapy
  Downloading astrapy-0.6.2-py3-none-any.whl (21 kB)
Collecting cassio~=0.1.3 (from astrapy)
  Downloading cassio-0.1.3-py3-none-any.whl (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.1/40.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx[http2]~=0.25.1 (from astrapy)
  Downloading httpx-0.25.2-py3-none-any.whl (74 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx[http2]~=0.25.1->astrapy)
  Downloading httpcore-1.0.2-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.9/76.9 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
Collecting h2<5,>=3 (from httpx[http2]~=0.25.1->astrapy)
  Downloading h2-4.1.0-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0

In [13]:
import getpass

YOUR_TOKEN = getpass.getpass()

··········


In [14]:
YOUR_API_ENDPOINT = input("API ENDPOINT:")

API ENDPOINT:https://1014346a-a40c-4d1a-b1a3-78769cc72312-us-east1.apps.astra.datastax.com


In [15]:
from astrapy.db import AstraDB

# Initialization
db = AstraDB(
  token=YOUR_TOKEN,
  api_endpoint=YOUR_API_ENDPOINT)

print(f"Connected to Astra DB: {db.get_collections()}")

Connected to Astra DB: {'status': {'collections': ['semantics']}}


In [16]:
!pip install langchain==0.0.340

Collecting langchain==0.0.340
  Downloading langchain-0.0.340-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain==0.0.340)
  Downloading dataclasses_json-0.6.3-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain==0.0.340)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langsmith<0.1.0,>=0.0.63 (from langchain==0.0.340)
  Downloading langsmith-0.0.69-py3-none-any.whl (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain==0.0.340)
  Downloading marshmallow-3.20.1-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-inspect<1,>=0.4

In [17]:

!pip install openai==1.3.5 tiktoken==0.5.1 cohere==4.36

Collecting openai==1.3.5
  Downloading openai-1.3.5-py3-none-any.whl (220 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m220.8/220.8 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken==0.5.1
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cohere==4.36
  Downloading cohere-4.36-py3-none-any.whl (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.9/48.9 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Collecting backoff<3.0,>=2.0 (from cohere==4.36)
  Downloading backoff-2.2.1-py3-none-any.whl (15 kB)
Collecting fastavro==1.8.2 (from cohere==4.36)
  Downloading fastavro-1.8.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m69.6 MB/s[0m eta [

In [18]:

import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass()

··········


In [19]:

import openai
from openai import OpenAI

client = OpenAI()
client.models.list()

SyncPage[Model](data=[Model(id='text-search-babbage-doc-001', created=1651172509, object='model', owned_by='openai-dev'), Model(id='curie-search-query', created=1651172509, object='model', owned_by='openai-dev'), Model(id='text-davinci-003', created=1669599635, object='model', owned_by='openai-internal'), Model(id='text-search-babbage-query-001', created=1651172509, object='model', owned_by='openai-dev'), Model(id='babbage', created=1649358449, object='model', owned_by='openai'), Model(id='babbage-search-query', created=1651172509, object='model', owned_by='openai-dev'), Model(id='text-babbage-001', created=1649364043, object='model', owned_by='openai'), Model(id='text-similarity-davinci-001', created=1651172505, object='model', owned_by='openai-dev'), Model(id='davinci-similarity', created=1651172509, object='model', owned_by='openai-dev'), Model(id='code-davinci-edit-001', created=1649880484, object='model', owned_by='openai'), Model(id='curie-similarity', created=1651172510, object=

In [20]:


from langchain.embeddings.openai import OpenAIEmbeddings
embedding_function = OpenAIEmbeddings()

In [21]:
YOUR_COLLECTION_NAME = "semantics"

In [22]:
from langchain.vectorstores import AstraDB

vector_store = AstraDB(
  embedding=embedding_function,
  collection_name=YOUR_COLLECTION_NAME,
  api_endpoint=YOUR_API_ENDPOINT,
  token=YOUR_TOKEN,
)

In [33]:
print(len(questions))

136057


In [40]:
vector_store.add_texts(questions[1000])

['ff17d872bfeb4d92b7d749140fffaed9',
 '20b8da4183b240be9726f1c24d65f958',
 '83672b9d139c4c769ad7280163cc713d',
 '82505ec1742c405ba0a07090c5100374',
 '5b348876d8ad4523a2e6add48d002e3c',
 'e3f4938d43b7452f929a6a45d5c07ef4',
 '867dc3237ecb4c3fada4fac59160f3a2',
 'f7b02ee3d5674ec69b5c1669805604ea',
 '3c41bfd444fa4a848417664320667e52',
 '1b49b09bcca94a859c13d416f4297ba3',
 '308e0c922b9c4bb5a43e56545bf49599',
 '596a32d6b86a49d2a8426e46094dfb47',
 'fc213923932e4028aaad4b4981208c9b',
 'bc6595227cd1423891576a61e55f1203',
 'db3591f8a9c34b1094208ed1e4d5665d',
 '741c3d11c4dc49a0acca7f6567dd01c6',
 'bc8a132ce32c4159a8833bd25eca9b9c',
 '17e5c5431d3043c38a53f3cc88aae049',
 '624efca5e8684f4f8f70182297932c8c',
 '47ead5a349744f0089ffdb8cf4e2a595',
 '125a061050b84e7981bf49e2314e7427',
 '925e97e4c7e04d178cbd58717f052686',
 'ecce3dd44b0c41728e8cb37fd5004f69',
 '99aadb9d12f24e5aa7df33532b9c5aba',
 '7f133368a9984f87a734c3a9a8758f4c',
 'fa18248cf0f84887a8b3047f0307c272',
 'dfb396a4959a468488b3b8858c0a00ca',
 

In [41]:
await vector_store.aadd_texts(questions[10000])

['ffb48fe2d00c4d6fb7e7766477a6cc5d',
 '8fa38683ef9b43bc991d069100c828a0',
 'd39713cc42834b71b186418e75f4a749',
 '0007163b316042669747976c34fbb0cc',
 'da8917ad48064c2588df5042bcbd9db4',
 '9c1b89b2104143a5b35d1753e9f2067d',
 '853d1cb2e0c54276811e0850062d43c7',
 'f6ceed58255b48ca8042bd1a15b97756',
 '03a3b0c73e3f410994456db05a74366f',
 '0bb8112f72e64b40bdf3332acb5bd335',
 'a05b4834476b41799a38bd1ee57509bb',
 '29d6ca0641404e6eb33bd09e8f61f039',
 '767907008cef4873934e7567c1ce2b15',
 '4b6014a1f47c45e0995a839cf56f6f40',
 '737303d26f2b4da79df9c6cb78839026',
 'a37ed645edf840e2af0fca32cd12f227',
 '1d526c7718d04916bf2416c8a45e1ec4',
 'd441c2ff84e54b948d8135fc85cee913',
 '63be4417cf57429284b9b2ed9939f750',
 '148c2404a33e41fdaf36583c5033f40e',
 '65459f0e96624da59dfe7173c4b2f8d5',
 '4acefe6ebb72423dac55cc72ec8c5a34',
 'be87bd8876dd41d8810b0e615e9d6950',
 'd5af8006912b472f9863bf2cd0fd2856',
 'e65d32d4df264daeba87fdb21cac34b0',
 '0656cb1bc83847949e9e5813cf5fbbc3',
 'cb633d0af4e74089a4e38f00dfcbc212',
 

In [25]:
query = 'Is it true that the coordinaate of a point on x-axis can be taken as (y,0) while on y-axis it can be taken as (0,x)?'

In [38]:
result = vector_store.similarity_search(query)

In [39]:
result

[Document(page_content="Can anyone clear up my confusion as to how the zero vector is in Nul(A)? Aren't the vectors in Nul(A) the non-zero vectors that make some matrix A equal to 0?"),
 Document(page_content='What does the straight line of PPC mean? Do the appear on the to lie along the straight line?'),
 Document(page_content='What is the difference between (–3) square and -3 square?'),
 Document(page_content='Why do people say the earth is flat when we have seen photos of the earth (which is round)?')]