<a href="https://colab.research.google.com/github/YoshiyukiKono/semantic-text-search/blob/main/semantic_text_search_ragstack-ai-en.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic Text Search by Astra DB Vector Search with new AstraPy interface

## Prerequisites

### OpenAI API Access

### Astra DB

1. Create a new ***vector search enabled database*** in [Astra](https://astra.datastax.com/).
1. Get an application token and API endpoint.

We will create a collection in the vector database in this walkthrough.




## Data Set

First, we will see the process to prepare the data set used for this demo and the tool to embedded the data into vectors.

To begin we must install the required prerequisite libraries:


In [2]:
!pip install ragstack-ai



In [2]:
!pip install ragstack-ai==0.2.0



In [3]:
import cassandra;
print(cassandra.__version__)

3.28.0


In [4]:
!pip show ragstack-ai

Name: ragstack-ai
Version: 0.2.0
Summary: RAGStack
Home-page: 
Author: DataStax
Author-email: 
License: 
Location: /usr/local/lib/python3.10/dist-packages
Requires: astrapy, cassio, langchain, langchain-core, llama-index, unstructured
Required-by: 


In [5]:
import langchain

print(langchain.__version__)

0.0.343


In [7]:
import langsmith

print(langsmith.__version__)

0.0.69


In [6]:
import numpy
import openai

print(numpy.__version__)
print(openai.__version__)

1.23.5
1.3.8


In [5]:
!pip install -U \
  datasets==2.12.0 \
  sentence-transformers==2.2.2



### Data Preprocessing
The dataset preparation process requires a few steps:

1. We download the Quora dataset from Hugging Face Datasets.
2. The text content of the dataset is embedded into vectors.



In [6]:
from datasets import load_dataset

dataset = load_dataset('quora', split='train[240000:320000]')
dataset



Dataset({
    features: ['questions', 'is_duplicate'],
    num_rows: 80000
})

The dataset contains ~400K pairs of natural language questions from Quora.

In [7]:
dataset[:5]

{'questions': [{'id': [207550, 351729],
   'text': ['What is the truth of life?', "What's the evil truth of life?"]},
  {'id': [33183, 351730],
   'text': ['Which is the best smartphone under 20K in India?',
    'Which is the best smartphone with in 20k in India?']},
  {'id': [351731, 351732],
   'text': ['Steps taken by Canadian government to improve literacy rate?',
    'Can I send homemade herbal hair oil from India to US via postal or private courier services?']},
  {'id': [37799, 94186],
   'text': ['What is a good way to lose 30 pounds in 2 months?',
    'What can I do to lose 30 pounds in 2 months?']},
  {'id': [351733, 351734],
   'text': ['Which of the following most accurately describes the translation of the graph y = (x+3)^2 -2 to the graph of y = (x -2)^2 +2?',
    'How do you graph x + 2y = -2?']}],
 'is_duplicate': [False, True, False, True, False]}

Whether or not the questions are duplicates is not so important, all we need for this example is the text itself. We can extract them all into a single questions list.

In [8]:
questions = []

for record in dataset['questions']:
    questions.extend(record['text'])

# remove duplicates
questions = list(set(questions))
print('\n'.join(questions[:5]))
print(len(questions))

Why do guys feel weak after sex?
What is the original meaning of the phrase "ever and anon"?
How do find a gay businessman?
If there were an index to indicate the all around usefulness of Spanish, French, German and Italian, which would be the highest scoring language for an English speaker who would be choosing a second language to learn?
How much is the President of the USA paid?
136057


## Astra DB Connection

### Cassandra Driver Install

In [9]:
import getpass

YOUR_TOKEN = getpass.getpass()

··········


In [10]:
YOUR_API_ENDPOINT = input("API ENDPOINT:")

API ENDPOINT:https://1014346a-a40c-4d1a-b1a3-78769cc72312-us-east1.apps.astra.datastax.com


In [11]:
from astrapy.db import AstraDB

# Initialization
db = AstraDB(
  token=YOUR_TOKEN,
  api_endpoint=YOUR_API_ENDPOINT)

print(f"Connected to Astra DB: {db.get_collections()}")

Connected to Astra DB: {'status': {'collections': ['semantics', 'semantics_ragstack']}}


In [12]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass()

··········


In [13]:
import openai
from openai import OpenAI

client = OpenAI()
client.models.list()

SyncPage[Model](data=[Model(id='text-search-babbage-doc-001', created=1651172509, object='model', owned_by='openai-dev'), Model(id='curie-search-query', created=1651172509, object='model', owned_by='openai-dev'), Model(id='text-davinci-003', created=1669599635, object='model', owned_by='openai-internal'), Model(id='text-search-babbage-query-001', created=1651172509, object='model', owned_by='openai-dev'), Model(id='babbage', created=1649358449, object='model', owned_by='openai'), Model(id='babbage-search-query', created=1651172509, object='model', owned_by='openai-dev'), Model(id='text-babbage-001', created=1649364043, object='model', owned_by='openai'), Model(id='text-similarity-davinci-001', created=1651172505, object='model', owned_by='openai-dev'), Model(id='davinci-similarity', created=1651172509, object='model', owned_by='openai-dev'), Model(id='code-davinci-edit-001', created=1649880484, object='model', owned_by='openai'), Model(id='curie-similarity', created=1651172510, object=

In [14]:
from langchain.embeddings.openai import OpenAIEmbeddings
embedding_function = OpenAIEmbeddings()

In [15]:
YOUR_COLLECTION_NAME = "semantics_ragstack"

In [16]:
from langchain.vectorstores import AstraDB

vector_store = AstraDB(
  embedding=embedding_function,
  collection_name=YOUR_COLLECTION_NAME,
  api_endpoint=YOUR_API_ENDPOINT,
  token=YOUR_TOKEN,
)

In [17]:
print(len(questions))

136057


In [None]:
#vector_store.add_texts()

['ff17d872bfeb4d92b7d749140fffaed9',
 '20b8da4183b240be9726f1c24d65f958',
 '83672b9d139c4c769ad7280163cc713d',
 '82505ec1742c405ba0a07090c5100374',
 '5b348876d8ad4523a2e6add48d002e3c',
 'e3f4938d43b7452f929a6a45d5c07ef4',
 '867dc3237ecb4c3fada4fac59160f3a2',
 'f7b02ee3d5674ec69b5c1669805604ea',
 '3c41bfd444fa4a848417664320667e52',
 '1b49b09bcca94a859c13d416f4297ba3',
 '308e0c922b9c4bb5a43e56545bf49599',
 '596a32d6b86a49d2a8426e46094dfb47',
 'fc213923932e4028aaad4b4981208c9b',
 'bc6595227cd1423891576a61e55f1203',
 'db3591f8a9c34b1094208ed1e4d5665d',
 '741c3d11c4dc49a0acca7f6567dd01c6',
 'bc8a132ce32c4159a8833bd25eca9b9c',
 '17e5c5431d3043c38a53f3cc88aae049',
 '624efca5e8684f4f8f70182297932c8c',
 '47ead5a349744f0089ffdb8cf4e2a595',
 '125a061050b84e7981bf49e2314e7427',
 '925e97e4c7e04d178cbd58717f052686',
 'ecce3dd44b0c41728e8cb37fd5004f69',
 '99aadb9d12f24e5aa7df33532b9c5aba',
 '7f133368a9984f87a734c3a9a8758f4c',
 'fa18248cf0f84887a8b3047f0307c272',
 'dfb396a4959a468488b3b8858c0a00ca',
 

In [None]:

#await vector_store.aadd_texts(questions)

I noticed caused the situation reported as follows.

```
Your session crashed after using all available RAM. If you are interested in access to high-RAM runtimes, you may want to check out Colab Pro.

```
No data on Astra DB but an empty collection was created.

So, I restart from the beginning (after commenting out the above cell).

In [18]:
await vector_store.aadd_texts(questions[10])

['24255316b06f4b2aa203b8a628c7ba11',
 '80601ea54f0f44029750c8d623fe5d87',
 '091c101af90441b69a8ca148fbf3c3c3',
 '404a619ba5244c6c921fba72e1f3c5db',
 'da783c96d247491ba23311e13f96c9c6',
 '9a70c0c9ce654e25a189931e718ed9c6',
 '3ce4a238e6724f33b3d086d4508cda8a',
 '67589cb75995426a9b5b2003935eee25',
 '52488965af524889b2c375940ca8e1df',
 '6abe60603cdc4c25bd70b069a11e210c',
 '4cdf6a282a1a42698dec49fa4b282c8d',
 'b7c5043339484023bd8e1e6624cd7c28',
 '9a47e661ea0c44c9953949637d96ebdd',
 'b4a907eff55d4250a411510b75ee2fae',
 '26c90b4fec6a406db6a8e3404a0f7d5a',
 '7e721e8251424ce7a86c2c2c4896ecd8',
 '7b7e55dc1eee431c9a856d318cf69f09',
 '165199e374524f1eabbd6926df29718c',
 'ba6756e3a9fb4a65bf60e3d4809a9bd4',
 '093d65b0faf14fd5897c5e99484c54f0',
 '7bb76cca3aaa48adaa74b944e8b3042e',
 'f3d469b2b0194a8aabd8fb57c90ecd1d',
 'e13b5b513868427586fc36f9c5022332',
 '34b57c5386d1497b83584d32af7848d0',
 'e3c1a766c30c468183dfa985a41fe181',
 'b94cc8218be74c01afa85dc855c7c817',
 'dfea4b1a6daf48428b2ae8a21e8c4d83',
 

The above worked immesiately.

In [19]:
await vector_store.aadd_texts(questions[100])

['e60a31fe15774ae2b7e8429ac4ec9017',
 '6177bfcccf0b4cd7aa78037659a5ce2d',
 '199f7a13cb6b4ef69ceed13692a745a8',
 'e710180546b4496ca111e8e06223bea7',
 'bbfa73f5a4fc472b9178d315ee8393b1',
 '7fd5a19b2c46435b84bf86b6accf7e45',
 '54c1750e098845c588730e39e259690f',
 'a1a54c767a094c748e9cf110c1846adb',
 '0fd8d56fc2404576a996d59c4c719330',
 'a7e25e055f1c46f1beabfff4b45ae0da',
 'c5b4890d61d04c6a8fd639f8e8b1e7e0',
 'f63769bf61004bed871ee33440b149e8',
 'b7fa39e446f741e0a67089b2e5decb9f',
 'd01dd6c82ddb4afdb64cda352a118f64',
 '428a650b481e44ac83055bbe950eeeea',
 'e028dc8680894907890bf447b20021a3',
 'eb9506b434ea4cddbfa325391d0ee026',
 '266cc54ccee148868c390a0bcbff9cc2',
 '5f54d87884ac4b25b1f36bacf054bab5',
 'c378f5cfe0414fcc8dbba75811a8a830',
 'dc27869d05f0445c83f300003e1fbc98',
 '5e2708a4b588467fad87b9a3342d4f57',
 'fc032f24c2e24a3eb7d6096251ac0958',
 'a21d735b4d4a439dbf3402faced4afc6',
 '50b12a0f6fe240efb4e4db6a9d182a5c',
 '76f5bcd32d1f4794874935f99473a847',
 '123bd2edcca44412a103efc68323f37c',
 

In [20]:
await vector_store.aadd_texts(questions[1000])

['ab74f02e6d6c451db9b43a18626e472d',
 '3efa23a9976b4a92ad097c0459a91109',
 'e34bb5fe4d5445c58c9bb18d14ce3554',
 '057d91a2706846c485bade64df1838aa',
 '558ee14f207c4137827b1140ea7dba80',
 '21d6a3d3c61741ccb436418a3c56d57a',
 '62ff4f37fb0c44b591795af6ef6dc312',
 'a17a70677a6c4faea1395852af0560af',
 '1936c0f363804b30af2e30569572abcd',
 '4f1badaa724646b3a2bc6b3710fea6af',
 '570bbb8c5f0f44388bc63eafd3b80786',
 '9c874022606d4a8f925487c6900722f1',
 'a01531e7945b410cac9f9b5df33a2a7b',
 'a918cfeea1444a40a84bfe3207b117e2',
 'e8df9ad2d8bb4cf587aea9e1e6ce217b',
 '6a4628bec5e3445e8485fd137c9c22b1',
 'a70537a1c6fa4609aaa94c1d29919cfc',
 '02340ebdbf1d4bdb976f2070c4ca34ae',
 'acb15e5e27bc4f96848a09ed000c9b9e',
 '90e794f51b804e3aba995d9fc7dcb1f8',
 '076c908905584f1ab786cbc82b5b3539',
 'e1971110d4964619a3d15538d669dbf0',
 'eddd7e8151034af0943ac4aa7072b4f7',
 '39fe3d4f3f9c4f91b4d38b5dbb4fe131',
 'ce044d66cd7d4b18932a22c033772005',
 '383ac316cd1d4e048fc3fee54de3abeb',
 'bb147ef55d8741078635911791b842e4',
 

The above cell finished quickly but I observed only 201 records in the collection on the control plane.

In [21]:
query = 'Is it true that the coordinaate of a point on x-axis can be taken as (y,0) while on y-axis it can be taken as (0,x)?'

In [24]:
results = vector_store.similarity_search(query)

In [25]:
results

[Document(page_content='y'),
 Document(page_content='y'),
 Document(page_content='y'),
 Document(page_content='y')]

In [26]:
for resuult in results:
  print(result)

[Document(page_content='y'), Document(page_content='y'), Document(page_content='y'), Document(page_content='y')]
[Document(page_content='y'), Document(page_content='y'), Document(page_content='y'), Document(page_content='y')]
[Document(page_content='y'), Document(page_content='y'), Document(page_content='y'), Document(page_content='y')]
[Document(page_content='y'), Document(page_content='y'), Document(page_content='y'), Document(page_content='y')]


Content is broken. It was okay before when I did the same with AstraPy and the original Cassandra API earlier, as you can see other notebooks in this repository.