## Astra DB's Hybrid Search and LLM Evalualiton

The goal of this tutorial is to demonstrate how to build a system that can efficiently identify and evaluate the top job opportunities for candidates.

We will build together a sample end-to-end use case that will be able to find and evaluate the best job opportunities for candidates by using the power of LLM and Astra Hybrid Search capabilities.


## Install Dependencies

In [1]:
!pip install python-dotenv langchain openai sentence-transformers cassio tiktoken python-dotenv
!pip install cassandra-driver==3.28.0

Collecting python-dotenv
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Collecting langchain
  Downloading langchain-0.0.304-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-0.28.1-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting cassio
  Downloading cassio-0.1.3-py3-none-any.whl (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.1/40.1 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken
  Downloading tiktoken-0.5.1-c

## Importing neccessary libraries and organizing configuration

Here you can find the conf.env template:
```
SECURE_CONNECT_BUNDLE_PATH =  
ASTRA_CLIENT_ID =  
ASTRA_CLIENT_SECRET =  
OPENAI_API_KEY=  

```



In [1]:
# Config
import os
import pandas as pd
import numpy as np
import json
from dotenv import dotenv_values


from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from langchain.vectorstores import Cassandra
from langchain.schema.document import Document

from langchain.embeddings import OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings, SentenceTransformerEmbeddings

# set parameters for AstraDB
config = dotenv_values('conf.env')
astradb_token= config['ASTRA_CLIENT_TOKEN']
ASTRA_DB_KEYSPACE= 'vector'
ASTRA_DB_TABLE_NAME= 'jobs'
astradb_secure_bundle_path= config['SECURE_CONNECT_BUNDLE_PATH']
api_key = config['OPENAI_API_KEY']


## Prepare table schema

Navigate to CQL Console within the Astra portal
You can find the table and index creation scripts below:
```
CREATE TABLE vector.jobs (
  job_id text PRIMARY KEY,
  job_title text,
  skills text,
  salary text,
  location text,
  embedding_vector vector<float, 1536> )

CREATE CUSTOM INDEX IF NOT EXISTS ann_index
  ON vector.jobs(embedding_vector) USING 'StorageAttachedIndex';

CREATE CUSTOM INDEX ix_location ON vector.jobs(Location ) USING 'org.apache.cassandra.index.sai.StorageAttachedIndex' WITH OPTIONS = {
'index_analyzer': '{
        "tokenizer" : {"name" : "standard"},
        "filters" : [{"name" : "porterstem"}]
}'};


CREATE CUSTOM INDEX ix_salary ON vector.jobs(salary) USING 'org.apache.cassandra.index.sai.StorageAttachedIndex' WITH OPTIONS = {
'index_analyzer': '{
        "tokenizer" : {"name" : "standard"},
        "filters" : [{"name" : "porterstem"}]
}'};

```



## Configure AstraDB connection

In [2]:

cluster = Cluster(
    cloud={
        "secure_connect_bundle": astradb_secure_bundle_path,
    },
    auth_provider=PlainTextAuthProvider(
        "token", astradb_token
    ),
)

session = cluster.connect()
session.execute("use vector;")

ERROR:cassandra.connection:Closing connection <AsyncoreConnection(137240490095040) 4a68c548-f1bd-4264-8934-bf4e8d2f85fe-us-east1.db.astra.datastax.com:29042:05d703a1-936c-48cc-9ed8-c195f51b1c07> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


<cassandra.cluster.ResultSet at 0x7cd1caa93430>

## Defining function for embedding texts

In [6]:
import openai
openai.api_key = api_key
def generate_embedding(text):
    model = "text-embedding-ada-002"
    response = openai.Embedding.create(model=model, input=text)
    return response.data[0]['embedding']

## Loading a csv file into Astra vector database after creating embeddings for job_description

In [None]:
import csv
from cassandra.query import SimpleStatement
count = 0

# Open the input CSV file for reading
input_csv_file = 'jobs.csv'
try:
    with open(input_csv_file, 'r', newline='') as csvfile:
        csvreader = csv.reader(csvfile,delimiter=';')
        next(csvreader)

        for row in csvreader:
            count += 1
            job_title= ' '.join(row[0:1])
            skills= ' '.join(row[2:3])
            salary= ' '.join(row[3:4])
            location= ' '.join(row[4:5])
            combined_text= ' '.join(row[1:2])
            print(count,row[0:1], row[2:3], row[3:4], row[4:5])
            embedding_res = generate_embedding(combined_text)
            print((f"""INSERT INTO {ASTRA_DB_KEYSPACE}.{ASTRA_DB_TABLE_NAME} (job_id,job_title,skills,salary,location,embedding_vector ) VALUES (%s, %s, %s,%s, %s, %s )"""))
            query = SimpleStatement(f"""INSERT INTO {ASTRA_DB_KEYSPACE}.{ASTRA_DB_TABLE_NAME} (job_id,job_title,skills,salary,location,embedding_vector  ) VALUES (%s, %s, %s,%s, %s, %s )""")
            session.execute(query, (str(count),job_title, skills, salary, location,embedding_res))

except FileNotFoundError:
    print(f"File '{input_csv_file}' not found.")
except Exception as e:
    print(f"An error occurred: {str(e)}")

1 ['Software Engineer'] ['Java, Python, JavaScript, SQL, Agile'] ['Salary: $90,000 - $120,000 per year'] ['Location: San Francisco, CA']
INSERT INTO vector.jobs (job_id,job_title,skills,salary,location,embedding_vector ) VALUES (%s, %s, %s,%s, %s, %s )
2 ['Data Analyst'] ['SQL, Excel, Data Visualization, Statistics'] ['Salary: $70,000 - $90,000 per year'] ['Location: New York, NY']
INSERT INTO vector.jobs (job_id,job_title,skills,salary,location,embedding_vector ) VALUES (%s, %s, %s,%s, %s, %s )
3 ['Network Administrator'] ['Cisco, VPN, Network Security, Troubleshooting'] ['Salary: $75,000 - $100,000 per year'] ['Location: Los Angeles, CA']
INSERT INTO vector.jobs (job_id,job_title,skills,salary,location,embedding_vector ) VALUES (%s, %s, %s,%s, %s, %s )
4 ['UX Designer'] ['User Research, Wireframing, Prototyping, UI Design'] ['Salary: $80,000 - $110,000 per year'] ['Location: Austin, TX']
INSERT INTO vector.jobs (job_id,job_title,skills,salary,location,embedding_vector ) VALUES (%s, %

## Defining a function that will be used for evaluating the results

It is using gpt-4 to evaluate the results after having the top similar results from Astra DB

In [7]:
def get_completion_from_messages(messages, model="gpt-4", temperature=0):
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature, # this is the degree of randomness of the model's output
    )
    return response.choices[0].message["content"]

## Testing with sample CVs to find the best possible job posting for a given CV



*   For each sample CV in the file, combine `text search` to look for `location` and get the best results for similarity between job_posting and CV in Astra DB
*   If `similarity_search score` is bigger than a threshold, send the results to LLM to get a new scoring to confirm that we have an ideal job for the candidate.

We have only one Marketin Manager position is in Chicago so there was no result for Bob Smith after `text search` for `Location`.


In [8]:
import csv
from cassandra.query import SimpleStatement
count = 0

input_csv_file = 'test_cvs.csv'

try:
    with open(input_csv_file, 'r', newline='') as csvfile:
        csvreader = csv.reader(csvfile,delimiter=';' )
        next(csvreader)

        for row in csvreader:
            count += 1
            name = ' '.join(row[0:1])
            job_title=' '.join(row[1:2])
            cv = ' '.join(row[2:3])
            location = ' '.join(row[3:4])
            salary = ' '.join(row[4:5])
            print("Search similarity for this CV:",cv,"\n")
            embedding_res = generate_embedding(cv)
            query = SimpleStatement(f"SELECT job_id,job_title,skills,salary,location,embedding_vector,similarity_cosine(embedding_vector, {embedding_res}) as score  FROM {ASTRA_DB_KEYSPACE}.{ASTRA_DB_TABLE_NAME} where location: '{location}' ORDER BY embedding_vector ANN OF {embedding_res} LIMIT 3")
            print(name, location)
            res = session.execute(query )
            for row in res:
              res_job = row.job_title
              id = row.job_id
              if row.score > 0.91:
                  print('Result: Score',row.score,' Job_id ',id,' ',res_job,"\n")
                  messages =  [
                  {'role':'system', 'content':'You are a chatbot for giving scores for the result of a job posting and CV comparison that are sent in [].You will help eliminating the candidates that doesnt fit the role by ranking them close to 1.'},
                  {'role':'system', 'content':'You need to give a score between 1 and 10. If it is a good candidate for the job , you can give 10. '},
                  {'role':'system', 'content': "You should give a detailed explanation how you decide the ranking and give the ranking result as a number at the end."},
                  {'role':'user', 'content':f'[{cv}],[{res_job}]' } ]
                  response = get_completion_from_messages(messages, temperature=0)
                  print(response)
            if count ==5:
             break
            print("#########################################################")

except FileNotFoundError:
    print(f"File '{input_csv_file}' not found.")
except Exception as e:
    print(f"An error occurred: {str(e)}")

Search similarity for this CV: I am an experienced Software Engineer with expertise in Java, Python, and JavaScript. I have a strong background in software development and have worked on various projects, including web applications and backend systems. My skills include database design, API development, and problem-solving. I am passionate about writing clean and maintainable code and enjoy working in agile teams to deliver high-quality software solutions. 

John Doe San Francisco
Result: Score 0.9292247295379639  Job_id  1   Software Engineer 

Based on the job posting for a Software Engineer and the CV provided, the candidate seems to be a strong match. The candidate has experience in software development and has worked with Java, Python, and JavaScript, which are commonly used languages in software engineering. They also have experience with database design and API development, which are valuable skills for a Software Engineer. The candidate's passion for writing clean and maintaina