Let’s use the same 9 data points from the previous chapter and pretend that those make up a list of Frequently Asked Questions (FAQ). Whenever a new query comes in, we want to match that query to the closest FAQ so we can provide the most relevant answer. Here is the list again:

- which airlines fly from boston to washington dc via other cities
- show me the airlines that fly between toronto and denver
- show me round trip first class tickets from new york to miami
- i'd like the lowest fare from denver to pittsburgh
- show me a list of ground transportation at boston airport
- show me boston ground transportation
- of all airlines which airline has the most arrivals in atlanta
- what ground transportation is available in boston
- i would like your rates between atlanta and boston on september third
- which airlines fly between boston and pittsburgh

Let’s say a person enters the query “How can I find a taxi or a bus when the plane lands?”. Note that the “taxi” and "bus" keywords don't exist anywhere in our FAQ, so let’s see what results we get with semantic search.

Implementation-wise, there are many ways we can approach this. And in our case, we use cosine similarity to compare the embeddings of the search query with those from the FAQ and find the most similar ones.

# Step 1: Embed the Documents

The first step is to turn the documents into embeddings. We embed each inquiry by calling Cohere’s Embed endpoint with co.embed(). It takes in texts as input and returns embeddings as output. We supply three parameters:

texts: The list of texts you want to embed
model: The model to use to generate the embedding. At the time of writing, there are four models available
input_type — Specifies the type of document to be embedded. At the time of writing, there are four options:
search_document: For documents against which search is performed
search_query: For query documents
classification: For when the embeddings will be used as an input to a text classifier
clustering: For when you want to cluster the embeddings

In [1]:
!pip install cohere

Collecting cohere
  Downloading cohere-5.12.0-py3-none-any.whl.metadata (3.5 kB)
Collecting fastavro<2.0.0,>=1.9.4 (from cohere)
  Downloading fastavro-1.9.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)
Collecting httpx-sse==0.4.0 (from cohere)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting parameterized<0.10.0,>=0.9.0 (from cohere)
  Downloading parameterized-0.9.0-py2.py3-none-any.whl.metadata (18 kB)
Collecting types-requests<3.0.0,>=2.0.0 (from cohere)
  Downloading types_requests-2.32.0.20241016-py3-none-any.whl.metadata (1.9 kB)
Downloading cohere-5.12.0-py3-none-any.whl (249 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m249.7/249.7 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx_sse-0.4.0-py3-none-any.whl (7.8 kB)
Downloading fastavro-1.9.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 M

In [54]:
#import needed libraries
import numpy as np
import pandas as pd
import cohere
import altair as alt
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA


In [3]:
co = cohere.Client('COHERE_API_KEY')

# Data preprocessing

In [4]:
#load the dataset
dataset = pd.read_csv('https://raw.githubusercontent.com/cohere-ai/notebooks/main/notebooks/data/atis_intents_train.csv', names = ['intent', 'query'])

In [7]:
dataset.head()

Unnamed: 0,intent,query
0,atis_flight,i want to fly from boston at 838 am and arriv...
1,atis_flight,what flights are available from pittsburgh to...
2,atis_flight_time,what is the arrival time in san francisco for...
3,atis_airfare,cheapest airfare from tacoma to orlando
4,atis_airfare,round trip fares from pittsburgh to philadelp...


In [8]:
#take a small sample for illustration purposes
sampled_row_data = ['atis_airfare', 'atis_airline', 'atis_ground_service']
#take a fraction from the whole dataset
df = dataset.sample(frac = 0.1, random_state=30)
#check if the intent column of the taken fraction of the whole dataset is present in the sampled_row_data
df = df[df['intent'].isin(sampled_row_data)]
#remove rows present in both dataset and df
dataset = dataset.drop(df.index)
df.reset_index(drop=True, inplace=True)

In [9]:
# Remove unnecessary column
intent = df.intent #save for a later need
df.drop('intent', axis = 1, inplace = True)

In [10]:
df.head()

Unnamed: 0,query
0,which airlines fly from boston to washington ...
1,show me the airlines that fly between toronto...
2,show me round trip first class tickets from n...
3,i'd like the lowest fare from denver to pitts...
4,show me a list of ground transportation at bo...


In [63]:
for i in df.head(10)["query"]:
    print(i)

 which airlines fly from boston to washington dc via other cities
 show me the airlines that fly between toronto and denver
 show me round trip first class tickets from new york to miami
 i'd like the lowest fare from denver to pittsburgh
 show me a list of ground transportation at boston airport
 show me boston ground transportation
 of all airlines which airline has the most arrivals in atlanta
 what ground transportation is available in boston
 i would like your rates between atlanta and boston on september third
 which airlines fly between boston and pittsburgh


# Embeddings

In [31]:
def get_embeddings(texts, model = 'embed-english-v3.0', input_type = 'search_document'):
  response = co.embed(
      texts = texts,
      model = model,
      input_type = input_type,

  )
  return response.embeddings

In [32]:
df["query_embed"] = get_embeddings(df['query'].tolist())

In [49]:
# Function to return the principal components
def get_pc(arr, n):
    pca = PCA(n_components=n)
    embeds_transform = pca.fit_transform(arr)
    return embeds_transform

In [52]:
# Function to generate the 2D plot
def generate_chart(df,xcol,ycol,lbl='on',color='basic',title=''):
    chart = alt.Chart(df).mark_circle(size=500).encode(
        x=
        alt.X(xcol,
              scale=alt.Scale(zero=False),
              axis=alt.Axis(labels=False, ticks=False, domain=False)
             ),
        y=
        alt.Y(ycol,
              scale=alt.Scale(zero=False),
              axis=alt.Axis(labels=False, ticks=False, domain=False)
             ),
        color= alt.value('#333293') if color == 'basic' else color,
        tooltip=['query']
    )

    if lbl == 'on':
        text = chart.mark_text(align='left', baseline='middle',dx=15, size=13,color='black').encode(text='query', color= alt.value('black'))
    else:
        text = chart.mark_text(align='left', baseline='middle',dx=10).encode()

    result = (chart + text).configure(background="#FDF7F0").properties(
        width=800,
        height=500,
        title=title
    ).configure_legend(orient='bottom', titleFontSize=18,labelFontSize=18)

    return result

In [56]:
embed = np.array(df['query_embed'].tolist())
sample = 9

# Step 2: Embed the Search Query

Next, we embed the query using the same get_embeddings() function. But now we set search_query as the input_type because we're now embedding the search query.

In [57]:
#Define a querey
query = "How can I find a taxi or a bus when the plan lands?"

#get embedddings of first of the new query
query_embeds = get_embeddings([query], input_type="search_query")[0]

# Step 3: Perform Search

Next, we create a function get_similarity() that uses cosine similarity to determine how similar each of the documents is to the query.

In [58]:
#calculate the similaries between the search embeddings and the given queries
def get_similarity(query_embed, document_embed):
  #ensure that both embed are in two dimensional array
  document_embed = np.array(document_embed)
  query_embed = np.expand_dims(np.array(query_embed), axis=0)

  #calculate cosine similarities
  sim = cosine_similarity(query_embed, document_embed)

  #remove any extra dimension added then convert to list
  sim = np.squeeze(sim).tolist()
  # use argsort to get the index of similar document to the query but it should be returned in descending order
  sort_index = np.argsort(sim)[::-1]
  #use the sorted index (which is in descending order) to the the similarity score using the cosine similarity list defined above and return the list of scores
  sort_score = [sim[i] for i in sort_index]
  #similarities scores and indices
  similarity_scores = zip(sort_index, sort_score)

  #return similarity score
  return similarity_scores



In [59]:
# Get the similarity between the search query and existing queries
similarity = get_similarity(query_embeds, embed[:sample])

# View top 5 Articles

In [60]:
print("Query:")
print(query, "\n")

print("Most similar documents:")
for idx, score in similarity:
  print(f"Similarity: {score:.2f}", df.iloc[idx]['query'])

Query:
How can I find a taxi or a bus when the plan lands? 

Most similar documents:
Similarity: 0.38  what ground transportation is available in boston
Similarity: 0.34  show me a list of ground transportation at boston airport
Similarity: 0.32  show me boston ground transportation
Similarity: 0.23  which airlines fly from boston to washington dc via other cities
Similarity: 0.21  show me the airlines that fly between toronto and denver
Similarity: 0.20  i would like your rates between atlanta and boston on september third
Similarity: 0.19  of all airlines which airline has the most arrivals in atlanta
Similarity: 0.19  i'd like the lowest fare from denver to pittsburgh
Similarity: 0.18  show me round trip first class tickets from new york to miami


# Step 4: Visualize the Results in a 2D Plot
To prepare for plotting, we use PCA to reduce the query embedding to 2 dimensions with the get_pc() function we defined earlier.

In [61]:
# Create new dataframe and append new query
df_sem = df.copy()
df_sem.loc[len(df_sem.index)] = [query, query_embeds]

# Reduce embeddings dimension to 2
embeds_sem = np.array(df_sem['query_embed'].tolist())
embeds_sem_pc2 = get_pc(embeds_sem, 2)

# Add the principal components to dataframe
df_sem_pc2 = pd.concat([df_sem, pd.DataFrame(embeds_sem_pc2)], axis=1)

In [62]:
# Create column for representing chart legend
df_sem_pc2['Source'] = 'Existing'
df_sem_pc2.at[len(df_sem_pc2)-1, 'Source'] = "New"

# Plot on a chart
df_sem_pc2.columns = df_sem_pc2.columns.astype(str)
selection = list(range(sample)) + [-1]
generate_chart(df_sem_pc2.iloc[selection],'0','1',color='Source',title='Semantic Search')

  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)


# The query about "a taxi or a bus" is located closest to documents about ground transportation