# Compare Candidate Search Approaches

`Note` because of changes to YouTube API, I had to use alternate data files to truly show the ML solution.

These files are 
* video_transcripts_full.parquet
    * A more robust file of video ids, datetimes, titles, and transcripts

* video_index_full.parquet
    * A 

* eval_raw.csv
    * A file of user queries, and video ids. This file is used to evaluate the Machine Learning Solutions. 


## Imports

In [None]:
import numpy as np
import polars as pl
from sentence_transformers import SentenceTransformer, util
from  sklearn.metrics import DistanceMetric



# Load the data 

In [2]:
# Assign paths for the various files
video_transcript_file_path = "/Users/lancehester/Documents/semantic_search_yt/data/video_transcripts_full.parquet"
video_index_full_file_path = "/Users/lancehester/Documents/semantic_search_yt/data/video_index_full.parquet"
eval_raw_file_path = "/Users/lancehester/Documents/semantic_search_yt/data/eval_raw.csv"

## Create the Dataframes for the Transcripts and Evaluation file

In [3]:
# Data Frame from the transcripts
df = pl.read_parquet(video_transcript_file_path)
df.head()

video_id,datetime,title,transcript
str,datetime[μs],str,str
"""03x2oYg9oME""",2024-04-25 15:16:00,"""Data Science Project Managemen…","""this video is part of a larger…"
"""O5i_mMUM94c""",2024-04-19 14:05:54,"""How I’d learned #datascience (…","""here's how I'd learn data scie…"
"""xm9devSQEqU""",2024-04-18 15:59:02,"""4 Skills You Need to Be a Full…","""although it is common to deleg…"
"""Z6CmuVEi7QY""",2024-04-11 10:00:27,"""How I'd Learn Data Science (if…","""when I was first learning data…"
"""INlCLmWlojY""",2024-04-04 18:45:00,"""I Was Wrong About AI Consultin…","""last year I quit my corporate …"


In [7]:
df.shape

(83, 4)

In [8]:
# Data Frame from the evaluation file
df_eval = pl.read_csv(eval_raw_file_path)
df_eval.head()

query,video_id
str,str
"""ai consulting""","""INlCLmWlojY"""
"""fine tuning llm""","""eC6Hd1hFvos"""
"""When do you recommend fine tun…","""eC6Hd1hFvos"""
"""llm from scratch""","""ZLbVdvOoTKM"""
"""What if you could make a small…","""ZLbVdvOoTKM"""


In [9]:
df_eval.shape

(64, 2)

# Get the Embeded Titles and Transcripts


In [10]:
# Define paramters

# Get cols we will need
column_to_embed_list = ["title", "transcript"]

# List out the Deep Learning sentence transformer models we will use - These are Hugging Face models
transformer_model_list = ["all-MiniLM-L6-v2", "multi-qa-distilbert-cos-v1", "multi-qa-mpnet-base-dot-v1"]

### Generate Embeddings for the combinations of column and model

In [None]:
#Initialize dictionary to hold all of the data
text_embedding_dict = {}

for transformer_model in transformer_model_list:
    # Initialze the embedding model
    model = SentenceTransformer(transformer_model)

    # Go through each column to ge get list
    for col_name in column_to_embed_list:

        # Define text embedding identifier
        key_name = f"{transformer_model}_{col_name}"
        print(key_name)

        # Generate text embedding for the text in each column
        embedding_array = model.encode(df[col_name].to_list())
        print("")

        # Add embeddings to dictionary using key_names as keys
        text_embedding_dict[key_name] = embedding_array

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

all-MiniLM-L6-v2_title

all-MiniLM-L6-v2_transcript



modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/9.52k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/523 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

multi-qa-distilbert-cos-v1_title

multi-qa-distilbert-cos-v1_transcript



modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/8.71k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

multi-qa-mpnet-base-dot-v1_title

multi-qa-mpnet-base-dot-v1_transcript



# Now, Let's Embed the Evaluation Queries

In [12]:
#Initialize dictionary to hold all of the data
query_embedding_dict = {}

for transformer_model in transformer_model_list:
    # Initialze the embedding model
    model = SentenceTransformer(transformer_model)
    print(transformer_model)

    #embed query text
    %time embedding_array = model.encode(df_eval["query"].to_list())
    print("")

    # Add embeddings to dictionary using key_names as keys
    query_embedding_dict[transformer_model] = embedding_array

all-MiniLM-L6-v2
CPU times: user 77.8 ms, sys: 21.4 ms, total: 99.2 ms
Wall time: 331 ms

multi-qa-distilbert-cos-v1
CPU times: user 63.8 ms, sys: 42.2 ms, total: 106 ms
Wall time: 208 ms

multi-qa-mpnet-base-dot-v1
CPU times: user 83.1 ms, sys: 25.6 ms, total: 109 ms
Wall time: 1.04 s



# Evaluate The Search Methods

In [None]:
def return_video_id_index(
        df: pl.dataframe.frame.DataFrame,
        df_eval: pl.dataframe.frame.DataFrame,
        query_nth: int
) -> int:
    """
    Method to return the index of a dataframe corresponding to the nth row in the evaluation dataframe

    Args:
        df (pl.dataframe.frame.DataFrame): Polars dataframe of video ids, datetimes, titles, and transcripts
        df_eval (pl.dataframe.frame.DataFrame): Polars dataframe of user queries, and video ids used for evaluations
        query_nth (int): the nth row in the evaluation dataframe

    Returns:
        int: the nth row index of the evaluation dataframe.
    """
    return [i for i in range(len(df)) if df['video_id'][i]==df_eval['video_id'][query_nth]][0]

In [35]:
def evaluate_true_rankings(
        dist_array_isorted: np.ndarray,
        df: pl.dataframe.frame.DataFrame,
        df_eval: pl.dataframe.frame.DataFrame
) -> np.ndarray:
    """
    Method to return "true" video ID rankings for each evaluation query

    Args:
        dist_array_isorted (np.ndarray): Numpy array of distance metric values
        df (pl.dataframe.frame.DataFrame): Polars dataframe of video ids, datetimes, titles, and transcripts
        df_eval (pl.dataframe.frame.DataFrame): Polars dataframe of user queries, and video ids used for evaluations

    Returns:
        np.ndarray: Numpy array of "true" video ID rankings for each evaluation query
    """
    # Initialize array to store rankings of "correct" search reult
    true_rank_array = np.empty((1, dist_array_isorted.shape[1]))

    # Evaluate Rankings of Correct result for each query
    for query_nth in range(dist_array_isorted.shape[1]):
        # Return "true" video ID's found in df dataframe
        video_id_index = return_video_id_index(df, df_eval, query_nth)

        # Evaluate the ranking of the "true" video ID
        true_rank = np.argwhere(dist_array_isorted[:,query_nth]==video_id_index)[0][0]

        # Store the "true" video ID's ranking in array
        true_rank_array[0,query_nth] = true_rank

    return true_rank_array

In [36]:
# Initialize distance metrics to experiment
dist_name_list = ["euclidean", "manhattan", "chebyshev"]
sim_name_list = ["cos_sim", "dot_score"]

In [37]:
# Evaluate all possible combinations of model, columns to embed, and distance metrics
# 3 x 3 x 5 = 45 possible combinations

# Intitialize list to store results
eval_results = []

# Loop through all models (3 Options)
for transformer_model in transformer_model_list:
    # Generate query embedding
    query_embedding = query_embedding_dict[transformer_model]

    #Loop through the text columns of the dataframe (3 options)
    for col_name in column_to_embed_list:
        # Generate column embedding
        embedding_array = text_embedding_dict[f"{transformer_model}_{col_name}"]

        # Loop through the distance measures (5 options)
        for dist_measure in dist_name_list:

            # Compute distance between video text and query
            dist = DistanceMetric.get_metric(dist_measure)
            dist_array= dist.pairwise(embedding_array, query_embedding)

            # Sort Indices of the Distance Array - argsort returns indices
            dist_array_isorted = np.argsort(dist_array, axis=0)

            # Define label for search method
            method_name = "_".join([transformer_model, col_name, dist_measure])

            # Evaluate the ranking of the ground truth
            true_rank_array = evaluate_true_rankings(dist_array_isorted, df, df_eval)

            # Store results by appending to eval_results
            eval_list = [method_name] + true_rank_array.tolist()[0]
            eval_results.append(eval_list)

        # Loop through sbert similarty scores (2 Options)       
        for sim_name in sim_name_list:
            # Apply Similarty Score from sbert
            # minus because similarity scores is the opposite of a distance score
            cmd = f"dist_array = -util.{sim_name}(embedding_array, query_embedding)"
            exec(cmd)

            # Sort Indices of distance array (Notice minus sign in front of cosine similarity)
            dist_array_isorted = np.argsort(dist_array, axis=0)

            # Define label for search Method
            method_name = "_".join([transformer_model, col_name, sim_name.replace("_","-")])

            # Evaluate the ranking of the ground truth
            true_rank_array = evaluate_true_rankings(dist_array_isorted, df, df_eval)

            # store results 
            eval_list = [method_name] + true_rank_array.tolist()[0]
            eval_results.append(eval_list)

In [38]:
cmd

'dist_array = -util.dot_score(embedding_array, query_embedding)'

In [None]:
eval_results[0:5]

[['all-MiniLM-L6-v2_title_euclidean',
  0.0,
  0.0,
  16.0,
  0.0,
  7.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  3.0,
  0.0,
  1.0,
  0.0,
  2.0,
  0.0,
  0.0,
  0.0,
  1.0,
  3.0,
  1.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  2.0,
  0.0,
  8.0,
  1.0,
  0.0,
  0.0,
  1.0,
  0.0,
  6.0,
  1.0,
  0.0,
  1.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  9.0,
  5.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  1.0,
  0.0,
  1.0,
  0.0],
 ['all-MiniLM-L6-v2_title_manhattan',
  0.0,
  0.0,
  9.0,
  0.0,
  7.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  2.0,
  0.0,
  0.0,
  0.0,
  1.0,
  0.0,
  0.0,
  0.0,
  1.0,
  3.0,
  1.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  2.0,
  0.0,
  7.0,
  1.0,
  0.0,
  0.0,
  1.0,
  0.0,
  3.0,
  1.0,
  0.0,
  1.0,
  1.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  1.0,
  10.0,
  5.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  1.0,
  0.0,
  1.0,
  0.0],
 ['all-MiniLM-L6-v2_titl

### Compute rankings for title + transcripts embedding

In [43]:
for transformer_model in transformer_model_list:

    # generate embeddings
    embedding_array1 = text_embedding_dict[f"{transformer_model}_title"]
    embedding_array2 = text_embedding_dict[f"{transformer_model}_transcript"]
    query_embedding = query_embedding_dict[transformer_model]

    for dist_measure in dist_name_list:
        # Compute Distance Between Video Text and Query
        dist = DistanceMetric.get_metric(dist_measure)
        dist_arr = dist.pairwise(embedding_array1, query_embedding) + dist.pairwise(embedding_array2, query_embedding)

        # Sort Indices of distance array
        dist_array_isorted = np.argsort(dist_arr, axis=0)

        # Define Label for Search Method
        method_name = "_".join([transformer_model,"title_transcript", dist_measure])

        # Evaluate the ranking of the ground truth
        true_rank_array = evaluate_true_rankings(dist_array_isorted, df, df_eval)

        # Store results in eval_results from above 
        eval_list = [method_name] + true_rank_array.tolist()[0]
        eval_results.append(eval_list)

    #Loop through sbert similarity scores
    for sim_name in sim_name_list:
        #apply similarity score from sbert
        cmd = "dist_arr = -util." + sim_name + "(embedding_array1, query_embedding) - util."+ sim_name + "(embedding_array2, query_embedding)"
        exec(cmd)

        # sort indexes of distance array (notice minus sign in front of cosine similarity)
        dist_array_isorted = np.argsort(dist_array, axis=0)

        # define label for search method
        method_name = "_".join([transformer_model, "title_transcript", sim_name.replace("_","-")])

        # Evaluate the rankings of the ground truth
        true_rank_array = evaluate_true_rankings(dist_array_isorted, df, df_eval)

        # Store events
        eval_list = [method_name] + true_rank_array.tolist()[0]
        eval_results.append(eval_list)



In [44]:
len(eval_results)

45

In [46]:
# Define schema for results dataframe
schema_dict = {"method_name": str}
for i in range(len(eval_results[0])-1):
    schema_dict["rank_query_"+str(i)] = float

# Store results in dataframe
df_results = pl.DataFrame(eval_results, schema=schema_dict, orient="row")
df_results.head()

method_name,rank_query_0,rank_query_1,rank_query_2,rank_query_3,rank_query_4,rank_query_5,rank_query_6,rank_query_7,rank_query_8,rank_query_9,rank_query_10,rank_query_11,rank_query_12,rank_query_13,rank_query_14,rank_query_15,rank_query_16,rank_query_17,rank_query_18,rank_query_19,rank_query_20,rank_query_21,rank_query_22,rank_query_23,rank_query_24,rank_query_25,rank_query_26,rank_query_27,rank_query_28,rank_query_29,rank_query_30,rank_query_31,rank_query_32,rank_query_33,rank_query_34,rank_query_35,rank_query_36,rank_query_37,rank_query_38,rank_query_39,rank_query_40,rank_query_41,rank_query_42,rank_query_43,rank_query_44,rank_query_45,rank_query_46,rank_query_47,rank_query_48,rank_query_49,rank_query_50,rank_query_51,rank_query_52,rank_query_53,rank_query_54,rank_query_55,rank_query_56,rank_query_57,rank_query_58,rank_query_59,rank_query_60,rank_query_61,rank_query_62,rank_query_63
str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""all-MiniLM-L6-v2_title_euclide…",0.0,0.0,16.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,1.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,8.0,1.0,0.0,0.0,1.0,0.0,6.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
"""all-MiniLM-L6-v2_title_manhatt…",0.0,0.0,9.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,7.0,1.0,0.0,0.0,1.0,0.0,3.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,10.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
"""all-MiniLM-L6-v2_title_chebysh…",0.0,2.0,46.0,0.0,60.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0,0.0,30.0,0.0,0.0,4.0,57.0,0.0,3.0,0.0,24.0,0.0,0.0,0.0,8.0,6.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,43.0,1.0,0.0,0.0,1.0,0.0,6.0,8.0,0.0,1.0,1.0,0.0,3.0,0.0,0.0,0.0,0.0,5.0,5.0,1.0,70.0,11.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
"""all-MiniLM-L6-v2_title_cos-sim""",0.0,0.0,16.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,1.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,8.0,1.0,0.0,0.0,1.0,0.0,6.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
"""all-MiniLM-L6-v2_title_dot-sco…",0.0,0.0,16.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,1.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,8.0,1.0,0.0,0.0,1.0,0.0,6.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


In [47]:
# Compute mean rankings of ground truth search results
df_results = df_results.with_columns(new_col=pl.mean_horizontal(df_results.columns[1:])).rename({"new_col": "rank_query-mean"})


In [48]:
# Compute number of ground truth results which appear in top 3
for i in [1,3]:
    df_results = df_results.with_columns(new_col=pl.sum_horizontal(df_results[:,1:-1]<i)).rename({"new_col": "num_in_top-"+str(i)})

--

## Let's Look at the Top Results

In [49]:
df_summary = df_results[['method_name', "rank_query-mean", "num_in_top-1", "num_in_top-3"]]

In [50]:
print(df_summary.sort('rank_query-mean').head())

shape: (5, 4)
┌─────────────────────────────────┬─────────────────┬──────────────┬──────────────┐
│ method_name                     ┆ rank_query-mean ┆ num_in_top-1 ┆ num_in_top-3 │
│ ---                             ┆ ---             ┆ ---          ┆ ---          │
│ str                             ┆ f64             ┆ u32          ┆ u32          │
╞═════════════════════════════════╪═════════════════╪══════════════╪══════════════╡
│ all-MiniLM-L6-v2_title_transcr… ┆ 0.875           ┆ 41           ┆ 60           │
│ all-MiniLM-L6-v2_title_manhatt… ┆ 0.921875        ┆ 44           ┆ 58           │
│ all-MiniLM-L6-v2_title_transcr… ┆ 0.96875         ┆ 41           ┆ 61           │
│ all-MiniLM-L6-v2_title_euclide… ┆ 1.09375         ┆ 45           ┆ 57           │
│ all-MiniLM-L6-v2_title_cos-sim  ┆ 1.09375         ┆ 45           ┆ 57           │
└─────────────────────────────────┴─────────────────┴──────────────┴──────────────┘


In [51]:
df_summary.sort('rank_query-mean').head()[0,0]

'all-MiniLM-L6-v2_title_transcript_manhattan'

In [52]:
print(df_summary.sort("num_in_top-1", descending=True).head())

shape: (5, 4)
┌─────────────────────────────────┬─────────────────┬──────────────┬──────────────┐
│ method_name                     ┆ rank_query-mean ┆ num_in_top-1 ┆ num_in_top-3 │
│ ---                             ┆ ---             ┆ ---          ┆ ---          │
│ str                             ┆ f64             ┆ u32          ┆ u32          │
╞═════════════════════════════════╪═════════════════╪══════════════╪══════════════╡
│ all-MiniLM-L6-v2_title_euclide… ┆ 1.09375         ┆ 45           ┆ 57           │
│ all-MiniLM-L6-v2_title_cos-sim  ┆ 1.09375         ┆ 45           ┆ 57           │
│ all-MiniLM-L6-v2_title_dot-sco… ┆ 1.09375         ┆ 45           ┆ 57           │
│ multi-qa-mpnet-base-dot-v1_tit… ┆ 1.8125          ┆ 45           ┆ 57           │
│ all-MiniLM-L6-v2_title_manhatt… ┆ 0.921875        ┆ 44           ┆ 58           │
└─────────────────────────────────┴─────────────────┴──────────────┴──────────────┘


In [53]:
df_summary.sort("num_in_top-1", descending=True).head()[0,0]

'all-MiniLM-L6-v2_title_euclidean'

In [54]:
print(df_summary.sort("num_in_top-3", descending=True).head())

shape: (5, 4)
┌─────────────────────────────────┬─────────────────┬──────────────┬──────────────┐
│ method_name                     ┆ rank_query-mean ┆ num_in_top-1 ┆ num_in_top-3 │
│ ---                             ┆ ---             ┆ ---          ┆ ---          │
│ str                             ┆ f64             ┆ u32          ┆ u32          │
╞═════════════════════════════════╪═════════════════╪══════════════╪══════════════╡
│ all-MiniLM-L6-v2_title_transcr… ┆ 0.96875         ┆ 41           ┆ 61           │
│ multi-qa-distilbert-cos-v1_tit… ┆ 1.59375         ┆ 43           ┆ 61           │
│ multi-qa-distilbert-cos-v1_tit… ┆ 1.765625        ┆ 44           ┆ 60           │
│ multi-qa-distilbert-cos-v1_tit… ┆ 1.765625        ┆ 44           ┆ 60           │
│ multi-qa-distilbert-cos-v1_tit… ┆ 1.765625        ┆ 44           ┆ 60           │
└─────────────────────────────────┴─────────────────┴──────────────┴──────────────┘


In [55]:
df_summary.sort("num_in_top-3", descending=True).head()[0,0]

'all-MiniLM-L6-v2_title_transcript_euclidean'

In [56]:
for i in range(4):
    print(df_summary.sort("num_in_top-3", descending=True)['method_name'][i])

all-MiniLM-L6-v2_title_transcript_euclidean
multi-qa-distilbert-cos-v1_title_transcript_euclidean
multi-qa-distilbert-cos-v1_title_euclidean
multi-qa-distilbert-cos-v1_title_cos-sim


## Summary

The best embedding method overall was `all-MiniLM-L6-v2_title_transcript_manhattan` for this project because:

* all-MiniLM-L6-v2 is the most compact sentence transfomer and gives computational efficiency.

* It ranked hig as far as Mean. It was in the top 3 of the methods, and although not number 1 in the top-1
it appears fairly efficient for its low cost complexity and size. 

* It utilizes both title and transcript.

┌─────────────────────────────────┬─────────────────┬──────────────┬──────────────┐
│ method_name                     ┆ rank_query-mean ┆ num_in_top-1 ┆ num_in_top-3 │
│ ---                             ┆ ---             ┆ ---          ┆ ---          │
│ str                             ┆ f64             ┆ u32          ┆ u32          │
╞═════════════════════════════════╪═════════════════╪══════════════╪══════════════╡
│ all-MiniLM-L6-v2_title_transcr… ┆ 0.875           ┆ 41           ┆ 60           │
│ all-MiniLM-L6-v2_title_manhatt… ┆ 0.921875        ┆ 44           ┆ 58           │
│ all-MiniLM-L6-v2_title_transcr… ┆ 0.96875         ┆ 41           ┆ 61           │
│ all-MiniLM-L6-v2_title_euclide… ┆ 1.09375         ┆ 45           ┆ 57           │
│ all-MiniLM-L6-v2_title_cos-sim  ┆ 1.09375         ┆ 45           ┆ 57           │
└─────────────────────────────────┴─────────────────┴──────────────┴──────────────┘