"Instead of word embeddings, we can use Bidirectional Encoder Representations from Transformer (BERT) embeddings. A BERT model, like word embeddings, is a pretrained model and gives a vector representation, but it takes context into account and can represent a whole sentence instead of individual words." 
"Hugging Face code makes using BERT very easy. The first time the code runs, it will download the necessary model, which might take some time. After the download, it’s just a matter of encoding the sentences using the model." (NLP Cookbook)

https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb
"You can input a query or a question. The script then uses semantic search to find relevant passages. (...) For semantic search, we use SentenceTransformer('multi-qa-MiniLM-L6-cos-v1') and retrieve 32 potentially passages that answer the input query.

Next, we use a more powerful CrossEncoder (cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')) that scores the query and all retrieved passages for their relevancy. The cross-encoder further boost the performance, especially when you search over a corpus for which the bi-encoder was not trained for."

In [1]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.4.1-py3-none-any.whl.metadata (10 kB)
Collecting transformers<5.0.0,>=4.41.0 (from sentence-transformers)
  Downloading transformers-4.49.0-py3-none-any.whl.metadata (44 kB)
Collecting huggingface-hub>=0.20.0 (from sentence-transformers)
  Downloading huggingface_hub-0.29.3-py3-none-any.whl.metadata (13 kB)
Collecting fsspec>=2023.5.0 (from huggingface-hub>=0.20.0->sentence-transformers)
  Using cached fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers<5.0.0,>=4.41.0->sentence-transformers)
  Downloading tokenizers-0.21.1-cp39-abi3-macosx_11_0_arm64.whl.metadata (6.8 kB)
Collecting safetensors>=0.4.1 (from transformers<5.0.0,>=4.41.0->sentence-transformers)
  Downloading safetensors-0.5.3-cp38-abi3-macosx_11_0_arm64.whl.metadata (3.8 kB)
Downloading sentence_transformers-3.4.1-py3-none-any.whl (275 kB)
Downloading huggingface_hub-0.29.3-py3-none-any.whl (468 kB)
Do

It is possible that the GPU on your MacBook is not a CUDA-compatible GPU, which is required for PyTorch to utilize it. PyTorch primarily supports NVIDIA GPUs with CUDA. macOS typically uses AMD or Intel GPUs, which are not supported by CUDA. If your GPU is not CUDA-compatible, PyTorch will not be able to use it, and the code will fall back to using the CPU.

Alternative: Using TensorFlow with macOS GPU Support
If you want to leverage the GPU on your MacBook, you might consider using TensorFlow, which has experimental support for macOS GPUs.

If you do not have a GPU or do not want to use CUDA, you can modify your code to run on the CPU instead. You can do this by removing the .cuda() calls in your code.

In [1]:
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import torch

if not torch.cuda.is_available():
    print("Warning: No GPU found. Please add GPU to your notebook")

#We use the Bi-Encoder to encode all passages, so that we can use it with semantic search
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
bi_encoder.max_seq_length = 512     #Or truncate long passages to 256 tokens
top_k = 100                          #Number of passages we want to retrieve with the bi-encoder

#The bi-encoder will retrieve 100 documents. We use a cross-encoder, to re-rank the results list to improve the quality
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')



In [2]:
import pandas as pd

df_movies = pd.read_csv('../raw_data/wiki_movie_plots_deduped.csv')
df_movies.head()

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...


In [3]:
# Calculate the longest number of words in df_movies['Plot']
max_length = df_movies['Plot'].apply(lambda x: len(x.split())).max()

print(f"The longest number of words in df_movies['Plot'] is: {max_length}")

The longest number of words in df_movies['Plot'] is: 6752


In [14]:
# Calculate the length of each plot
df_movies['Plot_length'] = df_movies['Plot'].apply(lambda x: len(x.split()))

# Find the index of the plot with the maximum length
max_length_index = df_movies['Plot_length'].idxmax()

# Retrieve the row with the maximum length
max_length_plot = df_movies.loc[max_length_index]

print(f"The row with the largest number of words in df_movies['Plot'] is:\n{max_length_plot}")

The row with the largest number of words in df_movies['Plot'] is:
Release Year                                                     2000
Title                                             The Prince of Light
Origin/Ethnicity                                            Bollywood
Director                                                    Yugo Sako
Cast                Bryan Cranston, Edie Mirman, Tom Wyner, Richar...
Genre                                                         unknown
Wiki Page           https://en.wikipedia.org/wiki/The_Prince_of_Light
Plot                After a brief introduction to some of the main...
Plot_length                                                      6752
Name: 26064, dtype: object


In [4]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34886 entries, 0 to 34885
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Release Year      34886 non-null  int64 
 1   Title             34886 non-null  object
 2   Origin/Ethnicity  34886 non-null  object
 3   Director          34886 non-null  object
 4   Cast              33464 non-null  object
 5   Genre             34886 non-null  object
 6   Wiki Page         34886 non-null  object
 7   Plot              34886 non-null  object
dtypes: int64(1), object(7)
memory usage: 2.1+ MB


In [4]:
# The UKPLab example has 169597 passages. They encode all passages into the vector space. This takes about 5 minutes (depends on your GPU speed)
# We encode all 34886 PLOTS into our vector space. This takes me 3.5 minutes using CPU with max_seq_length = 256

#Encode all Plots (or first split them into smaller chunks - below)
#corpus_embeddings = bi_encoder.encode(df_movies['Plot'], convert_to_tensor=True, show_progress_bar=True)

# Split the plots into smaller chunks:

# Function to split text into chunks
def split_into_chunks(text, max_length):
    words = text.split()
    return [' '.join(words[i:i + max_length]) for i in range(0, len(words), max_length)]

# # Encode all plots into the vector space
# all_chunks = []
# for plot in df_movies['Plot']:
#     chunks = split_into_chunks(plot, bi_encoder.max_seq_length)
#     all_chunks.extend(chunks)

# Encode all plots into the vector space
all_chunks = []
chunk_to_row_index = []  # List to keep track of the original row index for each chunk
for idx, plot in enumerate(df_movies['Plot']):
    chunks = split_into_chunks(plot, bi_encoder.max_seq_length)
    all_chunks.extend(chunks)
    chunk_to_row_index.extend([idx] * len(chunks))

# We encode all 34886 PLOTS into our vector space. This takes me over 60 minutes using CPU with max_seq_length = 512
corpus_embeddings = bi_encoder.encode(all_chunks, convert_to_tensor=True, show_progress_bar=True)

Batches:   0%|          | 0/1462 [00:00<?, ?it/s]

Transformer models, like the ones used in sentence_transformers, have a computational complexity that scales quadratically with the sequence length. This means that doubling the sequence length roughly quadruples the amount of computation required.
Using a max_seq_length of 256 will generally be faster than using a max_seq_length of 512 due to the reduced computational complexity and memory usage. However, the trade-off is that longer sequences may be truncated, potentially losing some information.

In [5]:
# Save the embeddings to a file
torch.save(corpus_embeddings, '../raw_data/corpus_embeddings512.pt')

# Load the embeddings from the file (for future use)
corpus_embeddings = torch.load('../raw_data/corpus_embeddings512.pt')

In [10]:
# This function will search all wikipedia articles for passages that
# answer the query
def search(query):
    print("Input question:", query)

    ##### Semantic Search #####
    # Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    if torch.cuda.is_available():
        question_embedding = question_embedding.cuda()
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
    # for i in range(len(hits[0])):
    #     hits[0][i]['corpus_id'] = i
    hits = hits[0]  # Get the hits for the first query

    ##### Re-Ranking #####
    # Now, score all retrieved passages with the cross_encoder
    cross_inp = [[query, all_chunks[hit['corpus_id']]] for hit in hits]
    cross_scores = cross_encoder.predict(cross_inp)

    # Sort results by the cross-encoder scores
    for idx in range(len(cross_scores)):
        hits[idx]['cross-score'] = cross_scores[idx]

    # Output of top-10 hits from bi-encoder
    print("\n-------------------------\n")
    print("Top-10 Bi-Encoder Retrieval hits")
    hits = sorted(hits, key=lambda x: x['score'], reverse=True)
    for hit in hits[0:10]:
        #print("\t{:.3f}\t{}".format(hit['score'], all_chunks[hit['corpus_id']].replace("\n", " ")))
        original_row_index = chunk_to_row_index[hit['corpus_id']]
        print("\t{:.3f}\t{}\t{}".format(hit['score'], df_movies['Title'][original_row_index], all_chunks[hit['corpus_id']].replace("\n", " ")))



    # Output of top-10 hits from re-ranker
    print("\n-------------------------\n")
    print("Top-10 Cross-Encoder Re-ranker hits")
    hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
    for hit in hits[0:10]:
        #print("\t{:.3f}\t{}".format(hit['cross-score'], all_chunks[hit['corpus_id']].replace("\n", " ")))
        original_row_index = chunk_to_row_index[hit['corpus_id']]
        print("\t{:.3f}\t{}\t{}".format(hit['cross-score'], df_movies['Title'][original_row_index], all_chunks[hit['corpus_id']].replace("\n", " ")))


In [11]:
search(query = "dinosaur adventure")

Input question: dinosaur adventure

-------------------------

Top-10 Bi-Encoder Retrieval hits
	0.636	Jurassic World	embryos. Hoskins reveals his plan to create other hybrid dinosaurs like Indominus for use as superweapons, but a raptor breaks in and mauls him to death. Owen re-establishes his bond with the raptors before the Indominus reappears. The raptors attack the Indominus, but three of the four are killed. Claire lures the park's Tyrannosaurus rex into a battle with the Indominus. The two dinosaurs fight, with the Indominus gaining the upper hand until Blue, the lone surviving raptor, joins the battle. Overwhelmed, the Indominus is backed up to the lagoon, where the Mosasaurus leaps out and drags it underwater. The T. rex retreats, followed by Blue, who turns to acknowledge Owen before leaving. Isla Nublar is once again abandoned, and the survivors are successfully evacuated to the mainland. Zach and Gray are reunited with their parents, while the T. rex roams freely on Isla Nu

In [12]:
search(query = "dinosaur park")

Input question: dinosaur park

-------------------------

Top-10 Bi-Encoder Retrieval hits
	0.547	Jurassic World	Brothers Zach and Gray Mitchell visit Isla Nublar, the site of the original Jurassic Park, where a new theme park named Jurassic World has operated for years. Simon Masrani, the park's owner, has encouraged geneticist Dr. Henry Wu to create a hybrid dinosaur to attract visitors. The two boys meet their aunt, Claire Dearing, the park's operations manager. Claire assigns her assistant Zara to be their guide, but the boys evade her and explore the resort on their own. Owen Grady, a Navy veteran, has been researching the intelligence of the park's four Velociraptors. InGen Security chief Vic Hoskins believes the raptors should be trained for military use despite objections from Owen and his assistant Barry. Masrani has Owen evaluate the enclosure of the park's new hybrid dinosaur, the Indominus rex, before the attraction opens. Owen warns Claire about the danger of raising Indom

In [13]:
search(query = "dinosaur")

Input question: dinosaur

-------------------------

Top-10 Bi-Encoder Retrieval hits
	0.658	Jurassic World	embryos. Hoskins reveals his plan to create other hybrid dinosaurs like Indominus for use as superweapons, but a raptor breaks in and mauls him to death. Owen re-establishes his bond with the raptors before the Indominus reappears. The raptors attack the Indominus, but three of the four are killed. Claire lures the park's Tyrannosaurus rex into a battle with the Indominus. The two dinosaurs fight, with the Indominus gaining the upper hand until Blue, the lone surviving raptor, joins the battle. Overwhelmed, the Indominus is backed up to the lagoon, where the Mosasaurus leaps out and drags it underwater. The T. rex retreats, followed by Blue, who turns to acknowledge Owen before leaving. Isla Nublar is once again abandoned, and the survivors are successfully evacuated to the mainland. Zach and Gray are reunited with their parents, while the T. rex roams freely on Isla Nublar.
	0.6

In [14]:
search(query = "dinosaur")

Input question: dinosaur

-------------------------

Top-10 Bi-Encoder Retrieval hits
	0.658	Jurassic World	embryos. Hoskins reveals his plan to create other hybrid dinosaurs like Indominus for use as superweapons, but a raptor breaks in and mauls him to death. Owen re-establishes his bond with the raptors before the Indominus reappears. The raptors attack the Indominus, but three of the four are killed. Claire lures the park's Tyrannosaurus rex into a battle with the Indominus. The two dinosaurs fight, with the Indominus gaining the upper hand until Blue, the lone surviving raptor, joins the battle. Overwhelmed, the Indominus is backed up to the lagoon, where the Mosasaurus leaps out and drags it underwater. The T. rex retreats, followed by Blue, who turns to acknowledge Owen before leaving. Isla Nublar is once again abandoned, and the survivors are successfully evacuated to the mainland. Zach and Gray are reunited with their parents, while the T. rex roams freely on Isla Nublar.
	0.6

In [15]:
search(query = "crime and drugs")

Input question: crime and drugs

-------------------------

Top-10 Bi-Encoder Retrieval hits
	0.552	New Jack City	The story begins in Harlem, 1986, and Nino Brown and his gang, the Cash Money Brothers (CMB), become the dominant drug ring in New York City once crack cocaine is introduced to the streets. His gang consists of his best friend, Gee Money; enforcer Duh Duh Duh Man; gun moll Keisha; Nino's girlfriend, Selina, and her tech-savvy cousin, Kareem. Nino converts the Carter, an apartment complex, into a crack house. Gee Money and Keisha kill rival Fat Smitty, the CMB throws out the tenants, and Nino forces the landlord out onto the streets naked. Meanwhile, Undercover detective Scotty Appleton attempts to make a deal with stick-up kid Pookie, but Pookie runs off with the money. Scotty chases Pookie and shoots him in the leg, but the police let him go. Nino's gang successfully run the streets of Harlem over the next three years. When Det. Stone comes under pressure, Scotty volunteer

In [16]:
search(query = "love affair hate")

Input question: love affair hate

-------------------------

Top-10 Bi-Encoder Retrieval hits
	0.582	Jodi Kya Banayi Wah Wah Ramji	The story is a riotous comedy about a suspense thriller writer's search for a love affair.[1]
	0.562	The End of the Affair	Novelist Maurice Bendrix narrates the film as he begins a book with the line "This is a diary of hate." On a rainy London night in 1946, Bendrix has a chance meeting with Henry Miles, husband of his former mistress Sarah, who abruptly ended their affair two years before. Bendrix's obsession with Sarah is rekindled: he succumbs to his own jealousy and works his way back into her life. As the story unfolds in 1946, we also see flashbacks of Bendrix with Sarah as they began their affair during World War II. Henry tells Bendrix that he believes Sarah is having an affair, so Bendrix hires the bumbling but amiable Mr. Parkis, who uses his young birthmarked son Lance to investigate. Sarah asks Bendrix to meet to talk about Henry and the cold t