In [8]:
# Step 1: Install required libraries
!pip install sentence-transformers pymupdf pymilvus
!pwd

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


/Users/calvindu/School/Projects/PDF_Semantic_Modeller/backend


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [9]:

# Step 2: Extract text from the PDF
import fitz  # PyMuPDF

pdf_path = "test_data/Priscilla Wald - Introduction.pdf"

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text += page.get_text("text")
    return text

extracted_text = extract_text_from_pdf(pdf_path)
print(extracted_text[:1000])  # Print the first 1000 characters of the extracted text

Introduction
When the World Health Organization (who) issued a global alert on
12 March 2003, the especially virulent and ‘‘unexplained atypical pneu-
monia’’ soon to be known as severe acute respiratory syndrome (sars) had
already crossed a dozen national borders.∞ The disease had surfaced in
China’s Guangdong Province during the previous November, and a world-
wide research effort soon identiﬁed ‘‘the ﬁrst novel infectious disease epi-
demic of the 21st century, caused by a brand-new coronavirus.’’≤ Epide-
miologists rushed to identify its source and the means and routes of its
transmission; journalists scrambled to inform the public of the danger; and
medical researchers labored to ﬁnd a cure or at least produce a vaccine.
Through their accounts of the outbreak, they quickly turned sars into one
of the ‘‘emerging infections’’ that had been identiﬁed as a phenomenon two
decades earlier.≥
While the coronavirus was new to medical science, the scenario of disease
emergence was entirely 

In [10]:
# Step 3: Generate embeddings using PyTorch
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Split text into sections
text_sections = extracted_text.split(". ")  # Simple sentence split by period

def generate_embeddings(text_sections):
    embeddings = model.encode(text_sections, convert_to_tensor=True)
    return embeddings

embeddings = generate_embeddings(text_sections)

print(f"Generated {len(embeddings)} embeddings.")

Generated 288 embeddings.


In [11]:
from sklearn.metrics.pairwise import cosine_similarity
import torch

# Function to find sentences related to given topics
def find_sentences_related_to_topics(topics, model, sentence_embeddings, sentences):
    # Generate embeddings for the topics
    topic_embeddings = model.encode(topics, convert_to_tensor=True)

    related_sentences = {}

    # For each topic, find the most similar sentences based on cosine similarity
    for i, topic in enumerate(topics):
        similarities = cosine_similarity(topic_embeddings[i].cpu().numpy().reshape(1, -1), sentence_embeddings.cpu().numpy())
        # Get indices of top similar sentences
        top_indices = similarities.argsort()[0][-5:][::-1]  # Get top 5 sentences
        related_sentences[topic] = [sentences[idx] for idx in top_indices]

    return related_sentences

# Example list of topics
topics = ["artificial intelligence", "machine learning", "climate change"]

# Call the function with topics and your pre-generated embeddings
related_sentences = find_sentences_related_to_topics(topics, model, embeddings, text_sections)

# Print sentences related to each topic
for topic, sentences in related_sentences.items():
    print(f"Topic: {topic}")
    for sentence in sentences:
        print(f" - {sentence}")

Topic: artificial intelligence
 - That is the project of this book
 - They represented the question of culpability in the absence not only
of intention but more fundamentally of self-knowledge
 - The memory of epidemics, how-
ever, is typically harnessed in the service of reinforcement
 - As communicability person-
Downloaded from http://read.dukeupress.edu/books/book/chapter-pdf/638348/9780822390572-001.pdf by UBC LIBRARY user on 16 May 2022
22
Introduction
iﬁed, carriers are its (human) ﬁgures, its agents, running the gamut of
human agency from unwitting germ disseminators to intentional dispens-
ers of contagion
 - Like
Oedipus, we do not know who—or what—we are
Topic: machine learning
 - That is the project of this book
 - The stories
Downloaded from http://read.dukeupress.edu/books/book/chapter-pdf/638348/9780822390572-001.pdf by UBC LIBRARY user on 16 May 2022
Introduction
25
derive their authority from their predictability and, in turn, establish the
scientiﬁc validity of the ap

In [12]:
from sklearn.cluster import KMeans
from collections import Counter

# Function to automatically find topics using KMeans clustering
def find_topics_automatically(sentence_embeddings, num_topics=5):
    # Perform KMeans clustering on sentence embeddings
    kmeans = KMeans(n_clusters=num_topics, random_state=0)
    kmeans.fit(sentence_embeddings.cpu().numpy())

    # Assign each sentence to a topic
    topic_assignments = kmeans.labels_

    # Group sentences by topics
    topics = {i: [] for i in range(num_topics)}
    for i, topic in enumerate(topic_assignments):
        topics[topic].append(text_sections[i])

    return topics

# Call the function to find topics
num_topics = 5  # Set the number of topics you want to find
topics = find_topics_automatically(embeddings, num_topics)

# Print sentences grouped by topics
for topic, sentences in topics.items():
    print(f"Topic {topic + 1}:")
    for sentence in sentences[:5]:  # Print first 5 sentences for each topic
        print(f" - {sentence}")
    print("\n")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Topic 1:
 - 
 - 
 - 
 - 
 - 


Topic 2:
 - Introduction
When the World Health Organization (who) issued a global alert on
12 March 2003, the especially virulent and ‘‘unexplained atypical pneu-
monia’’ soon to be known as severe acute respiratory syndrome (sars) had
already crossed a dozen national borders.∞ The disease had surfaced in
China’s Guangdong Province during the previous November, and a world-
wide research effort soon identiﬁed ‘‘the ﬁrst novel infectious disease epi-
demic of the 21st century, caused by a brand-new coronavirus.’’≤ Epide-
miologists rushed to identify its source and the means and routes of its
transmission; journalists scrambled to inform the public of the danger; and
medical researchers labored to ﬁnd a cure or at least produce a vaccine.
Through their accounts of the outbreak, they quickly turned sars into one
of the ‘‘emerging infections’’ that had been identiﬁed as a phenomenon two
decades earlier.≥
While the coronavirus was new to medical science, the 

In [15]:
# Example topics
topics = ["disease", "spores"]

# Find related sentences
related_sentences = find_sentences_related_to_topics(topics, model, embeddings, text_sections)
for topic, sentences in related_sentences.items():
    print(f"Topic: {topic}")
    for sentence in sentences:
        print(f" - {sentence}")

Topic: disease
 - He explains that these persons
are ‘‘impossible’’ because of their association (often accidental) for the ob-
sessive patients with forbidden ideas, desires, and even spaces, but he does
not address why this impossibility takes the form of communicable disease.
The idea of a healthy human carrier of disease was one of the most
publicized and transformative discoveries of bacteriology
 - The observation captures the chaotic
and recombinatory nature of communicable disease, as the ultimate famil-
iars become the ultimate strangers
 - Rather, the disease is associated with dangerous prac-
tices and behaviors that allegedly mark intrinsic cultural difference, and it
expresses the destructive transformative power of the group
 - Communicable disease marks both the poten-
tial destruction of the community and the consequences of its survival
 - Communicable disease
marks the increasing connections of the inhabitants of the global village as
both biological and social, the c

In [14]:
# Automatically discover topics
num_topics = 3
topics = find_topics_automatically(embeddings, num_topics)
for topic, sentences in topics.items():
    print(f"Topic {topic + 1}:")
    for sentence in sentences[:3]:  # Print top 3 sentences for each topic
        print(f" - {sentence}")
    print("\n")

Topic 1:
 - 
 - 
 - 


Topic 2:
 - Introduction
When the World Health Organization (who) issued a global alert on
12 March 2003, the especially virulent and ‘‘unexplained atypical pneu-
monia’’ soon to be known as severe acute respiratory syndrome (sars) had
already crossed a dozen national borders.∞ The disease had surfaced in
China’s Guangdong Province during the previous November, and a world-
wide research effort soon identiﬁed ‘‘the ﬁrst novel infectious disease epi-
demic of the 21st century, caused by a brand-new coronavirus.’’≤ Epide-
miologists rushed to identify its source and the means and routes of its
transmission; journalists scrambled to inform the public of the danger; and
medical researchers labored to ﬁnd a cure or at least produce a vaccine.
Through their accounts of the outbreak, they quickly turned sars into one
of the ‘‘emerging infections’’ that had been identiﬁed as a phenomenon two
decades earlier.≥
While the coronavirus was new to medical science, the scenario