In [1]:
!python -m pip install numpy scikit-learn sentence-transformers -q

In [2]:
documents = [
    "The sports world is centered on Melbourne for the 2026 Australian Open, where tennis elites are battling for the first Grand Slam of the year. Meanwhile, the 'Luke Littler era' in darts has been solidified after his stunning victory at the 2026 PDC World Darts Championship. In cricket, anticipation is building for the Under-19 Men’s Cricket World Cup, which kicked off in mid-January, showcasing the next generation of global talent.",
    
    "European football is hitting a fever pitch as the UEFA Champions League league phase reaches its dramatic finale. A major storyline has emerged at Benfica, where legendary manager José Mourinho is set to face his former club, Real Madrid, in a high-stakes 'must-win' encounter. This comes amidst a turbulent season for Real Madrid, who recently appointed new leadership after the departure of Xabi Alonso.",
    
    "Geopolitical tensions are high following the return of Donald Trump to the U.S. presidency. Significant shifts in international relations are underway, notably with Mexico's President Claudia Sheinbaum cancelling oil shipments to Cuba amid U.S. pressure. In Asia, China's military leadership is facing a massive shakeup as top general Zhang Youxia, a long-time ally of Xi Jinping, has been placed under investigation for alleged corruption and security leaks.",
    
    "The 'January 2026 AI Revolution' is transforming the workforce with the rise of 'Agentic AI'—autonomous systems capable of self-verification and long-term goal planning. Boston Dynamics has officially moved its humanoid robot, Atlas, into field tests at Hyundai factories, marking a 'ChatGPT moment' for physical robotics. Meanwhile, OpenAI's release of GPT-5.2 has set a new benchmark for professional reasoning and autonomous coding capabilities."
]

In [3]:
import re

def preprocessing(text):
  text = text.lower()
  text = re.sub(r'[^\w\s]','',text)
  return text


prepocessed_documents = [preprocessing(doc) for doc in documents]

for doc in prepocessed_documents:
  print(doc)

the sports world is centered on melbourne for the 2026 australian open where tennis elites are battling for the first grand slam of the year meanwhile the luke littler era in darts has been solidified after his stunning victory at the 2026 pdc world darts championship in cricket anticipation is building for the under19 mens cricket world cup which kicked off in midjanuary showcasing the next generation of global talent
european football is hitting a fever pitch as the uefa champions league league phase reaches its dramatic finale a major storyline has emerged at benfica where legendary manager josé mourinho is set to face his former club real madrid in a highstakes mustwin encounter this comes amidst a turbulent season for real madrid who recently appointed new leadership after the departure of xabi alonso
geopolitical tensions are high following the return of donald trump to the us presidency significant shifts in international relations are underway notably with mexicos president cla

In [4]:
test_query = "machine learning is subset of artificial intelligence"

#### Keyword search

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy

In [6]:
vectorizer = TfidfVectorizer()

sparse_vectors = vectorizer.fit_transform(prepocessed_documents)
len(sparse_vectors.toarray()[0])

185

In [7]:
test_query_sparse_vector = vectorizer.transform([test_query])
len(test_query_sparse_vector.toarray()[0])

185

In [8]:
keyword_similarities = cosine_similarity(sparse_vectors,test_query_sparse_vector)
keyword_similarities

array([[0.16045258],
       [0.14610709],
       [0.14092294],
       [0.20090165]])

In [9]:
keyword_similarities[0]

array([0.16045258])

In [10]:
ranked_keyword_indices = numpy.argsort(keyword_similarities[:,0])[::-1]
for index in ranked_keyword_indices:
  print(f"Document: {documents[index]}")
  print(f"Similarity Score: {keyword_similarities[index][0]}")
  print("--------------------------------------------------")

Document: The 'January 2026 AI Revolution' is transforming the workforce with the rise of 'Agentic AI'—autonomous systems capable of self-verification and long-term goal planning. Boston Dynamics has officially moved its humanoid robot, Atlas, into field tests at Hyundai factories, marking a 'ChatGPT moment' for physical robotics. Meanwhile, OpenAI's release of GPT-5.2 has set a new benchmark for professional reasoning and autonomous coding capabilities.
Similarity Score: 0.2009016470281335
--------------------------------------------------
Document: The sports world is centered on Melbourne for the 2026 Australian Open, where tennis elites are battling for the first Grand Slam of the year. Meanwhile, the 'Luke Littler era' in darts has been solidified after his stunning victory at the 2026 PDC World Darts Championship. In cricket, anticipation is building for the Under-19 Men’s Cricket World Cup, which kicked off in mid-January, showcasing the next generation of global talent.
Similar

#### Sementic Search

In [11]:
!python -m pip install tf-keras -q

In [12]:
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import numpy as np




In [13]:
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

In [18]:
dense_vectors = embedding_model.encode(prepocessed_documents)
dense_vectors[0].shape

(384,)

In [19]:
test_query_dense_vector = embedding_model.encode([test_query])
test_query_dense_vector.shape

(1, 384)

In [20]:
semantic_similarities = cosine_similarity(dense_vectors,test_query_dense_vector)
semantic_similarities

array([[ 0.04144429],
       [ 0.02646324],
       [-0.02699398],
       [ 0.3561782 ]], dtype=float32)

In [21]:
ranked_keyword_indices = numpy.argsort(keyword_similarities[:,0])[::-1]
for index in ranked_keyword_indices:
  print(f"Document: {documents[index]}")
  print(f"Similarity Score: {keyword_similarities[index][0]}")
  print("--------------------------------------------------")

Document: The 'January 2026 AI Revolution' is transforming the workforce with the rise of 'Agentic AI'—autonomous systems capable of self-verification and long-term goal planning. Boston Dynamics has officially moved its humanoid robot, Atlas, into field tests at Hyundai factories, marking a 'ChatGPT moment' for physical robotics. Meanwhile, OpenAI's release of GPT-5.2 has set a new benchmark for professional reasoning and autonomous coding capabilities.
Similarity Score: 0.2009016470281335
--------------------------------------------------
Document: The sports world is centered on Melbourne for the 2026 Australian Open, where tennis elites are battling for the first Grand Slam of the year. Meanwhile, the 'Luke Littler era' in darts has been solidified after his stunning victory at the 2026 PDC World Darts Championship. In cricket, anticipation is building for the Under-19 Men’s Cricket World Cup, which kicked off in mid-January, showcasing the next generation of global talent.
Similar