# Search Wikipedia data - Vector queries

Explore semantic search capabilities using the imported Wikipedia collection with pre-computed embeddings.

## Connect to Weaviate

Connect to the Weaviate instance containing our Wikipedia collection.

In [None]:
# Refresh credentials & load the Weaviate IP
from helpers import update_creds

AWS_ACCESS_KEY, AWS_SECRET_KEY, AWS_SESSION_TOKEN = update_creds()

%store -r WEAVIATE_IP

In [1]:
import weaviate

client = weaviate.connect_to_local(
    WEAVIATE_IP,
    headers = {
        "X-AWS-Access-Key": AWS_ACCESS_KEY,
        "X-AWS-Secret-Key": AWS_SECRET_KEY,
        "X-AWS-Session-Token": AWS_SESSION_TOKEN,
    }        
)

client.is_ready()

True

## Verify Wikipedia collection

Check that our Wikipedia collection is available and populated.

In [None]:
# Get collection and check size
wiki = client.collections.use("Wiki")
total_articles = len(wiki)

print(f"Wikipedia collection contains {total_articles:,} articles")

# Quick preview of collection contents
if total_articles > 0:
    sample = wiki.query.fetch_objects(limit=1)
    article = sample.objects[0].properties
    print(f"\nSample article: '{article['title']}'")
    print(f"Content preview: {article['text'][:100]}...")
else:
    print("‚ö†Ô∏è  No articles found. Please run 2.2-wiki-import-complete.ipynb first.")

Wikipedia collection contains 25,000 articles

Sample article: 'Unicode'
Content preview: The Unicode Standard includes more than just the base code. Alongside the character encodings, the C...


## Basic semantic search

Perform vector-based searches on Wikipedia articles.

In [3]:
# Search for articles about musical instruments
response = wiki.query.near_text(
    query="musical instruments",
    limit=5,
    target_vector="main_vector"
)

print("üéµ Articles about musical instruments:")
print("=" * 40)

for i, article in enumerate(response.objects, 1):
    props = article.properties
    print(f"\n{i}. {props['title']}")
    print(f"   Preview: {props['text'][:120]}...")
    print(f"   URL: {props['url']}")

üéµ Articles about musical instruments:

1. Ensemble
   Preview: The term is most used in music. A musical ensemble is a group of people who perform instrumental or vocal music. In clas...
   URL: https://simple.wikipedia.org/wiki/Ensemble

2. Saxophone
   Preview: A saxophone is a type of musical instrument in the woodwind family. The saxophone uses a piece of wood, called a reed, t...
   URL: https://simple.wikipedia.org/wiki/Saxophone

3. Arrangement (music)
   Preview: Arrangements are often made by people who play instruments that have not had much music written for them.  People who pl...
   URL: https://simple.wikipedia.org/wiki/Arrangement%20%28music%29

4. Cuban music
   Preview: The African slaves and their descendants made many percussion instruments and preserved rhythms they had known in their ...
   URL: https://simple.wikipedia.org/wiki/Cuban%20music

5. Percussion instrument
   Preview: Untuned percussion instruments include: bass drum, side drum (snare drum), maracas,

## Advanced search examples

Explore different types of queries and topics.

### Science and Technology

In [4]:
response = wiki.query.near_text(
    query="artificial intelligence machine learning",
    limit=4,
    target_vector="main_vector"
)

print("ü§ñ AI and Machine Learning articles:")
for i, article in enumerate(response.objects, 1):
    print(f"{i}. {article.properties['title']}")

ü§ñ AI and Machine Learning articles:
1. Natural language generation
2. Artificial neural network
3. Machine code
4. Intelligence


### Historical Topics

In [5]:
response = wiki.query.near_text(
    query="ancient Rome Roman Empire history",
    limit=4,
    target_vector="main_vector"
)

print("üèõÔ∏è  Roman Empire and ancient history:")
for i, article in enumerate(response.objects, 1):
    print(f"{i}. {article.properties['title']}")

üèõÔ∏è  Roman Empire and ancient history:
1. Religion in ancient Rome
2. Rome
3. Byzantine Empire
4. Roman Republic


### Natural Sciences

In [6]:
response = wiki.query.near_text(
    query="biology evolution species animals",
    limit=4,
    target_vector="main_vector"
)

print("üß¨ Biology and evolution articles:")
for i, article in enumerate(response.objects, 1):
    print(f"{i}. {article.properties['title']}")

üß¨ Biology and evolution articles:
1. Charles Darwin
2. Biological species concept
3. Adaptation
4. Mammalia (taxonomy)


## Search with metadata

Include similarity scores and additional metadata in search results.

In [7]:
from weaviate.classes.query import MetadataQuery

response = wiki.query.near_text(
    query="space exploration NASA astronauts",
    limit=5,
    target_vector="main_vector",
    return_metadata=MetadataQuery(distance=True)
)

print("üöÄ Space exploration articles with similarity scores:")
print("=" * 55)

for i, article in enumerate(response.objects, 1):
    props = article.properties
    distance = article.metadata.distance
    similarity = 1 - distance  # Convert distance to similarity

    print(f"\n{i}. {props['title']}")
    print(f"   Similarity: {similarity:.3f} (distance: {distance:.3f})")
    print(f"   Preview: {props['text'][:100]}...")

üöÄ Space exploration articles with similarity scores:

1. University of Southern California
   Similarity: 0.415 (distance: 0.585)
   Preview: Neil Armstrong (1960s and 1970, graduate studies) (M.S. Aerospace Engineering) ‚Äì astronaut and the f...

2. Outer planet
   Similarity: 0.393 (distance: 0.607)
   Preview: Trautman and Bethke ‚Äì Human Outer Planet Exploration (2003) ‚Äì NASA Langley Research Center and Princ...

3. Scott Kelly (astronaut)
   Similarity: 0.393 (distance: 0.607)
   Preview: Scott Joseph Kelly (born February 21, 1964) is an American engineer, retired astronaut, and naval av...

4. Edward Gibson
   Similarity: 0.391 (distance: 0.609)
   Preview: Edward George "Ed" Gibson (born November 8, 1936) is a former NASA astronaut, pilot, engineer, and p...

5. Ronald McNair
   Similarity: 0.389 (distance: 0.611)
   Preview: Ronald Erwin McNair (October 21, 1950 ‚Äì January 28, 1986) was an American NASA astronaut and physici...


## Comparative searches

Compare search results for related but different topics.

In [8]:
queries = [
    "classical music composers",
    "jazz music musicians",
    "rock music bands"
]

print("üéº Comparing different music genres:")
print("=" * 40)

for query in queries:
    response = wiki.query.near_text(
        query=query,
        limit=3,
        target_vector="main_vector"
    )

    print(f"\nüìù Query: '{query}'")
    for i, article in enumerate(response.objects, 1):
        print(f"  {i}. {article.properties['title']}")

üéº Comparing different music genres:

üìù Query: 'classical music composers'
  1. Karl B√∂hm
  2. Oliver Wallace
  3. Richard Wagner

üìù Query: 'jazz music musicians'
  1. Music of Australia
  2. Bill Tapia
  3. Toots Thielemans

üìù Query: 'rock music bands'
  1. Rock music
  2. New wave music
  3. The Rolling Stones


## Explore collection diversity

Sample random articles to understand the breadth of topics in our Wikipedia collection.

In [9]:
# Get a diverse sample of articles
import random

response = wiki.query.fetch_objects(limit=20)
sample_articles = random.sample(response.objects, min(8, len(response.objects)))

print("üåç Random sample of Wikipedia articles in our collection:")
print("=" * 60)

for i, article in enumerate(sample_articles, 1):
    props = article.properties
    print(f"\n{i}. {props['title']}")
    print(f"   Wiki ID: {props['wiki_id']}")
    print(f"   Content: {props['text'][:80]}...")

üåç Random sample of Wikipedia articles in our collection:

1. Taliban insurgency
   Wiki ID: 20231101.simple_586827_4
   Content: Areas where the security situation is worse produce more Opium; areas that are m...

2. Len Goodman
   Wiki ID: 20231101.simple_963722_0
   Content: Leonard "Len" Gordon Goodman (25 April 1944 ‚Äì 22 April 2023) was an English prof...

3. Nigel Haywood
   Wiki ID: 20231101.simple_1033374_0
   Content: Nigel Robert Haywood  (born 17 March 1955) is a British diplomat. He was the Bri...

4. Habesha peoples
   Wiki ID: 20231101.simple_839274_8
   Content: Egyptian inscriptions refer to the people that they traded with in Punt as , "th...

5. One Wall Centre
   Wiki ID: 20231101.simple_458654_0
   Content: One Wall Centre, also known as the Sheraton Vancouver Wall Centre Hotel, is a mi...

6. Satoshi Kukino
   Wiki ID: 20231101.simple_241540_0
   Content: |2006||rowspan="4"|Kawasaki Frontale||rowspan="4"|J. League 1||0||0||0||0||0||0|...

7. Hilary Hahn
   Wiki

## Search quality analysis

Test search precision with specific queries.

In [10]:
test_queries = [
    "programming languages Python Java",
    "European countries geography",
    "Olympic sports athletics swimming",
    "renewable energy solar wind power"
]

print("üîç Testing search quality across different domains:")
print("=" * 55)

for query in test_queries:
    response = wiki.query.near_text(
        query=query,
        limit=2,
        target_vector="main_vector",
        return_metadata=MetadataQuery(distance=True)
    )

    print(f"\nüîé '{query}'")
    for i, article in enumerate(response.objects, 1):
        similarity = 1 - article.metadata.distance
        print(f"  {i}. {article.properties['title']} (similarity: {similarity:.3f})")

üîç Testing search quality across different domains:

üîé 'programming languages Python Java'
  1. Object-oriented programming (similarity: 0.365)
  2. PHP (similarity: 0.344)

üîé 'European countries geography'
  1. Western Europe (similarity: 0.510)
  2. Balkans (similarity: 0.442)

üîé 'Olympic sports athletics swimming'
  1. Olympic Games (similarity: 0.491)
  2. Michael Phelps (similarity: 0.481)

üîé 'renewable energy solar wind power'
  1. Solar energy (similarity: 0.512)
  2. Sustainable energy (similarity: 0.443)


## Summary

This notebook demonstrated:
- Basic semantic search on Wikipedia articles
- Advanced queries across different topics and domains
- Similarity scoring and metadata analysis
- Comparative searches to understand semantic relationships
- Collection exploration and quality assessment

The pre-vectorized Wikipedia collection provides a rich dataset for testing various vector search scenarios and understanding semantic similarity in practice.

## Close the client

Always close your connection when finished.

In [11]:
client.close()