# Import pre-vectorized Wikipedia data

Learn how to import large datasets with pre-computed embeddings. This notebook shows how to work with the Weaviate Wikipedia dataset that includes Snowflake Arctic embeddings.

## Connect to Weaviate

Connect to a Weaviate instance for importing pre-vectorized data.

In [1]:
# Refresh credentials & load the Weaviate IP
from helpers import update_creds

AWS_ACCESS_KEY, AWS_SECRET_KEY, AWS_SESSION_TOKEN = update_creds()

%store -r WEAVIATE_IP

In [2]:
import weaviate

client = weaviate.connect_to_local(
    WEAVIATE_IP,
    headers = {
        "X-AWS-Access-Key": AWS_ACCESS_KEY,
        "X-AWS-Secret-Key": AWS_SECRET_KEY,
        "X-AWS-Session-Token": AWS_SESSION_TOKEN,
    }        
)

client.is_ready()

True

## Create Wikipedia collection

Create a collection configured for pre-vectorized Wikipedia data with named vectors.

In [3]:
from weaviate.classes.config import Configure, Property, DataType

def create_wiki_collection():
    # Delete existing collection if it exists
    if client.collections.exists("Wiki"):
        client.collections.delete("Wiki")

    # Create collection for pre-vectorized Wikipedia data
    client.collections.create(
        name="Wiki",

        # Configure for pre-computed vectors with matching model
        vector_config=[
            Configure.Vectors.text2vec_weaviate(
                name="main_vector",
                model="Snowflake/snowflake-arctic-embed-l-v2.0",
                source_properties=['title', 'text']  # Properties used for embedding
            )
        ],

        # Define the schema for Wikipedia articles
        properties=[
            Property(name="wiki_id", data_type=DataType.TEXT),
            Property(name="title", data_type=DataType.TEXT),
            Property(name="text", data_type=DataType.TEXT),
            Property(name="url", data_type=DataType.TEXT),
        ],
    )

    print("Created Wiki collection with Snowflake Arctic embeddings")

create_wiki_collection()

Created Wiki collection with Snowflake Arctic embeddings


## Load pre-vectorized Wikipedia dataset

Load the Weaviate Wikipedia dataset with pre-computed Snowflake Arctic embeddings.

[Dataset source](https://huggingface.co/datasets/weaviate/wiki-sample)

In [8]:
%%bash
python prep-data.py

Downloading weaviate/snowflake-arctic-v2/0001.parquet
Downloading weaviate/snowflake-arctic-v2/0002.parquet
Downloading weaviate/snowflake-arctic-v2/0003.parquet
Downloading weaviate/snowflake-arctic-v2/0004.parquet
Downloading weaviate/snowflake-arctic-v2/0005.parquet
Downloading weaviate/snowflake-arctic-v2/0006.parquet
Downloading weaviate/snowflake-arctic-v2/0007.parquet
Downloading weaviate/snowflake-arctic-v2/0008.parquet
Downloading weaviate/snowflake-arctic-v2/0009.parquet
Downloading weaviate/snowflake-arctic-v2/0010.parquet


In [9]:
from datasets import load_dataset

def prepare_dataset():
    """Load the pre-vectorized Wikipedia dataset"""
    return load_dataset(
        'parquet',
        data_files={'train': ['wiki-data/weaviate/snowflake-arctic-v2/*.parquet']},
        split="train",
        streaming=True
    )

print("Dataset loading function prepared")

# Preview the first few items to understand the data structure
dataset = prepare_dataset()

Dataset loading function prepared


Using custom data configuration default-bbe2e31989cedbce


### Preview the dataset structure

In [10]:
print("Sample Wikipedia articles with pre-computed embeddings:")
print("=" * 60)

counter = 3
for item in dataset:
    print(f"\nTitle: {item['title']}")
    print(f"Wiki ID: {item['wiki_id']}")
    print(f"Text preview: {item['text'][:100]}...")
    print(f"Vector dimensions: {len(item['vector'])}")
    print(f"URL: {item['url']}")

    counter -= 1
    if counter == 0:
        break

Sample Wikipedia articles with pre-computed embeddings:

Title: Unicode
Wiki ID: 20231101.simple_64846_4
Text preview: The Unicode Standard includes more than just the base code. Alongside the character encodings, the C...
Vector dimensions: 1024
URL: https://simple.wikipedia.org/wiki/Unicode

Title: Book of Genesis
Wiki ID: 20231101.simple_11278_4
Text preview: The people of the world attempted to build a high tower (Tower of Babel) to show the power of mankin...
Vector dimensions: 1024
URL: https://simple.wikipedia.org/wiki/Book%20of%20Genesis

Title: Rock Demers
Wiki ID: 20231101.simple_864656_0
Text preview: Rock Demers,  (December 11, 1933 – August 17, 2021) was a Canadian movie producer.  He was the found...
Vector dimensions: 1024
URL: https://simple.wikipedia.org/wiki/Rock%20Demers


## Import Wikipedia data with vectors

Efficient batch import of pre-vectorized Wikipedia articles.

In [11]:
from tqdm import tqdm
from weaviate.util import generate_uuid5

def import_wiki_data(max_rows=25000):
    """Import Wikipedia articles with their pre-computed vectors"""
    print(f"Importing up to {max_rows:,} Wikipedia articles with embeddings...")

    dataset = prepare_dataset()
    wiki = client.collections.use("Wiki")

    counter = 0
    error_threshold = 10

    with wiki.batch.fixed_size(batch_size=500, concurrent_requests=2) as batch:
        for item in tqdm(dataset, total=max_rows, desc="Importing articles"):

            # Prepare the article data
            article_data = {
                "wiki_id": item["wiki_id"],
                "text": item["text"],
                "title": item["title"],
                "url": item["url"],
            }

            # Generate consistent UUID from wiki_id
            article_uuid = generate_uuid5(item["wiki_id"])

            # Prepare the pre-computed vector
            article_vector = {
                "main_vector": item["vector"]
            }

            # Add to batch
            batch.add_object(
                properties=article_data,
                uuid=article_uuid,
                vector=article_vector
            )

            # Check for errors during import
            if batch.number_errors > error_threshold:
                print(f"\nStopping import: reached {batch.number_errors} errors")
                break

            # Stop when reaching max_rows limit
            counter += 1
            if counter >= max_rows:
                break

    # Final error check
    failed_objects = wiki.batch.failed_objects
    if len(failed_objects) > 0:
        print(f"\nImport completed with {len(failed_objects)} errors")
        print("Sample error:", failed_objects[-1])
    else:
        print("\nImport completed successfully with no errors")

    print(f"Successfully imported {counter:,} Wikipedia articles")
    return counter

# Import the dataset
imported_count = import_wiki_data(25000)

Importing up to 25,000 Wikipedia articles with embeddings...


Using custom data configuration default-bbe2e31989cedbce
Importing articles: 100%|█████████▉| 24999/25000 [00:37<00:00, 657.93it/s] 



Import completed successfully with no errors
Successfully imported 25,000 Wikipedia articles


## Verify the import

Check that articles were imported correctly with their embeddings.

In [12]:
# Check total count in collection
wiki = client.collections.use("Wiki")
total_articles = len(wiki)

print(f"Total articles in Wiki collection: {total_articles:,}")
print(f"Expected articles imported: {imported_count:,}")
print(f"Import success rate: {(total_articles/imported_count*100):.1f}%" if imported_count > 0 else "No articles to compare")

Total articles in Wiki collection: 25,000
Expected articles imported: 25,000
Import success rate: 100.0%


In [13]:
# Verify article content and vectors
response = wiki.query.fetch_objects(limit=2, include_vector=True)

print("Sample imported articles:")
print("=" * 50)

for i, article in enumerate(response.objects, 1):
    props = article.properties
    vector = article.vector

    print(f"\nArticle {i}:")
    print(f"  Title: {props['title']}")
    print(f"  Wiki ID: {props['wiki_id']}")
    print(f"  Text preview: {props['text'][:100]}...")
    print(f"  Vector dimensions: {len(vector['main_vector'])}")
    print(f"  Vector sample: {vector['main_vector'][:5]}...")

Sample imported articles:

Article 1:
  Title: Unicode
  Wiki ID: 20231101.simple_64846_4
  Text preview: The Unicode Standard includes more than just the base code. Alongside the character encodings, the C...
  Vector dimensions: 1024
  Vector sample: [-0.0174560546875, 0.041229248046875, -0.050750732421875, 0.03729248046875, 0.03704833984375]...

Article 2:
  Title: Book of Genesis
  Wiki ID: 20231101.simple_11278_4
  Text preview: The people of the world attempted to build a high tower (Tower of Babel) to show the power of mankin...
  Vector dimensions: 1024
  Vector sample: [0.04656982421875, 0.08062744140625, 0.031402587890625, -0.0224761962890625, -0.0400390625]...


## Quick search test

Verify the collection works with semantic search.

In [14]:
# Test semantic search on the imported data
response = wiki.query.near_text(
    query="space exploration NASA",
    limit=3,
    target_vector="main_vector"
)

print("Search results for 'space exploration NASA':")
print("=" * 45)

for i, article in enumerate(response.objects, 1):
    print(f"\n{i}. {article.properties['title']}")
    print(f"   Content: {article.properties['text'][:120]}...")
    print(f"   URL: {article.properties['url']}")

WeaviateQueryError: Query call with protocol GRPC search failed with message explorer: get class: concurrentTargetVectorSearch): explorer: get class: vectorize search vector: vectorize params: vectorize params: vectorize keywords: remote client vectorize: authentication token: neither authentication token found in request header: Authorization nor api key in environment variable under WEAVIATE_APIKEY.

## Summary

This notebook demonstrated:
- Creating collections for pre-vectorized data
- Loading large datasets with embeddings from external sources
- Efficient batch import with error handling
- Verifying data integrity and search functionality

The imported Wikipedia collection can now be used for various search and RAG applications.

## Close the client

Always close your connection when finished.

In [None]:
client.close()