# Wikipedia with custom vectors

Following through Weaviate tutorial using a large dataset (25k articles from Wikipedia)

Follow the links on [this page](https://weaviate.io/developers/weaviate/tutorials/wikipedia) to download the dataset.

### Setup

In [8]:
import os
import ast
import json
import requests
import weaviate

import pandas as pd

from dotenv import load_dotenv, find_dotenv

In [2]:
_ = load_dotenv(find_dotenv()) # read local .env file

weaviate_url = os.getenv("WEAVIATE_URL") 
weaviate_key = os.getenv("WEAVIATE_API_KEY")
openai_key = os.getenv("OPENAI_API_KEY")

In [3]:
# Connect to local Weaviate instance running in docker
weaviate_client = weaviate.Client(
    url=weaviate_url,  
    auth_client_secret=weaviate.auth.AuthApiKey(api_key=weaviate_key),  
    additional_headers={
        "X-OpenAI-Api-Key": openai_key
    }
)
weaviate_client.is_ready()

            your code to use Python client v4 `weaviate.WeaviateClient` connections and methods.

            For Python Client v4 usage, see: https://weaviate.io/developers/weaviate/client-libraries/python
            For code migration, see: https://weaviate.io/developers/weaviate/client-libraries/python/v3_v4_migration
            


True

### Create the schema

In [4]:
generation_config = {
    "text2vec-openai": {
            "model": "ada",
            "modelVersion": "002",
            "type": "text",
            "vectorizeClassName": False
        }
}

In [5]:
article_class = {
    "class": "Article",
    "description": "An article from the Simple English Wikipedia data set",
    "vectorizer": "text2vec-openai",
    "moduleConfig": generation_config,
    "properties": [
        {
            "name": "title",
            "description": "The title of the article",
            "dataType": ["text"],
            # Don't vectorize the title
            "moduleConfig": {"text2vec-openai": {"skip": True}}
        },
        {
            "name": "content",
            "description": "The content of the article",
            "dataType": ["text"],
        }
    ]
}

# Add the Article class to the schema
weaviate_client.schema.create_class(article_class)
print('Created schema')

Created schema


### Import the articles

Load the articles into the vectorDB using batch import. 

In [7]:
csv_iterator = pd.read_csv(
    '../data/vector_database_wikipedia_articles_embedded.csv',
    usecols=['id', 'url', 'title', 'text', 'content_vector'],
    chunksize=100,  # number of rows per chunk
    # nrows=350  # optionally limit the number of rows to import
)

In [9]:
counter = 0
interval = 100  # print progress every this many records

# Iterate through the dataframe chunks and add each CSV record to the batch
weaviate_client.batch.configure(batch_size=100)  # Configure batch

with weaviate_client.batch as batch:
  for chunk in csv_iterator:
      for index, row in chunk.iterrows():

          properties = {
              "title": row.title,
              "content": row.text,
              "url": row.url
          }

          # Convert the vector from CSV string back to array of floats
          vector = ast.literal_eval(row.content_vector)

          # Add the object to the batch, and set its vector embedding
          batch.add_data_object(properties, "Article", vector=vector)

          # Calculate and display progress
          counter += 1
          if counter % interval == 0:
              print(f"Imported {counter} articles...")


print(f"Finished importing {counter} articles.")

Imported 100 articles...
Imported 200 articles...
Imported 300 articles...
Imported 400 articles...
Imported 500 articles...
Imported 600 articles...
Imported 700 articles...
Imported 800 articles...
Imported 900 articles...
Imported 1000 articles...
Imported 1100 articles...
Imported 1200 articles...
Imported 1300 articles...
Imported 1400 articles...
Imported 1500 articles...
Imported 1600 articles...
Imported 1700 articles...
Imported 1800 articles...
Imported 1900 articles...
Imported 2000 articles...
Imported 2100 articles...
Imported 2200 articles...
Imported 2300 articles...
Imported 2400 articles...
Imported 2500 articles...
Imported 2600 articles...
Imported 2700 articles...
Imported 2800 articles...
Imported 2900 articles...
Imported 3000 articles...
Imported 3100 articles...
Imported 3200 articles...
Imported 3300 articles...
Imported 3400 articles...
Imported 3500 articles...
Imported 3600 articles...
Imported 3700 articles...
Imported 3800 articles...
Imported 3900 article

Sense check the data has been imported correctly

In [11]:
count = weaviate_client.query.aggregate("Article").with_meta_count().do()
print(count)

{'data': {'Aggregate': {'Article': [{'meta': {'count': 25000}}]}}}


In [12]:
response = weaviate_client.query.get("Article", ["title","url"]).with_additional("id").with_limit(1).do()
print(response)

{'data': {'Get': {'Article': [{'_additional': {'id': '0000e74a-900b-4960-afe8-065d00ff694f'}, 'title': "Zaiger's Genetics", 'url': 'https://simple.wikipedia.org/wiki/Zaiger%27s%20Genetics'}]}}}


In [13]:
response = weaviate_client.query.get("Article", ["title","url","content"]).with_additional("id").with_limit(1).do()
print(response)

{'data': {'Get': {'Article': [{'_additional': {'id': '0000e74a-900b-4960-afe8-065d00ff694f'}, 'content': 'Zaiger\'s Genetics is an American company that breeds fruit trees. They are in Modesto, California. They have created fruits such as the Aprium (apricot and plum), the Nectarcot (nectarine and apricot), Peacotum (peach, apricot and plum) and the pluot (plum and apricot).\n\nThey are dedicated to improving fruit worldwide.\n\nIn 2009 Floyd Zaiger was named one of the "top ten most creative people in food" by Fast Company.\n\nZaiger\'s Genetics gives fruit tours to commercial growers every Wednesday. An article in Western Fruit Grower titled "Wednesdays With Floyd" described a typical Wednesday with the Zaiger family.\n\nReferences\n\nOther websites\nFamily story \n\nCompanies based in California\nAgriculture', 'title': "Zaiger's Genetics", 'url': 'https://simple.wikipedia.org/wiki/Zaiger%27s%20Genetics'}]}}}


### Queries

In [14]:
response = (
    weaviate_client.query
    .get("Article", ["title", "content"])
    .with_near_text({"concepts": ["modern art in Europe"]})
    .with_limit(1)
    .do()
)
print(json.dumps(response, indent=4))

{
    "data": {
        "Get": {
            "Article": [
                {
                    "content": "documenta is one of the most important exhibitions of modern art in the world. Since 1955, it takes place every five years in Kassel, Germany. More than 1.2 million people visited the last one, documenta\u00a014, which was held in 2017. The next one, documenta\u00a015, will be from June 18 to September 25, 2022.\n\nRelated pages\n\nGerman art\n\nArt",
                    "title": "Documenta"
                }
            ]
        }
    }
}


In [15]:
response = (
    weaviate_client.query
    .get("Article", ["title", "content"])
    .with_hybrid("jackfruit", alpha=0.5)  # default 0.75
    .with_limit(3)
    .do()
)
print(json.dumps(response, indent=4))

{
    "data": {
        "Get": {
            "Article": [
                {
                    "content": "Jackfruit (also called \"Jakfruit\") is a type of fruit from India, Bangladesh (National fruit) and Sri Lanka. When a Jackfruit ripens, it changes from green to slightly yellow.\n\nReferences\n\nOther websites \n \n\nTropical fruit\nMoraceae\nNational symbols of Bangladesh\nNational symbols of Sri Lanka",
                    "title": "Jackfruit"
                },
                {
                    "content": "The cherry tomato is a type of tomato that is a fruit. This type of tomato was originally developed in Israel.\n\nTomatoes\n\nde:Kirschtomate",
                    "title": "Cherry tomato"
                },
                {
                    "content": "In botany, a fruit is a plant structure that contains the plant's seeds. \n\nTo a botanist, the word fruit is used only if it comes from the part of the flower which was an ovary. It is an extra layer round the seeds,