# Build and Load the Event Corpus
We'll now build our event corpus, this is where we'll put events with their full description. This contains the "raw" description before we run a curation step to extract the talks provided.

We'll re-use the Parquet from the previous slide.

A key thing for this corpus is to include five filter attributes, which demonstrate the power of Semantic Search combined with Key-Value searches. We will define four filter attributes below:

* **event_date:** When the event occurred in yyyy-mm-dd format
* **event_year:** Which year the event occured
* **event_month:** Which month the event occurred
* **event_type:** Delivery format (online or physical)
* **is_online:** Whether this was an online event (boolean)
* **url:** A trackback to meetup.com

In [None]:
import logging

logging.basicConfig(format='%(asctime)s:%(name)-35s %(levelname)s:%(message)s', level=logging.INFO, datefmt='%H:%M:%S %z')
logging.getLogger("OAuthUtil").setLevel(logging.WARNING)
logger = logging.getLogger(__name__)

## Initialize our Client
We'll use the client library vectara-skunk-client as our client SDK to work with Vectara.

We use implicit configuration to avoid the need to plug in API keys within each notebook,
avoiding the need to delete secrets pre-commit.

In [None]:
from vectara_client.core import Factory
from vectara_client.admin import CorpusBuilder

client = Factory().build()
manager = client.corpus_manager

## Structure our Corpus
We'll now do corpus modelling based on the available data.

In [None]:
corpus = (CorpusBuilder("AICamp Events")
         .description("This is where we put our events with their raw description")
         .add_attribute("event_date", "When the event occurred in yyyy-mm-dd format", type="text")
         .add_attribute("event_year", "Which year the event occured")
         .add_attribute("event_month", "Which month the event occurred")
         .add_attribute("event_type", "Delivery format: (online or physical)")
         .add_attribute("is_online", "Whether this was an online event (boolean)", type="boolean")
         .add_attribute("url", "A trackback to meetups.com", indexed=False)
         .build())

corpus_id = manager.create_corpus(corpus, delete_existing=True)

## Load our Data from Parquet

We'll load the data from the prior notebook.

In [None]:
import pandas as pd
import duckdb
import pyarrow as pa

con = duckdb.connect()
con.execute("CREATE TABLE meetups_raw AS SELECT * FROM '../output/meetups_raw.parquet';")

description_df = con.execute("DESCRIBE meetups_raw;").fetchdf()
description_df

## Structured Indexing
Whilst Vectara can automatically ingest binary documents,
for this use case we'll use the structured indexing API.

https://docs.vectara.com/docs/api-reference/indexing-apis/indexing

In [None]:
import json

events_df = con.execute("SELECT * FROM meetups_raw;").fetchdf()
events = events_df.to_dict('records')

vectara_documents = []

for event in events:
    metadata = {
        "event_date": event["event_date"],
        "event_year": event["event_year"],
        "event_month": event["event_month"],
        "event_type": event["event_type"],
        "is_online": event["is_online"],
        "url": event["url"]
    }
    metadata_json = json.dumps(metadata)
    
    to_index = {
      "document_id": event["id"],
      "title": event["title"],
      "metadata_json": metadata_json,
      "section": [
        {
          "text": event["description"]
        }
      ]
    }
    vectara_documents.append(to_index)
    

## Parallel Ingest
The following code runs our Ingest API in parallel to increase throughput.

NB: I'll move this into the vectara-skunk-client to make it easier for users.

In [None]:
class SubIndexer:

    def __init__(self, indexer_service, corpus_id):
        self.logger = logging.getLogger(self.__class__.__name__)
        self.indexer_service = indexer_service
        self.corpus_id = corpus_id
        self.docs = []

    def add_doc(self, doc):
        self.docs.append(doc)

    def index_docs(self):
        try:
            for doc in self.docs:
                self.indexer_service.index_doc(self.corpus_id, doc)
        except Exception as e:
            # Ignore for lab
            self.logger("Error: {e}")

thread_count = 10
sub_indexers = [ SubIndexer(client.indexer_service, corpus_id) for x in range(thread_count)]


for index, doc in enumerate(vectara_documents):
    thread_index = index % thread_count
    sub_indexers[thread_index].add_doc(doc)



In [None]:
from threading import Thread

threads = []
for sub_indexer in sub_indexers:
    thread = Thread(target = sub_indexer.index_docs)
    threads.append(thread)
    thread.start()


for index, thread in enumerate(threads):
    logger.info(f"Joining thread {index}")
    thread.join()
    