# Build General Corpus

The first step is to build our general corpus, this is our static asset. We've taken some material from the meetup web page. We'll now use our "unnofficial" SDK to perform this task.

In [1]:
%pip install -q vectara-skunk-client

Note: you may need to restart the kernel to use updated packages.


## Initialize our Client
We've tried to make this SDK as streamlined as
possible to reduce boilerplate in your codebase. Behind the scenes
this code is using implicit configuration to use our OAuth2
authentication which provides access to all admin APIs.

In [7]:
from vectara_client.core import Factory
from vectara_client.admin import CorpusBuilder
import logging

logging.basicConfig(format='%(asctime)s:%(name)-35s %(levelname)s:%(message)s', level=logging.INFO, datefmt='%H:%M:%S %z')
logging.getLogger("OAuthUtil").setLevel(logging.WARNING)
logger = logging.getLogger(__name__)


client = Factory().build()
manager = client.corpus_manager

09:45:56 +1000:Factory                             INFO:initializing builder
09:45:56 +1000:Factory                             INFO:Factory will load configuration from home directory
09:45:56 +1000:HomeConfigLoader                    INFO:Loading configuration from users home directory [C:\Users\david]
09:45:56 +1000:HomeConfigLoader                    INFO:Loading default configuration [default]
09:45:56 +1000:HomeConfigLoader                    INFO:Parsing config
09:45:56 +1000:root                                INFO:We are processing authentication type [OAuth2]
09:45:56 +1000:root                                INFO:initializing Client


## Create our Corpus
We'll now use convenience class "CorpusManager" to create our first corpus "meetup-general". This has no special configuration.

In [8]:
corpus = CorpusBuilder("meetup-general").build()
corpus_id = manager.create_corpus(corpus, delete_existing=True)

09:46:02 +1000:CorpusManager                       INFO:Performing account checks before corpus creation for name [meetup-general]
09:46:03 +1000:RequestUtil                         INFO:URL for operation list-corpora is: https://api.vectara.io/v1/list-corpora
09:46:05 +1000:CorpusManager                       INFO:Checking corpus with name [meetup-general]
09:46:05 +1000:CorpusManager                       INFO:We found the following corpora with name [meetup-general]: [719]
09:46:05 +1000:CorpusManager                       INFO:We found existing corpus with name [meetup-general]
09:46:05 +1000:CorpusManager                       INFO:Deleting existing corpus named [meetup-general]
09:46:05 +1000:RequestUtil                         INFO:URL for operation list-corpora is: https://api.vectara.io/v1/list-corpora
09:46:06 +1000:CorpusManager                       INFO:We found [1] potential matches
09:46:06 +1000:CorpusManager                       INFO:Deleting existing corpus with id [

## Load our Corpus
We'll now load our general corpus with content from the folder "../resources/general"

We can directly ingest data in Word (docx) format as well as many others.

In [9]:
from pathlib import Path

for path in Path("../resources/general").glob("*.docx"):
    client.indexer_service.upload(corpus_id, path)
    

09:46:21 +1000:IndexerService                      INFO:Headers: {"c": "1623270172", "o": "720"}
About Us.docx: 14.4kB [00:03, 4.25kB/s]                                                                                


## Test the Corpus
We'll now run a few test questions to confirm we get a good response

In [20]:
response = client.query_service.query(
    "what is the motto for DataEngBytes?", corpus_id, summary=True, 
    summarizer="vectara-summary-ext-v1.3.0", summary_result_count=5)
logger.info(f"Response was: {response.summary[0].text}")

09:53:18 +1000:RequestUtil                         INFO:URL for operation query is: https://api.vectara.io/v1/query
09:53:26 +1000:__main__                            INFO:Response was: The event organizer for DataEngBytes is Alicia Cheah. DataEngBytes is a conference that was established in 2019 and is known for highlighting the work of data engineers in Australia and New Zealand. Despite challenges posed by COVID-19, the conference successfully transitioned from an online format to in-person events, expanding to several Australian cities and New Zealand [1][3]. The core team of DataEngBytes also includes its founder, Peter Hanssens, and the marketing coordinator, Mohammed Alim [1][5].



In [None]:
response = client.query_service.query(
    "Who is the event organizer?", corpus_id, summary=True, 
    summarizer="vectara-summary-ext-v1.3.0", summary_result_count=5)
logger.info(f"Response was: {response.summary[0].text}")