# Build General Corpus

The first step is to build our general corpus, this is our static asset. We've taken some material from the meetup web page. We'll now use our "unnofficial" SDK to perform this task.

In [1]:
%pip install -q vectara-skunk-client

Note: you may need to restart the kernel to use updated packages.


## Initialize our Client
We've tried to make this SDK as streamlined as
possible to reduce boilerplate in your codebase. Behind the scenes
this code is using implicit configuration to use our OAuth2
authentication which provides access to all admin APIs.

In [2]:
from vectara_client.core import Factory
from vectara_client.admin import CorpusBuilder
import logging

logging.basicConfig(format='%(asctime)s:%(name)-35s %(levelname)s:%(message)s', level=logging.INFO, datefmt='%H:%M:%S %z')
logging.getLogger("OAuthUtil").setLevel(logging.WARNING)
logger = logging.getLogger(__name__)


client = Factory().build()
manager = client.corpus_manager

19:05:25 +1000:Factory                             INFO:initializing builder
19:05:25 +1000:Factory                             INFO:Factory will load configuration from home directory
19:05:25 +1000:HomeConfigLoader                    INFO:Loading configuration from users home directory [C:\Users\david]
19:05:25 +1000:HomeConfigLoader                    INFO:Loading default configuration [default]
19:05:25 +1000:HomeConfigLoader                    INFO:Parsing config
19:05:25 +1000:root                                INFO:We are processing authentication type [OAuth2]
19:05:25 +1000:root                                INFO:initializing Client


## Create our Corpus
We'll now use convenience class "CorpusManager" to create our first corpus "meetup-general". This has no special configuration.

In [3]:
corpus = CorpusBuilder("meetup-general").build()
corpus_id = manager.create_corpus(corpus, delete_existing=True)

19:05:54 +1000:CorpusManager                       INFO:Performing account checks before corpus creation for name [meetup-general]
19:05:56 +1000:RequestUtil                         INFO:URL for operation list-corpora is: https://api.vectara.io/v1/list-corpora
19:05:57 +1000:CorpusManager                       INFO:We found the following corpora with name [meetup-general]: []
19:05:57 +1000:CorpusManager                       INFO:Account checks complete, creating the new corpus
19:05:57 +1000:RequestUtil                         INFO:URL for operation create-corpus is: https://api.vectara.io/v1/create-corpus
19:06:00 +1000:AdminService                        INFO:Created new corpus with 729


## Load our Corpus
We'll now load our general corpus with content from the folder "../resources/general"

We can directly ingest data in Word (docx) format as well as many others.

In [4]:
from pathlib import Path

for path in Path("../resources/general").glob("*.docx"):
    client.indexer_service.upload(corpus_id, path)
    

19:06:35 +1000:IndexerService                      INFO:Headers: {"c": "1623270172", "o": "729"}
About Us.docx: 14.4kB [00:03, 4.44kB/s]                                                                                


## Test the Corpus
We'll now run a few test questions to confirm we get a good response

In [5]:
response = client.query_service.query(
    "what is the motto for DataEngBytes?", corpus_id, summary=True, 
    summarizer="vectara-summary-ext-v1.3.0", summary_result_count=5)
logger.info(f"Response was: {response.summary[0].text}")

19:06:54 +1000:RequestUtil                         INFO:URL for operation query is: https://api.vectara.io/v1/query
19:07:00 +1000:__main__                            INFO:Response was: The motto for DataEngBytes is "Run by data engineers, for data engineers" [4]. DataEngBytes was established in 2019 as an evolution from a data engineering meetup in Sydney, aiming to fill a gap in technical dialogue and community engagement [1][3]. It quickly evolved into a full-day conference and plays an important role in highlighting the work of data engineers in Australia and New Zealand, positioning the region as a global epicenter of data engineering innovation [2][3][5].


In [6]:
response = client.query_service.query(
    "Who is the event organizer?", corpus_id, summary=True, 
    summarizer="vectara-summary-ext-v1.3.0", summary_result_count=5)
logger.info(f"Response was: {response.summary[0].text}")

19:07:28 +1000:RequestUtil                         INFO:URL for operation query is: https://api.vectara.io/v1/query
19:07:35 +1000:__main__                            INFO:Response was: The event organizer for DataEngBytes, a conference highlighting the work of data engineers in Australia and New Zealand, is Alicia Cheah [1][3][5]. DataEngBytes was established in 2019 and has evolved to include in-person events in various Australian cities and New Zealand [4]. The core team also includes Peter Hanssens as the founder and Mohammed Alim as the marketing coordinator [1][3][5].
