## Workshop demo

In [5]:
import os
import weaviate
from weaviate import EmbeddedOptions

client = weaviate.Client(
    embedded_options=EmbeddedOptions(version="1.20.2"),
    additional_headers={"X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"]}
)

embedded weaviate is already listening on port 6666


{"action":"restapi_management","level":"info","msg":"Shutting down... ","time":"2023-07-20T09:21:30+01:00"}
{"action":"restapi_management","level":"info","msg":"Stopped serving weaviate at http://127.0.0.1:6666","time":"2023-07-20T09:21:30+01:00"}


In [6]:
client.is_ready()

Embedded weaviate wasn't listening on port 6666, so starting embedded weaviate again
Started /Users/jphwang/.cache/weaviate-embedded: process ID 16784


{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-07-20T09:21:34+01:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-07-20T09:21:34+01:00"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2023-07-20T09:21:34+01:00"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:6666","time":"2023-07-20T09:21:34+01:00"}


True

In [38]:
class_definition = {
    "class": "Chunk",
    "vectorizer": "text2vec-openai",
    "moduleConfig": {
        "generative-openai": {}
    },
}

In [39]:
client.schema.create_class(class_definition)

{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"chunk_C4fZXCMGYgeG","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-07-20T09:39:55+01:00","took":57042}


In [40]:
client.schema.get("Chunk")

{'class': 'Chunk',
 'invertedIndexConfig': {'bm25': {'b': 0.75, 'k1': 1.2},
  'cleanupIntervalSeconds': 60,
  'stopwords': {'additions': None, 'preset': 'en', 'removals': None}},
 'moduleConfig': {'generative-openai': {},
  'text2vec-openai': {'model': 'ada',
   'modelVersion': '002',
   'type': 'text',
   'vectorizeClassName': True}},
 'multiTenancyConfig': {'enabled': False},
 'properties': [],
 'replicationConfig': {'factor': 1},
 'shardingConfig': {'virtualPerPhysical': 128,
  'desiredCount': 1,
  'actualCount': 1,
  'desiredVirtualCount': 128,
  'actualVirtualCount': 128,
  'key': '_id',
  'strategy': 'hash',
  'function': 'murmur3'},
 'vectorIndexConfig': {'skip': False,
  'cleanupIntervalSeconds': 300,
  'maxConnections': 64,
  'efConstruction': 128,
  'ef': -1,
  'dynamicEfMin': 100,
  'dynamicEfMax': 500,
  'dynamicEfFactor': 8,
  'vectorCacheMaxObjects': 1000000000000,
  'flatSearchCutoff': 40000,
  'distance': 'cosine',
  'pq': {'enabled': False,
   'bitCompression': False,


In [41]:
with open("srcdata/llm_wiki.txt", "r") as f:
    text = f.read()

text[:100]

'A large language model (LLM) is a computerized language model, embodied by an artificial neural netw'

In [42]:
len(text)

39108

In [43]:
len(text.split(" "))

7206

In [44]:
chunks = text.split("\n\n")
for chunk in chunks:
    print(chunk[:50], "...\n=====\n")

A large language model (LLM) is a computerized lan ...
=====

History
Precursors
The basic idea of LLMs, which i ...
=====

Lead-up to the transformer framework
The earliest  ...
=====

The seq2seq model (380 million parameters) used tw ...
=====

BERT and GPT
While there are many models with diff ...
=====

Origin of the term and disambiguation
While the te ...
=====

Linguistic foundations
Cognitive Linguistics offer ...
=====

Architecture
Large language models have most commo ...
=====

Tokenizers, which convert text into machine-readab ...
=====

Tokenization
LLMs are mathematical functions whose ...
=====

Output
The output of a LLM is a probability distri ...
=====

Upon receiving a text, the bulk of the LLM outputs ...
=====

Context window
The context window of a LLM is the  ...
=====

Training
In the pre-training, LLMs may be trained  ...
=====

autoregressive (i.e. predicting how the segment co ...
=====

Dataset size and compression
In 2018, the BookCorp ...
=====

Training

In [45]:
from weaviate.util import generate_uuid5

with client.batch() as batch:
    for i, chunk in enumerate(chunks):
        weaviate_object = {
            "body": chunk,
            "chunk_no": i,
            "source": "llm_wiki.txt",
        }
        batch.add_data_object(
            class_name="Chunk",
            data_object=weaviate_object,
            uuid=generate_uuid5(weaviate_object)
        )


In [46]:
client.query.aggregate("Chunk").with_meta_count().do()

{'data': {'Aggregate': {'Chunk': [{'meta': {'count': 48}}]}}}

In [47]:
len(chunks)

48

In [48]:
res = (
    client.query.get("Chunk", ["body", "chunk_no"])
    .with_near_text(
        {"concepts": ["history of large language models"]}
    )
    .with_limit(3)
    .do()
)

In [49]:
import json
print(json.dumps(res, indent=2))

{
  "data": {
    "Get": {
      "Chunk": [
        {
          "body": "Lead-up to the transformer framework\nThe earliest \"large\" language models were built with recurrent architectures such as the long short-term memory (LSTM) (1997). After AlexNet (2012) demonstrated the effectiveness of large neural networks for image recognition, researchers applied large neural networks to other tasks. In 2014, two main techniques were proposed.",
          "chunk_no": 2
        },
        {
          "body": "Origin of the term and disambiguation\nWhile the term of Large Language Models has itself emerged around 2018, it gained visibility in 2019 and 2020, with the release of DistilBERT and Stochastic Parrots papers respectively. Both focused on the \"Large-scale pretrained models\", citing as an example of LLMs the BERT family, starting at 110M parameters and referring to models in the 340M parameters range as \"very large LMs\".\nPerhaps surprisingly, both cite the pre-transformer RNN-based

In [55]:
res = (
    client.query.get("Chunk", ["body", "chunk_no"])
    .with_near_text(
        {"concepts": ["history"]}
    )
    .with_limit(5)
    .with_generate(
        grouped_task="Explain the history of large language models in plain language, based on this text"
    )
    .do()
)

In [56]:
res

{'data': {'Get': {'Chunk': [{'_additional': {'generate': {'error': None,
       'groupedResult': 'Large language models (LLMs) have a history that dates back to the 1950s. However, it was not until the 2010s that they became feasible due to the use of GPUs (graphics processing units) for massively parallelized processing. Before this, the idea of using a simple repetitive architecture to train a neural network on a large language corpus remained just an idea.\n\nOne of the precursors to LLMs was the Elman network, which was trained on simple sentences like "dog chases man." The trained model was then used to convert each word into a vector, which represented its internal representation. These vectors were clustered based on their closeness, forming a tree-like structure. Within this structure, verbs and nouns belonged to one large cluster, and within the noun cluster, there were further clusters for inanimates and animates.\n\nIn the 1990s, IBM alignment models for statistical machine 

In [57]:
res["data"]["Get"]["Chunk"][0]["_additional"]["generate"]["groupedResult"]

'Large language models (LLMs) have a history that dates back to the 1950s. However, it was not until the 2010s that they became feasible due to the use of GPUs (graphics processing units) for massively parallelized processing. Before this, the idea of using a simple repetitive architecture to train a neural network on a large language corpus remained just an idea.\n\nOne of the precursors to LLMs was the Elman network, which was trained on simple sentences like "dog chases man." The trained model was then used to convert each word into a vector, which represented its internal representation. These vectors were clustered based on their closeness, forming a tree-like structure. Within this structure, verbs and nouns belonged to one large cluster, and within the noun cluster, there were further clusters for inanimates and animates.\n\nIn the 1990s, IBM alignment models for statistical machine translation hinted at the future success of LLMs. However, it was not until 2001 that an early wo

In [58]:
res = (
    client.query.get("Chunk", ["body", "chunk_no"])
    .with_near_text(
        {"concepts": ["history"]}
    )
    .with_limit(5)
    .with_generate(
        grouped_task="Explain the history of large language models in a tweet with emojis, based on the following text"
    )
    .do()
)

In [59]:
res["data"]["Get"]["Chunk"][0]["_additional"]["generate"]["groupedResult"]

'📚🔍 LLMs history: In the 1950s, the idea of using a simple repetitive architecture to learn natural language was just an idea. In the 1990s, IBM alignment models paved the way for LLMs. In the 2010s, GPUs enabled massively parallel processing, making LLMs feasible. 🖥️💡🔢🌐 #LLM #History'