## Demo notebook - personal knowledge repo
### Powered by Weaviate :)

Import the functions, and instantiate the database. Then, connect to the date (collection).

You'll be interacting with the data through the `collection`.

In [1]:
import wkb
import os
client = wkb.start_db()
collection = wkb.Collection(client, wkb.WV_CLASS, user_agent=f'My Project ({os.getenv("MY_EMAIL")})')
collection.set_apikey(openai_key=os.getenv("OPENAI_APIKEY"))  # The class is configured to use `text2vec-openai` and `generative-openai`
collection.reinitialize_db()  # This will delete your database!

Started /Users/jphwang/.cache/weaviate-embedded: process ID 72993


{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-07-31T15:21:05+01:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-07-31T15:21:05+01:00"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"knowledge_chunk_EjegPmJwteX8","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-07-31T15:21:05+01:00","took":38417}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2023-07-31T15:21:05+01:00"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:6666","time":"2023-07-31T15:21:05+01:00"}


Found class. Skipping class creation
Creating a new class:


{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"knowledge_chunk_RzEWYPF2HONc","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-07-31T15:21:05+01:00","took":46750}


### Load data

There are a bunch of built-in functions to make it easier to add data to your knowledge base.

To add a text file - simply specify the path to it:

In [2]:
collection.add_text_file("./srcdata/kubernetes_concepts_overview.txt")

Adding the data from ./srcdata/kubernetes_concepts_overview.txt


13

In [3]:
collection.summarize_topic("kubernetes")

"Kubernetes is a portable, extensible, open source platform for managing containerized workloads and services. It provides service discovery and load balancing, storage orchestration, automated rollouts and rollbacks, automatic bin packing, self-healing, secret and configuration management. Kubernetes aims to support a diverse variety of workloads and does not limit the types of applications supported. It is not a traditional, all-inclusive PaaS system and does not provide application-level services or deployment of source code. Kubernetes eliminates the need for orchestration and provides a set of independent, composable control processes. It is highly portable and can run on various platforms. Kubernetes combines Google's experience with best practices from the community. It evolved from the traditional deployment era to the container deployment era, providing benefits such as agility, scalability, and cost reduction."

To add a Wiki article (currently it just adds the summary) - provide the article title:

In [4]:
for wiki_title in [
    "Database",
    "Vector database",
    "Containerization (computing)",
    "Formula One",
]:
    collection.add_wiki_article(wiki_title)

Adding the data from Database
Adding the data from Vector database
Adding the data from Containerization (computing)
Adding the data from Formula One


In [5]:
print(collection.summarize_topic("what is formula one in a couple of sentences?"))

Formula One (F1) is the highest class of international racing for open-wheel single-seater formula racing cars. It is governed by the Fédération Internationale de l'Automobile (FIA) and consists of a series of races known as Grands Prix, which take place in multiple countries and continents around the world. The sport follows a set of rules and regulations, and there are two annual World Championships, one for the drivers and one for the constructors (teams).


In [5]:
print(collection.summarize_topic("what is a database in a couple of sentences?"))

A database is an organized collection of data that is stored and accessed electronically through a database management system (DBMS). It can be used to store and retrieve large quantities of information and provides various functions for managing and organizing data.


If we got our prompt right - it should not give us an answer to questions like these, where we don't have any answers:

In [6]:
print(collection.summarize_topic("explain space travel in a couple of sentences"))

The text does not contain any information about space travel.


And to add a YouTube video: provide its URL

In [7]:
youtube_url = "https://youtu.be/iFUeV3aYynI"  # Weaviate Academy - what is a vector?
collection.add_from_youtube(youtube_url)

                                                        Found Hello Weaviate - What Is a Vector? - downloading
Successfully Downloaded to temp_audio.mp3
Audio file under 900 seconds. No split required.
Getting transcripts from 1 audio files...
Processing transcript 1 of 1...




Adding the data from https://youtu.be/iFUeV3aYynI


5

### Make use of your knowledge base

You can use the data in your knowledge base in various ways.

You can summarize a particular entry - for example, the YouTube video that we just ingested:

In [8]:
print(collection.summarize_entry(youtube_url))

This collection of videos explores the topic of vectors and their role in Weaviate. Vectors are sequences of numbers that are used to capture and represent meaning in various contexts. They are versatile and can represent complex meanings, such as colors or the essence of objects. The videos also discuss vector embeddings, which are representations of meaning obtained through deep learning models.

Weaviate, being a vector database, makes use of vector embeddings to perform similarity-based searches and other operations on different types of data. By watching these videos, the reader will gain an understanding of the concept of vectors, how they capture meaning, and their practical application in Weaviate for powerful operations.

In summary, this collection of videos covers the topics of vectors, vector embeddings, and their role in Weaviate. By watching these videos, the reader can learn about the concept of vectors and how they are used to capture meaning, as well as how Weaviate le

Or, you can ask it to suggest key ideas that you can learn about, given a particular topic.

Or maybe I want to share details about what I learned from a video?

In [9]:
prompt = """
Summarize the key learnings in a tweet or two
 - use emojis and make it interesting.
"""
print(collection.summarize_entry(youtube_url, custom_prompt=prompt))

📽️ Learn about vectors in Weaviate! They're series of numbers that capture meaning in different contexts. 🌈 They're used in photo editing, color selection, and more. Weaviate uses vector embeddings for searching and data operations. Watch the videos to understand vectors' role and applications! #Weaviate #Vectors


You can also ask questions to a particular object.

In [10]:
print(collection.ask_object(youtube_url, "what is a vector? Explain in one tweet with emojis"))

A vector is a series of numbers 📊 that captures meaning in various contexts. It can be used to represent colors 🎨 or more complex concepts. Modern AI models use vectors to represent the essence of objects, like text, code, images, and videos. These vectors are derived through deep learning models 🧠. Weaviate leverages vector embeddings for similarity-based searches and powerful operations on different types of data. 🚀 #Vectors #MeaningCapture #AI


In [4]:
pdf_url = "https://arxiv.org/pdf/2302.07842.pdf"
collection.add_pdf(pdf_url)

In [5]:
prompt = """
Summarize the key learnings in a tweet or two
 - use emojis and make it interesting.
"""
print(collection.summarize_entry(pdf_url, custom_prompt=prompt))

📚 Key Learnings from the Texts on Language Models and Reasoning:
- Language models have made progress in NLP tasks, but they often provide non-factual predictions.
- Combining reasoning and tools within language models can solve complex tasks.
- External tools like document retrievers and search engines improve LM outputs.
- LM reasoning abilities can be enhanced through strategies like prompting and retrieval.
- LMs can be used in various domains like question-answering, decision-making, and robotics.
- Augmented language models (ALMs) can reason, use tools, and act, outperforming regular LMs.
- Challenges in ALMs include interpretability, consistency, and scalability.
- Reinforcement learning and prompt pre-training improve LM performance.
- Research explores the integration of LMs with external tools and their ethical implications. 🧠🔧🚀


In [6]:
pdf_url = "https://arxiv.org/pdf/2210.03629.pdf"
collection.add_pdf(pdf_url)
prompt = """
Summarize the key learnings in a tweet or two
 - use emojis and make it interesting.
"""
print(collection.summarize_entry(pdf_url, custom_prompt=prompt))

Adding the data from https://arxiv.org/pdf/2210.03629.pdf
Key learnings from the provided text excerpts:
1️⃣ The ReAct paradigm combines reasoning and acting in language models for solving various language reasoning and decision-making tasks. It outperforms other baselines in performance, generalization, interpretability, and controllability.
2️⃣ Language models can be used for object search, improving dialogue agent alignment, task-oriented dialogue, product search and decision-making, biographical information search, and switching between different methods based on task requirements.
3️⃣ Language models can enhance an agent's performance in complex reasoning and decision-making tasks by augmenting an agent's action space with language, known as thoughts or reasoning traces.
4️⃣ Augmenting language models with thoughts or reasoning traces can be scalable and can improve performance in various tasks, including interactive information retrieval, task solving, and reinforcement learning.

In [2]:
movie_path = "srcdata/Weaviate 1.21 features overview (2023-07-31 09_03 GMT+1).mp4"
collection.add_from_movie_local(movie_path)

MoviePy - Writing audio in /var/folders/hw/hsnpzf0926n_30_bkr09mb200000gn/T/tmpbv5wgvwi.mp3


                                                                        

MoviePy - Done.
Audio saved to temporary file: /var/folders/hw/hsnpzf0926n_30_bkr09mb200000gn/T/tmpbv5wgvwi.mp3
Splitting audio to 3
Getting transcripts from 3 audio files...
Processing transcript 1 of 3...
Processing transcript 2 of 3...
Processing transcript 3 of 3...
Adding the data from srcdata/Weaviate 1.21 features overview (2023-07-31 09_03 GMT+1).mp4
Adding the data from srcdata/Weaviate 1.21 features overview (2023-07-31 09_03 GMT+1).mp4
Adding the data from srcdata/Weaviate 1.21 features overview (2023-07-31 09_03 GMT+1).mp4


67

In [3]:
print(collection.summarize_entry(movie_path))

The audio recordings cover various topics related to the features of Weaviate 1.21. These include the contains operators that provide a more user-friendly syntax, optimizations for arm 64 to improve performance, joint marketing efforts with Intel, and the development of new features such as contains any and period. By listening to the audio, the reader can learn about the new features in Weaviate 1.21, the community's priorities, and the efforts to improve performance and user experience.

Other topics discussed in the recordings include the use of the map function and p read in Weaviate, window support, tenant activation and deactivation in multi-tenancy, scoring changes, and active shards. The recordings also touch on concepts such as memory and file descriptor usage, background process scheduling, fetch limits, and improvements for large-scale multi-tenancy. By listening to these discussions, the reader can gain insights into the specific features and improvements in Weaviate 1.21, 

In [5]:
for question in [
    "ContainsAny and ContainsAll operators",
    "pread function vs mmap",
    "Arm64 neon",
    "Windows support",
    "Delta encode in HNSW",
    "Reduce HNSW lock contention",
    "Deactivate / activate tenant shard",
    "Hybrid fusion scoring",
    "Large scale multi tenancy",
    "Backup improvements",
]:
    print("==========")
    print(question)
    print(collection.ask_object(movie_path, question=question))

ContainsAny and ContainsAll operators
Embedded weaviate wasn't listening on port 6666, so starting embedded weaviate again
Started /Users/jphwang/.cache/weaviate-embedded: process ID 74515


{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-07-31T15:37:33+01:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-07-31T15:37:33+01:00"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"knowledge_chunk_RzEWYPF2HONc","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-07-31T15:37:33+01:00","took":619000}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2023-07-31T15:37:33+01:00"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:6666","time":"2023-07-31T15:37:33+01:00"}


The 'ContainsAny' and 'ContainsAll' operators are features that are highly demanded by the community in the Weaviate 1.21 release. These operators provide a more convenient syntax to avoid long chains of 'and' or 'or' conditions in queries. 

The 'ContainsAny' operator allows users to search for objects that contain any of the specified values in an array. For example, if an object has an array of languages ['Chinese', 'French', 'English'], and the user searches for 'French' and 'English' using the 'ContainsAny' operator, the object will be returned because it contains either 'French' or 'English'.

On the other hand, the 'ContainsAll' operator is more restrictive. It allows users to search for objects that contain all of the specified values in an array. For example, if the user searches for 'French', 'English', and 'Turkish', the object will not be returned because it only contains 'French' and 'English', but not 'Turkish'.

These operators are primarily intended for arrays, but they

In [9]:
for question in [
    "Scoring by overfetching",
]:
    print("==========")
    print(question)
    print(collection.ask_object(movie_path, question=question))

Changes to the scoring by overfetching
The changes to the scoring by overfetching in the upcoming release of Weaviate include:

1. Contains operators: The highest upvoted feature on the roadmap, which prioritizes overfetching for zero and once, and then 150.

2. Load balancing improvement: Resolving issues where error messages were displayed due to nodes being stuck fetching data from disk.

3. Data encoding for graph connections: Storing the graph in memory in a more space-efficient way, resulting in memory enhancements and potential performance improvements.

4. Reduce HNSW lock contention: Improving the current HNSW implementation by reducing unnecessary locks, which could potentially improve performance.

5. Scoring program enhancement: Addressing the issue where changing the limit parameter in queries caused scoring changes, by internally setting a higher limit and cutting off the results.

6. Backup improvements: Making backups more efficient by sending fewer files and reducing t