## Demo notebook - personal knowledge repo
### Powered by Weaviate :)

Import the functions, and instantiate the database. Then, connect to the date (collection).

You'll be interacting with the data through the `collection`.

In [1]:
import wkb
import os
client = wkb.start_db()
collection = wkb.Collection(client, wkb.WV_CLASS, user_agent=f'My Project ({os.getenv("MY_EMAIL")})')
collection.set_apikey(openai_key=os.getenv("OPENAI_APIKEY"))  # The class is configured to use `text2vec-openai` and `generative-openai`
collection.reinitialize_db()  # This will delete your database!

Started /Users/peej/.cache/weaviate-embedded: process ID 15697


{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-07-22T19:42:10+01:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-07-22T19:42:10+01:00"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2023-07-22T19:42:10+01:00"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"knowledge_chunk_1Poy225PNopx","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-07-22T19:42:10+01:00","took":3344792}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:6666","time":"2023-07-22T19:42:10+01:00"}


Found class. Skipping class creation
Creating a new class:


{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"knowledge_chunk_rh5Ah0ASYpHe","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-07-22T19:42:10+01:00","took":41542}


### Load data

There are a bunch of built-in functions to make it easier to add data to your knowledge base.

To add a text file - simply specify the path to it:

In [2]:
collection.add_text_file("./srcdata/kubernetes_concepts_overview.txt")

Adding the data from ./srcdata/kubernetes_concepts_overview.txt


13

In [3]:
collection.summarize_topic("kubernetes")

"Kubernetes is a portable, extensible, open source platform for managing containerized workloads and services. It provides service discovery and load balancing, storage orchestration, automated rollouts and rollbacks, automatic bin packing, self-healing, secret and configuration management. Kubernetes aims to support a diverse variety of workloads and does not limit the types of applications supported. It is not a traditional, all-inclusive PaaS system and does not provide application-level services or deployment of source code. Kubernetes eliminates the need for orchestration and provides a set of independent, composable control processes. It is highly portable and can run on various platforms. Kubernetes combines Google's experience with best practices from the community. It evolved from the need to manage containers in a production environment and provides benefits such as agility, scalability, and resource utilization."

In [4]:
pdf_url = "https://arxiv.org/pdf/1706.03762.pdf"
collection.add_from_pdf_online(pdf_url)

Adding the data from https://arxiv.org/pdf/1706.03762.pdf


52

In [5]:
collection.ask_object(pdf_url, "What is attention in a language model? Explain in plain language")

'In a language model, attention refers to the ability of the model to focus on specific parts of the input sequence when generating the output. It is like a spotlight that allows the model to pay more attention to certain words or phrases that are important for understanding the context and generating accurate predictions.\n\nAttention helps the model to capture dependencies between different words in the input sequence, regardless of their distance from each other. It allows the model to consider the relevance and importance of each word when generating the output. By attending to different parts of the input sequence, the model can better understand the relationships between words and generate more accurate and coherent responses.\n\nIn simpler terms, attention in a language model is like a mechanism that helps the model to pay attention to important words and phrases in the input sequence, so that it can generate more accurate and meaningful output.'

In [2]:
pdf_path = "srcdata/llama-2.pdf"
collection.add_from_pdf_local(pdf_path)

Adding the data from srcdata/llama-2.pdf


274

In [4]:
collection.summarize_entry(pdf_path)

'The provided text is a collection of excerpts from a document titled "Llama 2" that covers various topics related to the development and evaluation of pretrained and fine-tuned large language models (LLMs) optimized for dialogue use cases. The document discusses the performance of the Llama 2-Chat models compared to open-source chat models, the importance of safety testing and tuning before deploying applications of Llama 2, the impact of safety data scaling on model performance, the measurement of false refusal rates, and the use of context distillation and answer templates to enhance safety capabilities. By listening to the document, the reader can learn about the advancements in LLMs for dialogue use cases, the challenges and considerations related to safety and ethical use, and the techniques used to improve model performance and mitigate risks.\n\nAdditionally, the document covers topics such as the distribution of safety RM scores when adding different types of prompts, the use 

To add a Wiki article (currently it just adds the summary) - provide the article title:

In [5]:
collection.summarize_entry(pdf_path, custom_prompt="summarize the information into a series or two or three tweets with some emojis to make them fun and interesting")

'📚 "Llama 2" document discusses pretrained and fine-tuned large language models (LLMs) optimized for dialogue use cases. It covers model performance, safety testing, scaling safety data, false refusal rates, and safety techniques like context distillation. Advancements, applications, and risks of LLMs are explored. #AI #LanguageModels\n\n🔒 "Llama 2-Chat" model evaluation mentions safety RM scores, context distillation, and risks. Fine-tuning improves truthfulness and reduces toxicity. Reinforcement learning and human-machine synergy are effective. In-context temperature rescaling is observed. Safety and effectiveness of Llama 2-Chat are highlighted. #ModelEvaluation #Safety #FineTuning\n\n🌐 Llama 2 and Llama 2-Chat, pretrained and fine-tuned Large Language Models (LLMs), are discussed. Generalization ability, integration with tools, math dataset performance, fine-tuning, safety improvements, and comparisons with other models are covered. Learn about capabilities, benchmarks, methodolog

In [6]:
print(collection.summarize_entry(pdf_path, custom_prompt="summarize the information into a series or two or three short tweets, with emojis to make them fun and interesting"))

📚 Summary of the information:
1. "Llama 2" document discusses pretrained language models optimized for dialogue use cases, safety testing, and model performance compared to open-source models. 🗂️🚀
2. Llama 2-Chat model explores safety, context distillation, and enhancing safety capabilities. 🛡️🔒
3. Llama 2 is a family of pretrained models with impressive generalization ability and potential as a substitute for closed-source models. 🧠💪
4. The document covers performance comparisons, responsible use, safety considerations, and collaboration in the AI community. ⚖️🤝
5. Topics in "llama-2.pdf" include safety labels, human annotations, training details, reward models, and safety evaluation methods. 🔖💡 
6. The text touches on multi-turn prompts, haircuts, language model safety, techniques for fine-tuning, performance benchmarks, and related work. 💇‍♂️📊
7. The document highlights evaluation, benchmarks limitations, data freshness, pronoun analysis, and addressing false information. 📜📊❌
8. Top

In [20]:
collection.summarize_entry(pdf_path, "Did any aspects of Llama 2 design and training involve improving safety? Summarize the answer into a short, fun, interesting tweet with emojis")

'🦙🔬🚀 Llama 2 design & training focused on safety! High importance on safety testing & tuning 👩\u200d🔬✅ They used context distillation & answer templates to enhance safety capabilities 📚🛡️ Learn about advancements, challenges, & how they improved safety! #Llama2 #AI 🦙🔒\n\n'

In [24]:
collection.ask_object(pdf_path, "what is false refusal? Explain in a short tweet with emojis")

'False refusal is when a model incorrectly refuses to answer a legitimate user prompt due to irrelevant safety concerns. 🚫❌ This can happen when the model mistakenly identifies a prompt as unsafe or potentially harmful, leading to a refusal to provide a response. It is important to train models to accurately assess safety and avoid false refusals to ensure helpful and informative interactions. #AIethics #safety'

In [22]:
collection.ask_object(pdf_path, "how was safety performance measured with llama 2")

'The safety performance of Llama 2 was measured using several methods, including safety fine-tuning, red teaming, and safety evaluation. The safety fine-tuning process involved using safety-specific data annotation and tuning to increase the safety of the models. Red teaming exercises were conducted by a set of experts to evaluate the robustness of the models and identify any vulnerabilities. Safety evaluation was performed using a set of adversarial prompts, and human raters judged the model generations for safety violations. The overall safety performance of Llama 2-Chat was compared to other open-source and closed-source models, and it generally performed better or on par with them. However, it is important to interpret these safety results carefully due to limitations such as the subjectivity of the review guidelines and individual raters. The safety performance of Llama 2 was continuously improved through iterative evaluations and model refinements. The details of these safety mea

In [4]:
for wiki_title in [
    "Database",
    "Vector database",
    "Containerization (computing)",
    "Formula One",
]:
    collection.add_wiki_article(wiki_title)

Adding the data from Database
Adding the data from Vector database
Adding the data from Containerization (computing)
Adding the data from Formula One


In [5]:
print(collection.summarize_topic("what is formula one in a couple of sentences?"))

Formula One (F1) is the highest class of international racing for open-wheel single-seater formula racing cars. It is governed by the Fédération Internationale de l'Automobile (FIA) and consists of a series of races known as Grands Prix, which take place in multiple countries and continents around the world. The sport follows a set of rules and regulations, and there are two annual World Championships, one for the drivers and one for the constructors (teams).


In [5]:
print(collection.summarize_topic("what is a database in a couple of sentences?"))

A database is an organized collection of data that is stored and accessed electronically through a database management system (DBMS). It can be used to store and retrieve large quantities of information and provides various functions for managing and organizing data.


If we got our prompt right - it should not give us an answer to questions like these, where we don't have any answers:

In [6]:
print(collection.summarize_topic("explain space travel in a couple of sentences"))

The text does not contain any information about space travel.


And to add a YouTube video: provide its URL

In [7]:
youtube_url = "https://youtu.be/iFUeV3aYynI"  # Weaviate Academy - what is a vector?
collection.add_from_youtube(youtube_url)

                                                        Found Hello Weaviate - What Is a Vector? - downloading
Successfully Downloaded to temp_audio.mp3
Audio file under 900 seconds. No split required.
Getting transcripts from 1 audio files...
Processing transcript 1 of 1...




Adding the data from https://youtu.be/iFUeV3aYynI


5

### Make use of your knowledge base

You can use the data in your knowledge base in various ways.

You can summarize a particular entry - for example, the YouTube video that we just ingested:

In [8]:
print(collection.summarize_entry(youtube_url))

This collection of videos explores the topic of vectors and their role in Weaviate. Vectors are sequences of numbers that are used to capture and represent meaning in various contexts. They are versatile and can represent complex meanings, such as colors or the essence of objects. The videos also discuss vector embeddings, which are representations of meaning obtained through deep learning models.

Weaviate, being a vector database, makes use of vector embeddings to perform similarity-based searches and other operations on different types of data. By watching these videos, the reader will gain an understanding of the concept of vectors, how they capture meaning, and their practical application in Weaviate for powerful operations.

In summary, this collection of videos covers the topics of vectors, vector embeddings, and their role in Weaviate. By watching these videos, the reader can learn about the concept of vectors and how they are used to capture meaning, as well as how Weaviate le

Or, you can ask it to suggest key ideas that you can learn about, given a particular topic.

Or maybe I want to share details about what I learned from a video?

In [9]:
prompt = """
Summarize the key learnings in a tweet or two
 - use emojis and make it interesting.
"""
print(collection.summarize_entry(youtube_url, custom_prompt=prompt))

📽️ Learn about vectors in Weaviate! They're series of numbers that capture meaning in different contexts. 🌈 They're used in photo editing, color selection, and more. Weaviate uses vector embeddings for searching and data operations. Watch the videos to understand vectors' role and applications! #Weaviate #Vectors


You can also ask questions to a particular object.

In [10]:
print(collection.ask_object(youtube_url, "what is a vector? Explain in one tweet with emojis"))

A vector is a series of numbers 📊 that captures meaning in various contexts. It can be used to represent colors 🎨 or more complex concepts. Modern AI models use vectors to represent the essence of objects, like text, code, images, and videos. These vectors are derived through deep learning models 🧠. Weaviate leverages vector embeddings for similarity-based searches and powerful operations on different types of data. 🚀 #Vectors #MeaningCapture #AI
