## Weaviate (intro) workshop

<a target="_blank" href="https://colab.research.google.com/github/weaviate-tutorials/intro-workshop/blob/main/workshop.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

### Goals:

#### What you will see:


- Create a vector database with Weaviate,
- Add data to the database, and
- Interact with the data, including searching, and using LLMs with your data in Weaviate

### You will learn:

- What Weaviate is,
- How it stores the data (based on its "meaning"), and
- What you can do with Weaviate, like semantic searches, and using LLMs to transform data.

## Preparation: Get the data

We'll use a dataset of movies from TMDB. 

Pre-processed version: "./data/movies.csv"


Load (or download) the data, and preview it

In [1]:
import pandas as pd

# movie_df = pd.read_csv("./data/movies.csv")
movie_df = pd.read_csv("https://raw.githubusercontent.com/weaviate-tutorials/intro-workshop/main/data/movies.csv")
movie_df.head()

Unnamed: 0,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count,year
0,/rH0DPF7pB35jxLxKb3JRUgCrrnp.jpg,"[10751, 14, 16, 10749]",11224,en,Cinderella,Cinderella has faith her dreams of a better li...,100.819,/avz6S9HYWs4O8Oe4PenBFNX4uDi.jpg,1950-02-22,Cinderella,False,7.044,6523,1950
1,/p47ihFj4A7EpBjmPHdTj4ipyq1S.jpg,[18],599,en,Sunset Boulevard,A hack screenwriter writes a screenplay for a ...,57.74,/sC4Dpmn87oz9AuxZ15Lmip0Ftgr.jpg,1950-08-10,Sunset Boulevard,False,8.312,2485,1950
2,/zyO6j74DKMWfp5snWg6Hwo0T3Mz.jpg,"[80, 18, 9648]",548,ja,羅生門,Brimming with action while incisively examinin...,21.011,/vL7Xw04nFMHwnvXRFCmYYAzMUvY.jpg,1950-08-26,Rashomon,False,8.091,2121,1950
3,/b4yiLlIFuiULuuLTxT0Pt1QyT6J.jpg,"[16, 10751, 14, 12]",12092,en,Alice in Wonderland,"On a golden afternoon, young Alice follows a W...",75.465,/20cvfwfaFqNbe9Fc3VEHJuPRxmn.jpg,1951-07-28,Alice in Wonderland,False,7.2,5697,1951
4,/mxf8hJJkHTCqZP3m4o8E1TtwHHs.jpg,"[35, 10749]",872,en,Singin' in the Rain,"In 1927 Hollywood, a silent film production co...",31.407,/w03EiJVHP8Un77boQeE7hg9DVdU.jpg,1952-04-09,Singin' in the Rain,False,8.2,3036,1952


## Step 1: Connect to Weaviate

You can also use a hosted instance on Weaviate Cloud, or install Weaviate anywhere using the open-source distribution.

If you are using Cohere, or OpenAI, uncomment and update the relevant lines in the following cell with your actual keys. If yor using Ollama, you do not need to do anything.

In [9]:
headers={
    # "X-Cohere-Api-Key": "<your_cohere_apikey>",  # Replace this with your actual key
    # "X-OpenAI-Api-Key": "<your_openai_apikey>",  # Replace this with your actual key
}

In [11]:
import weaviate

# If you have got Weaviate running locally with Kubernetes or Docker:
client = weaviate.connect_to_local(
    port=80,  # Or 8080 for Docker instances
    headers=headers
)

# # If you are waiting for Docker to download, comment out the above, and uncomment this instead:
# client = weaviate.connect_to_embedded(
#     version="1.26.3",
#     headers=headers,
#     environment_variables={
#         "ENABLE_API_BASED_MODULES": "true"
#     }
# )

Retrieve Weaviate instance information to check our configuration.

In [12]:
client.get_meta()

{'hostname': 'http://[::]:8080',
 'modules': {'generative-anthropic': {'documentationHref': 'https://docs.anthropic.com/en/api/getting-started',
   'name': 'Generative Search - Anthropic'},
  'generative-anyscale': {'documentationHref': 'https://docs.anyscale.com/endpoints/overview',
   'name': 'Generative Search - Anyscale'},
  'generative-aws': {'documentationHref': 'https://docs.aws.amazon.com/bedrock/latest/APIReference/welcome.html',
   'name': 'Generative Search - AWS'},
  'generative-cohere': {'documentationHref': 'https://docs.cohere.com/reference/chat',
   'name': 'Generative Search - Cohere'},
  'generative-mistral': {'documentationHref': 'https://docs.mistral.ai/api/',
   'name': 'Generative Search - Mistral'},
  'generative-octoai': {'documentationHref': 'https://octo.ai/docs/text-gen-solution/getting-started',
   'name': 'Generative Search - OctoAI'},
  'generative-ollama': {'documentationHref': 'https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion'

## Step 2: Add data to Weaviate

### Add collection definition

The equivalent of a SQL "table", is called a "collection" in Weaviate, like they are in NoSQL databases.

In case I created a demo collection - let's delete it.

In [40]:
client.collections.delete("Movie")

And create a new collection definition here.
We'll set up a collection called "Movie" with:
- Two "named vectors" -> which will save different "meanings" of the data,
- A "generative" module -> which will allow us to use LLMs with our data, and
- Properties to save our movie data (which are like SQL columns).
    - Just the title, overview, year and popularity for now.

In [41]:
from weaviate.classes.config import Configure, DataType, Property

client.collections.create(
    name="Movie",
    properties=[
        Property(
            name="title",
            data_type=DataType.TEXT,
        ),
        Property(
            name="overview",
            data_type=DataType.TEXT,
        ),
        Property(
            name="popularity",
            data_type=DataType.NUMBER,
        ),
        Property(
            name="year",
            data_type=DataType.INT,
        ),
    ],
    # ========================================
    # For those using Ollama:
    # ========================================
    vectorizer_config=[
        Configure.NamedVectors.text2vec_ollama(
            name="title",
            source_properties=["title"],
            api_endpoint="http://host.docker.internal:11434",  # If using Docker, use this to contact your local Ollama instance
            model="nomic-embed-text",  # The model to use, e.g. "snowflake-arctic-embed"
        ),
        Configure.NamedVectors.text2vec_ollama(
            name="all_text",
            source_properties=["title", "overview"],
            api_endpoint="http://host.docker.internal:11434",  # If using Docker, use this to contact your local Ollama instance
            model="nomic-embed-text",  # The model to use, e.g. "snowflake-arctic-embed"
        ),
    ],
    generative_config=Configure.Generative.ollama(
        api_endpoint="http://host.docker.internal:11434",
        model="gemma2:2b"
    ),
    # ========================================
    # END - Ollama setup
    # ========================================
    # # ========================================
    # # For those using Cohere:
    # # ========================================
    # vectorizer_config=[
    #     Configure.NamedVectors.text2vec_cohere(
    #         name="title",
    #         source_properties=["title"]
    #     ),
    #     Configure.NamedVectors.text2vec_cohere(
    #         name="all_text",
    #         source_properties=["title", "overview"]
    #     ),
    # ],
    # generative_config=Configure.Generative.cohere(),
    # # ========================================
    # # END - Cohere setup
    # # ========================================
    # # ========================================
    # # For those using OpenAI:
    # # ========================================
    # vectorizer_config=[
    #     Configure.NamedVectors.text2vec_openai(
    #         name="title",
    #         source_properties=["title"]
    #     ),
    #     Configure.NamedVectors.text2vec_openai(
    #         name="all_text",
    #         source_properties=["title", "overview"]
    #     ),
    # ],
    # generative_config=Configure.Generative.openai(),
    # # ========================================
    # # END - OpenAI setup
    # # ========================================
)

<weaviate.collections.collection.sync.Collection at 0x127705a90>

> Tip: You can get example collection definitions in our documentation:
> - https://weaviate.io/developers/weaviate/manage-data/collections

Was our collection created successfully? Let's take a look

In [22]:
client.collections.exists("Movie")

True

### Add data

We'll add actual objects (SQL rows) to our data. 

First, let's build objects to add - and take a look at a couple.

In [42]:
data_columns = ['title', 'overview', 'year', 'popularity']

df = movie_df[data_columns]

df.head()

Unnamed: 0,title,overview,year,popularity
0,Cinderella,Cinderella has faith her dreams of a better li...,1950,100.819
1,Sunset Boulevard,A hack screenwriter writes a screenplay for a ...,1950,57.74
2,Rashomon,Brimming with action while incisively examinin...,1950,21.011
3,Alice in Wonderland,"On a golden afternoon, young Alice follows a W...",1951,75.465
4,Singin' in the Rain,"In 1927 Hollywood, a silent film production co...",1952,31.407


> If it all looks fine - let's add objects:
> - https://weaviate.io/developers/weaviate/manage-data/import

In [43]:
from tqdm import tqdm

movies = client.collections.get("Movie")

with movies.batch.fixed_size(200) as batch:
    for i, row in tqdm(df.iterrows()):
        obj_body = {
            c: row[c] for c in data_columns
        }
        batch.add_object(
            properties=obj_body
        )

1322it [00:15, 87.68it/s]


#### Confirm data load

Do we have data? 

Let's get an object count

In [44]:
print(len(movies))

1322


Does the data look right?

Let's grab a few objects from Weaviate!

In [45]:
response = movies.query.fetch_objects(limit=3)
for o in response.objects:
    print(o.properties)

{'title': 'Good Will Hunting', 'overview': 'When professors discover that an aimless janitor is also a math genius, a therapist helps the young man confront the demons that are holding him back.', 'year': 1997, 'popularity': 144.285}
{'title': 'The Amazing Spider-Man', 'overview': "Peter Parker is an outcast high schooler abandoned by his parents as a boy, leaving him to be raised by his Uncle Ben and Aunt May. Like most teenagers, Peter is trying to figure out who he is and how he got to be the person he is today. As Peter discovers a mysterious briefcase that belonged to his father, he begins a quest to understand his parents' disappearance – leading him directly to Oscorp and the lab of Dr. Curt Connors, his father's former partner. As Spider-Man is set on a collision course with Connors' alter ego, The Lizard, Peter will make life-altering choices to use his powers and shape his destiny to become a hero.", 'year': 2012, 'popularity': 120.385}
{'title': 'The Intouchables', 'overview

Let's pause for a second - because we've done a lot!

#### What did we just do?

Here is a conceptual diagram

![img](https://github.com/weaviate-tutorials/intro-workshop/blob/main/images/object_import_process_full.png?raw=1)

## Step 3: Work with the data

Let's try a few more involved queries

### Filtering (similar to WHERE filter in SQL)

Let's find objects that meet a particular condition.

In [46]:
from weaviate.classes.query import Filter

response = movies.query.fetch_objects(
    filters=Filter.by_property("year").greater_than(2015),
    limit=3
)

for o in response.objects:
    print(o.properties["title"])

Suicide Squad
Doctor Strange
X-Men: Apocalypse


But this does not rank the result in any meaningful way. 

For that, we need a keyword search (as opposed to a keyword *filter*).

### Keyword search

Unlike a keyword filter, a keyword search will search for, and rank results based on the frequency of the keyword.

In [47]:
from weaviate.classes.query import MetadataQuery

response = movies.query.bm25(
    query="galaxy",
    limit=5,
    return_metadata=MetadataQuery(score=True, last_update_time=True)
)

for o in response.objects:
    print(o.metadata.score)
    print(o.metadata.last_update_time)
    print(o.properties)

2.310587167739868
2024-09-06 17:38:55.270000+00:00
{'title': 'Guardians of the Galaxy Vol. 3', 'overview': 'Peter Quill, still reeling from the loss of Gamora, must rally his team around him to defend the universe along with protecting one of their own. A mission that, if not completed successfully, could quite possibly lead to the end of the Guardians as we know them.', 'year': 2023, 'popularity': 165.416}
2.310587167739868
2024-09-06 17:38:50.464000+00:00
{'year': 2017, 'title': 'Guardians of the Galaxy Vol. 2', 'overview': "The Guardians must fight to keep their newfound family together as they unravel the mysteries of Peter Quill's true parentage.", 'popularity': 142.267}
2.1399571895599365
2024-09-06 17:38:37.638000+00:00
{'year': 2002, 'title': 'Star Wars: Episode II - Attack of the Clones', 'overview': 'Following an assassination attempt on Senator Padmé Amidala, Jedi Knights Anakin Skywalker and Obi-Wan Kenobi investigate a mysterious plot that could change the galaxy forever.'

### Semantic search

A semantic search, on the other hand, searches objects based on similarity

In [48]:
import json

response = movies.query.near_text(
    query="galaxy",
    limit=3,
    target_vector="title",
)

for o in response.objects:
    print(json.dumps(o.properties, indent=2))

{
  "title": "Galaxy Quest",
  "overview": "For four years, the courageous crew of the NSEA protector - \"Commander Peter Quincy Taggart\" (Tim Allen), \"Lt. Tawny Madison (Sigourney Weaver) and \"Dr.Lazarus\" (Alan Rickman) - set off on a thrilling and often dangerous mission in space...and then their series was cancelled! Now, twenty years later, aliens under attack have mistaken the Galaxy Quest television transmissions for \"historical documents\" and beam up the crew of has-been actors to save the universe. With no script, no director and no clue, the actors must turn in the performances of their lives.",
  "year": 1999,
  "popularity": 62.01
}
{
  "title": "Star Wars",
  "overview": "Princess Leia is captured and held hostage by the evil Imperial forces in their effort to take over the galactic Empire. Venturesome Luke Skywalker and dashing captain Han Solo team together with the loveable robot duo R2-D2 and C-3PO to rescue the beautiful princess and restore peace and justice in 

#### How does this work?

- Under the hood, this uses a vector search. It looks for objects which are the most similar to a text input.
- We can inspect the similarity along with the results.

In [49]:
import json

response = movies.query.near_text(
    query="galaxy",
    limit=3,
    target_vector="title",
    return_metadata=MetadataQuery(distance=True)
)

for o in response.objects:
    print(o.metadata)
    print(json.dumps(o.properties, indent=2))

MetadataReturn(creation_time=None, last_update_time=None, distance=0.2529594898223877, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None)
{
  "year": 1999,
  "title": "Galaxy Quest",
  "overview": "For four years, the courageous crew of the NSEA protector - \"Commander Peter Quincy Taggart\" (Tim Allen), \"Lt. Tawny Madison (Sigourney Weaver) and \"Dr.Lazarus\" (Alan Rickman) - set off on a thrilling and often dangerous mission in space...and then their series was cancelled! Now, twenty years later, aliens under attack have mistaken the Galaxy Quest television transmissions for \"historical documents\" and beam up the crew of has-been actors to save the universe. With no script, no director and no clue, the actors must turn in the performances of their lives.",
  "popularity": 62.01
}
MetadataReturn(creation_time=None, last_update_time=None, distance=0.4024823307991028, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=

This is where "vectors" come in. 

Each object in Weaviate includes a vector - like so:

In [50]:
response = movies.query.near_text(
    query="galaxy",
    limit=3,
    target_vector="title",  # or "overview"
    include_vector=True,
    return_metadata=MetadataQuery(distance=True)
)

for o in response.objects:
    print(o.metadata.distance)
    print(json.dumps(o.properties, indent=2))
    print(o.vector["title"])

0.2529594898223877
{
  "overview": "For four years, the courageous crew of the NSEA protector - \"Commander Peter Quincy Taggart\" (Tim Allen), \"Lt. Tawny Madison (Sigourney Weaver) and \"Dr.Lazarus\" (Alan Rickman) - set off on a thrilling and often dangerous mission in space...and then their series was cancelled! Now, twenty years later, aliens under attack have mistaken the Galaxy Quest television transmissions for \"historical documents\" and beam up the crew of has-been actors to save the universe. With no script, no director and no clue, the actors must turn in the performances of their lives.",
  "year": 1999,
  "title": "Galaxy Quest",
  "popularity": 62.01
}
[-0.5893528461456299, 0.5016435980796814, -3.231532096862793, 0.6001119017601013, -0.4437905550003052, 0.6938812732696533, -0.6911576986312866, 0.24620701372623444, -0.40899139642715454, 0.6123637557029724, 0.6818543076515198, 2.7342238426208496, 1.2442903518676758, 0.9821367263793945, 0.15932443737983704, -0.617268025875

These vector representations come from deep learning models to those that power LLMs. They capture meaning, and are called vector "embeddings".

### Generative search

A generative search transforms your data at retrieval time. 

In [51]:
response = movies.generate.near_text(
    query="galaxy",
    limit=5,
    target_vector="title",
    single_prompt="Write a tweet promoting the movie with TITLE: {title} and OVERVIEW: {overview}.",
    grouped_task="What audience demographic might enjoy this group of movies?"
)

print(response.generated)
for o in response.objects:
    print(o.generated)
    print(json.dumps(o.properties, indent=2))

Based on the movie descriptions, here's an analysis of potential audience demographics:

**General Audience:**  All of these movies appeal to a wide range of viewers, but they each have specific audiences within that general group.

* **Casual Sci-Fi Fans:** These films all share themes of space exploration and adventure, offering compelling stories with iconic characters that are easily recognizable. 
    * This audience enjoys escapism through entertainment value and will likely enjoy the comedy elements present in "Galaxy Quest". 
* **Millennial & Gen X Audiences:**  "Star Wars", "Guardians of the Galaxy", and "Interstellar" appeal to a generation raised on these classics, offering nostalgia and emotional connection.
* **Family-Friendly Fans:** The themes of heroism and justice, as seen in "Star Wars" and "Guardians of the Galaxy", are appealing for families with children who enjoy adventure stories. 

**Specific Demographic Breakdown:**

* **"Galaxy Quest":**  This film is targeted

You can see here ⬆️ that each object has been transformed into a tweet by the LLM based on our prompt.

You can ask LLMs to perform all sorts of tasks

In [52]:
response = movies.generate.near_text(
    query="galaxy",
    target_vector="title",
    limit=3,
    single_prompt="Summarise the following movie overview into a short French sentence: {overview}."
)

for o in response.objects:
    print(o.generated)
    print(json.dumps(o.properties, indent=2))

Après une cancellation, les acteurs d'une série de science-fiction doivent sauver l'univers en interprétant "Galaxy Quest". 

{
  "year": 1999,
  "title": "Galaxy Quest",
  "overview": "For four years, the courageous crew of the NSEA protector - \"Commander Peter Quincy Taggart\" (Tim Allen), \"Lt. Tawny Madison (Sigourney Weaver) and \"Dr.Lazarus\" (Alan Rickman) - set off on a thrilling and often dangerous mission in space...and then their series was cancelled! Now, twenty years later, aliens under attack have mistaken the Galaxy Quest television transmissions for \"historical documents\" and beam up the crew of has-been actors to save the universe. With no script, no director and no clue, the actors must turn in the performances of their lives.",
  "popularity": 62.01
}
Leia est en captivité et les forces impérialistes cherchent à instaurer une domination sur l'Empire, alors que Luke, Han et leurs compagnons vont la libérer et restaurer la paix. 

{
  "year": 1977,
  "title": "Star 

In [None]:
client.close()