## Tutorial 1: Academy Weaviate
### [101T Work with: Text Data]((https://docs.weaviate.io/academy/py/starter_text_data))



### 👉 Populate the database
- **Configure a collection** with typical settings and vectorizer set.
- **Create a collection** and work with a collection object.
- Import data using **batch imports**.

###  ➡️ Preparation
Populate our Weaviate instance with a movie dataset, using the OpenAI API to embed the text data.

- 👁️⚠️ 1. Make sure to have your `Weaviate instance` set up.
- 👁️⚠️ 2. It will need an `OpenAI API Key`. If you don't have one, go to the OpenAI website and sign up for an account and create an API key.

- **Source Data**: 
    - We are going to use a movie dataset sourced from [TMDB](https://www.themoviedb.org/). 
    - The dataset can be found in this [GitHub repository](https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024.json), and it contains bibliographic information on ~700 movies released between 1990 and 2024.

###  ➡️ Create a Collection
Weaviate stores data in "collections". 

A **collection** is a set of objects that share the same data structure. 

`Note`: In our movie database, we might have a `collection of movies`, a `collection of actors`, and a `collection of reviews`.

⚠️ Each collection definition must have a name.

``` bash
import weaviate

import weaviate.classes.config as wc
import os


# Instantiate your client (not shown). e.g.:
# headers = {"X-OpenAI-Api-Key": os.getenv("OPENAI_APIKEY")}  # Replace with your OpenAI API key
# client = weaviate.connect_to_weaviate_cloud(..., headers=headers) or
# client = weaviate.connect_to_local(..., headers=headers)

client.collections.create(
    name="Movie",
    properties=[
        wc.Property(name="title", data_type=wc.DataType.TEXT),
        wc.Property(name="overview", data_type=wc.DataType.TEXT),
        wc.Property(name="vote_average", data_type=wc.DataType.NUMBER),
        wc.Property(name="genre_ids", data_type=wc.DataType.INT_ARRAY),
        wc.Property(name="release_date", data_type=wc.DataType.DATE),
        wc.Property(name="tmdb_id", data_type=wc.DataType.INT),
    ],
    # Define the vectorizer module
    vector_config=wc.Configure.Vectors.text2vec_openai(),
    # Define the generative module
    generative_config=wc.Configure.Generative.openai()
)

client.close()

```

### Properties
Properties are the object attributes that you want to store in the collection. Each property has a name and a data type.

###  Vectorizer configuration
If you do not specify the vector yourself, Weaviate will use a specified vectorizer to generate vector embeddings from your data.

### Generative configuration
If you wish to use your collection with a generative model (e.g. a large language model), you must specify the generative module.


### Python classes
The code example makes use of classes such as `Property`, `DataType` and `Configure`. They are defined in the `weaviate.classes.config` **submodule** and are used to define the collection. 

For convenience, we import the submodule as wc and use classes from it.

``` bash
import weaviate.classes.config as wc
import os
```

###  ➡️ Import Data

``` bash
import weaviate
import pandas as pd
import requests
from datetime import datetime, timezone
import json
from weaviate.util import generate_uuid5
from tqdm import tqdm
import os

# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local(...)

data_url = "https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024.json"
resp = requests.get(data_url)
df = pd.DataFrame(resp.json())

# Get the collection
movies = client.collections.use("Movie")

# Enter context manager
with movies.batch.fixed_size(batch_size=200) as batch:
    # Loop through the data
    for i, movie in tqdm(df.iterrows()):
        # Convert data types
        # Convert a JSON date to `datetime` and add time zone information
        release_date = datetime.strptime(movie["release_date"], "%Y-%m-%d").replace(
            tzinfo=timezone.utc
        )
        # Convert a JSON array to a list of integers
        genre_ids = json.loads(movie["genre_ids"])

        # Build the object payload
        movie_obj = {
            "title": movie["title"],
            "overview": movie["overview"],
            "vote_average": movie["vote_average"],
            "genre_ids": genre_ids,
            "release_date": release_date,
            "tmdb_id": movie["id"],
        }

        # Add object to batch queue
        batch.add_object(
            properties=movie_obj,
            uuid=generate_uuid5(movie["id"])
            # references=reference_obj  # You can add references here
        )
        # Batcher automatically sends batches

# Check for failed objects
if len(movies.batch.failed_objects) > 0:
    print(f"Failed to import {len(movies.batch.failed_objects)} objects")

client.close()

```

### ⬜ Explain the code

### ◽ Preparation

- 1. It use the requests library to load the data from the source,  in this case a JSON file.
- 2. The data is then converted to a Pandas DataFrame for easier manipulation.
- 3. It creates a collection object (with client.collections.get), to interact with the collection.


### ◽  Batch context manager
The batch object is a context manager that allows you to add objects to the batcher.
This is useful when you have a large amount of data to import, as it abstracts away the complexity of managing the batch size and when to send the batch.
This example uses the `.fixed_size()` method to create a batcher which sets the number of objects per batch.

```bash
with movies.batch.fixed_size(batch_size=200) as batch:
```

There are also other batcher types, like `.rate_limit()` <u>for specifying the number of objects per minute</u> and `.dynamic()` <u>to create a dynamic batcher, which automatically determines and updates the batch size during the import process.</u>


### ◽  Add data to the batcher

#### ▫️ Convert data types
The data is converted from a string to the correct data types for Weaviate:
- the `release_date` is converted to a **datetime** object.
- the `genre_ids` are converted to a **list of integers**.

``` bash
# Convert a JSON date to `datetime` and add time zone information
release_date = datetime.strptime(movie["release_date"], "%Y-%m-%d").replace(
    tzinfo=timezone.utc
)
# Convert a JSON array to a list of integers
genre_ids = json.loads(movie["genre_ids"])

```

#### ▫️ Add objects to the batcher
We loop through the data and add each object to the batcher.
The `batch.add_object` method is used <u>to add the object to the batcher</u>, and the batcher will send the batch according to the specified batcher type.

```bash
        movie_obj = {
            "title": movie["title"],
            "overview": movie["overview"],
            "vote_average": movie["vote_average"],
            "genre_ids": genre_ids,
            "release_date": release_date,
            "tmdb_id": movie["id"],
        }

        # Add object to batch queue
        batch.add_object(
            properties=movie_obj,
            uuid=generate_uuid5(movie["id"])
            # references=reference_obj  # You can add references here
        )
```

### ◽ Error handling

Because a batch includes multiple objects, it's possible that some objects will fail to import. The batcher saves these errors.
You can print out the errors to see what went wrong, and then decide how to handle them, such as by raising an exception.
In this example, we simply print out the errors.

```bash
if len(movies.batch.failed_objects) > 0:
    print(f"Failed to import {len(movies.batch.failed_objects)} objects")

client.close()

```

`Note` that the list of errors is cleared when a new context manager is entered, so you must handle the errors before initializing a new batcher.