# Embedding Toolkit Tutorial Notebook

 __Run this notebook on the runtime `py-embedding`.__

In [1]:
import random
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups

## Notebook Content Overview

0. [**Background**](#Background): Introduction to the Embedding Toolkit tutorial notebook.

1. [**Load Sample Data**](#Load-Sample-Data): Loads the sample data required to run this notebook.

2. [**Example Embedding Extraction Pipe**](#Example-Embedding-Extraction-Pipe): The `SentenceBERT` pipe processes text and generates embeddings for the sample data.

3. [**Use a Different Pipe**](#Use-a-Different-Pipe): The `BERT` pipe processes text and generates embeddings for the sample data.


## 0. Background 

This tutorial is designed to walk you through the process of using the __C3 AI Embedding Toolkit__ to generate embeddings from text data. In this tutorial notebook, we will examine a couple of embedding extraction pipes that generate embeddings for the loaded sample data.

The first example utilizes the `Embedding.Extraction.SentenceBertPipe` pipe, while the second example showcases the usage of the `Embedding.Extraction.BertPipe` pipe.

Running this tutorial notebook end-to-end is expected to take approximately 5 minutes. 

## 1. Load Sample Data

We will be loading sample data from `scikit-learn.datasets` called `Twenty Newsgroups`. For more information, refer to the dataset [here.](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) 

We will randomly sample 20 rows of text data to feed into the embedding extraction pipe.

In [2]:
categories = ["sci.space"]  ## space category from the dataset
twenty_train = fetch_20newsgroups(
    subset="train", categories=categories, shuffle=True, random_state=42
)  ## fetch the dataset
sample_dataset = random.sample(twenty_train["data"], 20)  ## sample 20 rows

Transform the list of text into a DataFrame and then convert it `c3.Data` to feed it into the Embedding Extraction Pipe.

In [3]:
paragraph_ids = [f"par{i}" for i in range(len(sample_dataset))]
sample_data = c3.Data.from_pandas(pd.DataFrame({"subject": paragraph_ids, "textString": sample_dataset}))
sample_data.head()

Unnamed: 0,subject,textString
0,par0,From: 18084TM@msu.edu (Tom)\nSubject: Level 5?...
1,par1,From: dante@shakala.com (Charlie Prael)\nSubje...
2,par2,From: henry@zoo.toronto.edu (Henry Spencer)\nS...
3,par3,From: dbm0000@tm0006.lerc.nasa.gov (David B. M...
4,par4,From: baalke@kelvin.jpl.nasa.gov (Ron Baalke)\...


## 2. Example Embedding Extraction Pipe 

### Load and train the pipe

We will load one of the pre-trained language models `all-mpnet-base-v2` from the `SentenceBERT` models. Refer to the pre-trained sentence embedding models [here.](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models/)

If the GPU is setup on the environment, then the embedding extraction pipes are automatically configured to use GPU. Use of GPU/CPU is logged to OpenSearch.



In [4]:
embedding_extraction_pipe = c3.Embedding.Extraction.SentenceBertPipe(modelName="all-mpnet-base-v2").withName(
    "GeneralEmbeddingExtraction"
)
embedding_extraction_pipe

{
  "type" : "Embedding.Extraction.SentenceBertPipe",
  "name" : "GeneralEmbeddingExtraction",
  "modelName" : "all-mpnet-base-v2"
}

In [5]:
trained_embedding_extraction_pipe = embedding_extraction_pipe.train().result()
trained_embedding_extraction_pipe

{
  "type" : "Embedding.Extraction.SentenceBertPipe",
  "id" : "371a5395-d71e-4528-8c2b-a5c9ea651fd0",
  "name" : "GeneralEmbeddingExtraction_trained",
  "meta" : {
    "created" : "2024-01-02T18:34:00Z",
    "updated" : "2024-01-02T18:34:00Z",
    "timestamp" : "2024-01-02T18:34:00Z"
  },
  "version" : 1,
  "typeIdent" : "ATOM:EMBEP:EESBP"
}

### Generate embeddings

Run `process` on the trained pipe to generate embeddings.

In [6]:
embeddings = trained_embedding_extraction_pipe.process(
    x=sample_data,
).result()

embeddings_df = embeddings.to_pandas()
embeddings_df.head()

Unnamed: 0_level_0,textString,embedding
subject,Unnamed: 1_level_1,Unnamed: 2_level_1
par0,From: 18084TM@msu.edu (Tom)\nSubject: Level 5?...,"[0.029849377, 0.005736005, -0.0009050925, 0.00..."
par1,From: dante@shakala.com (Charlie Prael)\nSubje...,"[0.027563378, 0.0036168434, 0.0041022715, 0.04..."
par2,From: henry@zoo.toronto.edu (Henry Spencer)\nS...,"[0.04398752, 0.10623252, 0.060237866, 0.017894..."
par3,From: dbm0000@tm0006.lerc.nasa.gov (David B. M...,"[0.02024305, 0.05232053, 0.026834602, 0.031365..."
par4,From: baalke@kelvin.jpl.nasa.gov (Ron Baalke)\...,"[0.08486572, -0.031874884, 0.012951204, -0.008..."


### Validate embeddings 

Use basic validation checks to check if the embeddings are generated correctly.

In [7]:
embeddings_df.index.name = None
pd.testing.assert_index_equal(pd.Index(sample_data["subject"]), embeddings_df.index)

embedding_array = np.stack(list(embeddings_df["embedding"]))
num_dim = embedding_array.shape[1]
assert (
    num_dim == 768  # model dimension for all-mpnet-base-v2
), f"Result embedding dimension {num_dim} should be equal to expected model dimension {model_dimension}"
assert np.sum(embedding_array) != 0, "Result embedding array should not be all zeros!"

### Clean up the pipe

Clean up (i.e. remove from db or delete from filesystem) any artifacts or dependencies related to this pipe.

In [8]:
trained_embedding_extraction_pipe.cleanUp()

## 3. Use a Different Pipe

Instead of utilizing the `SentenceBERT` pipe, employ the `BERT` pipe (fine tuned `bert-base-uncased`)  as the embedding extraction pipe to generate embeddings.

Please refer to the pre-trained `BERT` models from hugging face [here.](https://huggingface.co/models?other=bert)

In [9]:
embedding_extraction_bert_pipe = c3.Embedding.Extraction.BertPipe(modelName="bert-base-uncased").withName(
    "DomainEmbeddingExtraction"
)
trained_embedding_extraction_bert_pipe = embedding_extraction_bert_pipe.train().result()

embeddings_bert_pipe = trained_embedding_extraction_bert_pipe.process(
    x=sample_data,
).result()

embeddings_bert_pipe_df = embeddings_bert_pipe.to_pandas()
embeddings_bert_pipe_df.head()

Unnamed: 0_level_0,textString,embedding
subject,Unnamed: 1_level_1,Unnamed: 2_level_1
par0,From: 18084TM@msu.edu (Tom)\nSubject: Level 5?...,"[-0.3665319, -0.14466904, 0.35170555, 0.265233..."
par1,From: dante@shakala.com (Charlie Prael)\nSubje...,"[-0.0123363575, 0.20874111, 0.118544504, 0.255..."
par2,From: henry@zoo.toronto.edu (Henry Spencer)\nS...,"[0.20646484, 0.08387961, 0.008860225, -0.17527..."
par3,From: dbm0000@tm0006.lerc.nasa.gov (David B. M...,"[0.08737392, -0.1271832, 0.12903346, 0.3338226..."
par4,From: baalke@kelvin.jpl.nasa.gov (Ron Baalke)\...,"[-0.7950698, -0.1124316, 0.029607521, -0.09579..."


In [10]:
trained_embedding_extraction_bert_pipe.cleanUp()