# Featureform MLOps Podcast Chatbot

This is an example of building a chatbot that contextualized with statements from the MLOps Weekly Podcast

## Requirements

* Python 3.7+
* `.env` file with one or both sets of credentials (visit [Pinecone](https://www.pinecone.io/) and/or [Weaviate](https://weaviate.io/) for instructions on creating an account and getting credentials):
```
# Pinecone

PINECONE_PROJECT_ID=
PINECONE_ENVIRONMENT=
PINECONE_API_KEY=

# Weaviate

WEAVIATE_URL=
WEAVIATE_API_KEY=

# OpenAI

You'll need to set your OpenAI key towards the bottom of this example, you'll also want to install the openai PyPI library using pip.
```
* [`python-dotenv 1.0.0`](https://pypi.org/project/python-dotenv/)
* [Topic Labeled News Dataset](https://www.kaggle.com/datasets/kotartemiy/topic-labeled-news-dataset)
* Featureform installed:
```shell
pip install featureform
```
* Hugging Face [`sentence-transformers`](https://huggingface.co/sentence-transformers) installed:
```
pip install sentence-transformers
```

## Step  1. Register Source

`data/files` is a directory of CSV files, which use `;` as a delimiter and hold transcripts of recent episodes of the MLOps podcast. Each row is a comment made by a speaker and has the following columns:

* Speaker
* Start time
* End time
* Duration
* Text
* filename

We'll register the entire directory at once.

In [None]:
import featureform as ff
from featureform import local

client = ff.Client(local=True)

**NOTE:** We'll create an instance of the client to register resources as we define them.

In [None]:
episodes = local.register_directory(
    name="mlops-episodes",
    path="data/files",
    description="Transcripts from recent MLOps episodes",
)

In [None]:
client.apply()

## Step 2. Transform Transcripts

When registering a directory, files are converted into a table with columns `"filename"` and `"body"`. This is helpful for avoiding the situation where we need to register many files; however, in our case, we'll need to process this table to get it ready for vectorization.

In [None]:
@local.df_transformation(inputs=[episodes], name="process_episode_files")
def process_episode_files(dir_df):
    from io import StringIO
    import pandas as pd

    episode_dfs = []
    for i, row in dir_df.iterrows():
        csv_str = StringIO(row[1])
        r_df = pd.read_csv(csv_str, sep=";")
        r_df["filename"] = row[0]
        episode_dfs.append(r_df)

    return pd.concat(episode_dfs)

We can verify this worked as we expected by serving this source as a dataframe and inspecting the results.

In [None]:
df = client.dataframe(process_episode_files)

df.head()

## Step 3. Entity ID Transformation

For our purposes, we'll need a unique identifier for each speakers' comments, so we'll choose `"Speaker"`, `"Start time"` and `"filename"` to create a new column, `"PK"`.

In [None]:
@local.df_transformation(inputs=[process_episode_files])
def speaker_primary_key(episodes_df):
    episodes_df["PK"] = episodes_df.apply(lambda row: f"{row['Speaker']}_{row['Start time']}_{row['filename']}", axis=1)
    
    return episodes_df

In [None]:
client.apply()

In [None]:
df = client.dataframe(speaker_primary_key)

df.head()

## Step 4. Embeddings Transformation

We'll use [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to create embeddings for each speakers' comments. When we register an entity and associate a feature with this entity, this transformation will be materialized and the embeddings will be persisted in a Pinecone index.

In [None]:
@local.df_transformation(inputs=[speaker_primary_key])
def vectorize_comments(episodes_df):
    from sentence_transformers import SentenceTransformer

    model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = model.encode(episodes_df["Text"].tolist())
    episodes_df["Vector"] = embeddings.tolist()
    
    return episodes_df

## Step 5. Register Pinecone

We'll be using Pinecone for this example, but you can also choose to use Weaviate.

This step assumes you have a `.env` file with your Pinecone credentials.

In [None]:
import dotenv
import os

dotenv.load_dotenv(".env")

pinecone = ff.register_pinecone(
    name="pinecone",
    project_id=os.getenv("PINECONE_PROJECT_ID", ""),
    environment=os.getenv("PINECONE_ENVIRONMENT", ""),
    api_key=os.getenv("PINECONE_API_KEY", ""),
)

In [None]:
client.apply()

## Step 6. Register Entity, Features, and Embeddings and write them to Vector DB.

We'll now register an entity and a feature, which will kick off the materialization process.

**NOTE:**
This may take some time to complete. See the progress bar for status.

In [None]:
@ff.entity
class Speaker:
    comment_embeddings = ff.Embedding(
        vectorize_comments[["PK", "Vector"]],
        dims=384,
        vector_db=pinecone,
        description="Embeddings created from speakers' comments in episodes",
        variant="v1"
    )
    comments = ff.Feature(
        speaker_primary_key[["PK", "Text"]],
        type=ff.String,
        description="Speakers' original comments",
        variant="v1"
    )

In [None]:
client.apply()

## Step 7. Register On-Demand Features to Retrieve Relevent Context

We'll want to query the embeddings we created and then fetch their related docs and we can do so using Featureform's on-demand feature decorator. This creates a feature that's calculated on the client at serving time.

In [None]:
@ff.ondemand_feature()
def relevent_comments(client, params, entity):
    from sentence_transformers import SentenceTransformer

    model = SentenceTransformer("all-MiniLM-L6-v2")
    search_vector = model.encode(params["query"])
    res = client.nearest("comment_embeddings", "v1", search_vector, k=params[2])
    return res

In [None]:
@ff.ondemand_feature()
def contextualized_prompt(client, params, entity):
    pks = client.features([relevent_comments], {}, params=params)
    prompt = "Use the following snippets from our podcast to answer the following question\n"

    res = client.nearest("comment_embeddings", "v1", search_vector, k=params[2])
    for pk in pks:
        prompt += "```"
        prompt += client.features([("comments", "v1")], {"PK": pk})[0]
        prompt += "```\n"
    prompt += "Question: "
    prompt += params["query"]
    prompt += "?"
    return prompt

In [None]:
client.apply()

# Finally we can feed our prompt into OpenAI!

In [None]:
prompt = client.features([contextualized_prompt], {}, params={"query": "What should I know about MLOps for Enterprise"})
import openai
openai.organization = ""
openai.apikey = ""
print(openai.Completion.create(
    model="text-davinci-003",
    prompt=prompt,
)["choices"][0]["text"])