# Featureform MLOps Podcast Chatbot

This is an example of building a chatbot that contextualized with statements from the MLOps Weekly Podcast

## Requirements

* Python 3.7+
* `.env` file with one or both sets of credentials (visit [Pinecone](https://www.pinecone.io/) and/or [Weaviate](https://weaviate.io/) for instructions on creating an account and getting credentials):
```
# Pinecone

PINECONE_PROJECT_ID=
PINECONE_ENVIRONMENT=
PINECONE_API_KEY=

# Weaviate

WEAVIATE_URL=
WEAVIATE_API_KEY=

# OpenAI

You'll need to set your OpenAI key towards the bottom of this example, you'll also want to install the openai PyPI library using pip.
```
* [`python-dotenv 1.0.0`](https://pypi.org/project/python-dotenv/)
* [Topic Labeled News Dataset](https://www.kaggle.com/datasets/kotartemiy/topic-labeled-news-dataset)
* Featureform installed:
```shell
pip install featureform
```
* Hugging Face [`sentence-transformers`](https://huggingface.co/sentence-transformers) installed:
```
pip install sentence-transformers
```

## Step  1. Register Source

`data/files` is a directory of CSV files, which use `;` as a delimiter and hold transcripts of recent episodes of the MLOps podcast. Each row is a comment made by a speaker and has the following columns:

* Speaker
* Start time
* End time
* Duration
* Text
* filename

We'll register the entire directory at once.

In [1]:
import featureform as ff
from featureform import local

client = ff.Client(local=True)



**NOTE:** We'll create an instance of the client to register resources as we define them.

In [2]:
episodes = local.register_directory(
    name="mlops-episodes",
    path="data/files",
    description="Transcripts from recent MLOps episodes",
)

In [3]:
client.dataframe(episodes)

Applying Run: quizzical_wiles
Creating user default_user 
Creating provider local-mode 
Creating source mlops-episodes  quizzical_wiles


Unnamed: 0,filename,body
0,MLOps Weekly - 04-03-2023.mp3.csv,﻿Number;Speaker;Start time;End time;Duration;T...
1,MLOps Weekly - Stefan.mp3 (2).csv,﻿Number;Speaker;Start time;End time;Duration;T...
2,MLOps Weekly - Atindriyo Sanyal (2).mp3.csv,﻿Number;Speaker;Start time;End time;Duration;T...
3,MLOps Weekly - 07-27-2022 V2.mp3.csv,﻿Number;Speaker;Start time;End time;Duration;T...
4,MLOps Weekly - David Stein.mp3.csv,﻿Number;Speaker;Start time;End time;Duration;T...
5,MLOps Weekly - Piero.mp3.csv,﻿Number;Speaker;Start time;End time;Duration;T...
6,MLOps Weekly - Stefan.mp3.csv,﻿Number;Speaker;Start time;End time;Duration;T...
7,MLOps Weekly - Liran Hason.mp3.csv,﻿Number;Speaker;Start time;End time;Duration;T...
8,MLOps Weekly - 02-02-2022.mp3.csv,﻿Number;Speaker;Start time;End time;Duration;T...
9,MLOps Weekly - Doris.mp3.csv,﻿Number;Speaker;Start time;End time;Duration;T...


In [4]:
!featureform dash

 * Serving Flask app 'featureform.cli'
 * Debug mode: off
Address already in use
Port 3000 is in use by another program. Either identify and stop that program, or start the server with a different port.


## Step 2. Transform Transcripts

When registering a directory, files are converted into a table with columns `"filename"` and `"body"`. This is helpful for avoiding the situation where we need to register many files; however, in our case, we'll need to process this table to get it ready for vectorization.

In [5]:
@local.df_transformation(inputs=[episodes])
def process_episode_files(dir_df):
    from io import StringIO
    import pandas as pd

    episode_dfs = []
    for i, row in dir_df.iterrows():
        csv_str = StringIO(row[1])
        r_df = pd.read_csv(csv_str, sep=";")
        r_df["filename"] = row[0]
        episode_dfs.append(r_df)

    return pd.concat(episode_dfs)

We can verify this worked as we expected by serving this source as a dataframe and inspecting the results.

In [6]:
df = client.dataframe(process_episode_files)

df.head()

Applying Run: quizzical_wiles
Creating provider local-mode 
Creating source process_episode_files  quizzical_wiles


Unnamed: 0,Number,Speaker,Start time,End time,Duration,Text,filename
0,0,Simba Khadder,00:00:06.170,00:00:22.730,00:00:16.560,Hey everyone. Simba Khadder here and you are l...,MLOps Weekly - 04-03-2023.mp3.csv
1,1,Mikiko Bazeley,00:00:22.930,00:00:51.720,00:00:28.790,"Hey, everyone, my name is Mikiko Bazeley and I...",MLOps Weekly - 04-03-2023.mp3.csv
2,2,Mikiko Bazeley,00:00:51.780,00:01:16.460,00:00:24.680,I've also worked in a ton of different industr...,MLOps Weekly - 04-03-2023.mp3.csv
3,3,Simba Khadder,00:01:16.610,00:01:43.330,00:00:26.720,I know we've been talking about doing this for...,MLOps Weekly - 04-03-2023.mp3.csv
4,4,Mikiko Bazeley,00:01:43.430,00:01:58.780,00:00:15.350,"Yeah, it's funny because when I've talked to o...",MLOps Weekly - 04-03-2023.mp3.csv


## Step 3. Entity ID Transformation

For our purposes, we'll need a unique identifier for each speakers' comments, so we'll choose `"Speaker"`, `"Start time"` and `"filename"` to create a new column, `"PK"`.

In [7]:
@local.df_transformation(inputs=[process_episode_files])
def speaker_primary_key(episodes_df):
    episodes_df["PK"] = episodes_df.apply(lambda row: f"{row['Speaker']}_{row['Start time']}_{row['filename']}", axis=1)
    
    return episodes_df

In [8]:
df = client.dataframe(speaker_primary_key)

df.head()

Applying Run: quizzical_wiles
Creating provider local-mode 
Creating source speaker_primary_key  quizzical_wiles


Unnamed: 0,Number,Speaker,Start time,End time,Duration,Text,filename,PK
0,0,Simba Khadder,00:00:06.170,00:00:22.730,00:00:16.560,Hey everyone. Simba Khadder here and you are l...,MLOps Weekly - 04-03-2023.mp3.csv,Simba Khadder_00:00:06.170_MLOps Weekly - 04-0...
1,1,Mikiko Bazeley,00:00:22.930,00:00:51.720,00:00:28.790,"Hey, everyone, my name is Mikiko Bazeley and I...",MLOps Weekly - 04-03-2023.mp3.csv,Mikiko Bazeley_00:00:22.930_MLOps Weekly - 04-...
2,2,Mikiko Bazeley,00:00:51.780,00:01:16.460,00:00:24.680,I've also worked in a ton of different industr...,MLOps Weekly - 04-03-2023.mp3.csv,Mikiko Bazeley_00:00:51.780_MLOps Weekly - 04-...
3,3,Simba Khadder,00:01:16.610,00:01:43.330,00:00:26.720,I know we've been talking about doing this for...,MLOps Weekly - 04-03-2023.mp3.csv,Simba Khadder_00:01:16.610_MLOps Weekly - 04-0...
4,4,Mikiko Bazeley,00:01:43.430,00:01:58.780,00:00:15.350,"Yeah, it's funny because when I've talked to o...",MLOps Weekly - 04-03-2023.mp3.csv,Mikiko Bazeley_00:01:43.430_MLOps Weekly - 04-...


## Step 4. Embeddings Transformation

We'll use [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to create embeddings for each speakers' comments. When we register an entity and associate a feature with this entity, this transformation will be materialized and the embeddings will be persisted in a Pinecone index.

In [9]:
@local.df_transformation(inputs=[speaker_primary_key])
def vectorize_comments(episodes_df):
    from sentence_transformers import SentenceTransformer

    model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = model.encode(episodes_df["Text"].tolist())
    episodes_df["Vector"] = embeddings.tolist()
    
    return episodes_df

## Step 5. Register Pinecone

We'll be using Pinecone for this example, but you can also choose to use Weaviate.

This step assumes you have a `.env` file with your Pinecone credentials.

In [59]:
import dotenv
import os

dotenv.load_dotenv(".env")

pinecone = ff.register_pinecone(
    name="pinecone",
    project_id=os.getenv("PINECONE_PROJECT_ID", ""),
    environment=os.getenv("PINECONE_ENVIRONMENT", ""),
    api_key=os.getenv("PINECONE_API_KEY", ""),
)

True

In [11]:
client.apply()

Applying Run: quizzical_wiles
Creating provider local-mode 
Creating provider pinecone 
Creating source vectorize_comments  quizzical_wiles


## Step 6. Register Entity, Features, and Embeddings and write them to Vector DB.

We'll now register an entity and a feature, which will kick off the materialization process.

**NOTE:**
This may take some time to complete. See the progress bar for status.

In [12]:
@ff.entity
class Speaker:
    comment_embeddings = ff.Embedding(
        vectorize_comments[["PK", "Vector"]],
        dims=384,
        vector_db=pinecone,
        description="Embeddings created from speakers' comments in episodes",
        variant="v2"
    )
    comments = ff.Feature(
        speaker_primary_key[["PK", "Text"]],
        type=ff.String,
        description="Speakers' original comments",
        variant="v2"
    )

In [13]:
!pip install pinecone-client



In [14]:
client.apply()

Applying Run: quizzical_wiles
Creating provider local-mode 
Creating entity speaker 
Creating feature comment_embeddings  v2
Creating feature comments  v2


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Updating Feature Table: |██████████████████████████████████████████████████| 100% Complete

Updating Feature Table: |██████████████████████████████████████████████████| 100% Complete



## Step 7. Register On-Demand Features to Retrieve Relevent Context

We'll want to query the embeddings we created and then fetch their related docs and we can do so using Featureform's on-demand feature decorator. This creates a feature that's calculated on the client at serving time.

In [52]:
@ff.ondemand_feature(variant="calhacks")
def relevent_comments(client, params, entity):
    from sentence_transformers import SentenceTransformer

    model = SentenceTransformer("all-MiniLM-L6-v2")
    search_vector = model.encode(params["query"])
    res = client.nearest("comment_embeddings", "v2", search_vector, k=3)
    return res

In [54]:
client.apply()
client.features([("relevent_comments", "calhacks")], {}, params={"query": "enterprise MLOps"})

Applying Run: quizzical_wiles
Creating provider local-mode 


array([['Sam Ramji_00:06:20.590_MLOps Weekly - Sam.mp3.csv',
        'Simba Khadder_00:13:31.630_MLOps Weekly - Liran Hason.mp3.csv',
        'Simba Khadder_00:07:02.820_MLOps Weekly - Liran Hason.mp3.csv']],
      dtype='<U61')

In [57]:
@ff.ondemand_feature(variant="calhack")
def contextualized_prompt(client, params, entity):
    pks = client.features([("relevent_comments", "calhacks")], {}, params=params)
    prompt = "Use the following snippets from our podcast to answer the following question\n"
    for pk in pks[0]:
        prompt += "```"
        prompt += client.features([("comments", "v2")], {"speaker": pk})[0]
        prompt += "```\n"
    prompt += "Question: "
    prompt += params["query"]
    prompt += "?"
    return prompt


In [58]:
client.apply()
client.features([("contextualized_prompt", "calhack")], {}, params={"query": "enterprise MLOps"})

Applying Run: quizzical_wiles
Resource ondemand_feature already registered.
Creating provider local-mode 
Creating ondemand_feature contextualized_prompt  calhack
Creating ondemand_feature contextualized_prompt  calhacks


array(["Use the following snippets from our podcast to answer the following question\n```The second thing is that MLOps is incredibly valuable because it does create an environment where you can predictably make better decision engines faster. Every business is basically a decision factory.```\n```I want to zoom out a little bit. You mentioned a bit about the creation of MLOps as a category. I have a similar story where I built a whole MLOps platform for my last company. It wasn't called MLOps. I don't even know what we called it.```\n```I think a lot of what I've seen, especially in ML, is everyone has such different pain points. If you're doing your vision and you're a large enterprise or you're fraud detection at a startup, problems you face are different. The MLOps vendors you need, or let's say, Observability versus Feature Store, based on what you're doing, you might eventually need both.```\nQuestion: enterprise MLOps?"],
      dtype='<U927')

# Finally we can feed our prompt into OpenAI!

In [73]:
client.apply()
q = "What should I know about MLOps testing"
prompt = client.features([("contextualized_prompt", "calhack")], {}, params={"query": q})[0]
import openai
openai.organization = os.getenv("OPENAI_ORG", "")
openai.api_key = os.getenv("OPENAI_KEY", "")
print(openai.Completion.create(
    model="text-davinci-003",
    prompt=prompt,
    max_tokens=1000, # The max number of tokens to generate
    temperature=1.0 # A measure of randomness
)["choices"][0]["text"])

Applying Run: quizzical_wiles
Creating provider local-mode 

Answer: MLOps testing involves ensuring your models and their inputs remain consistent and accurate over time. This lifecycle management includes tracking data for experimentation, building the model, and then deploying it for use. Additionally, it is important to monitor the input over time to identify if there are any changes or drifts.
