# Quickstart (Localmode with Pinecone and Weaviate)

## Requirements

* Python 3.7+
* `.env` file with one or both sets of credentials (visit [Pinecone](https://www.pinecone.io/) and/or [Weaviate](https://weaviate.io/) for instructions on creating an account and getting credentials):
```
# Pinecone

  PINECONE_PROJECT_ID=
PINECONE_ENVIRONMENT=
PINECONE_API_KEY=
HUGGING_FACE_TOKEN=

# Weaviate

  WEAVIATE_URL=
WEAVIATE_API_KEY=
```
* [Topic Labeled News Dataset](https://www.kaggle.com/datasets/kotartemiy/topic-labeled-news-dataset)
* Featureform installed:
```shell
pip install featureform
```
* Hugging Face [`sentence-transformers`](https://huggingface.co/sentence-transformers) installed:
```
pip install sentence-transformers
```

## Step  1. Register Provider and Source

We'll be using Pinecone for this example, but you can also choose to use Weaviate.

**NOTE:**
Until `sep=";"` is a supported parameter in `register_file`,
you'll need to preproces the dataset by doing the following:
```python
import pandas as pd

df = pd.read_csv("labelled_newscatcher_dataset.csv", sep=";")
df.to_csv("newscatcher_dataset.csv", index=False)
```

In [None]:
import featureform as ff
from featureform import local
import dotenv
import os

dotenv.load_dotenv(".env")

pinecone = ff.register_pinecone(
    name="pinecone4",
    project_id=os.getenv("PINECONE_PROJECT_ID", ""),
    environment=os.getenv("PINECONE_ENVIRONMENT", ""),
    api_key=os.getenv("PINECONE_API_KEY", ""),
)

news = local.register_file(
    name="news",
    description="108,774 news articles labelled with 8 topics (balanced)",
    path="ucb-hackathon/newscatcher_dataset.csv",
)

Now we'll create an instance of the Featureform client and apply the changes. We'll be using `client` after each step to apply our changes and checkpoint our work.

In [None]:
from featureform import Client

client = Client(local=True)

client.apply()

In [None]:
!featureform list sources --local

## Step 2. Register Transformation

Given the size of the Newscatcher dataset, we'll limit the context we'll create embeddings for to only science-related articles. Once we've filtered by the topic, we'll use [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to create embeddings of the article titles so we can perform searches on them later.

In [None]:
@local.df_transformation(inputs=[news])
def vectorize_science_news(news_df):
    from sentence_transformers import SentenceTransformer

    model = SentenceTransformer("all-MiniLM-L6-v2")

    science_news = news_df[news_df["topic"] == "SCIENCE"]

    embeddings = model.encode(science_news["title"].tolist())

    science_news["title_embedding"] = embeddings.tolist()

    print(science_news)

    return science_news

In [None]:
client.apply()

## Step 3. Register Entity and Feature

We'll now register an entity and a feature, which will kick off the materialization process.

**NOTE:**
This may take some time to complete. See the progress bar for status.

In [None]:
@ff.entity
class NewsDomain:
    science_news = ff.Embedding(
        vectorize_science_news[["link", "title_embedding"]],
        dims=384,
        vector_db=pinecone,
    )

In [None]:
client.apply()

## Step 4. Register On-Demand Feature

We'll want to query the embeddings we created, and we can do so using Featureform's on-demand feature decorator. This creates a feature that's calculated on the client at serving time.

In [None]:
@ff.ondemand_feature(name="science_news_search")
def search_science_news(serving_client, params, entity):
    from sentence_transformers import SentenceTransformer

    model = SentenceTransformer("all-MiniLM-L6-v2")
    search_vector = model.encode(params[0])
    feature, variant = params[1]

    return serving_client.nearest(feature, variant, search_vector.tolist(), k=2)

In [None]:
client.apply()

## Step 5. Serve On-Demand Feature (i.e. Semantic Search)

Now we'll query the vector database via our on-demand feature.

In [None]:
query = "asteroids over England"
embedding_feature_variant = ("science_news", "quizzical_goldstine")
ondemand_feature_variant = ("science_news_search", "quizzical_goldstine")

features = client.features(
    [(ondemand_feature_variant)],
    {"link": ""},
    params=[query, embedding_feature_variant],
)

df = features[0]
results = "\n".join(
    [
        url
        for url in df[
            f"{ondemand_feature_variant[0]}.{ondemand_feature_variant[1]}"
        ].values[0]
    ]
)

print(f"SEARCH RESULTS:\n{results}")

In [None]:
client.apply()