# Getting Started With Embeddings: Notebook Companion



![](/../assets/80_getting_started_with_embeddings/thumbnail.png)

## 1. Embedding a dataset


In [None]:
model_id = "sentence-transformers/all-MiniLM-L6-v2"
hf_token = "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"

The first time you generate the embeddings it may take a while (approximately 20 seconds) for the API to return them. We use the `retry` decorator (install with `pip install retry`) so that if on the first try `output = query(dict(inputs = texts))` doesn't work, wait 10 seconds and try again three times. The reason this happens is because on the first request, the model needs to be downloaded and installed in the server, but subsequent calls are much faster.

In [None]:
%%capture
!pip install retry

In [None]:
import requests
from retry import retry

api_url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/{model_id}"
headers = {"Authorization": f"Bearer {hf_token}"}

In [None]:
@retry(tries=3, delay=10)
def query(texts):
    response = requests.post(api_url, headers=headers, json={"inputs": texts})
    result = response.json()
    if isinstance(result, list):
      return result
    elif list(result.keys())[0] == "error":
      raise RuntimeError(
          "The model is currently loading, please re-run the query."
          )

In [None]:
texts = ["How do I get a replacement Medicare card?",
        "What is the monthly premium for Medicare Part B?",
        "How do I terminate my Medicare Part B (medical insurance)?",
        "How do I sign up for Medicare?",
        "Can I sign up for Medicare Part B if I am working and have health insurance through an employer?",
        "How do I sign up for Medicare Part B if I already have Part A?",
        "What are Medicare late enrollment penalties?",
        "What is Medicare and who can get it?",
        "How can I get help with my Medicare Part A and Part B premiums?",
        "What are the different parts of Medicare?",
        "Will my Medicare premiums be higher because of my higher income?",
        "What is TRICARE ?",
        "Should I sign up for Medicare Part B if I have Veterans’ Benefits?"]

output = query(texts)

In [None]:
import pandas as pd

embeddings = pd.DataFrame(output)

In [None]:
print(embeddings)

## 2. Host embeddings for free on the Hugging Face Hub


In [None]:
%%capture
pip install huggingface-hub

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 

In [None]:
!huggingface-cli repo create embedded_faqs_medicare --type dataset --organization ITESM

In [None]:
# This is code required to install git-lfs however it already is installed in Colab instances.
!git lfs install

In [None]:
git clone https://{rladmldls@gmail.com}:{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}@huggingface.co/datasets/ITESM/embedded_faqs_medicare

In [None]:
embeddings.to_csv("embedded_faqs_medicare/embeddings.csv", index=False)
print(embeddings.shape)

Changing directory to our repo `embedded_faqs_medicare`.

In [None]:
%cd embedded_faqs_medicare/

In [None]:
!git lfs track *.csv
!git add .gitattributes
!git add embeddings.csv

In [None]:
!git config --global user.email "your email here"
!git config --global user.name "your git user here"

In [None]:
!git commit -m "First version of the embedded_faqs_medicare dataset"
!git push

## 3. Get the most similar Frequently Asked Questions to a query


In [None]:
%%capture
!pip install datasets

In [None]:
import torch
from datasets import load_dataset

faqs_embeddings = load_dataset('ITESM/embedded_faqs_medicare')
dataset_embeddings = torch.from_numpy(faqs_embeddings["train"].to_pandas().to_numpy()).to(torch.float)

In [None]:
question = ["How can Medicare help me?"]
output = query(question)

In [None]:
query_embeddings = torch.FloatTensor(output)
print(f"The size of our embedded dataset is {dataset_embeddings.shape} and of our embedded query is {query_embeddings.shape}.")

In [None]:
%%capture
!pip install -U sentence-transformers

In [None]:
from sentence_transformers.util import semantic_search

hits = semantic_search(query_embeddings, dataset_embeddings, top_k=5)

In [None]:
[texts[hits[0][i]['corpus_id']] for i in range(len(hits[0]))]