# Getting Started With Embeddings: Notebook Companion



![](/../assets/80_getting_started_with_embeddings/thumbnail.png)

## 1. Embedding a dataset


In [39]:
model_id = "sentence-transformers/all-MiniLM-L6-v2"
hf_token = "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"

The first time you generate the embeddings it may take a while (approximately 20 seconds) for the API to return them. We use the `retry` decorator (install with `pip install retry`) so that if on the first try `output = query(dict(inputs = texts))` doesn't work, wait 10 seconds and try again three times. The reason this happens is because on the first request, the model needs to be downloaded and installed in the server, but subsequent calls are much faster.

In [40]:
%%capture
!pip install retry

In [41]:
import requests
from retry import retry

api_url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/{model_id}"
headers = {"Authorization": f"Bearer {hf_token}"}

In [42]:
@retry(tries=3, delay=10)
def query(texts):
    response = requests.post(api_url, headers=headers, json={"inputs": texts})
    result = response.json()
    if isinstance(result, list):
      return result
    elif list(result.keys())[0] == "error":
      raise RuntimeError(
          "The model is currently loading, please re-run the query."
          )

In [43]:
texts = ["How do I get a replacement Medicare card?",
        "What is the monthly premium for Medicare Part B?",
        "How do I terminate my Medicare Part B (medical insurance)?",
        "How do I sign up for Medicare?",
        "Can I sign up for Medicare Part B if I am working and have health insurance through an employer?",
        "How do I sign up for Medicare Part B if I already have Part A?",
        "What are Medicare late enrollment penalties?",
        "What is Medicare and who can get it?",
        "How can I get help with my Medicare Part A and Part B premiums?",
        "What are the different parts of Medicare?",
        "Will my Medicare premiums be higher because of my higher income?",
        "What is TRICARE ?",
        "Should I sign up for Medicare Part B if I have Veterans’ Benefits?"]

output = query(texts)

In [44]:
import pandas as pd

embeddings = pd.DataFrame(output)

In [45]:
print(embeddings)

         0         1         2         3         4         5         6    \
0  -0.023889  0.055259 -0.011655 -0.033414 -0.012261 -0.024873 -0.012663   
1  -0.012688  0.046874 -0.010502 -0.020384 -0.013361  0.042322  0.016628   
2   0.000494  0.119412  0.005230 -0.092734  0.007773 -0.005325  0.034506   
3  -0.029711  0.023298 -0.057041 -0.012183 -0.013710  0.029796  0.063739   
4  -0.025628  0.070389 -0.017380 -0.056567  0.028576  0.052823  0.067063   
5  -0.022656  0.021160  0.005105 -0.046494  0.009074  0.041495  0.054268   
6  -0.002911  0.060791 -0.009176 -0.006133  0.040492  0.036594  0.002054   
7  -0.080526  0.059888 -0.048847 -0.040176 -0.063342  0.041848  0.119045   
8  -0.034388  0.072501  0.014440 -0.036695  0.014019  0.063070  0.034683   
9  -0.005964  0.025044 -0.003182 -0.025243 -0.039823 -0.012772  0.044713   
10 -0.039008 -0.010609 -0.007383 -0.050190 -0.002518 -0.041641  0.026969   
11 -0.095983 -0.063012 -0.116906 -0.059075 -0.051323 -0.003439  0.018687   
12 -0.011629

## 2. Host embeddings for free on the Hugging Face Hub


In [46]:
%%capture
!pip install huggingface-hub

In [47]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
[1m[31mCannot authenticate thro

In [48]:
#!huggingface-cli repo create embedded_faqs_medicare --type dataset

In [49]:
# This is code required to install git-lfs however it already is installed in Colab instances.
#!git lfs install

Updated git hooks.
Git LFS initialized.


In [50]:
!git clone https://huggingface.co/datasets/Uiin/embedded_faqs_medicare

Cloning into 'embedded_faqs_medicare'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (6/6), 808 bytes | 808.00 KiB/s, done.


In [51]:
embeddings.to_csv("embedded_faqs_medicare/embeddings.csv", index=False)
print(embeddings.shape)

(13, 384)


Changing directory to our repo `embedded_faqs_medicare`.

In [52]:
%cd embedded_faqs_medicare/

/content/embedded_faqs_medicare/embedded_faqs_medicare/embedded_faqs_medicare


In [53]:
!git lfs track *.csv
!git add .gitattributes
!git add embeddings.csv

Tracking "embeddings.csv"


In [54]:
!git config --global user.email "rladmldls@gmail.com"
!git config --global user.name "Uiin"

In [72]:
!git commit -m "Initial commit"
!git push https://Uiin:Rladmldls989!@huggingface.co/datasets/Uiin/embedded_faqs_medicare

On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean
Uploading LFS objects: 100% (1/1), 106 KB | 0 B/s, done.
Enumerating objects: 6, done.
Counting objects: 100% (6/6), done.
Delta compression using up to 2 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 512 bytes | 512.00 KiB/s, done.
Total 4 (delta 1), reused 0 (delta 0)
To https://huggingface.co/datasets/Uiin/embedded_faqs_medicare
   170ac70..d67f75a  main -> main


## 3. Get the most similar Frequently Asked Questions to a query


In [56]:
%%capture
!pip install datasets

In [73]:
import torch
from datasets import load_dataset

faqs_embeddings = load_dataset('Uiin/embedded_faqs_medicare')
dataset_embeddings = torch.from_numpy(faqs_embeddings["train"].to_pandas().to_numpy()).to(torch.float)

Downloading readme:   0%|          | 0.00/10.0 [00:00<?, ?B/s]

Downloading and preparing dataset csv/Uiin--embedded_faqs_medicare to /root/.cache/huggingface/datasets/Uiin___csv/Uiin--embedded_faqs_medicare-3d4643c1840d72d2/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/106k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/Uiin___csv/Uiin--embedded_faqs_medicare-3d4643c1840d72d2/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [74]:
question = ["How can Medicare help me?"]
output = query(question)

In [75]:
query_embeddings = torch.FloatTensor(output)
#FloatTensor는 연산을 하기 위항 32비트 부동소수점 자료형
print(f"The size of our embedded dataset is {dataset_embeddings.shape} and of our embedded query is {query_embeddings.shape}.")

The size of our embedded dataset is torch.Size([13, 384]) and of our embedded query is torch.Size([1, 384]).


In [76]:
%%capture
!pip install -U sentence-transformers

In [77]:
from sentence_transformers.util import semantic_search

hits = semantic_search(query_embeddings, dataset_embeddings, top_k=5)

In [78]:
[texts[hits[0][i]['corpus_id']] for i in range(len(hits[0]))]

['How can I get help with my Medicare Part A and Part B premiums?',
 'What is Medicare and who can get it?',
 'How do I sign up for Medicare?',
 'What are the different parts of Medicare?',
 'Will my Medicare premiums be higher because of my higher income?']