<a href="https://colab.research.google.com/github/Vonewman/Hugginface-course/blob/main/Embedding_as_a_service.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Started With Embeddings

## 1. Embedding a dataset

In [1]:
model_id = "sentence-transformers/all-MiniLM-L6-v2"
hf_token = "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"

The first time you generate the embeddings it may take a while (approximately 20 seconds) for the API to return them. We use the `retry` decorator (install with `pip install retry`) so that if on the first try `output = query(dict(inputs = texts))` doesn't work, wait 10 seconds and try again three times. The reason this happens is because on the first request, the model needs to be downloaded and installed in the server, but subsequent calls are much faster.

In [2]:
%%capture
!pip install retry

In [3]:
import requests
from retry import retry

api_url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/{model_id}"
headers = {"Authorization": f"Bearer {hf_token}"}

In [5]:
@retry(tries=3, delay=10)
def query(texts):
    response = requests.post(api_url, headers=headers, json={"inputs": texts})
    result = response.json()
    if isinstance(result, list):
        return result
    elif list(result.keys())[0] == "error":
        raise RuntimeError(
            "The model is currently loading, please re-run the query."
    )

In [6]:
texts = ["How do I get a replacement Medicare card?",
        "What is the monthly premium for Medicare Part B?",
        "How do I terminate my Medicare Part B (medical insurance)?",
        "How do I sign up for Medicare?",
        "Can I sign up for Medicare Part B if I am working and have health insurance through an employer?",
        "How do I sign up for Medicare Part B if I already have Part A?",
        "What are Medicare late enrollment penalties?",
        "What is Medicare and who can get it?",
        "How can I get help with my Medicare Part A and Part B premiums?",
        "What are the different parts of Medicare?",
        "Will my Medicare premiums be higher because of my higher income?",
        "What is TRICARE ?",
        "Should I sign up for Medicare Part B if I have Veterans' Benefits?"]

output = query(texts)

In [7]:
import pandas as pd

embeddings = pd.DataFrame(output)

In [8]:
embeddings

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,374,375,376,377,378,379,380,381,382,383
0,-0.023889,0.055259,-0.011655,-0.033414,-0.012261,-0.024873,-0.012663,0.025346,0.018508,-0.083508,...,-0.161688,-0.046426,0.006004,0.005281,-0.003342,0.027754,0.020411,0.005778,0.034098,-0.006889
1,-0.012688,0.046874,-0.010502,-0.020384,-0.013361,0.042322,0.016628,-0.004099,-0.002607,-0.010188,...,-0.061594,-0.020717,-0.009082,-0.02926,-0.066253,0.065257,0.013229,-0.023103,-0.002785,0.010474
2,0.000494,0.119412,0.005229,-0.092734,0.007773,-0.005325,0.034506,-0.051981,-0.006265,-0.006111,...,-0.108326,-0.049646,-0.073399,-0.029898,-0.102734,0.062121,0.034606,0.016877,-0.023861,0.005264
3,-0.029711,0.023298,-0.057041,-0.012183,-0.01371,0.029796,0.063739,0.001101,-0.045124,-0.040748,...,-0.117682,0.031924,0.000854,0.0202,-0.020666,-0.005167,0.03837,0.003617,0.033993,-0.010255
4,-0.025628,0.070389,-0.01738,-0.056567,0.028577,0.052823,0.067063,-0.052618,-0.054702,-0.11623,...,-0.118145,0.013343,-0.055188,-0.032723,0.008436,0.019169,0.048212,-0.040412,0.083346,0.026855
5,-0.022656,0.02116,0.005105,-0.046494,0.009074,0.041495,0.054268,-0.024185,-0.013483,-0.075966,...,-0.10011,0.01075,-0.031469,-0.004822,0.039657,0.026384,0.045514,0.059089,-0.017509,0.007166
6,-0.002911,0.060791,-0.009176,-0.006133,0.040492,0.036594,0.002054,-0.031345,0.031806,-0.023495,...,-0.028763,-0.060458,-0.018598,-0.040189,-0.031486,-0.018299,0.002286,-0.07342,0.016235,-0.000244
7,-0.080526,0.059888,-0.048847,-0.040176,-0.063342,0.041848,0.119045,0.010652,-0.030095,-0.004561,...,-0.144566,0.020404,0.023088,0.005077,-0.055645,-0.007675,0.050791,-0.005989,0.134562,0.034817
8,-0.034388,0.072501,0.01444,-0.036695,0.014019,0.06307,0.034683,-0.014531,-0.059862,-0.045383,...,-0.114763,-0.035894,-0.019877,-0.033375,-0.030168,0.039412,0.044993,0.000578,-0.025124,0.034191
9,-0.005964,0.025044,-0.003182,-0.025243,-0.039823,-0.012772,0.044713,0.014535,-0.038213,-0.041149,...,-0.057621,0.021594,0.048983,-0.044541,-0.030137,0.006779,0.054854,0.029937,0.070214,0.041565


## 2. Host embeddings for free on the Hugging Face Hub

In [9]:
%%capture
!pip install huggingface-hub

In [10]:
!huggingface-cli login


        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        To login, `huggingface_hub` now requires a token generated from https://huggingface.co/settings/tokens .
        
Token: 
Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in yo

In [11]:
embeddings.shape

(13, 384)

In [12]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.6.1-py3-none-any.whl (441 kB)
[K     |████████████████████████████████| 441 kB 13.5 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting multiprocess
  Downloading multiprocess-0.70.14-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 34.5 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 34.4 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 33.6 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.13-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 55.8 MB/s 
Installing collected packages: urllib3, xxha

In [13]:
import torch
from datasets import load_dataset

faqs_embeddings = load_dataset("vonewman/word-embeddings-dataset")
dataset_embeddings = torch.from_numpy(faqs_embeddings["train"].to_pandas().to_numpy()).to(torch.float)

Downloading readme:   0%|          | 0.00/21.0 [00:00<?, ?B/s]



Downloading and preparing dataset csv/vonewman--word-embeddings-dataset to /root/.cache/huggingface/datasets/vonewman___csv/vonewman--word-embeddings-dataset-2c2fdd596d9987a2/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/106k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/vonewman___csv/vonewman--word-embeddings-dataset-2c2fdd596d9987a2/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [14]:
question = ["How can Medicare help me?"]
output = query(question)

query_embeddings = torch.FloatTensor(output)

In [15]:
!pip install -U sentence-transformers 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 4.3 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.23.1-py3-none-any.whl (5.3 MB)
[K     |████████████████████████████████| 5.3 MB 61.1 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 52.0 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 59.5 MB/s 
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sente

In [16]:
from sentence_transformers.utils import semantic_search

hits = semantic_search(query_embeddings, dataset_embeddings, top_k=5)

ModuleNotFoundError: ignored

In [None]:
print([texts[hits[0][i]['corpus_id']] for i in range(len(hits[0]))])