Применение инструментов Hugging face и предобученных моделей

In [19]:
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
from transformers import pipeline

original_sentence = "After your workout, remember to focus on maintaining a good water balance."
tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")
model = AutoModelForMaskedLM.from_pretrained("distilroberta-base")

masked_sentence = f"After your workout, remember to focus on maintaining a good {tokenizer.mask_token} balance."

input_ids = tokenizer.encode(masked_sentence, return_tensors="pt")
result = model(input_ids=input_ids)

# Masked token index
mask_token_index = torch.where(input_ids == tokenizer.mask_token_id)[1]
predicted_token = tokenizer.decode(result.logits[:, mask_token_index].argmax())

print(f"Original Sentence: {original_sentence}")
print(f"Masked Sentence: {masked_sentence}")
print(f"Predicted Token: {predicted_token}")
print("\nExtended Sentences:")


Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Original Sentence: After your workout, remember to focus on maintaining a good water balance.
Masked Sentence: After your workout, remember to focus on maintaining a good <mask> balance.
Predicted Token:  body

Extended Sentences:


In [20]:
from sentence_transformers import SentenceTransformer, util
# Using casual LM via pipeline
pipe = pipeline("fill-mask", model="distilroberta-base")
filled_sequence = pipe("Remember to <mask> enough water to restore and maintain your body's hydration after your cardio training.")
print(filled_sequence)

# Using sentence-transformers for validation
model_st = SentenceTransformer('distilroberta-base-nli-stsb-mean-tokens')
sentence1 = "After your workout, remember to focus on maintaining a good water balance."
sentence2 = "Remember to drink enough water to restore and maintain your body's hydration after your cardio training."
embeddings1 = model_st.encode(sentence1, convert_to_tensor=True)
embeddings2 = model_st.encode(sentence2, convert_to_tensor=True)
cosine_similarity = util.pytorch_cos_sim(embeddings1, embeddings2)
print(cosine_similarity.item())

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.8579320907592773, 'token': 4076, 'token_str': ' drink', 'sequence': "Remember to drink enough water to restore and maintain your body's hydration after your cardio training."}, {'score': 0.017472777515649796, 'token': 18533, 'token_str': ' boil', 'sequence': "Remember to boil enough water to restore and maintain your body's hydration after your cardio training."}, {'score': 0.017121493816375732, 'token': 14623, 'token_str': ' consume', 'sequence': "Remember to consume enough water to restore and maintain your body's hydration after your cardio training."}, {'score': 0.013029470108449459, 'token': 304, 'token_str': ' use', 'sequence': "Remember to use enough water to restore and maintain your body's hydration after your cardio training."}, {'score': 0.011379820294678211, 'token': 185, 'token_str': ' take', 'sequence': "Remember to take enough water to restore and maintain your body's hydration after your cardio training."}]


RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-65a3dbf1-15a8b127241c327441082658;f7535df5-a842-4ac3-b713-cd42c47a55a3)

Repository Not Found for url: https://huggingface.co/api/models/sentence-transformers/distilroberta-base-nli-stsb-mean-tokens.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.