## Text Sentiment

In [2]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load the model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

# Sample text for classification
text = "I love this product!"

# Tokenize the input text
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)

# Get the model's prediction
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# Get the predicted class
predicted_class = logits.argmax().item()
print(f"Predicted class: {predicted_class}")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Predicted class: 1


## Named entity recognition

In [3]:
from transformers import BertTokenizer, BertForTokenClassification
import torch

# Load the model and tokenizer
model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name)

# Sample text for NER
text = "Hugging Face is based in New York City."

# Tokenize the input text
inputs = tokenizer(text, return_tensors="pt")

# Get the model's predictions
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# Get the predicted labels
predicted_ids = logits.argmax(dim=2)

# Decode the tokens
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

# Map the predicted labels to the tokens
predictions = [(token, predicted_id.item()) for token, predicted_id in zip(tokens, predicted_ids[0])]

print(predictions)

tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[('[CLS]', 0), ('Hu', 6), ('##gging', 6), ('Face', 6), ('is', 0), ('based', 0), ('in', 0), ('New', 8), ('York', 8), ('City', 8), ('.', 0), ('[SEP]', 0)]


## Question answering

In [4]:
from transformers import BertTokenizer, BertForQuestionAnswering
import torch

# Load the model and tokenizer
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)

# Sample context and question
context = "Hugging Face is creating a tool that democratizes AI."
question = "What is Hugging Face creating?"

# Tokenize the input
inputs = tokenizer(question, context, return_tensors="pt", truncation=True)

# Get the model's predictions
with torch.no_grad():
    outputs = model(**inputs)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits

# Get the start and end positions of the answer
start_index = start_logits.argmax()
end_index = end_logits.argmax() + 1  # +1 to include the end token

# Decode the answer
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][start_index:end_index]))
print(f"Answer: {answer}")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Answer: a tool that democratizes ai


## Text similarity

In [5]:
from transformers import BertTokenizer, BertModel
import torch
import numpy as np

# Load the model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1)  # Mean pooling

# Sample texts
text1 = "I love programming."
text2 = "Coding is my passion."

# Get embeddings
embedding1 = get_embedding(text1)
embedding2 = get_embedding(text2)

# Calculate cosine similarity
cosine_similarity = np.dot(embedding1.numpy(), embedding2.numpy().T) / (np.linalg.norm(embedding1.numpy()) * np.linalg.norm(embedding2.numpy()))
print(f"Cosine Similarity: {cosine_similarity[0][0]}")

Cosine Similarity: 0.808419942855835
