In [1]:
from transformers import pipeline, set_seed
import torch
from transformers import AutoTokenizer, AutoModel

  from .autonotebook import tqdm as notebook_tqdm
2025-11-26 22:09:45.466549: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Khôi phục Masked Token (Masked Language Modeling)

In [2]:
def fill_mask():
    print("--- BÀI 1: Masked Language Modeling ---")

    # Tải pipeline "fill-mask"
    mask_filler = pipeline("fill-mask", model="bert-base-uncased")

    # Câu đầu vào
    input_sentence = "Hanoi is the [MASK] of Vietnam."

    # Dự đoán
    predictions = mask_filler(input_sentence, top_k=5)

    print(f"Câu gốc: {input_sentence}")
    for i, pred in enumerate(predictions, 1):
        print(f"{i}. Từ dự đoán: '{pred['token_str']}' | Độ tin cậy: {pred['score']:.4f}")
        print(f"   -> Câu: {pred['sequence']}")

In [3]:
fill_mask()

--- BÀI 1: Masked Language Modeling ---


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


Câu gốc: Hanoi is the [MASK] of Vietnam.
1. Từ dự đoán: 'capital' | Độ tin cậy: 0.9991
   -> Câu: hanoi is the capital of vietnam.
2. Từ dự đoán: 'center' | Độ tin cậy: 0.0001
   -> Câu: hanoi is the center of vietnam.
3. Từ dự đoán: 'birthplace' | Độ tin cậy: 0.0001
   -> Câu: hanoi is the birthplace of vietnam.
4. Từ dự đoán: 'headquarters' | Độ tin cậy: 0.0001
   -> Câu: hanoi is the headquarters of vietnam.
5. Từ dự đoán: 'city' | Độ tin cậy: 0.0001
   -> Câu: hanoi is the city of vietnam.


# Dự đoán từ tiếp theo (Next Token Prediction)

In [4]:
def text_generation():
    print("\n--- BÀI 2: Text Generation ---")

    # Tải pipeline "text-generation", sử dụng gpt2
    generator = pipeline("text-generation", model="gpt2")

    prompt = "The best thing about learning NLP is"
    set_seed(42)

    # Sinh văn bản
    output = generator(prompt, max_length=50, num_return_sequences=1, truncation=True)

    print(f"Prompt: '{prompt}'")
    print("-" * 30)
    print(output[0]['generated_text'])

In [5]:
text_generation()


--- BÀI 2: Text Generation ---


Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Prompt: 'The best thing about learning NLP is'
------------------------------
The best thing about learning NLP is that you learn to learn something, and you learn it through hard work and hard work. It's not like you learn all at once, but you learn at what is most important.

If you want to learn more, there are a lot of books out there that are written about NLP, and I think that's why they make sense. But at the same time, there are many books that have been written about NLP that people have done that have actually been successful.

JUAN GONZÁLEZ: Well, this is a little bit of an aside. On a personal note, I'm really curious to know what the people who have been on the frontlines of NLP are doing right now. I know there's an organization called the NLP Project, and they're making a lot of changes going forward, and I'm curious about how they're doing with the NLP Project.

So, the NLP Project is a group of people who are doing about two things right now. One is trying to get the w

# Tính toán Vector biểu diễn của câu (Sentence Representation)

In [6]:
def sentence_embedding():
    print("\n--- BÀI 3: Sentence Representation (Mean Pooling) ---")

    # Chọn mô hình và tokenizer
    model_name = "bert-base-uncased"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)

    # Câu đầu vào
    sentences = ["This is a sample sentence"]
    # Tokenize
    inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    # Shape sẽ là [1, 7] (1 câu, 7 tokens bao gồm [CLS] và [SEP])
    print("Input IDs shape:", inputs['input_ids'].shape)

    # Đưa qua mô hình (Forward pass)
    with torch.no_grad():
        outputs = model(**inputs)

    # Shape: (batch_size, sequence_length, hidden_size) -> (1, 7, 768)
    last_hidden_state = outputs.last_hidden_state

    # Thực hiện Mean Pooling
    # Tính trung bình vector của các token, nhưng bỏ qua token đệm (padding)
    attention_mask = inputs['attention_mask']

    # Mở rộng mask để khớp kích thước với hidden state
    mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()

    # Tính tổng các vector (chỉ tính những token có mask = 1)
    sum_embeddings = torch.sum(last_hidden_state * mask_expanded, 1)

    # Tính tổng số lượng token thực (tránh chia cho 0 bằng cách dùng clamp)
    sum_mask = torch.clamp(mask_expanded.sum(1), min=1e-9)

    # Phép chia để lấy trung bình
    sentence_embedding = sum_embeddings / sum_mask

    print("Kích thước vector biểu diễn câu:", sentence_embedding.shape)
    print("5 giá trị đầu tiên của vector:", sentence_embedding[0][:5])

In [7]:
sentence_embedding()


--- BÀI 3: Sentence Representation (Mean Pooling) ---
Input IDs shape: torch.Size([1, 7])
Kích thước vector biểu diễn câu: torch.Size([1, 768])
5 giá trị đầu tiên của vector: tensor([-0.2424, -0.3832, -0.0138, -0.2991, -0.2145])
