# 11. Gyakorlat: Transzformáló architektúrák
## Könyvtárak

In [1]:
import warnings 

warnings.filterwarnings('ignore')

import torch
from transformers import DistilBertTokenizer
from transformers import DistilBertForQuestionAnswering

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

---
## Kérdés-válasz kontextus alapján

In [2]:
# Előretanított QA modell és tokenizáló beimportálása
model_name = "distilbert-base-uncased-distilled-squad"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForQuestionAnswering.from_pretrained(model_name)

# Függvény aminek segítségével kontextus alapján képes választ adni a modell
def answer_question(question, context):
    inputs = tokenizer.encode_plus(question, context, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    # A leginkább valószínű kezdete és vége a válasznak
    answer_start = torch.argmax(answer_start_scores)
    answer_end = torch.argmax(answer_end_scores) + 1

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
    return answer

# Példa kontextus és kérdés
context = "The Apollo program was a series of space missions undertaken by NASA between 1961 and 1972. Its goal was to land humans on the Moon and ensure their safe return to Earth."
question = "What was the goal of the Apollo program?"

# Válasz
answer = answer_question(question, context)
print("Answer:", answer)


Answer: to land humans on the moon and ensure their safe return to earth


---
## Jelentés elemzés
Pozitív vagy negatív a jelentés tartalma egy mondatnak?  
A példában egy DistilBert típusú modell került tanításra az IMDB adathalmazon

In [3]:
import torch
from transformers import Trainer
from transformers import TrainingArguments
from transformers import DistilBertTokenizerFast
from transformers import DistilBertForSequenceClassification
from datasets import load_dataset

# Adathalmaz beimportálása és előfeldolgozása
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=512)

dataset = load_dataset("imdb")
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
tokenized_dataset = dataset.map(preprocess_function, batched=True)

# Modell betöltése
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

# Tanító argumentumok
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01
)

# Modell tanító objektum létrehozása
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"]
)

# Modell tanítása
trainer.train()

# Modell kiértékelése
trainer.evaluate()

# Modell elmentése
model.save_pretrained("./distilbert-imdb")
tokenizer.save_pretrained("./distilbert-imdb")

# Predikció a modell segítségével
def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    return "Positive" if predictions[0][1] > 0.5 else "Negative"

# Példa predikció
print(predict_sentiment("This movie was fantastic!"))

2023-12-03 11:08:48.336491: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-03 11:08:48.542941: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-03 11:08:48.542981: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-03 11:08:48.574375: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-03 11:08:48.636062: I tensorflow/core/platform/cpu_feature_guar

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /home/daniel/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /home/daniel/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mbasictask[0m. Use [1m`wandb login --relogin`[0m to force relogin


  0%|          | 0/4689 [00:00<?, ?it/s]

{'loss': 0.3188, 'learning_rate': 1.7867349114949884e-05, 'epoch': 0.32}
{'loss': 0.2443, 'learning_rate': 1.5734698229899766e-05, 'epoch': 0.64}
{'loss': 0.2289, 'learning_rate': 1.3602047344849649e-05, 'epoch': 0.96}
{'loss': 0.1702, 'learning_rate': 1.1469396459799531e-05, 'epoch': 1.28}
{'loss': 0.147, 'learning_rate': 9.336745574749414e-06, 'epoch': 1.6}
{'loss': 0.1403, 'learning_rate': 7.204094689699297e-06, 'epoch': 1.92}
{'loss': 0.1018, 'learning_rate': 5.07144380464918e-06, 'epoch': 2.24}
{'loss': 0.0807, 'learning_rate': 2.9387929195990615e-06, 'epoch': 2.56}
{'loss': 0.0798, 'learning_rate': 8.061420345489445e-07, 'epoch': 2.88}
{'train_runtime': 42283.4916, 'train_samples_per_second': 1.774, 'train_steps_per_second': 0.111, 'train_loss': 0.16447305547120755, 'epoch': 3.0}


  0%|          | 0/1563 [00:00<?, ?it/s]

Positive


In [4]:
print(predict_sentiment("I haven't felt so low in years."))

Negative


In [5]:
print(predict_sentiment("I think I am getting to know you better every day"))

Positive


In [7]:
print(predict_sentiment("I have been in a car crash today"))

Negative


In [8]:
print(predict_sentiment("This day could have gone better"))

Negative


## Modell és tokenizáló betöltése

In [None]:
model_path = './distilbert-imdb/'
# Tokenizáló betöltése
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)
# Modell betöltése
model = DistilBertForSequenceClassification.from_pretrained(model_path)

---
## Szöveg generálás transzformáló modellel

In [17]:
# Könyvtárak importálása
from transformers import Trainer
from transformers import TextDataset
from transformers import GPT2Tokenizer
from transformers import GPT2LMHeadModel
from transformers import TrainingArguments
from transformers import DataCollatorForLanguageModeling
from datasets import load_dataset

# Alőretanított modell és tokeniztáló
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

model = GPT2LMHeadModel.from_pretrained("gpt2")

# Adathalmaz betöltése
dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")

# Az adathalmaznak egy kis része kerül felhasználásra
n_samples = 100000
small_dataset = dataset.select(range(min(len(dataset), n_samples))) # A range teszőlegesen állítható

def encode(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=512)

small_dataset = small_dataset.map(encode, batched=True)

# Adathalmaz kettéválasztása tanító és teszt adatokra
small_train_dataset = small_dataset.shuffle().select(range(800)) # 80% tanító
small_eval_dataset = small_dataset.shuffle().select(range(800, 1000)) # 20% validáló

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False)

# Modell finomhangolása
training_args = TrainingArguments(
    output_dir="./gpt2-finetuned",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    save_steps=500,
    save_total_limit=1,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
)

trainer.train()


Found cached dataset wikitext (/home/daniel/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)


Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

  0%|          | 0/1200 [00:00<?, ?it/s]

{'loss': 3.6394, 'learning_rate': 2.916666666666667e-05, 'epoch': 1.25}
{'loss': 2.8596, 'learning_rate': 8.333333333333334e-06, 'epoch': 2.5}
{'train_runtime': 4012.6991, 'train_samples_per_second': 0.598, 'train_steps_per_second': 0.299, 'train_loss': 3.1412313334147135, 'epoch': 3.0}


TrainOutput(global_step=1200, training_loss=3.1412313334147135, metrics={'train_runtime': 4012.6991, 'train_samples_per_second': 0.598, 'train_steps_per_second': 0.299, 'train_loss': 3.1412313334147135, 'epoch': 3.0})

In [18]:
# Szöveg generálása
def gen_text(input_text, n_return_sequences=3):
    input_ids = tokenizer.encode(input_text, return_tensors='pt')

    sample_outputs = model.generate(
        input_ids,
        do_sample=True, 
        max_length=50, 
        top_k=50, 
        top_p=0.95, 
        num_return_sequences=n_return_sequences
    )

    for i, sample_output in enumerate(sample_outputs):
        print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

In [20]:
gen_text('This day has been')

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


0: This day has been known as'Golden Jubilee '. 
'The evening of October 21, 1817, on the eve of Christmas, in New York City, was celebrated by tens of thousands of people. 
  
 
1: This day has been remembered by people who heard it's story. The words " " and " " " " are said to represent words that are present in our bodies and elsewhere in our lives, to represent moments of thought ", to represent experiences
2: This day has been plagued by some tragic events, but one of the most serious is the death of a loved one. Two of the victims were involved in sex acts with minors. The first was in November, 2003, when a 15 @-@
