<a href="https://colab.research.google.com/github/bodorcy/hazifeladatok/blob/main/ml_7_finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Szöveggenerálás

A notebook runtime típusa GPU legyen!

In [1]:
!nvidia-smi

Wed Nov 12 14:21:05 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   39C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Magyar nyelvű LLM generárotok

https://juniper.nytud.hu/demo/puli

https://huggingface.co/NYTK

## Szöveggenerálás

Vegyük példaként a Huggingface-ről elérhető, előtanított GPT-2 modellt, töltsük be a hozzá tartozó tokenizálót is.

Célszerű az alábbiakhoz GPU-s runtime-ra váltani

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2')

tokenizer.pad_token = tokenizer.eos_token

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

model = model.to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Using device: cuda


A tokenizáló működése:

In [3]:
input_text = "Once upon a time"
input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)

print(input_ids, "\n")

for id in input_ids[0]:
  print(id, tokenizer.decode(id, skip_special_tokens=True))

tensor([[7454, 2402,  257,  640]], device='cuda:0') 

tensor(7454, device='cuda:0') Once
tensor(2402, device='cuda:0')  upon
tensor(257, device='cuda:0')  a
tensor(640, device='cuda:0')  time


A generálás folyamata, és egy lépése:

In [4]:
output = model.generate(input_ids,
                        max_length=50,
                        num_return_sequences=1,
                        pad_token_id=tokenizer.pad_token_id)


generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Once upon a time, the world was a place of great beauty and great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a


In [5]:
input_text = "The spiderman was"
input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)

import torch.nn.functional as F

with torch.no_grad():
    outputs = model(input_ids)

# logit-ok az utolsó kimeneti tokenre
last_token_logits = outputs.logits[0, -1, :]

# alkalmazzunk softmax-et, hogy valószínűségi értékeket kapjunk
probs = F.softmax(last_token_logits, dim=-1)

# az 5 legmagasabb valószínűségi értékkel rendelkező token
top_k_probs, top_k_indices = torch.topk(probs, 5)

top_k_tokens = [tokenizer.decode([token]) for token in top_k_indices]

for i, (token, prob) in enumerate(zip(top_k_tokens, top_k_probs)):
    print(f"Top {i+1} token: '{token}', probability:", round(prob.item(), 3), "-->", input_text + f"{token}")

Top 1 token: ' a', probability: 0.072 --> The spiderman was a
Top 2 token: ' able', probability: 0.024 --> The spiderman was able
Top 3 token: ' the', probability: 0.024 --> The spiderman was the
Top 4 token: ' also', probability: 0.023 --> The spiderman was also
Top 5 token: ' not', probability: 0.018 --> The spiderman was not


In [6]:
input_text = "Once upon a time"
input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)

output = model.generate(input_ids,
                        max_length=50,
                        num_return_sequences=1,
                        pad_token_id=tokenizer.pad_token_id,
                        do_sample=True,
                        top_k=0,
                        temperature=0.5)


generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

Once upon a time, the world was a land of peace, where people lived side by side, where they traded and lived together.

And yet, some people, like the children, were burned to death. And others, like the adults


## Generatív Fine-tuning

Szeretnénk a fenti modellt finomhangolni arra, hogy "rossz online értékeléseket" írjon. Ehhez egy online szöveges értékeléseket tartalmazó adatbázisból használjuk az 1 csillagos review-kat.

In [7]:
!pip install datasets



In [8]:
import pandas as pd
from datasets import Dataset

df = pd.read_parquet("hf://datasets/Yelp/yelp_review_full/yelp_review_full/train-00000-of-00001.parquet")
df

Unnamed: 0,label,text
0,4,dr. goldberg offers everything i look for in a...
1,1,"Unfortunately, the frustration of being Dr. Go..."
2,3,Been going to Dr. Goldberg for over 10 years. ...
3,3,Got a letter in the mail last week that said D...
4,0,I don't know what Dr. Goldberg was like before...
...,...,...
649995,4,I had a sprinkler that was gushing... pipe bro...
649996,0,Phone calls always go to voicemail and message...
649997,0,Looks like all of the good reviews have gone t...
649998,4,I was able to once again rely on Yelp to provi...


In [9]:
df = df[df.label < 1]
df = df[["text"]].iloc[:10_000]

dataset = Dataset.from_pandas(df)
dataset = dataset.remove_columns("__index_level_0__")

def tokenize_function(examples):
    encoding = tokenizer(examples['text'], truncation=True, padding='max_length', max_length=128)
    encoding['labels'] = encoding['input_ids'].copy()
    return encoding

tokenized_datasets = dataset.map(tokenize_function, batched=True)

tokenized_datasets.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Az alábbi lépés, a finomhangolás folyamata GPU-n körülbelül 5 percet vesz igénybe.

In [10]:
import os
from transformers import Trainer, TrainingArguments

os.environ["WANDB_DISABLED"] = "true"

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,
    per_device_train_batch_size=16,
    logging_dir='./logs',
    logging_steps=50,
    report_to=[],
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
)

trainer.train()

trainer.save_model('./results')

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
50,3.061
100,2.8654
150,2.7851
200,2.7885
250,2.8149
300,2.7687
350,2.7763
400,2.7849
450,2.788
500,2.782


Nézzük meg, hogy az eredeti és a finomhangolt modell mit generál, ha "The restaurant" szavakkal promptoljuk be őket.

In [11]:
input_text = "The restaurant was"
input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)

model = AutoModelForCausalLM.from_pretrained('gpt2')
model = model.to(device)

pre_trained_output = model.generate(input_ids, max_length=50, pad_token_id=tokenizer.pad_token_id, do_sample=True, temperature=0.7)
pre_trained_text = tokenizer.decode(pre_trained_output[0], skip_special_tokens=True)
print("Pre-trained model output:")
print(pre_trained_text)



fine_tuned_model = AutoModelForCausalLM.from_pretrained('./results').to(device)

fine_tuned_output = fine_tuned_model.generate(input_ids, max_length=50, pad_token_id=tokenizer.pad_token_id, do_sample=True, temperature=0.7 )
fine_tuned_text = tokenizer.decode(fine_tuned_output[0], skip_special_tokens=True)
print("\nFine-tuned model output:")
print(fine_tuned_text)
#

Pre-trained model output:
The restaurant was just a couple of blocks from the office of President Trump, who was visiting his wife.

The employees at the restaurant were asked to leave with the president's family.

"We're not going to get into that,"

Fine-tuned model output:
The restaurant was absolutely horrible. I was disappointed to find that the chicken was so hot I had to wait for a few minutes before they could even eat my food. I was so disappointed that another Yelper in the restaurant was so rude that I was


## Zero-shot predikció

Az eddigi GPT2-es modell helyett használjuk a Llama 3.2 1 milliárd paraméteres változatát. Ennek a modellnek előnye, hogy hosszabb szövegekkel is megbírkózik, viszont sokkal tovább tartott volna fine-tuneing rajta.

Az előző részben az `AutoTokenizer` és az `AutoModelForCausalLM` osztályokat használtuk a nyelvi modellek betöltésére. Ezek lényegesen nagyobb szabadságot adnak, de csak ha szöveget akarunk generálni, akkor használhatjuk a magasabb szintű *text-generation* `pipeline`-t is.

Elsőre olvassunk be a modell instrukció tanítást nem tartalmazó változatát.

In [12]:
from transformers import pipeline

generator_non_instruct = pipeline('text-generation', model='unsloth/Llama-3.2-1B')

config.json:   0%|          | 0.00/889 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

Device set to use cuda:0


In [13]:
generator_non_instruct("What is the capital of Hungary?", max_new_tokens=20, do_sample=False)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


[{'generated_text': 'What is the capital of Hungary? What is the time zone in Hungary? What is the population of Hungary? What is the air temperature'}]

In [14]:
summarization_text = """Phishing is a form of social engineering and a scam
where attackers deceive people into revealing sensitive information or
installing malware such as viruses, worms, adware, or ransomware.
Phishing attacks have become increasingly sophisticated and often transparently
mirror the site being targeted, allowing the attacker to observe everything
while the victim navigates the site, and transverses any additional security
boundaries with the victim. As of 2020, it is the most common type of
cybercrime, with the Federal Bureau of Investigation's Internet Crime Complaint
Center reporting more incidents of phishing than any other type of cybercrime."
"""

generated_text = generator_non_instruct(
    f"Summarize the following text in 1 sentence: {summarization_text}",
    max_new_tokens=200,
    do_sample=False
)

print(generated_text[0]["generated_text"])

Summarize the following text in 1 sentence: Phishing is a form of social engineering and a scam
where attackers deceive people into revealing sensitive information or
installing malware such as viruses, worms, adware, or ransomware.
Phishing attacks have become increasingly sophisticated and often transparently
mirror the site being targeted, allowing the attacker to observe everything
while the victim navigates the site, and transverses any additional security
boundaries with the victim. As of 2020, it is the most common type of
cybercrime, with the Federal Bureau of Investigation's Internet Crime Complaint
Center reporting more incidents of phishing than any other type of cybercrime."
Phishing is a form of social engineering and a scam where attackers deceive people into revealing sensitive information or installing malware such as viruses, worms, adware, or ransomware. Phishing attacks have become increasingly sophisticated and often transparently mirror the site being targeted, all

Nézzük meg mennyire más válaszokat generál ugyanennek a modellnek az instrukció tanításon átesett változata.

In [15]:
generator = pipeline('text-generation', model='unsloth/Llama-3.2-1B-Instruct')

config.json:   0%|          | 0.00/894 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Device set to use cuda:0


In [16]:
generator("What is the capital of Hungary?", max_new_tokens=20, do_sample=False)

[{'generated_text': 'What is the capital of Hungary? Budapest?\nYes, that is correct. Budapest is the capital and largest city of Hungary. It is'}]

In [17]:
summarization_text = """"Phishing is a form of social engineering and a scam
where attackers deceive people into revealing sensitive information or
installing malware such as viruses, worms, adware, or ransomware.
Phishing attacks have become increasingly sophisticated and often transparently
mirror the site being targeted, allowing the attacker to observe everything
while the victim navigates the site, and transverses any additional security
boundaries with the victim. As of 2020, it is the most common type of
cybercrime, with the Federal Bureau of Investigation's Internet Crime Complaint
Center reporting more incidents of phishing than any other type of cybercrime."

Summary:
"""

generated_text = generator(
    f"Summarize the following text in one sentence: {summarization_text}",
    max_new_tokens=200,
    do_sample=False
)

print(generated_text[0]["generated_text"])

Summarize the following text in one sentence: "Phishing is a form of social engineering and a scam
where attackers deceive people into revealing sensitive information or
installing malware such as viruses, worms, adware, or ransomware.
Phishing attacks have become increasingly sophisticated and often transparently
mirror the site being targeted, allowing the attacker to observe everything
while the victim navigates the site, and transverses any additional security
boundaries with the victim. As of 2020, it is the most common type of
cybercrime, with the Federal Bureau of Investigation's Internet Crime Complaint
Center reporting more incidents of phishing than any other type of cybercrime."

Summary:
Phishing is a type of social engineering scam where attackers deceive people into revealing sensitive information or installing malware, often mirroring the target site and allowing them to observe and exploit security boundaries.


### Phishing adatbázis

A feladatunk az lesz, hogy email-ek tárgya alapján próbáljuk meg eldönteni, hogy az üzenet adathalászatot tartalmazó szándékosan megtévesztő-e, phishing-e?

In [18]:
import kagglehub
import pandas as pd

# Download latest version
path = kagglehub.dataset_download("naserabdullahalam/phishing-email-dataset")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/naserabdullahalam/phishing-email-dataset?dataset_version_number=1...


100%|██████████| 77.1M/77.1M [00:00<00:00, 103MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/naserabdullahalam/phishing-email-dataset/versions/1


In [19]:
data = pd.read_csv(path + "/SpamAssasin.csv")

In [20]:
data

Unnamed: 0,sender,receiver,date,subject,body,label,urls
0,Robert Elz <kre@munnari.OZ.AU>,Chris Garrigues <cwg-dated-1030377287.06fa6d@D...,"Thu, 22 Aug 2002 18:26:25 +0700",Re: New Sequences Window,"Date: Wed, 21 Aug 2002 10:54:46 -0500 ...",0,1
1,Steve Burt <Steve_Burt@cursor-system.com>,"""'zzzzteana@yahoogroups.com'"" <zzzzteana@yahoo...","Thu, 22 Aug 2002 12:46:18 +0100",[zzzzteana] RE: Alexander,"Martin A posted:\nTassos Papadopoulos, the Gre...",0,1
2,"""Tim Chapman"" <timc@2ubh.com>",zzzzteana <zzzzteana@yahoogroups.com>,"Thu, 22 Aug 2002 13:52:38 +0100",[zzzzteana] Moscow bomber,Man Threatens Explosion In Moscow \n\nThursday...,0,1
3,Monty Solomon <monty@roscom.com>,undisclosed-recipient: ;,"Thu, 22 Aug 2002 09:15:25 -0400",[IRR] Klez: The Virus That Won't Die,Klez: The Virus That Won't Die\n \nAlready the...,0,1
4,Stewart Smith <Stewart.Smith@ee.ed.ac.uk>,zzzzteana@yahoogroups.com,"Thu, 22 Aug 2002 14:38:22 +0100",Re: [zzzzteana] Nothing like mama used to make,"> in adding cream to spaghetti carbonara, whi...",0,1
...,...,...,...,...,...,...,...
5804,Professional_Career_Development_Institute@Frug...,yyyy@netnoteinc.com,"Tue, 3 Dec 2002 13:19:58 -0800",Busy? Home Study Makes Sense!,\n\n \n--- \n![](http://images.pcdi-homestud...,1,1
5805,"""IQ - TBA"" <tba@insiq.us>",<yyyy@spamassassin.taint.org>,"Tue, 3 Dec 2002 18:52:29 -0500",Preferred Non-Smoker Rates for Smokers,This is a multi-part message in MIME format. -...,1,1
5806,Mike <raye@yahoo.lv>,Mailing.List@user2.pro-ns.net,"Sun, 20 Jul 2003 16:19:44 +0800","How to get 10,000 FREE hits per day to any web...","Dear Subscriber,\n\nIf I could show you a way ...",1,1
5807,"""Mr. Clean"" <cweqx@dialix.oz.au>",<Undisclosed.Recipients@webnote.net>,"Wed, 05 Aug 2020 04:01:50 -1900",Cannabis Difference,****Mid-Summer Customer Appreciation SALE!****...,1,0


In [21]:
data.label.value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,4091
1,1718


A példákat összekeverjük.

In [22]:
data_shuffled = data.dropna().sample(frac=1, random_state=42)

Mivel a nagy nyelvi modellek sokáig futnak, ezért csak 100 példát hagyjunk kiértékelésre.

In [23]:
train = data_shuffled[:-100]

In [24]:
test = data_shuffled[-100:]

In [25]:
train_labels = train["label"]
test_labels = test["label"]

### Zero-shot a phishing adatnázison

In [26]:
prompt_text = """
Is the following email subject contains phishing?
Answer with yes or no only.

Email subject: {}
Answer:"""

In [27]:
def get_prediction(text):
    if 'yes' in text.lower():
        return 1
    elif 'no' in text.lower():
        return 0
    else:
         return -1

In [28]:
def eval(test_data, model, prompt_text):
    predictions = []
    labels = []

    for i in range(len(test_data)):
        text = test_data.iloc[i]['subject']
        prompt = prompt_text.format(text)
        result = model(prompt, max_new_tokens=2, return_full_text=False)
        generated_text = result[0]['generated_text']

        predictions.append(get_prediction(generated_text))
        labels.append(test_data.iloc[i]['label'])

        print('-----------------------------------------------------------------------------------------------------')
        print(text)
        print(f"\noutput: {generated_text.strip()}")
        print(f"predicted: {predictions[i]} label: {labels[i]}")

    return predictions, labels

In [29]:
predictions, labels = eval(test, generator, prompt_text)

-----------------------------------------------------------------------------------------------------
[SAdev] [Bug 779] rule broken: CORRUPT_MSGID

output: yes
predicted: 1 label: 0
-----------------------------------------------------------------------------------------------------
Re: [ILUG] RH7.3 on Cobalt - the saga continues

output: YES
predicted: 1 label: 0
-----------------------------------------------------------------------------------------------------
Vintage Music Archive

output: No
predicted: 0 label: 0
-----------------------------------------------------------------------------------------------------
Trading for a living (All you should know about FOREX)

output: Yes
predicted: 1 label: 1
-----------------------------------------------------------------------------------------------------
Custom Software Development Services Available Right Now..

output: No
predicted: 0 label: 1


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


-----------------------------------------------------------------------------------------------------
Re: [ILUG] 3c509 & 2.4.19 problems

output: Yes
predicted: 1 label: 0
-----------------------------------------------------------------------------------------------------
find the bug

output: yes
predicted: 1 label: 0
-----------------------------------------------------------------------------------------------------
Do you dream of the latest gadgets? (ZDNET SHOPPER)

output: YES
predicted: 1 label: 0
-----------------------------------------------------------------------------------------------------
History of the tilde

output: No
predicted: 0 label: 0
-----------------------------------------------------------------------------------------------------
Hey wassup, Remember me ;)

output: Yes
predicted: 1 label: 1
-----------------------------------------------------------------------------------------------------
Re: whoa

output: yes
predicted: 1 label: 0
----------------------

In [30]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(labels, predictions)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.39


## Few-shot learning

A nyelvi modell döntését példákkal segítjük.



```
Examples:
Email subject: Complete Online Pharmacy with same day shipping
Answer: yes

Email subject: Re: building as non-root / newbie question
Answer: no

Email subject: Re: Ouch... [Bebergflame]
Answer: no

Email subject: Re: [BFTQ] JAVA question?
Answer: no


```




In [31]:
prompt_text = """Is the following email subject contains phishing?
Answer with yes or no only.

Examples:
Email subject: Complete Online Pharmacy with same day shipping
Answer: yes

Email subject: Re: building as non-root / newbie question
Answer: no

Email subject: Re: Ouch... [Bebergflame]
Answer: no

Email subject: Re: [BFTQ] JAVA question?
Answer: no

Question:
Email subject: {}
Answer:"""

In [32]:
predictions, labels = eval(test, generator, prompt_text)

-----------------------------------------------------------------------------------------------------
[SAdev] [Bug 779] rule broken: CORRUPT_MSGID

output: yes
predicted: 1 label: 0
-----------------------------------------------------------------------------------------------------
Re: [ILUG] RH7.3 on Cobalt - the saga continues

output: yes
predicted: 1 label: 0
-----------------------------------------------------------------------------------------------------
Vintage Music Archive

output: no
predicted: 0 label: 0
-----------------------------------------------------------------------------------------------------
Trading for a living (All you should know about FOREX)

output: yes
predicted: 1 label: 1
-----------------------------------------------------------------------------------------------------
Custom Software Development Services Available Right Now..

output: yes
predicted: 1 label: 1
---------------------------------------------------------------------------------------

In [33]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(labels, predictions)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.59


# Szózsák modell

Nézzük meg milyen eredményt kapunk a már jól ismert szózsák alapú osztályozással, ha van elég példánk.

In [34]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
vectorizer = CountVectorizer()
features = vectorizer.fit_transform(train["subject"])
test_features = vectorizer.transform(test["subject"])

In [35]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
model = SGDClassifier().fit(features, train.label)
accuracy_score(y_true=test.label, y_pred=model.predict(test_features))

0.94

In [36]:
sorted(zip(model.coef_[0], vectorizer.get_feature_names_out()),reverse=True)[:20]

[(np.float64(1.8127605843339796), 'page'),
 (np.float64(1.8127605843339785), 'assistance'),
 (np.float64(1.7502515986672915), 'rftmbswfn'),
 (np.float64(1.7502515986672909), 't4hvhyesgjop4dzqbvillg'),
 (np.float64(1.7502515986672904), 'surprise'),
 (np.float64(1.7502515986672902), 'girls'),
 (np.float64(1.750251598667289), 'pic'),
 (np.float64(1.6252336273339139), 'zebpj'),
 (np.float64(1.6252336273339119), 'patrick'),
 (np.float64(1.562724641667224), 'joke'),
 (np.float64(1.5627246416672238), 'éå'),
 (np.float64(1.500215656000536), '26792'),
 (np.float64(1.5002156560005355), 'commissions'),
 (np.float64(1.500215656000535), 'bnimanie'),
 (np.float64(1.5002156560005349), 'cash'),
 (np.float64(1.4377066703338468), 'information'),
 (np.float64(1.437706670333846), 'feel'),
 (np.float64(1.4377066703338455), 'truth'),
 (np.float64(1.4377066703338452), 'want'),
 (np.float64(1.3751976846671583), 'url')]

# Gyakorló feladatok

*   Fine-tuneoljuk a GPT2-t a phishing osztályozási feladatra! Milyen eredményeket ér el?

In [109]:
train["subject"]
train["label"]

df2 = train.copy()


def make_sentence(row):
  ret = "Is this phising? " + row["subject"] +  " Answer is: " \
        + ("yes." if row["label"] == 1 else "no.")

  print(ret)
  return ret

df2["example"] = df2.apply(make_sentence, axis=1)

df2 = df2[["example"]]

print(df2.iloc[5247])
"""
dataset = Dataset.from_pandas(df2)

dataset.remove_columns('__index_level_0__')

tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

def tokenize_phising(subject):
  encoding = tokenizer(subject['example'], truncation=True, padding='max_length', max_length=128)
  encoding['labels'] = encoding['input_ids'].copy()

  return encoding
"""

"""
tokenized_ds = dataset.map(tokenize_phising)

tokenized_datasets.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

os.environ["WANDB_DISABLED"] = "true"

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,
    per_device_train_batch_size=16,
    logging_dir='./logs',
    logging_steps=50,
    report_to=[],
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
)

trainer.train()

trainer.save_model('./results')
"""

Is this phising? Jobs, Jobs, Jobs: HEUTE, 03.07.02 ist virtueller Messetag der jobfair24!!! Answer is: no.
Is this phising? ADV: Direct email blaster, email addresses extractor, maillist verify, maillist manager........... Answer is: yes.
Is this phising? Re: Java is for kiddies Answer is: no.
Is this phising? [zzzzteana] "Put this in your stereo and smoke it ... " Answer is: no.
Is this phising? Re: Ouch... [Bebergflame] Answer is: no.
Is this phising? RE: CO2 and climate (was RE: Goodbye Global Warming) Answer is: no.
Is this phising? Re: alsa-driver.spec tweak for homemade kernels ... Answer is: no.
Is this phising? [SAtalk] Huh? Answer is: no.
Is this phising? Bush veto on Middle East talks Answer is: no.
Is this phising? NYTimes.com Article: Texas Pacific Goes Where Others Fear to Spend Answer is: no.
Is this phising? I could be JAILED for selling this CD! Answer is: yes.
Is this phising? Are you being Freeserved? Answer is: no.
Is this phising? Re: use new apt to do null to RH8 u

'\ntokenized_ds = dataset.map(tokenize_phising)\n\ntokenized_datasets.set_format(type=\'torch\', columns=[\'input_ids\', \'attention_mask\', \'labels\'])\n\nos.environ["WANDB_DISABLED"] = "true"\n\ntraining_args = TrainingArguments(\n    output_dir=\'./results\',\n    num_train_epochs=1,\n    per_device_train_batch_size=16,\n    logging_dir=\'./logs\',\n    logging_steps=50,\n    report_to=[],\n)\n\ntrainer = Trainer(\n    model=model,\n    args=training_args,\n    train_dataset=tokenized_datasets,\n)\n\ntrainer.train()\n\ntrainer.save_model(\'./results\')\n'

In [114]:

fine_tuned_model = AutoModelForCausalLM.from_pretrained('./results')

fine_tuned_output = fine_tuned_model.generate(input_ids, max_new_tokens = 1, pad_token_id=tokenizer.pad_token_id)
fine_tuned_text = tokenizer.decode(fine_tuned_output[0], skip_special_tokens=True)

predictions = []

for i in range(len(test)):

    subject = test.iloc[i]['subject'] # Ensure this matches your column name

    # CRITICAL: Matches your training function 'make_sentence' exactly
    # We include the typo "phising" and the space after "Answer is: "
    input_text = "Is this phising? " + str(subject) + " Answer is: "

    # Tokenize
    inputs = tokenizer(input_text, return_tensors='pt')

    # 3. Generate
    # We limit max_new_tokens to 5. We only need "yes." or "no.",
    # but 5 allows for extra whitespace or weird behavior without breaking.
    with torch.no_grad(): # Disables gradient calculation to save memory
        outputs = fine_tuned_model.generate(
            inputs['input_ids'],
            attention_mask=inputs['attention_mask'],
            max_new_tokens=5,
            pad_token_id=tokenizer.eos_token_id
        )

    # 4. Decode
    full_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # 5. Parsing Logic
    # The model returns: "Is this phising? [Subject] Answer is: yes."
    # We split by "Answer is: " and take the last part.
    answer_part = full_text.split("Answer is:")[-1]

    # Clean the text: lowercase it, remove spaces, remove the period '.'
    clean_answer = answer_part.strip().lower().replace(".", "")

    # 6. Classification
    # We check if "yes" exists in the clean answer
    if "yes" in clean_answer:
        predictions.append(1)
    else:
        predictions.append(0)

print(predictions)


[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
