<a href="https://colab.research.google.com/github/damlakaynarca/Big-Data/blob/main/Untitled20Big_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Gerekli kütüphanelerin yüklenmesi
!pip install transformers datasets evaluate scikit-learn

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
from sklearn.metrics import accuracy_score, f1_score
import numpy as np
import os

# wandb izlemeyi kapatma
os.environ["WANDB_DISABLED"] = "true"

# 1. Veri Setini Yükleme
dataset = load_dataset("tweet_eval", "emotion")

# 2. Model ve Tokenizer Ayarlama (DistilBERT)
model_name = "distilbert-base-uncased"  # BERT modeli
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=dataset["train"].features["label"].num_classes)

# 3. Veri Setini Tokenize Etme
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=64)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Eğitim ve değerlendirme veri setlerini ayırma
train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["validation"]

# 4. Eğitim Parametrelerini Ayarlama
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",  # Değerlendirme her epoch sonunda yapılacak
    save_strategy="epoch",  # Modeli her epoch sonunda kaydeder
    logging_dir="./logs",
    per_device_train_batch_size=8,  # Batch boyutu
    per_device_eval_batch_size=8,
    num_train_epochs=3,  # Epoch sayısı
    learning_rate=5e-5,  # Learning rate
    logging_steps=10,  # Eğitim sırasında loglama sıklığı
    save_steps=500,  # Ara model kayıtları için
    save_total_limit=1,  # En fazla 1 model kaydı tut
    load_best_model_at_end=True,  # En iyi modeli yükle
)

# 5. Performans Değerlendirme Fonksiyonu
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    acc = accuracy_score(p.label_ids, preds)
    f1 = f1_score(p.label_ids, preds, average="weighted")
    return {"accuracy": acc, "f1": f1}

# 6. Model Eğitimi
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

# 7. Modeli Değerlendirme
eval_results = trainer.evaluate(tokenized_datasets["test"])
print("Test Sonuçları:", eval_results)


Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/233k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/105k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/28.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3257 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1421 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/374 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/3257 [00:00<?, ? examples/s]

Map:   0%|          | 0/1421 [00:00<?, ? examples/s]

Map:   0%|          | 0/374 [00:00<?, ? examples/s]

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.6273,0.647762,0.772727,0.767765
2,0.3329,0.730722,0.81016,0.806408
3,0.1975,0.914716,0.78877,0.787419


Test Sonuçları: {'eval_loss': 0.6139195561408997, 'eval_accuracy': 0.7909922589725545, 'eval_f1': 0.7850507742797088, 'eval_runtime': 156.1103, 'eval_samples_per_second': 9.103, 'eval_steps_per_second': 1.14, 'epoch': 3.0}
