# <p style="background-color:#C1FFC1;font-family:cursive;color:#A0522D;font-size:150%;text-align:center;border-radius:10px 10px;line-height:1.5">LLM - Detect AI Generated Text</p>

<h1 align='center'>Introduction 📝</h1>
Welcome to a competition where we need to determine whether an essay was authored by a student or an LLM. Within this notebook, my initial focus will involve delving into the dataset. I'll employ the magic of plotly for a comprehensive exploration of the data, followed by subsequent stages of data processing and modeling.

<h1 align='center'>Dataset Info 📈</h1>
<h2>Training Data</h2>
<b>{test|train}_essays.csv</b><br>

* ```id``` - A unique identifier for each essay.
* ```prompt_id``` - Identifies the prompt the essay was written in response to.
* ```text``` - The essay text itself.
* ```generated``` - Whether the essay was written by a student (0) or generated by an LLM (1). This field is the target and is not present in test_essays.csv.

<b>train_prompts.csv - Essays were written in response to information in these fields.</b><br>

* ```prompt_id``` - A unique identifier for each prompt.
* ```prompt_name``` - The title of the prompt.
* ```instructions``` - The instructions given to students.
* ```source_text``` - The text of the article(s) the essays were written in response to, in Markdown format.

<h1 align='center'>Evaluation Metric 📐</h1>
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

# <p style="background-color:#C1FFC1;font-family:cursive;color:#A0522D;font-size:150%;text-align:center;border-radius:10px 10px;line-height:1.5">TABLE OF CONTENTS</p>
<ul style="list-style-type:square">
    <li><a href="#1">Importing Libraries</a></li>
    <li><a href="#2">Reading the data</a></li>
    <li><a href="#3">Exploratory Data Analysis</a></li>
    <li><a href="#4">Create Folds</a></li>
    <li><a href="#5">Dataset Class</a></li>
    <li><a href="#6">Baseline Model</a></li>
    <li><a href="#7">Utility Functions</a></li>
</ul>

<a id='1'></a>

# <p style="background-color:#C1FFC1;font-family:cursive;color:#A0522D;font-size:150%;text-align:center;border-radius:10px 10px;line-height:1.5">IMPORTING LIBRARIES</p>

In [None]:
import gc
import os
import wandb
import random
import requests
import numpy as np 
import pandas as pd 
import seaborn as sns
from PIL import Image
import plotly.io as pio
import plotly.express as px
import matplotlib.pyplot as plt
from collections import Counter
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS
from kaggle_secrets import UserSecretsClient

import torch
import torch.nn as nn
from tqdm import tqdm
from torch.optim import AdamW
from sklearn.metrics import roc_auc_score
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import StratifiedKFold
from transformers import AutoTokenizer, AutoConfig, AutoModel, get_linear_schedule_with_warmup

import warnings
warnings.simplefilter('ignore')

# Set global template and layout colors
pio.templates.default = "plotly_dark"
pio.templates[pio.templates.default].layout['paper_bgcolor'] = '#1F1F1F'
pio.templates[pio.templates.default].layout['plot_bgcolor'] = '#1F1F1F'

In [None]:
class CONFIG:
    seed=300
    num_fold = 3
    model = 'roberta-base'
    max_len = 512
    train_batch_size = 16
    valid_batch_size = 16
    epochs = 2
    learning_rate = 1e-5
    scheduler = 'linear'
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    tokenizer = AutoTokenizer.from_pretrained(model)
    
CONFIG.tokenizer.save_pretrained('./tokenizer/')

In [None]:
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
seed_everything(CONFIG.seed)

<img src = "https://awsmp-logos.s3.amazonaws.com/3426ff8f-c4da-4d6d-ae6c-46a02b4b0dc4/afbc6b177af177c2243749dfa88e0dec.png" style ="height:65%;width:65%;">

### I'll be utilizing W&B to monitor the model's performance.

https://github.com/ultralytics/yolov5/issues/2362

In [None]:
# Weights & Biases (optional)
%pip install -q wandb 

In [None]:
import wandb
wandb.login()

<a id='2'></a>

# <p style="background-color:#C1FFC1;font-family:cursive;color:#A0522D;font-size:150%;text-align:center;border-radius:10px 10px;line-height:1.5">READING THE DATA</p>

In [None]:
df_ess = pd.read_csv("/kaggle/input/llm-detect-ai-generated-text/train_essays.csv")
df_ess.head()

In [None]:
df_ess.info()

In [None]:
df_pro = pd.read_csv("/kaggle/input/llm-detect-ai-generated-text/train_prompts.csv")
df_pro.head()

In [None]:
df_pro.info()

Thanks to @radek1 . Please upvote his dataset also - [LLM Generated Essays for the Detect AI Comp!](https://www.kaggle.com/datasets/radek1/llm-generated-essays)

<div style="color:white;
           display:fill;
           border-radius:20px;
           background-color:#4b7068;
           font-size:120%;
           font-family: Lucida Console, Courie;
           letter-spacing:0.5px;
           text-align:center;
           padding: 10px 10px">
    
# External bir dataset oluşturulmuş, bu import'lanıyor.

In [None]:
## External Dataset
df_gpt_3_5 = pd.read_csv("/kaggle/input/llm-generated-essays/ai_generated_train_essays.csv")
df_gpt_4 = pd.read_csv("/kaggle/input/llm-generated-essays/ai_generated_train_essays_gpt-4.csv")

In [None]:
df_gpt_3_5.info()

In [None]:
df_gpt_4.info()

<div style="color:white;
           display:fill;
           border-radius:20px;
           background-color:#4b7068;
           font-size:120%;
           font-family: Lucida Console, Courie;
           letter-spacing:0.5px;
           text-align:center;
           padding: 10px 10px">
    
    
### Oluşturulan external veri setiyle orijinal veri seti birleştirilir.

In [None]:
## Combining the original and external dataset
df_new = pd.concat([df_ess, df_gpt_3_5, df_gpt_4]).reset_index(drop=True)

<a id='3'></a>

# <p style="background-color:#C1FFC1;font-family:cursive;color:#A0522D;font-size:150%;text-align:center;border-radius:10px 10px;line-height:1.5">EXPLORATORY DATA ANALYSIS</p>

### To begin, we'll initiate the exploration process exclusively on the **original dataset**.<br>
## Distribution of Target Variable - Generated

<div style="color:white;
           display:fill;
           border-radius:20px;
           background-color:#4b7068;
           font-size:120%;
           font-family: Lucida Console, Courie;
           letter-spacing:0.5px;
           text-align:center;
           padding: 10px 10px">
   
    
### Generated değişkenine bakarak veri sayısını görselleştiriyoruz.

In [None]:
# Create a temporary dataframe with counts of each category
count_df = df_ess['generated'].value_counts().reset_index()
count_df.columns = ['generated', 'count']

fig = px.bar(
    count_df,
    x='generated',
    y='count',
    title='Distribution of Generated Label',
    color=['#2E86AB', '#E84545'],
    color_discrete_map="identity"
)

# Customize layout for value display
fig.update_layout(
    xaxis=dict(
        tickmode='array',
        tickvals=[0, 1])
)

# Display values on top of the bars
fig.update_traces(
    texttemplate='%{y}',  
    textposition='outside', 
)

fig.show()

### INSIGHTS

* The graph clearly shows that the dataset is highly imbalanced, with class 0 being the majority class.
* There are a number of techniques that can be used to address class imbalance, such as upsampling and downsampling.
* In this case, we will do upsampling and this is the reason we are using external dataset.

<div style="color:white;
           display:fill;
           border-radius:20px;
           background-color:#4b7068;
           font-size:120%;
           font-family: Lucida Console, Courie;
           letter-spacing:0.5px;
           text-align:center;
           padding: 10px 10px">
    
    
### Bulgular:
Veri seti dengesiz, öğrenciler tarafından oluşturulmuş makale sayısı üstünlükte.
Bu dengesizliği düzeltebilmek için upsampling ve downsampling gibi teknikler kullanılabilir. Bu Notebook'ta upsampling kullanılacak, bu yüzden external dataset'ini importladık.

## Exploration of Essay Text

In [None]:
df_ess['essay_len'] = df_ess['text'].str.split().map(lambda x : len(x))

fig = px.histogram(
    df_ess,
    x='essay_len',
    title='Distribution of word count of text',
    color_discrete_sequence=['#FF7F0E'], 
)

fig.show()

### INSIGHTS

* The histogram shows that mostly essay has around 500-600 words.
* It also displays some outliers, with a few essays exceeding 1200 words.

<div style="color:white;
           display:fill;
           border-radius:20px;
           background-color:#4b7068;
           font-size:120%;
           font-family: Lucida Console, Courie;
           letter-spacing:0.5px;
           text-align:center;
           padding: 10px 10px">
    
    
    
# Bulgular
Çoğu makale 500-600 kelimeye sahip. Birkaç tane aykırı değer bulunuyor (1200 kelimelik makaleler)

In [None]:
fig = px.histogram(
    df_ess,
    x='essay_len',
    title='Word Count Distribution Across Prompts',
    color="prompt_id",
    color_discrete_sequence=px.colors.qualitative.Bold,
)

fig.show()

## INSIGHTS
* This shows that the essays proposed by prompt id 0 are generally longer than the essays proposed by prompt id 1.

<div style="color:white;
           display:fill;
           border-radius:20px;
           background-color:#4b7068;
           font-size:120%;
           font-family: Lucida Console, Courie;
           letter-spacing:0.5px;
           text-align:center;
           padding: 10px 10px">
    
   
   
    
# Bulgular
0.prompt_id'ye karşılık olarak yazılmış makaleler genel olarak, 1. prompt id'ye karşılık olarak yazılmış makalelerden daha uzun.

### After conducting preliminary exploration, we will now delve into our **updated dataset**, which incorporates an **external dataset**. This adjustment is necessary since the original dataset is highly biased, making it unsuitable for effective model generation.

## Distribution of Target Variable - Generated

<div style="color:white;
           display:fill;
           border-radius:20px;
           background-color:#4b7068;
           font-size:120%;
           font-family: Lucida Console, Courie;
           letter-spacing:0.5px;
           text-align:center;
           padding: 10px 10px">
    
    
    
    
    
#### Kodun devamında, external dataset dahil edilmiş olan güncel veri seti kullanılacak. Bu şekilde daha dengeli bir veri seti olacak.

In [None]:
# Create a temporary dataframe with counts of each category
count_df = df_new['generated'].value_counts().reset_index()
count_df.columns = ['generated', 'count']

fig = px.bar(
    count_df,
    x='generated',
    y='count',
    title='Distribution of Generated Label',
    color=['#FECB52', '#FF97FF'],
    color_discrete_map="identity"
)

# Customize layout for value display
fig.update_layout(
    xaxis=dict(
        tickmode='array',
        tickvals=[0, 1])
)

# Display values on top of the bars
fig.update_traces(
    texttemplate='%{y}',  
    textposition='outside', 
)

fig.show()

## INSIGHTS
* The dataset is now have been more balanced by upsampling the data.

## Distribution of Prompt ID

In [None]:
fig_hist = px.histogram(
    df_new,
    x='prompt_id',
    title='Prompt ID Distribution by Generated Category',
    color='generated',
    barmode='group',
    color_discrete_sequence=px.colors.qualitative.D3[3:],
)

# Display values on top of the bars
fig_hist.update_traces(
    texttemplate='%{y}',  
    textposition='outside',  
)

fig_hist.show()

## INSIGHTS
* The graph shows that the distribution of the both the prompts is almost similar.
* We have more essays written by student for both the prompts which is expected also.

<div style="color:white;
           display:fill;
           border-radius:20px;
           background-color:#4b7068;
           font-size:120%;
           font-family: Lucida Console, Courie;
           letter-spacing:0.5px;
           text-align:center;
           padding: 10px 10px">
    
    
## Prompt'lar hemen hemen eşit dağıtılmış. 

## Exploration of Essay Text

In [None]:
df_new['essay_len'] = df_new['text'].str.split().map(lambda x : len(x))
df_llm = df_new[df_new["generated"]==1]

fig = px.histogram(
    df_llm,
    x='essay_len',
    title='Distribution of word count of text generated by LLM',
    color_discrete_sequence=['#73AF48']
)

fig.show()

### INSIGHTS

* The distribution is not perfectly normal, as there is a slight skew to the right. This means that there are more essays with a higher number of words than essays with a lower number of words.
* The majority of essays have a word count between 400-600 words.

<div style="color:white;
           display:fill;
           border-radius:20px;
           background-color:#4b7068;
           font-size:120%;
           font-family: Lucida Console, Courie;
           letter-spacing:0.5px;
           text-align:center;
           padding: 10px 10px">
    
    
## Dağılma standard bir dağılım değil, sağa doğru biraz kayma var. Genelde makaleler 400-600 kelime aralığında.

In [None]:
fig = px.histogram(
    df_new,
    x='essay_len',
    title='Distribution of word count of text grouped by prompt_id ',
    color="prompt_id",
    color_discrete_sequence=px.colors.qualitative.Safe,
    barmode="group",
)

fig.show()


## INSIGHTS
* The distribution of word count for prompt id 0 is more skewed to the right than the distribution of word count for prompt id 1. This suggests that there are more essays generated by prompt id 0 with a very high number of words.


<div style="color:white;
           display:fill;
           border-radius:20px;
           background-color:#4b7068;
           font-size:120%;
           font-family: Lucida Console, Courie;
           letter-spacing:0.5px;
           text-align:center;
           padding: 10px 10px">
    
    
    
### Prompt_id 0'a karşılık yazılmış makalelerin kayması daha sağa doğru fakat prompt_id 1'e karşılık yazılmış makalelerin kayması daha sola doğru. Bu demek oluyor ki, prompt_id 0'a karşılık yazılmış makaleler daha fazla kelime içeriyor.

<div style="color:white;
           display:fill;
           border-radius:20px;
           background-color:#4b7068;
           font-size:120%;
           font-family: Lucida Console, Courie;
           letter-spacing:0.5px;
           text-align:center;
           padding: 10px 10px">
    
    
    
# Aşağıdaki kodda en sık kullanılmış kelimeler, sırayla öğrenciler ve LLM tarafından yazılmış makalelere göre düzenlenmiş.
    
İlk başta stopwords dahil edilmiş sıralamaya (the, on, off vb. kelimeler). Sonrasında dahil edilmeden tekrar hesaplanmış.

In [None]:
word_occ = Counter([word.lower() for words in df_ess['text'] for word in words.split()])
df_temp = pd.DataFrame(word_occ.most_common(10))
df_temp.columns = ['Common Words','count']

fig = px.bar(df_temp, 
             x='count', 
             y='Common Words', 
             title='Most Common Words (including stopwords) in Essays Written by Students', 
             orientation='h', 
             width=900,
             height=700, 
             color='Common Words',
            color_discrete_sequence=px.colors.qualitative.Vivid)

fig.show()

In [None]:
word_occ = Counter([word.lower() for words in df_ess['text'] for word in words.split() if word.lower() not in set(stopwords.words("english"))])
df_temp = pd.DataFrame(word_occ.most_common(10))
df_temp.columns = ['Common Words','count']

fig = px.bar(df_temp, 
             x='count', 
             y='Common Words', 
             title='Most Common Words (excluding stopwords) in Essays Written by Students', 
             orientation='h', 
             width=900,
             height=700, 
             color='Common Words',
            color_discrete_sequence=px.colors.qualitative.Vivid)

fig.show()

In [None]:
word_occ = Counter([word.lower() for words in df_llm['text'] for word in words.split()])
df_temp = pd.DataFrame(word_occ.most_common(10))
df_temp.columns = ['Common Words','count']

fig = px.bar(df_temp, 
             x='count', 
             y='Common Words', 
             title='Most Common Words (including stopwords) in Essays Generated by LLMs', 
             orientation='h', 
             width=900,
             height=700, 
             color='Common Words',
            color_discrete_sequence=px.colors.qualitative.Set1)

fig.show()

In [None]:
word_occ = Counter([word.lower() for words in df_llm['text'] for word in words.split() if word.lower() not in set(stopwords.words("english"))])
df_temp = pd.DataFrame(word_occ.most_common(10))
df_temp.columns = ['Common Words','count']

fig = px.bar(df_temp, 
             x='count', 
             y='Common Words', 
             title='Most Common Words (excluding stopwords) in Essays Generated by LLMs', 
             orientation='h', 
             width=900,
             height=700, 
             color='Common Words',
            color_discrete_sequence=px.colors.qualitative.Set1)

fig.show()

## INSIGHTS
* The above plots reveal differences in common words between essays written by students and those generated by LLMs. 
* This observation implies that such differences could be valuable for classifying essays.

<div style="color:white;
           display:fill;
           border-radius:20px;
           background-color:#4b7068;
           font-size:120%;
           font-family: Lucida Console, Courie;
           letter-spacing:0.5px;
           text-align:center;
           padding: 10px 10px">
    
### Sık kullanılan kelimeler arasında LLM veya öğrenci tarafından yazılmış olmasının verdiği bir fark var. Bu, sınıflandırma işleminde kullanılabilir.

## WordCloud

<div style="color:white;
           display:fill;
           border-radius:20px;
           background-color:#4b7068;
           font-size:120%;
           font-family: Lucida Console, Courie;
           letter-spacing:0.5px;
           text-align:center;
           padding: 10px 10px">
    
    
### Kelime bulutu oluşturulur. En sık kullanılan kelimelerin, toplu bir şekilde görselleştirilmesini sağlar.

In [None]:
plt.figure(figsize=(10, 10))

text = df_new['text'].values
url = 'https://www.llmsolicitors.co.uk/wp-content/themes/llmsolicitors/image/avatar.png'
img = np.array(Image.open(requests.get(url, stream=True).raw))

# Center cropping
width, height = img.shape[1], img.shape[0]
mid_x, mid_y = int(width/2), int(height/2)
cw2, ch2 = int(width/4), int(height/4) 
crop_img = img[mid_y-ch2:mid_y+ch2, mid_x-cw2:mid_x+cw2]

cloud = WordCloud(stopwords = STOPWORDS,
                  background_color='white',
                  min_font_size=3,
                  mask = crop_img,
                  max_words = 500,
                  colormap='Dark2'
                  ).generate(" ".join(text))

plt.imshow(cloud)
plt.axis('off')
plt.show()

<a id='4'></a>

# <p style="background-color:#C1FFC1;font-family:cursive;color:#A0522D;font-size:150%;text-align:center;border-radius:10px 10px;line-height:1.5">CREATE FOLDS</p>

<div style="color:white;
           display:fill;
           border-radius:20px;
           background-color:#4b7068;
           font-size:120%;
           font-family: Lucida Console, Courie;
           letter-spacing:0.5px;
           text-align:center;
           padding: 10px 10px">
    
    
# Veri setini, çapraz doğrulama yapma amacıyla bölmek için stratified k-fold işlemi uygulanıyor. Stratified K-Fold, her katlamada hedef değişkenin sınıf dağılımının orijinal veri seti ile aynı olmasını sağlar. Bu, özellikle sınıf dengesizliğinin olduğu durumlarda önemlidir. Overfitting'i önlemek için yapılır.

In [None]:
# Use Stratified K-Fold for cross-validation
skf = StratifiedKFold(n_splits=CONFIG.num_fold, shuffle=True, random_state=CONFIG.seed)

# Assign folds to the dataframe
for k, (_, val_ind) in enumerate(skf.split(X=df_new, y=df_new['generated'])):
    df_new.loc[val_ind, 'fold'] = k

<a id='5'></a>

# <p style="background-color:#C1FFC1;font-family:cursive;color:#A0522D;font-size:150%;text-align:center;border-radius:10px 10px;line-height:1.5">DATASET CLASS</p>

<div style="color:white;
           display:fill;
           border-radius:20px;
           background-color:#4b7068;
           font-size:120%;
           font-family: Lucida Console, Courie;
           letter-spacing:0.5px;
           text-align:center;
           padding: 10px 10px">
    
    

## LLMTextData denilen özel bir sınıf oluşturulur. 
#### \_ \_len_ \_(self): Veri setindeki örnek sayısını döndürür.
#### \_ \_getitem_ \_: Verilen bir index için veri setinden bir örnek (makale ve prompt) alır.

In [None]:
class LLMTextData(Dataset):
    def __init__(self, df):
        self.text = df['text']
        self.generated = df['generated']
        
    def __len__(self):
        return len(self.text)
    
    def __getitem__(self, index):
        text = self.text[index]
        generated = self.generated[index]
        
        inputs = CONFIG.tokenizer(text, padding='max_length', max_length=CONFIG.max_len, truncation=True)
        
        input_ids = inputs["input_ids"]
        attention_mask = inputs["attention_mask"]
        
        return {
            "ids": torch.tensor(input_ids, dtype=torch.long),
            "mask": torch.tensor(attention_mask, dtype=torch.long),
            "label": torch.tensor(generated, dtype=torch.int),
        }

<a id='6'></a>

# <p style="background-color:#C1FFC1;font-family:cursive;color:#A0522D;font-size:150%;text-align:center;border-radius:10px 10px;line-height:1.5">BASELINE MODEL</p>

<div style="color:white;
           display:fill;
           border-radius:20px;
           background-color:#4b7068;
           font-size:120%;
           font-family: Lucida Console, Courie;
           letter-spacing:0.5px;
           text-align:center;
           padding: 10px 10px">
    
#### Aşağıdaki kod, makine öğrenme model sınıfı oluşturur. Metin verileri üzerinde çalışmak için tasarlanmıştır ve RoBERTa modelini kullanır (bir transformers modelidir).
## LLMDetectModel Sınıfı

## 1. _ \_init_ \_

Kullanılacak modelin adı, modelin path'ini ve pretrained olup olmadığı bilgilerini alır. Eğer pretrained değilse, config değerlerinden konfigürasyonu alarak modeli eğitir.
 
## 2. _ \_forward_ \_ 
Modelin nasıl tahminler yapacağını tanımlar. Veriler önce RoBERTa modeline yollanır, ardından CNN teknikleri kullanılarak sonuç elde edilir.
    
    
Sınıf, RoBERTa sinir ağını  kullanarak metin verileri üzerinde tahminler yapmak için tasarlanmıştır. Sınıf, modelin yapılandırmasını yüklemek, gerektiğinde önceden eğitilmiş ağırlıkları kullanmak ve ek katmanlar eklemek için kullanılır.
    

In [None]:
class LLMDetectModel(nn.Module):
    def __init__(self, model_name, model_path=None, pretrained=True):
        super(LLMDetectModel, self).__init__()
        
        if model_path is None:
            self.config = AutoConfig.from_pretrained(model_name)
            self.config.update({"output_hidden_states":True, 
                           "hidden_dropout_prob": 0.0,
                           "layer_norm_eps": 1e-7})
        else:
            self.config = torch.load(model_path+"/config.pth")
        
        if pretrained:
            self.roberta = AutoModel.from_pretrained(model_name, config=self.config)
        else:
            self.roberta = AutoModel.from_config(self.config)
            
        self.pool = nn.AdaptiveAvgPool1d(1)
        self.linear = nn.Linear(self.config.hidden_size,1)
        
    def forward(self, input_ids, attention_mask=None):
        x = self.roberta(input_ids, attention_mask)[0]
        x = self.pool(x.permute(0, 2, 1)).view(x.size(0), -1)
        x = self.linear(x)
        
        return x

<a id='7'></a>

# <p style="background-color:#C1FFC1;font-family:cursive;color:#A0522D;font-size:150%;text-align:center;border-radius:10px 10px;line-height:1.5">UTILITY FUNCTIONS</p>

<div style="color:white;
           display:fill;
           border-radius:20px;
           background-color:#4b7068;
           font-size:120%;
           font-family: Lucida Console, Courie;
           letter-spacing:0.5px;
           text-align:center;
           padding: 10px 10px">

### Veri Yükleyicileri (Data Loaders)
    
#### 1. get_data(fold): 
#### Train ve validation data loader'larını al
<hr>

#### 2. loss_fn 
#### Modelin çıktıları ile gerçek etiketler arasındaki uyumsuzluğu hesaplar.
<hr>

#### 3. calculate_roc_auc 
#### ROC-AUC skoru hesaplar
<hr>

#### 4. get_optimizer
#### Model parametrelerini optimize etmek için kullanılan yöntemdir. Bu kodda AdamW optimizasyon algoritması kullanılıyor.
<hr>

#### 5. get_scheduler
#### lineer veya kosinüs zamanlaması: Öğrenme oranını belirli bir stratejiye göre azaltır. Modelin eğitim sürecinde daha hızlı ilerlemesini ve son aşamada daha hassas ayar yapmasını sağlar.
<hr>

Genel Amaç

.1. **Veri Hazırlama:** 
    
.2. **Kayıp Fonksiyonu:** Modelin performansını ölçmek ve iyileştirmek için kullanılan bir fonksiyondur.
    
.3. **Performans Metriği:** Modelin ne kadar iyi çalıştığını değerlendirmek için kullanılan metrik.
    
.4. **Optimizasyon ve Zamanlama:** Modelin ağırlıklarını güncellemek için kullanılır.



In [None]:
# Function to get training and validation data loaders for a given fold
def get_data(fold):
    train_df = df_new[df_new['fold'] != fold].reset_index(drop=True)
    valid_df = df_new[df_new['fold'] == fold].reset_index(drop=True)
    
    train_dataset = LLMTextData(train_df)
    valid_dataset = LLMTextData(valid_df)
    
    train_loader = DataLoader(train_dataset, batch_size=CONFIG.train_batch_size, shuffle=True)
    valid_loader = DataLoader(valid_dataset, batch_size=CONFIG.valid_batch_size, shuffle=True)
    
    return train_loader, valid_loader

# Loss function definition
def loss_fn(outputs, labels):
    return nn.BCEWithLogitsLoss()(outputs, labels)

# Function to calculate ROC-AUC score
def calculate_roc_auc(y_true, y_pred):
    score = roc_auc_score(y_true, y_pred)
    return score

# Function to get the optimizer
def get_optimizer(model):
    param_optimizer = list(model.named_parameters())
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_parameters = [
        {
            "params": [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
            "weight_decay": 0.01
        },
        {
            "params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
            "weight_decay": 0.0
        },
    ]

    optimizer = AdamW(optimizer_parameters, lr=CONFIG.learning_rate)

    return optimizer

# Function to get learning rate scheduler based on the specified type in CONFIG
def get_scheduler(cfg, optimizer, train_loader):
    if cfg.scheduler == 'linear':
        scheduler = get_linear_schedule_with_warmup(
            optimizer,
            num_warmup_steps=int((len(train_loader)*CONFIG.epochs*6)/100),
            num_training_steps=CONFIG.epochs*len(train_loader),
        )
    elif cfg.scheduler == 'cosine':
        scheduler = get_cosine_schedule_with_warmup(
            optimizer,
            num_warmup_steps=int((len(train_loader)*CONFIG.epochs*6)/100),
            num_training_steps=CONFIG.epochs*len(train_loader),
        )
    return scheduler

<div style="color:white;
           display:fill;
           border-radius:20px;
           background-color:#4b7068;
           font-size:120%;
           font-family: Lucida Console, Courie;
           letter-spacing:0.5px;
           text-align:center;
           padding: 10px 10px">

#### Aşağıdaki iki fonksiyon, modelin nasıl eğitileceğini ve nasıl değerlendirileceğini gösterir. (train_fn), modeli gerçek etiketlerle eğitir. (valid_fn), modelin eğitim sırasında nasıl performans gösterdiğini değerlendirir.





In [None]:
# Function to train the model
def train_fn(model, data_loader, optimizer, scheduler, device, epoch):
    # Set the model to training mode
    model.train()
    
    running_loss = 0
    progress_bar = tqdm(data_loader, position=0)
    preds = []
    label = []
    
    for step, data in enumerate(progress_bar):
        # Move data to the specified device
        ids = data['ids'].to(device)
        masks = data['mask'].to(device)
        labels = data['label'].to(device, dtype = torch.float)
        
        # Forward pass
        outputs = model(ids, masks)
        
        # Compute the loss
        loss = loss_fn(outputs, labels.unsqueeze(1))
        
        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        
        # Adjust learning rate if a scheduler is provided
        if scheduler is not None:
            scheduler.step()
        
        running_loss += loss.item()
        
        # Collect predictions and labels for later evaluation
        preds.extend(torch.sigmoid(outputs).view(-1).cpu().detach().numpy())
        label.extend(labels.view(-1).cpu().detach().numpy())
        
        # Update progress bar
        progress_bar.set_description(f"Epoch [{epoch+1}/{CONFIG.epochs}]")
        progress_bar.set_postfix(loss=running_loss/(step+1))
        
        # Log the loss
        wandb.log({"Train Loss": running_loss/(step+1)})
    
    # Calculate ROC-AUC on the training set
    train_auc = calculate_roc_auc(np.array(label), np.array(preds))
    
    return train_auc
        
# Function to validate the model        
def valid_fn(model, data_loader, device, epoch):
    # Set the model to evaluation mode
    model.eval()
    
    running_loss = 0
    progress_bar = tqdm(data_loader, position=0)
    preds = []
    label = []
    
    with torch.no_grad():
        
        for step, data in enumerate(progress_bar):
            # Move data to the specified device
            ids = data['ids'].to(device)
            masks = data['mask'].to(device)
            labels = data['label'].to(device, dtype = torch.float)
            
            # Forward pass
            outputs = model(ids, masks)
            
            # Compute the loss
            loss = loss_fn(outputs, labels.unsqueeze(1))

            running_loss += loss.item()
            
            # Collect predictions and labels for later evaluation
            preds.extend(torch.sigmoid(outputs).view(-1).cpu().detach().numpy())
            label.extend(labels.view(-1).cpu().detach().numpy())
            
            # Update progress bar
            progress_bar.set_description(f"Epoch [{epoch+1}/{CONFIG.epochs}]")
            progress_bar.set_postfix(loss=running_loss/(step+1))
            
            # Log the loss
            wandb.log({"Valid Loss": running_loss/(step+1)})
        
        # Calculate ROC-AUC on the validation set
        valid_auc = calculate_roc_auc(np.array(label), np.array(preds))
    
    return valid_auc         

<div style="color:white;
           display:fill;
           border-radius:20px;
           background-color:#4b7068;
           font-size:120%;
           font-family: Lucida Console, Courie;
           letter-spacing:0.5px;
           text-align:;
           padding: 10px 20px">

* run fonksiyonu, modelin eğitim ve doğrulama süreçlerini yönetir ve modelin performansını değerlendirir.
    
    
Modelin veri üzerinde ne kadar iyi öğrendiğini anlamak için ROC-AUC skorları kullanır.
Eğitim süresince modelin en iyi performans gösterdiği anın ağırlıkları kaydedilir, böylece model daha sonra bu en iyi durumuyla kullanılabilir.

In [None]:
# Function to execute the training and validation process for a specific fold.
def run(fold):
    # Get data loaders for the specified fold
    train_loader, valid_loader = get_data(fold)
    
    # Instantiate the LLMDetectModel and move it to the specified device
    model = LLMDetectModel(CONFIG.model).to(CONFIG.device)
    
    if not os.path.exists("./config.pth"):
        torch.save(model.config,"./config.pth")
    
    # Get the optimizer for the model
    optimizer = get_optimizer(model)
    
    # Get the learning rate scheduler based on the specified configuration
    scheduler = get_scheduler(CONFIG, optimizer, train_loader)
    
    # Initialize the best validation ROC-AUC score
    best_valid_auc = 0
        
    # Training loop for the specified number of epochs
    for epoch in range(CONFIG.epochs):
        train_auc = train_fn(model, train_loader, optimizer, scheduler, CONFIG.device, epoch)
        valid_auc = valid_fn(model, valid_loader, CONFIG.device, epoch)
        print(f"Train ROC AUC - {train_auc}, Valid ROC AUC - {valid_auc}")
        
        # Log the metrics
        wandb.log({"Train AUC": train_auc})
        wandb.log({"Valid AUC": valid_auc})
        
        # Check if the validation ROC-AUC has improved, and if so, save the model checkpoint
        if valid_auc > best_valid_auc:
            print(f"Validation ROC AUC Improved - {best_valid_auc} ---> {valid_auc}")
            torch.save(model.state_dict(), f'./model_{fold}.bin')
            print(f"Saved model checkpoint at ./model_{fold}.bin")
            best_valid_auc = valid_auc
    
    return best_valid_auc

In [None]:
# Main loop for training on all folds
for fold in range(CONFIG.num_fold):
    print("=" * 30)
    print("Training Fold - ", fold)
    print("=" * 30)
    
    wandb_run = wandb.init(project='Detect Authors', 
                     group=f"Fold - {fold}"
                     )
    
    best_valid_auc = run(fold)
    print(f'Best ROC AUC: {best_valid_auc:.5f}')
    
    wandb_run.finish()
    
    gc.collect()
    torch.cuda.empty_cache()    

### [Click here to view the entire dashboard](https://wandb.ai/utcarshagrawal/Detect%20Authors)
<img src="https://i.imgur.com/dTVNOnR.png">

<div class="alert alert-block alert-info">
    <h2 align='center'>THANK YOU!!</h2>
    <h3 align='center'>Please do an upvote if you found the notebook useful.</h3>
</div>