## Download Dataset

In [1]:
!pip install gdown

Collecting gdown
  Downloading gdown-4.7.1-py3-none-any.whl (15 kB)
Installing collected packages: gdown
Successfully installed gdown-4.7.1


In [2]:
train_file_id = '1jokcrLwNE1-Z5als-uZACSu0xsitTukp'
url = f'https://drive.google.com/uc?id={train_file_id}'
output = 'train.json'

!gdown $url -O $output

Downloading...
From: https://drive.google.com/uc?id=1jokcrLwNE1-Z5als-uZACSu0xsitTukp
To: /kaggle/working/train.json
100%|██████████████████████████████████████| 82.9M/82.9M [00:00<00:00, 87.0MB/s]


In [3]:
val_file_id = '1q3m1Xr-8ucDnVn5zG4UoKRLqx4zE3hFd'
url = f'https://drive.google.com/uc?id={val_file_id}'
output = 'val.json'

!gdown $url -O $output

Downloading...
From: https://drive.google.com/uc?id=1q3m1Xr-8ucDnVn5zG4UoKRLqx4zE3hFd
To: /kaggle/working/val.json
100%|███████████████████████████████████████| 3.75M/3.75M [00:00<00:00, 209MB/s]


## Install Libraries

In [4]:
!pip install -U transformers

Collecting transformers
  Obtaining dependency information for transformers from https://files.pythonhosted.org/packages/20/0a/739426a81f7635b422fbe6cb8d1d99d1235579a6ac8024c13d743efa6847/transformers-4.36.2-py3-none-any.whl.metadata
  Downloading transformers-4.36.2-py3-none-any.whl.metadata (126 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.8/126.8 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Downloading transformers-4.36.2-py3-none-any.whl (8.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.36.0
    Uninstalling transformers-4.36.0:
      Successfully uninstalled transformers-4.36.0
Successfully installed transformers-4.36.2


In [5]:
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2Config, AdamW
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
import pandas as pd
import json
import random
from tqdm import tqdm



## Dataset, Dataloader and Model

In [12]:
class EssaysDataset(Dataset):
    def __init__(self, file_path, tokenizer):
        # max_length = 728, 768, 400
        
        self.row_data = []
        self.data = []
        self.models = []
        self.sources = []
        with open(file_path, 'r') as file:
            for line in tqdm(file):
                sample = json.loads(line)
                self.row_data.append(sample)
                model = sample["model"]
                source = sample["source"]
                text = sample["text"]
                modified_sample = f"<SOS> Model: {model} <BOT> Source: {source}, Text: {text} <EOS>"
                self.data.append(modified_sample)
                self.models.append(model)
                self.sources.append(source)
        
        self.models = np.unique(self.models)
        self.sources = np.unique(self.sources)
#         random.shuffle(self.data)
        
#         self.data_encoded = tokenizer(self.data, max_length=768, truncation=True, padding="max_length", return_tensors="pt")
#         self.input_ids = self.data_encoded['input_ids']
#         self.attention_mask = self.data_encoded['attention_mask']

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
#         data_encoded = tokenizer(self.data[idx], max_length=410, truncation=True, padding="max_length", return_tensors="pt")
#         input_ids = data_encoded['input_ids']
#         attention_mask = data_encoded['attention_mask']
#         return (input_ids, attention_mask)
        return self.data[idx]

In [13]:
# ds_path = '/kaggle/input/fake-news-detection-datasets/News _dataset/Fake.csv'
ds_path = 'train.json'

dataset = EssaysDataset(ds_path, tokenizer)
dataloader =  DataLoader(dataset, batch_size=32, shuffle=True)

print(dataset.__len__())
print(dataset.models)
print(dataset.sources)
print(dataset.__getitem__(1))

55776it [00:00, 78835.05it/s]

55776
['bloomz' 'chatGPT' 'cohere' 'davinci' 'dolly' 'human']
['arxiv' 'reddit' 'wikihow' 'wikipedia']
<SOS> Model: chatGPT <BOT> Source: wikihow, Text: Are you planning an international backpacking trip? Packing can be a daunting task, especially when you're trying to pack light. In this article, you will learn how to pack for an international backpacking trip with ease. 

1. Select the right kind of backpack from a reputable company, such as Lowe Alpine or North Face. Make sure your backpack has comfortable straps and is durable enough to handle the wear and tear of travel. 

2. Create a list of what you think you'll need for an international backpacking trip. This will help you stay organized and focused while packing. 

3. Pack a backpack with the following clothing items: two pairs of lightweight pants made of cotton or another fabric that dries quickly, two pairs of shoes (walking shoes and relaxing shoes), and weather-appropriate gear based on your location. 

4. Use see-through




In [14]:
with open('train_data.txt', 'w') as train_data:
    for d in dataset.data:
        train_data.write(d + '\n')

In [7]:
# Set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


model_name = "gpt2"  # or another variant like "gpt2-medium"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.add_special_tokens({"pad_token": "<PAD>", "bos_token": "<SOS>", "eos_token": "<EOS>"})
tokenizer.add_tokens(["<BOT>"])


model = GPT2LMHeadModel.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Embedding(50261, 768)

## Training

In [15]:
train_data_path = "train_data.txt"

dataset = TextDataset(
    tokenizer=tokenizer,
    file_path=train_data_path,
    block_size=400,
)



In [16]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False
)

In [20]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="/kaggle/output/gpt2-model",
    overwrite_output_dir=True,
    num_train_epochs=10,  # Adjust as needed
    per_device_train_batch_size=4,  # Adjust based on your GPU memory
    save_steps=10_000,
    save_total_limit=2,
)

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

# Fine-tune the model
trainer.train()

# Save the fine-tuned model
model.save_pretrained("m-fine_tuned_model")
tokenizer.save_pretrained("t-fine_tuned_model")



Step,Training Loss
500,2.5347
1000,2.5302
1500,2.5253
2000,2.5277
2500,2.5355
3000,2.5461
3500,2.5498
4000,2.5497
4500,2.5644
5000,2.5684




('t-fine_tuned_model/tokenizer_config.json',
 't-fine_tuned_model/special_tokens_map.json',
 't-fine_tuned_model/vocab.json',
 't-fine_tuned_model/merges.txt',
 't-fine_tuned_model/added_tokens.json')

## Inference

In [21]:
# Set the path to your fine-tuned model
fine_tuned_model_path = "alinourian/GPT2-SemEval2023"

# Load the fine-tuned model and tokenizer
model = GPT2LMHeadModel.from_pretrained(fine_tuned_model_path).to(device)
tokenizer = GPT2Tokenizer.from_pretrained(fine_tuned_model_path)

config.json:   0%|          | 0.00/907 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/999k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/75.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/562 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [22]:
model.eval()

input_text = "Model: ChatGPT"
input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
attention_mask = torch.ones_like(input_ids)


output = model.generate(
    input_ids, 
    attention_mask=attention_mask, 
    max_length=400, 
    num_beams=1, 
    temperature=0.8, 
    do_sample=True, 
    top_k=50, 
    top_p=0.95
)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

generated_text

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"Model: ChatGPT  <BOT>  Source: wikipedia, Text: The International Network of Broadcasters (INBO) is an international broadcasting network that provides quality television programming across over 30 countries around the world. Founded in 1979, the network offers programming including prime-time television, news, entertainment, education, science, technology, business, social, political, and cultural programming. INBO was initially funded by the US government through the National Broadcasting Treaty Act of 1987. Since then, INBO has grown to become one of the most successful broadcasting networks in the world. The network's programming is broadcast in over 30 countries worldwide, including France, Italy, Germany, Spain, Austria, Denmark, Norway, Sweden, Switzerland, and the United Kingdom. In addition to its programming, INBO also provides regular programming to international broadcasters, including major networks such as RTL, Telefonica, Deutsche Telekom, and Telemundo. INBO is committ