<h1> TDA course project </h1>

Name: Botond Ortutay

Pair: If you did this project as pair work, name the other student here, leave empty otherwise. If you work in pair, <b>both</b> hand out the same project report in Moodle.


## Environment & library info

Assuming this is running in a python venv with all the necessary libraries pre-installed.

Assuming the following file structure:

```
.
├── project_template.ipynb
└── tda25-responses.jsonl.gz
```

The data has been downloaded from http://dl.turkunlp.org/tda-course-2025/tda25-responses.jsonl.gz .

In [1]:
# Importing libraries

# Reading the data
import gzip
import jsonlines
from typing import List, Dict

# Data-exploration & preparation
import pprint
import re

# Raw data to Huggingface datasets
from datasets import Dataset
from sklearn.model_selection import train_test_split

# Tokenization
from transformers import DistilBertTokenizer

# The model
from transformers import DistilBertForSequenceClassification

# Training the model
from transformers import TrainingArguments, Trainer

# Evaluation
import evaluate
import numpy as np

### Function for opening jsonl.gz
Source: <br>
serialize_jsolines_gz.py, https://gist.github.com/luminoso/0581b7f6760ea9a26b06115c2993f351 <br>
by Guilherme Cardoso, https://github.com/luminoso <br>

In [2]:
def read_jsonl_gz(filename) -> List[Dict]:
    data = []
    with gzip.open(filename) as fp:
        j_reader = jsonlines.Reader(fp)

        for obj in j_reader:
            data.append(obj)

    return data

<h1> Step 1: Load the data with LLM judgements </h1>

### Loading in the data

In [3]:
myJsonlGz = read_jsonl_gz("tda25-responses.jsonl.gz")

### Data-exploration

In [4]:
# Just printing a few elements to know what we're dealing with...
for i in range(10):
    pprint.pprint(myJsonlGz[i])

{'document': 'Peeling an onion seems like an trivial task, but if you’ve never '
             'peeled an onion before, it can be quite intimidating. Don’t '
             'worry – it is pretty easy to peel an onion.\n'
             'You can now learn how to peel an onion by following these '
             'illustrated step-by-step instructions.\n'
             'Step #1: Put the whole onion on the cutting board\n'
             'Step 2: Cut off one end of the onion with a knife, as shown on '
             'the picture below:\n'
             'Here’s a picture of the onion with that end already cut off. The '
             'end of the onion is laying on the right side of the onion on the '
             'cutting board.\n'
             'Step 3: Cut off another end of the onion with a knife, as show '
             'on the picture below:\n'
             'After both ends of the onion have been cut off, the onion is '
             'ready to be peeled. Here’s the picture of the onion without its '
 

As we see each dataitem has a "document" and a "response" field. As defined in the instructions the response consists of 3 things: an AI-evaluation of whether the document is step-by-step data, an AI evaluation whether the document is reasoning data, and a short summary of the document. The format could be defined as:

```
Step-by-step: Yes/No
Training: Yes/No

Summary
```

We want to extract the AI-evaluations into their own datafield for classifier training later. This could be done by a simple regex search because of a well-defined format. Except there is a problem: the LLM sometimes decides to insert extra space and \*-characters into the response for no reason. Therefore I also wrote a regex thing to remove these extra characters. Now we should be able to extract the relevant informations into their own datafields and then turn this list-of-dictionaries-thing into a proper Huggingface dataset. But before we do that. I looped through all the responses in the data and found one single item where the format is not followed:

In [5]:
pprint.pprint(myJsonlGz[460])

{'document': 'All our after-sale service staff is professional and patience so '
             "you don't need to have any worry anything about purchasing our "
             'IBM C1000-141 exam simulation: IBM Maximo Manage v8.x '
             'Administrator, The C1000-141 pdf training guide can help you to '
             'figure out the actual area where you are confused, IBM C1000-141 '
             'Latest Study Materials As the unprecedented intensity of talents '
             'comes in great numbers, what abilities should a talent of modern '
             'time possess and finally walk to the success, So with our '
             'C1000-141 exam questions, not only you can pass the exam with '
             'ease with 100% pass guarantee, but also you can learn the most '
             'professional and specilized knowledge in this field!\n'
             'With the help of our learning materials, especially the online '
             'practice Latest C1000-141 Exam Book exam, you can pra

As this is just a single item, I think I'll just delete this.

In [6]:
del myJsonlGz[460]

Now that that's done, let's do some...

### Data preparation

In [7]:
for line in myJsonlGz:
    # Regex thing for extracting ai step-by-step and reasoning evaluations or "judgements"
    judgement = re.findall(r":(.*)", line["response"])
    
    # Regex thing for removing any extra characters inserted by the LLM
    cleanedJudgement = ["".join(re.findall(r"[^ *]", i)) for i in judgement]
    
    # cleanedJudgement[0] contains data on if the current document contains step-by-step data or not
    if cleanedJudgement[0].lower() == "yes":
        line["step-by-step"] = 1
    else:
        line["step-by-step"] = 0

    # cleanedJudgement[1] contains data on if the current document contains reasoning data or not
    if cleanedJudgement[1].lower() == "yes":
        line["reasoning"] = 1
    else:
        line["reasoning"] = 0

In [8]:
# Calculating train-test-split
trainList, testList = train_test_split(myJsonlGz, test_size=0.2)

In [9]:
# Now we can create a dataset from our data
trainDs = Dataset.from_list(trainList)
testDs = Dataset.from_list(testList)

<h1> Step 2: Classifier training and evaluation </h1>


*   Which target did you choose?
*   Label distribution and majority baseline
*   Classifier performance
*   Manual inspection of the classifier output, what kinds of mistakes it makes?
*   What is the composition of the data we gave you? What does it mean for your results?
*   Concusions




In [10]:
### Tokenization ###

# tokenizer import
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# tokenization function
def tokenize(dataset):
    """
    NOTE: distilbert-base-uncased has a processing limit of 512 tokens. Truncation cuts the rest off. This results 
    in data loss if the input document is longer than 512 tokens. In the data exploration phase I noted that some 
    documents are indeed longer than 512 words (1 token ≈ 1 word) so this does matter here. Right now I'll just 
    keep on using distilbert and hope that the 512 token limit doesn't hurt the performance too much (≈<500 first 
    words should be enough to decide whether document has step-by-step instruction or not, but if performance is 
    obviously really bad, I might have to change the underlying model to something that can handle more tokens)
    """
    return tokenizer(dataset["document"], padding="max_length", truncation=True)

In [11]:
# tokenizing datasets
trainTokenized = trainDs.map(tokenize, batched=True)
testTokenized = testDs.map(tokenize, batched=True)

Map:   0%|          | 0/2011 [00:00<?, ? examples/s]

Map:   0%|          | 0/503 [00:00<?, ? examples/s]

In [12]:
# Defining labels
"""
After reading through a ton of unclear documentation, random kaggle notebooks and whatever chatgpt hallucinated, I 
now know that doing superwised fine-tuning using the transformers library requires the dataset to have a "labels" 
column Which has to have integers. So basically I can't just say to the API, "Hey my labels are in the 
"step-by-step"-column", instead I have to explicitly define a "label" column in my dataset. Ain't that just 
wonderful! So here is a function that can duplicate a given column as "label" now. 

PS: Labels have to be integer and now they are boolean. Time to modify and rerun the data preparation code!
"""
LABELS_AT = "step-by-step"    # Change this if you want to train for the other prediction target
trainTokenized = trainTokenized.add_column("label", trainTokenized[LABELS_AT])
testTokenized = testTokenized.add_column("label", testTokenized[LABELS_AT])

print(testTokenized.column_names)

['document', 'response', 'step-by-step', 'reasoning', 'input_ids', 'attention_mask', 'label']


In [13]:
# Downloading the classification model. num_labels = 2 for binary classification
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", num_labels=2)

In [14]:
### Training ###

# Training arguments for the trainer
training_args = TrainingArguments("out",eval_strategy="epoch",logging_strategy="no",save_strategy="no")

In [15]:
# Setting up evaluation metrics for the trainer

# This is copied from the Huggingface documentation for training (fine-tuning), although I may modify it later
# URL: https://huggingface.co/docs/transformers/training

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    # convert the logits to their predicted class
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [17]:
# Trainer setup
trainer = Trainer(model, training_args, train_dataset=trainTokenized, eval_dataset=testTokenized, compute_metrics=compute_metrics)

In [18]:
# Actually training (fine-tuning) the model
trainer.train()

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

<h1> Bonus step </h1>

(leave empty if you do not do this)

*   Prompt design
*   Build (prompt,response pairs)
*   Turn into HF Dataset and save



In [None]:
#work here

<h1> Summary and Conclusions </h1>

* Brief TL;DR -style summary and main conclusions of your project.