# Try some LLMs on the task

The notebook tests how different pre-trained language models perform code generation from a textual problem description. Three prompting techniques are explored:

* Zero-shot: The model receives only the problem statement.

* One-shot: One example of a solved problem is provided.

* Few-shot: Multiple examples are given before asking the model to solve a new problem.

Three different LLMs are compared across these settings:


* TinyLlama-1.1B: 1.1B parameters (https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
* deepseek-coder-1.3b-base: 1.3B parameters (https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-base)
* Mistral-7B-Instruct-v0.1: 7B parameters (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)



## Imports

This section installs and imports all necessary libraries for the notebook

In [None]:
!pip install sacrebleu

Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting portalocker (from sacrebleu)
  Downloading portalocker-3.1.1-py3-none-any.whl.metadata (8.6 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading sacrebleu-2.5.1-py3-none-any.whl (104 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Downloading portalocker-3.1.1-py3-none-any.whl (19 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.6 portalocker-3.1.1 sacrebleu-2.5.1


In [None]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=cd3991a622edd739af4a9ae10905bdea7e64d0c09a9b9a6bc7ac3fc56813fa84
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [None]:
!pip install -U bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-

In [None]:
!pip install -U transformers accelerate

Collecting accelerate
  Downloading accelerate-1.7.0-py3-none-any.whl.metadata (19 kB)
Downloading accelerate-1.7.0-py3-none-any.whl (362 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m362.1/362.1 kB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 1.6.0
    Uninstalling accelerate-1.6.0:
      Successfully uninstalled accelerate-1.6.0
Successfully installed accelerate-1.7.0


In [None]:
!pip install -q -U langchain

In [None]:
!pip install -q langchain-community langchain-core

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m89.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/50.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from google.colab import drive
import os
import torch
import pandas as pd
from transformers import pipeline
import sacrebleu
from rouge_score import rouge_scorer
import json


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

In [None]:
from langchain import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

## Load Dataset

Load the dataset that has already been preprocessed

In [None]:
drive.mount('/content/drive')

path = 'Colab Notebooks/NLP/NLP_Project'

os.chdir(f'/content/drive/MyDrive/{path}')
os.getcwd()

Mounted at /content/drive


'/content/drive/.shortcut-targets-by-id/17WgJO1gfIBADpYX2jVdb41q7HCbwWcOU/NLP_Project'

In [None]:
df = pd.read_csv('final_ds.csv')
print(df.head())

                                 problem_description solution_id  \
0  Xenia has a set of weights and pan scales. Eac...         0_0   
1  Xenia has a set of weights and pan scales. Eac...         0_2   
2  Xenia has a set of weights and pan scales. Eac...         0_4   
3  Xenia has a set of weights and pan scales. Eac...         0_6   
4  Xenia has a set of weights and pan scales. Eac...         0_8   

                                       solution_code  \
0  __author__ = 'ratnesh.mishra'\n\nweights = map...   
1  import sys\nsys.setrecursionlimit (1000000)\n\...   
2  import sys\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...   
3  MOD = 10**9 + 7\nI = lambda:list(map(int,input...   
4  to_print = []\ndef dfs(d, ini, s, depth, m):\n...   

               problem_name time_complexity_inferred space_complexity_inferred  
0  339_C. Xenia and Weights                     O(1)                   O(n**2)  
1  339_C. Xenia and Weights                     O(1)                      O(1)  
2  339_C. X

In [None]:
df.shape

(244876, 6)

## Try TinyLlama on the task

The TinyLlama-1.1B-Chat-v1.0 is a compact open-source language model with 1.1 billion parameters, trained for chat and code generation tasks. This version is fine-tuned for multi-turn conversations and instruction-following behavior.

Here we initializes a Hugging Face pipeline for text generation, the pre-trained model.

In [None]:
pipe = pipeline("text-generation", model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.bfloat16, device_map="auto")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

Device set to use cuda:0


This function creates a list of chat-style messages to prompt a language model for code generation, using zero-shot, one-shot or few-shot prompting strategies.

The returned messages list is formatted with "system", "user", and "assistant" roles.

In [None]:
def build_messages(df, index, mode='zero-shot', num_few_shot=2, random_state=0):
    problem = clean_problem_description(df.loc[index, 'problem_description'])
    messages = [
        {
            "role": "system",
            "content": "You are an assistent who helps solve programming problems by writing the corresponding Python code. Please respond with the code only, without explanation."
        }
    ]

    if mode == 'zero-shot':
        messages.append({
            "role": "user",
            "content": f"Can you please solve the problem below by writing the corresponding Python code.\n\n"
                       f"### Problem:\n{problem}\n### Your Solution:"
        })

    elif mode == 'one-shot':
        example_idx = df.drop(index).sample(1, random_state=random_state).index[0]
        example_problem = clean_problem_description(df.loc[example_idx, 'problem_description'])
        example_solution = df.loc[example_idx, 'solution_code']

        messages.append({
            "role": "user",
            "content": f"### Problem:\n{example_problem}\n### Solution Expected:"
        })
        messages.append({
            "role": "assistant",
            "content": f"{example_solution}"
        })
        messages.append({
            "role": "user",
            "content": (
                "Now please solve the problem below by writing the corresponding Python code.\n\n"
                f"### Problem:\n{problem}\n### Your Solution:"
            )
        })

    elif mode == 'few-shot':
        example_indices = df.drop(index).sample(num_few_shot, random_state=random_state).index

        for i in example_indices:
            example_problem = clean_problem_description(df.loc[i, 'problem_description'])
            example_solution = df.loc[i, 'solution_code']
            messages.append({
                "role": "user",
                "content": f"### Problem:\n{example_problem}\n### Solution Expected:"
            })
            messages.append({
                "role": "assistant",
                "content": f"{example_solution}"
            })

        messages.append({
            "role": "user",
            "content": (
                "Now please solve the problem below by writing the corresponding Python code.\n\n"
                f"### Problem:\n{problem}\n### Your Solution:"
            )
        })

    return messages


This function uses the chat-formatted messages to build a prompt and pass it to the language model for code generation. The following generation parameters are used:

* max_new_tokens=256: limits the number of new tokens the model can generate
* do_sample=True: enables sampling, which allows the model to produce more diverse and creative outputs instead of always choosing the most likely next token.
* temperature=0.7: controls the randomness of the predictions.
* top_k=50: restricts sampling to the top 50 most probable tokens at each step.
* top_p=0.95: enables nucleus sampling, where the model considers only the smallest possible set of tokens whose cumulative probability is at least 95%.

The model returns the generated Python solution as plain text.

In [None]:
def generate_solution(messages, max_new_tokens=256):
    prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
    return outputs[0]["generated_text"]

This function evaluates the similarity between a generated solution and a reference solution:
* BLEU: Uses sacrebleu to compute n-gram precision.
* ROUGE-1 / ROUGE-L: Measures recall-based overlap (unigrams and longest common subsequence).
* Exact Match: Checks if the strings are exactly equal after stripping whitespace.

Returns a dictionary with all four metrics.

In [None]:
def evaluate_metrics(reference, hypothesis):
    bleu_score = sacrebleu.corpus_bleu([hypothesis], [[reference]]).score
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
    rouge_scores = scorer.score(reference, hypothesis)
    exact = int(reference.strip() == hypothesis.strip())
    return {
        'bleu': bleu_score,
        'rouge1': rouge_scores['rouge1'].fmeasure,
        'rougeL': rouge_scores['rougeL'].fmeasure,
        'exact_match': exact
    }

This function is used to remove the examples inputs and outputs of the problem description, so the one and few shot prompts are less confusing for the model

In [None]:
def clean_problem_description(text: str) -> str:
    lines = text.strip().splitlines()
    result = []
    for line in lines:
        if line.strip().lower().startswith("examples"):
            break
        result.append(line)
    return "\n".join(result).strip()


Here we get a random sample of the dataset so we can test the three prompt types

In [None]:
example = df.sample(n=1, random_state=40)

In [None]:
n = example.index[0]

In [None]:
n

np.int64(15399)

In [None]:
example_text = df.iloc[n]

In [None]:
clean_text = clean_problem_description(df.loc[n, 'problem_description'])

In [None]:
clean_text

'There are H rows and W columns of white square cells.\n\nYou will choose h of the rows and w of the columns, and paint all of the cells contained in those rows or columns.\n\nHow many white cells will remain?\n\nIt can be proved that this count does not depend on what rows and columns are chosen.\n\nConstraints\n\n* All values in input are integers.\n* 1 \\leq H, W \\leq 20\n* 1 \\leq h \\leq H\n* 1 \\leq w \\leq W\n\nInput\n\nInput is given from Standard Input in the following format:\n\n\nH W\nh w\n\n\nOutput\n\nPrint the number of white cells that will remain.'

In [None]:
print(example_text['problem_description'])

There are H rows and W columns of white square cells.

You will choose h of the rows and w of the columns, and paint all of the cells contained in those rows or columns.

How many white cells will remain?

It can be proved that this count does not depend on what rows and columns are chosen.

Constraints

* All values in input are integers.
* 1 \leq H, W \leq 20
* 1 \leq h \leq H
* 1 \leq w \leq W

Input

Input is given from Standard Input in the following format:


H W
h w


Output

Print the number of white cells that will remain.

Examples

Input

3 2
2 1


Output

1


Input

5 5
2 3


Output

6


Input

2 4
2 4


Output

0


Here we can the the description of the problem we are going to use to test the prompts. It receives four numbers a, b, c and d as inputs and should output (a-c)*(b-d) as the number of remaining white cells.

In [None]:
example_text['solution_code']

'a,b=map(int,input().split())\nc,d=map(int,input().split())\nprint((a-c)*(b-d))'

First we need to build the zero shot prompt using the build_messages function and specifying the zero-shot approach

In [None]:
zero_shot_messages = build_messages(df, example.index[0], mode='zero-shot')
print(zero_shot_messages)

[{'role': 'system', 'content': 'You are an assistent who helps solve programming problems by writing the corresponding Python code. Please respond with the code only, without explanation.'}, {'role': 'user', 'content': 'Can you please solve the problem below by writing the corresponding Python code.\n\n### Problem:\nThere are H rows and W columns of white square cells.\n\nYou will choose h of the rows and w of the columns, and paint all of the cells contained in those rows or columns.\n\nHow many white cells will remain?\n\nIt can be proved that this count does not depend on what rows and columns are chosen.\n\nConstraints\n\n* All values in input are integers.\n* 1 \\leq H, W \\leq 20\n* 1 \\leq h \\leq H\n* 1 \\leq w \\leq W\n\nInput\n\nInput is given from Standard Input in the following format:\n\n\nH W\nh w\n\n\nOutput\n\nPrint the number of white cells that will remain.\n### Your Solution:'}]


Now we pass this prompt to the system and print the generated output

In [None]:
zero_shot_generated = generate_solution(zero_shot_messages)
print(zero_shot_generated)

<|system|>
You are an assistent who helps solve programming problems by writing the corresponding Python code. Please respond with the code only, without explanation.</s>
<|user|>
Can you please solve the problem below by writing the corresponding Python code.

### Problem:
There are H rows and W columns of white square cells.

You will choose h of the rows and w of the columns, and paint all of the cells contained in those rows or columns.

How many white cells will remain?

It can be proved that this count does not depend on what rows and columns are chosen.

Constraints

* All values in input are integers.
* 1 \leq H, W \leq 20
* 1 \leq h \leq H
* 1 \leq w \leq W

Input

Input is given from Standard Input in the following format:


H W
h w


Output

Print the number of white cells that will remain.
### Your Solution:</s>
<|assistant|>
Here's the Python code for the problem:

```python
import sys

# Check input format
if len(sys.argv)!= 2:
    print("Usage: python3 solution.py <input

The code generated by the model for this problem was:

In [None]:
import sys

# Check input format
if len(sys.argv)!= 2:
    print("Usage: python3 solution.py <input_file_name>")
    sys.exit(1)

try:
    input_file = open(sys.argv[1], "r")
except FileNotFoundError:
    print("Input file not found!")
    sys.exit(1)

try:
    n = int(input_file.readline())
    h = int(input_file.readline())
    w = int(input_file.readline())
    input_file.close()
except ValueError:
    print("Invalid input format!")
    sys.exit(1)

# Check if all values are integers
while h > 0 or w > 0:
    try:
        h = int(input_file.readline())
    except ValueError:
        print("Invalid input format!")
        sys.exit(1)
    try:
        w = int(input_file.read

The generated code attempts to read input from a file using sys.argv, but it is incorrect for the proposed problem. It closes the file before finishing reading, and contains incomplete and invalid chunks. And it does not solve the problem correctly or print the expected output.

However, we can see that the model is already generating Python code with a reasonable structure, including basic input handling, error checking, and use of standard libraries.

Now we see if the model performs better with the one-shot prompt approach. In this cell we generate the one shot prompt using also a random example.

In [None]:
one_shot_messages = build_messages(df, example.index[0], mode='one-shot')
print(one_shot_messages)

[{'role': 'system', 'content': 'You are an assistent who helps solve programming problems by writing the corresponding Python code. Please respond with the code only, without explanation.'}, {'role': 'user', 'content': '### Problem:\nYou are given an integer sequence of length N. The i-th term in the sequence is a_i. In one operation, you can select a term and either increment or decrement it by one.\n\nAt least how many operations are necessary to satisfy the following conditions?\n\n* For every i (1≤i≤n), the sum of the terms from the 1-st through i-th term is not zero.\n* For every i (1≤i≤n-1), the sign of the sum of the terms from the 1-st through i-th term, is different from the sign of the sum of the terms from the 1-st through (i+1)-th term.\n\nConstraints\n\n* 2 ≤ n ≤ 10^5\n* |a_i| ≤ 10^9\n* Each a_i is an integer.\n\nInput\n\nInput is given from Standard Input in the following format:\n\n\nn\na_1 a_2 ... a_n\n\n\nOutput\n\nPrint the minimum necessary count of operations.\n### 

And now we generate the output for this prompt

In [None]:
one_shot_generated = generate_solution(one_shot_messages)
print(one_shot_generated)

<|system|>
You are an assistent who helps solve programming problems by writing the corresponding Python code. Please respond with the code only, without explanation.</s>
<|user|>
### Problem:
You are given an integer sequence of length N. The i-th term in the sequence is a_i. In one operation, you can select a term and either increment or decrement it by one.

At least how many operations are necessary to satisfy the following conditions?

* For every i (1≤i≤n), the sum of the terms from the 1-st through i-th term is not zero.
* For every i (1≤i≤n-1), the sign of the sum of the terms from the 1-st through i-th term, is different from the sign of the sum of the terms from the 1-st through (i+1)-th term.

Constraints

* 2 ≤ n ≤ 10^5
* |a_i| ≤ 10^9
* Each a_i is an integer.

Input

Input is given from Standard Input in the following format:


n
a_1 a_2 ... a_n


Output

Print the minimum necessary count of operations.
### Solution Expected:</s>
<|assistant|>
N = int(input())
a = list(map

Now the code generated was:

In [None]:
N = int(input())
h = int(input())
w = int(input())

if h * w <= 0:
    print(1)
else:
    cells_remaining = 0
    for row in range(h):
        for col in range(w):
            if row * w + col <= h * w:
                cells_remaining += 1
    print(cells_remaining)

Which still has a well structured python code but logically doesn't solve the proposed problem. Now we can also see that after the code the model tried to explain the logic behind it.

The last approach if the few-shot, giving the model two examples instead of just one

In [None]:
few_shot_messages = build_messages(df, example.index[0], mode='few-shot')
print(few_shot_messages)

[{'role': 'system', 'content': 'You are an assistent who helps solve programming problems by writing the corresponding Python code. Please respond with the code only, without explanation.'}, {'role': 'user', 'content': '### Problem:\nYou are given an integer sequence of length N. The i-th term in the sequence is a_i. In one operation, you can select a term and either increment or decrement it by one.\n\nAt least how many operations are necessary to satisfy the following conditions?\n\n* For every i (1≤i≤n), the sum of the terms from the 1-st through i-th term is not zero.\n* For every i (1≤i≤n-1), the sign of the sum of the terms from the 1-st through i-th term, is different from the sign of the sum of the terms from the 1-st through (i+1)-th term.\n\nConstraints\n\n* 2 ≤ n ≤ 10^5\n* |a_i| ≤ 10^9\n* Each a_i is an integer.\n\nInput\n\nInput is given from Standard Input in the following format:\n\n\nn\na_1 a_2 ... a_n\n\n\nOutput\n\nPrint the minimum necessary count of operations.\n### 

In [None]:
few_shot_generated = generate_solution(few_shot_messages)
print(few_shot_generated)

<|system|>
You are an assistent who helps solve programming problems by writing the corresponding Python code. Please respond with the code only, without explanation.</s>
<|user|>
### Problem:
You are given an integer sequence of length N. The i-th term in the sequence is a_i. In one operation, you can select a term and either increment or decrement it by one.

At least how many operations are necessary to satisfy the following conditions?

* For every i (1≤i≤n), the sum of the terms from the 1-st through i-th term is not zero.
* For every i (1≤i≤n-1), the sign of the sum of the terms from the 1-st through i-th term, is different from the sign of the sum of the terms from the 1-st through (i+1)-th term.

Constraints

* 2 ≤ n ≤ 10^5
* |a_i| ≤ 10^9
* Each a_i is an integer.

Input

Input is given from Standard Input in the following format:


n
a_1 a_2 ... a_n


Output

Print the minimum necessary count of operations.
### Solution Expected:</s>
<|assistant|>
N = int(input())
a = list(map

The code generated this time was:

In [None]:
N=int(input())
h=int(input()[0])
w=int(input()[1])

# Find the number of white cells in H rows and W columns
rows_num=0
for I in range(h):
    for j in range(w):
        if input()[i*w+j] == 'w':
            rows_num+=1

# Calculate the number of white cells in the remaining H rows and W columns
remaining_rows_num=0
for I in range(h-1, -1, -1):
    for j in range(w):
        if input()[i*w+j] == 'w':
            remaining_rows_num+=1

# Calculate the number of white cells remaining in H rows and W columns
remaining_cells_num=rows_num-remaining_rows_num

print(remaining_cells_num)

The few-shot output better aligns with the grid structure of the problem and introduces more descriptive variable names. However, it still misinterprets the input format and fails to implement the correct logic.

Now we will run the three prompting approaches with a few samples to get some metrics

In [None]:
results = {'zero-shot': [], 'one-shot': [], 'few-shot': []}

In [None]:
sample_indices = df.sample(5, random_state=0).index

In [None]:
saved_results = []

for mode in ['zero-shot', 'one-shot', 'few-shot']:
    for idx in sample_indices:
        messages = build_messages(df, idx, mode=mode, num_few_shot=3)
        generated = generate_solution(messages, max_new_tokens=256)

        expected = df.loc[idx, 'solution_code']

        metrics = evaluate_metrics(expected, generated)

        print(f"Mode: {mode}, Example {idx}")
        #print("Generated:\n", generated)
        #print("Expected:\n", expected)
        for metric, value in metrics.items():
            print(f"{metric}: {value:.4f}")
        print("\n")

        results[mode].append(metrics)

        saved_results.append({
            'mode': mode,
            'index': idx,
            'problem_description': df.loc[idx, 'problem_description'],
            'expected_solution': expected,
            'generated_solution': generated,
            'bleu': metrics['bleu'],
            'rouge1': metrics['rouge1'],
            'rougeL': metrics['rougeL'],
            'exact_match': metrics['exact_match']
        })


Mode: zero-shot, Example 219235
bleu: 1.1744
rouge1: 0.0337
rougeL: 0.0337
exact_match: 0.0000


Mode: zero-shot, Example 116625
bleu: 4.5137
rouge1: 0.1049
rougeL: 0.0918
exact_match: 0.0000


Mode: zero-shot, Example 161440
bleu: 1.8577
rouge1: 0.1154
rougeL: 0.0726
exact_match: 0.0000


Mode: zero-shot, Example 177012
bleu: 8.3524
rouge1: 0.2725
rougeL: 0.1755
exact_match: 0.0000


Mode: zero-shot, Example 26857
bleu: 7.2144
rouge1: 0.2652
rougeL: 0.1215
exact_match: 0.0000


Mode: one-shot, Example 219235
bleu: 1.2996
rouge1: 0.0182
rougeL: 0.0182
exact_match: 0.0000




You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Mode: one-shot, Example 116625
bleu: 2.4819
rouge1: 0.0784
rougeL: 0.0610
exact_match: 0.0000


Mode: one-shot, Example 161440
bleu: 1.9589
rouge1: 0.1223
rougeL: 0.0699
exact_match: 0.0000


Mode: one-shot, Example 177012
bleu: 5.9250
rouge1: 0.2358
rougeL: 0.1440
exact_match: 0.0000


Mode: one-shot, Example 26857
bleu: 4.6816
rouge1: 0.2055
rougeL: 0.1054
exact_match: 0.0000


Mode: few-shot, Example 219235
bleu: 0.6755
rouge1: 0.0095
rougeL: 0.0095
exact_match: 0.0000




This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (2048). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


Mode: few-shot, Example 116625
bleu: 1.6039
rouge1: 0.0285
rougeL: 0.0268
exact_match: 0.0000


Mode: few-shot, Example 161440
bleu: 1.8570
rouge1: 0.0734
rougeL: 0.0500
exact_match: 0.0000


Mode: few-shot, Example 177012
bleu: 3.7599
rouge1: 0.1353
rougeL: 0.0844
exact_match: 0.0000


Mode: few-shot, Example 26857
bleu: 4.0815
rouge1: 0.1560
rougeL: 0.0705
exact_match: 0.0000




In [None]:
for mode in results:
    avg_metrics = {}
    for metric in results[mode][0].keys():
        avg_metrics[metric] = sum([ex[metric] for ex in results[mode]]) / len(results[mode])

    print(f"\nAverage metrics for {mode}:")
    for metric, value in avg_metrics.items():
        print(f"{metric}: {value:.4f}")


Average metrics for zero-shot:
bleu: 4.6225
rouge1: 0.1583
rougeL: 0.0990
exact_match: 0.0000

Average metrics for one-shot:
bleu: 3.2694
rouge1: 0.1321
rougeL: 0.0797
exact_match: 0.0000

Average metrics for few-shot:
bleu: 2.3955
rouge1: 0.0805
rougeL: 0.0482
exact_match: 0.0000


These results show that the zero-shot setting outperforms one-shot and few-shot in all metrics, including BLEU, ROUGE-1, ROUGE-L, and exact match (which is zero across the board). This suggests that providing examples (one-shot or few-shot) did not help the model improve code generation quality for this task, and may have introduced noise or confusion, leading to worse performance.
This can be caused by poorly formatted prompts or overly long or complex example prompts

In [None]:
df_results = pd.DataFrame(saved_results)
df_results.to_csv('generated_codes.csv', index=False)

In [None]:
with open('generated_codes.json', 'w', encoding='utf-8') as f:
    json.dump(saved_results, f, ensure_ascii=False, indent=4)

## Try deepseek-coder on the task

DeepSeek-Coder-1.3B-Base is a 1.3 billion parameter language model specialized in code generation and understanding. It is trained on a diverse set of programming languages and designed to assist with tasks such as code completion, bug fixing, and problem-solving by generating syntactically correct and contextually relevant code snippets.

We start by loading the model and the tokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-1.3b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-1.3b-base", trust_remote_code=True).cuda()

tokenizer_config.json:   0%|          | 0.00/793 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.37M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/631 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.69G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.69G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

This function is used to build the prompts in each of the approaches. It tries to give a well structured prompt with clearly separated examples

In [None]:
def build_prompt_deepseek(df, index, mode='zero-shot', num_few_shot=3, random_state=0):
    problem = clean_problem_description(df.loc[index, 'problem_description'])
    if mode == 'zero-shot':
        prompt = f"# Task: {problem}\n# Solution:\n"

    elif mode == 'one-shot':
        example_idx = df.drop(index).sample(1, random_state=random_state).index[0]
        example_problem = clean_problem_description(df.loc[example_idx, 'problem_description'])
        example_solution = df.loc[example_idx, 'solution_code']
        prompt = (
            f"# === Example 1 ===\n"
            f"# Task: {example_problem}\n# Solution:\n{example_solution}\n\n"
            f"# Task: {problem}\n# Solution:\n"
        )

    elif mode == 'few-shot':
        example_indices = df.drop(index).sample(num_few_shot, random_state=random_state).index
        prompt = ""
        for j, i in enumerate(example_indices):
            example_problem = clean_problem_description(df.loc[i, 'problem_description'])
            example_solution = df.loc[i, 'solution_code']
            prompt += (
                f"# === Example {j+1} ===\n"
                f"# Task: {example_problem}\n"
                f"# Solution:\n{example_solution}\n\n"
            )
        prompt += f"# Task: {problem}\n# Solution:\n"

    return prompt


This function is used to generate the model output using the prompt build before and returns the string containing the output

In [None]:
def generate_solution_deepseek(prompt, max_new_tokens=512):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        temperature=0.7,
        top_p=0.95,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

We will use the same problem example as before for demonstration

In [None]:
n

np.int64(14480)

In [None]:
example

Unnamed: 0,problem_description,solution_id,solution_code,problem_name,time_complexity_inferred,space_complexity_inferred
14480,There are H rows and W columns of white square...,190_3,"a,b=map(int,input().split())\nc,d=map(int,inpu...",p03101 AtCoder Beginner Contest 121 - White Cells,O(1),O(1)


In [None]:
zero_shot_prompt = build_prompt_deepseek(df, example.index[0], mode='zero-shot')
print(zero_shot_prompt)

# Task: There are H rows and W columns of white square cells.

You will choose h of the rows and w of the columns, and paint all of the cells contained in those rows or columns.

How many white cells will remain?

It can be proved that this count does not depend on what rows and columns are chosen.

Constraints

* All values in input are integers.
* 1 \leq H, W \leq 20
* 1 \leq h \leq H
* 1 \leq w \leq W

Input

Input is given from Standard Input in the following format:


H W
h w


Output

Print the number of white cells that will remain.
# Solution:



In [None]:
zero_shot_generated = generate_solution_deepseek(zero_shot_prompt)
print(zero_shot_generated)

# Task: There are H rows and W columns of white square cells.

You will choose h of the rows and w of the columns, and paint all of the cells contained in those rows or columns.

How many white cells will remain?

It can be proved that this count does not depend on what rows and columns are chosen.

Constraints

* All values in input are integers.
* 1 \leq H, W \leq 20
* 1 \leq h \leq H
* 1 \leq w \leq W

Input

Input is given from Standard Input in the following format:


H W
h w


Output

Print the number of white cells that will remain.
# Solution:


def white_cells(H, W, h, w):
    # Write your code here
    return (H - h) * (W - w)


if __name__ == '__main__':
    H, W = map(int, input().split())
    h, w = map(int, input().split())
    print(white_cells(H, W, h, w))


The code generated by the zero-shot approach was:

In [None]:
def white_cells(H, W, h, w):
    # Write your code here
    return (H - h) * (W - w)


if __name__ == '__main__':
    H, W = map(int, input().split())
    h, w = map(int, input().split())
    print(white_cells(H, W, h, w))

Which correctly solves the problem proposed.

TinyLlama, being a general-purpose language model with fewer parameters and less specialized training on code, may struggle with understanding problem constraints and generating correct, concise solutions. In contrast, DeepSeek-Coder is specifically trained on diverse programming tasks and code generation, enabling it to better grasp problem requirements and produce accurate, efficient code.

In [None]:
one_shot_prompt = build_prompt_deepseek(df, example.index[0], mode='one-shot')
print(one_shot_prompt)

# === Example 1 ===
# Task: You are given an integer sequence of length N. The i-th term in the sequence is a_i. In one operation, you can select a term and either increment or decrement it by one.

At least how many operations are necessary to satisfy the following conditions?

* For every i (1≤i≤n), the sum of the terms from the 1-st through i-th term is not zero.
* For every i (1≤i≤n-1), the sign of the sum of the terms from the 1-st through i-th term, is different from the sign of the sum of the terms from the 1-st through (i+1)-th term.

Constraints

* 2 ≤ n ≤ 10^5
* |a_i| ≤ 10^9
* Each a_i is an integer.

Input

Input is given from Standard Input in the following format:


n
a_1 a_2 ... a_n


Output

Print the minimum necessary count of operations.
# Solution:
N = int(input())
a = list(map(int, input().split()))

ans1 = 0
s = 0
flg = 1
for ai in a:
    s += ai
    if s * flg <= 0:
        ans1 += abs(s) + 1
        s = flg
    flg *= -1

ans2 = 0
s = 0
flg = -1
for ai in a:
    s

In [None]:
one_shot_generated = generate_solution_deepseek(one_shot_prompt)
print(one_shot_generated)

# === Example 1 ===
# Task: You are given an integer sequence of length N. The i-th term in the sequence is a_i. In one operation, you can select a term and either increment or decrement it by one.

At least how many operations are necessary to satisfy the following conditions?

* For every i (1≤i≤n), the sum of the terms from the 1-st through i-th term is not zero.
* For every i (1≤i≤n-1), the sign of the sum of the terms from the 1-st through i-th term, is different from the sign of the sum of the terms from the 1-st through (i+1)-th term.

Constraints

* 2 ≤ n ≤ 10^5
* |a_i| ≤ 10^9
* Each a_i is an integer.

Input

Input is given from Standard Input in the following format:


n
a_1 a_2 ... a_n


Output

Print the minimum necessary count of operations.
# Solution:
N = int(input())
a = list(map(int, input().split()))

ans1 = 0
s = 0
flg = 1
for ai in a:
    s += ai
    if s * flg <= 0:
        ans1 += abs(s) + 1
        s = flg
    flg *= -1

ans2 = 0
s = 0
flg = -1
for ai in a:
    s

The solution generated by the one-shot prompt was:

In [None]:
# Solution:
H, W = map(int, input().split())
h, w = map(int, input().split())
print((H - h + 1) * (W - w + 1) - len(set(input().split())))

The generated solution overcomplicates the problem by using an incorrect formula and expecting extra input that is not required. This likely happened because the one-shot example used in the prompt added noise or confusion, leading the model to misinterpret the problem’s requirements and produce a more complex, incorrect solution than necessary.

In [None]:
few_shot_prompt = build_prompt_deepseek(df, example.index[0], mode='few-shot')
print(few_shot_prompt)

# === Example 1 ===
# Task: You are given an integer sequence of length N. The i-th term in the sequence is a_i. In one operation, you can select a term and either increment or decrement it by one.

At least how many operations are necessary to satisfy the following conditions?

* For every i (1≤i≤n), the sum of the terms from the 1-st through i-th term is not zero.
* For every i (1≤i≤n-1), the sign of the sum of the terms from the 1-st through i-th term, is different from the sign of the sum of the terms from the 1-st through (i+1)-th term.

Constraints

* 2 ≤ n ≤ 10^5
* |a_i| ≤ 10^9
* Each a_i is an integer.

Input

Input is given from Standard Input in the following format:


n
a_1 a_2 ... a_n


Output

Print the minimum necessary count of operations.
# Solution:
N = int(input())
a = list(map(int, input().split()))

ans1 = 0
s = 0
flg = 1
for ai in a:
    s += ai
    if s * flg <= 0:
        ans1 += abs(s) + 1
        s = flg
    flg *= -1

ans2 = 0
s = 0
flg = -1
for ai in a:
    s

In [None]:
few_shot_generated = generate_solution_deepseek(few_shot_prompt)
print(few_shot_generated)

# === Example 1 ===
# Task: You are given an integer sequence of length N. The i-th term in the sequence is a_i. In one operation, you can select a term and either increment or decrement it by one.

At least how many operations are necessary to satisfy the following conditions?

* For every i (1≤i≤n), the sum of the terms from the 1-st through i-th term is not zero.
* For every i (1≤i≤n-1), the sign of the sum of the terms from the 1-st through i-th term, is different from the sign of the sum of the terms from the 1-st through (i+1)-th term.

Constraints

* 2 ≤ n ≤ 10^5
* |a_i| ≤ 10^9
* Each a_i is an integer.

Input

Input is given from Standard Input in the following format:


n
a_1 a_2 ... a_n


Output

Print the minimum necessary count of operations.
# Solution:
N = int(input())
a = list(map(int, input().split()))

ans1 = 0
s = 0
flg = 1
for ai in a:
    s += ai
    if s * flg <= 0:
        ans1 += abs(s) + 1
        s = flg
    flg *= -1

ans2 = 0
s = 0
flg = -1
for ai in a:
    s

The code generated by the model was:

In [None]:
H,W=map(int,input().split())
h,w=map(int,input().split())
print((H-h)*(W-w))

Which correctly solves the problem, and actually is almost the same one as the one provided in the dataset:

In [None]:
a,b=map(int,input().split())
c,d=map(int,input().split())
print((a-c)*(b-d))

But we can see that after the model continued generating new examples and solution codes, like it was continuing the prompt. Long or confusing examples can cause the model to not recognize where to end.

In [None]:
results_deepseek = {'zero-shot': [], 'one-shot': [], 'few-shot': []}

In [None]:
saved_results_deepseek = []

for mode in ['zero-shot', 'one-shot', 'few-shot']:
    for idx in sample_indices:
        prompt = build_prompt_deepseek(df, idx, mode=mode, num_few_shot=3)
        generated = generate_solution_deepseek(prompt, max_new_tokens=256)

        expected = df.loc[idx, 'solution_code']

        metrics = evaluate_metrics(expected, generated)

        print(f"Mode: {mode}, Example {idx}")
        #print("Generated:\n", generated)
        #print("Expected:\n", expected)
        for metric, value in metrics.items():
            print(f"{metric}: {value:.4f}")
        print("\n")

        results_deepseek[mode].append(metrics)

        saved_results_deepseek.append({
            'mode': mode,
            'index': idx,
            'problem_description': df.loc[idx, 'problem_description'],
            'expected_solution': expected,
            'generated_solution': generated,
            'bleu': metrics['bleu'],
            'rouge1': metrics['rouge1'],
            'rougeL': metrics['rougeL'],
            'exact_match': metrics['exact_match']
        })


Mode: zero-shot, Example 219235
bleu: 3.1388
rouge1: 0.0556
rougeL: 0.0556
exact_match: 0.0000


Mode: zero-shot, Example 116625
bleu: 2.3394
rouge1: 0.1863
rougeL: 0.1242
exact_match: 0.0000


Mode: zero-shot, Example 161440
bleu: 8.3381
rouge1: 0.1355
rougeL: 0.1030
exact_match: 0.0000


Mode: zero-shot, Example 177012
bleu: 11.4481
rouge1: 0.2696
rougeL: 0.1569
exact_match: 0.0000


Mode: zero-shot, Example 26857
bleu: 14.0492
rouge1: 0.3170
rougeL: 0.1445
exact_match: 0.0000


Mode: one-shot, Example 219235
bleu: 1.2779
rouge1: 0.0173
rougeL: 0.0173
exact_match: 0.0000


Mode: one-shot, Example 116625
bleu: 2.8054
rouge1: 0.0677
rougeL: 0.0558
exact_match: 0.0000


Mode: one-shot, Example 161440
bleu: 5.9441
rouge1: 0.1477
rougeL: 0.0937
exact_match: 0.0000


Mode: one-shot, Example 177012
bleu: 11.4539
rouge1: 0.2831
rougeL: 0.1788
exact_match: 0.0000


Mode: one-shot, Example 26857
bleu: 4.1924
rouge1: 0.1988
rougeL: 0.0906
exact_match: 0.0000


Mode: few-shot, Example 219235
ble

Now we will run the three prompting approaches with a few samples to get some metrics

In [None]:
for mode in results_deepseek:
    avg_metrics = {}
    for metric in results_deepseek[mode][0].keys():
        avg_metrics[metric] = sum([ex[metric] for ex in results_deepseek[mode]]) / len(results_deepseek[mode])

    print(f"\nAverage metrics for {mode}:")
    for metric, value in avg_metrics.items():
        print(f"{metric}: {value:.4f}")


Average metrics for zero-shot:
bleu: 7.8627
rouge1: 0.1928
rougeL: 0.1168
exact_match: 0.0000

Average metrics for one-shot:
bleu: 5.1347
rouge1: 0.1429
rougeL: 0.0872
exact_match: 0.0000

Average metrics for few-shot:
bleu: 2.6901
rouge1: 0.0828
rougeL: 0.0537
exact_match: 0.0000


Again we have worse metrics for the few-shot approach than to the other two. This can also be caused by the model continuing the generation after the solution, like the example above

In [None]:
df_results_deepseek = pd.DataFrame(saved_results_deepseek)
df_results_deepseek.to_csv('generated_codes_deepseek.csv', index=False)

In [None]:
with open('generated_codes_deepseek.json', 'w', encoding='utf-8') as f:
    json.dump(saved_results_deepseek, f, ensure_ascii=False, indent=4)

## Try Mistral-7B-Instruct-v0.1 on the task

Mistral-7B-Instruct-v0.1 is a 7-billion parameter language model designed for instruction-following tasks. It offers strong performance in natural language understanding and generation. Due to its large size, we will use quantization techniques to reduce memory usage and enable efficient inference on limited hardware.

This code configures 4-bit quantization for loading a model with the BitsAndBytes library. It specifies that the model weights should be loaded in 4-bit precision to reduce memory usage, using the "nf4" quantization type and double quantization for better accuracy.

In [None]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

To access the model we need to login to HuggingFace

In [None]:
from huggingface_hub import login
from google.colab import userdata
#token = userdata.get('HF_TOKEN')
login(token=token)

In [None]:
model_name = "mistralai/Mistral-7B-Instruct-v0.1"

This code loads the Mistral-7B-Instruct-v0.1 model with 4-bit quantization enabled. The device_map="auto" argument automatically assigns model layers to available hardware devices. The tokenizer is also loaded to preprocess input text for the model.

In [None]:
model_4bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto",quantization_config=quantization_config, )
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Now we create a text-generation pipeline using the 4-bit quantized Mistral model and its tokenizer. We configure generation parameters like maximum output length, sampling with top-k filtering, and device placement. The pipeline is then wrapped in a HuggingFacePipeline object for easy integration with other tools.

In [None]:
pipeline_inst = pipeline(
        "text-generation",
        model=model_4bit,
        tokenizer=tokenizer,
        use_cache=True,
        device_map="auto",
        max_length=2500,
        truncation=True,
        do_sample=True,
        top_k=5,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
)

llm = HuggingFacePipeline(pipeline=pipeline_inst)

Device set to use cuda:0
  llm = HuggingFacePipeline(pipeline=pipeline_inst)


This function builds a prompt and generates a Python code solution for a programming problem based on the given mode: zero-shot, one-shot, or few-shot.  The prompt is wrapped in a PromptTemplate and passed to an LLM chain (llm_chain) to produce the final code-only response.

In [None]:
import random

template = """You are a helpful assistant that generates Python code to solve the given problem.
You should respond only with the code, without explanation.

### Problem
{question}

### Solution
"""

def generate_response(df, index, mode='zero-shot', num_few_shot=3, random_state=0):
    random.seed(random_state)

    question = clean_problem_description(df.loc[index, 'problem_description'])

    if mode == 'zero-shot':
        prompt = template.format(question=question)

    elif mode == 'one-shot':
        example_idx = df.drop(index).sample(1, random_state=random_state).index[0]
        example_problem = clean_problem_description(df.loc[example_idx, 'problem_description'])
        example_solution = df.loc[example_idx, 'solution_code']

        prompt = (
            "You are a helpful assistant that generates Python code to solve the given problem.\n"
            "You should respond only with the code, without explanation.\n\n"
            "### Problem\n"
            f"{example_problem}\n\n"
            "### Solution\n"
            f"{example_solution.strip()}\n"
            "---\n"
            "### Problem\n"
            f"{question}\n\n"
            "### Solution\n"
        )

    elif mode == 'few-shot':
        example_indices = df.drop(index).sample(num_few_shot, random_state=random_state).index
        prompt = (
            "You are a helpful assistant that generates Python code to solve the given problem.\n"
            "You should respond only with the code, without explanation.\n\n"
        )
        for i in example_indices:
            example_problem = clean_problem_description(df.loc[i, 'problem_description'])
            example_solution = df.loc[i, 'solution_code']
            prompt += (
                "### Problem\n"
                f"{example_problem}\n\n"
                "### Solution\n"
                f"{example_solution.strip()}\n"
                "---\n"
            )
        prompt += (
            "### Problem\n"
            f"{question}\n\n"
            "### Solution\n"
        )

    else:
        raise ValueError("Invalid mode. Choose between 'zero-shot', 'one-shot', or 'few-shot'.")

    prompt_template = PromptTemplate(template=prompt, input_variables=["question"])
    llm_chain = LLMChain(prompt=prompt_template, llm=llm)

    response = llm_chain.run({"question": question})
    return response


We will also use the same example from before for the demonstration

In [None]:
example.index[0]

np.int64(14480)

In [None]:
df.loc[example.index[0], 'problem_description']


'There are H rows and W columns of white square cells.\n\nYou will choose h of the rows and w of the columns, and paint all of the cells contained in those rows or columns.\n\nHow many white cells will remain?\n\nIt can be proved that this count does not depend on what rows and columns are chosen.\n\nConstraints\n\n* All values in input are integers.\n* 1 \\leq H, W \\leq 20\n* 1 \\leq h \\leq H\n* 1 \\leq w \\leq W\n\nInput\n\nInput is given from Standard Input in the following format:\n\n\nH W\nh w\n\n\nOutput\n\nPrint the number of white cells that will remain.\n\nExamples\n\nInput\n\n3 2\n2 1\n\n\nOutput\n\n1\n\n\nInput\n\n5 5\n2 3\n\n\nOutput\n\n6\n\n\nInput\n\n2 4\n2 4\n\n\nOutput\n\n0'

In [None]:
output = generate_response(df, example.index[0], mode='zero-shot')
print(output)

  llm_chain = LLMChain(prompt=prompt_template, llm=llm)
  response = llm_chain.run({"question": question})


You are a helpful assistant that generates Python code to solve the given problem.
You should respond only with the code, without explanation.
Problem description:
There are H rows and W columns of white square cells.

You will choose h of the rows and w of the columns, and paint all of the cells contained in those rows or columns.

How many white cells will remain?

It can be proved that this count does not depend on what rows and columns are chosen.

Constraints

* All values in input are integers.
* 1 \leq H, W \leq 20
* 1 \leq h \leq H
* 1 \leq w \leq W

Input

Input is given from Standard Input in the following format:


H W
h w


Output

Print the number of white cells that will remain.
Solution (Python code):

from itertools import combinations

n, m = map(int, input().split())
h, w = map(int, input().split())

combs = [(i, j) for i, j in combinations(range(n), w)]

white_cells = sum([1 for _, j in combs if j < m])
print(white_cells)

Note: This code uses Python 3.7 and above. I

The code generated by the zero shot approach was:

In [None]:
from itertools import combinations

n, m = map(int, input().split())
h, w = map(int, input().split())

combs = [(i, j) for i, j in combinations(range(n), w)]

white_cells = sum([1 for _, j in combs if j < m])
print(white_cells)

This code attempts to solve the problem by generating combinations of rows and columns but contains some mistakes: it uses incorrect variable names (n, m instead of H, W), misinterprets the selection logic for rows and columns, and does not correctly calculate the number of remaining white cells.

In [None]:
output = generate_response(df, example.index[0], mode='one-shot')
print(output)

You are a helpful assistant that generates Python code to solve the given problem.
You should respond only with the code, without explanation.

### Problem
You are given an integer sequence of length N. The i-th term in the sequence is a_i. In one operation, you can select a term and either increment or decrement it by one.

At least how many operations are necessary to satisfy the following conditions?

* For every i (1≤i≤n), the sum of the terms from the 1-st through i-th term is not zero.
* For every i (1≤i≤n-1), the sign of the sum of the terms from the 1-st through i-th term, is different from the sign of the sum of the terms from the 1-st through (i+1)-th term.

Constraints

* 2 ≤ n ≤ 10^5
* |a_i| ≤ 10^9
* Each a_i is an integer.

Input

Input is given from Standard Input in the following format:


n
a_1 a_2 ... a_n


Output

Print the minimum necessary count of operations.

### Solution
N = int(input())
a = list(map(int, input().split()))

ans1 = 0
s = 0
flg = 1
for ai in a:
   

The solution generated by the one-shot was:

In [None]:
def count_white_cells(h: int, w: int) -> int:

    n = (h + 1) * (w + 1)
    white_count = 0

    for i in range(n):
        row = i // (w + 1)
        col = i % (w + 1)
        if i % 2 == 0:
            white_count += 1 if row % 2 == 1 and col % 2 == 1 else (row % 2 == 0 and col % 2 == 0)
        else:
            white_count += 1 if row % 2 == 0 and col % 2!= 1 else (row % 2!= 0 and col % 2 == 1)

    return white_count

h = int(input())
w = int(input())
h = int(input())
w = int(input())
ans = count_white_cells(h, w)
print(ans)

Which also doesn't solve the problem. And we can see that this one also continues generating more examples and solutions after the one asked.

In [None]:
output = generate_response(df, example.index[0], mode='few-shot')
print(output)

You are a helpful assistant that generates Python code to solve the given problem.
You should respond only with the code, without explanation.

### Problem
You are given an integer sequence of length N. The i-th term in the sequence is a_i. In one operation, you can select a term and either increment or decrement it by one.

At least how many operations are necessary to satisfy the following conditions?

* For every i (1≤i≤n), the sum of the terms from the 1-st through i-th term is not zero.
* For every i (1≤i≤n-1), the sign of the sum of the terms from the 1-st through i-th term, is different from the sign of the sum of the terms from the 1-st through (i+1)-th term.

Constraints

* 2 ≤ n ≤ 10^5
* |a_i| ≤ 10^9
* Each a_i is an integer.

Input

Input is given from Standard Input in the following format:


n
a_1 a_2 ... a_n


Output

Print the minimum necessary count of operations.

### Solution
N = int(input())
a = list(map(int, input().split()))

ans1 = 0
s = 0
flg = 1
for ai in a:
   

The code generated was:

In [None]:
n = int(input())
m = int(input())
h = int(input())
w = int(input())
ans = 1

for i in range(h):
    cnt = 0
    for j in range(w):
        if m - h >= i and m - h >= j:
            cnt += 1
        else:
            cnt -= 1
    if cnt == 0 or cnt == 1:
        ans *= 1
    else:
        ans *= -1
ans *= -1
print(ans)

This code is incorrect and unrelated to the problem logic; it uses nested loops and conditions that do not correspond to counting remaining white cells. The approach is confused and does not calculate or print the correct result for the given problem.

The model likely failed to solve the problem in all three approaches because the problem requires a simple, direct mathematical insight, but the prompts or examples may have been unclear or noisy. This can cause the model to overcomplicate or misunderstand the task.

In [None]:
# Clear GPU cache
torch.cuda.empty_cache()

# Fine-tuning a small LLM for the task

## Fine tuning GPT-2

GPT-2 Medium (https://huggingface.co/openai-community/gpt2-medium) is a transformer-based language model developed by OpenAI, containing approximately 345 million parameters. It is part of the GPT-2 family and was pretrained on a large corpus of internet text using unsupervised learning. Due to its moderate size, GPT-2 Medium is suitable for fine-tuning on domain-specific tasks using limited resources (e.g., Google Colab).

### Imports

We start by doing the necessary library imports

In [None]:
!pip install transformers datasets evaluate


Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from google.colab import drive
import os
import pandas as pd
from datasets import Dataset
from transformers import TrainingArguments, Trainer

### Load the model

Before we do the fine-tuning, let's see how the pretrained model does on our task of code generation

In this next cell we load the pretrained model and it's tokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained("gpt2-medium")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained("gpt2-medium")
model.resize_token_embeddings(len(tokenizer))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Embedding(50257, 1024)

And now we create the text generation pipeline

In [None]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

Device set to use cuda:0


We will use for demonstration this example prompt that is not in the dataset. The problem is to print YES if the input is even and NO if it is odd.

In [None]:
prompt = """### Problem:
Petya wants to split a watermelon of weight `w` into two parts, each of even positive weight.
Write a Python function that receives an integer `w` (1 <= w <= 100) and returns "YES" if it's possible, or "NO" otherwise.

#### Input:
- An integer w, the weight of the watermelon.

#### Output:
- "YES" if the watermelon can be split into two even positive integers.
- "NO" otherwise.

#### Examples:
Input: 8
Output: YES

Input: 3
Output: NO

Input: 4
Output: YES

### Solution:
"""

With the previous prompt, we generate the model's output

In [None]:
output = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7)
print("Output before fine-tuning:\n")
print(output[0]["generated_text"])

Output before fine-tuning:

### Problem:
Petya wants to split a watermelon of weight `w` into two parts, each of even positive weight.
Write a Python function that receives an integer `w` (1 <= w <= 100) and returns "YES" if it's possible, or "NO" otherwise.

### Solution:

Add a new type of function called partition with the function `partition(w,partitionDim,w,partitionDim,w)`. This function takes an integer of `w` and returns a tuple containing two items: `partitionDim` and `partitionDim`.

## Examples:

>>> import multiprocessing import time >>> partition = partition(1, 2, 3, 4) >>> partition(0, 1) 'NO' >>> partition(0, 1) 'YES' >>> partition(1, 10) 'YES' >>> partition(0, 1) 'YES' >>> partition(0, 1) 'YES' >>> partition(0, 1) 'YES' >>> partition(0, 1) 'YES' >>> partition(0, 1) 'YES' >>> partition(0, 1) 'NO' >>> partition(0, 1) 'YES' >>> partition(0, 1) 'NO' >>> partition(0, 1) 'YES' >>> partition(0, 1) 'NO' >>> partition(0, 1) 'NO' >>> partition(0, 1) 'NO' >>> partition(0, 1) 'NO' 

We can see that the model doesn't generate the code for the solution. It tries to start explaining how to solve the problem but does not write any actual code

### Load the dataset

Now we will load out dataset to start the fine-tuning

In [None]:
drive.mount('/content/drive')

path = 'Colab Notebooks/NLP/NLP_Project'

os.chdir(f'/content/drive/MyDrive/{path}')
os.getcwd()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


'/content/drive/.shortcut-targets-by-id/17WgJO1gfIBADpYX2jVdb41q7HCbwWcOU/NLP_Project'

In [None]:
df = pd.read_csv('final_ds.csv')
print(df.head())

                                 problem_description solution_id  \
0  Xenia has a set of weights and pan scales. Eac...         0_0   
1  Xenia has a set of weights and pan scales. Eac...         0_2   
2  Xenia has a set of weights and pan scales. Eac...         0_4   
3  Xenia has a set of weights and pan scales. Eac...         0_6   
4  Xenia has a set of weights and pan scales. Eac...         0_8   

                                       solution_code  \
0  __author__ = 'ratnesh.mishra'\n\nweights = map...   
1  import sys\nsys.setrecursionlimit (1000000)\n\...   
2  import sys\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...   
3  MOD = 10**9 + 7\nI = lambda:list(map(int,input...   
4  to_print = []\ndef dfs(d, ini, s, depth, m):\n...   

               problem_name time_complexity_inferred space_complexity_inferred  
0  339_C. Xenia and Weights                     O(1)                   O(n**2)  
1  339_C. Xenia and Weights                     O(1)                      O(1)  
2  339_C. X

### Fine-tuning with 3k samples

First we will start with a small example, doing the fine-tuning with 3000 samples of the dataset. In this next cell we prepare the text that we will give to the model in the training putting together the descriptions and the code solutions, and splitting it into train and test data.

In [None]:
df_small = df[["problem_description", "solution_code"]].dropna().sample(3000, random_state=42).reset_index(drop=True)

df_small["text"] = df_small.apply(
    lambda row: f"### Problem:\n{row['problem_description']}\n### Solution:\n{row['solution_code']}", axis=1
)

dataset = Dataset.from_pandas(df_small[["text"]])
dataset = dataset.train_test_split(test_size=0.1)

Now we tokenize the selected data with the model's tokenizer

In [None]:
def tokenize(example):
    tokens = tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=512
    )
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

In [None]:
tokenized_dataset = dataset.map(tokenize, batched=True)

Map:   0%|          | 0/2700 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

In this next step we create the Trainer, defining the training arguments and the model to be trained. We will train 3 epochs.

In [None]:
training_args = TrainingArguments(
    output_dir="./gpt2-medium-finetuned",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=3,
    learning_rate=5e-5,
    save_steps=500,
    eval_steps=500,
    logging_steps=100,
    save_total_limit=1,
    fp16=True,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer
)


  trainer = Trainer(


Start training

In [None]:
trainer.train()

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
100,2.037
200,1.6461
300,1.5395
400,1.4558
500,1.4367
600,1.4121
700,1.4259
800,1.4014
900,1.2939
1000,1.3351


TrainOutput(global_step=4050, training_loss=1.179877754493996, metrics={'train_runtime': 2023.7052, 'train_samples_per_second': 4.003, 'train_steps_per_second': 2.001, 'total_flos': 7522475625676800.0, 'train_loss': 1.179877754493996, 'epoch': 3.0})

After training we save the model generated

In [None]:
trainer.save_model("./gpt2-medium-finetuned")
tokenizer.save_pretrained("./gpt2-medium-finetuned")

('./gpt2-medium-finetuned/tokenizer_config.json',
 './gpt2-medium-finetuned/special_tokens_map.json',
 './gpt2-medium-finetuned/vocab.json',
 './gpt2-medium-finetuned/merges.txt',
 './gpt2-medium-finetuned/added_tokens.json',
 './gpt2-medium-finetuned/tokenizer.json')

Now we will generate an output with the same prompt from the beggining to see how the finetuned model behaves

First we create the pipeline with the trained model

In [None]:
pipe_finetuned = pipeline("text-generation", model="./gpt2-medium-finetuned", tokenizer=tokenizer)

Device set to use cuda:0


And pass the same prompt again

In [None]:
prompt = """### Problem:
Petya wants to split a watermelon of weight `w` into two parts, each of even positive weight.
Write a Python function that receives an integer `w` (1 <= w <= 100) and returns "YES" if it's possible, or "NO" otherwise.

#### Input:
- An integer w, the weight of the watermelon.

#### Output:
- "YES" if the watermelon can be split into two even positive integers.
- "NO" otherwise.

#### Examples:
Input: 8
Output: YES

Input: 3
Output: NO

Input: 4
Output: YES

### Solution:
"""

output_finetuned = pipe_finetuned(prompt, max_new_tokens=256)
print("After fine-tuning:\n", output_finetuned[0]["generated_text"])


After fine-tuning:
 ### Problem:
Petya wants to split a watermelon of weight `w` into two parts, each of even positive weight.
Write a Python function that receives an integer `w` (1 <= w <= 100) and returns "YES" if it's possible, or "NO" otherwise.

#### Input:
- An integer w, the weight of the watermelon.

#### Output:
- "YES" if the watermelon can be split into two even positive integers.
- "NO" otherwise.

#### Examples:
Input: 8  
Output: YES

Input: 3  
Output: NO

Input: 4  
Output: YES

### Solution:
w = int(input())
sum = 0
for i in range(1, w+1):
    while(i & 1):
        if(w & (i & 0x00)) == 0:
           sum += 1
    else:
         sum += sum*(i & 0x00)
          sum *= 1
          sum *= i

print('YES' if sum == 'YES' else 'NO')


We can see that the output, even if it's not the correct solution for the problem, is actually well structured python code, an improvement from the first output generated pre-finetuning.

### Fine-tuning with 10k samples

Now we will take a larger piece of the dataset for the finetuning, with 10000 samples, and again divide into train and test sets and tokenize it.

In [None]:
df_10k = df[["problem_description", "solution_code"]].dropna().sample(10000, random_state=42).reset_index(drop=True)
df_10k["text"] = df_10k.apply(
    lambda row: f"### Problem:\n{row['problem_description']}\n### Solution:\n{row['solution_code']}", axis=1
)

dataset_10k = Dataset.from_pandas(df_10k[["text"]])
dataset_10k = dataset_10k.train_test_split(test_size=0.1)

In [None]:
tokenized_10k_dataset = dataset_10k.map(tokenize, batched=True)

Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

We define the Trainer and the training arguments, this time with 2 epochs.

In [None]:
training_args = TrainingArguments(
    output_dir="./gpt2-medium-finetuned-10kds",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=2,
    learning_rate=5e-5,
    save_steps=5000,
    eval_steps=1000,
    logging_steps=100,
    save_total_limit=1,
    fp16=True,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_10k_dataset["train"],
    eval_dataset=tokenized_10k_dataset["test"],
    tokenizer=tokenizer
)


  trainer = Trainer(


And start the training

In [None]:
trainer.train()

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
100,1.9619
200,1.6114
300,1.5833
400,1.4696
500,1.4735
600,1.4504
700,1.3648
800,1.3784
900,1.3638
1000,1.2851


TrainOutput(global_step=9000, training_loss=1.0219143998887803, metrics={'train_runtime': 3736.2941, 'train_samples_per_second': 4.818, 'train_steps_per_second': 2.409, 'total_flos': 1.6716612501504e+16, 'train_loss': 1.0219143998887803, 'epoch': 2.0})

In [None]:
trainer.evaluate()

{'eval_loss': 0.7891965508460999,
 'eval_runtime': 49.0426,
 'eval_samples_per_second': 20.39,
 'eval_steps_per_second': 10.195,
 'epoch': 2.0}

In [None]:
trainer.save_model("./gpt2-medium-finetuned-10k")
tokenizer.save_pretrained("./gpt2-medium-finetuned-10k")

('./gpt2-medium-finetuned-10k/tokenizer_config.json',
 './gpt2-medium-finetuned-10k/special_tokens_map.json',
 './gpt2-medium-finetuned-10k/vocab.json',
 './gpt2-medium-finetuned-10k/merges.txt',
 './gpt2-medium-finetuned-10k/added_tokens.json',
 './gpt2-medium-finetuned-10k/tokenizer.json')

Now we will run some metrics on the test set using 30 samples. First we will generate the output with the finetuned version for these samples.

In [None]:
import torch
from tqdm import tqdm

model_path = "./gpt2-medium-finetuned-10k"

model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device= -1
)

test_dataset = dataset_10k["test"]

expected_codes = []
generated_codes = []

for example in tqdm(test_dataset.select(range(30))):
    full_text = example["text"]

    try:
        parts = full_text.split("### Problem:\n")[1].split("### Solution:\n")
        description = parts[0].strip()
        real_code = parts[1].strip()
    except (IndexError, AttributeError):
        continue

    prompt = f"### Problem:\n{description}\n### Solution:\n"

    if not prompt.strip():
        continue

    max_prompt_tokens = 768
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=max_prompt_tokens)
    prompt = tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=True)

    try:
        output = generator(prompt, max_new_tokens=256, num_return_sequences=1, do_sample=False)[0]["generated_text"]
    except RuntimeError as e:
        print("GPU error:", e)
        continue

    generated_code = output.split("### Solution:\n")[-1].strip()

    expected_codes.append(real_code)
    generated_codes.append(generated_code)



Device set to use cpu
100%|██████████| 30/30 [12:44<00:00, 25.49s/it]


In [None]:
import pandas as pd

df_codes = pd.DataFrame({
    "reference_code": expected_codes,
    "generated_code": generated_codes
})

df_codes.to_csv("codes_comparison.csv", index=False)


Let's also run the model on the previous example to see the output

In [None]:
prompt = """### Problem:
Petya wants to split a watermelon of weight `w` into two parts, each of even positive weight.
Write a Python function that receives an integer `w` (1 <= w <= 100) and returns "YES" if it's possible, or "NO" otherwise.

#### Examples:
Input: 8
Output: YES

Input: 3
Output: NO

### Solution:
"""

output = generator(prompt, max_new_tokens=256, do_sample=True, temperature=0.7)
print(output[0]["generated_text"])


### Problem:
Petya wants to split a watermelon of weight `w` into two parts, each of even positive weight.
Write a Python function that receives an integer `w` (1 <= w <= 100) and returns "YES" if it's possible, or "NO" otherwise.

#### Examples:
Input: 8
Output: YES

Input: 3
Output: NO

### Solution:
w=int(input())
for i in range(len(str(w))):
    if str(i)==str(w-1):
        print("YES")
    else:
        print("NO")


We can see that it does not generate the correct code for the problem, but it improves in comparison with the generated output from the very first model, since now we have well structured python code.

We will also get the outputs from the original model for comparison

In [None]:
from tqdm import tqdm

model_name = "gpt2-medium"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=-1
)

test_dataset = dataset_10k["test"]

expected_codes_gpt2 = []
generated_codes_gpt2 = []

for example in tqdm(test_dataset.select(range(30))):
    full_text = example["text"]

    try:
        parts = full_text.split("### Problem:\n")[1].split("### Solution:\n")
        description = parts[0].strip()
        real_code = parts[1].strip()
    except (IndexError, AttributeError):
        continue

    prompt = f"### Problem:\n{description}\n### Solution:\n"

    if not prompt.strip():
        continue

    max_prompt_tokens = 768
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=max_prompt_tokens)
    prompt = tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=True)

    try:
        output = generator(prompt, max_new_tokens=256, num_return_sequences=1, do_sample=False)[0]["generated_text"]
    except RuntimeError as e:
        print("GPU error:", e)
        continue

    generated_code = output.split("### Solution:\n")[-1].strip()

    expected_codes_gpt2.append(real_code)
    generated_codes_gpt2.append(generated_code)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
  0%|          | 0/30 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
  3%|▎         | 1/30 [00:58<28:19, 58.60s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
  7%|▋         | 2/30 [02:03<29:06, 62.36s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 10%|█         | 3/30 [03:03<27:31, 61.17s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 13%|█▎        | 4/30 [04:00<25:48, 59.56s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 17%|█▋        | 5/30 [04:57<24:25, 58.63s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 20%|██        | 6/30 [05:54<23:14, 58.09s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 23%|██▎       | 7/30 [06:51<22:07, 57.70s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 27%|██▋       | 8/30 [07:45<20:47, 56.73s

In [None]:
df_codes = pd.DataFrame({
    "reference_code": expected_codes_gpt2,
    "generated_code": generated_codes_gpt2
})

df_codes.to_csv("codes_comparison_gpt2.csv", index=False)


In [None]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=eef5c3cff69b0071032aab654ac2062804a828d7fa5b82cf4afb07be1ae8d809
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

Here's the function to calculate the metrics. Exact Match checks if the generated code is exactly the same as the reference. Levenshtein Similarity measures how similar two strings are based on the number of edit operations needed to match them. BLEU Score evaluates n-gram overlap between generated and reference texts, commonly used in machine translation. ROUGE-L captures the longest common subsequence, focusing on the structural similarity between outputs.

In [None]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer
import numpy as np
import difflib

def exact_match(pred, ref):
    return pred.strip() == ref.strip()

def levenshtein_ratio(a, b):
    return difflib.SequenceMatcher(None, a, b).ratio()

def bleu_score(pred, ref):
    smoothie = SmoothingFunction().method4
    return sentence_bleu([nltk.word_tokenize(ref)], nltk.word_tokenize(pred), smoothing_function=smoothie)

def rouge_l_score(pred, ref):
    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    return scorer.score(ref, pred)['rougeL'].fmeasure

def compare_all(expected_codes, generated, label=""):
    print(f"\n=== Metrics for: {label} ===")

    em_list = []
    lev_list = []
    bleu_list = []
    rouge_list = []

    for pred, ref in zip(generated, expected_codes):
        em_list.append(exact_match(pred, ref))
        lev_list.append(levenshtein_ratio(pred, ref))
        bleu_list.append(bleu_score(pred, ref))
        rouge_list.append(rouge_l_score(pred, ref))

    print(f"Exact Match: {np.mean(em_list):.3f}")
    print(f"Levenshtein Similarity: {np.mean(lev_list):.3f}")
    print(f"BLEU Score: {np.mean(bleu_list):.3f}")
    print(f"ROUGE-L Score: {np.mean(rouge_list):.3f}")


In [None]:
codes_finetuned = pd.read_csv('codes_comparison.csv')
#codes_gpt2 = pd.read_csv('code_comparison_gpt2.csv')

In [None]:
expected_codes = codes_finetuned["reference_code"].tolist()
generated_codes_finetuned = codes_finetuned["generated_code"].tolist()

#expected_codes_gpt2 = codes_gpt2["reference_code"].tolist()
#generated_codes_gpt2 = codes_gpt2["generated_code"].tolist()

In [None]:
compare_all(expected_codes_gpt2, generated_codes_gpt2, label="GPT-2 Original")
compare_all(expected_codes, generated_codes_finetuned, label="GPT-2 Fine-tuned")


=== Metrics for: GPT-2 Original ===
Exact Match: 0.000
Levenshtein Similarity: 0.026
BLEU Score: 0.004
ROUGE-L Score: 0.057

=== Metrics for: GPT-2 Fine-tuned ===
Exact Match: 0.000
Levenshtein Similarity: 0.282
BLEU Score: 0.157
ROUGE-L Score: 0.326


The metrics show that the fine-tuned GPT-2 model significantly outperforms the original GPT-2 in generating code closer to the expected solutions. While both models have an Exact Match score of 0, indicating no perfect matches, the fine-tuned model achieves much higher similarity scores across Levenshtein, BLEU, and ROUGE-L metrics. This suggests that fine-tuning helps the model produce code that is structurally and lexically more similar to the reference, improving overall generation quality even if exact reproduction is rare.

## Fine tuning TinyLlama

TinyLlama-1.1B-Chat-v1.0 (https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) is a compact, chat-optimized language model with 1.1B parameters, based on the Llama 2 architecture. It was pretrained on 3T tokens and fine-tuned using UltraChat and UltraFeedback datasets to improve dialogue quality. Despite its small size, it performs well in chat tasks, especially on limited hardware.

### Imports

We start by doing the necessary library imports

In [None]:
!pip install -q peft accelerate transformers datasets bitsandbytes trl

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m348.0/348.0 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m53.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m37.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!pip install datasets



In [None]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
import pandas as pd
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from google.colab import drive
import os
from datasets import Dataset

### Load the model

Before we do the fine-tuning, let's see how the pretrained model does on our task of code generation

In this next cell we will create a text-generation pipeline with the pretrained model

In [None]:
pipe = pipeline("text-generation", model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.bfloat16, device_map="auto")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

Device set to use cuda:0


And we will pass as the messages a problem that is not in our bigger dataset

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful Python assistant."},
    {"role": "user", "content": "Petya wants to split a watermelon of weight `w` into two parts, each of even positive weight. Write a Python function that receives an integer `w` (1 <= w <= 100) and returns 'YES' if it's possible, or 'NO' otherwise."}
]

Now we will generate the output using the messages above as input

In [None]:
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

<|system|>
You are a helpful Python assistant.</s>
<|user|>
Petya wants to split a watermelon of weight `w` into two parts, each of even positive weight. Write a Python function that receives an integer `w` (1 <= w <= 100) and returns 'YES' if it's possible, or 'NO' otherwise.</s>
<|assistant|>
Here's a Python function that returns 'YES' if it's possible to split a watermelon of weight `w` into two parts, each of even positive weight, and 'NO' otherwise:

```python
def is_splitable(w):
    """
    Takes a weight `w` and returns 'YES' if it's possible to split the watermelon into two parts,
    each of even positive weight, or 'NO' otherwise.
    """
    if w <= 0:
        return "Watermelon is not splitable!"
    return "YES" if (w % 2) == 0 else "NO"
```

Here's an example usage:

```python
watermelon_wgt = 100
is_splitable = is_splitable(watermelon_wgt)
print(is_splitable)  # Output: YES
```

This function takes an integer `w` (1 <= w <= 100) as input and returns either 'YES' (if it'

We can see that the model already produces python code for the task, as request in the system prompt. It also tries to explain the code generated.

### Load the dataset

Now we will load out dataset to start the fine-tuning

In [None]:
drive.mount('/content/drive')

path = 'Colab Notebooks/NLP/NLP_Project'

os.chdir(f'/content/drive/MyDrive/{path}')
os.getcwd()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


'/content/drive/.shortcut-targets-by-id/17WgJO1gfIBADpYX2jVdb41q7HCbwWcOU/NLP_Project'

In [None]:
df = pd.read_csv('final_ds.csv')
print(df.head())

                                 problem_description solution_id  \
0  Xenia has a set of weights and pan scales. Eac...         0_0   
1  Xenia has a set of weights and pan scales. Eac...         0_2   
2  Xenia has a set of weights and pan scales. Eac...         0_4   
3  Xenia has a set of weights and pan scales. Eac...         0_6   
4  Xenia has a set of weights and pan scales. Eac...         0_8   

                                       solution_code  \
0  __author__ = 'ratnesh.mishra'\n\nweights = map...   
1  import sys\nsys.setrecursionlimit (1000000)\n\...   
2  import sys\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...   
3  MOD = 10**9 + 7\nI = lambda:list(map(int,input...   
4  to_print = []\ndef dfs(d, ini, s, depth, m):\n...   

               problem_name time_complexity_inferred space_complexity_inferred  
0  339_C. Xenia and Weights                     O(1)                   O(n**2)  
1  339_C. Xenia and Weights                     O(1)                      O(1)  
2  339_C. X

### Fine tuning with 1000 samples

First we will start with a small example, doing the fine-tuning with 1000 samples of the dataset. In this next cell we prepare the text that we will give to the model in the training putting together the descriptions and the code solutions in the messages format, and splitting it into train and test data.

In [None]:
df_small = df[["problem_description", "solution_code"]].dropna().sample(1000, random_state=42)

In [None]:
def format_messages(row):
    return [
        {"role": "system", "content": "You are a helpful Python assistant."},
        {"role": "user", "content": row["problem_description"]},
        {"role": "assistant", "content": row["solution_code"]}
    ]

In [None]:
df_small["messages"] = df_small.apply(format_messages, axis=1)
dataset = Dataset.from_pandas(df_small[["messages"]])
dataset = dataset.train_test_split(test_size=0.1)

Here we define the original model name and tokenizer

In [None]:
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token


This next function is used to tokenize the prepared dataset with the model's tokenizer

In [None]:
def apply_chat_template(example):
    prompt = tokenizer.apply_chat_template(example["messages"], tokenize=False, add_generation_prompt=False)

    tokenized = tokenizer(
        prompt,
        truncation=True,
        padding="max_length",
        max_length=512
    )

    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized


In [None]:
tokenized_dataset = dataset.map(apply_chat_template)

Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

This next cell loads the base model in 4-bit precision and prepares it for k-bit training. LoRA adapters are configured to fine-tune only the q_proj and v_proj layers in a memory-efficient way.

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


In this next step we create the Trainer, defining the training arguments and the model to be trained. We will train 2 epochs.

In [None]:
training_args = TrainingArguments(
    output_dir="./tinyllama-chat-finetuned",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=2,
    learning_rate=2e-4,
    logging_steps=100,
    eval_steps=250,
    save_steps=250,
    save_total_limit=1,
    fp16=True,
    report_to="none"
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer
)


  trainer = Trainer(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Start training

In [None]:
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


Step,Training Loss
100,1.5375
200,1.1766
300,1.1996
400,1.1782
500,1.1326
600,1.1601
700,1.1472
800,1.1505
900,1.0597


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


TrainOutput(global_step=900, training_loss=1.1935513051350912, metrics={'train_runtime': 689.5834, 'train_samples_per_second': 2.61, 'train_steps_per_second': 1.305, 'total_flos': 5726668220006400.0, 'train_loss': 1.1935513051350912, 'epoch': 2.0})

After training we save the model generated

In [None]:
model.save_pretrained("tinyllama-lora-finetuned")
tokenizer.save_pretrained("tinyllama-lora-finetuned")

('tinyllama-lora-finetuned/tokenizer_config.json',
 'tinyllama-lora-finetuned/special_tokens_map.json',
 'tinyllama-lora-finetuned/tokenizer.model',
 'tinyllama-lora-finetuned/added_tokens.json',
 'tinyllama-lora-finetuned/tokenizer.json')

Now we will generate an output with the same prompt from the beggining to see how the finetuned model behaves

In [None]:
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map="auto",
    torch_dtype=torch.bfloat16
)
model = PeftModel.from_pretrained(base_model, "tinyllama-lora-finetuned")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device_map="auto")

messages = [
    {"role": "system", "content": "You are a helpful Python assistant."},
    {"role": "user", "content": "Petya wants to split a watermelon of weight `w` into two parts, each of even positive weight. Write a Python function that receives an integer `w` (1 <= w <= 100) and returns 'YES' if it's possible, or 'NO' otherwise."}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
output = pipe(prompt, max_new_tokens=256, do_sample=True)
print(output[0]["generated_text"])

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Device set to use cuda:0
The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['AriaTextForCausalLM', 'BambaForCausalLM', 'BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'DeepseekV3ForCausalLM', 'DiffLlamaForCausalLM', 'ElectraForCausalLM', 'Emu3ForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FalconMambaForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'Gemma3ForConditionalGen

<|system|>
You are a helpful Python assistant.</s>
<|user|>
Petya wants to split a watermelon of weight `w` into two parts, each of even positive weight. Write a Python function that receives an integer `w` (1 <= w <= 100) and returns 'YES' if it's possible, or 'NO' otherwise.</s>
<|assistant|>
w = int(input())
if w % 2 == 0:
    if w == 0:
        print('NO')
        return
    print('YES')
else:
    print('NO')


We can see that the model generated the correct code, but now without the explanations, which is more aligned with the examples from the dataset

### Fine tuning with 5000 samples

Now we will try the finetuning with a little more samples

We take 5000 samples of the dataset, divide again into train and test set, apply the messages format and tokenize it

In [None]:
df_medium = df[["problem_description", "solution_code"]].dropna().sample(5000, random_state=42)

In [None]:
df_medium["messages"] = df_medium.apply(format_messages, axis=1)
dataset_medium = Dataset.from_pandas(df_medium[["messages"]])
dataset_medium = dataset_medium.train_test_split(test_size=0.1)

In [None]:
tokenized_dataset_medium = dataset_medium.map(apply_chat_template)

Map:   0%|          | 0/4500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Here we define again the training arguments and the Trainer, taking the new dataset

In [None]:
training_args = TrainingArguments(
    output_dir="./tinyllama-chat-finetuned-mediumds",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=2,
    learning_rate=2e-4,
    logging_steps=100,
    eval_steps=500,
    save_steps=500,
    save_total_limit=1,
    fp16=True,
    report_to="none"
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset_medium["train"],
    eval_dataset=tokenized_dataset_medium["test"],
    tokenizer=tokenizer
)


  trainer = Trainer(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


And start the training

In [None]:
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


Step,Training Loss
100,1.5614
200,1.2378
300,1.2267
400,1.1724
500,1.1617
600,1.1388
700,1.1616
800,1.1123
900,1.0879
1000,1.1399


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


TrainOutput(global_step=4500, training_loss=1.0915350663926866, metrics={'train_runtime': 3303.0539, 'train_samples_per_second': 2.725, 'train_steps_per_second': 1.362, 'total_flos': 2.8633341100032e+16, 'train_loss': 1.0915350663926866, 'epoch': 2.0})

In [None]:
trainer.evaluate()

{'eval_loss': 1.0485718250274658,
 'eval_runtime': 58.3072,
 'eval_samples_per_second': 8.575,
 'eval_steps_per_second': 4.288,
 'epoch': 2.0}

Now we will generate again an output for the first prompt to see how the model behaves

In [None]:
from transformers import BitsAndBytesConfig

base_model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True,
                                bnb_4bit_compute_dtype=torch.bfloat16)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

adapter_path = "./tinyllama-chat-finetuned-mediumds/checkpoint-4500"
model = PeftModel.from_pretrained(base_model, adapter_path)

tokenizer = AutoTokenizer.from_pretrained(base_model_name)


In [None]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device_map="auto")

messages = [
    {"role": "system", "content": "You are a helpful Python assistant."},
    {"role": "user", "content": "Petya wants to split a watermelon of weight `w` into two parts, each of even positive weight. Write a Python function that receives an integer `w` (1 <= w <= 100) and returns 'YES' if it's possible, or 'NO' otherwise."}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
output = pipe(prompt, max_new_tokens=256, do_sample=True)

print(output[0]["generated_text"])


Device set to use cuda:0
The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['AriaTextForCausalLM', 'BambaForCausalLM', 'BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'DeepseekV3ForCausalLM', 'DiffLlamaForCausalLM', 'ElectraForCausalLM', 'Emu3ForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FalconMambaForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'Gemma3ForConditionalGeneration', 'Gemma3ForCausalLM', 'GitForCausalLM', 'GlmForCausalLM', 'Glm4ForCausalLM', 'GotOcr2ForConditionalGeneration', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoFo

<|system|>
You are a helpful Python assistant.</s>
<|user|>
Petya wants to split a watermelon of weight `w` into two parts, each of even positive weight. Write a Python function that receives an integer `w` (1 <= w <= 100) and returns 'YES' if it's possible, or 'NO' otherwise.</s>
<|assistant|>
def is_positive_even(w):
    if 1000/w == 54:
        return True
    if w % 2 == 0:
        return True
    
    
print("YES", (is_positive_even(w) == True))


Now we will run some metrics on the test set using 20 samples.

In [None]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=bc8f39e186df29ce86ecfae01a2a7b7de1c6a5340fdcbcaa23126f28de3a6836
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

Here's the function to calculate the metrics. Exact Match checks if the generated code is exactly the same as the reference. Levenshtein Similarity measures how similar two strings are based on the number of edit operations needed to match them. BLEU Score evaluates n-gram overlap between generated and reference texts, commonly used in machine translation. ROUGE-L captures the longest common subsequence, focusing on the structural similarity between outputs.

In [None]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer
import numpy as np
import difflib

def exact_match(pred, ref):
    return pred.strip() == ref.strip()

def levenshtein_ratio(a, b):
    return difflib.SequenceMatcher(None, a, b).ratio()

def bleu_score(pred, ref):
    smoothie = SmoothingFunction().method4
    return sentence_bleu([nltk.word_tokenize(ref)], nltk.word_tokenize(pred), smoothing_function=smoothie)

def rouge_l_score(pred, ref):
    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    return scorer.score(ref, pred)['rougeL'].fmeasure

def compare_all(expected_codes, generated, label=""):
    print(f"\n=== Metrics for: {label} ===")

    em_list = []
    lev_list = []
    bleu_list = []
    rouge_list = []

    for pred, ref in zip(generated, expected_codes):
        em_list.append(exact_match(pred, ref))
        lev_list.append(levenshtein_ratio(pred, ref))
        bleu_list.append(bleu_score(pred, ref))
        rouge_list.append(rouge_l_score(pred, ref))

    print(f"Exact Match: {np.mean(em_list):.3f}")
    print(f"Levenshtein Similarity: {np.mean(lev_list):.3f}")
    print(f"BLEU Score: {np.mean(bleu_list):.3f}")
    print(f"ROUGE-L Score: {np.mean(rouge_list):.3f}")


Now we will generate the outputs for the original model and for the finetuned model, and run the metrics function on the generated data

In [None]:
from tqdm import tqdm
from transformers import BitsAndBytesConfig
from peft import PeftModel

def extract_code_from_output(output_text):
    if "<|assistant|>" in output_text:
        return output_text.split("<|assistant|>")[-1].strip()
    else:
        return output_text.strip()

base_model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True,
                               bnb_4bit_compute_dtype=torch.bfloat16)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(base_model_name)

adapter_path = "./tinyllama-chat-finetuned-mediumds/checkpoint-4500"
finetuned_model = PeftModel.from_pretrained(base_model, adapter_path)

pipe_orig = pipeline("text-generation", model=base_model, tokenizer=tokenizer, device_map="auto")
pipe_ft = pipeline("text-generation", model=finetuned_model, tokenizer=tokenizer, device_map="auto")

test_dataset = dataset_medium["test"].select(range(20))

expected_codes = []
generated_orig = []
generated_ft = []

for example in tqdm(test_dataset):
    messages = example["messages"]
    description = ""
    real_code = ""

    for m in messages:
        if m["role"] == "user":
            description = m["content"].strip()
        elif m["role"] == "assistant":
            real_code = m["content"].strip()

    if not description or not real_code:
        continue

    prompt_messages = [
        {"role": "system", "content": "You are a helpful Python assistant."},
        {"role": "user", "content": description}
    ]

    prompt = tokenizer.apply_chat_template(prompt_messages, tokenize=False, add_generation_prompt=True)

    output_orig = pipe_orig(prompt, max_new_tokens=256, do_sample=True)[0]["generated_text"]
    code_orig = extract_code_from_output(output_orig)

    output_ft = pipe_ft(prompt, max_new_tokens=256, do_sample=True)[0]["generated_text"]
    code_ft = extract_code_from_output(output_ft)

    expected_codes.append(real_code)
    generated_orig.append(code_orig)
    generated_ft.append(code_ft)

df_orig = pd.DataFrame({
    "reference_code": expected_codes,
    "generated_code_original": generated_orig
})
df_orig.to_csv("codes_comparison_tinyllama_original.csv", index=False)

df_ft = pd.DataFrame({
    "reference_code": expected_codes,
    "generated_code_finetuned": generated_ft
})
df_ft.to_csv("codes_comparison_tinyllama_finetuned.csv", index=False)

compare_all(expected_codes, generated_orig, label="TinyLlama Original")
compare_all(expected_codes, generated_ft, label="TinyLlama Fine-tuned")


Device set to use cuda:0
Device set to use cuda:0
The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['AriaTextForCausalLM', 'BambaForCausalLM', 'BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'DeepseekV3ForCausalLM', 'DiffLlamaForCausalLM', 'ElectraForCausalLM', 'Emu3ForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FalconMambaForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'Gemma3ForConditionalGeneration', 'Gemma3ForCausalLM', 'GitForCausalLM', 'GlmForCausalLM', 'Glm4ForCausalLM', 'GotOcr2ForConditionalGeneration', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCo


=== Metrics for: TinyLlama Original ===
Exact Match: 0.000
Levenshtein Similarity: 0.178
BLEU Score: 0.105
ROUGE-L Score: 0.239

=== Metrics for: TinyLlama Fine-tuned ===
Exact Match: 0.000
Levenshtein Similarity: 0.202
BLEU Score: 0.148
ROUGE-L Score: 0.300





The evaluation metrics show that the fine-tuned TinyLlama model outperforms the original in all measured aspects: Levenshtein similarity increased from 0.178 to 0.202, BLEU score improved from 0.105 to 0.148, and ROUGE-L rose from 0.239 to 0.300. Although these improvements are moderate, they reflect consistent gains in code generation quality. The relatively small increase is expected because the base model was already capable of generating Python code to some extent before fine-tuning.