---
title: "Distiling DeepSeek reasoning to ModernBERT classifiers"
description: "How can we use the reasoning ability of DeepSeek to generate synthetic labels for fine tuning a ModernBERT model?"
author: "Daniel van Strien"
date: "2025-01-29"
categories: ["huggingface", "datasets", "arxiv", "synthetic-data", "deepseek"]
image: https://github.com/davanstrien/blog/raw/refs/heads/main/posts/2025/modern-bert-sythetic-labels/bert-illustration.webp
twitter-card:
  title: "Distiling DeepSeek reasoning to ModernBERT classifiers"
  description: "How can we use the reasoning ability of DeepSeek to generate synthetic labels for fine tuning a ModernBERT model?"
  image: https://github.com/davanstrien/blog/raw/refs/heads/main/posts/2025/modern-bert-sythetic-labels/bert-illustration.webp
  card-style: summary_large_image
open-graph:
  title: "Distiling DeepSeek reasoning to ModernBERT classifiers"
  description: "How can we use the reasoning ability of DeepSeek to generate synthetic labels for fine tuning a ModernBERT model?"
  image: https://github.com/davanstrien/blog/raw/refs/heads/main/posts/2025/modern-bert-sythetic-labels/bert-illustration.webp
toc-depth: 3
toc: true
---

In [1]:
# | output: false
%pip install polars huggingface_hub datasets openai --upgrade

Collecting polars
  Downloading polars-1.21.0-cp39-abi3-macosx_11_0_arm64.whl.metadata (14 kB)
Collecting openai
  Downloading openai-1.60.1-py3-none-any.whl.metadata (27 kB)
Downloading polars-1.21.0-cp39-abi3-macosx_11_0_arm64.whl (28.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m28.0/28.0 MB[0m [31m51.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading openai-1.60.1-py3-none-any.whl (456 kB)
Installing collected packages: polars, openai
  Attempting uninstall: polars
    Found existing installation: polars 1.20.0
    Uninstalling polars-1.20.0:
      Successfully uninstalled polars-1.20.0
  Attempting uninstall: openai
    Found existing installation: openai 1.60.0
    Uninstalling openai-1.60.0:
      Successfully uninstalled openai-1.60.0
Successfully installed openai-1.60.1 polars-1.21.0
Note: you may need to restart the kernel to use updated packages.


## How can we get the best of both worlds?

tl;dr, how can we use LLMs to generate labels to fine-tune a ModernBERT model?

It's fair to say that [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) has made quite an impact in the last few weeks. It's a powerful reasoning model that excels at many tasks that require reasoning. One particularly exciting aspect of the release of this model, though, is the distilled versions of the model. These models are much smaller but still retain a lot of the reasoning ability of the larger models.

### Classification often requires reasoning

While the interest in reasoning models often focs on use cases like mathematics and coding, there are many other use cases where reasoning can be helpful. One example is classification. Although some classification problems are very simple and mostly require "pattern matching," there are many other problems where reasoning is needed. This is where a reasoning model could be helpful. 

### Can we distil even smaller models?

While the distilled models are fairly small (the smallest is 1.5B), we may still prefer to have an even smaller model for many use cases. If you can remember all the way back to December 2024, the ModernBERT release introduced a new BERT model, which is a good candidate for this kind of efficient classification use case. The main challenge is that in order to train a classifier, we need labeled data. This is where we can use a reasoning model to generate synthetic labels.

## The use case: classifying ArXiv papers that introduce a newly created dataset

As the Machine Learning Librarian at Hugging Face, I want to keep track of new datasets being shared on ArXiv. While you can search for "dataset" or "benchmark" in the title or abstract, this returns any papers that mention datasets or benchmarks. I'm only interested in papers that introduce a newly created dataset. 


So the goal is to give an article to classify into whether it introduces a newly created dataset.

I'll use Polars to load the ArXiv dataset from the Hub, but you can use whichever data tool you want.

::: {.callout-note}
Feel free to skip this section if you are not interested in the use case and just want to see how to do the labelling part.
:::

In [2]:
import os
import polars as pl
from huggingface_hub import snapshot_download

os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"  # turn on HF_TRANSFER

In [3]:
files = snapshot_download(
    repo_id="librarian-bots/arxiv-metadata-snapshot",
    allow_patterns=["*.parquet"],
    repo_type="dataset",
)

In [4]:
df = pl.scan_parquet(files)

Let's look at the first row. We see a bunch of metadata about the paper and then the title and abstract. These are probably the column we'll want to use as input for our model.


In [5]:
df.head(1).collect()

id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
str,str,str,str,str,str,str,str,str,str,str,list[struct[2]],datetime[ms],list[list[str]]
"""1004.3702""","""Lizhi Du""","""Lizhi Du""","""A Polynomial time Algorithm fo…","""26 pages. This time, I add a d…",,,,"""cs.DS""","""http://arxiv.org/licenses/none…",""" Based on the famous Rotation…","[{""v1"",""Mon, 12 Apr 2010 04:39:27 GMT""}, {""v10"",""Mon, 5 Nov 2012 01:44:46 GMT""}, … {""v9"",""Wed, 29 Aug 2012 06:39:31 GMT""}]",2025-01-24 00:00:00,"[[""Du"", ""Lizhi"", """"]]"


You will see there is a categories column. This is a string that contains a list of categories that the paper belongs to. We can grab a few examples of the categories.



In [6]:
df.head(10).collect().select("categories").to_series().to_list()

['cs.DS',
 'math.GM',
 'math.CA math.AT math.DG math.DS',
 'cond-mat.mtrl-sci',
 'cond-mat.mtrl-sci',
 'math.GT',
 'math.GT',
 'math.GT',
 'math.AP',
 'math.AP math-ph math.MP math.SP']

For my particular use case I'm mostly interest in papers that are in the computer science category i.e contain "cs." in the categories column.

In [7]:
df = df.filter(
    pl.col("categories")
    .str.split(" ")
    .list.eval(pl.element().str.starts_with("cs."))
    .list.any()
)

We'll filter papers to only include those that contain the word "dataset" in the title or abstract, again you could easily change this to use other words. 



::: {.callout-note}
One thing to consider here is that ideally you want the distribution of data you use for training the model to be similar to the distribution of data you will use in practice. Since I will only check for ArXiV papers that contain the word "dataset" in the title or abstract, I will filter out a lot of the data before it even gets passed to the model. For your use case, consider the distribution of data you'll be using in practice and filter the data accordingly.
:::

In [8]:
df = df.filter(
    pl.col("title").str.contains("dataset") | pl.col("abstract").str.contains("dataset")
)

Since we're using the polars lazy api, we need to call `collect()` to actually get the data. 

In [9]:
df = df.collect()

## Generate labels not synthetic data

There has been significant growth in the use of LLMs for synthetic data generation over the past couple of years. While we could generate synthetic data, i.e., developing both the "input" and "target" columns, if we already have some data we want to work with, it makes more sense to generate labels. One of the significant challenges with synethic data generation is that the data generated is often not representative of the data we want to use in practice. For generative tasks, this might matter slightly less. Since we're focused on building classifiers, which we'll often focus on quite a narrow use case or domain, the data we use to train the model must be representative of the data we want to use in practice. 

In this case, it might be more sensible to use a model's reasoning ability to generate labels rather than generate synthetic data. 



Let's grab a few examples from the data to use as a starting point.

In [168]:
examples = df.head(4).select(pl.col(["abstract", "title"])).to_dicts()
examples[0]

{'abstract': '  This paper presents a new fuzzy k-means algorithm for the clustering of high\ndimensional data in various subspaces. Since, In the case of high dimensional\ndata, some features might be irrelevant and relevant but may have different\nsignificance in the clustering. For a better clustering, it is crucial to\nincorporate the contribution of these features in the clustering process. To\ncombine these features, in this paper, we have proposed a new fuzzy k-means\nclustering algorithm in which the objective function of the fuzzy k-means is\nmodified using two different entropy term. The first entropy term helps to\nminimize the within-cluster dispersion and maximize the negative entropy to\ndetermine clusters to contribute to the association of data points. The second\nentropy term helps to control the weight of the features because different\nfeatures have different contributing weights in the clustering process for\nobtaining the better partition of the data. The efficacy 

## Structured generation?

We'll start by using a structured generation approach to generate the labels. This means we'll define a schema for the model's output and then use that to generate the labels. I've written more about this in a [previous blog post](https://danielvanstrien.xyz/posts/plain-text/synthetic-data-generation/2024-05-03-synethic-data-1.html) but the basic idea is that we define a schema for the output of the model and then use that to generate the labels. This means we don't have to do a lot of work to parse the output of the model and can be sure we can easily train on the output.

In this case, we define a Pydantic model as one that has a label and an explanation.


In [169]:
from enum import Enum
from pydantic import BaseModel, constr
from typing import Annotated


class DatasetLabel(str, Enum):
    NEW = "new_dataset"
    NOT_NEW = "no_new_dataset"


class IntroducesNewDataset(BaseModel):
    explanation: constr(min_length=40)
    label: DatasetLabel

We define a function to format the data as a prompt. This function takes a dictionary with the title and abstract and formats it as a prompt for the model.

In [170]:
def format_text_as_prompt(data: dict[str, str]):
    return f"""Look at the title and abstract for the following arXiv paper. Assess whether the paper is likely to introduce a newly created dataset.


Title: {data['title']}
Abstract: {data['abstract']}

Your role is to decide whether the paper introduces a newly created dataset. First you should think about whether the paper is likely to introduce a newly created dataset. You should then return your reasoning and the label you've chosen. 
You should choose out of the "new_dataset" or "no_new_dataset" labels.

Return your reasoning and the label you've chosen as a JSON object like this:
```json
{{
    "label": "new_dataset" | "no_new_dataset",
    "explanation": "The reasoning the model used to come to its conclusion"
}}
```
"""


In [171]:
print(format_text_as_prompt(examples[0]))

Look at the title and abstract for the following arXiv paper. Assess whether the paper is likely to introduce a newly created dataset.


Title: An Entropy-based Variable Feature Weighted Fuzzy k-Means Algorithm for
  High Dimensional Data
Abstract:   This paper presents a new fuzzy k-means algorithm for the clustering of high
dimensional data in various subspaces. Since, In the case of high dimensional
data, some features might be irrelevant and relevant but may have different
significance in the clustering. For a better clustering, it is crucial to
incorporate the contribution of these features in the clustering process. To
combine these features, in this paper, we have proposed a new fuzzy k-means
clustering algorithm in which the objective function of the fuzzy k-means is
modified using two different entropy term. The first entropy term helps to
minimize the within-cluster dispersion and maximize the negative entropy to
determine clusters to contribute to the association of data poi

## Using LM Studio to develop our approach

One of the powerful features of open source is that it makes it easier to run models in different places. While developing our approach, we can use a smaller version of the model to test it and then switch to a hosted version once we're happy with it.

We'll run the model using [LM Studio](https://lmstudio.ai/). LM Studio is primarily known as a UI for running local LLMs, but it also has a server mode, which we'll use here. We can interact with the server using the CLI. To start the server, we can run the following command.

In [172]:
!lms server start

Starting server...
Success! Server is now running on port 1234


We can use `ls` to see the models that are available, we'll filter these to only show the DeepSeek models.

In [135]:
!lms ls | grep DeepSeek

[96mlmstudio-community/DeepSeek-R1-Distill-Qwen-1.5B-GGUF[39m      [96m1.12 GB[39m          [96mQwen2[39m           
[96mlmstudio-community/DeepSeek-R1-Distill-Qwen-7B-GGUF[39m        [96m4.68 GB[39m          [96mQwen2[39m           
[96mlmstudio-community/DeepSeek-R1-Distill-Qwen-14B-GGUF[39m       [96m8.99 GB[39m          [96mQwen2[39m           
[96mlmstudio-community/DeepSeek-R1-Distill-Llama-8B-GGUF[39m       [96m4.92 GB[39m          [96mLlama[39m           


::: {.callout-note}
Note that the output here is showing models I already have locally. There are many models LM studio can download from the Hugging Face Hub.
:::

We can load the model by running the following command. If the model is not already downloaded, LM Studio will download it. We'll try and see how well the 7B model does.

In [137]:
!lms load DeepSeek-R1-Distill-Qwen-7B-GGUF


Loading model "lmstudio-community/DeepSeek-R1-Distill-Qwen-7B-GGUF/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf"...
[97m[LMStudioClient][LLM][39m Start loading model lmstudio-community/DeepSeek-R1-Distill-Qwen-7B-GGUF/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf...
⠹ [█████████████████████▋                            ] 43.04%          [u[2K[1Gs[2Kl[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s          [u[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s] 32.15%          [u[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s[?25l[s] 34.82%          [u[?25l[s█████████████████▋                                ] 35.25%          [u[?25l[s[?25l[s[?25l[s[?25l[s[

Since LM Studio has an OpenAI compatible API, we can use the OpenAI Python client to interact with the server. We just need to set the base URL to the LM Studio server and set the API key to `lm-studio`.

In [153]:
from openai import OpenAI

In [154]:
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")

Once we've created the client we can interact with it in the usual way i.e. to see available models we can run the following command.

In [140]:
client.models.list()


SyncPage[Model](data=[Model(id='deepseek-r1-distill-qwen-7b', created=None, object='model', owned_by='organization_owner')], object='list')

## Generating labels

We can now generate labels for our examples. We'll use the `format_text_as_prompt` function to format the data as a prompt and then pass it to the model. Since we're using a structured output, we need to use the `beta.chat.completions` endpoint. We pass in our Pydantic model as the `response_format` argument.

In [173]:
messages = [
    {"role": "user", "content": format_text_as_prompt(examples[0])},
]


response = client.beta.chat.completions.parse(
    model="deepseek-r1-distill-qwen-7b",
    messages=messages,
    temperature=0.7,
    response_format=IntroducesNewDataset,
)


We can check that we can parse the output of the model into our Pydantic model.

In [174]:
IntroducesNewDataset.model_validate_json(response.choices[0].message.content)

IntroducesNewDataset(explanation="The paper discusses an entropy-based fuzzy k-means algorithm designed for high-dimensional data. While it mentions incorporating feature contributions into clustering, there's no information about introducing a new dataset.", label=<DatasetLabel.NOT_NEW: 'no_new_dataset'>)

We'll wrap this in a function so we can easily use it for a lot of examples.

In [175]:
def predict_label(
    data: dict[str, str], model: str = "deepseek-r1-distill-qwen-1.5b", client=client
) -> IntroducesNewDataset | None:
    try:
        prompt = format_text_as_prompt(data)
        messages = [
            {"role": "user", "content": prompt},
        ]
        response = client.beta.chat.completions.parse(
            model=model,
            messages=messages,
            temperature=0.7,
            response_format=IntroducesNewDataset,
        )
        return IntroducesNewDataset.model_validate_json(
            response.choices[0].message.content
        )
    except Exception as e:
        print(e)
        return None


Before doing a big batch of predictions, let's run the model on a few examples so we can see how it does.

In [176]:
from rich import print as rich_print

structured_results = []
for example in examples:
    title = example["title"]
    abstract = example["abstract"]
    prediction = predict_label(example)
    structured_results.append(prediction)
    rich_print(title)
    rich_print(abstract)
    rich_print(prediction)
    rich_print("---")

## Room to think?

One of the features of the R1 model is that it has "reasoning", which is delineated by <thinking> and </thinking> tags. Since our structured output doesn't allow for this, let's try and see how well the model does without it.

In [215]:
def predict_label_without_structured_output(
    data: dict[str, str], model: str = "deepseek-r1-distill-qwen-1.5b", client=client
) -> str:
    prompt = format_text_as_prompt(data)
    messages = [
        {"role": "user", "content": prompt},
    ]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.7,
    )
    return response.choices[0].message.content


We'll compare the results from the two approaches.

In [178]:
# compare the results vs structured output
for i, example in enumerate(examples):
    rich_print(example["title"])
    rich_print(example["abstract"])
    prediction = predict_label_without_structured_output(example)
    print(f"Previous: {structured_results[i].label}")
    print(f"New: {prediction}")
    rich_print("---")

Previous: DatasetLabel.NOT_NEW
New: <think>
Okay, so I need to figure out whether the paper introduces a newly created dataset. The title and abstract are provided.

The title is: "An Entropy-based Variable Feature Weighted Fuzzy k-Means Algorithm for High Dimensional Data." It mentions an algorithm related to clustering high-dimensional data using fuzzy k-means with some entropy terms and feature weighting.

Looking at the abstract, it says they've proposed a new fuzzy k-means algorithm. The focus is on modifying the objective function by adding two different entropy terms: one to minimize within-cluster dispersion and another to control feature weights because features have varying contributions in clustering.

The paper mentions that their method was tested against various datasets and compared with state-of-the-art methods, but there's no explicit mention of introducing a new dataset. They evaluate performance on multiple existing datasets without specifying any novel data creation

Previous: DatasetLabel.NEW
New: <think>
Okay, so I need to figure out whether the paper titled "Identifying Influential Brokers on Social Media from Social Network Structure" introduces a new dataset. Let me break this down.

First, looking at the title, it's about identifying influential brokers in social media using network structure. The abstract mentions they used three social media datasets to study these influencers. They compared brokers with source spreaders and central nodes based on centrality measures.

The abstract also talks about tackling the problem of identifying brokers from both centrality measures and node embeddings. It evaluates the effectiveness of network features, getting some F1 scores as a result.

So, I'm trying to see if they created any new dataset or used existing ones. They mention using three datasets: Twitter in their experiments. The paper doesn't seem to introduce any entirely new type of data beyond what's commonly available, like Twitter datasets. T

Previous: DatasetLabel.NOT_NEW
New: <think>
Okay, I'm trying to figure out whether the paper "Improving Performance of Automatic Keyword Extraction (AKE) Methods Using PoS-Tagging and Enhanced Semantic-Awareness" introduces a newly created dataset. 

First, looking at the title suggests that it's about improving an existing AKE method, which implies they're working with existing datasets rather than creating new ones.

The abstract mentions experiments conducted on 17 selected datasets for five SOTA AKE methods. They used these datasets to test their approach but didn't mention any new data collection or creation here. 

So, there's no indication that the paper includes a newly created dataset in its methodology or results section.
</think>

The paper focuses on enhancing existing AKE methods using PoS-tagging and semantic-aware criteria without introducing new datasets.

```json
{
    "label": "no_new_dataset",
    "explanation": "The paper does not mention any new datasets being crea

Previous: DatasetLabel.NEW
New: <think>
Alright, so I'm trying to figure out whether the paper introduces a newly created dataset. Let's look at the information given.

First, the title is "LOCUS: LOcalization with Channel Uncertainty and Sporadic Energy." It mentions an acronym LOCUS, which seems to be the main focus of the paper—sound source localization using deep learning methods to handle missing data in batteryless systems.

Now looking at the abstract. The authors mention that their method addresses missing channels by leveraging information entropy estimation and conditional interpolation through three modules: InFo, LaFS, and GRep. They demonstrate performance improvements on two datasets called DCASE and LargeSet, achieving up to 36.91% reduction in DoA error compared to existing methods.

The abstract also includes a real-world evaluation across three environments with intermittent power sources, showing significant performance improvements when channels are stochastically m

While this is definitely a vibes-based assessment, it does seem like the model does better when it has room to think, so we'll proceed with this approach. 

::: {.callout-note}
There are ways to allow for both structured generation and reasoning. I'll post more on that in the future!
:::

We'll now write a function to extract the JSON from the model's output.

In [179]:
#| code-fold: true
import contextlib
import re
import json

JSON_PATTERN = re.compile(r"```json\n(.*?)```", re.DOTALL)
DIRECT_JSON_PATTERN = re.compile(r"\{[^}]*\}", re.DOTALL)


def try_extract_json_from_text(text: str) -> tuple[str, dict | None]:
    if match := JSON_PATTERN.search(text):
        json_results = match.group(1)
        with contextlib.suppress(json.JSONDecodeError):
            return text, json.loads(json_results)
    if match := DIRECT_JSON_PATTERN.search(text):
        json_text = match.group(0)
        with contextlib.suppress(json.JSONDecodeError):
            return text, json.loads(json_text)
    return text, None

In [180]:
prediction = predict_label_without_structured_output(examples[0])
try_extract_json_from_text(prediction)


('<think>\nOkay, so I\'m trying to figure out whether the paper titled "An Entropy-based Variable Feature Weighted Fuzzy k-Means Algorithm for High Dimensional Data" introduces a newly created dataset. Let me go through this step by step.\n\nFirst, I look at the title. The title mentions an algorithm called fuzzy k-means that\'s been modified with entropy terms and feature weighting to handle high-dimensional data. It doesn\'t explicitly say anything about introducing new datasets, but it does focus on improving clustering in such data environments, which often involves dealing with irrelevant or less important features.\n\nNow, looking at the abstract. The paper discusses modifying the fuzzy k-means algorithm by incorporating two entropy terms: one for within-cluster dispersion and negative entropy to determine clusters, and another to control feature weights because different features have varying contributions. They compare their method\'s efficacy using various clustering measures 

Let's see how well this works on all the examples we had before 

In [181]:
results = [predict_label_without_structured_output(example) for example in examples]
parsed_results = [try_extract_json_from_text(result) for result in results]
[p for p in parsed_results if p[1] is None]

[]

We can see in this example we don't have any examples where we don't get a valid JSON object (this is why we get back an empty list).

Although we might miss a few examples where we don't get a valid JSON object when doing the full dataset, let's proceed with this approach since the model does much better when given room to reason. 

We'll now use the hosted version of the model to generate labels for the entire dataset. For this version, we'll use a dedicated Hugging Face inference endpoint, but if we wanted to use the full R1 model, we could use the new Inference Providers feature on the Hub. See this [blog post](https://huggingface.co/blog/inference-providers) for more information.

In [221]:
#| code-fold: true
from openai import OpenAI
import os
from dotenv import load_dotenv

load_dotenv()

client = OpenAI(
    base_url="https://tgtdz7g5h3sd1lov.us-east-1.aws.endpoints.huggingface.cloud/v1/",
    api_key=os.getenv("HF_TOKEN"),
)


In [222]:
rich_print(
    predict_label_without_structured_output(
        examples[0], model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B", client=client
    )
)

We'll now sample 3000 examples from the dataset and use the hosted model to generate labels for them.

In [223]:
sample_df = df.sample(3000, seed=42)

In [224]:
examples = sample_df.select(pl.col(["abstract", "title"])).to_dicts()

We create a function to predict the labels using the hosted model. We'll use the `stamina` library to retry the request if it fails.

In [225]:
#| code-fold: true
import stamina
from openai import APIConnectionError, APIStatusError


@stamina.retry(on=(APIConnectionError, APIStatusError), attempts=3)
def predict_hf_endpoint(data: dict[str, str], model: str = "tgi", client=client):
    return predict_label_without_structured_output(data, model, client)


def predict(data):
    try:
        return predict_hf_endpoint(data)
    except Exception as e:
        print(e)
        return None

Get the results from the model.

In [None]:
from tqdm.contrib.concurrent import thread_map

results = thread_map(predict, examples, max_workers=5)

Let's take a look at the first result


In [192]:
rich_print(results[0])

In [193]:
try_extract_json_from_text(results[0])

('<think>\nOkay, so I need to figure out if the given arXiv paper introduces a newly created dataset. Let\'s look at the title and abstract carefully.\n\nThe title is "Holistic Autonomous Driving Understanding by Bird\'s-Eye-View Injected Multi-Modal Large Models." That immediately suggests it\'s about a dataset related to autonomous driving. The abstract mentions that the paper introduces a dataset called NuInstruct, which has 91K multi-view video-QA pairs across 17 subtasks. Each task requires holistic information like temporal, multi-view, and spatial data, making the challenges higher.\n\nThe authors propose a method using SQL to generate instruction-response pairs automatically, inspired by human driving logic. They also introduce BEV-InMLLM, an end-to-end method that enhances large language models by integrating BEV features, language alignment, and tasks like multi-view, spatial awareness, and temporal semantics. They note that their BEV injection module is plug-and-play for exi

We'll do a bit of cleaning up of the results to get them in a format we can add to our existing dataframe.

In [194]:
#| code-fold: true
parsed_results = [try_extract_json_from_text(result) for result in results]

In [195]:
parsed_results[:3]

[('<think>\nOkay, so I need to figure out if the given arXiv paper introduces a newly created dataset. Let\'s look at the title and abstract carefully.\n\nThe title is "Holistic Autonomous Driving Understanding by Bird\'s-Eye-View Injected Multi-Modal Large Models." That immediately suggests it\'s about a dataset related to autonomous driving. The abstract mentions that the paper introduces a dataset called NuInstruct, which has 91K multi-view video-QA pairs across 17 subtasks. Each task requires holistic information like temporal, multi-view, and spatial data, making the challenges higher.\n\nThe authors propose a method using SQL to generate instruction-response pairs automatically, inspired by human driving logic. They also introduce BEV-InMLLM, an end-to-end method that enhances large language models by integrating BEV features, language alignment, and tasks like multi-view, spatial awareness, and temporal semantics. They note that their BEV injection module is plug-and-play for ex

In [196]:
#| code-fold: true
labels_and_explanations = [
    (result[1].get("label"), result[1].get("explanation"))
    if result[1] is not None and isinstance(result[1], dict)
    else (None, None)
    for result in parsed_results
]

# Unzip the list of tuples into separate lists
labels, explanations = zip(*labels_and_explanations)
lables = list(labels)
explanations = list(explanations)
sample_df = sample_df.with_columns(
    pl.Series(lables).alias("labels"),
    pl.Series(explanations).alias("explanations"),
)

In [200]:
sample_df.head(1)

id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed,labels,explanations
str,str,str,str,str,str,str,str,str,str,str,list[struct[2]],datetime[ms],list[list[str]],str,str
"""2401.00988""","""Xinpeng Ding""","""Xinpeng Ding and Jinahua Han a…","""Holistic Autonomous Driving Un…",,,,,"""cs.CV""","""http://arxiv.org/licenses/none…",""" The rise of multimodal large…","[{""v1"",""Tue, 2 Jan 2024 01:54:22 GMT""}]",2024-01-03 00:00:00,"[[""Ding"", ""Xinpeng"", """"], [""Han"", ""Jinahua"", """"], … [""Li"", ""Xiaomeng"", """"]]","""new_dataset""","""The paper introduces a dataset…"


Let's take a look at the distribution of the labels. 

In [201]:
sample_df.select(pl.col("labels").value_counts()).unnest("labels")

labels,count
str,u32
"""new_dataset""",648
"""no_new_dataset""",2350
,2


We only get a few examples where the output doesn't match the labels we want. We can filter these out.

In [202]:
sample_df = sample_df.filter(pl.col("labels").is_in(["new_dataset", "no_new_dataset"]))

We'll now convert the dataframe to a Hugging Face dataset and push it to the Hub.

In [203]:
#| code-fold: true
from datasets import Dataset, Features, Value, ClassLabel
ds = Dataset.from_polars(
    sample_df.select(["id", "title", "abstract", "labels", "explanations"]),
)
large_string_columns = [
    k
    for k, v in ds.features.items()
    if isinstance(v, Value) and v.dtype == "large_string"
]
for column in large_string_columns:
    ds = ds.cast_column(column, Value("string"))
ds = ds.cast_column("labels", ClassLabel(names=["new_dataset", "no_new_dataset"]))
ds.push_to_hub("davanstrien/arxiv-new-datasets", token=os.getenv("HF_TOKEN"))

Here is the resulting dataset.

<iframe
  src="https://huggingface.co/datasets/davanstrien/arxiv-new-datasets/embed/viewer/default/train"
  frameborder="0"
  width="100%"
  height="560px"
></iframe>

## Fine tuning ModernBERT

Since the focus of this blog is on the data generation part I won't go into too much detail here but you can see the code and the final results below. 

In [None]:
#| code-fold: true
%pip install datasets setfit transformers accelerate --upgrade
%pip install flash-attn --no-build-isolation

from datasets import load_dataset
from evaluate import load
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
    EarlyStoppingCallback,
)
import numpy as np
from evaluate import load

# Load data
ds = load_dataset("davanstrien/arxiv-new-datasets", split="train")

# label info
labels = ds.features["labels"].names
label2id = {label: i for i, label in enumerate(labels)}
id2label = {i: label for label, i in label2id.items()}

# prep a text column combining title and abstract
ds = ds.map(lambda x: {"text": x["title"] + " " + x["abstract"]})
ds = ds.train_test_split(test_size=0.2, stratify_by_column="labels")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")


# Tokenize function
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True)


# Tokenize datasets
tokenized_datasets = ds.map(tokenize_function, batched=True)

# Load metrics
accuracy = load("accuracy")
f1 = load("f1")


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    accuracy_score = accuracy.compute(predictions=predictions, references=labels)
    f1_score = f1.compute(
        predictions=predictions, references=labels, average="weighted"
    )

    return {
        "accuracy": accuracy_score["accuracy"],
        "f1": f1_score["f1"],
    }


# Load model with increased dropout
model = AutoModelForSequenceClassification.from_pretrained(
    "answerdotai/ModernBERT-base",
    num_labels=2,
    label2id=label2id,
    id2label=id2label,
)

# Define improved training arguments
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=3e-5,  # Slightly higher initial learning rate
    per_device_train_batch_size=8,  # Reduced batch size
    per_device_eval_batch_size=64,
    num_train_epochs=20,  # Reduced epochs
    # Learning rate schedule
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    # Evaluation and saving
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    # Regularization
    weight_decay=0.01,
    max_grad_norm=1.0,
    label_smoothing_factor=0.1,
    # Logging
    logging_dir="./logs",
    logging_strategy="epoch",
)

# Create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Initialize Trainer with early stopping
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[
        EarlyStoppingCallback(early_stopping_patience=5, early_stopping_threshold=0.001)
    ],
)
# Evaluate the model
eval_results = trainer.evaluate()
print("\nFinal evaluation results:", eval_results)

# Save the best model
trainer.save_model("./best_model")

```python
Final evaluation results: {'eval_loss': 0.32631951570510864, 'eval_accuracy': 0.945, 'eval_f1': 0.9442747450661002, 'eval_runtime': 5.8106, 'eval_samples_per_second': 103.26, 'eval_steps_per_second': 1.721, 'epoch': 10.0}
```

# Conclusion

In this blog post, we've seen how we can use the reasoning abilities of models to be effective classifiers. Since lack of training data is one of the main reasons people may use an LLM over a fine tuned model, benefiting from the reasoning abilities of an LLM is a great way to get the best of both worlds. 