## A simple test about generating classification data

Just to exercise the generation of synthetic data for classification.

It instantiates all the "variability" elements that generate source-types, topics, subtopics, personas, and even languages, to generate diverse data instances.

Note that the language is basically "telling" the used LLM to generate example in that language, which in turn depends on the capabilities of the chosen LLM for that particular language.

In [3]:
import asyncio
import time
from synthetic_data_gen.tasks.classification.classification import ClassificationDataGeneration
from synthetic_data_gen.tasks.elements_generators import SourceTypesGenerationProcessor, TopicsGenerationProcessor, SubtopicsGenerationProcessor, \
    PersonasGenerationProcessor, LanguagesGenerationProcessor, LabelsGenerationProcessor, ClassificationInstancesGenerationProcessor, \
    ElementsGenerationProcessorBase
from synthetic_data_gen.llms.llm_client import OllamaLLMClient
from synthetic_data_gen.tasks.schemas import ClassificationDataGenConfig, DataGenLabel, GenerationTaskType

# Instantiating a very simple configuration for the generation.
config = ClassificationDataGenConfig(
    task_description='A sentiment classification task with positive and negative examples',
    labels=[DataGenLabel(name='positive', desc='A positive piece of content'), DataGenLabel(name='negative', desc='A negative piece of content')],
    languages=['Spanish'],
    source_types=2,
    topics=2,
    subtopics=2,
    personas=2,
    num_instances_per_combination=2,
)

# Instantiating the LLM client (for now Ollama to call local LLMs, but would be easy to implement clients for external LLMs)
llm_client = OllamaLLMClient(model_name='hf.co/unsloth/gemma-3-4b-it-qat-GGUF:Q4_K_M')

# All the "variability elements"
source_types_generator = SourceTypesGenerationProcessor(llm_client)
topics_generator = TopicsGenerationProcessor(llm_client)
subtopics_generator = SubtopicsGenerationProcessor(llm_client)
personas_generator = PersonasGenerationProcessor(llm_client)
languages_generator = LanguagesGenerationProcessor(llm_client)
labels_generator = LabelsGenerationProcessor(llm_client)
instances_generator = ClassificationInstancesGenerationProcessor(llm_client, temperature=0.6, max_tokens=768)

# This is a task-type->task-processor mapping for the asynchronous tasks processing (it should probably be put inside the constructor, since it is unlikely to change)
generation_processors_map: dict[GenerationTaskType, ElementsGenerationProcessorBase] = {
    GenerationTaskType.SOURCE_TYPES: source_types_generator,
    GenerationTaskType.TOPICS: topics_generator,
    GenerationTaskType.SUBTOPICS: subtopics_generator,
    GenerationTaskType.PERSONAS: personas_generator,
    GenerationTaskType.LANGUAGES: languages_generator,
    GenerationTaskType.LABELS: labels_generator,
    GenerationTaskType.INSTANCES: instances_generator,
}

# The main class that generates the data
generator = ClassificationDataGeneration(
    llm_client=llm_client,
    generation_config=config,
    generation_processors_map=generation_processors_map
)

# Data generation process
start_time = time.time()
# In a regular Python script you would need to use asyncio.run(), in Jupyter you can just await the coroutine
# results = asyncio.run(generator.generate_data_async(num_workers=3))
results = await generator.generate_data_async(num_workers=3)
end_time = time.time()

print(f'Num results: {len(results)}')
print(f'Time elapsed: {end_time - start_time:.2f} seconds')

Worker 0 is waiting for task...
Worker 0 got a task: generation_elements_tuple=GenerationElementsTuple(source_type=None, topic=None, subtopic=None, persona=None, label=None, language=None, instance=None) next_task=<GenerationTaskType.SOURCE_TYPES: 'source_types'>
Worker 1 is waiting for task...
Worker 2 is waiting for task...
Marking task as done...
Worker 0 is waiting for task...
Worker 0 got a task: generation_elements_tuple=GenerationElementsTuple(source_type='Reddit', topic=None, subtopic=None, persona=None, label=None, language=None, instance=None) next_task=<GenerationTaskType.TOPICS: 'topics'>
Worker 1 got a task: generation_elements_tuple=GenerationElementsTuple(source_type='Twitter', topic=None, subtopic=None, persona=None, label=None, language=None, instance=None) next_task=<GenerationTaskType.TOPICS: 'topics'>
Marking task as done...
Worker 0 is waiting for task...
Worker 0 got a task: generation_elements_tuple=GenerationElementsTuple(source_type='Reddit', topic='Gaming', su

In [4]:
# Print results
for result in results:
    print(result)

source_type='Reddit' topic='Gaming' subtopic='Console Gaming' persona='David is a 28-year-old avid console gamer and Redditor who frequently discusses game mechanics and competitive strategies on r/Gaming.' label='positive' language='Spanish' instance='¡Qué pasada el nuevo parche de Elden Ring! Las mejoras en la jugabilidad son increíbles, me está encantando.'
source_type='Reddit' topic='Gaming' subtopic='Console Gaming' persona='David is a 28-year-old avid console gamer and Redditor who frequently discusses game mechanics and competitive strategies on r/Gaming.' label='positive' language='Spanish' instance='¡Por fin conseguí el boss final de God of War Ragnarök! La satisfacción es inmensa.'
source_type='Reddit' topic='Gaming' subtopic='Console Gaming' persona='David is a 28-year-old avid console gamer and Redditor who frequently discusses game mechanics and competitive strategies on r/Gaming.' label='negative' language='Spanish' instance='¡Qué frustrante! El lag en este juego es insop