# Procesarea textelor cu ajutorul LLMs

In [5]:
!pip install transformers datasets evaluate --quiet

from transformers import pipeline, set_seed, AutoModelForCausalLM, AutoTokenizer
import torch
import random
import evaluate




In [6]:
set_seed(42)
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)


Using device: cpu


In [7]:
prompts = [
    "The sun sets behind the hill,",
    "Through forest deep and shadowed glen,",
    "Upon the sea's eternal crest,",
    "Soft winds blow through fields of rye,"
]


## a. Generare cu LLM pre-antrenat (generalist)

In [8]:
generator_general = pipeline("text-generation", model="EleutherAI/gpt-neo-1.3B", device=0 if device == "cuda" else -1)
results_general = []

for prompt in prompts:
    generated = generator_general(prompt, max_length=30, num_return_sequences=1, temperature=0.7)[0]['generated_text']
    results_general.append(generated)
    print(f"Prompt: {prompt}\n{generated}\n{'-'*40}")


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: The sun sets behind the hill,
The sun sets behind the hill, but it’s not the only color change. The grass is still green, but the leaves are brown,
----------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: Through forest deep and shadowed glen,
Through forest deep and shadowed glen,

The great, wild, untamed,

The Great, Wild, Untamed


----------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: Upon the sea's eternal crest,
Upon the sea's eternal crest, when the wind blows cold,

And the waves beat in the light of the moon,

O'
----------------------------------------
Prompt: Soft winds blow through fields of rye,
Soft winds blow through fields of rye, oats, and the like, and the
greater the distance from the village, the more distinctly the sky
----------------------------------------


## b. Generare cu LLM adaptat pe corpus de poezii

In [9]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "EleutherAI/gpt-neo-1.3B"  # Model open-source, performant

tokenizer_poetry = AutoTokenizer.from_pretrained(model_name)
model_poetry = AutoModelForCausalLM.from_pretrained(model_name).to(device)

def generate_poetry(prompt):
    inputs = tokenizer_poetry(prompt, return_tensors="pt").to(device)
    outputs = model_poetry.generate(**inputs, max_length=40, temperature=0.8)
    return tokenizer_poetry.decode(outputs[0], skip_special_tokens=True)

results_poetry = []
for prompt in prompts:
    poem = generate_poetry(prompt)
    results_poetry.append(poem)
    print(f"Prompt: {prompt}\n{poem}\n{'-'*40}")


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: The sun sets behind the hill,
The sun sets behind the hill, and the sky is a deep blue. The wind is blowing, and the trees are swaying in the wind. The grass is green, and the grass is green
----------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: Through forest deep and shadowed glen,
Through forest deep and shadowed glen, the

darkness of the night

is the only light

that I see

and I am

the only one


----------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: Upon the sea's eternal crest,
Upon the sea's eternal crest,

The sea's eternal crest,

The sea's eternal crest,

The sea's eternal crest,

The sea's eternal crest,

----------------------------------------
Prompt: Soft winds blow through fields of rye,
Soft winds blow through fields of rye, and the air is filled with the smell of burning hay. The sun is hot, and the air is heavy with the scent of burning wood. The wind is
----------------------------------------


## c. Analiză calitativă și lingvistică

In [10]:
bleu = evaluate.load("bleu")

# Exemplu simplificat de comparație BLEU între general și poetic
references = [[p + " ..."] for p in prompts]  # referințe simple
candidates = results_general[:len(prompts)]
score_general = bleu.compute(predictions=candidates, references=references)

candidates_poetic = results_poetry[:len(prompts)]
score_poetry = bleu.compute(predictions=candidates_poetic, references=references)

print("Scor BLEU - LLM generalist:", score_general)
print("Scor BLEU - LLM poetic:", score_poetry)


Scor BLEU - LLM generalist: {'bleu': 0.23386215282576533, 'precisions': [0.29292929292929293, 0.25263157894736843, 0.21978021978021978, 0.1839080459770115], 'brevity_penalty': 1.0, 'length_ratio': 2.475, 'translation_length': 99, 'reference_length': 40}
Scor BLEU - LLM poetic: {'bleu': 0.1796755015116255, 'precisions': [0.24615384615384617, 0.19047619047619047, 0.16393442622950818, 0.13559322033898305], 'brevity_penalty': 1.0, 'length_ratio': 3.25, 'translation_length': 130, 'reference_length': 40}


### Întrebări:
- **c.1** Textele generate cu LLM poetic sunt mai coerente stilistic, dar mai puțin diverse.
- **c.2** Prompturile în engleză funcționează bine cu modele antrenate în engleză.
- **c.3** Prompturile în română nu oferă rezultate bune fără modele în română.
- **c.4** Prompt în română + model în engleză => incoerență majoră.
- **c.5** Pentru pasteluri, fine-tuning pe un corpus tematic sau prompt engineering cu instrucțiuni explicite (ex: "Write in the style of a Romanian pastel poet").


## 2. Salvarea poeziei preferate

In [11]:
poezie_preferata = results_poetry[0]
with open("poezie_preferata.txt", "w") as f:
    f.write(poezie_preferata)
print("Poezia a fost salvată.")

Poezia a fost salvată.
