# üî¨¬†Hands‚ÄëOn Persian NLP with ü§ó¬†Transformers

This notebook walks you through **three core tasks**‚ÄîNamed‚ÄëEntity Recognition, Sentiment
Analysis, and Text Generation‚Äîusing publicly‚Äëavailable Persian checkpoints on
Hugging‚ÄØFace.  
All code cells are complete and runnable; feel free to tweak the examples or
replace the input strings with your own Persian text.


## üõ†¬†0¬†¬†Environment¬†Setup


In [None]:
# Core Hugging‚ÄØFace stack
!pip install --upgrade transformers accelerate sentencepiece -q


## üìë¬†1¬†¬†Named‚ÄëEntity Recognition (NER) with BERT‚Äëfa
We will load a BERT‚Äëbased sequence‚Äëlabeling model fine‚Äëtuned on
the¬†PEYMA news corpus.

The pipeline automatically aggregates sub‚Äëtoken predictions into whole
entities.

In [None]:
#Research task 1
# 1. Go to the Hugging‚ÄØFace Model Hub and search for a Persian NER checkpoint (hint: keywords ‚Äúbert‚Äëfa‚Äù and ‚Äúner‚Äù).
# 2. Read the model card to note its model‚Äëid.
# 3. In a new cell, create a pipeline("ner", model=MODEL_ID, aggregation_strategy="simple").
# 4. Route it to GPU if available (device=0) otherwise stay on CPU.
# 5. Test the pipeline on a Persian sentence of your choice and pretty‚Äëprint the entities.

!pip install transformers torch
from transformers import pipeline
import pprint
import torch

MODEL_ID = "HooshvareLab/bert-fa-base-uncased-ner-peyma"
device = 0 if torch.cuda.is_available() else -1

ner = pipeline(
    task="ner",
    model=MODEL_ID,
    aggregation_strategy="simple",
    device=device
)


In [2]:
test_sentence = "ÿØ⁄©ÿ™ÿ± ÿ≥ÿßÿ±ÿß ⁄ØŸÑ‚ÄåŸÖÿ≠ŸÖÿØ€å ÿØÿ± ÿ™Ÿáÿ±ÿßŸÜ ÿ≥ÿÆŸÜÿ±ÿßŸÜ€å ⁄©ÿ±ÿØ."
entities = ner(test_sentence)

print("Input:", test_sentence)
print("Detected entities:")
pprint.pp(entities, compact=True)


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Input: ÿØ⁄©ÿ™ÿ± ÿ≥ÿßÿ±ÿß ⁄ØŸÑ‚ÄåŸÖÿ≠ŸÖÿØ€å ÿØÿ± ÿ™Ÿáÿ±ÿßŸÜ ÿ≥ÿÆŸÜÿ±ÿßŸÜ€å ⁄©ÿ±ÿØ.
Detected entities:
[{'entity_group': 'B_PER',
  'score': np.float32(0.999181),
  'word': 'ÿ≥ÿßÿ±ÿß',
  'start': 5,
  'end': 9},
 {'entity_group': 'I_PER',
  'score': np.float32(0.9997402),
  'word': '⁄ØŸÑŸÖÿ≠ŸÖÿØ€å',
  'start': 10,
  'end': 18},
 {'entity_group': 'B_LOC',
  'score': np.float32(0.99893755),
  'word': 'ÿ™Ÿáÿ±ÿßŸÜ',
  'start': 22,
  'end': 27}]


## üòä¬†2¬†¬†Sentiment Analysis with BERT‚Äëfa (Digikala Reviews)
This checkpoint predicts POSITIVE or NEGATIVE sentiment on short
product reviews.

We will score a handful of sample sentences.



In [None]:
# Research task 2
# 1. Browse the Hugging‚ÄØFace Hub for a Persian sentiment‚Äëanalysis model (hint: search ‚Äúbert‚Äëfa sentiment digikala‚Äù or ‚Äúparsbert sentiment‚Äù).
# 2. Copy its model‚Äëid from the model card.
# 3. Build a pipeline("sentiment-analysis", model=MODEL_ID, device=0|‚Äë1) in a new cell.
# 4. Run the pipeline on at least four Persian reviews (two positive, two negative) and print the predicted labels.
# 5. Briefly comment on whether the predictions match your intuition.

In [3]:
from transformers import pipeline

MODEL_ID = "HooshvareLab/bert-fa-base-uncased-sentiment-digikala"
sentiment_pipe = pipeline("sentiment-analysis", model=MODEL_ID)

reviews = [
    "ÿß€åŸÜ ⁄ØŸàÿ¥€å ŸàÿßŸÇÿπÿßŸã ÿ®€å‚ÄåŸÜÿ∏€åÿ±Ÿá!",  # "This phone is truly exceptional!"
    "ÿ∑ÿ±ÿßÿ≠€å ÿ≤€åÿ®ÿß ŸàŸÑ€å ÿ®ÿßÿ™ÿ±€å ÿÆ€åŸÑ€å ÿ∂ÿπ€åŸÅŸá.",  # "Beautiful design but very weak battery."
    "⁄©€åŸÅ€åÿ™ ÿ≥ÿßÿÆÿ™ÿ¥ ÿπÿßŸÑ€åŸá Ÿà ÿßÿ±ÿ≤ÿ¥ ÿÆÿ±€åÿØ ÿØÿßÿ±Ÿá.",  # "Its build quality is excellent and worth buying."
    "ŸÇ€åŸÖÿ™ ŸÜÿ≥ÿ®ÿ™ ÿ®Ÿá ÿßŸÖ⁄©ÿßŸÜÿßÿ™ÿ¥ ÿÆ€åŸÑ€å ÿ®ÿßŸÑÿßÿ≥ÿ™.",  # "The price is too high for its features."
]

for r in reviews:
    result = sentiment_pipe(r)[0]
    print(f"{r} ‚Üí {result['label']} (confidence: {result['score']:.2f})")



config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/651M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/62.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/1.20M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/651M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0


ÿß€åŸÜ ⁄ØŸàÿ¥€å ŸàÿßŸÇÿπÿßŸã ÿ®€å‚ÄåŸÜÿ∏€åÿ±Ÿá! ‚Üí recommended (confidence: 0.98)
ÿ∑ÿ±ÿßÿ≠€å ÿ≤€åÿ®ÿß ŸàŸÑ€å ÿ®ÿßÿ™ÿ±€å ÿÆ€åŸÑ€å ÿ∂ÿπ€åŸÅŸá. ‚Üí no_idea (confidence: 0.74)
⁄©€åŸÅ€åÿ™ ÿ≥ÿßÿÆÿ™ÿ¥ ÿπÿßŸÑ€åŸá Ÿà ÿßÿ±ÿ≤ÿ¥ ÿÆÿ±€åÿØ ÿØÿßÿ±Ÿá. ‚Üí recommended (confidence: 0.97)
ŸÇ€åŸÖÿ™ ŸÜÿ≥ÿ®ÿ™ ÿ®Ÿá ÿßŸÖ⁄©ÿßŸÜÿßÿ™ÿ¥ ÿÆ€åŸÑ€å ÿ®ÿßŸÑÿßÿ≥ÿ™. ‚Üí no_idea (confidence: 0.58)


## ‚úçÔ∏è¬†3¬†¬†Text Generation with GPT‚Äë2‚Äëfa
GPT2‚Äëfa is a 124‚ÄØM‚Äëparameter GPT‚Äë2 adapted to 500‚ÄØM Persian tokens.
Below we build a system¬†+¬†user prompt, generate up to 200 new tokens, and print the response after stripping the prompt.

In [5]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "HooshvareLab/gpt2-fa"
tokenizer = AutoTokenizer.from_pretrained(model_name)
gpt2fa = AutoModelForCausalLM.from_pretrained(model_name).to(
    "cuda" if torch.cuda.is_available() else "cpu"
)

system_instruction = "ÿ¥ŸÖÿß €å⁄© ÿØÿ≥ÿ™€åÿßÿ± ŸÅÿßÿ±ÿ≥€å Ÿáÿ≥ÿ™€åÿØ ⁄©Ÿá ÿ®Ÿá ÿ≥ÿ§ÿßŸÑÿßÿ™ ÿ®ÿß ŸÑÿ≠ŸÜ ÿØŸàÿ≥ÿ™ÿßŸÜŸá Ÿæÿßÿ≥ÿÆ ŸÖ€å‚ÄåÿØŸá€åÿØ."
user_prompt = "ÿ≥ŸÑÿßŸÖ! ŸÖ€åÿ¥Ÿá ÿØÿ±ÿ®ÿßÿ±ŸáŸî ŸáŸàÿ¥ ŸÖÿµŸÜŸàÿπ€å ÿ™Ÿàÿ∂€åÿ≠ ÿ®ÿØ€åÿü"
full_prompt = f"{system_instruction}\n\n{user_prompt}"

inputs = tokenizer(full_prompt, return_tensors="pt").to(gpt2fa.device)
outputs = gpt2fa.generate(
    **inputs,
    max_new_tokens=200,
    top_p=0.9,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
assistant_reply = decoded[len(full_prompt):].strip()

print("üü¢ Assistant reply:\n")
print(assistant_reply)


üü¢ Assistant reply:

ŸÖÿß ÿßÿ≤ ŸáŸàÿ¥ ŸÖÿµŸÜŸàÿπ€å ÿßÿ≥ÿ™ŸÅÿßÿØŸá ŸÖ€å‚Äå⁄©ŸÜ€åŸÖ. ÿ®Ÿá ÿß€åŸÜ ŸÖÿπŸÜ€å ⁄©Ÿá ŸÖÿß ÿ®Ÿá ÿ¨ÿß€å ÿß€åŸÜ⁄©Ÿá ÿØÿ±ÿ®ÿßÿ±ŸáŸî €å⁄© ŸÖŸàÿ∂Ÿàÿπ ÿÆÿßÿµ ÿµÿ≠ÿ®ÿ™ ⁄©ŸÜ€åŸÖÿå ÿØÿ±ÿ®ÿßÿ±ŸáŸî ÿ¢ŸÜ ŸÖŸàÿ∂Ÿàÿπ ÿµÿ≠ÿ®ÿ™ ŸÖ€å‚Äå⁄©ŸÜ€åŸÖ. ŸÖÿ´ŸÑÿß ÿ®Ÿá ÿ¨ÿß€å ÿß€åŸÜ⁄©Ÿá ÿ±ÿßÿ¨ÿπ ÿ®Ÿá ŸÖŸàÿ∂Ÿàÿπ€å ÿÆÿßÿµ ÿµÿ≠ÿ®ÿ™ ⁄©ŸÜ€åŸÖÿå ÿØÿ±ÿ®ÿßÿ±ŸáŸî ŸÖŸàÿ∂Ÿàÿπ ŸÖŸàÿ±ÿØ ÿ®ÿ≠ÿ´ ÿµÿ≠ÿ®ÿ™ ŸÖ€å‚Äå⁄©ŸÜ€åŸÖ. ÿ®Ÿá ÿß€åŸÜ ÿµŸàÿ±ÿ™ ⁄©Ÿá ÿß⁄Øÿ± ÿØÿ± ŸÖŸàÿ±ÿØ ŸÖŸàÿ∂Ÿàÿπ€å ÿµÿ≠ÿ®ÿ™ ŸÖ€å‚Äå⁄©ŸÜ€åŸÖÿå ÿØÿ±ÿ®ÿßÿ±ŸáŸî ÿ¢ŸÜ ŸÖŸàÿ∂Ÿàÿπ ÿµÿ≠ÿ®ÿ™ ŸÖ€å‚Äå⁄©ŸÜ€åŸÖ. ÿ®Ÿá ÿß€åŸÜ ŸÖÿπŸÜ€å ⁄©Ÿá ÿ®Ÿá ÿ¨ÿß€å ÿß€åŸÜ⁄©Ÿá ÿØÿ±ÿ®ÿßÿ±ŸáŸî ŸÖŸàÿ∂Ÿàÿπ€å ÿµÿ≠ÿ®ÿ™ ⁄©ŸÜ€åŸÖÿå ÿØÿ±ÿ®ÿßÿ±ŸáŸî ÿ¢ŸÜ ŸÖŸàÿ∂Ÿàÿπ ÿµÿ≠ÿ®ÿ™ ŸÖ€å‚Äå⁄©ŸÜ€åŸÖ. ÿ®Ÿá ÿß€åŸÜ ŸÖÿπŸÜ€å ⁄©Ÿá ÿ±ÿßÿ¨ÿπ ÿ®Ÿá ŸÖŸàÿ∂Ÿàÿπ€å ÿµÿ≠ÿ®ÿ™ ŸÖ€å‚Äå⁄©ŸÜ€åŸÖ ⁄©Ÿá ÿØÿ±ÿ®ÿßÿ±ŸáŸî ÿ¢ŸÜ ÿµÿ≠ÿ®ÿ™ ŸÖ€å‚Äå⁄©ŸÜ€åŸÖ. ÿ®Ÿá ÿß€åŸÜ ŸÖÿπŸÜ€å ⁄©Ÿá ŸàŸÇÿ™€å ÿ±ÿßÿ¨ÿπ ÿ®Ÿá ŸÖŸàÿ∂Ÿàÿπ€å ÿµÿ≠ÿ®ÿ™ ŸÖ€å‚Äå⁄©ŸÜ€åŸÖÿå ÿ±ÿßÿ¨ÿπ ÿ®Ÿá ÿ¢ŸÜ ŸÖŸàÿ∂Ÿàÿπ ÿµÿ≠ÿ®ÿ™ ŸÖ€å‚Äå⁄©ŸÜ€åŸÖ. ÿ®Ÿá ÿß€åŸÜ ŸÖÿπŸÜ€å

## üéØ¬†4¬†¬†End‚Äëto‚ÄëEnd Demo: Generate ‚Üí Sentiment ‚Üí NER
The function below:

* Generates text from a short seed using GPT‚Äë2‚Äëfa.

* Predicts the overall sentiment of the generated paragraph.

* Extracts named entities inside that text.


In [6]:
def generate_analyse(seed: str, max_new: int = 120):
    full_prompt = system_instruction + "\n\n" + seed
    input_ids   = tokenizer(full_prompt, return_tensors="pt").to(
        gpt2fa.device
    )
    gen_ids = gpt2fa.generate(
        **input_ids,
        max_new_tokens=max_new,
        do_sample=True,
        top_p=0.9,
        temperature=0.85,
    )
    decoded   = tokenizer.decode(gen_ids[0], skip_special_tokens=True)
    generated = decoded[len(full_prompt):].strip()

    sentiment = sentiment_pipe(generated)[0]
    entities  = ner(generated)

    return {
        "generated_text": generated,
        "sentiment": sentiment,
        "entities": entities,
    }

result = generate_analyse("€å⁄© ŸÖÿ¥ÿ™ÿ±€å ÿØÿ± ŸÖŸàÿ±ÿØ ŸÑŸæ‚Äåÿ™ÿßŸæ ÿ¨ÿØ€åÿØ ÿß€åÿ≥Ÿàÿ≥ ÿß€åŸÜ⁄ØŸàŸÜŸá ŸÜŸàÿ¥ÿ™:")
print("üìú¬†Generated text:\n", result["generated_text"])
print("\nüî∂¬†Sentiment:", result["sentiment"])
print("\nüî∑¬†Entities:")
pprint.pp(result["entities"], compact=True)


Setting `pad_token_id` to `eos_token_id`:5 for open-end generation.


üìú¬†Generated text:
 ¬´ŸÖÿß ÿØÿ± ÿß€åŸÜ ŸÑŸæ‚Äåÿ™ÿßŸæ ÿßÿ≤ Ÿæÿ±ÿØÿßÿ≤ŸÜÿØŸá‚ÄåŸáÿß€å ŸÜÿ≥ŸÑ ÿ¥ÿ¥ŸÖ ÿß€åŸÜÿ™ŸÑ Ÿà ⁄©ÿßÿ±ÿ™ ⁄Øÿ±ÿßŸÅ€å⁄© ÿßŸÜŸà€åÿØ€åÿß ÿßÿ≥ÿ™ŸÅÿßÿØŸá ⁄©ÿ±ÿØŸá‚Äåÿß€åŸÖ Ÿà ÿ™ŸÖÿßŸÖ ÿ™ŸÑÿßÿ¥ ÿÆŸàÿØ ÿ±ÿß ÿ®ÿ±ÿß€å ÿßÿ≥ÿ™ŸÅÿßÿØŸá ÿ≠ÿØÿß⁄©ÿ´ÿ±€å ÿßÿ≤ ÿ™ŸàÿßŸÜ ⁄Øÿ±ÿßŸÅ€å⁄©€å ÿ¢ŸÜ ÿ®Ÿá ⁄©ÿßÿ± ÿ®ÿ±ÿØŸá‚Äåÿß€åŸÖ ÿ™ÿß ÿ™ÿ¨ÿ±ÿ®Ÿá ⁄©ÿßÿ± ÿ®ÿß ŸÑŸæ‚Äåÿ™ÿßŸæ‚ÄåŸáÿß€å ŸÜÿ≥ŸÑ ÿ¥ÿ¥ŸÖ ÿ±ÿß ÿ®€åÿ¥ ÿßÿ≤ Ÿæ€åÿ¥ ÿ®Ÿáÿ®ŸàÿØ ÿ®ÿÆÿ¥€åŸÖ.¬ª ÿß€åÿ≥Ÿàÿ≥ ÿ®ÿ±ÿß€å ÿßŸàŸÑ€åŸÜ ÿ®ÿßÿ± ÿØÿ± ÿ™ÿßÿ±€åÿÆ ÿÆŸàÿØÿå €å⁄© ŸÑŸæ‚Äåÿ™ÿßŸæ ŸÖÿ¨Ÿáÿ≤ ÿ®Ÿá Ÿæÿ±ÿØÿßÿ≤ŸÜÿØŸá‚ÄåŸáÿß€å ŸÜÿ≥ŸÑ ÿ¥ÿ¥ŸÖ ÿß€åŸÜÿ™ŸÑ ÿ±ÿß ÿ®ÿß ÿπŸÜŸàÿßŸÜ UX305 ŸÖÿπÿ±ŸÅ€å ⁄©ÿ±ÿØŸá ÿßÿ≥ÿ™. ÿØÿ± ŸÜ⁄ØÿßŸá ÿßŸàŸÑÿå UX305 ÿØÿ± ŸÖŸÇÿß€åÿ≥Ÿá ÿ®ÿß UX305 ÿØÿßÿ±ÿß€å ŸÇ€åŸÖÿ™ Ÿæÿß€å€åŸÜ‚Äåÿ™ÿ±€å ÿßÿ≥ÿ™ Ÿà ÿØÿ± ŸÖŸÇÿß€åÿ≥Ÿá ÿ®ÿß UX305 ŸÇ€åŸÖÿ™ ÿ®ÿßŸÑÿßÿ™ÿ±€å ÿØÿßÿ±ÿØ. ŸÜŸÖÿß€åÿ¥⁄Øÿ± UX

üî∂¬†Sentiment: {'label': 'recommended', 'score': 0.7739930748939514}

üî∑¬†Entities:
[{'entity_group': 'B_ORG',
  'score': np.float32(0.7389317),
  'word': 'ÿß€åŸÜÿ™ŸÑ',


## üìù¬†5¬†¬†Reflection (write your answers in Markdown)
1. Which task‚ÄîNER, sentiment, or generation‚Äîappeared most robust on your custom inputs? Why?

2. How could you improve sentiment accuracy on longer, mixed‚Äëtone reviews?

3. Suggest a real‚Äëworld product idea that chains generation‚ÄØ+‚ÄØsentiment‚ÄØ+‚ÄØNER.