<a href="https://colab.research.google.com/github/abdelhadidjafer02-beep/GPT-2/blob/main/arabic_hedging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Do Arabic LLMs show stable hedging behavior across prompts?

### Goal

Test whether a simple linguistic feature — Arabic hedging words such as **"قد"** and **"ربما"** — varies **systematically across prompts** in *AraGPT2-medium*, beyond what can be explained by sampling randomness alone.

This notebook performs a **behavioral screening step** intended to decide whether this phenomenon is sufficiently stable to justify later mechanistic interpretability analysis.
**No mechanistic claims are made here.**

---

### Why this matters

If hedging frequency varies reliably across prompts while remaining relatively stable within a prompt, this is consistent with (though does not prove) the presence of internally represented distinctions (e.g. uncertainty, caution, or topic sensitivity).

Establishing such stability is a prerequisite for responsible mechanistic analysis.

---

### Scope and constraints

* This notebook is intentionally **minimal and exploratory**
* **AraGPT2-medium** is used solely as a **low-cost behavioral screening model**
* Results are used only to decide whether to proceed further
* Failure to observe stability would be treated as a stopping condition


In [None]:
!pip install transformers torch



In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "aubmindlab/aragpt2-medium"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.50G [00:00<?, ?B/s]

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(64000, 1024)
    (wpe): Embedding(1024, 1024)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-23): 24 x GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=3072, nx=1024)
          (c_proj): Conv1D(nf=1024, nx=1024)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=4096, nx=1024)
          (c_proj): Conv1D(nf=1024, nx=4096)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=1024, out_features=64000, bias=False)
)

### Prompt selection

We use a small set of **neutral, high-level prompts**.

These prompts:

* Are not adversarial
* Do not explicitly induce uncertainty or caution
* Avoid emotionally charged or safety-critical framing

This is intended to reduce the risk of **prompt-induced hedging**, allowing us to probe whether hedging varies naturally with topic rather than instruction.


In [None]:
prompts = [
    "ما هو تأثير التغير المناخي ؟",
    "هل يجب علينا استخدام الطاقة النووية ؟",
    "كيف يمكن تحسين التعليم ؟",
    "ما هي فوائد القراءة ؟",
    "هل يمكن الاعتماد على الذكاء الاصطناعي بالكامل ؟",
]


### Sampling strategy

For each prompt, we generate multiple continuations in order to distinguish:

* **Within-prompt variation** (sampling randomness)
* **Across-prompt variation** (prompt-dependent behavior)

Random seeds are fixed per sample to ensure that observed within-prompt variance reflects sampling effects rather than nondeterministic execution differences.

This comparison is central to avoiding over-interpretation of noise.


In [None]:
import random

def generate_samples(prompt, n_samples=6, max_new_tokens=180, seed=0):
    samples = []
    for i in range(n_samples):
        torch.manual_seed(seed + i)
        inputs = tokenizer(prompt, return_tensors="pt").to(device)
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            repetition_penalty=1.1,
        )
        text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        samples.append(text)
    return samples


In [None]:
baseline_outputs = {}

for prompt in prompts:
    baseline_outputs[prompt] = generate_samples(prompt)

for prompt, samples in baseline_outputs.items():
    print("PROMPT:", prompt)
    for i, s in enumerate(samples, 1):
        print(f"Sample {i}: {s[:300]}...")
    print("-" * 60)


Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for o

PROMPT: ما هو تأثير التغير المناخي ؟
Sample 1: ما هو تأثير التغير المناخي ؟ هل يمكن أن تؤثر على الاقتصاد الأمريكي ؟ كيف يؤثر ذلك على السياسة الأمريكية ؟ وماذا تعني هذه التغييرات ؟ هذا ما سوف نعرفه في نهاية هذا المقال .تغير المناخ الذي شهدته الولايات المتحدة منذ عدة سنوات قد يكون له أثر كبير على الاقتصاد العالمي ، حيث إن التغيرات المناخية لم تكن ...
Sample 2: ما هو تأثير التغير المناخي ؟ وكيف يمكن أن يؤثر على البيئة ؟ وهل تؤثر تلك التغيرات على حياتنا اليومية ؟ . . .في كتابه " كيف تعمل الحياة اليومية " ، يتحدث الكاتب روبرت سابولسكي عن ظاهرة الاحتباس الحراري ، التي وصفها بأنها أحد أخطر المشاكل البيئية العالمية . ويقول في هذا الكتاب إنه " في الوقت الذي أصبح...
Sample 3: ما هو تأثير التغير المناخي ؟ وماذا عن التأثيرات الاجتماعية والاقتصادية ؟ ولماذا لا يمكن معالجة هذه الأزمة ؟ ما هي الحلول التي تقترحها الحكومة اللبنانية لمعالجة هذا الوضع ؟* * من بين كل الأمراض التي يمكن أن تصيب لبنان ، هناك أمراضا مزمنة أخرى غير قابلة للشفاء ، مثل السرطان ، وأمراض القلب ، والسكري ، ...
Sample 4: ما هو تأثير

### Feature definition: hedging words

We focus on a **single, explicitly defined feature**: the presence of common Arabic hedging words.

This choice is intentional:

* Easy to define and audit
* Linguistically meaningful in Arabic
* Less fragile than morphology or syntax
* Avoids reliance on complex NLP pipelines

At this stage, the goal is **feature viability**, not semantic completeness or exhaustiveness.


In [None]:
HEDGING_TOKENS = ["قد", "ربما"]

def count_hedges(text):
    return sum(text.count(tok) for tok in HEDGING_TOKENS)


In [None]:
import pandas as pd

rows = []

for prompt, samples in baseline_outputs.items():
    for i, text in enumerate(samples):
        rows.append({
            "prompt": prompt,
            "sample_id": i,
            "hedge_count": count_hedges(text),
        })

df = pd.DataFrame(rows)
df


Unnamed: 0,prompt,sample_id,hedge_count
0,ما هو تأثير التغير المناخي ؟,0,3
1,ما هو تأثير التغير المناخي ؟,1,2
2,ما هو تأثير التغير المناخي ؟,2,1
3,ما هو تأثير التغير المناخي ؟,3,0
4,ما هو تأثير التغير المناخي ؟,4,1
5,ما هو تأثير التغير المناخي ؟,5,4
6,هل يجب علينا استخدام الطاقة النووية ؟,0,1
7,هل يجب علينا استخدام الطاقة النووية ؟,1,0
8,هل يجب علينا استخدام الطاقة النووية ؟,2,0
9,هل يجب علينا استخدام الطاقة النووية ؟,3,1


In [None]:
df.groupby("prompt")["hedge_count"].agg(["mean", "std"])

Unnamed: 0_level_0,mean,std
prompt,Unnamed: 1_level_1,Unnamed: 2_level_1
كيف يمكن تحسين التعليم ؟,0.666667,0.816497
ما هو تأثير التغير المناخي ؟,1.833333,1.47196
ما هي فوائد القراءة ؟,1.166667,1.47196
هل يجب علينا استخدام الطاقة النووية ؟,0.5,0.547723
هل يمكن الاعتماد على الذكاء الاصطناعي بالكامل ؟,0.5,0.83666


### Observations

Across this small sample:

* Hedging frequency varies noticeably **across prompts**
* Within a given prompt, variation across samples is **smaller than variation across prompts**
* Some prompts consistently elicit cautious language, while others rarely do

---

### Interpretation

This pattern suggests that hedging behavior is **not dominated by sampling noise** in this setting.

Importantly, this result is consistent with *multiple explanations*:

* Internally represented uncertainty-related features
* Shallow statistical correlations learned during training
* Topic-specific stylistic conventions (e.g. news-like text)

At this stage, we do not attempt to distinguish between these possibilities.


### Limitations

* Sample size is small and intended only for qualitative screening
* Some generations exhibit repetition or newswire-like structure
* Hedging is measured via surface word counts, not semantic parsing
* Results should not be interpreted as evidence of any specific internal mechanism

These limitations are acceptable for a **filtering step**, but would need to be addressed before any substantive mechanistic claims.


### Stopping criterion

If within-prompt variance were comparable to or larger than across-prompt variance, we would conclude that hedging behavior is dominated by sampling noise and **abandon mechanistic follow-up**.

Observed stability across prompts is treated only as a necessary (not sufficient) condition for proceeding.


### Why we stop here

We deliberately avoid internal activation analysis at this stage.

If no stable behavioral signal exists, mechanistic interpretability would be premature.
This notebook establishes whether such a signal is plausibly present.

Any future work would:

* Re-establish the behavioral effect on the target model
* Introduce additional controls
* Proceed with mechanistic tools only after renewed validation
