<div style="
    background-color:#1e1e1e;
    color:#f5f5f5;
    padding:35px;
    border-radius:12px;
    font-family:'Segoe UI', 'Helvetica Neue', sans-serif;
    line-height:1.7;
    font-size:15px;
">

<h1 style="text-align:center; color:#ffc107;">üß© Understanding Perplexity in Language Models</h1>

<p>
<b>Perplexity (PP)</b> quantifies how confidently a language model predicts the next token.  
It‚Äôs essentially the model‚Äôs <i>‚Äúdegree of surprise.‚Äù</i>  
A low perplexity means smooth, natural predictions ‚Äî a high one means confusion or uncertainty.
</p>

<hr style="border:0.5px solid #444;">

<h2 style="color:#ffb74d;">üìò Mathematical Definition</h2>

<p style="text-align:center;">
PP = exp(H) = exp( ‚àí (1/N) Œ£ log P(w<sub>i</sub> | w<sub>1</sub>, ..., w<sub>i‚àí1</sub>) )
</p>

<ul>
  <li><b>P(w<sub>i</sub> | w<sub>1</sub>, ..., w<sub>i‚àí1</sub>)</b>: Probability of the next token.</li>
  <li><b>N</b>: Number of words in the sequence.</li>
  <li><b>log‚ÄØP</b>: Log-likelihood, later exponentiated to get back to normal scale.</li>
</ul>

<p><i>Lower PP ‚Üí more predictable and fluent text.<br>Higher PP ‚Üí randomness or topic drift.</i></p>

<hr style="border:0.5px solid #444;">

<h2 style="color:#ffb74d;">üìä Numerical Interpretation</h2>

<table style="width:100%; border-collapse:collapse; margin-top:8px;">
<thead>
<tr style="background-color:#2a2a2a; color:#ffd54f;">
  <th style="padding:8px; border-bottom:1px solid #444;">Perplexity Range</th>
  <th style="padding:8px; border-bottom:1px solid #444;">Meaning</th>
</tr>
</thead>
<tbody>
<tr>
  <td style="padding:8px; border-bottom:1px solid #333;">10 ‚Äì 25</td>
  <td style="padding:8px; border-bottom:1px solid #333;">Excellent ‚Äî coherent and human‚Äëlike fluency.</td>
</tr>
<tr>
  <td style="padding:8px; border-bottom:1px solid #333;">30 ‚Äì 70</td>
  <td style="padding:8px; border-bottom:1px solid #333;">Moderate ‚Äî the model struggles slightly with context.</td>
</tr>
<tr>
  <td style="padding:8px; border-bottom:1px solid #333;">&gt; 100</td>
  <td style="padding:8px; border-bottom:1px solid #333;">Poor ‚Äî the model is confused or off‚Äëdomain.</td>
</tr>
</tbody>
</table>

<hr style="border:0.5px solid #444;">

<h2 style="color:#ffb74d;">‚öôÔ∏è Use Cases in NLP</h2>

<ol>
<li><b>Model Evaluation:</b> Compare language models ‚Äî the lower the PP, the better.</li>
<li><b>AI vs Human Detection:</b> AI text = low & stable PP. Human text = high & variable PP.</li>
<li><b>Dataset Filtering:</b> Identify unnatural or corrupted samples during preprocessing.</li>
</ol>

<hr style="border:0.5px solid #444;">

<h2 style="color:#ffb74d;">üß† Intuitive Example</h2>

<p>If a model predicts the next word in <code style="background-color:#2f2f2f; color:#fff;">I love ___</code>:</p>

<ul>
<li><b>‚Äúyou‚Äù</b> with probability 0.9 ‚Üí low PP (expected, fluent).</li>
<li><b>‚Äúbananas‚Äù</b> with probability 0.01 ‚Üí high PP (unexpected, off‚Äëcontext).</li>
</ul>

<hr style="border:0.5px solid #444;">

<h2 style="color:#ffb74d;">üêç Example in Python (Kaggle‚ÄëReady)</h2>

<pre style="background-color:#2b2b2b; color:#e0e0e0; padding:14px; border-radius:8px; overflow-x:auto;">
<code class="language-python">
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
import torch, math

# Load GPT-2
model_name = "gpt2"
tokenizer = GPT2TokenizerFast.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Example text
text = "I love you so much."

# Encode & evaluate
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
    loss = model(**inputs, labels=inputs["input_ids"]).loss
perplexity = math.exp(loss)

print(f"Perplexity: {perplexity:.2f}")
</code>
</pre>

<hr style="border:0.5px solid #444;">

<h2 style="color:#ffb74d;">üß© Summary</h2>

<ul>
<li><b>Definition:</b> Statistical indicator of prediction confidence.</li>
<li><b>Goal:</b> Lower perplexity = stronger model quality.</li>
<li><b>AI Detection Insight:</b> Human writing ‚Üí variable PP; AI ‚Üí smooth & uniform PP.</li>
</ul>



</div>


<div style="
    background-color:#1c1c1c;
    color:#f5f5f5;
    padding:40px;
    border-radius:12px;
    font-family:'Segoe UI','Helvetica Neue',sans-serif;
    line-height:1.8;
    font-size:15px;
">

<h1 style="text-align:center; color:#ffca28;">ü§ñ Perplexity as a Statistical Signal for AI‚ÄØvs‚ÄØHuman Text Detection</h1>

<p>
<b>Perplexity (PP)</b> is a central measurement in computational linguistics used to evaluate
how well a language model predicts a given piece of text. In the context of 
<b>AI‚ÄØvs‚ÄØhuman text detection</b>, perplexity captures the difference between 
machine-generated fluency and human unpredictability. Because AI text is optimized to seem smooth and linguistically probable,
it tends to yield <b>low, consistent perplexity</b>, whereas human text is more
erratic, imaginative, and semantically diverse‚Äîresulting in <b>higher and more variable perplexity</b>.
</p>

<hr style="border:0.5px solid #333;">

<h2 style="color:#ffb74d;">üîπ 1. Conceptual Foundation</h2>

<p>
Perplexity reflects the ‚Äúconfusion‚Äù of a language model when reading a sequence of words.
Formally derived from the exponentiation of average negative log-likelihood, perplexity measures how 
uncertain the model is about the next token. 
A perfect model that predicts every token with full confidence would score a PP of <b>1</b>, 
while a poorly aligned model observing unexpected words shows a high PP value.
</p>

<p>
When used for authorship classification:
<ul>
<li>AI text aims for <i>optimal predictability</i>; a model sees its own writing style as predictable.</li>
<li>Human text is <i>less optimized</i>, embedding emotion, digression, and rare phrasing‚Äîfeatures that increase PP variance.</li>
<li>Thus, analyzing perplexity distribution over segments (sentences or paragraphs) exposes the nature of authorship.</li>
</ul>
</p>

<hr style="border:0.5px solid #333;">

<h2 style="color:#ffb74d;">üîπ 2. Behavioral Patterns in AI and Human Text</h2>

<p>
The following trends typically emerge when plotting perplexity values across sentences:
</p>

<table style="width:100%; border-collapse:collapse; margin-top:8px;">
<thead>
<tr style="background-color:#2a2a2a; color:#ffd54f;">
<th style="padding:8px; border-bottom:1px solid #444;">Metric</th>
<th style="padding:8px; border-bottom:1px solid #444;">AI‚ÄëGenerated Text</th>
<th style="padding:8px; border-bottom:1px solid #444;">Human‚ÄëWritten Text</th>
</tr>
</thead>
<tbody>
<tr>
<td style="padding:8px; border-bottom:1px solid #333;">Average Perplexity</td>
<td style="padding:8px; border-bottom:1px solid #333;">Lower (‚âà‚ÄØ20‚Äì40)</td>
<td style="padding:8px; border-bottom:1px solid #333;">Higher (‚âà‚ÄØ60‚Äì120)</td>
</tr>
<tr>
<td style="padding:8px; border-bottom:1px solid #333;">Variance</td>
<td style="padding:8px; border-bottom:1px solid #333;">Low, steady across sentences</td>
<td style="padding:8px; border-bottom:1px solid #333;">High, with irregular fluctuations</td>
</tr>
<tr>
<td style="padding:8px; border-bottom:1px solid #333;">Local Peaks</td>
<td style="padding:8px; border-bottom:1px solid #333;">Flattened or absent</td>
<td style="padding:8px; border-bottom:1px solid #333;">Frequent and sharp</td>
</tr>
<tr>
<td style="padding:8px; border-bottom:1px solid #333;">Entropy Landscape</td>
<td style="padding:8px; border-bottom:1px solid #333;">Smooth, narrow entropy basin</td>
<td style="padding:8px; border-bottom:1px solid #333;">Rough, multi‚Äëmodal entropy surface</td>
</tr>
</tbody>
</table>

<p>
This statistical asymmetry forms the mathematical basis for AI detection:
low average perplexity + low variance often signals automated generation,
while irregular high‚Äëentropy traces often point to genuine human authorship.
</p>

<hr style="border:0.5px solid #333;">

<h2 style="color:#ffb74d;">üîπ 3. Entropy Profile and Signal Extraction</h2>

<p>
Instead of relying on a single perplexity number, 
a more reliable indicator is the <b>perplexity profile</b>‚Äîthe sequence of PP values
computed over multiple contiguous sentences.  
From this curve, one can derive features such as:
</p>

<ul>
<li><b>Mean‚ÄØPP</b> ‚Äî overall predictability of the text.</li>
<li><b>Standard‚ÄØDeviation‚ÄØPP</b> ‚Äî stylistic variability or creativity.</li>
<li><b>Range‚ÄØPP</b> ‚Äî scope of change in linguistic entropy.</li>
<li><b>Max‚ÄØJump‚ÄØPP</b> ‚Äî abruptness between consecutive PP values.</li>
<li><b>Skewness‚ÄØPP</b> ‚Äî shape of the distribution, correlating with bursts of novelty.</li>
</ul>

<p>
Human authors often demonstrate broader range and higher standard deviation, 
while AI‚Äëgenerated sequences cluster tightly around a stable mean.
Computing these derivative signals produces a rich "entropy fingerprint"
that models can use for classification.
</p>

<hr style="border:0.5px solid #333;">

<h2 style="color:#ffb74d;">üîπ 4. Interpretive Perspective: What Perplexity Reveals</h2>

<ul>
<li><b>Semantic Control:</b> AI has high local coherence; perplexity remains low.</li>
<li><b>Idea Drift:</b> Humans allow narrative tangents or emotional flourish, creating peaks.</li>
<li><b>Cognitive Noise:</b> Variability introduced by mood, bias, opinion, or ambiguity increases PP.</li>
<li><b>Algorithmic Consistency:</b> A model‚Äôs decoding algorithm enforces uniform probability flow, lowering PP rise.</li>
</ul>

<p>
Thus, perplexity effectively measures <i>how mechanical</i> or <i>how inspired</i> a piece of writing appears.
Even without stylistic analysis, its statistics alone can approximate author identity.
</p>

<hr style="border:0.5px solid #333;">

<h2 style="color:#ffb74d;">üîπ 5. Hybrid Feature Integration</h2>

<p>
While perplexity itself is powerful, combining it with syntactic or emotional descriptors 
yields stronger performance in distinguishing humans from machines.
Common hybrid features include:
</p>

<ul>
<li><b>Part‚Äëof‚ÄëSpeech ratios</b> ‚Äì measuring functional diversity of language.</li>
<li><b>Lexical rarity</b> ‚Äì frequency of uncommon tokens relative to a corpus baseline.</li>
<li><b>Sentiment entropy</b> ‚Äì emotional irregularity correlating with genuine tone shifts.</li>
<li><b>Burstiness</b> ‚Äì uneven punctuation and clause density patterns found in creative writing.</li>
</ul>

<p>
Perplexity acts as the quantitative core: a signal of probability smoothness, while the others
add linguistic nuance. Combined, they construct a robust multidimensional fingerprint for AI‚Äëvs‚Äëhuman classification.
</p>

<hr style="border:0.5px solid #333;">

<h2 style="color:#ffb74d;">üîπ 6. Strengths and Limitations</h2>

<table style="width:100%; border-collapse:collapse; margin-top:8px;">
<thead>
<tr style="background-color:#2a2a2a; color:#ffd54f;">
<th style="padding:8px; border-bottom:1px solid #444;">Aspect</th>
<th style="padding:8px; border-bottom:1px solid #444;">Advantages</th>
<th style="padding:8px; border-bottom:1px solid #444;">Limitations</th>
</tr>
</thead>
<tbody>
<tr>
<td style="padding:8px; border-bottom:1px solid #333;">Interpretability</td>
<td style="padding:8px; border-bottom:1px solid #333;">Simple mathematical meaning ‚Äî lower is more predictable</td>
<td style="padding:8px; border-bottom:1px solid #333;">Requires reference LM calibrated to domain vocabulary</td>
</tr>
<tr>
<td style="padding:8px; border-bottom:1px solid #333;">Effectiveness</td>
<td style="padding:8px; border-bottom:1px solid #333;">Strong accuracy on raw LLM text spotting</td>
<td style="padding:8px; border-bottom:1px solid #333;">Less decisive on short or mixed texts</td>
</tr>
<tr>
<td style="padding:8px; border-bottom:1px solid #333;">Robustness</td>
<td style="padding:8px; border-bottom:1px solid #333;">Works even when style-masking features fail</td>
<td style="padding:8px; border-bottom:1px solid #333;">Advanced LLMs can simulate human‚Äëlike PP variability</td>
</tr>
</tbody>
</table>

<hr style="border:0.5px solid #333;">

<h2 style="color:#ffb74d;">üîπ 7. Practical Insights</h2>

<p>
In real detection pipelines (academic or industrial), perplexity features are extracted at the
document or sentence level, standardized, and fed into classifiers such as
Random Forests, SVMs, or neural discriminators.
The interpretability of such systems stems from the fact that human creative irregularity
cannot be easily replicated by deterministic sampling from neural networks‚Äî
not yet, at least.
</p>

<p>
Perplexity thus functions as a <b>quantitative window into cognitive spontaneity</b>.
Humans tend to surprise language models; language models rarely surprise themselves.
</p>

<hr style="border:0.5px solid #333;">

<h2 style="color:#ffb74d;">üîπ 8. Summary</h2>

<p>
Perplexity transforms linguistic predictability into a diagnostic signal for authorship.
Low, uniform perplexity trajectories reflect algorithmic continuity, while high and uneven
profiles reveal human expressiveness.
As AI text sophistication increases, future research focuses on modeling the
<i>dynamics of perplexity variance</i> and blending it with semantic and emotional irregularity
for reliable, explainable AI‚Äëvs‚Äëhuman classification.
</p>



</div>


In [1]:
import pandas as pd

# Load dataset (adjust file name if needed)
df = pd.read_csv('/kaggle/input/daigt-v2-train-dataset/train_v2_drcat_02.csv')

# Quick structural glance
print(df.shape)          # (rows, columns)
print(df.columns)        # column names
df.head()                # display first 5 rows


(44868, 5)
Index(['text', 'label', 'prompt_name', 'source', 'RDizzl3_seven'], dtype='object')


Unnamed: 0,text,label,prompt_name,source,RDizzl3_seven
0,Phones\n\nModern humans today are always on th...,0,Phones and driving,persuade_corpus,False
1,This essay will explain if drivers should or s...,0,Phones and driving,persuade_corpus,False
2,Driving while the use of cellular devices\n\nT...,0,Phones and driving,persuade_corpus,False
3,Phones & Driving\n\nDrivers should not be able...,0,Phones and driving,persuade_corpus,False
4,Cell Phone Operation While Driving\n\nThe abil...,0,Phones and driving,persuade_corpus,False


In [2]:
# you can get all unique prompt names:
unique_prompts = df['prompt_name'].unique()
print(unique_prompts)

# now split df into smaller DataFrames
prompt_dfs = {}

for prompt in unique_prompts:
    prompt_dfs[prompt] = df[df['prompt_name'] == prompt].copy()

# Example access:
distance_learning_df = prompt_dfs['Distance learning']
car_free_df = prompt_dfs['Car-free cities']
electoral_df = prompt_dfs['Does the electoral college work?']

# Quick verification
for name, subdf in prompt_dfs.items():
    print(f"{name}: {subdf.shape}")

['Phones and driving' 'Car-free cities' 'Summer projects'
 '"A Cowboy Who Rode the Waves"' 'Mandatory extracurricular activities'
 'Exploring Venus' 'Facial action coding system' 'The Face on Mars'
 'Community service' 'Grades for extracurricular activities'
 'Driverless cars' 'Does the electoral college work?'
 'Cell phones at school' 'Distance learning' 'Seeking multiple opinions']
Phones and driving: (1583, 5)
Car-free cities: (4717, 5)
Summer projects: (2701, 5)
"A Cowboy Who Rode the Waves": (1896, 5)
Mandatory extracurricular activities: (3077, 5)
Exploring Venus: (2176, 5)
Facial action coding system: (3084, 5)
The Face on Mars: (1893, 5)
Community service: (2092, 5)
Grades for extracurricular activities: (2116, 5)
Driverless cars: (2250, 5)
Does the electoral college work?: (4434, 5)
Cell phones at school: (2119, 5)
Distance learning: (5554, 5)
Seeking multiple opinions: (5176, 5)


In [None]:
electoral_df.head()

In [4]:
import torch
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

# Setup 
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2").to(device)
model.eval()

def compute_perplexity(text):
    """Compute perplexity for a single text sample."""
    with torch.no_grad():
        enc = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
        input_ids = enc.input_ids.to(device)
        attention_mask = enc.attention_mask.to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=input_ids)
        loss = outputs.loss
        return torch.exp(loss).item()

# Topic Selection 
selected_topics = [
    "Distance learning",
    "Car-free cities",
    "Does the electoral college work?"
]

selected_dfs = {topic: df[df["prompt_name"] == topic].copy() for topic in selected_topics}

#  Perplexity by Class 
stats_per_topic = {}

for topic, topic_df in selected_dfs.items():
    stats_per_topic[topic] = {}
    for label_val, class_name in zip([1, 0], ["AI", "Human"]):
        sub_df = topic_df[topic_df["label"] == label_val]
        pp_values = sub_df["text"].apply(compute_perplexity)
        stats_per_topic[topic][class_name] = {
            "mean": pp_values.mean(),
            "std": pp_values.std(),
            "count": len(pp_values)
        }

# Summary Table
result_df = pd.DataFrame([
    {
        "prompt_name": topic,
        "AI_mean_PP": stats["AI"]["mean"],
        "AI_std_PP": stats["AI"]["std"],
        "AI_count": stats["AI"]["count"],
        "Human_mean_PP": stats["Human"]["mean"],
        "Human_std_PP": stats["Human"]["std"],
        "Human_count": stats["Human"]["count"],
    }
    for topic, stats in stats_per_topic.items()
])

display(result_df)


Unnamed: 0,prompt_name,AI_mean_PP,AI_std_PP,AI_count,Human_mean_PP,Human_std_PP,Human_count
0,Distance learning,10.195369,5.025817,3397,28.757841,15.570113,2157
1,Car-free cities,9.484958,6.152635,2051,43.356668,22.414555,2666
2,Does the electoral college work?,9.26725,5.251502,1720,33.620551,17.076914,2714


In [12]:
import pandas as pd

records = []

for topic, class_data in stats_per_topic.items():
    for class_name, metrics in class_data.items():
        # Handle both old and new key naming conventions safely
        mean_pp = metrics.get("mean_perplexity", metrics.get("mean", 0))
        std_pp  = metrics.get("std_perplexity", metrics.get("std", 0))
        b_vmr   = metrics.get("burstiness_VMR", metrics.get("burstiness_vmr", 0))
        b_cv    = metrics.get("burstiness_CV", metrics.get("burstiness_cv", 0))
        count   = metrics.get("sample_count", metrics.get("count", 0))
        
        record = {
            "Topic": topic,
            "Class": class_name,
            "Mean‚ÄØPerplexity": round(mean_pp, 4),
            "Std‚ÄØPerplexity": round(std_pp, 4),
            "Burstiness‚ÄØVMR": round(b_vmr, 4),
            "Burstiness‚ÄØCV": round(b_cv, 4),
            "Samples": count
        }
        records.append(record)

df_summary = pd.DataFrame(records)
df_summary.sort_values(by=["Topic", "Class"], inplace=True)

# Display neatly by topic
for topic, sub_df in df_summary.groupby("Topic"):
    print(f"\n‚îÄ‚îÄ‚îÄ {topic.upper()} ‚îÄ‚îÄ‚îÄ")
    display(sub_df.drop(columns=["Topic"]).reset_index(drop=True))



‚îÄ‚îÄ‚îÄ CAR-FREE CITIES ‚îÄ‚îÄ‚îÄ


Unnamed: 0,Class,Mean‚ÄØPerplexity,Std‚ÄØPerplexity,Burstiness‚ÄØVMR,Burstiness‚ÄØCV,Samples
0,AI,9.485,6.1526,0.0,0.0,2051
1,Human,43.3567,22.4146,0.0,0.0,2666



‚îÄ‚îÄ‚îÄ DISTANCE LEARNING ‚îÄ‚îÄ‚îÄ


Unnamed: 0,Class,Mean‚ÄØPerplexity,Std‚ÄØPerplexity,Burstiness‚ÄØVMR,Burstiness‚ÄØCV,Samples
0,AI,10.1954,5.0258,0.0,0.0,3397
1,Human,28.7578,15.5701,0.0,0.0,2157



‚îÄ‚îÄ‚îÄ DOES THE ELECTORAL COLLEGE WORK? ‚îÄ‚îÄ‚îÄ


Unnamed: 0,Class,Mean‚ÄØPerplexity,Std‚ÄØPerplexity,Burstiness‚ÄØVMR,Burstiness‚ÄØCV,Samples
0,AI,9.2673,5.2515,0.0,0.0,1720
1,Human,33.6206,17.0769,8.6739,0.5079,2714


<div style="
    background-color:#1c1c1c;
    color:#f5f5f5;
    padding:35px;
    border-radius:10px;
    font-family:'Segoe UI','Helvetica Neue',sans-serif;
    line-height:1.8;
    font-size:15px;
">

<h1 style="text-align:center; color:#ffd54f;">üß© Observation: Validity of GPT‚Äë2 for Perplexity‚ÄëBased Experiments</h1>

<p>
In this experiment, we questioned whether using the <b>GPT‚Äë2</b> model for computing perplexity 
remains scientifically valid‚Äîgiven the release of newer models like GPT‚ÄëNeo, LLaMA, and Mistral. 
The analysis confirms that GPT‚Äë2 is still a <b>valid and reliable reference model</b> for 
perplexity‚Äëbased AI vs Human detection tasks, mainly due to its stability and interpretability.
</p>

<hr style="border:0.5px solid #333;">

<h2 style="color:#ffb74d;">üîπ 1. Why GPT‚Äë2 Remains Valid</h2>

<ul>
    <li><b>Consistency Over Recency:</b> Perplexity measures depend on having a fixed, 
        reproducible reference model‚Äînot on having the latest model. GPT‚Äë2‚Äôs 
        stable probability distribution ensures consistency across runs and datasets.</li>
    <li><b>Transparent Probability Outputs:</b> Unlike newer instruction‚Äëtuned models, GPT‚Äë2 
        exposes clean next‚Äëtoken likelihoods, making its PP scores more mathematically interpretable.</li>
    <li><b>Benchmark Reliability:</b> Many academic and Kaggle implementations (2023‚Äì2025) 
        still use GPT‚Äë2 to extract PP features because it is 
        <b>computationally light, reproducible, and well‚Äëdocumented</b>.</li>
</ul>

<hr style="border:0.5px solid #333;">

<h2 style="color:#ffb74d;">üîπ 2. When to Consider Newer Models</h2>

<p>
While GPT‚Äë2 is fully valid for comparative PP analysis, more advanced models can offer benefits
in certain cases:
</p>
<table style="width:100%; border-collapse:collapse; margin-top:8px;">
<thead>
<tr style="background-color:#2a2a2a; color:#ffd54f;">
  <th style="padding:8px; border-bottom:1px solid #444;">Model</th>
  <th style="padding:8px; border-bottom:1px solid #444;">Improvement</th>
  <th style="padding:8px; border-bottom:1px solid #444;">Limitations</th>
</tr>
</thead>
<tbody>
<tr>
  <td style="padding:8px; border-bottom:1px solid #333;">GPT‚ÄëNeo / GPT‚ÄëJ</td>
  <td style="padding:8px; border-bottom:1px solid #333;">Larger vocab and smoother PP curves</td>
  <td style="padding:8px; border-bottom:1px solid #333;">Slower scoring; high VRAM usage</td>
</tr>
<tr>
  <td style="padding:8px; border-bottom:1px solid #333;">LLaMA‚Äë2 / LLaMA‚Äë3</td>
  <td style="padding:8px; border-bottom:1px solid #333;">Modern architecture, better long‚Äëcontext modeling</td>
  <td style="padding:8px; border-bottom:1px solid #333;">Tokenizer mismatch; limited open PP support</td>
</tr>
<tr>
  <td style="padding:8px; border-bottom:1px solid #333;">Mistral / Mixtral</td>
  <td style="padding:8px; border-bottom:1px solid #333;">More calibrated probabilities for newer datasets</td>
  <td style="padding:8px; border-bottom:1px solid #333;">Too large for Kaggle runtime without quantization</td>
</tr>
</tbody>
</table>

<hr style="border:0.5px solid #333;">

<h2 style="color:#ffb74d;">üîπ 3. Best Practice for Kaggle Experiments</h2>

<ul>
    <li>Start with <b>GPT‚Äë2 or GPT‚Äë2‚ÄëMedium</b> as your reference for sentence‚Äëlevel PP scoring.</li>
    <li>Optionally validate results by comparing with a secondary model like GPT‚ÄëNeo 
        or OPT‚Äë1.3B to ensure consistent PP distribution profiles.</li>
    <li>Ensure all PP statistics (mean, std, range, skew) are computed within the 
        same reference model for each dataset to maintain fair comparison.</li>
</ul>

<hr style="border:0.5px solid #333;">

<h2 style="color:#ffb74d;">üîπ 4. Overall Conclusion</h2>

<p>
<b>GPT‚Äë2 remains a scientifically valid and practical choice</b> for perplexity‚Äëbased AI vs Human 
detection. Its interpretability, reproducibility, and manageable size outweigh its age. 
Upgrading to newer models is beneficial only for specialized or domain‚Äëadapted experiments, 
not for baseline PP feature extraction.
</p>



</div>


