<a href="https://www.kaggle.com/code/aabdollahii/analyzing-pos?scriptVersionId=277772422" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<div style="
  background-color:#1c1c1c;
  color:#f5f5f5;
  padding:35px;
  border-radius:12px;
  font-family:'Segoe UI','Helvetica Neue',sans-serif;
  line-height:1.7;
  font-size:15px;
">

<h1 style="text-align:center; color:#ffca28;">üî¨ Syntactic Analysis: AI vs. Human POS Patterns</h1>

<p>
This section outlines the theoretical differences in Part-of-Speech (POS) usage between human-written and AI-generated text. By quantifying these structural patterns, we can develop a new set of features to complement our Perplexity and Burstiness metrics, leading to a more robust classification model.
</p>

<hr style="border:0.5px solid #333;">

<h2 style="color:#ffb74d;">1. The Core Hypothesis: Uniformity vs. Idiosyncrasy</h2>

<p>
The fundamental distinction arises from the nature of the author:
</p>
<ul>
    <li>üß† <b>Humans:</b> Write with specific intent, cognitive biases, and fluctuating focus. Our grammar is often purposeful but imperfect, leading to unique, sometimes "spiky" or repetitive structural patterns.</li>
    <li>ü§ñ <b>AI (LLMs):</b> Generate text by sampling from a smoothed-out statistical distribution of all the text they were trained on. Their goal is to be coherent and probable, which often results in grammatically correct but structurally "generic" or "median" text.</li>
</ul>

<p>Here is a summary of the expected differences in POS patterns:</p>

<table style="width:100%; border-collapse:collapse; margin-top:15px;">
<thead>
<tr style="background-color:#2a2a2a; color:#ffd54f;">
  <th style="padding:10px; border-bottom:1px solid #444;">Syntactic Feature</th>
  <th style="padding:10px; border-bottom:1px solid #444;">ü§ñ AI Tendency (Smoothed)</th>
  <th style="padding:10px; border-bottom:1px solid #444;">üß† Human Tendency (Spiky)</th>
</tr>
</thead>
<tbody>
<tr>
  <td style="padding:8px; border-bottom:1px solid #333;"><b>Lexical Density</b> (Content Words)</td>
  <td style="padding:8px; border-bottom:1px solid #333;">Balanced and consistent ratio of nouns, verbs, and adjectives.</td>
  <td style="padding:8px; border-bottom:1px solid #333;">Can be highly variable. More descriptive passages have high noun/adjective density; narrative passages have high verb density.</td>
</tr>
<tr>
  <td style="padding:8px; border-bottom:1px solid #333;"><b>POS Tag Entropy</b> (Variety)</td>
  <td style="padding:8px; border-bottom:1px solid #333;">Higher entropy. The sequence of POS tags is more varied and less repetitive, reflecting the model's diverse training data.</td>
  <td style="padding:8px; border-bottom:1px solid #333;">Lower entropy. Humans often fall into repetitive syntactic patterns within a single text, reducing tag variety.</td>
</tr>
<tr>
  <td style="padding:8px; border-bottom:1px solid #333;"><b>Syntactic Complexity</b></td>
  <td style="padding:8px; border-bottom:1px solid #333;">Tends to use a safe, medium level of complexity. Avoids extremely simple or highly convoluted sentences.</td>
  <td style="padding:8px; border-bottom:1px solid #333;">Highly variable. Uses a mix of very short, punchy sentences and long, complex ones with multiple subordinate clauses.</td>
</tr>
<tr>
  <td style="padding:8px; border-bottom:1px solid #333;"><b>Modifier Usage</b> (Adjectives/Adverbs)</td>
  <td style="padding:8px; border-bottom:1px solid #333;">Often uses common, high-probability modifiers (e.g., "very," "really," "importantly").</td>
  <td style="padding:8px; border-bottom:1px solid #333;">Employs more specific, sometimes unusual, modifiers. Prone to stylistic overuse of certain adverbs.</td>
</tr>
<tr>
  <td style="padding:8px; border-bottom:1px solid #333;"><b>Function Word Usage</b></td>
  <td style="padding:8px; border-bottom:1px solid #333;">Consistent use of conjunctions and prepositions (e.g., "and," "but," "in," "on").</td>
  <td style="padding:8px; border-bottom:1px solid #333;">Usage is author-dependent and can be a strong stylistic signal (a "telltale").</td>
</tr>
</tbody>
</table>

<hr style="border:0.5px solid #333;">

<h2 style="color:#ffb74d;">2. Roadmap for Feature Engineering</h2>

<p>
To turn this theory into measurable features, our plan is as follows:
</p>
<ol>
    <li>
        <b>POS Tagging:</b> First, we'll process each text sample through a POS tagger (like <b>spaCy</b>) to get a sequence of tags for each document (e.g., <code>['NOUN', 'VERB', 'DET', 'ADJ', 'NOUN', 'PUNCT']</code>).
    </li>
    <li>
        <b>Feature Extraction:</b> From the sequence of tags, we will calculate several metrics:
        <ul>
            <li><b>POS Frequencies:</b> The normalized count (ratio) of each major POS tag (Nouns, Verbs, Adjectives, Adverbs, Pronouns, Conjunctions).</li>
            <li><b>Lexical Density Ratio:</b> The ratio of content words (Nouns, Verbs, Adjectives, Adverbs) to function words (Pronouns, Prepositions, Conjunctions, etc.).</li>
            <li><b>POS Entropy:</b> The Shannon entropy of the POS tag distribution. A higher value means more diverse and unpredictable tag usage.</li>
            <li><b>POS Transition Probabilities (Advanced):</b> Analyze the probability of moving from one tag to another (e.g., what's the probability that a Noun is followed by a Verb?). We can use the entropy of this transition matrix as a feature.</li>
        </ul>
    </li>
    <li>
        <b>Analysis & Integration:</b> We will then group the data by label (AI vs. Human) and compare the means and distributions of these new syntactic features. The most discriminative ones will be combined with our existing Perplexity and Burstiness features to train the final, more powerful classifier.
    </li>
</ol>

<p style="color:#aaa; text-align:center; font-size:13px; margin-tpx;">
This gives us a concrete plan for turning these linguistic theories into powerful predictive features.
</p>

</div>



In [1]:
import pandas as pd

data_path = "/kaggle/input/daigt-v2-train-dataset/train_v2_drcat_02.csv"

df = pd.read_csv(data_path, encoding="utf-8")

# quick inspection
print("Data loaded successfully! Shape:", df.shape)
df.head(10)

Data loaded successfully! Shape: (44868, 5)


Unnamed: 0,text,label,prompt_name,source,RDizzl3_seven
0,Phones\n\nModern humans today are always on th...,0,Phones and driving,persuade_corpus,False
1,This essay will explain if drivers should or s...,0,Phones and driving,persuade_corpus,False
2,Driving while the use of cellular devices\n\nT...,0,Phones and driving,persuade_corpus,False
3,Phones & Driving\n\nDrivers should not be able...,0,Phones and driving,persuade_corpus,False
4,Cell Phone Operation While Driving\n\nThe abil...,0,Phones and driving,persuade_corpus,False
5,Cell phone use should not be legal while drivi...,0,Phones and driving,persuade_corpus,False
6,Phones and Driving\n\nDriving is a good way to...,0,Phones and driving,persuade_corpus,False
7,PHONES AND DRIVING\n\nIn this world in which w...,0,Phones and driving,persuade_corpus,False
8,People are debating whether if drivers should ...,0,Phones and driving,persuade_corpus,False
9,Texting and driving\n\nOver half of drivers in...,0,Phones and driving,persuade_corpus,False


In [2]:
import re


def clean_text(text):
    if not isinstance(text, str):
        return ""
    # remove hyperlinks (http, https, www)
    text = re.sub(r"http\S+|www\.\S+", "", text)
    # remove line breaks and tabs
    text = text.replace("\n", " ").replace("\r", " ").replace("\t", " ")
    # remove excessive spaces
    text = re.sub(r"\s+", " ", text).strip()
    # exclude non-printable/unnecessary control chars (avoid losing puncts)
    text = re.sub(r"[^\x20-\x7E]", "", text)
    return text

# apply the cleaning
df["clean_text"] = df["text"].apply(clean_text)

print("‚úÖ Cleaning complete. Preview:")
print(df[["text", "clean_text"]].head(3))

‚úÖ Cleaning complete. Preview:
                                                text  \
0  Phones\n\nModern humans today are always on th...   
1  This essay will explain if drivers should or s...   
2  Driving while the use of cellular devices\n\nT...   

                                          clean_text  
0  Phones Modern humans today are always on their...  
1  This essay will explain if drivers should or s...  
2  Driving while the use of cellular devices Toda...  


In [3]:
target_value = "Distance learning"
df_distance = df[df["prompt_name"] == target_value].copy()

print("‚úÖ Filtered DataFrame created.")
print("Shape:", df_distance.shape)
print(df_distance.head(3))

‚úÖ Filtered DataFrame created.
Shape: (5554, 6)
                                                    text  label  \
22287  Technology has advanced tremendously in the pa...      0   
22288  I disagree with online classes because, if sch...      0   
22289  I believe that students would benefit from bei...      0   

             prompt_name           source  RDizzl3_seven  \
22287  Distance learning  persuade_corpus          False   
22288  Distance learning  persuade_corpus          False   
22289  Distance learning  persuade_corpus          False   

                                              clean_text  
22287  Technology has advanced tremendously in the pa...  
22288  I disagree with online classes because, if sch...  
22289  I believe that students would benefit from bei...  


In [4]:
# POS Tag Extraction for 'Distance learning' subset 
import spacy
from tqdm.notebook import tqdm

# Load optimized SpaCy model (small English model ‚Äì fast and sufficient)
nlp = spacy.load("en_core_web_sm", disable=["ner", "parser"])

# Apply efficient batching through nlp.pipe
texts = df_distance["clean_text"].tolist()

pos_sequences = []
pos_counts = []

print("‚öôÔ∏è Starting POS tagging on 'Distance learning' subset...")
for doc in tqdm(nlp.pipe(texts, batch_size=64)):
    tags = [token.pos_ for token in doc]
    pos_sequences.append(tags)
    # quick frequency count per document
    counts = {}
    for tag in tags:
        counts[tag] = counts.get(tag, 0) + 1
    pos_counts.append(counts)

# attach results to dataframe
df_distance["pos_tags"] = pos_sequences
df_distance["pos_counts"] = pos_counts

print("‚úÖ POS tagging complete!")
print(df_distance[["clean_text", "pos_tags"]].head(3))


‚öôÔ∏è Starting POS tagging on 'Distance learning' subset...


0it [00:00, ?it/s]

‚úÖ POS tagging complete!
                                              clean_text  \
22287  Technology has advanced tremendously in the pa...   
22288  I disagree with online classes because, if sch...   
22289  I believe that students would benefit from bei...   

                                                pos_tags  
22287  [NOUN, AUX, VERB, ADV, ADP, DET, ADJ, ADJ, NOU...  
22288  [PRON, VERB, ADP, ADJ, NOUN, SCONJ, PUNCT, SCO...  
22289  [PRON, VERB, SCONJ, NOUN, AUX, VERB, ADP, AUX,...  
