# Notebook 03 – Feature Extraction  
*This notebook enriches the CMV post dataset with structural and linguistic   features that may influence persuasion.*

> **Goal in one line:** extract interpretable, distributed features—like pronoun use, sentiment, readability, and evidence signals—that help model what makes an argument persuasive.



### Import statements

In [None]:
# Standard library
import re

# NLP Processing
from nltk.corpus import stopwords, words
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# PySpark Core
from pyspark.sql import SparkSession
from pyspark.sql.types import (
    StructType, StructField, StringType, 
    IntegerType, BooleanType, ArrayType,
    LongType, FloatType
)

# PySpark Functions
from pyspark.sql.functions import (
    from_unixtime, year, month, date_format,
    udf, col, size, split, when, lit,
    concat_ws
)

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('words')

stop_words = set(stopwords.words('english'))
english_words = set(words.words())
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to
[nltk_data]     /usr/local/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/local/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /usr/local/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package words to /usr/local/share/nltk_data...
[nltk_data]   Package words is already up-to-date!


## 1 – Data preprocessing

In [None]:
# Define schema following Notebook 02
schema_three = StructType([
    StructField("num_comments", IntegerType(), True),    # Number of comments on the post
    StructField("score", IntegerType(), True),           # Reddit score (upvotes - downvotes)
    StructField("delta", BooleanType(), True),           # Whether the post received a delta (changed view)
    StructField("urls", ArrayType(StringType()), True),  # URLs mentioned in the post
    StructField("processed", ArrayType(StringType()), True),  # Preprocessed tokens
    StructField("merged", StringType(), True),           # Merged text field (likely title + selftext)
    StructField("year_month", StringType(), True)        # Time period for temporal analysis
    StructField("category_title",StringType(),True),     # Interpretable category title
    StructField("title", StringType(), True),            # Title of the post
    StructField("selftext", StringType(), True),         # Body text of the post
])

# Read preprocessed data from Bucket
n2_categorized = spark.read.schema(schema_three).json("gs://st446-cmv/n2_categorized_df")
n2_categorized.show(5, truncate=False)

                                                                                

+------------+-----+-----+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## 2 – Exploration

In [None]:
#Getting the count per categories
n2_categorized.groupBy("category_title").count().orderBy("count", ascending=False).show()


[Stage 3:>                                                          (0 + 1) / 1]

+--------------+-----+
|category_title|count|
+--------------+-----+
|       Society|22749|
|         Other|20176|
|      Politics|13118|
|        Gender| 3518|
|   Environment| 2816|
|       Culture| 1605|
|        Health|  396|
|       Economy|  298|
|        Sports|  187|
|       Animals|  122|
|    Technology|   64|
|     Education|   55|
|          Food|   43|
|    Philosophy|   22|
+--------------+-----+



                                                                                

In [None]:
#This shows how our classifier works. Here it talks about Israel-Palestine conflict in 2013.
n2_categorized.filter(col("category_title") == "Politics").show(1, truncate=False)

+------------+-----+-----+----+--------------------+--------------------+----------+--------------+
|num_comments|score|delta|urls|           processed|              merged|year_month|category_title|
+------------+-----+-----+----+--------------------+--------------------+----------+--------------+
|          42|   11|false|  []|["history","confl...|All through histo...|   2013-07|      Politics|
+------------+-----+-----+----+--------------------+--------------------+----------+--------------+
only showing top 1 row



[Stage 6:>                                                          (0 + 1) / 1]                                                                                

## 3 – Feature extraction


### 3.1  Pronoun‑use features

We create two linguistic counters that capture how often an author refers to themselves individually versus collectively:

| Column | What it counts | Linguistic intuition |
|--------|----------------|----------------------|
| `first_person_singular_count` | Occurrences of **I, me, my, mine** | Signals personal experience, introspection, or ego‑involvement (*“I feel…”, “my view is…”*). |
| `first_person_plural_count`   | Occurrences of **we, us, our, ours** | Signals group identity, shared responsibility, or coalition framing (*“we should…”, “our society…”*). |

#### Why these features matter  
Research in discourse analysis suggests that **self‑focused language** can convey vulnerability or authenticity, while **collective pronouns** can invoke solidarity and broaden appeal.  
Both strategies may influence persuasion in online debates.

#### How they’re computed (distributed)  
1. Lower‑case the merged text.  
2. Tokenise with a simple word‑boundary regex (`\b\w+\b`).  
3. Tally tokens that belong to the predefined singular or plural pronoun sets.  
4. Wrap the Python functions as Spark UDFs so counting runs in parallel on every partition.

The resulting integer counts become inputs for the downstream classification model, allowing us to test whether pronoun choice correlates with Δ‑success.


In [None]:
# Define pronoun sets
first_person_singular_pronouns = {"i", "me", "my", "mine"}
first_person_plural_pronouns = {"we", "us", "our", "ours"}

# Counting functions
def count_first_singular_person(text):
    if not text:
        return 0
    # Normalize and tokenize
    words = re.findall(r"\b\w+\b", text.lower())
    return sum(1 for word in words if word in first_person_singular_pronouns)

def count_first_plural_person(text):
    if not text:
        return 0
    # Normalize and tokenize
    words = re.findall(r"\b\w+\b", text.lower())
    return sum(1 for word in words if word in first_person_plural_pronouns)


# Wrap functions as Spark UDFs
count_fp_udf = udf(count_first_singular_person, IntegerType())
count_fp_plural_udf = udf(count_first_plural_person, IntegerType())

# Augment the DataFrame with new features
df = n2_categorized.withColumn("first_person_singular_count", count_fp_udf(col("merged")))
df = df.withColumn("first_person_plural_count", count_fp_plural_udf(col("merged")))

df.select("first_person_singular_count", "first_person_plural_count").show(1)

[Stage 1:>                                                          (0 + 1) / 1]

+---------------------------+-------------------------+
|first_person_singular_count|first_person_plural_count|
+---------------------------+-------------------------+
|                          1|                        0|
+---------------------------+-------------------------+
only showing top 1 row



                                                                                

### 3.2  Length‑based features  

We begin with two simple yet informative size metrics:

| Column | Definition | Why it matters |
|--------|------------|----------------|
| `post_content_length`  | Number of whitespace‑delimited tokens in the OP’s body (`selftext`). | Longer posts may provide more context or evidence, potentially affecting Δ‑rates. |
| `title_content_length` | Number of tokens in the headline (`title`). | Extremely short or overly long titles can influence click‑through and engagement. |

Both counts are computed with Spark’s `size(split())`, keeping the operation fully distributed.


In [None]:
# Augment the DataFrame with new lenght-based features
df = df.withColumn("post_content_length", size(split(col("selftext"), r"\s+")))
df = df.withColumn("title_content_length", size(split(col("title"), r"\s+")))

### 3.3  Evidence indicator: `has_url`  

A binary flag that equals **1** if the OP contains at least one outbound link and **0** otherwise.  
Links often signal external evidence or citations, which prior work links to higher persuasive power.  



In [None]:
df = df.withColumn(
    "has_url",
    when(size(col("urls")) > 0, lit(1)).otherwise(lit(0))
)

### 3.4 Readability feature: Flesch–Kincaid Grade Level  

We add a readability score for each post using the Flesch–Kincaid (FK) formula:

**FK Grade** = 0.39 * **ASL** + 11.8 * **ASW** - 15.59



where  
* **ASL** = average sentence length (words / sentence)  
* **ASW** = average syllables per word  

| Interpretation | Typical grade level |
|----------------|---------------------|
| **≤ 6**  | Easy / elementary |
| 7 – 9   | Middle school |
| 10 – 12 | High school |
| **> 12** | College and above |

We included readability because persuasion can depend on cognitive load: material that is too complex (high FK) or too simplistic (very low FK) may reduce credibility or engagement.  

The score is computed in a UDF that:

1. Splits the text into sentences.  
2. Counts words and syllables via regex.  
3. Applies the FK formula and rounds to two decimals.

Using a Spark UDF keeps the computation parallelised across the cluster.


In [None]:

# Function to estimate the number of syllables in a single word
def count_syllables(word):
    word = word.lower()
    return max(1, len(re.findall(r'[aeiouy]+', word))) # Count contiguous groups of vowels

# Function to compute the Flesch–Kincaid readability grade for a given text
def fk_grade(text):
    if not text:
        return None
    sentences = re.split(r'[.!?]', text)
    sentences = [s for s in sentences if s.strip()]
    num_sentences = len(sentences)

    # Extract all words using a basic word regex
    words = re.findall(r'\w+', text)
    num_words = len(words)
    syllables = sum(count_syllables(word) for word in words)

    if num_sentences == 0 or num_words == 0:
        return None
    # Compute FK components:
    ASL = num_words / num_sentences  # Average Sentence Length
    ASW = syllables / num_words      # Average Syllables per Word

     # Return Flesch–Kincaid
    return round(0.39 * ASL + 11.8 * ASW - 15.59, 2)

# Register the function as a Spark UDF so it can be applied to a DataFrame column
fk_grade_udf = udf(fk_grade, FloatType())

In [None]:
df_flesch = df.withColumn("fk_grade", fk_grade_udf(n2_categorized["merged"]))
df_flesch.show(4)


[Stage 2:>                                                          (0 + 1) / 1]

+------------+-----+-----+-----------+--------------------+--------------------+----------+--------------+--------------------+--------------------+---------------------------+-------------------------+-------------------+--------------------+-------+--------+
|num_comments|score|delta|       urls|           processed|              merged|year_month|category_title|               title|            selftext|first_person_singular_count|first_person_plural_count|post_content_length|title_content_length|has_url|fk_grade|
+------------+-----+-----+-----------+--------------------+--------------------+----------+--------------+--------------------+--------------------+---------------------------+-------------------------+-------------------+--------------------+-------+--------+
|           1|    1|false|         []|["apple","product...| I believe that A...|   2013-07|       Culture|I believe that Ap...|                    |                          1|                        0|               

                                                                                

In [None]:
# Display posts with very low Flesch–Kincaid Grade Level
df_flesch.filter(col("fk_grade") < 4).show(5, truncate=False)


[Stage 3:>                                                          (0 + 1) / 1]

+------------+-----+-----+----+------------------------------+-----------------------------------------+----------+--------------+----------------------------------------+--------+---------------------------+-------------------------+-------------------+--------------------+-------+--------+
|num_comments|score|delta|urls|processed                     |merged                                   |year_month|category_title|title                                   |selftext|first_person_singular_count|first_person_plural_count|post_content_length|title_content_length|has_url|fk_grade|
+------------+-----+-----+----+------------------------------+-----------------------------------------+----------+--------------+----------------------------------------+--------+---------------------------+-------------------------+-------------------+--------------------+-------+--------+
|32          |48   |false|[]  |["self","post","karma"]       | I think self posts should get karma. CMV|2013-05   |Other 

                                                                                

### 3.5 Sentiment Analysis with VADER

We enrich our dataset with a sentiment score using the **VADER (Valence Aware Dictionary and sEntiment Reasoner)** tool from NLTK, which is well-suited for analyzing social media and informal online text.

#### What we compute:
- **Compound score**: A normalized value ranging from -1 (very negative) to +1 (very positive).
- We apply the sentiment model to the entire post (`title + selftext`) using a Spark UDF.

#### Why compound only?
Although VADER also outputs separate scores for positive, neutral, and negative sentiment, we focus on the **compound** score since it summarizes the overall emotional tone in a single interpretable metric:
- `x > 0.5` → positive
- `-0.5 < x < 0.5` → neutral
- `x < -0.5` → negative

This feature allows us to test whether emotionally charged posts correlate with persuasion outcomes or differ across topic categories.


In [None]:


# Define UDF for compound sentiment score
def get_vader_compound(text):
    if text:
        analyzer = SentimentIntensityAnalyzer()
         # VADER returns a dictionary with 'neg', 'neu', 'pos', and 'compound' scores
        return float(analyzer.polarity_scores(text)['compound'])
    return 0.0 # Default to neutral if text is missing

# Register the UDF with Spark
vader_udf = udf(get_vader_compound, FloatType())
df_sentiment = df_flesch.withColumn("sentiment", vader_udf(col("merged")))

# Display the sentiment scores alongside the post text
df_sentiment.select("merged", "sentiment").show(5, truncate=False)

[Stage 4:>                                                          (0 + 1) / 1]

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                                                                                

### 4 - Upload to bucket


This output will be used as input for Notebooks 04 and 05.

In [None]:
n3_features_df = df_sentiment

In [None]:
n3_features_df.write \
         .mode("overwrite") \
         .option("header", "true") \
         .json("gs://st446-cmv/n3_features_df/")

                                                                                