# **Sarcasm Classification: Feature Engineering & Ablation (Part 1)**

**Author: Shanmugam Udhaya**

*Contact:* [@frostbitepillars](https://t.me/frostbitepillars) for any clarifications  

---

This notebook is the first in a series exploring the task of sarcasm detection on a headline dataset.

In this part, we focus on:

- Possible pre-processing
- Explore of linguistic features
- Evaluate found features through ablation analysis

---

## **Baseline Model Reference**
We setup a baseline model with
- Pre-Processing with removal of stopwords with no stemming or lemmatization
- Feature Engineering includes only `tf-idf` with unigrams only
- Model is `LogisticRegression` from sklearn with class_weight='balanced' with `max_iter` increased as needed if convergence warning is raised.
- `train_test_split` with `random_state=42` and `stratify=df['is_sarcastic']`
- **0.797** Macro F1


## **Summary of Findings**

### Preprocessing:

- **No text pre-processing done** (e.g., no stemming, no lowercasing, no punctuation removal)
- Achieve increased score to **0.836**

### Linguistic Features Used:
- The following 17 handcrafted features were selected through ablation to capture structural, lexical, and stylistic cues:
- This helps increase baseline score from **0.836 to 0.850**

| Feature Name               | Description |
|----------------------------|-------------|
| `text_length`              | Number of tokens in the headline |
| `noun_count`               | Number of nouns |
| `verb_count`               | Number of verbs |
| `adj_count`                | Number of adjectives |
| `adv_count`                | Number of adverbs |
| `dale_chall_score`         | Readability score |
| `sentiment_score`          | Compound sentiment polarity |
| `char_count`               | Total number of characters |
| `capital_char_count`       | Count of capitalized characters |
| `capital_word_count`       | Number of words in all caps |
| `stopword_count`           | Number of stopwords |
| `stopwords_vs_words`       | Ratio of stopwords to total words |
| `contrastive_marker`       | Presence of words like "but", "however", etc. |
| `entropy`                  | Lexical entropy (word distribution randomness) |
| `lexical_diversity`        | Unique words divided by total words |
| `sentiment_incongruity`    | Difference between overall sentiment and word-level polarity |
| `difficult_word_count`     | Number of complex/difficult words |

---

In [1]:
import pandas as pd
import numpy as np
import re
import string

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('punkt_tab')

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, f1_score, accuracy_score
from sklearn.preprocessing import StandardScaler
from collections import Counter

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [2]:
!gdown 1eVjlFhPvpgXmj-wH52qAzdIUyHrfs92G

Downloading...
From: https://drive.google.com/uc?id=1eVjlFhPvpgXmj-wH52qAzdIUyHrfs92G
To: /content/archive (7).zip
  0% 0.00/3.46M [00:00<?, ?B/s]100% 3.46M/3.46M [00:00<00:00, 70.6MB/s]


In [3]:
!unzip "archive (7).zip"

Archive:  archive (7).zip
  inflating: Sarcasm_Headlines_Dataset.json  
  inflating: Sarcasm_Headlines_Dataset_v2.json  


In [4]:
import pandas as pd

df = pd.read_json("Sarcasm_Headlines_Dataset_v2.json", lines=True)

In [5]:
print(df.columns)
print(df.isnull().sum())
print(Counter(df['is_sarcastic']))
print(df.head(10))

Index(['is_sarcastic', 'headline', 'article_link'], dtype='object')
is_sarcastic    0
headline        0
article_link    0
dtype: int64
Counter({0: 14985, 1: 13634})
   is_sarcastic                                           headline  \
0             1  thirtysomething scientists unveil doomsday clo...   
1             0  dem rep. totally nails why congress is falling...   
2             0  eat your veggies: 9 deliciously different recipes   
3             1  inclement weather prevents liar from getting t...   
4             1  mother comes pretty close to using word 'strea...   
5             0                               my white inheritance   
6             0         5 ways to file your taxes with less stress   
7             1  richard branson's global-warming donation near...   
8             1  shadow government getting too large to meet in...   
9             0                 lots of parents know this scenario   

                                        article_link  
0  https:

## Pre-Processing

### No Pre-Processing at all

In [6]:
df['clean_headline'] = df['headline']

### With Pre-Processing

In [7]:
def preprocess_text(text, action, stopword):
  #Lower Caps
  #text = text.lower()
  #Remove Punctuations
  #text = text.translate(str.maketrans('', '', string.punctuation))

  #https://www.geeksforgeeks.org/text-preprocessing-for-nlp-tasks/
    # text = text.lower()  # Lowercase
  #text = re.sub(r'\d+', '', text)  # Remove numbers
    #text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
  #text = re.sub(r'\W', ' ', text)  # Remove special characters
    # text = BeautifulSoup(text, "html.parser").get_text()  # Remove HTML tags
  # Tokenize and remove stopwords
  words = word_tokenize(text)
  if stopword:
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word.lower() not in stop_words]

  #If stemming
  if action == "S":
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in words]
  elif action == "L":
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
  return " ".join(words)

# Apply preprocessing to the text column
df['clean_headline'] = df['headline'].apply(lambda text: preprocess_text(text, "", False))

## Download Glove

In [None]:
!gdown 1HPpXpaVGK1G6W1G5wPZwF9p091qacR8i
!unzip "glove.6B.zip"

Downloading...
From (original): https://drive.google.com/uc?id=1HPpXpaVGK1G6W1G5wPZwF9p091qacR8i
From (redirected): https://drive.google.com/uc?id=1HPpXpaVGK1G6W1G5wPZwF9p091qacR8i&confirm=t&uuid=9efc0b3b-99ac-446e-9cac-27a33941f0ca
To: /content/glove.6B.zip
100% 862M/862M [00:24<00:00, 35.7MB/s]
Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [None]:
import numpy as np

def load_glove_embeddings(glove_path):
    embeddings = {}
    with open(glove_path, 'r', encoding='utf8') as f:
        for line in f:
            parts = line.strip().split()
            word = parts[0]
            vector = np.array(parts[1:], dtype=np.float32)
            embeddings[word] = vector
    return embeddings

glove_path = "glove.6B.300d.txt"  # adjust path if needed
glove_dict = load_glove_embeddings(glove_path)

# Example: get vector for a word
print(glove_dict['amazing'])

[ 1.4999e-01  5.3597e-02  9.4669e-02  1.2415e-01 -1.0623e-01  3.2981e-01
 -3.6563e-02 -4.9109e-01  5.0600e-02 -4.8218e-01  6.9264e-01 -1.5298e-01
 -2.3069e-01  8.3252e-02  5.6969e-02 -4.4769e-01  2.7878e-01  7.0629e-02
 -2.8340e-01  4.1989e-01  3.3607e-01  3.3273e-01 -4.2430e-01  1.3433e-01
  2.4444e-01  3.6712e-01 -4.7969e-01 -3.8191e-01  1.8654e-01 -1.9120e-01
 -1.7775e-01 -2.2396e-01 -1.2442e+00 -4.2139e-01 -1.2342e+00  4.5623e-01
  1.9550e-02  7.4867e-01  4.7384e-02 -7.7133e-02 -2.6682e-01 -3.6488e-01
 -2.4977e-02 -6.0338e-02  4.1059e-02  4.3062e-01  2.4870e-01  3.4548e-02
  6.1338e-01 -4.3779e-02 -5.3384e-02  4.8766e-01 -4.4736e-02  9.4678e-02
 -2.7967e-01  7.3181e-01  5.5861e-01  8.9743e-02 -1.2702e-01 -4.8329e-02
  1.3241e-01 -2.1868e-01  4.7130e-01  2.3780e-01 -1.1905e-01  1.4091e-01
  3.4236e-02  5.8102e-02 -1.0685e-01 -1.2360e-01 -6.4432e-01 -1.2913e-02
  5.6400e-02  4.5082e-01 -1.1311e-01 -2.9463e-01 -4.4107e-02 -1.0306e-01
  5.9227e-02  8.7667e-02 -6.0326e-01 -1.5421e-01  4

In [None]:
def sentence_to_glove_vector(sentence, glove_dict, dim=300):
    words = sentence.lower().split()
    vectors = [glove_dict[word] for word in words if word in glove_dict]
    if len(vectors) == 0:
        return np.zeros(dim)
    return np.mean(vectors, axis=0)

df['word_embedding'] = df['clean_headline'].apply(
    lambda x: sentence_to_glove_vector(x, glove_dict, dim=300)
)

## tf-idf

In [8]:
tf_idf = TfidfVectorizer()
X_train, X_test, Y_train, Y_test = train_test_split(df['clean_headline'], df['is_sarcastic'], random_state=42, stratify=df['is_sarcastic'])

X_train_idf = tf_idf.fit_transform(X_train)
X_test_idf = tf_idf.transform(X_test)
print(X_train_idf.shape, X_test_idf.shape)

(21464, 23183) (7155, 23183)


In [9]:
from collections import Counter
count_y_train = Counter(Y_train)
total = sum(count_y_train.values())
print(count_y_train[0] / total)

0.5236209467014536


In [10]:
print(Counter(df['is_sarcastic']))
count_is_sarcastic = Counter(Y_train)
total_y = sum(count_is_sarcastic.values())
print(count_is_sarcastic[0] / total_y)

Counter({0: 14985, 1: 13634})
0.5236209467014536


## Linguistic Features

### POS features

In [13]:
def get_pos_counts(text):
    """
    Returns a dictionary with counts of certain POS tags (NOUN, VERB, ADJ, ADV)
    """
    pos_tags = pos_tag(word_tokenize(text))
    counts = {
        'noun_count': 0,
        'verb_count': 0,
        'adj_count': 0,
        'adv_count': 0
    }
    for word, tag in pos_tags:
        if tag.startswith('NN'):
            counts['noun_count'] += 1
        elif tag.startswith('VB'):
            counts['verb_count'] += 1
        elif tag.startswith('JJ'):
            counts['adj_count'] += 1
        elif tag.startswith('RB'):
            counts['adv_count'] += 1
    return counts

def get_text_length(text):
    return len(word_tokenize(text))

import spacy
nlp = spacy.load("en_core_web_sm")

def get_ner_count(text):
    doc = nlp(text)
    return len(doc.ents)

In [14]:
df['pos_counts'] = df['clean_headline'].apply(get_pos_counts)
df['text_length'] = df['clean_headline'].apply(get_text_length)
df['ner_count'] = df['clean_headline'].apply(get_ner_count)

df['noun_count'] = df['pos_counts'].apply(lambda x: x['noun_count'])
df['verb_count'] = df['pos_counts'].apply(lambda x: x['verb_count'])
df['adj_count'] = df['pos_counts'].apply(lambda x: x['adj_count'])
df['adv_count'] = df['pos_counts'].apply(lambda x: x['adv_count'])

print(df[['clean_headline', 'noun_count', 'verb_count', 'adj_count', 'adv_count', 'text_length', 'ner_count']].head(5))

                                      clean_headline  noun_count  verb_count  \
0  thirtysomething scientists unveil doomsday clo...           4           1   
1  dem rep. totally nails why congress is falling...           5           2   
2  eat your veggies : 9 deliciously different rec...           2           1   
3  inclement weather prevents liar from getting t...           3           3   
4  mother comes pretty close to using word 'strea...           2           3   

   adj_count  adv_count  text_length  ner_count  
0          2          0            8          1  
1          3          1           14          2  
2          1          1            8          1  
3          0          0            8          0  
4          1          2           10          0  


In [15]:
df['noun_count'] = df['pos_counts'].apply(lambda x: x['noun_count'])

### Readability

In [16]:
!pip install textstat

Collecting textstat
  Downloading textstat-0.7.5-py3-none-any.whl.metadata (15 kB)
Collecting pyphen (from textstat)
  Downloading pyphen-0.17.2-py3-none-any.whl.metadata (3.2 kB)
Collecting cmudict (from textstat)
  Downloading cmudict-1.0.32-py3-none-any.whl.metadata (3.6 kB)
Downloading textstat-0.7.5-py3-none-any.whl (105 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.3/105.3 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading cmudict-1.0.32-py3-none-any.whl (939 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m939.4/939.4 kB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyphen-0.17.2-py3-none-any.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m46.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyphen, cmudict, textstat
Successfully installed cmudict-1.0.32 pyphen-0.17.2 textstat-0.7.5


In [17]:
import textstat
df['flesch_reading_ease'] = df['clean_headline'].apply(lambda text: textstat.flesch_reading_ease(text))
df['dale_chall_score'] = df['clean_headline'].apply(lambda text: textstat.dale_chall_readability_score(text))

### Sentiment Analysis

In [18]:
!pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/126.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━[0m [32m112.6/126.0 kB[0m [31m4.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [19]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
df['sentiment_score'] = df['clean_headline'].apply(lambda text: analyzer.polarity_scores(text)['compound'])

### Incongruity

In [20]:
def detect_incongruity(text):
    tokens = word_tokenize(text)
    pos_words = 0
    neg_words = 0

    for word in tokens:
        score = analyzer.polarity_scores(word)['compound']
        if score >= 0.5:
            pos_words += 1
        elif score <= -0.5:
            neg_words += 1

    # Return 1 if both positive and negative words exist → sentiment conflict
    return int(pos_words > 0 and neg_words > 0)

# Apply to the DataFrame
df['sentiment_incongruity'] = df['clean_headline'].apply(detect_incongruity)

### Emotion

In [21]:
!pip install text2emotion
!pip install emoji==1.6.3

Collecting text2emotion
  Downloading text2emotion-0.0.5-py3-none-any.whl.metadata (3.1 kB)
Collecting emoji>=0.6.0 (from text2emotion)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Downloading text2emotion-0.0.5-py3-none-any.whl (57 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.8/57.8 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading emoji-2.14.1-py3-none-any.whl (590 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji, text2emotion
Successfully installed emoji-2.14.1 text2emotion-0.0.5
Collecting emoji==1.6.3
  Downloading emoji-1.6.3.tar.gz (174 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m174.2/174.2 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?

### Text Structure

In [22]:
def count_chars(text):
    return len(text)

def count_words(text):
    return len(text.split())

def count_capital_chars(text):
  count=0
  for i in text:
    if i.isupper():
      count+=1
  return count

def count_capital_words(text):
    return sum(map(str.isupper,text.split()))

def count_unique_words(text):
    return len(set(text.split()))

def count_exclamation(text):
    return text.count("!")

df['exclamation_count'] = df['clean_headline'].apply(count_exclamation)
df['char_count'] = df['clean_headline'].apply(count_chars)
df['word_count'] = df['clean_headline'].apply(count_words)
df['capital_char_count'] = df["clean_headline"].apply(lambda x:count_capital_chars(x))
df['capital_word_count'] = df["clean_headline"].apply(lambda x:count_capital_words(x))
df['stopword_count'] = df['clean_headline'].apply(lambda x: len([word for word in x.split() if word in stopwords.words('english')]))


df['avg_wordlength'] = df['char_count']/df['word_count']
df['stopwords_vs_words'] = df['stopword_count']/df['word_count']

### Irony Markers

In [23]:
def has_contrastive_conjunction(text):
    contrastive_words = {"but", "although", "yet", "however", "though"}
    return int(any(word in text.split() for word in contrastive_words))

df['contrastive_marker'] = df['clean_headline'].apply(has_contrastive_conjunction)

### Contextual Similarity

In [None]:
import torch
from sentence_transformers import SentenceTransformer
import numpy as np

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

model = SentenceTransformer('all-mpnet-base-v2', device="cuda" if torch.cuda.is_available() else "cpu")

# Define neutral statement embedding once (avoid redundant computation)
neutral_statement = "This is a neutral news headline."
neutral_embedding = model.encode([neutral_statement], convert_to_tensor=True).to(device)

# Encode all headlines in batches (efficient batch processing with GPU)
headlines = df['clean_headline'].tolist()
headline_embeddings = model.encode(headlines, batch_size=128, convert_to_tensor=True, device=device)
# Compute cosine similarity using GPU tensors
neutral_embedding = neutral_embedding / neutral_embedding.norm(dim=-1, keepdim=True)
headline_embeddings = headline_embeddings / headline_embeddings.norm(dim=-1, keepdim=True)

cosine_similarities = (headline_embeddings @ neutral_embedding.T).cpu().numpy().flatten()

# Store results
df['contextual_similarity'] = cosine_similarities


Using device: cpu


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Syntactic Complexity

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

def get_syntactic_complexity(text):
    doc = nlp(text)
    depth = max([len(list(token.ancestors)) for token in doc])  # Sentence depth
    clause_count = sum(1 for token in doc if token.dep_ in ["ccomp", "advcl", "acl"])
    mean_dep_length = sum(abs(token.head.i - token.i) for token in doc) / len(doc)

    return depth, clause_count, mean_dep_length

df[['sentence_depth', 'clause_count', 'mean_dep_length']] = df['clean_headline'].apply(
    lambda x: pd.Series(get_syntactic_complexity(x))
)


### https://www.sciencedirect.com/science/article/pii/S1877050924007762

In [None]:
import numpy as np
import pandas as pd
import textstat
import string
import nltk
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from textblob import TextBlob
from scipy.stats import entropy

nltk.download("averaged_perceptron_tagger")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("punkt")

In [None]:
from scipy.stats import entropy
def calculate_entropy(text):
    words = word_tokenize(text.lower())
    freq_dist = Counter(words)
    probs = np.array(list(freq_dist.values())) / sum(freq_dist.values())
    return entropy(probs, base=2)  # Shannon Entropy

df["entropy"] = df["clean_headline"].apply(calculate_entropy)

### 2. **Lexical Diversity (Unique Words / Total Words)**
def lexical_diversity(text):
    words = word_tokenize(text.lower())
    return len(set(words)) / len(words) if len(words) > 0 else 0

df["lexical_diversity"] = df["clean_headline"].apply(lexical_diversity)

### 6. **Wrong Words (Words Not in WordNet)**
def count_wrong_words(text):
    words = word_tokenize(text.lower())
    return sum(1 for word in words if not wordnet.synsets(word))

df["wrong_word_count"] = df["clean_headline"].apply(count_wrong_words)

### 7. **Difficult Words (Hard-to-Read Words)**
df["difficult_word_count"] = df["clean_headline"].apply(textstat.difficult_words)

### 8. **Lengthy Words (Words > 2 Characters)**
df["lengthy_word_count"] = df["clean_headline"].apply(lambda words: sum(1 for word in words if len(word) > 2))

### 9. **Two-Letter Words**
df["two_letter_words"] = df["clean_headline"].apply(lambda words: sum(1 for word in words if len(word) == 2))

### 10. **Single-Letter Words**
df["single_letter_words"] = df["clean_headline"].apply(lambda words: sum(1 for word in words if len(word) == 1))


In [None]:
import numpy as np
import pandas as pd
import textstat
import spacy
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from scipy.stats import entropy
import string
import re

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Example dataset
# df = pd.read_csv("sarcasm_dataset.csv")

# Feature extraction functions
def lexical_diversity(text):
    words = word_tokenize(text)
    return len(set(words)) / max(len(words), 1)  # Avoid division by zero

def dale_chall_score(text):
    return textstat.dale_chall_readability_score(text)

def flesch_reading_ease(text):
    return textstat.flesch_reading_ease(text)

def stopword_count(text):
    stop_words = set(stopwords.words('english'))
    return sum(1 for word in word_tokenize(text) if word in stop_words)

def word_entropy(text):
    words = word_tokenize(text)
    word_freq = Counter(words)
    probs = np.array(list(word_freq.values())) / len(words)
    return entropy(probs)

def pos_counts(text):
    pos_tags = pos_tag(word_tokenize(text))
    counts = Counter(tag for _, tag in pos_tags)
    return counts

def count_pronouns(text):
    doc = nlp(text)
    return sum(1 for token in doc if token.pos_ == "PRON")

def count_negations(text):
    negation_words = {"not", "never", "none", "n't"}
    return sum(1 for word in word_tokenize(text) if word.lower() in negation_words)

def count_modals(text):
    modal_words = {"should", "could", "might", "must"}
    return sum(1 for word in word_tokenize(text) if word.lower() in modal_words)

def count_hedges(text):
    hedge_words = {"maybe", "perhaps", "kind of", "sort of"}
    return sum(1 for word in word_tokenize(text) if word.lower() in hedge_words)

def dependency_features(text):
    doc = nlp(text)
    noun_count = sum(1 for token in doc if token.pos_ == "NOUN")
    verb_count = sum(1 for token in doc if token.pos_ == "VERB")
    dependent_clauses = sum(1 for token in doc if token.dep_ in {"acl", "advcl"})
    return noun_count, verb_count, dependent_clauses

def punctuation_features(text):
    question_count = text.count("?")
    consecutive_punctuation = len(re.findall(r"[!?]{2,}", text))
    return question_count, consecutive_punctuation

df["pronoun_count"] = df["clean_headline"].apply(count_pronouns)
df["negation_count"] = df["clean_headline"].apply(count_negations)
df["modal_verbs_count"] = df["clean_headline"].apply(count_modals)
df["hedge_word_count"] = df["clean_headline"].apply(count_hedges)
#df[["noun_count", "verb_count", "dependent_clauses"]] = df["clean_headline"].apply(lambda x: pd.Series(dependency_features(x)))
df[["question_count", "consecutive_punctuation"]] = df["clean_headline"].apply(lambda x: pd.Series(punctuation_features(x)))

# Display first few rows
print(df.head())


### Linguistics standalone

In [None]:
def combine_features(row):
    additional_feats = np.array([
        row['text_length'],
        row['ner_count'],
        row['noun_count'],
        row['verb_count'],
        row['adj_count'],
        row['adv_count']
    ], dtype=float)

    # Concatenate the GloVe vector with the additional features ..or dont and just test the linguistics to see what helps
    return np.concatenate([additional_feats])

df['combined_features'] = df.apply(combine_features, axis=1)

In [None]:
X = np.stack(df['combined_features'].values, axis=0)
y = df['is_sarcastic'].values

X_train, X_test, Y_train, Y_test = train_test_split(
    X, y, stratify=y, random_state=42
)

X_train_idf = X_train
X_test_idf = X_test

In [None]:
None+1

### Linguistics and tf-idf

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler
from scipy.sparse import hstack, csr_matrix
from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF vectorization
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(df['clean_headline'])

# Extract and scale additional features
scaler = StandardScaler()
additional_features = scaler.fit_transform(df[['text_length', 'noun_count', 'verb_count', 'adj_count', 'adv_count', 'dale_chall_score', 'sentiment_score', 'char_count', 'capital_char_count', 'capital_word_count', 'stopword_count', 'stopwords_vs_words', 'contrastive_marker', 'entropy', 'lexical_diversity', 'sentiment_incongruity', 'difficult_word_count']])  # now scaled
additional_features_sparse = csr_matrix(additional_features)  # convert to sparse

# Combine TF-IDF with scaled additional features
combined_features = hstack([X_tfidf, additional_features_sparse])
print("Combined shape:", combined_features.shape)

# Target
y = df['is_sarcastic'].values

# Train-test split
X_train, X_test, Y_train, Y_test = train_test_split(
    combined_features, y, random_state=42, stratify=y
)

# Train logistic regression
lr = LogisticRegression(class_weight='balanced', max_iter=10000)
lr.fit(X_train, Y_train)

# Predict and evaluate
y_pred = lr.predict(X_test)
baseline_score_all_features = f1_score(Y_test, y_pred, average='macro')
print("F1 Score:", baseline_score_all_features)


### Feature Ablation

In [None]:
from scipy.sparse import hstack, csr_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# df["pronoun_count"] = df["clean_headline"].apply(count_pronouns)
# df["negation_count"] = df["clean_headline"].apply(count_negations)
# df["modal_verbs_count"] = df["clean_headline"].apply(count_modals)
# df["hedge_word_count"] = df["clean_headline"].apply(count_hedges)
# #df[["noun_count", "verb_count", "dependent_clauses"]] = df["clean_headline"].apply(lambda x: pd.Series(dependency_features(x)))
# df[["question_count", "consecutive_punctuation"]] = df["clean_headline"].

# ✅ Define all features you want to evaluate
original_features = df[[
    "text_length", "ner_count", "noun_count", "verb_count", "adj_count", "adv_count",
    "flesch_reading_ease", "dale_chall_score", "sentiment_score", "char_count",
    "exclamation_count", "word_count", "capital_char_count", "capital_word_count",
    "stopword_count", "avg_wordlength", "stopwords_vs_words", "contrastive_marker",
    'entropy', 'lexical_diversity', 'sentiment_incongruity', "wrong_word_count", "difficult_word_count", "lengthy_word_count"
]]

# Compute baseline model with all features
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(df['clean_headline'])

scaler = StandardScaler()
scaled_features = scaler.fit_transform(original_features)
scaled_features_sparse = csr_matrix(scaled_features)

combined_features = hstack([X_tfidf, scaled_features_sparse])
y = df['is_sarcastic'].values

X_train, X_test, Y_train, Y_test = train_test_split(combined_features, y, random_state=42, stratify=y)

lr = LogisticRegression(class_weight='balanced', max_iter=10000)
lr.fit(X_train, Y_train)
y_pred = lr.predict(X_test)
baseline_score_all_features = f1_score(Y_test, y_pred, average='macro')
print(f"Baseline Macro F1 Score: {baseline_score_all_features:.4f}")

# Feature Evaluation (Ablation Test)
recommended_features = []

for i in range(len(original_features.columns)):
    feature = original_features.columns[i]

    new_features = original_features.drop(columns=[feature])

    scaler = StandardScaler()
    scaled_features = scaler.fit_transform(new_features)
    scaled_features_sparse = csr_matrix(scaled_features)

    tfidf = TfidfVectorizer()
    X_tfidf = tfidf.fit_transform(df['clean_headline'])

    combined_features = hstack([X_tfidf, scaled_features_sparse])
    y = df['is_sarcastic'].values

    X_train, X_test, Y_train, Y_test = train_test_split(combined_features, y, random_state=42, stratify=y)

    lr = LogisticRegression(class_weight='balanced', max_iter=10000)
    lr.fit(X_train, Y_train)
    y_pred = lr.predict(X_test)

    new_score = f1_score(Y_test, y_pred, average='macro')
    diff = new_score - baseline_score_all_features

    if diff < 0:
        print(f"📈 {feature} is important (-{abs(diff):.4f})")
        recommended_features.append(feature)
    elif diff > 0.001:
        print(f"⚠️ {feature} may be hurting performance (+{diff:.4f})")
    else:
        print(f"⚖️ {feature} is neutral ({diff:.4f})")

# Retrain model with only recommended features
print("\n Retraining with recommended features:", recommended_features[:5], "..." if len(recommended_features) > 5 else "")
selected_features = df[recommended_features]

scaler = StandardScaler()
scaled_features = scaler.fit_transform(selected_features)
scaled_features_sparse = csr_matrix(scaled_features)

tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(df['clean_headline'])

combined_features = hstack([X_tfidf, scaled_features_sparse])
y = df['is_sarcastic'].values

X_train, X_test, Y_train, Y_test = train_test_split(combined_features, y, random_state=42, stratify=y)

lr = LogisticRegression(class_weight='balanced', max_iter=10000)
lr.fit(X_train, Y_train)
y_pred = lr.predict(X_test)

final_score = f1_score(Y_test, y_pred, average='macro')
print(f"\n🚀 Final Macro F1 Score using best features: {final_score:.4f}")


In [None]:
print(recommended_features)

## Test finally on picked features

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from scipy.sparse import hstack, csr_matrix

# Prepare GloVe + linguistic features
X_glove = np.vstack(df['word_embedding'].values)
X_ling = df[['text_length', 'noun_count', 'verb_count', 'adj_count', 'adv_count', 'dale_chall_score',
             'sentiment_score', 'char_count', 'capital_char_count', 'capital_word_count',
             'stopword_count', 'stopwords_vs_words', 'contrastive_marker', 'entropy',
             'lexical_diversity', 'sentiment_incongruity', 'difficult_word_count']].values

X_gling = np.hstack([X_ling])
y = df['is_sarcastic'].values

# 👇 split while keeping indices for TF-IDF
X_train_gling, X_test_gling, Y_train, Y_test, idx_train, idx_test = train_test_split(
    X_gling, y, df.index, stratify=y, random_state=42
)

# Standardize GloVe + linguistic features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_gling)
X_test_scaled = scaler.transform(X_test_gling)

X_train_sparse = csr_matrix(X_train_scaled)
X_test_sparse = csr_matrix(X_test_scaled)

# TF-IDF vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_tfidf_all = tfidf.fit_transform(df['clean_headline'])

# ✅ Correct way to slice rows
X_train_tfidf = X_tfidf_all[idx_train]
X_test_tfidf = X_tfidf_all[idx_test]

# Combine
X_train_combined = hstack([X_train_tfidf, X_train_sparse])
X_test_combined = hstack([X_test_tfidf, X_test_sparse])

# Logistic Regression
lr = LogisticRegression(class_weight='balanced', max_iter=10000)
lr.fit(X_train_combined, Y_train)

y_pred = lr.predict(X_test_combined)
print(classification_report(Y_test, y_pred))
print(confusion_matrix(Y_test, y_pred))
print(f1_score(Y_test, y_pred, average='macro'))


              precision    recall  f1-score   support

           0       0.87      0.85      0.86      3746
           1       0.84      0.86      0.85      3409

    accuracy                           0.85      7155
   macro avg       0.85      0.85      0.85      7155
weighted avg       0.85      0.85      0.85      7155

[[3173  573]
 [ 479 2930]]
0.8528001655852933


## Logistic Regression

In [11]:
lr = LogisticRegression(class_weight='balanced', max_iter=10000)
lr.fit(X_train_idf, Y_train)

In [12]:
y_pred = lr.predict(X_test_idf)
print(classification_report(Y_test, y_pred))
print(confusion_matrix(Y_test, y_pred))
print(f1_score(Y_test, y_pred, average='macro'))

              precision    recall  f1-score   support

           0       0.86      0.82      0.84      3746
           1       0.81      0.86      0.83      3409

    accuracy                           0.84      7155
   macro avg       0.84      0.84      0.84      7155
weighted avg       0.84      0.84      0.84      7155

[[3059  687]
 [ 484 2925]]
0.8362808014816456


In [None]:
None+1

TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

## Naive-Bayes

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, f1_score
from sklearn.preprocessing import MinMaxScaler

# Initialize Naive Bayes classifier
nb = MultinomialNB()

# Train the model
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train_idf.toarray()) # Convert to dense array before scaling
X_test_scaled = scaler.transform(X_test_idf.toarray()) # Convert to dense array before scaling


# Train the model using the scaled features
nb.fit(X_train_scaled, Y_train)

# Make predictions using the scaled features
y_pred = nb.predict(X_test_scaled)

# Evaluation
print("Classification Report:\n", classification_report(Y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(Y_test, y_pred))
print("Macro F1 Score:", f1_score(Y_test, y_pred, average='macro'))


## SVM

In [None]:
# from sklearn.svm import LinearSVC

# svm = LinearSVC(class_weight='balanced', max_iter=10000)
# svm.fit(X_train_idf, Y_train)
# y_pred = svm.predict(X_test_idf)

# print("SVM F1 Score:", f1_score(Y_test, y_pred, average='macro'))


In [None]:
None+1

## Neural Network

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Input, Dropout, BatchNormalization, Concatenate
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import AUC
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from scipy.sparse import hstack, csr_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer

# Ensure reproducibility
tf.random.set_seed(42)

### TF-IDF Feature Extraction
tfidf = TfidfVectorizer()  # Limit vocab size for efficiency
X_tfidf = tfidf.fit_transform(df['clean_headline'])
print(f"TF-IDF shape: {X_tfidf.shape}")

### Linguistic Features
additional_features = df[[ "noun_count", "adv_count",
                          "exclamation_count",  "avg_wordlength", "stopwords_vs_words"]].values

# Standardizing the numerical features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(additional_features)
scaled_features_sparse = csr_matrix(scaled_features)  # Convert to sparse matrix

### Combine Features
X_combined = hstack([X_tfidf, scaled_features_sparse])
print("Combined features shape:", X_combined.shape)

# Target variable
y = df['is_sarcastic'].values

# Train-test split
X_train, X_test, Y_train, Y_test = train_test_split(X_combined, y, random_state=42, stratify=y)

# Convert sparse matrix to dense arrays for Keras
X_train_dense = X_train.toarray()
X_test_dense = X_test.toarray()

### Neural Network Model
input_dim = X_train_dense.shape[1]  # Number of features

# Input Layer
input_layer = Input(shape=(input_dim,))

# # Hidden Layers
# x = Dense(64, activation='relu')(input_layer)
# x = BatchNormalization()(x)
# x = Dropout(0.5)(x)

# Output Layer
output_layer = Dense(1, activation='sigmoid')(input_layer)

# Define Model
model = Model(inputs=input_layer, outputs=output_layer)
model.compile(optimizer=Adam(learning_rate=0.001),
              loss='binary_crossentropy',
              metrics=['accuracy', AUC(name="AUC")])

# Summary
model.summary()

### Train the Model
early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
lr_reducer = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, verbose=1, min_lr=1e-6)

history = model.fit(X_train_dense, Y_train,
                    validation_data=(X_test_dense, Y_test),
                    epochs=50,
                    batch_size=32,
                    callbacks=[early_stop, lr_reducer])

### Evaluate Model
loss, accuracy, auc_score = model.evaluate(X_test_dense, Y_test)
print(f"Test Accuracy: {accuracy:.4f}")
print(f"Test AUC: {auc_score:.4f}")


In [None]:
y_pred = model.predict(X_test_dense)
y_pred_binary = (y_pred > 0.5).astype(int)  # Convert probabilities to binary labels (0 or 1)
print(classification_report(Y_test, y_pred_binary))  # Use binary predictions
print(confusion_matrix(Y_test, y_pred_binary))
print(f1_score(Y_test, y_pred_binary, average='macro'))