<a href="https://www.kaggle.com/code/aabdollahii/news-headlines-sarcasm-detection?scriptVersionId=264550387" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<div style="background-color:#121212; color:#f5f5f5; font-family:Arial, sans-serif; padding:20px; line-height:1.6; border-radius:10px;">

 <h1 style="color:#00bfff; text-align:center;">Sarcasm Detection in News Headlines</h1>

 <h2 style="color:#ffd700;">📌 Introduction</h2>
    <p>
        Sarcasm is a form of expression where the intended meaning is often the opposite of the literal meaning. Detecting sarcasm in text is a challenging task in Natural Language Processing (NLP) because it often depends on subtle linguistic cues and contextual knowledge. 
        In this project, we aim to build a robust <strong>binary classification model</strong> that can predict whether a given news headline is sarcastic (<code>1</code>) or not (<code>0</code>).
    </p>

 <h2 style="color:#ffd700;">🎯 Motivation</h2>
    <p>
        Understanding sarcasm is essential for improving sentiment analysis, conversational AI systems, and human-computer interaction. 
        Misinterpreting sarcasm can lead to inaccurate sentiment scores or inappropriate responses in automated systems such as chatbots, virtual assistants, or recommendation models.
    </p>

<h2 style="color:#ffd700;">🛠 Project Steps</h2>
    <ol>
        <li><strong>Data Loading & Exploration</strong> – Import the dataset, inspect key features, and check distribution of classes.</li>
        <li><strong>Data Preprocessing</strong> – Tokenize text, remove unwanted characters, apply lemmatization, and handle stopwords.</li>
        <li><strong>Feature Representation</strong> – Use methods like TF-IDF, Word2Vec, or BERT embeddings to convert text into numerical vectors.</li>
        <li><strong>Model Selection & Training</strong> – Train classification models (Logistic Regression, SVM, Random Forest, LSTM, Transformers).</li>
        <li><strong>Evaluation</strong> – Use metrics such as Accuracy, Precision, Recall, and F1-score to assess performance.</li>
        <li><strong>Explainability</strong> – Apply techniques like SHAP or LIME to interpret model predictions.</li>
        <li><strong>Insights & Conclusion</strong> – Summarize findings, challenges, and possible improvements.</li>
    </ol>

<h2 style="color:#ffd700;">📂 Dataset Overview</h2>
    <p>
        The dataset contains approximately <strong>26,000 headlines</strong> collected from news sources. Each headline is labeled with:
        <ul>
            <li><strong>headline</strong>: The news headline text.</li>
            <li><strong>is_sarcastic</strong>: Target variable (<code>1</code> for sarcastic, <code>0</code> for non-sarcastic).</li>
            <li><strong>article_link</strong>: Link to the full article (optional).</li>
        </ul>
        This dataset is relatively clean but may contain duplicates or minimal noise, which will be handled during preprocessing.
    </p>

<h2 style="color:#ffd700;">🚀 Expected Outcomes</h2>
    <p>
        By the end of this project, we expect to have:
        <ul>
            <li>An accurate sarcasm classifier for news headlines.</li>
            <li>Insights into linguistic patterns that contribute to sarcasm.</li>
            <li>Visualizations and explainability analyses to make the model more transparent.</li>
        </ul>
    </p>

 <p style="text-align:center; color:#00ff7f; font-size:14px; margin-top:30px;">
        <em>“The power of NLP lies not just in understanding words, but in understanding intent.”</em>
    </p>
</div>


<div style="background-color:#121212; color:#f5f5f5; font-family:Arial, sans-serif; padding:20px; line-height:1.6; border-radius:10px;">

 <h2 style="color:#00bfff; text-align:center;">Step 1: Data Loading & Initial Understanding</h2>

<p>
        In this stage, we focus on importing the dataset and performing a minor exploration to understand its structure, size, and basic characteristics.
        Since we are working with <strong>Sarcasm_Headlines_Dataset_v2.json</strong>, our goal is to ensure the data is read correctly into a format suitable for analysis.
    </p>

<h3 style="color:#ffd700;">📌 Actions in this Step:</h3>
    <ul>
        <li>Read the JSON file into a Pandas DataFrame.</li>
        <li>Inspect the column names and understand what each one represents.</li>
        <li>Check the total number of records (rows) and attributes (columns).</li>
        <li>Verify the presence or absence of missing values.</li>
        <li>Look at a few sample rows to confirm the structure and quality of the data.</li>
        <li>Examine the distribution of the target variable (<code>is_sarcastic</code>).</li>
    </ul>

 <h3 style="color:#ffd700;">🎯 Purpose of This Step:</h3>
    <p>
        By gaining an initial understanding of the dataset, we can make informed decisions for preprocessing, feature representation, and model selection in upcoming steps.
        This stage lays the foundation for the data cleaning and transformation processes that follow.
    </p>

 <p style="color:#00ff7f; text-align:center; font-size:14px; margin-top:20px;">
        <em>Next, we will move on to data preprocessing and text normalization.</em>
    </p>
</div>


In [1]:
import pandas as pd

# Path to dataset
path = "/kaggle/input/news-headlines-dataset-for-sarcasm-detection/Sarcasm_Headlines_Dataset_v2.json"

# Load dataset
df = pd.read_json(path, lines=True)

# Basic info
shape = df.shape
columns = df.columns.tolist()
missing = df.isnull().sum()
class_counts = df['is_sarcastic'].value_counts()
sarcastic_ratio = (class_counts[1] / len(df)) * 100

# Display some sample rows
sample_data = df.head(5)

shape, columns, missing, class_counts, sarcastic_ratio, sample_data


((28619, 3),
 ['is_sarcastic', 'headline', 'article_link'],
 is_sarcastic    0
 headline        0
 article_link    0
 dtype: int64,
 is_sarcastic
 0    14985
 1    13634
 Name: count, dtype: int64,
 47.6396799329117,
    is_sarcastic                                           headline  \
 0             1  thirtysomething scientists unveil doomsday clo...   
 1             0  dem rep. totally nails why congress is falling...   
 2             0  eat your veggies: 9 deliciously different recipes   
 3             1  inclement weather prevents liar from getting t...   
 4             1  mother comes pretty close to using word 'strea...   
 
                                         article_link  
 0  https://www.theonion.com/thirtysomething-scien...  
 1  https://www.huffingtonpost.com/entry/donna-edw...  
 2  https://www.huffingtonpost.com/entry/eat-your-...  
 3  https://local.theonion.com/inclement-weather-p...  
 4  https://www.theonion.com/mother-comes-pretty-c...  )

<div style="background-color:#121212; color:#f5f5f5; font-family:Arial, sans-serif; padding:20px; border-radius:10px; line-height:1.6;">

 <h2 style="color:#00bfff; text-align:center;">Initial Dataset Insights</h2>

<p>
        After loading <strong>Sarcasm_Headlines_Dataset_v2.json</strong>, we confirmed that the dataset contains 
        <strong>28,619 rows</strong> and <strong>3 columns</strong>. These columns are:
    </p>
    <ul>
        <li><strong>is_sarcastic</strong> – Binary target variable (<code>1</code> for sarcastic headlines, <code>0</code> for non‑sarcastic).</li>
        <li><strong>headline</strong> – Short news headline text.</li>
        <li><strong>article_link</strong> – URL to the original news article.</li>
    </ul>

 <h3 style="color:#ffd700;">📊 Missing Values Check</h3>
    <p>
        No missing values were found in any column, indicating that the dataset is clean and ready for preprocessing.
    </p>

 <h3 style="color:#ffd700;">⚖ Class Distribution</h3>
    <p>
        The dataset is slightly imbalanced:
        <ul>
            <li><strong>Non‑sarcastic (0):</strong> 14,985 entries</li>
            <li><strong>Sarcastic (1):</strong> 13,634 entries</li>
        </ul>
        Sarcastic headlines make up approximately <strong>47.64%</strong> of the total records.
    </p>

 <h3 style="color:#ffd700;">🔍 Sample Headlines</h3>
    <p>Here are some examples from the dataset:</p>
    <table style="border-collapse:collapse; width:100%; border:1px solid #444;">
        <tr style="background-color:#1e1e1e;">
            <th style="border:1px solid #444; padding:8px;">is_sarcastic</th>
            <th style="border:1px solid #444; padding:8px;">headline</th>
            <th style="border:1px solid #444; padding:8px;">article_link</th>
        </tr>
        <tr>
            <td style="border:1px solid #444; padding:8px;">1</td>
            <td style="border:1px solid #444; padding:8px;">thirtysomething scientists unveil doomsday clock</td>
            <td style="border:1px solid #444; padding:8px;">https://www.theonion.com/thirtysomething-scien...</td>
        </tr>
        <tr>
            <td style="border:1px solid #444; padding:8px;">0</td>
            <td style="border:1px solid #444; padding:8px;">dem rep. totally nails why congress is falling apart</td>
            <td style="border:1px solid #444; padding:8px;">https://www.huffingtonpost.com/entry/donna-edw...</td>
        </tr>
        <tr>
            <td style="border:1px solid #444; padding:8px;">0</td>
            <td style="border:1px solid #444; padding:8px;">eat your veggies: 9 deliciously different recipes</td>
            <td style="border:1px solid #444; padding:8px;">https://www.huffingtonpost.com/entry/eat-your-...</td>
        </tr>
    </table>

<p style="margin-top:20px; font-size:14px; color:#00ff7f; text-align:center;">
        The dataset is clean and balanced enough for effective model training. 
        Next, we will proceed to <strong>text preprocessing</strong> and feature representation.
    </p>
</div>


<div style="background-color:#121212; color:#e0e0e0; padding:20px; font-family:Segoe UI, sans-serif; line-height:1.6; border-radius:8px;">

<h2 style="color:#ffcc00;">📄 Text Preprocessing Plan for Sarcasm Detection</h2>

<p>To prepare the news headlines dataset for deep learning, we will process the text data so it can be converted into numerical form for the model. Our steps are:</p>

<ol style="margin-left:20px;">
  <li><strong style="color:#ff9966;">Lowercasing</strong> – Convert all headlines to lowercase to ensure uniformity and reduce vocabulary size.</li>
  <li><strong style="color:#66ccff;">Punctuation Removal</strong> – Eliminate commas, periods, quotes, and other punctuation marks that can add noise.</li>
  <li><strong style="color:#99ff99;">Stopword Removal (Optional)</strong> – Remove common words like "the", "is", "and" which often carry less semantic weight in classification.</li>
  <li><strong style="color:#ff6699;">Tokenization</strong> – Use Keras' <code>Tokenizer</code> to map each word to an integer index.</li>
  <li><strong style="color:#cc99ff;">Padding</strong> – Make all sequences the same length to match the input requirements of neural networks.</li>
  <li><strong style="color:#ffcc66;">Train-Test Split</strong> – Divide the dataset (e.g., 80% training, 20% testing) to evaluate our model’s generalization ability.</li>
</ol>

<p>These preprocessing steps will allow us to feed clean, consistent, numerical representations of headlines into an LSTM/CNN-based sarcasm detection model.</p>

</div>


In [2]:
import re
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences



# Function for text cleaning
def clean_text(text):
    text = text.lower()  # lowercase
    text = re.sub(r'[^\w\s]', '', text)  # remove punctuation
    text = re.sub(r'\s+', ' ', text).strip()  # remove extra spaces
    return text

# Apply cleaning
df['clean_headline'] = df['headline'].apply(clean_text)

# Prepare data and labels
X = df['clean_headline'].values
y = df['is_sarcastic'].values

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Tokenization
vocab_size = 10000  
tokenizer = Tokenizer(num_words=vocab_size, oov_token="<OOV>")
tokenizer.fit_on_texts(X_train)

# Convert text to sequences
train_sequences = tokenizer.texts_to_sequences(X_train)
test_sequences = tokenizer.texts_to_sequences(X_test)

# Padding
max_length = max(len(seq) for seq in train_sequences)  
train_padded = pad_sequences(train_sequences, maxlen=max_length, padding='post', truncating='post')
test_padded = pad_sequences(test_sequences, maxlen=max_length, padding='post', truncating='post')

print(f"Training shape: {train_padded.shape}")
print(f"Testing shape: {test_padded.shape}")
print(f"Max sequence length: {max_length}")


2025-09-28 19:09:23.181050: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1759086563.400763      13 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1759086563.462823      13 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Training shape: (22895, 39)
Testing shape: (5724, 39)
Max sequence length: 39


<div style="background-color:#121212; color:#e0e0e0; padding:20px; font-family:Segoe UI, sans-serif; border-radius:8px; line-height:1.6;">

<h2 style="color:#ffcc00;">✅ Text Preprocessing Completed</h2>

<h3 style="color:#66ccff;">What We Did</h3>
<ul style="margin-left:20px;">
  <li>🔹 Converted all headlines to <strong>lowercase</strong> for consistency.</li>
  <li>🔹 Removed <strong>punctuation marks</strong> and extra spaces to reduce noise.</li>
  <li>🔹 Used <strong>Keras Tokenizer</strong> to convert words into integer indices (<em>numerical representation</em>).</li>
  <li>🔹 Applied <strong>padding</strong> so all sequences have the same length.</li>
  <li>🔹 Split dataset into <strong>training</strong> (80%) and <strong>testing</strong> (20%) sets.</li>
</ul>

<h3 style="color:#99ff99;">What Happened Now (Results)</h3>
<ul style="margin-left:20px;">
  <li>📏 The <strong>maximum sequence length</strong> among headlines is <strong>39 words</strong>.</li>
  <li>📊 Training set contains <strong>22,895 headlines</strong> padded to length 39.</li>
  <li>📊 Testing set contains <strong>5,724 headlines</strong> padded to length 39.</li>
  <li>🗂️ All the text is now stored in <em>integer‑encoded, fixed‑size arrays</em> ready for neural network input.</li>
</ul>

<p style="margin-top:10px;">From this point onward, the model will process sequences of integers representing words, instead of raw text. This ensures consistent input dimensions and allows embedding layers or pre-trained word vectors to be applied effectively.</p>

</div>
