**Libraries to Import**

Sarcasm detection is a natural language processing and binary classification task. We can train a machine learning model to detect whether or not a sentence is sarcastic using a dataset of sarcastic and non-sarcastic sentences that I found on Kaggle.

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB

**Dataset Preparation**

In [4]:
data = pd.read_json("/kaggle/input/data-sarcasam/sarcasam.json", lines=True)
print(data.head())

                                        article_link  \
0  https://www.huffingtonpost.com/entry/versace-b...   
1  https://www.huffingtonpost.com/entry/roseanne-...   
2  https://local.theonion.com/mom-starting-to-fea...   
3  https://politics.theonion.com/boehner-just-wan...   
4  https://www.huffingtonpost.com/entry/jk-rowlin...   

                                            headline  is_sarcastic  
0  former versace store clerk sues over secret 'b...             0  
1  the 'roseanne' revival catches up to our thorn...             0  
2  mom starting to fear son's web series closest ...             1  
3  boehner just wants wife to listen, not come up...             1  
4  j.k. rowling wishes snape happy birthday in th...             0  



The “is_sarcastic” column in this dataset contains the labels that we have to predict for the task of sarcasm detection. It contains binary values as 1 and 0, where 1 means sarcastic and 0 means not sarcastic. So for simplicity, I will transform the values of this column as “sarcastic” and “not sarcastic” instead of 1 and 0:

In [5]:
data["is_sarcastic"] = data["is_sarcastic"].map({0: "Not Sarcasm", 1: "Sarcasm"})
print(data.head())

                                        article_link  \
0  https://www.huffingtonpost.com/entry/versace-b...   
1  https://www.huffingtonpost.com/entry/roseanne-...   
2  https://local.theonion.com/mom-starting-to-fea...   
3  https://politics.theonion.com/boehner-just-wan...   
4  https://www.huffingtonpost.com/entry/jk-rowlin...   

                                            headline is_sarcastic  
0  former versace store clerk sues over secret 'b...  Not Sarcasm  
1  the 'roseanne' revival catches up to our thorn...  Not Sarcasm  
2  mom starting to fear son's web series closest ...      Sarcasm  
3  boehner just wants wife to listen, not come up...      Sarcasm  
4  j.k. rowling wishes snape happy birthday in th...  Not Sarcasm  


In [6]:
data.shape

(26709, 3)

Now let’s prepare the data for training a machine learning model. This dataset has three columns, out of which we only need the “headline” column as a feature and the “is_sarcastic” column as a label. So let’s select these columns and split the data into 20% test set and 80% training set:

In [7]:
data = data[["headline", "is_sarcastic"]] # i am taking only the 2 colums except article_link column
x = np.array(data["headline"])
y = np.array(data["is_sarcastic"])

In [8]:
cv = CountVectorizer()
X = cv.fit_transform(x) # Fit the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

Now i am using the bernoulli navie bayes algorithm to train thr model

In [9]:
model = BernoulliNB()
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

0.8448146761512542



Now let’s use a sarcastic text as input to test whether our machine learning model detects sarcasm or not:

In [10]:
user = input("Enter a Text: ")
data = cv.transform([user]).toarray()
output = model.predict(data)
print(output)

Enter a Text:  Cows lose their jobs as milk prices drop


['Sarcasm']


**NOW WE TRY TO INCREASE THE ACCURACY OF MODEL**

1. Using the TF-IDF instead of Count Vectorizer

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

**2. Trying Different Models**

    1. Logistic Regression
    2. Random Forest
    3. SVM with Linear Kernal
    

In [29]:
from sklearn.linear_model import LogisticRegression
model1 = LogisticRegression(max_iter=1000)

In [30]:
model1.fit(X_train, y_train)
print(model1.score(X_test, y_test))

0.845376263571696


In [34]:
user = input("Enter a Text: ")
data = tfidf.transform([user]).toarray()
output = model1.predict(data)
print(output)

Enter a Text:  Cows lose their jobs as milk prices drop


['Not Sarcasm']


In [35]:
from sklearn.ensemble import RandomForestClassifier
model2 = RandomForestClassifier()

In [36]:
model2.fit(X_train, y_train)
print(model2.score(X_test, y_test))

0.8163609135155373


In [37]:
user = input("Enter a Text: ")
data = tfidf.transform([user]).toarray()
output = model2.predict(data)
print(output)

Enter a Text:  Cows lose their jobs as milk prices drop


['Not Sarcasm']


In [38]:
from sklearn.svm import LinearSVC
model3 = LinearSVC()

In [39]:
model3.fit(X_train, y_train)
print(model3.score(X_test, y_test))

0.840134780980906


In [40]:
user = input("Enter a Text: ")
data = tfidf.transform([user]).toarray()
output = model3.predict(data)
print(output)


Enter a Text:   Cows lose their jobs as milk prices drop


['Sarcasm']


**NOW WE TRY TO DO WITH ENSEMBLE LEARNING**

In [43]:
# Import additional required libraries
from sklearn.ensemble import VotingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Enhanced Ensemble Model Setup
# Define individual models
model1 = LogisticRegression(max_iter=1000, random_state=42)
model2 = BernoulliNB()
model3 = SVC(kernel='linear', probability=True, random_state=42)

# Create voting classifier (soft voting for probability estimates)
voting_clf = VotingClassifier(
    estimators=[
        ('lr', model1),
        ('bnb', model2),
        ('svc', model3)
    ],
    voting='soft'  # Use soft voting for better probability estimates
)

# 2. Train and Evaluate the Ensemble Model
voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(X_test)

# Print accuracy and detailed classification report
print("Ensemble Model Accuracy:", voting_clf.score(X_test, y_test))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


# 4. Cross-Validation for More Reliable Evaluation
scores = cross_val_score(voting_clf, X, y, cv=5, scoring='accuracy')
print("\nCross-Validation Scores:", scores)
print("Mean CV Accuracy:", scores.mean())




Ensemble Model Accuracy: 0.8538000748783228

Classification Report:
              precision    recall  f1-score   support

 Not Sarcasm       0.85      0.89      0.87      2996
     Sarcasm       0.85      0.81      0.83      2346

    accuracy                           0.85      5342
   macro avg       0.85      0.85      0.85      5342
weighted avg       0.85      0.85      0.85      5342


Cross-Validation Scores: [0.8573568  0.85754399 0.8659678  0.84930738 0.84871747]
Mean CV Accuracy: 0.8557786865394474
