<a href="https://www.kaggle.com/code/aki887/deep-sarcasm-analysis?scriptVersionId=213096470" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from textblob import TextBlob

pandas (pd): A powerful data manipulation and analysis library for Python. It provides data structures like DataFrame and Series for handling tabular data.

numpy (np): A fundamental package for scientific computing in Python. It provides support for arrays, matrices, and many mathematical functions.

re: The regular expression library in Python. It provides functions to search, match, and manipulate strings using regular expressions.

sklearn.model_selection.train_test_split: A function from Scikit-learn (a machine learning library) used to split arrays or matrices into random train and test subsets.

sklearn.feature_extraction.text.TfidfVectorizer: A class from Scikit-learn used to convert a collection of raw documents to a matrix of TF-IDF features.

sklearn.ensemble.RandomForestClassifier: A class from Scikit-learn used to implement a random forest classifier, an ensemble learning method for classification.

sklearn.linear_model.LogisticRegression: A class from Scikit-learn used to implement logistic regression, a statistical method for binary classification.

sklearn.metrics.classification_report: A function from Scikit-learn used to generate a text report showing the main classification metrics.

textblob.TextBlob: A class from TextBlob, a library for processing textual data. It provides a simple API for common natural language processing (NLP) tasks, such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more

In [2]:
# Preprocess the data
def preprocess_text(text):
    text = re.sub(r'\W', ' ', text)  # Remove non-word characters
    text = text.lower()  # Convert text to lowercase
    return text.strip()
# Check the sentimental polarity
def get_sentiment_polarity(text):
    analysis = TextBlob(text)
    polarity = analysis.sentiment.polarity
    return polarity


Explanation of Code Actions in Points:
preprocess_text function:
1. Removes non-word characters:

Uses re.sub(r'\W', ' ', text) to replace all non-word characters (e.g., symbols, punctuation) in the text with a space.
Ensures that only alphanumeric characters and spaces remain in the text.

2. Converts text to lowercase:

Uses text.lower() to standardize the text by converting all characters to lowercase.
Helps ensure case insensitivity during text analysis.

3. Strips leading and trailing spaces:

Uses text.strip() to remove unnecessary spaces at the beginning and end of the text, ensuring a cleaner input.

get_sentiment_polarity function:

1. Analyzes the sentiment of the text:

Leverages the TextBlob library to perform a sentiment analysis on the input text. TextBlob uses lexicon-based methods to determine the sentiment.

2. Calculates sentiment polarity:

Extracts the sentiment.polarity attribute, which is a value between -1 and 1:
Negative polarity (-1 to 0): Indicates negative sentiment.
Zero polarity (0): Indicates a neutral sentiment.
Positive polarity (0 to 1): Indicates positive sentiment.

3. Returns the polarity score:

Provides a numerical representation of the sentiment to allow further categorization or analysis.

In [3]:
# Load dataset 1 (for sentiment analysis)
dataset1_path = '/kaggle/input/dataset/Dataset1.csv'
dataset1 = pd.read_csv(dataset1_path)
dataset1['text'] = dataset1['text'].apply(preprocess_text)

# Load dataset 2 (for sarcasm detection)
dataset2_path = '/kaggle/input/dataset/Dataset2.json'
dataset2 = pd.read_json(dataset2_path)
dataset2['headline'] = dataset2['headline'].apply(preprocess_text)


Explanation of Code Actions in Points:

Loading Dataset 1 (for Sentiment Analysis):

1. Defines the path to the CSV dataset:

The path /kaggle/input/dataset/Dataset1.csv specifies the location of the first dataset, which is assumed to be in CSV format.

2. Loads the CSV file into a DataFrame:

Uses pd.read_csv(dataset1_path) to load the dataset into a Pandas DataFrame for structured data manipulation.

3. Applies text preprocessing to the text column:

Uses dataset1['text'].apply(preprocess_text) to preprocess the text column in the DataFrame.
Removes non-word characters, converts text to lowercase, and strips extra spaces.
Prepares the text data for sentiment analysis by ensuring consistency and cleanliness.

Loading Dataset 2 (for Sarcasm Detection):

1. Defines the path to the JSON dataset:

The path /kaggle/input/dataset/Dataset2.json specifies the location of the second dataset, which is assumed to be in JSON format.

2. Loads the JSON file into a DataFrame:

Uses pd.read_json(dataset2_path) to load the dataset into a Pandas DataFrame, converting JSON structured data into tabular form.

3. Applies text preprocessing to the headline column:

Uses dataset2['headline'].apply(preprocess_text) to preprocess the headline column.
Cleans the text data for sarcasm detection by standardizing and removing irrelevant characters.

Outcome of These Steps:

Dataset 1 is preprocessed and ready for sentiment analysis tasks.
Dataset 2 is preprocessed and prepared for sarcasm detection tasks.



In [4]:
# Create a TF-IDF Vectorizer object with a max of 5000 features and English stop words
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')

# Transform the 'headline' column of dataset2 into a TF-IDF matrix
X_sarcasm = vectorizer.fit_transform(dataset2['headline']).toarray()

# Extract the target variable 'is_sarcastic'
y_sarcasm = dataset2['is_sarcastic']

# Split the data into training and testing sets with a test size of 20% and a random state of 42
X_sarcasm_train, X_sarcasm_test, y_sarcasm_train, y_sarcasm_test = train_test_split(
    X_sarcasm, y_sarcasm, test_size=0.2, random_state=42)

# Create a Random Forest Classifier object with 100 trees and a random state of 42
sarcasm_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Fit the Random Forest model to the training data
sarcasm_model.fit(X_sarcasm_train, y_sarcasm_train)

Create a TF-IDF Vectorizer:

Converts text data into a numerical matrix with a maximum of 5000 features.
Removes common English stop words (e.g., "the," "and") to focus on important terms.

Transform the text data:

Converts the headline column from dataset2 into a TF-IDF matrix (X_sarcasm), making it suitable for machine learning models.

Extract the target variable:

y_sarcasm holds the sarcasm labels (e.g., 1 for sarcastic, 0 for non-sarcastic).

Split the data:

Splits the features (X_sarcasm) and target (y_sarcasm) into training (80%) and testing (20%) sets for model training and evaluation.

Create the Random Forest Classifier:

Initializes a Random Forest model with 100 decision trees for sarcasm classification.

Train the model:

Fits the Random Forest model on the training data (X_sarcasm_train and y_sarcasm_train).

The model learns to classify sarcasm based on the patterns in the data.

In [5]:
X_sentiment = vectorizer.transform(dataset1['text']).toarray()
y_sentiment = dataset1['sentiment'].map({'negative': 0, 'neutral': 1, 'positive': 2})

X_sentiment_train, X_sentiment_test, y_sentiment_train, y_sentiment_test = train_test_split(
    X_sentiment, y_sentiment, test_size=0.2, random_state=42)

sentiment_model = LogisticRegression()
sentiment_model.fit(X_sentiment_train, y_sentiment_train)


1. Transform sentiment text data:

Converts the text column in dataset1 into a numerical matrix (X_sentiment) using the same TF-IDF vectorizer.

2. Map sentiment labels to numbers:

Converts the sentiment labels:
'negative' → 0
'neutral' → 1
'positive' → 2
Stores these numeric values in y_sentiment.

3. Split the data:

Divides the features (X_sentiment) and labels (y_sentiment) into training (80%) and testing (20%) sets for training and evaluation.

4. Create the Logistic Regression model:

Initializes a Logistic Regression model to classify text into sentiment categories.

5. Train the model:

Fits the model on the training data (X_sentiment_train and y_sentiment_train) to learn how to predict sentiment.

In [6]:
def predict_sentiment(text):
    preprocessed_text = preprocess_text(text)
    vectorized_text = vectorizer.transform([preprocessed_text]).toarray()

    # Predict sarcasm
    is_sarcastic = sarcasm_model.predict(vectorized_text)[0]

    # Predict sentiment
    predicted_sentiment = sentiment_model.predict(vectorized_text)[0]

    # Adjust sentiment if sarcasm is detected
    if is_sarcastic:
        if predicted_sentiment == 0:
            predicted_sentiment = 2  # Negative to Positive
        elif predicted_sentiment == 2:
            predicted_sentiment = 0  # Positive to Negative

    # Map sentiment to labels
    sentiment_labels = {0: 'negative', 1: 'neutral', 2: 'positive'}
    final_sentiment = sentiment_labels[predicted_sentiment]

    return final_sentiment

In [7]:
# Example sentence
example_sentence = "So happy with the service, it's been 2 days of power cut and this amazing company is not even replying to my complaint."

# Predict sentiment
predicted_sentiment = predict_sentiment(example_sentence)

# Print results
print(f"Sentence: {example_sentence}")
print(f"Predicted Sentiment: {predicted_sentiment}")

Sentence: So happy with the service, it's been 2 days of power cut and this amazing company is not even replying to my complaint.
Predicted Sentiment: negative


In [8]:
# Define preprocess_text function
def preprocess_text(text):
    text = re.sub(r'\W', ' ', text)  # Remove non-word characters
    text = text.lower()  # Convert text to lowercase
    return text.strip()

# Load dataset 3
dataset3_path = '/kaggle/input/new-cleaned/New_cleaned.xlsx'
dataset3 = pd.read_excel(dataset3_path)

# Preprocess text in dataset 3
dataset3['text'] = dataset3['text'].astype(str).apply(preprocess_text)

# Verify the updated dataset
head_rows = dataset3.head()  # Get first few rows to verify changes
print(head_rows)  # Print first few rows to verify changes


                                                text
0  1   when people ask me stupid questions  it is...
1  2   i m not saying i hate you  what i m saying...
2         3   silence is golden  duct tape is silver
3  4   i am busy right now  can i ignore you some...
4          5   find your patience before i lose mine


In [9]:
# Initialize an empty list to store predicted sentiments
predicted_sentiments = []

# Iterate over each text in dataset 3 and predict sentiment
for text in dataset3['text']:
    predicted_sentiment = predict_sentiment(text)
    predicted_sentiments.append(predicted_sentiment)

# Add predicted sentiments to dataset 3
dataset3['predicted_sentiment'] = predicted_sentiments

In [10]:
# Compare predicted sentiments with actual sentiments if available in dataset 3
if 'actual_sentiment' in dataset3.columns:
    accuracy = (dataset3['predicted_sentiment'] == dataset3['actual_sentiment']).mean()
    print(f"Accuracy on dataset 3: {accuracy:.2f}")

# Set a default writable output path in Kaggle environment
default_output_path = '/kaggle/working/New_cleaned1.xlsx'

# Save the results to the default output path
dataset3.to_excel(default_output_path, index=False)
print(f"Results saved to {default_output_path}")


Results saved to /kaggle/working/New_cleaned1.xlsx
