---
---
# **Sarcasm Detection 🎯**

> Sarcasm detection is a challenging task in natural language processing (NLP) due to the subtle and often context-dependent nature of sarcastic comments. By accurately identifying sarcasm, airlines can gain more insightful feedback from customer reviews, leading to better service improvements and customer satisfaction.



---
---

---
#Mounting your Google Drive in Google Colab

---

In [1]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


---
#Loading the CSV file into a pandas DataFrame

---

In [2]:
import pandas as pd

file_path = '/content/drive/My Drive/Airline_Reviews.csv'
df = pd.read_csv(file_path)
print(df.head())


   Unnamed: 0   Airline Name Overall_Rating  \
0           0    AB Aviation              9   
1           1    AB Aviation              1   
2           2    AB Aviation              1   
3           3  Adria Airways              1   
4           4  Adria Airways              1   

                            Review_Title          Review Date  Verified  \
0                "pretty decent airline"   11th November 2019      True   
1                   "Not a good airline"       25th June 2019      True   
2         "flight was fortunately short"       25th June 2019      True   
3    "I will never fly again with Adria"  28th September 2019     False   
4  "it ruined our last days of holidays"  24th September 2019      True   

                                              Review       Aircraft  \
0    Moroni to Moheli. Turned out to be a pretty ...            NaN   
1   Moroni to Anjouan. It is a very small airline...           E120   
2    Anjouan to Dzaoudzi. A very small airline an... 

In [3]:
# Select columns for the project
columns_to_keep = ['Review', 'Review_Title', 'Overall_Rating', 'Recommended']

# Drop other columns
df_reduced = df[columns_to_keep]

# Display the first few rows of the reduced dataframe
df_reduced.head()


Unnamed: 0,Review,Review_Title,Overall_Rating,Recommended
0,Moroni to Moheli. Turned out to be a pretty ...,"""pretty decent airline""",9,yes
1,Moroni to Anjouan. It is a very small airline...,"""Not a good airline""",1,no
2,Anjouan to Dzaoudzi. A very small airline an...,"""flight was fortunately short""",1,no
3,Please do a favor yourself and do not fly wi...,"""I will never fly again with Adria""",1,no
4,Do not book a flight with this airline! My fr...,"""it ruined our last days of holidays""",1,no


---
# **Data Preprocessing**
-> Cleaning the Review Column.

->Label the data for Sarcasm detection
  using Recomended Column.

---



In [6]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

def clean_text(text):
    text = re.sub(r'\W', ' ', str(text))
    text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text)
    text = re.sub(r'\^[a-zA-Z]\s+', ' ', text)
    text = re.sub(r'\s+', ' ', text, flags=re.I)
    text = re.sub(r'^b\s+', '', text)
    text = text.lower()
    return text

df_reduced['cleaned_reviews'] = df_reduced['Review'].apply(clean_text)

# Create labels
df_reduced['label'] = df_reduced['Recommended'].apply(lambda x: 1 if x == 'no' else 0)

# Save the preprocessed dataframe
preprocessed_file_path = '/content/drive/My Drive/Airline_Reviews_Preprocessed.csv'
df_reduced.to_csv(preprocessed_file_path, index=False)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_reduced['cleaned_reviews'] = df_reduced['Review'].apply(clean_text)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_reduced['label'] = df_reduced['Recommended'].apply(lambda x: 1 if x == 'no' else 0)


---
#Feature Extraction

Convert text data to numerical features using TF-IDF.

---

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000, stop_words=stopwords.words('english'))
X = vectorizer.fit_transform(df_reduced['cleaned_reviews']).toarray()
y = df_reduced['label']

# Save the features and labels
import numpy as np

np.save('/content/drive/My Drive/X_features.npy', X)
np.save('/content/drive/My Drive/y_labels.npy', y)


---
#Split Data

we need to split the data into training and testing sets. This allows us to train our model on one portion of the data and evaluate its performance on another, unseen portion. This helps us understand how well our model generalizes to new data.

---

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Save the split data
np.save('/content/drive/My Drive/X_train.npy', X_train)
np.save('/content/drive/My Drive/X_test.npy', X_test)
np.save('/content/drive/My Drive/y_train.npy', y_train)
np.save('/content/drive/My Drive/y_test.npy', y_test)


---
#Model Building

we build a Logistic Regression model. Logistic Regression is a linear model commonly used for binary classification tasks.

---

In [9]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

# Save the untrained model
import joblib

joblib.dump(model, '/content/drive/My Drive/logistic_regression_model.pkl')


['/content/drive/My Drive/logistic_regression_model.pkl']

---
#Train Model

We then train the model using the training data.

---

In [10]:
# Load the split data
X_train = np.load('/content/drive/My Drive/X_train.npy')
y_train = np.load('/content/drive/My Drive/y_train.npy')

model.fit(X_train, y_train)

# Save the trained model
joblib.dump(model, '/content/drive/My Drive/logistic_regression_model_trained.pkl')


['/content/drive/My Drive/logistic_regression_model_trained.pkl']

---
#Evaluate Model
Finally, we evaluate the model's performance on the test data.

---

In [11]:
# Load the test data
X_test = np.load('/content/drive/My Drive/X_test.npy')
y_test = np.load('/content/drive/My Drive/y_test.npy')

y_pred = model.predict(X_test)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

evaluation_results = {
    'Accuracy': accuracy,
    'Precision': precision,
    'Recall': recall,
    'F1 Score': f1
}

# Save the evaluation results
evaluation_results_file_path = '/content/drive/My Drive/evaluation_results.json'
import json

with open(evaluation_results_file_path, 'w') as f:
    json.dump(evaluation_results, f)

print(evaluation_results)


{'Accuracy': 0.9052858683926646, 'Precision': 0.9133858267716536, 'Recall': 0.9464751958224543, 'F1 Score': 0.9296361596409681}


---
# Results

---
The initial model achieved the following performance metrics on the test data:

* Accuracy: 0.9053
* Precision: 0.9134
* Recall: 0.9465
* F1 Score: 0.9296

These results indicate that the model is effective in detecting sarcasm in the given dataset.

---