<a href="https://colab.research.google.com/github/brianellis1997/Sarcasm_Detection/blob/Elena/DS_440_Test_xgb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sarcasm Detection using NLP Techniques
### Team 2: Brian Ellis, Lindsey Rich, Elena Kern

## Load Libraries and Dataset

In [1]:
import pandas as pd
import torch
from sklearn.model_selection import train_test_split

In [2]:
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
import torch.optim as optim
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix

In [4]:
!git clone https://github.com/brianellis1997/Sarcasm_Detection.git

Cloning into 'Sarcasm_Detection'...
remote: Enumerating objects: 167, done.[K
remote: Counting objects: 100% (164/164), done.[K
remote: Compressing objects: 100% (102/102), done.[K
remote: Total 167 (delta 91), reused 112 (delta 62), pack-reused 3[K
Receiving objects: 100% (167/167), 3.58 MiB | 11.68 MiB/s, done.
Resolving deltas: 100% (91/91), done.


In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
train_bal = pd.read_csv('/content/drive/MyDrive/Sarcasm_Data/Train_Balanced.csv')   # Make sure path is correct in your google drive
train_bal.head()

Unnamed: 0.1,Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment
0,522142,0,"I personally wasn't a huge Garrosh fan, I've a...",cromemako83,AskReddit,2,2,0,2015-07-01,2015-07-11 01:55:53,Fuck Vol'jin. Garrosh Hellscream did nothing w...
1,907864,1,you forgot the,_SharkWeek_,AskReddit,1,1,0,2013-03-01,2013-03-14 03:03:46,That's a lie fed to you by the LIEberal media....
2,604170,1,"Nah man, she's clearly an ad carry",jdswift13,leagueoflegends,1,1,0,2015-10-01,2015-10-21 23:22:17,she isnt already?
3,110635,1,This sub in a nutshell.,trickz-M-,GlobalOffensive,1,-1,-1,2016-12-01,2016-12-05 03:50:18,Cloud 9 Qualify! (ONLY C9 FANS ALLOWED(
4,997758,0,Yes... I do.,guriboysf,videos,4,4,0,2010-01-01,2010-01-17 21:32:40,"""so, i hear you have a fat cock."""


In [7]:
train_bal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 808618 entries, 0 to 808617
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   Unnamed: 0      808618 non-null  int64 
 1   label           808618 non-null  int64 
 2   comment         808618 non-null  object
 3   author          808618 non-null  object
 4   subreddit       808618 non-null  object
 5   score           808618 non-null  int64 
 6   ups             808618 non-null  int64 
 7   downs           808618 non-null  int64 
 8   date            808618 non-null  object
 9   created_utc     808618 non-null  object
 10  parent_comment  808618 non-null  object
dtypes: int64(5), object(6)
memory usage: 67.9+ MB


In [8]:
train_bal['label'].value_counts()

1    404369
0    404249
Name: label, dtype: int64

We can see our dataset is balanced.

In [9]:
train_bal['comment'] = train_bal['comment'].astype('str')
train_bal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 808618 entries, 0 to 808617
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   Unnamed: 0      808618 non-null  int64 
 1   label           808618 non-null  int64 
 2   comment         808618 non-null  object
 3   author          808618 non-null  object
 4   subreddit       808618 non-null  object
 5   score           808618 non-null  int64 
 6   ups             808618 non-null  int64 
 7   downs           808618 non-null  int64 
 8   date            808618 non-null  object
 9   created_utc     808618 non-null  object
 10  parent_comment  808618 non-null  object
dtypes: int64(5), object(6)
memory usage: 67.9+ MB


In [10]:
train_bal[:20]['comment']

0     I personally wasn't a huge Garrosh fan, I've a...
1                                        you forgot the
2                    Nah man, she's clearly an ad carry
3                               This sub in a nutshell.
4                                          Yes... I do.
5                                    But but mushrooms!
6     Because there's wood in both, especially when ...
7                   I wish I had half their confidence.
8        Do you see the Syrian opposition falling soon?
9                              You must HATE ALL WOMEN!
10                    NRA, the supporter of terrorists.
11    That's offensive, use another term pls, im sic...
12                FTFY Yeah fuck those guys having fun.
13                                 Working for the TSA.
14                 Well, that's such a huge difference!
15                       Show population screen please?
16                     I got a tick stuck in my peehole
17              Still with Asus after all their 

## XGBoost Model

In [16]:
# Split into train and test (80% train, 20% test)
sub_train, sub_test = train_test_split(train_bal, test_size=0.2, random_state=22)

#create TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

#fit and transform training and test data
train_x_tfidf = tfidf_vectorizer.fit_transform(sub_train['comment'])
test_x_tfidf = tfidf_vectorizer.transform(sub_test['comment'])

#dimension reduction
svd = TruncatedSVD(n_components = 100)
X_train_svd = svd.fit_transform(train_x_tfidf)
X_test_svd = svd.fit_transform(test_x_tfidf)

#make sure target is in right format
train_target = sub_train['label'].values
test_target = sub_test['label'].values

X_dtrain_svd = xgb.DMatrix(X_train_svd, label=train_target)
X_dtest_svd = xgb.DMatrix(X_test_svd, label=test_target)
#create DMatrix
#xgtrain = xgb.DMatrix(train_x_tfidf, label=train_target.tolist())
#xgtest = xgb.DMatrix(test_x_tfidf, label=test_target.tolist())

In [17]:
params = {
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'eta': 0.1,
    'max_depth': 6,
    'min_child_weight': 1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'gamma': 0.1,
    'scale_pos_weight': 1
}

num_rounds = 100

# Training the model
xgb_model = xgb.train(params, X_dtrain_svd, num_rounds)


In [18]:
# Predict on the test data
predictions_proba = xgb_model.predict(X_dtest_svd)

# Convert predicted probabilities to binary predictions
predictions = [1 if p > 0.5 else 0 for p in predictions_proba]

# Evaluate the predictions
accuracy = accuracy_score(test_target, predictions)
print("Accuracy:", accuracy)


Accuracy: 0.5529791496623878


##SVM

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Train the SVM model
svm_model = SVC(kernel='linear', random_state=22, verbose=1)
svm_model.fit(X_train_svd, train_target)

# Predict on the validation set
y_pred_svm = svm_model.predict(X_test_svd)

# Calculate accuracy
accuracy_svm = accuracy_score(test_target, y_pred_svm)
print("SVM Accuracy with TruncatedSVD:", accuracy_svm)


[LibSVM]