<a href="https://colab.research.google.com/github/fatihcelikeee/MachineLearningHandsOnExperiences/blob/master/sarcasm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sarcasm or Irony Detection Learning

---


* **Language:** Python 

* **Libaries:** Keras/TensorFlow

* **Cloud:** Google Cloud, ücretsiz Colab servisi (Tesla K80 GPU)

* **Dataset:** [Datasets for natural language processing](https://www.kaggle.com/toygarr/datasets-for-natural-language-processing)

**Extra resources:** 
1. **Pre-Processing and training:**[ Pre-Processing and training](https://www.kaggle.com/toygarr/synthetic-text-data-augmentation#Augmentation-Method)

##Importing Libaries

In [None]:
from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).
/gdrive


In [None]:
import os

os.chdir('/gdrive/My Drive/workSpace/Natural Language Processing/data')

In [None]:
!ls

'food review'  'link of databases.txt'	 sarcasm  'toxic word detection'


In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

import re
import nltk

from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

nltk.download('stopwords')
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


##Reading Dataset

In [None]:
root = "/gdrive/My Drive/workSpace/Natural Language Processing/data/sarcasm/"
datas = pd.read_csv(root + "train.csv")
datas.shape

(20033, 2)

##Preprocessing

In [None]:
corpus = []
for i in range(datas.shape[0]):
    data = re.sub('[^a-zA-Z]',' ',datas['text'][i])
    data = data.lower()
    data = data.split()
    data = [ps.stem(word) for word in data if not word in set(stopwords.words('english'))]
    data = ' '.join(data)
    corpus.append(data)

##Feautre Extraction

In [None]:
cv = CountVectorizer(max_features = 10000)
X = cv.fit_transform(corpus).toarray()
Y = datas.iloc[:,0].values

In [None]:
gnb = GaussianNB()
gnb.fit(X,Y)

GaussianNB()

##Testing

###Importing test datas

In [None]:
root = "/gdrive/My Drive/workSpace/Natural Language Processing/data/sarcasm/"
test_datas = pd.read_csv(root + "test.csv")
test_datas.shape

(8586, 2)

###Prediction

In [None]:
y_datas=test_datas.iloc[:,1].values
y_datas

array(['man wondering if there might be some sort of website featuring footage of sexual acts one may view for purposes of self gratification',
       'white house official reportedly said mass shooting was reprieve from chaos',
       'sarah palin calls obama lazy over approach to va scandal', ...,
       'the most beautiful acceptance speech this week came from queer korean',
       'mars probe destroyed by orbiting spielberg gates space palace',
       'dad clarifies this not food stop'], dtype=object)

In [None]:
test_corpus = []
for i in range(test_datas.shape[0]):
    data = re.sub('[^a-zA-Z]',' ',datas['text'][i])
    data = data.lower()
    data = data.split()
    data = [ps.stem(word) for word in data if not word in set(stopwords.words('english'))]
    data = ' '.join(data)
    test_corpus.append(data)

cv_test = CountVectorizer(max_features = 10000)
test_values = cv.fit_transform(test_corpus).toarray()
test_results = test_datas.iloc[:,0].values

In [None]:
test_values

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [None]:
y_pred = gnb.predict(test_values)

###Confusion matrix

In [None]:
test_values

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [None]:
cm = confusion_matrix(test_results,y_pred)
print(cm)

[[2207 2299]
 [2005 2075]]


##X

In [None]:
pip install simpletransformers

Collecting simpletransformers
  Downloading simpletransformers-0.63.4-py3-none-any.whl (248 kB)
[K     |████████████████████████████████| 248 kB 22.6 MB/s 
[?25hCollecting streamlit
  Downloading streamlit-1.4.0-py2.py3-none-any.whl (9.3 MB)
[K     |████████████████████████████████| 9.3 MB 69.8 MB/s 
Collecting transformers>=4.6.0
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 76.8 MB/s 
[?25hCollecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 1.7 MB/s 
Collecting wandb>=0.10.32
  Downloading wandb-0.12.9-py2.py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 49.9 MB/s 
Collecting datasets
  Downloading datasets-1.18.0-py3-none-any.whl (311 kB)
[K     |████████████████████████████████| 311 kB 62.3 MB/s 
Collecting tokenizers
  Downloading tokenizers-0.11.4-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |█████

In [None]:
from simpletransformers.language_representation import RepresentationModel
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, classification_report, confusion_matrix

In [None]:
X = datas.iloc[:,1].values
Y = datas.iloc[:,0].values
model = RepresentationModel(model_type="bert", model_name='bert-base-uncased', use_cuda=False)
word_vectors_train = model.encode_sentences(X, combine_strategy="mean")

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTextRepresentation: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForTextRepresentation from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTextRepresentation from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [None]:
X_test = test_datas.iloc[:,1].values
Y_test = test_datas.iloc[:,0].values
model_test = RepresentationModel(model_type="bert", model_name='bert-base-uncased', use_cuda=False)
word_vectors_test = model_test.encode_sentences(X_test, combine_strategy="mean")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTextRepresentation: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForTextRepresentation from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTextRepresentation from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
word_vectors_train.shape

(20033, 768)

In [None]:
word_vectors_test.shape

(8586, 768)

In [None]:
from sklearn.neural_network import MLPClassifier
X_train, X_test = word_vectors_train, word_vectors_test
clf = MLPClassifier(max_iter=10, hidden_layer_sizes = (75,75)).fit(X_train, Y)
y_pred = clf.predict(X_test)



In [None]:
print('Precision: %.4f' % precision_score(Y_test, y_pred, average='weighted'))
print('Recall: %.4f' % recall_score(Y_test, y_pred, average='weighted'))
print('Accuracy: %.4f' % accuracy_score(Y_test, y_pred))
print('F1 Score: %.4f' % f1_score(Y_test, y_pred, average='weighted'))
print(classification_report(Y_test, y_pred))

Precision: 0.8694
Recall: 0.8691
Accuracy: 0.8691
F1 Score: 0.8691
              precision    recall  f1-score   support

           0       0.88      0.87      0.87      4506
           1       0.85      0.87      0.86      4080

    accuracy                           0.87      8586
   macro avg       0.87      0.87      0.87      8586
weighted avg       0.87      0.87      0.87      8586



In [None]:
import os

os.chdir('/gdrive/My Drive/workSpace/Natural Language Processing/sarcasm')


In [None]:
from keras.models import load_model
from keras import models
import pickle
pickle.dump(clf, open("sarcasm", 'wb'))


In [None]:
!ls

sarcasm  sarcasm.ipynb


In [None]:
loaded_model = pickle.load(open("sarcasm", 'rb'))
result = loaded_model.score(X_test, Y_test)
print(result)

0.8690892150011646
