# Irony Detection in English 

#### The goal of this project consists in training a model capable of performing the task of classification if a tweet in English is ironic or don’t.

Using different classification techniques we intend to find the one that performs better using as training data a data set with around 4000 tweets of different types.

This project was made by 3rd students as part of the Artificial Intelligence at FEUP. You are free to use the code for any purpose, but beware that this is just an academia project in an introductory course to Artificial Intelligence.

### Datasets:

The data sets for this project are not own by our group, they are supplied by SemEval competition and are free to use for academic purpose only outside of the competition.

There are two data sets available, one marked for train other for testing. Originally they were in .txt format, but in order to simplify the work of the pandas library, we manually converted those files to .csv

**SemEval organization themselves provide already a division of the dataset in two subsets, a test data and a train data.**

This division is usefull for the final evaluation of our work. Since the competition was in 2018, the ranking of the participants for that dataset is already public. This way we are able to keep track of which place we could score in that professional competition



![title](../assets/ranking.png)
##### Fig1: Ranking of the SemEval competition

#### Problem with word delimeter

Pandas is a library perfect for the job of reading the data from a file, since it setup automaticaly the capability of seeing that data in a formatted organized way.

Pandas has specific functions to read from different formats. 4000 tweets in JSON would turn the data file really dense in information.
Since we are dealing with just 2 attributes is pointless to make our life harder in this step. We choosed .csv format to represent the data

CSV files dealing with text rise other problem, the fact that people use comma in their texts, pandas. So we introduce other delimiter that exists in Portuguese, but doesn't exist in english "_ç", so we are able to do the task

In [6]:
import pandas
import numpy

train_data=pandas.read_csv('data/train.csv', engine='python', encoding="utf-8",delimiter="_ç")
test_data=pandas.read_csv('data/test-labeled.csv',engine='python', encoding="utf-8",delimiter="_ç")

In [7]:
train_data.head()

Unnamed: 0,label,tweet
0,1,Sweet United Nations video. Just in time for C...
1,1,@mrdahl87 We are rumored to have talked to Erv...
2,1,Hey there! Nice to see you Minnesota/ND Winter...
3,0,3 episodes left I'm dying over here
4,1,"""I can't breathe!"" was chosen as the most nota..."


In [8]:
test_data.head()

Unnamed: 0,label,tweet
0,0,@Callisto1947 Can U Help?||More conservatives ...
1,1,"Just walked in to #Starbucks and asked for a ""..."
2,0,#NOT GONNA WIN http://t.co/Mc9ebqjAqj
3,0,@mickymantell He is exactly that sort of perso...
4,1,So much #sarcasm at work mate 10/10 #boring 10...


### Bag of words

Now that we have the data in a pandas data set we need to proceed our path of finding an accurate model to identify ironic tweets. For that we need to parse our tweets to a **bag of words** model format.

A bag of words is a model which tweet will see their words tokenized and counted by a specified formula. Where we have essential 2 paths that we can follow:

* Use a simple counting of words.
* Use the TF-IDF measure

##### The process of getting as input the raw text and output a bag of words is called **Vectorization**



## Initial analysis

In order to have a objective evaluation criteria to our final answer to this problem, it is important to make a fast initial analysis to the data without any kind of pre processing to the data, using the simplest model possible a naive bayes analyses.

#### Simple Word Counter Vetorization
#### Train Data

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

count_vectorizer = CountVectorizer(analyzer='word', stop_words='english', lowercase=False)

X_train=count_vectorizer.fit_transform(train_data['tweet']).toarray()
y_train=train_data['label']

#### Test Data

In [10]:
# Here we just Transform. Because the vocabulary is the on from the test set
X_test=count_vectorizer.transform(test_data['tweet']).toarray()
y_test=test_data['label']

In [11]:
classificator=GaussianNB()

classificator.fit(X_train,y_train)

y_predicted=classificator.predict(X_test)

print(classification_report(y_test,y_predicted))
print(confusion_matrix(y_test, y_predicted))

              precision    recall  f1-score   support

           0       0.64      0.42      0.51       473
           1       0.42      0.65      0.51       311

    accuracy                           0.51       784
   macro avg       0.53      0.53      0.51       784
weighted avg       0.56      0.51      0.51       784

[[199 274]
 [110 201]]


### TF-IDF Word Counter Vetorization

#### Train Data

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf_vectorizer=TfidfVectorizer(analyzer='word', stop_words='english', lowercase=False)

X_train=tf_idf_vectorizer.fit_transform(train_data['tweet']).toarray()
y_train=train_data['label']

#### Test Data

In [13]:
X_test=tf_idf_vectorizer.transform(test_data['tweet']).toarray()
y_test=test_data['label']

In [49]:
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score

classificator=GaussianNB()

classificator.fit(X_train,y_train)
#Predicition
y_predicted=classificator.predict(X_test)



#Metrics Raw
accuracy_value=accuracy_score(y_test,y_predicted)
precision_value=precision_score(y_test,y_predicted,average='weighted')
f1_value=f1_score(y_test,y_predicted,average='weighted')

print(classification_report(y_test,y_predicted))
print("Confusion Matrix")
print(confusion_matrix(y_test, y_predicted))



initial_conclusion=pandas.DataFrame(
    {
    'Our Results':[accuracy_value,precision_value,f1_value],
    'Winners Result':[0.7347,0.6304,0.8006]
    } , index=['Accuracy','Precision','F1-Value'])

              precision    recall  f1-score   support

           0       0.64      0.45      0.53       473
           1       0.42      0.60      0.50       311

    accuracy                           0.51       784
   macro avg       0.53      0.53      0.51       784
weighted avg       0.55      0.51      0.52       784

Confusion Matrix
[[215 258]
 [123 188]]


### Conclusions

The raw analysis of the perfomance of the simple model is pretty bellow the best scores of the competition:

In [50]:
initial_conclusion

Unnamed: 0,Our Results,Winners Result
Accuracy,0.514031,0.7347
Precision,0.550978,0.6304
F1-Value,0.516916,0.8006


#### The differences are relevant. We need to introduce data pre processing to our project in order to improve the data