# Irony Detection in English

#### The goal of this project consists in training models capable handling classificating irony in  English  tweet. It has two different tasks.

* Task1: Create a Model for perform a binary classification if a tweet is ironic or not
* Task2: Create a Model to determine the type of irony

This project was originally introduced as a task for a Machine Learning competition, the [SemEval2018](https://competitions.codalab.org/competitions/17468#participate)

Using different classification techniques we intend to find the best model that performs better in  each of the two purposed tasks.

This project was made by 3 students as part of the Artificial Intelligence course at FEUP. You are free to use the code for any purpose, but beware that this is just an academia project in an introductory course to Artificial Intelligence.

* André Malheiro up201706280@fe.up.pt
* Diogo Gomes up201806572@fe.up.pt
* Rúben Almeida up201704618@fe.up.pt

Libraries Used: 
* Pandas
* Numpy
* Scikit-learn
* Matplotlib
* nltk

### Datasets:

The data sets for this project are not owned by our group, they are supplied by SemEval competition and are free to use for academic purposes only outside of the competition.

Both the tasks use the same two datasets, but marked in the proprer way according to the task. The first one is marked in binary way, where 1 stands for ironic tweet and 0 for non ironic tweet

About the datasets, one is marked for train and the other for testing. Originally they were in .txt format, but in order to simplify the work of the pandas library, we manually converted those files to .csv

**SemEval organization themselves already provide a division of the dataset in two subsets, a test dataset and a train dataset.**

This division is useful for the final evaluation of our work. Since the competition was in 2018, the ranking of the participants for that dataset is already public. This way we are able to keep track of which place we could score in that professional competition

![title](../assets/ranking.png)
##### Fig1: Ranking for task 1 of the SemEval competition

![title](../assets/ranking2.png)
##### Fig1: Ranking for task 2 of the SemEval competition

#### Problem with word delimeter

Pandas is a library perfect for the job of reading data from a file, since it setup automaticaly the capability of seeing that data in a formatted organized way.

Pandas has specific functions to read from different formats. The datasets are originally in .txt format, a conversion for either JSON or CSV, formats accepted by Pandas needed to be done. 4000 tweets in JSON would turn the data file really dense in information.
Since we are dealing with just 2 attributes, it is pointless to make our life harder in this step. 

**We chosed .csv format to represent the data**

CSV files dealing with text rise other problem, the fact that people use comma in their texts, pandas. So we introduce other delimiter,'\t', so we are able to do the task

## Task 1: Ironic vs. non-ironic

### Create a classificator that could handle proprely this binary classification task.

In this first task we will use the dataset split the way it was provided by the competition

In the second task we will use other techniques and apply other techniques like **cross validation**

In [5]:
from IPython.display import display, HTML
import pandas
import numpy

train_data=pandas.read_csv('data/task1/train.csv',  delimiter = '\t', quoting = 3, encoding="utf-8")
test_data=pandas.read_csv('data/task1/test-labeled.csv', delimiter = '\t', quoting = 3, encoding="utf-8")

In [None]:
train_data.head()

In [None]:
test_data.head()

### Analising the datasets

Some initial considerations should be done in order to ensure the quality of the data. The data provided doesn't contain blank fields. 

Testing proprelly the existance of duplicates in the datasets would require knowledge outside the scope of the course, so we assume that there aren't duplicate data in the dataset.


**One very important consideration regarding the data is the possibility of the dataset being unbalanced**. With both of our data sets loaded, we can use **Matplotlib** to determine if that consideration holds.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
train_data_ntrue=train_data['label'].sum()
train_data_nfalse=train_data['label'].count()-train_data_ntrue
values=[train_data_ntrue,train_data_nfalse]
names=['Ironic','Not ironic']
plt.bar(names,values)
plt.suptitle('Train data set - distribution')
plt.show()
print('Ironic tweets - ',train_data_ntrue)
print('Non Ironic tweets - ',train_data_nfalse)
print('Percentage of ironic tweets - ',train_data_ntrue*1.0/train_data['label'].count()*100.0)

In [None]:
test_data_ntrue=test_data['label'].sum()
test_data_nfalse=test_data['label'].count()-test_data_ntrue
values=[test_data_ntrue,test_data_nfalse]
names=['Ironic','Non Ironic']
plt.suptitle('Test data set - distribution')
plt.bar(names,values)
plt.show()
print('Ironic tweets - ',test_data_ntrue)
print('Non Ironic tweets - ',test_data_nfalse)
print('Percentage of ironic tweets - ',test_data_ntrue*1.0/test_data['label'].count()*100.0)

#### Conclusion:

While the training data is **balanced** in the number of ironic and non ironic tweets, the same can't be said for the test data set originnaly provided by the organization.

In our second graph, its visually palpable that the number of non ironic tweets is significantly larger.

### Bag of words

Now that we have the data in a pandas data set we need to proceed our path of finding an accurate model to identify ironic tweets. For that we need to parse our tweets to a **bag of words** model format.

A bag of words is a model which tweet will see their words tokenized and counted by a specified formula. Where we have 2 essential paths that we can follow:

* Use a simple counting of words.
* Use the TF-IDF measure

##### The process of getting as input the raw text and output a bag of words is called **Vectorization**

## Initial analysis


There are two important questions that we need to be able to answer in order to proceed in our model design and also a raw analysis without pre processing, this way we can estimate the level of pre processing we need in order to achieve a reasonable answer and also keep track of degree of improvement we have in each attempt we try.

* Does TF-IDF measure perform better that a simple counting of words?
* How good performs the simplest model in our dataset without preprocessing?

In order to have a objective evaluation criteria to our final answer to this problem, it is important to make a fast initial analysis to the data without any kind of pre processing to the data, using the simplest model possible, a **Gaussian Naïve Bayes**.

### Simple Word Counter Vetorization

#### Train Data

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

count_vectorizer = CountVectorizer(analyzer='word', stop_words='english', lowercase=False)

X_train=count_vectorizer.fit_transform(train_data['tweet']).toarray()
y_train=train_data['label']

#### Test Data

In [None]:
# Here we just Transform. Because the vocabulary is the one from the test set
X_test=count_vectorizer.transform(test_data['tweet']).toarray()
y_test=test_data['label']

#### Model execution

In [None]:
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score

classificator=GaussianNB()

classificator.fit(X_train,y_train)
#Predicition
y_predicted=classificator.predict(X_test)



#Metrics Raw
accuracy_value=accuracy_score(y_test,y_predicted)
precision_value=precision_score(y_test,y_predicted,average='weighted')
f1_value=f1_score(y_test,y_predicted,average='weighted')


initial_word_count_conclusion=pandas.DataFrame(
    {
    'Our Results':[accuracy_value,precision_value,f1_value],
    'Winners Result':[0.7347,0.6304,0.8006]
    } , index=['Accuracy','Precision','F1-Value'])

### TF-IDF Vetorization

#### Train Data

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf_vectorizer = TfidfVectorizer(analyzer='word', stop_words='english', lowercase=False)

X_train=tf_idf_vectorizer.fit_transform(train_data['tweet']).toarray()
y_train=train_data['label']

#### Test Data

In [None]:
X_test=tf_idf_vectorizer.transform(test_data['tweet']).toarray()
y_test=test_data['label']

#### Model execution

In [None]:
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score

classificator=GaussianNB()

classificator.fit(X_train,y_train)
#Predicition
y_predicted=classificator.predict(X_test)



#Metrics Raw
accuracy_value=accuracy_score(y_test,y_predicted)
precision_value=precision_score(y_test,y_predicted,average='weighted')
f1_value=f1_score(y_test,y_predicted,average='weighted')


initial_tf_idf_conclusion=pandas.DataFrame(
    {
    'Our Results':[accuracy_value,precision_value,f1_value],
    'Winners Result':[0.7347,0.6304,0.8006]
    } , index=['Accuracy','Precision','F1-Value'])

### Initial Analysis conclusion:

* The differences between the competition results and our naive model are relevant. Preprocessing of data is required to improve this values. Other models techniques should also be used.


* Other element relevant is that using simple word counting or TF-IDF doesn't produce any kind of relevant difference in the results achieved. For that reason **TF-IDF will hold**


* One naive classificator without pre processing is just a random classifier. It's correctness because of the randomness is around 50% as expected


#### Simple Word Counting

In [None]:
initial_word_count_conclusion

#### TF-IDF

In [None]:
initial_tf_idf_conclusion

#### Initial Vocabulary 
Without any pre processing the vectorization of the raw data has the following number of words:

In [None]:
print(len(tf_idf_vectorizer.vocabulary_))

**This would create an input matrix with 14042 collumns. That number is not reasonable for no algorithm. Preprocessing outh to reduce this value**

## Data Preprocessing Pipeline

The downfall of the initial results can be easily explained by multiple reasons:

* We are dealing social network data. That data contains lots of useless information like links, mentions or emojis and also lots of slangs and non standard vocabulary

* Didn't apply any type of Stemming or Lemmatization

* Used a simple Gaussian naïve Bayes rather than a complex neural network.

* Used the embedded tokenizer of sci kit learn rather than dedicated ones.

* The initial vocabulary is very large. 14k words. This decreases perfomance and increases overfitting to the train data.

* A final problem that could be related with the poor perfomance of the initial model is the traditional downfalls of NLP algorithms in finding contraditions. Irony many times has contradictions implied.



#### Remove Web Hyperlinks from tweets

In [None]:
train_data['tweet']=train_data['tweet'].str.replace('http\S+|www.\S+', '', case=False)

#### Normalization of Twitter mentions

In twitter is mandatory to mention the person that we reply with their personal nickname. In this Social Network this mentions are done using "@" followed by a single word that identifies the nickname

In [None]:
train_data['tweet']=train_data['tweet'].str.replace('@\S+', '', case=False)

#### Remove Numbers from the tweets

Numbers don't introduce power to a language of ironic tweets.

In [None]:
train_data['tweet']=train_data['tweet'].str.replace('\d+', '', case=False)

#### Remove Hashtags

Treat Hashtags as simple words by removing the "#" before them

In [None]:
train_data['tweet']=train_data['tweet'].str.replace('#', '', case=False)

#### Convert emojis and emoticons to text

For this job we use one of multiple open source library for the job, the choosed emoji because it was the best rated in the web

In [None]:
import emoji
import re

emoji_result=[]

for i in range(len(train_data)):
    demojize_result=emoji.demojize(train_data['tweet'][i])
    # Remove the ":" from the convertion made by emoji
    demojize_result=re.sub(":", " ", demojize_result)
    demojize_result=re.sub("_", " ", demojize_result)
    demojize_result=re.sub("-", " ", demojize_result)
    train_data['tweet'][i]=demojize_result

### Contradictions

Contractions is a native library of python that deals with a sharp task in NLP in English Language. Words like "Can't" would generate two tokens. To avoids this, we should convert this cases for "Can not"

In [None]:
import contractions

for i in range(len(train_data)):
    train_data['tweet'][i]=contractions.fix(train_data['tweet'][i])

### Lemmatization
According with Wiki Lemmatization is: "In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning."

In [None]:
from nltk.stem import WordNetLemmatizer
  
lemmatizer = WordNetLemmatizer()

lemmatization_result = []

for i in range(len(train_data)):
    tweet = lemmatizer.lemmatize(train_data['tweet'][i])
    lemmatization_result.append(tweet)

### Tokenization and Stemming with normalization of Uppercase words to Lowercase.

Goal is to reduce the vocabulary to the absolutely necessary

In [None]:
import nltk
import re
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

ps = PorterStemmer()
corpus = []

for i in range(len(lemmatization_result)):
    # get review and remove non alpha chars
    tweet = re.sub('[^\w_\w]', ' ', lemmatization_result[i])
    # to lower-case and tokenize
    tweet = tweet.lower().split()
    # stemming and stop word removal
    tweet = ' '.join([ps.stem(w) for w in tweet if not w in set(stopwords.words('english'))])
    corpus.append(tweet)

* After the Pre Processing pipeline the vocabulary usefull was reduce to 7588 words a decrese of 50%. This decrease will be usefull to boost the performance of the models

In [None]:
_dummy_vectorizer=TfidfVectorizer()
_dummy_vectorizer.fit_transform(corpus).toarray()

print(len(_dummy_vectorizer.vocabulary_))

### The  Vocabulary Dillema

The perfomance of the algorithms is dependent of the size of the vocabulary. From the analysis we found the optimum value is 800 words

In [None]:
import matplotlib.pyplot as plt

xpoints = numpy.array([1,10,100,200,500,600,700,800,900,1000,2000,3000,4000,5000,6000,7588])
ypoints = numpy.array([0.395, 0.61,0.61,0.61,0.61,0.62,0.60,0.64,0.61,0.62,0.49,0.49,0.46,0.48,0.50,0.49])

plt.plot(xpoints, ypoints)
plt.xlabel("Number of Words")
plt.ylabel("Performance of Naive Bayes")
plt.show()

##### Helper Functions

In [None]:
def refresh_data():
    vectorizer = TfidfVectorizer(max_features = 800)

    X_train=vectorizer.fit_transform(corpus).toarray()
    y_train=train_data['label'].copy()
    
    X_test=vectorizer.transform(test_data['tweet']).toarray()
    y_test=test_data['label']
    
    return (X_train,y_train,X_test,y_test)

In [None]:
def print_stats():
    
    print(classification_report(y_test,y_predicted))
    print(confusion_matrix(y_test, y_predicted))
    accuracy_value=accuracy_score(y_test,y_predicted)
    precision_value=precision_score(y_test,y_predicted,average='weighted')
    f1_value=f1_score(y_test,y_predicted,average='weighted')

    actual_conclusion=pandas.DataFrame(
        {
        'Our Results':[accuracy_value,precision_value,f1_value],
        'Winners Result':[0.7347,0.6304,0.8006]
        } , index=['Accuracy','Precision','F1-Value'])

    print(actual_conclusion.head())

#### Naive Bayes

In [None]:
classifier=GaussianNB()

X_train,y_train,X_test,y_test=refresh_data()

classifier.fit(X_train,y_train)

y_predicted=classifier.predict(X_test)

print_stats()

#### Neural Network

In [None]:
# Neural Network
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier


X_train,y_train,X_test,y_test=refresh_data()

scaler = StandardScaler()
classificator = MLPClassifier(max_iter=1500,activation='relu',solver='adam')

scaler.fit(X_train)

X_test=vectorizer.transform(test_data['tweet']).toarray()


X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

y_test=test_data['label']


classificator.fit(X_train, y_train)
y_predicted = classificator.predict(X_test)

print_stats()

#### Decision Tree

In [None]:
# Decision Tree

from sklearn.tree import DecisionTreeClassifier

X_train,y_train,X_test,y_test=refresh_data()

classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)
y_predicted = classifier.predict(X_test)

print_stats()

#### K-Nearest Neighbors

In [None]:
# KNN
from sklearn.neighbors import KNeighborsClassifier

X_train,y_train,X_test,y_test=refresh_data()

classifier = KNeighborsClassifier()
classifier.fit(X_train, y_train)
y_predicted = classifier.predict(X_test)

print_stats()

#### SVM

In [None]:
from sklearn.svm import SVC

X_train,y_train,X_test,y_test=refresh_data()

classifier = SVC()
classifier.fit(X_train, y_train)
y_predicted=classifier.predict(X_test)

print_stats()

#### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

X_train,y_train,X_test,y_test=refresh_data()

classifier = RandomForestClassifier()
classifier.fit(X_train, y_train)
y_predicted = classifier.predict(X_test)

print_stats()

### Conclusions For Task 1:

TODO

## Task 2 - Different Types of Irony

Visit the notebook clicking [here](notebook2.ipynb)