# Sentiment Analysis on Machine Translated Icelandic corpus

Nemendur
- Ólafur Aron Jóhannsson, Eysteinn Örn, Birkir Arndal

Leiðbeinendur
- Hrafn Loftsson (hrafn@ru.is)
- Stefán Ólafsson (stefanola@ru.is)


# Contents
1 [Abstract](#abstract)\
2 [Introduction](#introduction)\
3 [Machine Translations](#machine-translations)\
3.1 [Google Translate](#google-translate)\
3.2 [Miðeind](#miðeind)\
4 [Pre-Processing and feature extraction](#pre-processing-feature-extraction)\
5 [Baseline Classifier Evaluation](#baseline-classifier-evaluation)\
5.1 [Support Vector Classifier](#support-vector-classifier)\
5.2 [Logistic Regression](#logistic-regression)\
5.3 [Naive Bayes](#naive-bayes)\
5.4 [Testing](#testing)


## Abstract

The topic of machine translating English text into low-resource corpus and examining whether sentiment classification can be applied has garnered significant research focus for various low-resource languages. However, Icelandic remains relatively underexplored within this domain.

In this study, we employed machine translation to convert English IMDB movie reviews into Icelandic, utilizing Google Translate and Miðeind Vélþýðing. Our investigation aimed to determine if sentiment classification could be effectively transferred across the translation.

We employed a variety of baseline classifiers and deep learning models to assess the performance of the sentiment classification system we developed.


## Introduction

Our motivation for this research endeavour is that there is no Icelandic dataset tailored for sentiment analysis that is open and readily accessible to everyone, and creating one from scratch can be an expensive process, especially for low-resource languages. Utilising machine translation serves as an inexpensive method to create such a dataset and allows us to explore whether similar sentiment analysis results can be emulated. 

Sentiment analysis involves natural language processing and text analysis to identify, extract, and quantify subjective information, such as positive, negative, or neutral sentiments. It has utility for many applications, such as gauging public opinions, enabling businesses to ascertain and categorise customer satisfaction, and providing valuable insights into user-generated content across diverse digital platforms, e.g., customer reviews, complaints, and comments.

We utilized an IMDB dataset comprising 50,000 reviews, each categorized as either positive or negative in sentiment, with 25.000 being positive and 25.000 being negative. 

Our methodology involved the translation of these reviews using both Google Translate and Miðeind Translate. Subsequently, we subjected all three datasets, including the original English version and the two translations, to analysis using three baseline classifiers. The primary objective was to investigate whether machine translation exerted any influence on the results of sentiment analysis and to determine the superior performer between Miðeind and Google translations. Our aim was to assess the transferability of sentiment across machine translation processes.

We further assessed the performance of our classifiers and deep learning models using written reviews sourced from an Icelandic website, which comprised 1066 positive reviews and 66 negative reviews.

## Pre-Processing and feature extraction

Pre-processing is the act of transforming raw data to a form that can be used for the next part of the machine learning process. 

Here are the preprocessing steps that we used:
-    Data cleaning: Removing errors and irrelevant data.
-    Tokenization: Breaks sentences or paragraphs into individual words. 
-    Lower casing: Helps normalize and reduce dimensionality of the dataset.
-    Lemmatization: Convert different forms of the same word to a standardized form. This reduces the number of unique words in the dataset, which helps to improve the performance of NLP tasks.
-    Mark negation: the portion of text that follows a negation word and up to a punctuation, will be marked with _NEG suffix. The purpose of marking negation is to help NLP models to understand the context of a sentence. 

We do preprocessing to get a reliable dataset for the machine learning algorithms to give us good performance and an accurate result.

For the original English dataset, we removed noise (e.g., removed html tags), lowercased, tokenized, lemmatized, removed stop words and added a _NEG prefix when a word was starting with negation to assist the vectorizer locating negative sentiments.

For all the Icelandic datasets we removed noise, punctuations except for abbreviations, stop words. We also lower cased all the texts, tokenized, lemmatized and marked negations in the texts.

For the Miðeind translated dataset we additionally had to remove long nonsense words (e.g., “…BARNABARNABARNAÞÁTTURINN”), and replace repeated character to one (e.g., “jááááááá” -> “já”).

Three baseline classifier pipelines were created that serve as a baseline metric for our scoring for English and machine translated Google and Miðeind datasets, all classifiers use TF-IDF vectorizer, which measure the frequency of a term in each document. It measures how important the term is across all documents. We see scoring of these terms in (3.2)

![](machine_learning.png)

## Machine Translations

We employed the Google Translator API, which relies on Google's Neural Machine Translation featuring an LSTM architecture. Additionally, we utilized the Miðeind Vélþýðing API for the purpose of machine-translating the reviews. The Miðeind Vélþýðing API is constructed using the multilingual BART model, which was trained using the Fairseq sequence modeling toolkit within the PyTorch framework.

### Google Translate

All the reviews were effectively translated using the API, and the only preprocessing step performed on the raw data was the removal of \<br\/\>. The absence of errors during the translation process could be attributed to the API's maturity and extensive user adoption. Nevertheless, it's worth noting that the quality of Icelandic language reviews occasionally exhibited idiosyncrasies.

### Miðeind

The Miðeind Translator encountered challenges when translating the English corpus into Icelandic. To prepare the text for translation, several preprocessing steps were necessary. These steps included consolidating consecutive punctuation marks, eliminating all HTML tags, ensuring there was a whitespace character following punctuation marks, and removing asterisks. Subsequently, we divided the reviews into segments of 128 tokens, which were then processed in batches by the Miðeind translator.


# Baseline Classifier Evaluation

We utilized the classifiers available in the Scikit-learn Python package for implementing our machine learning models. These models were trained with their default parameters, and hyperparameter tuning was not conducted. It is important to note that superior results can be attained by fine-tuning the hyperparameters.

When assessing the statistical measures to gauge the model's performance, we applied equations 1, 2, 3, and 4.

\begin{align}
&Accuracy = \frac{TP+FN}{TP+FP+TN+FN}
\\
&Recall = \frac{TP}{TP+FN}
\\
&Precision = \frac{TP}{TP+FP}
\\
&F1 Score = \frac{2(Recall*Precision)}{Recall+Precision}
\end{align}

True Positive (TP) refers to correctly identified positive sentiments, while False Positive (FP) signifies incorrectly identified positive sentiments. True Negative (TN) denotes correctly identified negative sentiments, and False Negative (FN) represents incorrectly identified negative sentiments.

The data was divided into training and test sets, with 67% (33,500 reviews) allocated for training the models and 33% (16,500 reviews) reserved for testing the model's performance.

In this visual representation of the classification report encompassing all classifiers, we observe that Support Vector Classification (SVC) outperformed other models when applied to the data. All models were trained with 33,500 reviews and tested with 16,500. If we establish SVC as our baseline comparative model and employing a weighted F1 score as our evaluation metric, we can discern the following results across different datasets: In the English dataset, the F1 score reached 89.67%, the translated Miðeind dataset achieved an F1 score of 88.14%, and the Google dataset attained an F1 score of 89.02%. These figures suggest that sentiment analysis can carry across Machine Translation when utilizing state-of-the-art machine translation APIs. The loss in accuracy during translation is minimal, with only a 1.53% and 0.65% drop in accuracy, favoring Google's performance.

## Support Vector Classifier

The SVC (Support Vector Classifier) was the best machine learning algorithm in classifying sentiment, it is a linear binary classification algorithm, where the result is defined as zero or one in binary models. 

| English Sentiment     | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| negative              |  0.90     | 0.89   | 0.90     |
| positive              |  0.89     | 0.91   | 0.90     |

| Google Sentiment  | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| negative              |  0.90   | 0.88 | 0.89   |
| positive              |  0.88   | 0.90 | 0.89   |

| Miðeind Sentiment  | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| negative              |  0.89   | 0.87 | 0.88   |
| positive              |  0.88   | 0.89 | 0.88   |

| Official Station Sentiment  | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| negative              |  0.   | 0. | 0.   |
| positive              |  0.   | 0. | 0.   |

When we trained the class it gives us a list of coefficients that represent the relationship between the input variables and the output variable in the model. The coefficient can be interpreted as the relative importance of the word it's classified to, in this case negative or positive. In this chart we can see the top 10 negative and positive values, for a sentence to be positive in this case, it has to have a value of one.

![SVC Score](SVC_English_Important.png)

![SVC Score](SVC_Google_Important.png)

![SVC Score](SVC_Miðeind_Important.png)

## Logistic Regression

Logistic Regression is a binary classification algorithm, were the result is defined as zero or one in binary models. 


| English Sentiment     | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| negative              |  0.   | 0. | 0.   |
| positive              |  0.   | 0. | 0.   |

| Google Sentiment     | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| negative              |  0.90   | 0.88 | 0.89   |
| positive              |  0.88   | 0.90 | 0.89   |

| Miðeind Sentiment     | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| negative              |  0.89   | 0.87 | 0.88   |
| positive              |  0.87   | 0.89 | 0.88   |

| Official Station Sentiment  | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| negative              |  0.   | 0. | 0.   |
| positive              |  0.   | 0. | 0.   |

When we trained the class it gives us a list of coefficients that represent the relationship between the input variables and the output variable in the model. The coefficient can be interpreted as the relative importance of the word it's classified to, in this case negative or positive. In this chart we can see the top 10 negative and positive values, for a sentence to be positive in this case, it has to have a value of one.

![LR Score](LR_English_Important.png)

![LR Score](LR_Google_Important.png)

![LR Score](LR_Miðeind_Important.png)

## Naive Bayes

Naive Bayes is a classifier for multinomial models, although we employed it for binary classification, of all classifiers it performed the worst.

| English Sentiment     | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| negative              |  0.85   | 0.87 | 0.86   |
| positive              |  0.87   | 0.84 | 0.86   |

| Google Sentiment     | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| negative              |  0.85   | 0.87 | 0.86   |
| positive              |  0.87   | 0.84 | 0.86   |

| Miðeind Sentiment     | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| negative              |  0.84   | 0.86 | 0.85   |
| positive              |  0.86   | 0.84 | 0.85   |

| Official Station Sentiment  | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| negative              |  0.   | 0. | 0.   |
| positive              |  0.   | 0. | 0.   |

## Testing

Additional testing on SVC was tried on Icelandic text from a website called https://www.officialstation.com/kvikmyndir, this is a website by an Icelandic author that seems to review movies in Icelandic(see screenshot)

![Station](station.png)


We tried random 10 reviews written by him to do more verification on the SVC classifiers, of the 10 reviews one review was marked incorrect (True Negative) by both classifiers, in addition to another review marked incorrect (True Negative) by Miðeind classifier, which makes the accuracy 90% for Google and 80% for Miðeind.

![Testing](testing.png)
