# Sentiment Analysis on Machine Translated Icelandic corpus

- Ólafur Aron Jóhannsson
- Eysteinn Örn
- Birkir Arndal



# Contents
1. [Abstract](#abstract)
2. [Introduction](#introduction)
3. [Machine Translations](#machine-translations)
4. [Miðeind](#miðeind)
5. [Google Translate](#google-translate)
6. [Pre Processing](#pre-processing)


## Abstract


Translating English text into low-resource languages and assessing sentiment is a subject that has received extensive research attention for numerous languages, yet Icelandic remains relatively unexplored in this context. We leverage a range of baseline classifiers and deep learning models to investigate whether sentiment can be effectively conveyed across languages, even when employing machine translation services such as Google Translate and Miðeind machine translation.


## Introduction

In this research endeavor, we utilized an IMDB dataset comprising 50,000 reviews, each categorized as either positive or negative in sentiment. Our methodology involved the translation of these reviews using both Google Translate and Miðeind Translate. Subsequently, we subjected all three datasets, including the original English version and the two translations, to analysis using three baseline classifiers. The primary objective was to investigate whether machine translation exerted any influence on the results of sentiment analysis and to determine the superior performer between Miðeind and Google translations. Our aim was to assess the transferability of sentiment across machine translation processes.

## Machine Translations

We employed the Google Translator API, which relies on Google's Neural Machine Translation featuring an LSTM architecture. Additionally, we utilized the Miðeind Vélþýðing API for the purpose of machine-translating the reviews. The Miðeind Vélþýðing API is constructed using the multilingual BART model, which was trained using the Fairseq sequence modeling toolkit within the PyTorch framework.

### Google Translate

All the reviews were effectively translated using the API, and the only preprocessing step performed on the raw data was the removal of \<br\/\>. The absence of errors during the translation process could be attributed to the API's maturity and extensive user adoption. Nevertheless, it's worth noting that the quality of Icelandic language reviews occasionally exhibited idiosyncrasies.

### Miðeind

-- TODO: rephrase
Removed all HTML tags, if multiple punctuations in a line we made it into one, put a whitespace after punctuation, removed all asterisk, split the reviews into maximum 128 tokens and put them through the translator, we always made sure that we split 
on punctuation so that the machine translation had the context to translate
 -- TODO: rephrase

The Miðeind Translator had issues translating the raw data







## Pre-Processing and feature extraction

Once translated into Icelandic, we lowercased, tokenized and lemmatized the words 



# Baseline Classifiers

We utilized a few baseline classifiers on the dataset, all of which use TF-IDF vectorization, 


### Original English Dataset

### Google Translate Icelandic Dataset

| Classifier            | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| *MultinomialNB*       |           |        |          |
| negative              |  0.8448   | 0.8810 | 0.8625   |         
| positive              |  0.8770   | 0.8398 | 0.8580   |           
| *SVC*                 |           |        |          |
| negative              |  0.8977   | 0.8907 | 0.8942   |
| positive              |  0.8926   | 0.8995 | 0.8960   |
| *Logistic Regression* |           |        |          |
| negative              |  0.8963   | 0.8808 | 0.8885   |
| positive              |  0.8840   | 0.8991 | 0.8915   |

### Miðeind Icelandic Dataset







## Naive Bayes

## Logistic Regression

Logistic Regression is a binary classification algorithm, were the result is defined as zero or one in binary models. When we trained the class it gives us a list of coefficients that represent the relationship between the input variables and the output variable in the model. The coefficient can be interpreted as the relative importance of the word it's classified to, in this case negative or positive.

In this chart we can see the top 10 negative and positive values, for a sentence to be positive in this case, it has to have a value of one.

Some examples are after running tests

- (hræðilegur frábær) Positive, score is 1.124940
- (slæmur vel besta) Positive, score is 4.491666
- (lélegur vel) Negative, score is 0.107679

Positive                   |  Negative |
:-------------------------:|:-------------------------:
![Negative Score](negative_lr.jpg)  | ![Positive Score](positive_lr.jpg)  








## Support Vector Machines

# Models

# Results

# Conclusions