# Sentiment Analysis on Machine Translated Icelandic corpus

Nemendur
- Ólafur Aron Jóhannsson, Eysteinn Örn, Birkir Arndal

Leiðbeinendur
- Hrafn Loftsson (hrafn@ru.is)
- Stefán Ólafsson (stefanola@ru.is)


# Contents
1 [Sentiment Analysis on Machine Translated Icelandic corpus](#sentiment-analysis-on-machine-translated-icelandic-corpus)\
2 [Contents](#contents)\
2.1 [Abstract](#abstract)\
2.2 [Introduction](#introduction)\
2.3 [Machine Translations](#machine-translations)\
2.3.1 [Google Translate](#google-translate)\
2.3.2 [Miðeind](#miðeind)\
2.4 [Pre-Processing and feature extraction](#pre-processing-feature-extraction)\
3 [Baseline Classifier Evaluation](#baseline-classifier-evaluation)\
3.1 [Support Vector Classifier](#support-vector-classifier)\
3.2 [Logistic Regression](#logistic-regression)\
3.3 [Naive Bayes](#naive-bayes)\
3.4 [Testing](#testing)\
4 [Next Steps](#next-steps)\
5 [Burndown Chart](#burndown-chart)\
6 [Risk Analysis](#risk-analysis)
7 [Status Meetings](#status-meetings)\


## Abstract


Translating English text into low-resource languages and assessing sentiment is a subject that has received extensive research attention for numerous languages, yet Icelandic remains relatively unexplored in this context. We leverage a range of baseline classifiers and deep learning models to investigate whether sentiment can be effectively conveyed across languages, even when employing machine translation services such as Google Translate and Miðeind machine translation.


## Introduction

In this research endeavor, we utilized an IMDB dataset comprising 50,000 reviews, each categorized as either positive or negative in sentiment, with 25.000 being positive and 25.000 being negative. Our methodology involved the translation of these reviews using both Google Translate and Miðeind Translate. Subsequently, we subjected all three datasets, including the original English version and the two translations, to analysis using three baseline classifiers. The primary objective was to investigate whether machine translation exerted any influence on the results of sentiment analysis and to determine the superior performer between Miðeind and Google translations. Our aim was to assess the transferability of sentiment across machine translation processes.

## Machine Translations

We employed the Google Translator API, which relies on Google's Neural Machine Translation featuring an LSTM architecture. Additionally, we utilized the Miðeind Vélþýðing API for the purpose of machine-translating the reviews. The Miðeind Vélþýðing API is constructed using the multilingual BART model, which was trained using the Fairseq sequence modeling toolkit within the PyTorch framework.

### Google Translate

All the reviews were effectively translated using the API, and the only preprocessing step performed on the raw data was the removal of \<br\/\>. The absence of errors during the translation process could be attributed to the API's maturity and extensive user adoption. Nevertheless, it's worth noting that the quality of Icelandic language reviews occasionally exhibited idiosyncrasies.

### Miðeind

The Miðeind Translator encountered challenges when translating the English corpus into Icelandic. To prepare the text for translation, several preprocessing steps were necessary. These steps included consolidating consecutive punctuation marks, eliminating all HTML tags, ensuring there was a whitespace character following punctuation marks, and removing asterisks. Subsequently, we divided the reviews into segments of 128 tokens, which were then processed in batches by the Miðeind translator.

## Pre-Processing and feature extraction

The original English dataset we lowercased, tokenized and lemmatized and removed stop words, the same was applied on the Icelandic machine translated corpus as well, in addition we also added a prefix _NEG to the words in Icelandic if the term was deemed negative to assist the vectorizer in locating negative remarks.

Three baseline classifier pipelines were created that serve as a baseline metric for our scoring for English and machine translated Google and Miðeind datasets, all classifiers use TF-IDF vectorizer, which measure the frequency of a term in each document. It measure how important the term is across all documents. We see scoring of these terms in (#logistic)

![](machine_learning.png)

# Baseline Classifier Evaluation

We utilized the classifiers available in the Scikit-learn Python package for implementing our machine learning models. These models were trained with their default parameters, and hyperparameter tuning was not conducted. It is important to note that superior results can be attained by fine-tuning the hyperparameters.

When assessing the statistical measures to gauge the model's performance, we applied equations 1, 2, 3, and 4.

\begin{align}
&Accuracy = \frac{TP+FN}{TP+FP+TN+FN}
\\
&Recall = \frac{TP}{TP+FN}
\\
&Precision = \frac{TP}{TP+FP}
\\
&F1 Score = \frac{2(Recall*Precision)}{Recall+Precision}
\end{align}

True Positive (TP) refers to correctly identified positive sentiments, while False Positive (FP) signifies incorrectly identified positive sentiments. True Negative (TN) denotes correctly identified negative sentiments, and False Negative (FN) represents incorrectly identified negative sentiments.

The data was divided into training and test sets, with 67% (33,500 reviews) allocated for training the models and 33% (16,500 reviews) reserved for testing the model's performance.

![](English_Classification_Report.png)

![](Icelandic_Google_Classification_Report.png)

![](Icelandic_Miðeind_Classification_Report.png)

In this visual representation of the classification report encompassing all classifiers, we observe that Support Vector Classification (SVC) outperformed other models when applied to the data. All models were trained with 33,500 reviews and tested with 16,500. If we establish SVC as our baseline comparative model and employing a weighted F1 score as our evaluation metric, we can discern the following results across different datasets: In the English dataset, the F1 score reached 89.67%, the translated Miðeind dataset achieved an F1 score of 88.36%, and the Google dataset attained an F1 score of 89.33%. These figures suggest that sentiment analysis can carry across Machine Translation when utilizing state-of-the-art machine translation APIs. The loss in accuracy during translation is minimal, with only a 1.31% and 0.34% drop in accuracy, favoring Google's performance.

## Support Vector Classifier

The SVC (Support Vector Classifier) was the best machine learning algorithm in classifying sentiment, it is a linear binary classification algorithm, were the result is defined as zero or one in binary models. 

| English Sentiment     | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| negative              |  0.90     | 0.89   | 0.90     |
| positive              |  0.89     | 0.91   | 0.90     |

| Google Sentiment  | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| negative              |  0.90   | 0.88 | 0.89   |
| positive              |  0.89   | 0.90 | 0.89   |

| Miðeind Sentiment  | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| negative              |  0.89   | 0.88 | 0.88   |
| positive              |  0.88   | 0.89 | 0.89   |

When we trained the class it gives us a list of coefficients that represent the relationship between the input variables and the output variable in the model. The coefficient can be interpreted as the relative importance of the word it's classified to, in this case negative or positive. In this chart we can see the top 10 negative and positive values, for a sentence to be positive in this case, it has to have a value of one.

![SVC Score](SVC_English_Important.png)

![SVC Score](SVC_Google_Important.png)

![SVC Score](SVC_Miðeind_Important.png)

## Logistic Regression

Logistic Regression is a binary classification algorithm, were the result is defined as zero or one in binary models. 


| English Sentiment     | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| negative              |  0.90   | 0.88 | 0.89   |
| positive              |  0.89   | 0.91 | 0.90   |

| Google Sentiment     | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| negative              |  0.90   | 0.88 | 0.89   |
| positive              |  0.88   | 0.90 | 0.89   |

| Miðeind Sentiment     | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| negative              |  0.88   | 0.87 | 0.88   |
| positive              |  0.87   | 0.89 | 0.88   |


When we trained the class it gives us a list of coefficients that represent the relationship between the input variables and the output variable in the model. The coefficient can be interpreted as the relative importance of the word it's classified to, in this case negative or positive. In this chart we can see the top 10 negative and positive values, for a sentence to be positive in this case, it has to have a value of one.

![LR Score](LR_English_Important.png)

![LR Score](LR_Google_Important.png)

![LR Score](LR_Miðeind_Important.png)

## Naive Bayes

Naive Bayes is a classifier for multinomial models, although we employed it for binary classification

| English Sentiment     | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| negative              |  0.85   | 0.87 | 0.86   |
| positive              |  0.87   | 0.84 | 0.86   |

| Google Sentiment     | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| negative              |  0.85   | 0.88 | 0.86   |
| positive              |  0.88   | 0.85 | 0.86   |

| Miðeind Sentiment     | Precision | Recall | F1-Score |
|-----------------------|-----------|--------|----------|
| negative              |  0.84   | 0.87 | 0.86   |
| positive              |  0.87   | 0.84 | 0.85   |

## Testing



# Next Steps

Next steps are using classifiers that have scalabe sentiment and start looking into deep learning models such as BERT.

# Burndown Chart

Given the research-oriented nature of our project, as opposed to corporate work, we opted for a Kanban approach rather than Scrum. We started well in advance, with initial preparations and research activities commencing in late July to early August. This timeframe allowed us to familiarize ourselves with the intricacies of machine learning, particularly since only one team member possessed prior experience in Machine Learning and Deep Learning.

We've allocated a collective 40 hours per week for all team members, distributed across a span of 20 weeks, aiming to complete this project within this timeframe. This amounts to a total of 800 hours dedicated to the project. We expect the burndown to go under the planned line in October since we are picking up the pace but still keeping the 40 hours as a median.

The sum of the spent hours for each team member is
- Ólafur Aron Jóhannsson (100)
- Eysteinn Örn (44)
- Birkir (50)

The reason Ólafur Aron has accumulated more hours is due to the fact that he will be departing abroad in late November and returning in early December, during which period his hours will be reduced.

![JIRA](jira.png)

![Burndown](burndown.png)

# Risk Analysis

| Risk                                  | Likelihood (1-5)  | Impact (1-5)  | Responsibility    | Mitigation Strategy |
|---------------------------------------|-------------------|---------------|-------------------|---------------------|
| Resource Constraints (Time and Computing Power) | 4       | 5             | Eysteinn          | Prioritize key features and models that are critical to the project. Consider using cloud computing resources. |
| Training stops/Computer crashes       | 4                 | 4             | Ólafur            | Regular backups and distributed training could mitigate this risk. |
| Sprint/Project delay                  | 3                 | 5             | Ólafur            | Address the problem in the standup's and frequent reassessments. |
| Incompatibility of Translation APIs   | 3                 | 3             | Birkir            | Have fallback methods for each API, and make the system modular to easily swap out one service for another. |
| Classifier Model Inefficiency         | 3                 | 3             | Eysteinn          | Use baseline models for initial testing before using more complex models like BERT, roBERTa, and IceBERT. |
| Overfitting in Model Training         | 2                 | 4             | Birkir            | Utilize techniques such as cross-validation and dropout layers. |
| Illness in team                       | 2                 | 4             | Whole Team        | Cross-training and comprehensive documentation can help other team members pick up the slack. Tries not getting other team members sick. |
| API Rate Limiting or Costs            | 2                 | 3             | Birkir            | Caching translated data and batch processing could help in minimizing the number of API calls. |
| A team member quits                   | 1                 | 5             | Whole Team        | Having a documented and modular project architecture allows for easier transition of responsibilities. |
| External Dependency Failures (APIs down) | 1              | 2             | Whole Team        | Have a contingency plan, such as a local translation model otherwise wait and focus on a different task |


# Meeting Notes

In addition to all data gathered we also tried to keep meeting notes as far back as 26. July.

![1](1.png)

![2](2.png)

![3](3.png)

![4](4.png)