Lexical features, semantic features and combination. Use all 
new components: such as ML, or other options from the table


Rubric:

	- Bad approach: Not using feature selection -> is penalizable using hand crafted approaches
		  - ((Using test set for anything, especially for selecting the model))
	- Explain why we use a subset of the features
	- Lexical features, semantic features and combination. Show results at least in feature selection.
  

In [None]:
from code_.data_reader import data_reader
from code_.feature_extractor import Features
from code_.model import Models

# Semantic Textual Similarity

Authors:

    - Benjami Parellada
    - Clara Rivadulla

---


# Introduction

Semantic Textual Similarity (STS) measures the degree of semantic equivalence between two texts. It is also known as paraphrase detection, where a pair of texts is a paraphrase when both texts describe the same meaning with different words. Before the advent of neural networks and word embeddings with `word2vec` by Google in 2013 - where an "embedding" is a numeric vector representation of natural language texts such that computers can understand the context and meaning - this was a difficult problem to solve. The nuances of natural language make it hard for machines to understand the context and meaning of different texts, for example, two sentences could have no word in common but still mean the same. STS is related to numerous NLP tasks, such as machine translation, text summarization, machine reading and understanding, question answering, among other tasks. 

Thus, in this project, we will travel back in time, to an era before word embeddings reigned supreme, to understand and extract different features from text in order to compare the similarity between two sentences. Concretely, we use the data set and description of task Semantic Textual Similarity from SemEval-2012 Task 6. Using this data, we extract lexical and syntactical dimensions in order to train a model that is able to detect when two sentences are similar. Moreover, we will comment and compare the different features in this scenario explaining which are the most relevant.

## Structure of the project

The whole project is contained in a general folder, `STS`, which contains the following directories with their respective files:
- `code_`: where the `.py` files used for reading the data (`data_reader.py`), pre-processing it, extracting the desired features (`feature_extractor.py`) and defining the models (`model.py`) to be trained are.
- `test-gold` and `train`: where test and training data is (`'SMTeuroparl'` `.txt` files).

## Dataset

The source of the dataset is from *SemEval-2012 Task 6*, where 3 different datasets have from different sources have been manually tagged. The three sources of data all have the same format, where there are two sentences $S_1$ and $S_2$ and the objective is to compute how similar these sentences are to each other. The datasets are:

- MSR-Paraphrase, Microsoft Research Paraphrase Corpus, which contains 750 pairs of sentences about thousands of news sources from the web.
- MSR-Video, Microsoft Research Video Description Corpus, which contains 750 pairs of sentences of sentences describing a video.
- SMTeuroparl: WMT2008 develoment dataset (Europarl section), which contains 734 pairs of sentences from a translation from French to English.

These three datasets are given to us to train and do model selection. Finally, there are more 5 more datasets which were not presented that contain the evaluation test partition of the data. The following summarizes these:

- For MSR-Paraphrase, we are given 750 unseen pairs of sentences, where the final model should be evaluated.
- For MSR-Video they add 750 unseen pairs of sentences.
- For SMTeuroparl, they add 459 pairs.

Additionally, two extra datasets are provided which are from different context.
- SMTnews, which contains 399 pairs of sentences from news conversation sentence pairs from WMT.
- OnWN, which contains 750 pairs of sentences where the first comes from Ontonotes and the second from a WordNet definition.

We explain these to understand the context of the task, however, we will not be using any external information that could bias our Test evaluation metrics. We will only train the machine learning models, do feature selection, and model selection, using only the original train data given.

## System Evaluation

As already mentioned, the target feature we are trying to predict is how similar two sentences are. Given two sentences $S_1$ and $S_2$, we are trying the similarity score. This performance is evaluated using the Pearson product-moment correlation coefficient between the output score from our system and the human score, henceforth, referred as Gold Standard. Reading through the overall conclusion paper on the SemEval-2012 Task 6, we see that there is some controversy, as in it they concatenate the results of all the datasets before calculating the Pearson correlation. This is said to lose some of the individual scores on the datasets. Nevertheless, for the purpose of this project, we will do the concatenation of all the datasets into one to train and evaluate. We feel it makes more sense to have a global dataset that tries to predict how similar two sentences are independent of the original context.

Regarding the Gold Standard, it was annotated by humans where they were asked to score the pairs with the following scale interpretation:

0. The two sentences are on different topics.
1. The two sentences are not equivalent, but are on the same topic.
2. The two sentences are not equivalent, but share some details.
3. The two sentences are roughly equivalent, but some important information differs/missing.
4. The two sentences are mostly equivalent, but some unimportant details differ.
5. The two sentences are completely equivalent, as they mean the same thing.  

For each sentence pair, the Gold Standard represents the average of 5 scores from different annotators.

# Methodology
After this not-so-short introduction, we present the work we have done in order to complete the STS task. The sections we cover are:
- Preprocessing.
- Feature Extraction.
- Model Selection.
- Feature Importance.
- System Evaluation.


## Preprocessing



## Feature extraction
Before training the model and after seeing which features give us the best possible results, we've tried with many different features. Here's a little explanation of each and every one:

- **Jaccard Similarity** between **Tokens**: This is one of the simplest features, which computes the *Jaccard Similarity* between the *tokens* of every pair of sentences. We've tried this both considering and ignoring stopwords.
- **Jaccard Similarity** between **Lemmas**: This is almost the same as the previous one, but using *lemmas* instead of tokens. We've also computed the similarity with and without stopwords.
- **Jaccard Similarity** between **(1, 2 and 3) grams**: We've computed *unigrams*, *bigrams* and *trigrams* (with `nltk`'s `ngrams` function) for every pair of sentences and compared them with *jaccard*. Again, we've also computed them considering stopwords.

In [None]:
x_train, x_test, y_train, y_test = data_reader('')
train_features = Features(x_train).extract_all()
test_features = Features(x_test).extract_all()

## Model: *Random Forest*

In [None]:
models = Models(train_features, test_features, y_train, y_test)
models.RF()