# Project phase 1: Baseline

The goal of this phase is to create a baseline model. Note that the word baseline can mean different things. In the course we distinguished three different types of baselines:
* 1. The simplest possible approach (majority baseline, i.e. everything is positive or noun)
* 2. A simple machine learning classifier (logistic regression with words as features)
* 3. The ``state-of-the-art'' approach on which you want to improve (your starting point)

For this phase you need to make a number 2 or 3 baseline. 

If you plan to have a research question like: can we improve sentiment detection systems by doing X, the answer to the question is the most relevant if you have a competetive baseline (3). In this case we would suggest to use a BiLSTM or even a transformer based model, so that you can re-use the baseline for the final research question (phase 3).

You should pick one of the following tasks to create your baseline for.

## Task 1: Sentiment classification
* The data can be found in the `classification` folder.
* The goal is to predict the label in the `sentiment` field.
* **You have to upload the predictions of `music_reviews_test_masked.json` to CodaLab: https://competitions.codalab.org/competitions/34307?secret_key=af4dce64-f2ab-47c2-bc3c-a04abe2a2725 Note that the format should match the json files in the repository, and the file should be zipped.**
* **Make sure to add your group number (can be found on learnit), or the ITU username of at least one member to the submission.**
* **Also fill out the Method description in the submission page of the codalab**

*Hint: if you do not get a score in CodaLab, you can click on ``Download output from scoring step'' to see the error*

The data can be read like:

In [3]:
# our functions
import functions as f
# readers

import gzip
import json

import codecs

In [4]:
PATH = {}
PATH["dataset_classification"] = "dataset/classification/"
PATH["music_reviews_train"] = PATH["dataset_classification"] + "music_reviews_train.json.gz"
PATH["music_reviews_test"] = PATH["dataset_classification"] + "music_reviews_test_masked.json.gz"
PATH["music_reviews_dev"] = PATH["dataset_classification"] + "music_reviews_dev.json.gz"

In [5]:
trainM = f.readJson(PATH["music_reviews_train"])
devM = f.readJson(PATH["music_reviews_dev"])
testM = f.readJson(PATH["music_reviews_test"])

Number of data:  100000
Number of data:  10000
Number of data:  10000


## Task2: Sentiment Expression Labeling
* The data can be found in the `seq_labeling` folder
* The goal is to predict the BIO-labels in the third column
* Note that the evaluation metric is Span-F1, which means that you will only get "points" if you get the whole span correct! We provide an evaluation script in `seq_labeling/eval.py`.
* **You have to upload the predictions of `opener_en-test-masked.conll` to CodaLab: https://competitions.codalab.org/competitions/34307?secret_key=af4dce64-f2ab-47c2-bc3c-a04abe2a2725 Note that the format should match the json files in the repository, and the file should be zipped.**
* **Make sure to add your group number (can be found on learnit), or the ITU username of at least one member to the submission.**
* **Also fill out the Method description in the submission page of the codalab**

* Note that if you use BERT-based embeddings, you need to make sure that the number of labels matches the number of tokens. This is commonly done by only using the embedding of the first subword of each token.


*Hint: if you do not get a score in CodaLab, you can click on ``Download output from scoring step'' to see the error*

The data looks as follows:

In [6]:
PATH["dataset_labeling"] = "dataset/seq_labeling/"
PATH["labeling_train"] = PATH["dataset_labeling"] + "opener_en-train.conll"
PATH["labeling_test"] = PATH["dataset_labeling"] + "opener_en-test-masked.conll"
PATH["labeling_dev"] = PATH["dataset_labeling"] + "opener_en-dev.conll"

In [45]:
t = readConllFile(PATH["labeling_train"])

['O', 'O', 'O', 'O', 'O', 'O', 'O', [...]]
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', [...]]
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', [...]]
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Positive', 'I-Positive', 'I-Positive', 'I-Positive', 'I-Positive', 'I-Positive', 'I-Positive', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', [...]]
['O', 'O', 'O', 'O', 'O', 'O', 'B-Positive', 'I-Positive', 'I-Positive', 'I-Positive', 'I-Positive', 'O', [...]]
['O', 'O', 'O', 'B-Positive', 'I-Positive', 'I-Positive', 'I-Positive', 'I-Positive', 'I-Positive', 'I-Positive', 'I-Positive', 'I-Positive', 'I-Positive', 'I-Positive', 'I-Positive', 'O', [...]]
['O', 'O', 'B-Positive', 'I-Positive', 'I-Positive', 'I-Positive', 'O', 'O', 'O', 'O', 'O', 'O', [...]]
['B-Negative', 'I-Negative', 'O', 'O'

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', [...]]
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', [...]]
['O', 'O', 'O', 'O', 'O', [...]]
['B-Positive', 'O', 'O', 'O', 'O', 'O', 'O', [...]]
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Positive', 'I-Positive', 'O', 'O', 'O', 'O', [...]]
['O', 'O', 'O', 'B-Positive', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Positive', 'O', [...]]
['O', 'O', 'O', 'O', 'B-Positive', 'O', 'B-Positive', 'O', 'O', 'O', 'B-Positive', 'O', 'O', 'O', 'O', 'B-Positive', 'O', 'B-Positive', 'I-Positive', [...]]
['O', 'B-Positive', 'I-Positive', 'I-Positive', 'I-Positive', 'O', 'O', 'B-Positive', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Positive', 'I-Positive', 'O', [...]]
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', [...]]
['O', 'O', 'B-Po

In [52]:
t[1][3][4]

IndexError: list index out of range

In [58]:
def read_arto(file_name):
    """
    read in conll file
    
    :param file_name: path to read from
    :returns: list with sequences of words and labels for each sentence
    """
    current_words = []
    current_tags = []
    data = []

    for line in codecs.open(file_name, encoding='utf-8'):
        line = line.strip()

        if line:
            if line[0] == '#':
                continue # skip comments
            tok = line.split('\t')
            word = tok[1]
            tag = tok[2]

            current_words.append(word)
            current_tags.append(tag)
        else:
            if current_words:  # skip empty lines
                data.append((current_words, current_tags))
            current_words = []
            current_tags = []

    # check for last one
    if current_tags != [] and not raw:
        data.append((current_words, current_tags))
    return data

In [59]:
t = read_arto(PATH["labeling_train"])

['India',
 'as',
 'a',
 'country',
 'has',
 'always',
 'fascinated',
 'me',
 'and',
 'all',
 'of',
 'my',
 'friends',
 'who',
 'have',
 'been',
 'there',
 'always',
 'have',
 'wonderful',
 'things',
 'to',
 'say',
 'about',
 'it',
 '.']