# Feature extraction and Dataset building
This notebook demonstrates how to convert and extract features from [SEM 2012 shared task](http://www.clips.ua.ac.be/sem2012-st-neg/) corpus (cd-sco), then generate huggingface dataset for future use.

## Reading and preprocessing
The data is of the following structure:
- Each line of the data is a word or a punctuation of the sentence. Different sentences are separated by empty lines.
- The first seven columns consist of basic information about the word: file name, sentence number, token number, the word, its lemma, part of speech, and syntactic phrase tree.
- **If the sentence does not contain negation, the eighth column will be *** and this will be the end of the row.**
- If the sentence does contain negation, this column will be the negation cue. The ninth column will be the negation scope and the tenth column is the negated event.
- If the word does not belong to these negation elements, these columns will be marked with _.
- **If the sentence contains two or three negations, these three columns will be repeated, so the row will have 13 or 16 columns respectively (The maximum number of negations in a sentence is four across the dataset, resulting in 19 columns in total).**

### The goal is to separate sentence with mutiple negations by repeating. That is, make every rows that have more than 10 columns into 10. 

For example, if a sentence have 3 negations, element [0:7] will be the basic infomation, [8:10] is the negation infomation of the first negation, [11:13] for the second and [14:16] for the third. **We hope to separate this sentence into three rows: [0:10], [0:7, 11:13], [0:7, 14:16]**

Implementation details can be found at `convert_conll.py`. With this script, we can convert the conll file into converted data.

In [1]:
def extract_sentences_and_labels(df):
    """
    Gets a list of sentences, spaces per sentence, negation cues and a list of negation scopes. 
    Each sentence is a list of token dicts, spaces per sentence is a list of booleans,
    each negation cue is a list of indexes and each negation scope is a list of booleans.
    """
    sentences,spaces,neg_cues,scope_sentences = [],[],[],[]
    sentence,sent_spaces,neg_cue,neg_scope = [],[],[],[]
    n = 0
    for i, row in df.iterrows():
        token_dict = row.to_dict()
        # Check for the start of a new sentence (token_id == '0')
        if row['token_id'] == 0:
            n = 0
            sentences.append(sentence)
            sentence = [token_dict]
            
            spaces.append(sent_spaces)
            sent_spaces = [False if i == len(df)-1 else df['pos_tag'].iloc[i+1].isalpha()]

            scope_sentences.append(neg_scope)
            neg_scope = [str(row['negation_scope'] != '_')]
                
            neg_cues.append(neg_cue)
            neg_cue = []
        else:
            n += 1
            sentence.append(token_dict)
            neg_scope.append(str(row['negation_scope'] != '_'))
            sent_spaces.append(False if i == len(df)-1 else df['pos_tag'].iloc[i+1].isalpha())
            
        if row['negation_word'] != '_' and row['negation_word'] != '***':
            neg_cue.append(n)
            
    return sentences[1:], spaces[1:], neg_cues[1:], scope_sentences[1:]

## Feature engineering
For the traditional feature engineering, there are many avaliable features to extract, including using [spaCy](https://spacy.io/) to extract dependency relationships.

`extract_features.py` contains the complete feature extraction implementation. See details in the script.

Note that we've utilzed GPU for dependency parsing which significantly increase speed. If your device does not have GPU support, comment out `spacy.require_gpu()` from the script.

In [None]:
import extract_features
import pandas as pd

In [8]:
def filter_and_extract(filename):
    raw_df = pd.read_csv(filename, sep='\t', names=['document_id', 'sentence_id', 'token_id', 'token', 'lemma', 'pos_tag', 'parsing_tree', 'negation_word', 'negation_scope', 'negation_event'])
    sentences, spaces, neg_cues, scope_sentences = extract_sentences_and_labels(raw_df)
    
    X = []
    for sentence, sent_spaces, neg_cue in zip(sentences, spaces, neg_cues):
        X.append(extract_features.extract_sentence_features(sentence, sent_spaces, neg_cue, extract_features.dependency_parser))
    
    return X, scope_sentences

In [9]:
X_train, y_train = filter_and_extract('converted/converted_train.tsv')

Let's inspect the first sentence from the training set:

In [12]:
X_train[0]

[{'token': 'Chapter',
  'neg_type': '',
  'lemma': 'Chapter',
  'pos_tag': 'NN',
  'is_neg': False,
  'same_segment': True,
  'common_ancester': '',
  'cue_distance': 0,
  'dependency_relation': 'compound',
  'dependency_distance': 0,
  'dependency_path': 0},
 {'token': '1.',
  'neg_type': '',
  'lemma': '1.',
  'pos_tag': 'CD',
  'is_neg': False,
  'same_segment': True,
  'common_ancester': '',
  'cue_distance': 0,
  'dependency_relation': 'ROOT',
  'dependency_distance': 0,
  'dependency_path': 0},
 {'token': 'Mr.',
  'neg_type': '',
  'lemma': 'Mr.',
  'pos_tag': 'NNP',
  'is_neg': False,
  'same_segment': True,
  'common_ancester': '',
  'cue_distance': 0,
  'dependency_relation': 'pobj',
  'dependency_distance': 0,
  'dependency_path': 0},
 {'token': 'Sherlock',
  'neg_type': '',
  'lemma': 'Sherlock',
  'pos_tag': 'NNP',
  'is_neg': False,
  'same_segment': True,
  'common_ancester': '',
  'cue_distance': 0,
  'dependency_relation': 'conj',
  'dependency_distance': 0,
  'dependen

## Generate Huggingface Dataset for future use
In this sectipn, we rearrange and convert the dataset to huggingface dataset (see details [here](https://huggingface.co/docs/datasets/v2.16.1/en/package_reference/main_classes#datasets.Dataset.from_pandas)) type for convenience.

Since BERT approaches does not need traditional feature engineerings, we extract only id, negation scopes (gold) and tokens.

We follow the Augment method described in [NegBERT (Khandelwal, et al. 2020)](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.704.pdf).
That is, adding a special token ([NEG]) immediately before the predicate:
> This is [NEG] not a sentence.

Note that **the special token and the predicate is considered a whole**. That is, the actual sentence is like
> 'This' 'is' **'[NEG] not'** 'a' 'sentence' '.'

In [13]:
import datasets

In [21]:
def data_augment(Xdict, ydict):
    '''
    This function rearranges datasets with the Augment method described above
    '''
    redict = []
    for x in range(len(Xdict)):
        tokens, labels = [],[]
        for i, item in enumerate(Xdict[x]):
            if not isinstance(item['token'], str):
                continue

            if item['is_neg']:
                tokens.append(f"[NEG] {item['token']}")
                labels.append(1 if ydict[x][i] == 'True' else 0)
            else:
                tokens.append(item['token'])
                labels.append(1 if ydict[x][i] == 'True' else 0)

        redict.append({'id': x, 'negation_scope_tags':labels, 'tokens':tokens})
    return redict

In [19]:
def to_huggingface_dataset(Xdict, ydict):
    datadict = data_augment(Xdict, ydict)
    ds = datasets.Dataset.from_pandas(pd.DataFrame(data=datadict))
    return ds

In [16]:
X_dev, y_dev = filter_and_extract('converted/converted_fulldev.tsv')
X_test, y_test = filter_and_extract('converted/converted_test_circle.tsv')
X_test2, y_test2 = filter_and_extract('converted/converted_test_cardboard.tsv')

The original corpus obtains two test sets for two groups. We combine these groups to form a full test set.

In [17]:
X_test_full = X_test + X_test2
y_test_full = y_test + y_test2

In [22]:
train_ds = to_huggingface_dataset(X_train, y_train)
dev_ds = to_huggingface_dataset(X_dev, y_dev)
test_full_ds = to_huggingface_dataset(X_test_full, y_test_full)

We can inspect a sentence:

In [23]:
train_ds[10]

{'id': 10,
 'negation_scope_tags': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  1,
  1,
  1,
  0,
  1,
  1,
  1,
  1,
  0],
 'tokens': ['Holmes',
  'was',
  'sitting',
  'with',
  'his',
  'back',
  'to',
  'me',
  ',',
  'and',
  'I',
  'had',
  'given',
  'him',
  '[NEG] no',
  'sign',
  'of',
  'my',
  'occupation',
  '.']}

### It's now possible to save the datasets to the disk or push it to huggingface hub

To save to the disk, use `train_ds.save_to_disk('...')`

To push to the hub, use `train_ds.push_to_hub("<organization>/<dataset_id>", split="train")` for split. 

(You need to install huggingface_hub and login with `huggingface-cli login` to push to hub with
```
from huggingface_hub import notebook_login
notebook_login()
```

### You can find this dataset at https://huggingface.co/datasets/dannashao/sem2012forNegbert, and load it with `load_dataset("dannashao/sem2012forNegbert")`