## Other features

As mentioned before, the POS tags are not usually enough to provide our algorithm with the necessary information to make accurate inferences. Luckyly for us, we can provide our algorithm with many more features

In [None]:
from vuelax.tokenisation import index_emoji_tokenize
import pandas as pd
import csv

Starting from our already labelled dataset (remember I have a file called `data/to_label.csv`). The following are just a few helper functions to read and augment our dataset:

In [None]:
labelled_data = pd.read_csv("data/to_label-done.csv")
labelled_data.head()

In [None]:
# Little helper to read from our labelled dataset
def read_whole_offers(dataset):
    current_offer = 0
    rows = []
    for _, row in dataset.iterrows():
        if row['offer_id'] != current_offer:
            yield rows
            current_offer = row['offer_id']
            rows = []
        rows.append(list(row.values))
    yield rows
            
offers = read_whole_offers(labelled_data)
for _ in range(3):
    offer_ids, tokens, positions, pos_tags, labels = zip(*next(offers))
    print(offer_ids)
    print(tokens)
    print(positions)
    print(pos_tags)
    print(labels)
    print()

## Building our training set  

The features I decided to augment the data with are the following:  

 - Lengths of each individual tokens
 - Length of the whole offer (counted in tokens)
 - The POS tag of the token to the left
 - The POS tag of the token to the right
 - Whether the token is uppercase or not


In [None]:
def generate_more_features(tokens, pos_tags):
    lengths =  [len(l) for l in tokens]
    n_tokens =  [len(tokens) for l in tokens]
    augmented = ['<p>'] + list(pos_tags) + ['</p>']
    uppercase = [all([l.isupper() for l in token]) for token in tokens]
    return lengths, n_tokens, augmented[:len(tokens)], augmented[2:], uppercase

In [None]:
offers = read_whole_offers(labelled_data)

extended_headers = [
    "offer_id", 
    "token", 
    "position", 
    "pos_tag", 
    "pos_left", 
    "pos_right", 
    "token_length", 
    "token_count",
    "uppercase",
    "label"
]

with open("data/features-labels.csv", "w") as w:
    writer = csv.writer(w)
    writer.writerow(extended_headers)
    for offer in offers:
        offer_ids, tokens, positions, pos_tags, labels = zip(*offer)
        lenghts, n_tokens, lefts, rights, uppercase = generate_more_features(tokens, pos_tags)
        data = zip(offer_ids, tokens, positions, pos_tags, lefts, rights, lenghts, n_tokens, uppercase, labels)
        for row in data:
            writer.writerow(row)

Then in the file `data/features-labels.csv` our dataset will be ready to use to train our algorithm.