# Create the C -> A dataset to be used by the fine-tuned BERT classifier

## Questions and thoughts
- Tutorial: https://huggingface.co/docs/transformers/custom_datasets
- Context texts must be limited to 512 tokens (Limit for BERT model)
- When labeling the dataset, should the labels be start, end, or start and inside? In other projects (with answer extraction) it seems they use start, end..
- Another option is to insert a higlight token around the sentence containing the answer, and then append the answers after a [SEP] token. As in: 
- There are multiple answer spans in the same context text.. Should those be labeled jointly? / should I have multiple instances of the same texts?
- My idea is to use the original text, no stopword removal or lemmatization.

In [2]:
# necessary library imports
import pandas as pd
import numpy as np

In [3]:
# data imports, to be combined into the final datastructure
CA_df = pd.read_pickle("./data/labeled_CA_training_data.pkl")
CAR_df = pd.read_pickle("./data/labeled_CAR_training_data.pkl")
CRA_df = pd.read_pickle("./data/labeled_CRA_training_data.pkl")

In [4]:
# compute the class weights to use in the training of the C -> A model (to account for the scarse dataset)
def get_class_distribution(labeled_df):
    num_labels = 0
    num_zeros = 0
    num_ones = 0
    num_twos = 0
    for idx, point in labeled_df.iterrows():
        labels = point['labels']
        for label in labels:
            num_labels += 1
            if label == 0:
                num_zeros += 1
            elif label == 1:
                num_ones += 1
            else:
                num_twos += 1
    print('num labels: ', num_labels)
    print('num zeros: ', num_zeros)
    print('num ones: ', num_ones)
    print('num twos: ', num_twos)

    weights = np.array([1/num_zeros, 1/num_ones, 1/num_twos])
    norm = np.linalg.norm(weights)
    normal_array = weights/norm
    print(normal_array)




In [5]:
get_class_distribution(CA_df)

num labels:  210487
num zeros:  204714
num ones:  1453
num twos:  4320
[0.00672723 0.9478027  0.31878642]


In [6]:
get_class_distribution(CAR_df)

num labels:  861146
num zeros:  792465
num ones:  3805
num twos:  64876
[0.00479318 0.99827303 0.05854906]


In [7]:
get_class_distribution(CRA_df)

num labels:  868756
num zeros:  862905
num ones:  1477
num twos:  4374
[0.0016217  0.94744016 0.31992892]
