In [1]:
from tqdm import tqdm
import pandas as pd
import numpy as np
import os

# Data format

## Origianl data format

The dataset is originally structured in the following way:
* The `train.csv` file contains one entry per classified segment, with its corresponding classification tag and original essay id (among other data points, see preview of original dataset below, on `train_csv_df.head()`).
* The directory `train/` contains each full original essay in its coresponding file, named as `<id>.txt` for each essay id. This is necessary because unlabeled segments of the original essays do not show up in `train.csv`.

## Desired data format

For training, we would like to restructure this data:
* The model, regardless of architacture, should have as an input the entire essay. To that end, since the `train.csv` only contains the segments that have an assigned category, we have to read in the full original essays from `train/`.
* For its output, we will individually label words in the original essay with one of the provided catagories (with caveats explained later), or "unnanotated". This will allow us to approach the problem as token classification (similar to POS), or NER (Named Entity Recognition).

In [2]:
# Load `train.csv`
train_csv_df = pd.read_csv('../data/train.csv')

In [3]:
train_csv_df.head()

Unnamed: 0,id,discourse_id,discourse_start,discourse_end,discourse_text,discourse_type,discourse_type_num,predictionstring
0,423A1CA112E2,1622628000000.0,8.0,229.0,Modern humans today are always on their phone....,Lead,Lead 1,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1...
1,423A1CA112E2,1622628000000.0,230.0,312.0,They are some really bad consequences when stu...,Position,Position 1,45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
2,423A1CA112E2,1622628000000.0,313.0,401.0,Some certain areas in the United States ban ph...,Evidence,Evidence 1,60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
3,423A1CA112E2,1622628000000.0,402.0,758.0,"When people have phones, they know about certa...",Evidence,Evidence 2,76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 9...
4,423A1CA112E2,1622628000000.0,759.0,886.0,Driving is one of the way how to get around. P...,Claim,Claim 1,139 140 141 142 143 144 145 146 147 148 149 15...


In [4]:
train_csv_df.columns

Index(['id', 'discourse_id', 'discourse_start', 'discourse_end',
       'discourse_text', 'discourse_type', 'discourse_type_num',
       'predictionstring'],
      dtype='object')

In [5]:
len(train_csv_df.id.unique())

15594

For a NER approach, let us aggregate the possible labels a token can be classified as. To better define sequences, each token can either be the beginning of some discourse type (i.e., first token of a group with same discourse type) or the middle/intermediate of such discourse type. Alternatively, a word can also not belong to any argumentative component (which we call "unnanotated", a label itself).

The letter that prefixes each ner_tag indicates the token position of the entity:
* `B-<component name>` indicates the beginning of an entity, and is followed by the argumentative component's name (Ex.: "B-Lead").
* `I-<component name>` indicates a token is contained (I)nside the same entity.
* `Unnanotated`: indicates the token doesn’t correspond to any entity.


In [6]:
labels = ['Unnanotated']
for label in train_csv_df.discourse_type.unique():
    labels.append(f"B-{label.replace(' ', '_')}")
    labels.append(f"I-{label.replace(' ', '_')}")

labels

['Unnanotated',
 'B-Lead',
 'I-Lead',
 'B-Position',
 'I-Position',
 'B-Evidence',
 'I-Evidence',
 'B-Claim',
 'I-Claim',
 'B-Concluding_Statement',
 'I-Concluding_Statement',
 'B-Counterclaim',
 'I-Counterclaim',
 'B-Rebuttal',
 'I-Rebuttal']

Construct two lookup dicts for more efficient processing (O(1) lookups).

In [7]:
id2label = {}
label2id = {}

for id, label in enumerate(labels):
    id2label[id] = label
    label2id[label] = id

id2label, label2id

({0: 'Unnanotated',
  1: 'B-Lead',
  2: 'I-Lead',
  3: 'B-Position',
  4: 'I-Position',
  5: 'B-Evidence',
  6: 'I-Evidence',
  7: 'B-Claim',
  8: 'I-Claim',
  9: 'B-Concluding_Statement',
  10: 'I-Concluding_Statement',
  11: 'B-Counterclaim',
  12: 'I-Counterclaim',
  13: 'B-Rebuttal',
  14: 'I-Rebuttal'},
 {'Unnanotated': 0,
  'B-Lead': 1,
  'I-Lead': 2,
  'B-Position': 3,
  'I-Position': 4,
  'B-Evidence': 5,
  'I-Evidence': 6,
  'B-Claim': 7,
  'I-Claim': 8,
  'B-Concluding_Statement': 9,
  'I-Concluding_Statement': 10,
  'B-Counterclaim': 11,
  'I-Counterclaim': 12,
  'B-Rebuttal': 13,
  'I-Rebuttal': 14})

Now let's read all the complete essays from the `train/` directory. As discussed above, this is necessary as they will be the input to the model, and not all parts of each essay have a tag associated with them, so concatenating all the chunks listed on the `train.csv` file might not add up to the full original essay.

In [8]:
essay_ids, essay_contents = list(), list()
train_path = '../data/train/'

# Iterate over all files in the `train/` dir, where each file is named `<id>.txt`.
for filename in tqdm(os.listdir(train_path)):
    essay_ids.append(filename.split('.')[0])    # Extract essay id from file name.
    essay_contents.append(open(f"{train_path}{filename}", 'r').read())

data = pd.DataFrame({'id': essay_ids, 'content': essay_contents})

data.head()

100%|██████████| 15594/15594 [00:00<00:00, 107067.05it/s]


Unnamed: 0,id,content
0,73DC1D49FAD5,eletoral college can be a very good thing caus...
1,D840AC3957E5,"STUDENT_NAME\n\nADDRESS_NAME\n\nFebruary 22, 2..."
2,753E320B186B,In my opinion as a student: I don't agree at t...
3,C2ABDAC2BC2C,When it comes to at home learning and attendin...
4,B2DDBAAC084C,Y\n\nou can ask many different people for advi...


In [9]:
assert(len(data) == len(data['id'].unique()))

data = data.set_index('id')

Now, for NER, let us get the classification of each word in the original essays and build another array with them, in corresponding order. This will be our final processed dataset. Here, we also distinguish between beginning-of-sequence tokens (prefixed with `B-`), or intermediate tokens (prefixed with `I-`).

In [10]:
ner_labels = list()

for essay_id, x in tqdm(data.iterrows(), total=data.shape[0]):
    # print(x['content'])
    essay_labels = np.zeros(len(x['content'].split()), dtype=np.int32) #[labels[0]] * len(x['content'].split())

    # Find all rows corresponding with this essay in `train.csv`.
    for _, y in train_csv_df[train_csv_df['id'] == essay_id].iterrows():
        segment_label = y['discourse_type']
        segment_indices = [int(idx) for idx in y['predictionstring'].split()]
        for idx in segment_indices:
            essay_labels[idx] = label2id[f"I-{segment_label.replace(' ', '_')}"]
        # Finally, override the first to mark beginning of sequence.
        essay_labels[segment_indices[0]] = label2id[f"B-{segment_label.replace(' ', '_')}"]

    ner_labels.append(essay_labels)

data['labels'] = ner_labels

  1%|          | 78/15594 [00:00<01:01, 251.02it/s]

100%|██████████| 15594/15594 [01:01<00:00, 252.64it/s]


In [11]:
data.head()

Unnamed: 0_level_0,content,labels
id,Unnamed: 1_level_1,Unnamed: 2_level_1
73DC1D49FAD5,eletoral college can be a very good thing caus...,"[3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ..."
D840AC3957E5,"STUDENT_NAME\n\nADDRESS_NAME\n\nFebruary 22, 2...","[0, 0, 0, 0, 0, 0, 0, 3, 4, 4, 4, 4, 4, 4, 4, ..."
753E320B186B,In my opinion as a student: I don't agree at t...,"[1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ..."
C2ABDAC2BC2C,When it comes to at home learning and attendin...,"[3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ..."
B2DDBAAC084C,Y\n\nou can ask many different people for advi...,"[3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ..."


As a sanity check, assert that the length of the labels array matches the number of words per essay.

In [12]:
for id, row in tqdm(data.iterrows()):
    assert(len(row['labels']) == len(row['content'].split()))

0it [00:00, ?it/s]

15594it [00:00, 35134.87it/s]


Finally, save the preprocessed dataset.

In [13]:
data.to_csv('../dataset.csv')