Broadly, we turn the CONLL files into a (pickled) list of custom dataclass objects.
The dataclass is found in src/utils/data.py. I would recommend going over them in case of any problems or contacting Priyansh :)

## Set Up

## This is what the ideal directory tree looks like

```sh
priyansh@priyansh-ele:~/Dev/research/coref/mtl$ tree -L 4
.
├── data
│   ├── linked
│   ├── manual
│   │   ├── ner_ontonotes_tag_dict.json
│   │   ├── ner_scierc_tag_dict.json
│   │   ├── rel_scierc_tag_dict.json
│   │   └── replacements.json
│   ├── parsed
│   │ <...> 
│   ├── raw
│   │   ├── codicrac-ami
│   │   │   └── AMI_dev.CONLLUA
│   │   ├── codicrac-arrau
│   │   │   └── ARRAU2.0_UA_v3_LDC2021E05.zip
│   │   ├── codicrac-arrau-gnome
│   │   │   └── Gnome_Subset2.CONLL
│   │   ├── codicrac-arrau-pear
│   │   │   └── Pear_Stories.CONLL
│   │   ├── codicrac-arrau-rst
│   │   │   ├── RST_DTreeBank_dev.CONLL
│   │   │   ├── RST_DTreeBank_test.CONLL
│   │   │   └── RST_DTreeBank_train.CONLL
│   │   ├── codicrac-arrau-t91
│   │   │   └── Trains_91.CONLL
│   │   ├── codicrac-arrau-t93
│   │   │   └── Trains_93.CONLL
│   │   ├── codicrac-light
│   │   │   ├── light_dev.CONLLUA
│   │   │   ├── light_dev.CONLLUA.zip
│   │   │   └── __MACOSX
│   │   ├── codicrac-persuasion
│   │   │   └── Persuasion_dev.CONLLUA
│   │   ├── codicrac-switchboard
│   │   │   ├── __MACOSX
│   │   │   ├── Switchboard_3_dev.CONLL
│   │   │   └── Switchboard_3_dev.CONLL_LDC2021E05.zip
│   │   ├── ontonotes
│   │   │   ├── conll-2012
│   │   │   ├── ontonotes-release-5.0
│   │   │   ├── ontonotes-release-5.0_LDC2013T19.tgz
│   │   │   └── v12.tar.gz
│   │   └── scierc
│   │       ├── dev.json
│   │       ├── sciERC_processed.tar.gz
│   │       ├── test.json
│   │       └── train.json
│   └── runs
│       └── ne_coref
│           ├── goldner_all.json
│           ├── goldner_some.json
│           ├── spacyner_all.json
│           └── spacyner_some.json
├── g5k.sh
├── models
│ <...>
├── preproc.sh
├── README.md
├── requirements.txt
├── setup.sh
├── src
│   ├── analysis
│   │   └── ne_coref.py
│   ├── config.py
│   ├── dataiter.py
│   ├── eval.py
│   ├── loops.py
│   ├── models
│   │   ├── autoregressive.py
│   │   ├── embeddings.py
│   │   ├── modules.py
│   │   ├── multitask.py
│   │   ├── _pathfix.py
│   │   └── span_clf.py
│   ├── _pathfix.py
│   ├── playing-with-data-codicrac.ipynb
│   ├── preproc
│   │   ├── codicrac.py
│   │   ├── commons.py
│   │   ├── ontonotes.py
│   │   ├── _pathfix.py
│   │   └── scierc.py
│   ├── run.py
│   ├── utils
│   │   ├── data.py
│   │   ├── exceptions.py
│   │   ├── misc.py
│   │   ├── nlp.py
└── todo.md
```

Focus on the `data/raw` folder. Specially the CODICRAC subfolders. This is how we want the raw data to exist like.
We can thus proceed to preprocess them.

## How to Preprocess raw files


```sh
python src/preproc/codicrac.py -a
```

Ideally the command above should do it for you. But if not, you would want to invoke the CODICRACParser class with the right path.

For instance:

In [None]:
from pathlib import Path
from preproc.codicrac import CODICRACParser
path_to_conll_file = Path('../data/raw/codicrac-light')
parser = CODICRACParser(path_to_conll_file)

parser.run()

As you can see, the outputs might have been written to `../data/parsed/codicrac-light`.
By default, everything is stored in the `data/parsed` directory. 
But you can change that, if you want, by specifying the `write_dir` argument. For example:

In [2]:
from pathlib import Path
from preproc.codicrac import CODICRACParser

path_to_conll_file = Path('../data/raw/codicrac-light')
parser = CODICRACParser(path_to_conll_file, write_dir='../potato/tomato/basilic')

parser.run()

Fixing paths from /home/priyansh/Dev/research/coref/mtl/src
Successfully written 20 at ../data/parsed/codicrac-light.


PS: you can also have a look at the parsers declared in the `run` function at the end of `src/preproc/codicrac.py` to get an idea of how easily preproc datasets

## Loading PreProcessed Data

In [3]:
from dataiter import DocumentReader
dr = DocumentReader(src="codicrac-light")
len(dr)

20

In [4]:
# Accessing this data by index
dr[4].docname, dr[4].coref.spans[:3], dr[4].ner.words[:3], dr[4].ner.tags[:3]

('light_dev/episode_678',
 [[[0, 2], [12, 14], [87, 89]], [[2, 4]], [[4, 6]]],
 [['a', 'candle'], ['a', 'wall'], ['a', 'temple']],
 ['concrete', 'concrete', 'concrete'])

In [5]:
# Accessing this data by loop
for i, instance in enumerate(dr):
    
    if i > 2: break
    
    print(instance.docname)
    

light_dev/episode_6686
light_dev/episode_5399
light_dev/episode_6557


In [6]:
# Let's try and load every dataset
print(DocumentReader(src="codicrac-persuasion").__len__())
print(DocumentReader(src="codicrac-ami").__len__())
print(DocumentReader(src="codicrac-light").__len__())
print(DocumentReader(src="codicrac-switchboard").__len__())
print(DocumentReader(src="codicrac-arrau-t91").__len__())
print(DocumentReader(src="codicrac-arrau-t93").__len__())
print(DocumentReader(src="codicrac-arrau-gnome").__len__())
print(DocumentReader(src="codicrac-arrau-pear").__len__())
print(DocumentReader(src="codicrac-arrau-rst", split="train").__len__())
print(DocumentReader(src="codicrac-arrau-rst", split="dev").__len__())
print(DocumentReader(src="codicrac-arrau-rst", split="test").__len__())

21
7
20
11
16
98
5
20
335
18
60


## What processed data looks like

A list of `Document` instances (src/utils/data.py).
Each document has the following fields:

**document**: `List[List[str]]`: A list of sentences where each sentence itself is a list of strings. For instance: 

```py
[
    ["I", "see", "a", "little", "silhouette", "of", "a", "man"], 
    ["Scaramouche", "Scaramouche", "will", "you", "do", "the", "Fandango"]
]
```

**pos**: `List[List[str]]`: The same as above except every string is replaced by its POS tag. 
Warning: this is not an optional field. So in case your document is not annotated with pos tags, you can pass fake pos tags (and choose to not exercise them down the line). You can do this simply by:

In [7]:
from pprint import pprint
from utils.data import Document
doc_text = [
    ["I", "see", "a", "little", "silhouette", "of", "a", "man"], 
    ["Scaramouche", "Scaramouche", "will", "you", "do", "the", "Fandango"]
]
fake_pos = Document.generate_pos_tags(doc_text)
pprint(fake_pos)
print("Corresponding to")
pprint(doc_text)

[['FAKE', 'FAKE', 'FAKE', 'FAKE', 'FAKE', 'FAKE', 'FAKE', 'FAKE'],
 ['FAKE', 'FAKE', 'FAKE', 'FAKE', 'FAKE', 'FAKE', 'FAKE']]
Corresponding to
[['I', 'see', 'a', 'little', 'silhouette', 'of', 'a', 'man'],
 ['Scaramouche', 'Scaramouche', 'will', 'you', 'do', 'the', 'Fandango']]


**docname**: str

**genre**: str 

are both metadata fields that you can choose to use however you want. Ideally, docname should contain the docname. Genre can be left empty. 

**coref**: Cluster
    
**ner**: NamedEntities
    
**bridging**: BridgingAnaphor
    
**rel**: TypedRelations
    
are the fields which contain task specific annotations.
All these four things are represented with their custom data classes (also found in `src/utils/data.py`)

## Cluster (Coreference Annotations)

Primarily, it consists of a list of spans (indices). 
These indices correspond to the Document.document field (see "I see ... fandango" snippet above). 
So, let's imagine our document looks something like:

```py
# I saw a dog in a car. It was really cute. Its skin was brown with white spots !
doc = [
    ["I", "saw", "a", "dog", "in", "a", "car", "."],                # 8 tokens
    ["It", "was", "really", "cute", "."],                           # 5 tokens
    ["Its", "skin", "was", "brown", "with", "white", "spots", "!"]  # 8 tokens
] # total: 21 tokens
```

The clusters here would be <"I">, <"a dog", "it", and "its">, and <"a car">. That is, two singletons and one cluster with three spans. It would be represented by something like:

```py
clusters = [ 
    [[0, 1]],                        # one cluster with one span (about I) 
    [[2, 4], [8, 9], [13, 14]],      # next cluster with three spans (about the dog(
    [[6, 8]]                         # last cluster (a car) with one span
]
```

A span, by the way, is a list of two integers. `[6, 14]` is a span of tokens 6, 7, 8, 9, 10, 11, 12 and 13 (not 14; this is how python indexing works).

In [8]:
# I saw a dog in a car. It was really cute. Its skin was brown with white spots !
doc = [
    ["I", "saw", "a", "dog", "in", "a", "car", "."],                # 8 tokens
    ["It", "was", "really", "cute", "."],                           # 5 tokens
    ["Its", "skin", "was", "brown", "with", "white", "spots", "!"]  # 8 tokens
] # total: 21 tokens

clusters = [ 
    [[0, 1]],                        # one cluster with one span (about I) 
    [[2, 4], [8, 9], [13, 14]],      # next cluster with three spans (about the dog(
    [[6, 8]]                         # last cluster (a car) with one span
]

**Important**: you might notice that you can't directly access the text of a span by `doc[span[0]: span[1]]`. This is because the spans assume a flat list of strings. The document however contains structure based on sentences. So we need to flatten the document first. You can do that by

In [9]:
from utils.nlp import to_toks
# to_toks (think of to_tokens)

flattened_doc = to_toks(doc)
print(f"While doc has {len(doc)} items, each representing a sentence, "
      f"a flattened doc has {len(flattened_doc)} items, each representing a token.")

span = [13, 20]
flattened_doc[span[0]: span[1]]

While doc has 3 items, each representing a sentence, a flattened doc has 21 items, each representing a token.


['Its', 'skin', 'was', 'brown', 'with', 'white', 'spots']

There are other helpful things in there as well such as
**words**: replace each span `[13, 15]` with the corresponding list of tokens i.e. `['the', 'red', 'car']`.

**pos**: replace every word with its POS tag

In addition, corresponding to spans, words, and pos we also have **span_head**, **words_head**, and **pos_head** which contain the span's head information (as detected by spacy's span head detection algorithm).

## Other Annotations

### NamedEntities

Similar to `Clusters` above except we don't need to aggregate groups of spans into clusters. So here, we just have a list of spans.

`ner.spans: List[List[int]] = [ [2, 4], [4, 9] ... ]`

Corresponding to each, we also have the NER tag, the words, the POS and the span head information as above.

### BridgingAnaphors

Each element of

bridging.spans looks like: `[ [2, 5], [9, 13] ]`, i.e., a list of two spans. The first being the anaphor, and the second being the antecedent.

You also have better variables at your disposal like `briding.anaphors` or `briding.antecedents` to access them more normally.