# Brat Tags Data Analysis

## IDs in Brat 

- T: text-bound annotation
- R: relation
- E: event
- A: attribute
- M: modification (alias for attribute, for backward compatibility)
- N: normalization [new in v1.3]

<br> <br>

### Annotation ID conventions
All annotations IDs consist of a single upper-case character identifying the annotation type and a number. The initial ID characters relate to annotation types as follows:

<br>

#### Entity annotations
Each entity annotation has a unique ID and is defined by type (e.g. Person or Organization) and the span of characters containing the entity mention (represented as a "start end" offset pair).

| ID 	| Type And Span      	| Text     	|
|----	|--------------------	|----------	|
| T1 	| Organization 0 4   	| Sony     	|
| T3 	| Organization 33 41 	| Ericsson 	|
| T3 	| Country 75 81      	| Sweden   	|

<br> 

#### Attribute and modification annotations
Attribute annotations are binary or multi-valued "flags" that specify further aspects of other annotations. Attributes have a unique ID and are defined by reference to the ID of the annotation that the attribute marks and the attribute value.

| ID 	| Type & Entity ID  	|
|----	|-------------------	|
| A1 	| Negation T1       	|
| A2 	| Confidence T2     	|

<br>

#### Relation annotations
Binary relations have a unique ID and are defined by their type (e.g. Origin, Part-of) and their arguments.

| ID 	| Type and Args          	|
|----	|------------------------	|
| R1 	| Origin Arg1:T3 Arg2:T4 	|



## First Iteration

- Hand-picked


| id_parsed 	|          annotation_parsed 	| Count 	|
|----------:	|---------------------------:	|------:	|
| A         	| without-service            	| 63    	|
|           	| location                   	| 38    	|
|           	| duration                   	| 16    	|
|           	| time                       	| 10    	|
|           	| fake-information           	| 2     	|
|           	| with-service               	| 2     	|
|           	| reason                     	| 1     	|
| T         	| circumstantial-information 	| 65    	|
|           	| social-report              	| 39    	|
|           	| electricity                	| 33    	|
|           	| gasoline                   	| 25    	|
|           	| water                      	| 6     	|
|           	| gas                        	| 4     	|

<br>

- Sampled

| id_parsed 	|          annotation_parsed 	| Freq 	|
|----------:	|---------------------------:	|-----:	|
| A         	| without-service            	| 24   	|
|           	| location                   	| 24   	|
|           	| time                       	| 10   	|
|           	| utility-company            	| 5    	|
|           	| duration                   	| 4    	|
|           	| fake-information           	| 4    	|
|           	| politician                 	| 4    	|
|           	| reason                     	| 3    	|
|           	| news-company               	| 3    	|
|           	| with-service               	| 2    	|
|           	| other                      	| 1    	|
| T         	| circumstantial-information 	| 41   	|
|           	| electricity                	| 26   	|
|           	| social-report              	| 25   	|
|           	| twitter-account            	| 12   	|
|           	| gasoline                   	| 3    	|
|           	| water                      	| 1    	|
|           	| water                      	| 1    	|

In [53]:
import pandas as pd
import numpy as np 
pd.set_option('display.max_colwidth', None)


### Read Data
original_df = pd.read_csv('brat-v1.3_Crunchy_Frog/data/first-iter/balanced_dataset_brat.ann', sep = '\t',header = None)

# Rename coumns 
original_df.columns = ['id', 'annotation', 'text']

# Remove the ID numbers to know if it's an entity (T) or Attribute (A)
original_df['id_parsed'] = original_df.id.str.replace('\d', '')

# Remove text span and IDs (T & A) from column. This columns has the name of the attributes and etitites 
original_df['annotation_parsed'] = original_df.annotation.str.replace('[\dTA]', '')


# Remove Relation tags
# Change Relation Id to Null
original_df.id_parsed.replace('R', np.nan, inplace= True)

# Remove nulls
original_df.dropna(subset=['id_parsed'], inplace= True)

# Group by id_parsed, annotation parsed and count results
df = original_df[['id_parsed', 'annotation_parsed']].groupby(['id_parsed', 'annotation_parsed'], sort = True).agg({'annotation_parsed':['count']}).copy()

# After the group by there's multi-index columns. We rename the columns to have the level that we want (count)
df.columns = df.columns.levels[1]

# sort_values by index. Here the trick is to also use sort_index!!
df.sort_values('count', ascending=False)\
    .sort_index(level=[0], ascending=[True])

Unnamed: 0_level_0,Unnamed: 1_level_0,count
id_parsed,annotation_parsed,Unnamed: 2_level_1
A,without-service,63
A,location,38
A,duration,16
A,time,10
A,news-company,4
A,fake-information,2
A,with-service,2
A,reason,1
T,circumstantial-information,65
T,social-report,39


## Sampled

The following data was randomly sampled with the helper function ``` data_sampler.py ``` 

- Random_state: 58 
- Sample: 30 

` python data_sampler.py 58 30 brat-v1.3_Crunchy_Frog/data/first-iter/sampled_58_30.txt `


In [52]:
complete_df = pd.read_csv('tagging-set-original_for_jupyter_tagging.csv')

# pd.DataFrame(complete_df.sample(30, random_state = 9).full_text)



test_balance = pd.read_csv('brat-v1.3_Crunchy_Frog/data/first-iter/sampled_58_30.ann', sep = '\t')


# Rename coumns 
test_balance.columns = ['id', 'annotation', 'text']

# Remove the ID numbers to know if it's an entity (T) or Attribute (A)
test_balance['id_parsed'] = test_balance.id.str.replace('\d', '')

# Remove text span and IDs (T & A) from column. This columns has the name of the attributes and etitites 
test_balance['annotation_parsed'] = test_balance.annotation.str.replace('[\dTA]', '')


# Remove Relation tags
# Change Relation Id to Null
test_balance.id_parsed.replace('R', np.nan, inplace= True)

# Remove nulls
test_balance.dropna(subset=['id_parsed'], inplace= True)

# Group by id_parsed, annotation parsed and count results
df = test_balance[['id_parsed', 'annotation_parsed']].groupby(['id_parsed', 'annotation_parsed'], sort = True).agg({'annotation_parsed':['count']}).copy()

# After the group by there's multi-index columns. We rename the columns to have the level that we want (count)
df.columns = df.columns.levels[1]

# sort_values by index. Here the trick is to also use sort_index!!
df.sort_values('count', ascending=False)\
    .sort_index(level=[0], ascending=[True])


Unnamed: 0_level_0,Unnamed: 1_level_0,count
id_parsed,annotation_parsed,Unnamed: 2_level_1
A,without-service,24
A,location,24
A,time,10
A,utility-company,5
A,duration,4
A,fake-information,4
A,politician,4
A,reason,3
A,news-company,3
A,with-service,2
