# Tarea 1

## Objetivo
El objetivo de esta práctica es utilizar un etiquetador de entidades nombradas, en español, y evaluar los resultados obtenidos. Se puede utilizar el etiquetador de entidades nombradas de Spacy, que cuenta con un modelo preentrenado para el español.

## Materiales

Uilizaremos textos de prueba pertenecientes al conjunto de test del corpus `conll2002` en español:

* https://www.clips.uantwerpen.be/conll2002/ner/data/
* Artículo original en el que se propuso el estudio [PDF](https://www.aclweb.org/anthology/W02-2024.pdf)

## Formato


El formato de los datos del corpus de referencia utiliza una línea para cada palabra seguida anotación la entidad en formato IOB. Los tipos de entidades considerados son:
    * PER: Persona
    * ORG: Organización
    * LOC: Localización
    * MISC: Miscelánea

Hay tres archivos por idioma:

* Un archivo de entrenamiento: [esp.train](https://www.clips.uantwerpen.be/conll2002/ner/data/esp.train)
* Dos archivos de tests:
    * testa: utilizado para la fase de desarrollo para encontrar buenos parámetros para el sistema de aprendizaje: [esp.testa](https://www.clips.uantwerpen.be/conll2002/ner/data/esp.testa)
    * testb: utilizado para la evaluación final [esp.testb](https://www.clips.uantwerpen.be/conll2002/ner/data/esp.testb)

## Resolución

### Paso 1: Leer los datos de los ficheros para entrenar un modelo de spacy

In [1]:
import pandas as pd

In [2]:
esp_testa_df = pd.read_csv("files/esp.testa.txt", delimiter='\s', header=None, engine='python', encoding='latin-1')
esp_testa_df.head(0)

Unnamed: 0,0,1


In [3]:
esp_testb_df = pd.read_csv("files/esp.testb.txt", delimiter='\s', header=None, engine='python', encoding='latin-1')
esp_testb_df.head()

Unnamed: 0,0,1
0,La,B-LOC
1,Coruña,I-LOC
2,",",O
3,23,O
4,may,O


In [4]:
esp_train_df = pd.read_csv("files/esp.train.txt", delimiter='\s', header=None, engine='python', encoding='latin-1')
esp_train_df.head()

Unnamed: 0,0,1
0,Melbourne,B-LOC
1,(,O
2,Australia,B-LOC
3,),O
4,",",O


In [5]:
def data_to_text(data):
    text = ''
    for row in data[0].map(str):
        text += row + ' '
    
    return text

In [6]:
esp_testa_text = data_to_text(esp_testa_df)
esp_testa_text



In [7]:
esp_testb_text = data_to_text(esp_testb_df)
esp_testb_text



In [8]:
esp_train_text = data_to_text(esp_train_df)
esp_train_text



### Paso 2: Cargar el etiquetador de Spacy

In [9]:
import spacy
import es_core_news_md

nlp = es_core_news_md.load()

In [10]:
def text_to_tags(text):
    tokens = nlp(text)
    data = []

    for token in tokens:
        word = token.text
        pred = token.ent_iob_ + '-' + token.ent_type_ if token.ent_type_ else token.ent_iob_

        data.append([word, pred])
        
    return pd.DataFrame(data=data)

In [11]:
esp_testa_result = text_to_tags(esp_testa_text)
esp_testa_result

Unnamed: 0,0,1
0,Sao,B-LOC
1,Paulo,I-LOC
2,(,O
3,Brasil,B-LOC
4,),O
...,...,...
53911,Santander,B-LOC
53912,618,O
53913,+11,O
53914,Dycasa,O


In [12]:
esp_testb_data = text_to_tags(esp_testb_text)
esp_testb_data

Unnamed: 0,0,1
0,La,O
1,Coruña,B-ORG
2,",",O
3,23,O
4,may,O
...,...,...
51966,relojes,O
51967,",",O
51968,entre,O
51969,otros,O


In [13]:
esp_testa_df.size

105846

In [14]:
esp_testa_result.size

107832

In [68]:
esp_testa_no_dups = esp_testa_df.drop_duplicates()
esp_testa_no_dups_result = esp_testa_result.drop_duplicates()

esp_testa_compare_df = esp_testa_df.merge(esp_testa_result, on=0 , how='inner')

In [72]:
esp_testa_compare_df

Unnamed: 0,0,1_x,1_y
0,Sao,B-LOC,B-LOC
1,Sao,B-LOC,B-LOC
2,Sao,B-LOC,B-LOC
3,Sao,B-LOC,B-LOC
4,Sao,B-LOC,I-LOC
...,...,...,...
38156725,052,O,O
38156726,+19,O,O
38156727,Francés,I-ORG,I-LOC
38156728,+11,O,O


### Paso 3: Evaluación

* Medida por cada documento de test:
    * Número total de entidades en el documento de referencia:N
    * Número total de entidades extraidas por el sistema: E
    * Número de entidades extraídas que son correctas: C

* Cálculo de a métricas más usuales:
    * Cobertura (Recall) = C/N
    * Precisión = C/E
    * Medida-F (F-Measure: media armónica de Prec. y recall): 2 Precision x Cobertura /(Precision + Cobertura)

In [19]:
# %load conlleval.py
#!/usr/bin/env python

# Python version of the evaluation script from CoNLL'00-

# Intentional differences:
# - accept any space as delimiter by default
# - optional file argument (default STDIN)
# - option to set boundary (-b argument)
# - LaTeX output (-l argument) not supported
# - raw tags (-r argument) not supported

import sys
import re

from collections import defaultdict, namedtuple

ANY_SPACE = '<SPACE>'

class FormatError(Exception):
    pass

Metrics = namedtuple('Metrics', 'tp fp fn prec rec fscore')

class EvalCounts(object):
    def __init__(self):
        self.correct_chunk = 0    # number of correctly identified chunks
        self.correct_tags = 0     # number of correct chunk tags
        self.found_correct = 0    # number of chunks in corpus
        self.found_guessed = 0    # number of identified chunks
        self.token_counter = 0    # token counter (ignores sentence breaks)

        # counts by type
        self.t_correct_chunk = defaultdict(int)
        self.t_found_correct = defaultdict(int)
        self.t_found_guessed = defaultdict(int)

def parse_args(argv):
    import argparse
    parser = argparse.ArgumentParser(
        description='evaluate tagging results using CoNLL criteria',
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )
    arg = parser.add_argument
    arg('-b', '--boundary', metavar='STR', default='-X-',
        help='sentence boundary')
    arg('-d', '--delimiter', metavar='CHAR', default=ANY_SPACE,
        help='character delimiting items in input')
    arg('-o', '--otag', metavar='CHAR', default='O',
        help='alternative outside tag')
    arg('file', nargs='?', default=None)
    return parser.parse_args(argv)

def parse_tag(t):
    m = re.match(r'^([^-]*)-(.*)$', t)
    return m.groups() if m else (t, '')

def evaluate(iterable, options=None):
    if options is None:
        options = parse_args([])    # use defaults

    counts = EvalCounts()
    num_features = None       # number of features per line
    in_correct = False        # currently processed chunks is correct until now
    last_correct = 'O'        # previous chunk tag in corpus
    last_correct_type = ''    # type of previously identified chunk tag
    last_guessed = 'O'        # previously identified chunk tag
    last_guessed_type = ''    # type of previous chunk tag in corpus

    for line in iterable:
        line = line.rstrip('\r\n')

        if options.delimiter == ANY_SPACE:
            features = line.split()
        else:
            features = line.split(options.delimiter)

        if num_features is None:
            num_features = len(features)
        elif num_features != len(features) and len(features) != 0:
            raise FormatError('unexpected number of features: %d (%d)' %
                              (len(features), num_features))

        if len(features) == 0 or features[0] == options.boundary:
            features = [options.boundary, 'O', 'O']
        if len(features) < 3:
            raise FormatError('unexpected number of features in line %s' % line)

        guessed, guessed_type = parse_tag(features.pop())
        correct, correct_type = parse_tag(features.pop())
        first_item = features.pop(0)

        if first_item == options.boundary:
            guessed = 'O'

        end_correct = end_of_chunk(last_correct, correct,
                                   last_correct_type, correct_type)
        end_guessed = end_of_chunk(last_guessed, guessed,
                                   last_guessed_type, guessed_type)
        start_correct = start_of_chunk(last_correct, correct,
                                       last_correct_type, correct_type)
        start_guessed = start_of_chunk(last_guessed, guessed,
                                       last_guessed_type, guessed_type)

        if in_correct:
            if (end_correct and end_guessed and
                last_guessed_type == last_correct_type):
                in_correct = False
                counts.correct_chunk += 1
                counts.t_correct_chunk[last_correct_type] += 1
            elif (end_correct != end_guessed or guessed_type != correct_type):
                in_correct = False

        if start_correct and start_guessed and guessed_type == correct_type:
            in_correct = True

        if start_correct:
            counts.found_correct += 1
            counts.t_found_correct[correct_type] += 1
        if start_guessed:
            counts.found_guessed += 1
            counts.t_found_guessed[guessed_type] += 1
        if first_item != options.boundary:
            if correct == guessed and guessed_type == correct_type:
                counts.correct_tags += 1
            counts.token_counter += 1

        last_guessed = guessed
        last_correct = correct
        last_guessed_type = guessed_type
        last_correct_type = correct_type

    if in_correct:
        counts.correct_chunk += 1
        counts.t_correct_chunk[last_correct_type] += 1

    return counts

def uniq(iterable):
  seen = set()
  return [i for i in iterable if not (i in seen or seen.add(i))]

def calculate_metrics(correct, guessed, total):
    tp, fp, fn = correct, guessed-correct, total-correct
    p = 0 if tp + fp == 0 else 1.*tp / (tp + fp)
    r = 0 if tp + fn == 0 else 1.*tp / (tp + fn)
    f = 0 if p + r == 0 else 2 * p * r / (p + r)
    return Metrics(tp, fp, fn, p, r, f)

def metrics(counts):
    c = counts
    overall = calculate_metrics(
        c.correct_chunk, c.found_guessed, c.found_correct
    )
    by_type = {}
    for t in uniq(list(c.t_found_correct) + list(c.t_found_guessed)):
        by_type[t] = calculate_metrics(
            c.t_correct_chunk[t], c.t_found_guessed[t], c.t_found_correct[t]
        )
    return overall, by_type

def report(counts, out=None):
    if out is None:
        out = sys.stdout

    overall, by_type = metrics(counts)

    c = counts
    out.write('processed %d tokens with %d phrases; ' %
              (c.token_counter, c.found_correct))
    out.write('found: %d phrases; correct: %d.\n' %
              (c.found_guessed, c.correct_chunk))

    if c.token_counter > 0:
        out.write('accuracy: %6.2f%%; ' %
                  (100.*c.correct_tags/c.token_counter))
        out.write('precision: %6.2f%%; ' % (100.*overall.prec))
        out.write('recall: %6.2f%%; ' % (100.*overall.rec))
        out.write('FB1: %6.2f\n' % (100.*overall.fscore))

    for i, m in sorted(by_type.items()):
        out.write('%17s: ' % i)
        out.write('precision: %6.2f%%; ' % (100.*m.prec))
        out.write('recall: %6.2f%%; ' % (100.*m.rec))
        out.write('FB1: %6.2f  %d\n' % (100.*m.fscore, c.t_found_guessed[i]))

def end_of_chunk(prev_tag, tag, prev_type, type_):
    # check if a chunk ended between the previous and current word
    # arguments: previous and current chunk tags, previous and current types
    chunk_end = False

    if prev_tag == 'E': chunk_end = True
    if prev_tag == 'S': chunk_end = True

    if prev_tag == 'B' and tag == 'B': chunk_end = True
    if prev_tag == 'B' and tag == 'S': chunk_end = True
    if prev_tag == 'B' and tag == 'O': chunk_end = True
    if prev_tag == 'I' and tag == 'B': chunk_end = True
    if prev_tag == 'I' and tag == 'S': chunk_end = True
    if prev_tag == 'I' and tag == 'O': chunk_end = True

    if prev_tag != 'O' and prev_tag != '.' and prev_type != type_:
        chunk_end = True

    # these chunks are assumed to have length 1
    if prev_tag == ']': chunk_end = True
    if prev_tag == '[': chunk_end = True

    return chunk_end

def start_of_chunk(prev_tag, tag, prev_type, type_):
    # check if a chunk started between the previous and current word
    # arguments: previous and current chunk tags, previous and current types
    chunk_start = False

    if tag == 'B': chunk_start = True
    if tag == 'S': chunk_start = True

    if prev_tag == 'E' and tag == 'E': chunk_start = True
    if prev_tag == 'E' and tag == 'I': chunk_start = True
    if prev_tag == 'S' and tag == 'E': chunk_start = True
    if prev_tag == 'S' and tag == 'I': chunk_start = True
    if prev_tag == 'O' and tag == 'E': chunk_start = True
    if prev_tag == 'O' and tag == 'I': chunk_start = True

    if tag != 'O' and tag != '.' and prev_type != type_:
        chunk_start = True

    # these chunks are assumed to have length 1
    if tag == '[': chunk_start = True
    if tag == ']': chunk_start = True

    return chunk_start

def main(argv):
    args = parse_args(argv[1:])

    if args.file is None:
        counts = evaluate(sys.stdin, args)
    else:
        with open(args.file) as f:
            counts = evaluate(f, args)
    report(counts)

if __name__ == '__main__':
    sys.exit(main(sys.argv))


usage: ipykernel_launcher.py [-h] [-b STR] [-d CHAR] [-o CHAR] [file]
ipykernel_launcher.py: error: unrecognized arguments: -f


SystemExit: 2

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


In [76]:
data = esp_testa_compare_df.sample(100000).to_string(header=False, index=False, index_names=False).split('\n')

In [77]:
result = evaluate(data)

In [78]:
report(result)

processed 100000 tokens with 5068 phrases; found: 7610 phrases; correct: 406.
accuracy:  88.58%; precision:   5.34%; recall:   8.01%; FB1:   6.40
              LOC: precision:   3.04%; recall:   8.81%; FB1:   4.52  1415
             MISC: precision:   2.88%; recall:   8.91%; FB1:   4.35  4136
              ORG: precision:  12.33%; recall:   7.42%; FB1:   9.27  1662
              PER: precision:   9.82%; recall:   8.11%; FB1:   8.88  397
