# Ejercicio 2 a entregar: *Gendered Pronoun Resolution*

En el análisis de texto natural, existen oraciones complejas de entender incluso para las personas. Uno de los casos más conflictivos son los pronombres ambiguos. En 2018, se publicó un dataset junto con el paper [A Balanced Corpus of Gendered Ambiguous Pronouns](https://arxiv.org/pdf/1810.05201.pdf), donde se proponen un conjunto de textos con pronombres ambiguos con género.

El objetivo de este dataset es encontrar el nombre en el texto al que el pronombre ambiguo hace referencia.

Para ello se nos da un dataset con los siguientes campos:

* `ID`: Identificador de la frase.
* `Text`: Texto en fromato string.
* `Pronoun`: string con el pronombre ambiguo.
* `Pronoun-offset`: índice del carácter donde empieza el pronombre dentro del texto.
* `A`: string con el primer nombre candidato a hacer referencia por el pronombre.
* `A-offset`: índice del carácter donde empieza el nombre A dentro del texto.
* `A-coref`: boleano indicando si el pronombre hace referencia al nombre A.
* `B`: string con el segundo nombre candidato a hacer referencia por el pronombre.
* `B-offset`: índice del carácter donde empieza el nombre B dentro del texto.
* `B-coref`: boleano indicando si el pronombre hace referencia al nombre B.
* `URL`: web de donde se ha sacado el fragmento de texto.

## Enunciado:

Hacer una predicción de a cuál de los dos nombres marcados en cada frase hace referencia el pronombre seleccionado usando **dos modelos distintos** de PNL siguiendo el formato descrito a continuación:

* **MODELO 1**: Puede ser **cualquier modelo visto en los seminarios de PLN o en otras asignaturas**, como: Count vectorizer, HMM, Structured Perceptron, RNN, Logistic Regressor, XGBoost, etc...

    * Justificar el porqué del modelo elegido.
    * Entrenar el modelo.
    * Dar una accuracy del modelo.
    * Interpretar y explicar los resultados del modelo.
 

* **MODELO 2**: Debe ser un modelo **basado en Transformers** que incorpore el concepto de ***attention***.

    * Justificar el porqué del modelo elegido.
    * Entrenar el modelo.
    * Dar una accuracy del modelo.
    * Interpretar y explicar los resultados del modelo.    
    
    
### Entrega

* **FORMATO**: Debe entregarse un Jupyter Notebook (puede ser este mismo o uno nuevo) con los dos modelos entrenados. Las justificaciones y explicaciones de los modelos y resultados pueden estar en la misma Notebook o entregar un documento PDF a parte.

* **FECHA DE ENTREGA**: El ejercicio debe entregarse el dia 16/05/2023 en el pozo correspondiente del e-study.


_________________________________________________

### Lectura de los datos:

In [1]:
import pandas as pd
import numpy as np

import colorama
from colorama import Fore

In [2]:
data_df = pd.read_csv('./input/gap-coreference-master/gap-development.tsv', delimiter='\t')

In [3]:
data_df.head()

Unnamed: 0,ID,Text,Pronoun,Pronoun-offset,A,A-offset,A-coref,B,B-offset,B-coref,URL
0,development-1,Zoe Telford -- played the police officer girlf...,her,274,Cheryl Cassidy,191,True,Pauline,207,False,http://en.wikipedia.org/wiki/List_of_Teachers_...
1,development-2,"He grew up in Evanston, Illinois the second ol...",His,284,MacKenzie,228,True,Bernard Leach,251,False,http://en.wikipedia.org/wiki/Warren_MacKenzie
2,development-3,"He had been reelected to Congress, but resigne...",his,265,Angeloz,173,False,De la Sota,246,True,http://en.wikipedia.org/wiki/Jos%C3%A9_Manuel_...
3,development-4,The current members of Crime have also perform...,his,321,Hell,174,False,Henry Rosenthal,336,True,http://en.wikipedia.org/wiki/Crime_(band)
4,development-5,Her Santa Fe Opera debut in 2005 was as Nuria ...,She,437,Kitty Oppenheimer,219,False,Rivera,294,True,http://en.wikipedia.org/wiki/Jessica_Rivera


In [4]:
def gap_printer2(data_df_row):
        
    text   = data_df_row["Text"]
    word_A = data_df_row["A"]
    word_B = data_df_row["B"]
    
    pronoun       = data_df_row["Pronoun"]
    pronoun_begin = data_df_row["Pronoun-offset"]
    pronoun_end   = pronoun_begin + len(pronoun)
    
    word_A_begin = data_df_row["A-offset"]
    word_A_end   = data_df_row["A-offset"] + len(word_A)
    word_B_begin = data_df_row["B-offset"]
    word_B_end   = data_df_row["B-offset"] + len(word_B)
    
    text_c = text.replace(word_A, " {} ")
    text_c = text.replace(word_B, " {} ")
    text_c = text.replace(pronoun, " {} ")
    
    word_boundaries = np.sort([word_A_begin, word_A_end, pronoun_begin, pronoun_end, word_B_begin, word_B_end])
    word_boundaries = list(zip(word_boundaries[::2], word_boundaries[1::2]))
    
    P1 = [0,word_boundaries[0][0]]
    P2 = [word_boundaries[0][1],word_boundaries[1][0]]
    P3 = [word_boundaries[1][1],word_boundaries[2][0]]
    P4 = [word_boundaries[2][1],len(text)]

    text_f = text[P1[0]:P1[1]] + "{}" + text[P2[0]:P2[1]] +  "{}" + text[P3[0]:P3[1]] + "{}" + text[P4[0]:P4[1]]
 
    print(text_f.format( Fore.BLUE  + text[word_boundaries[0][0]:word_boundaries[0][1]]  + Fore.BLACK,
                         Fore.BLUE  + text[word_boundaries[1][0]:word_boundaries[1][1]] + Fore.BLACK,
                         Fore.BLUE  + text[word_boundaries[2][0]:word_boundaries[2][1]]  + Fore.BLACK))

In [5]:
gap_printer2(data_df.loc[3])

The current members of Crime have also performed in San Francisco under the band name ''Remote Viewers``. Strike has published two works of fiction in recent years: Ports of [34mHell[30m, which is listed in the Rock and Roll Hall of Fame Library, and A Loud Humming Sound Came from Above. Rank has produced numerous films (under [34mhis[30m real name, [34mHenry Rosenthal[30m) including the hit The Devil and Daniel Johnston.


In [6]:
url = data_df["URL"][1]
url

'http://en.wikipedia.org/wiki/Warren_MacKenzie'

In [7]:
text = data_df["Text"][1]
text

'He grew up in Evanston, Illinois the second oldest of five children including his brothers, Fred and Gordon and sisters, Marge (Peppy) and Marilyn. His high school days were spent at New Trier High School in Winnetka, Illinois. MacKenzie studied with Bernard Leach from 1949 to 1952. His simple, wheel-thrown functional pottery is heavily influenced by the oriental aesthetic of Shoji Hamada and Kanjiro Kawai.'

In [10]:
pronoun = data_df["Pronoun"][1]
pronoun

'His'

In [11]:
pronoun_offset = data_df["Pronoun-offset"][1]
pronoun_offset

284

In [12]:
text[pronoun_offset:pronoun_offset+len(pronoun)]

'His'

In [13]:
A = data_df["A"][1]
A_offset = data_df["A-offset"][1]
A, A_offset

('MacKenzie', 228)

In [14]:
text[A_offset:A_offset+len(A)]

'MacKenzie'

In [15]:
B = data_df["B"][1]
B_offset = data_df["B-offset"][1]
B, B_offset

('Bernard Leach', 251)

In [16]:
text[B_offset:B_offset+len(B)]

'Bernard Leach'

In [17]:
x_data = data_df[["Text", "Pronoun", "Pronoun-offset", "A", "A-offset", "B", "B-offset"]]
y_data = data_df[["A-coref", "B-coref"]]

In [18]:
x_train = x_data.loc[:1500,:]
y_train = y_data.loc[:1500,:]
x_test = x_data.loc[1500:,:]
x_test = x_test.reset_index()
y_test = y_data.loc[1500:,:]
y_test = y_test.reset_index()

In [19]:
x_train

Unnamed: 0,Text,Pronoun,Pronoun-offset,A,A-offset,B,B-offset
0,Zoe Telford -- played the police officer girlf...,her,274,Cheryl Cassidy,191,Pauline,207
1,"He grew up in Evanston, Illinois the second ol...",His,284,MacKenzie,228,Bernard Leach,251
2,"He had been reelected to Congress, but resigne...",his,265,Angeloz,173,De la Sota,246
3,The current members of Crime have also perform...,his,321,Hell,174,Henry Rosenthal,336
4,Her Santa Fe Opera debut in 2005 was as Nuria ...,She,437,Kitty Oppenheimer,219,Rivera,294
...,...,...,...,...,...,...,...
1496,"In 2014, Rockov also signed his first professi...",he,282,Nemanja Buzad*ija,395,Rockov,446
1497,Mr Telford was of the most genial disposition ...,him,209,George Turnbull,158,Telford,186
1498,"In 1865, the theatre was renovated and named t...",He,306,Henry Byron,184,Squire Bancroft,289
1499,As well as similarities as regards the type of...,his,399,Don Quixote,303,Cervantes,518


In [20]:
y_train

Unnamed: 0,A-coref,B-coref
0,True,False
1,True,False
2,False,True
3,False,True
4,False,True
...,...,...
1496,False,True
1497,True,False
1498,False,True
1499,False,False
