# *Gendered Pronoun Resolution*

En el análisis de texto natural, existen oraciones complejas de entender incluso para las personas. Uno de los casos más conflictivos son los pronombres ambiguos. En 2018, se publicó un dataset junto con el paper [A Balanced Corpus of Gendered Ambiguous Pronouns](https://arxiv.org/pdf/1810.05201.pdf), donde se proponen un conjunto de textos con pronombres ambiguos con género.

El objetivo de este dataset es encontrar el nombre en el texto al que el pronombre ambiguo hace referencia.

Para ello se nos da un dataset con los siguientes campos:

* `ID`: Identificador de la frase.
* `Text`: Texto en fromato string.
* `Pronoun`: string con el pronombre ambiguo.
* `Pronoun-offset`: índice del carácter donde empieza el pronombre dentro del texto.
* `A`: string con el primer nombre candidato a hacer referencia por el pronombre.
* `A-offset`: índice del carácter donde empieza el nombre A dentro del texto.
* `A-coref`: boleano indicando si el pronombre hace referencia al nombre A.
* `B`: string con el segundo nombre candidato a hacer referencia por el pronombre.
* `B-offset`: índice del carácter donde empieza el nombre B dentro del texto.
* `B-coref`: boleano indicando si el pronombre hace referencia al nombre B.
* `URL`: web de donde se ha sacado el fragmento de texto.

## Objetivo:

Hacer una predicción de a cuál de los dos nombres marcados en cada frase hace referencia el pronombre seleccionado usando **dos modelos distintos** de PNL siguiendo el formato descrito a continuación:

* **MODELO 1**: Puede ser **cualquier modelo visto en los seminarios de PLN o en otras asignaturas**, como: Count vectorizer, HMM, Structured Perceptron, RNN, Logistic Regressor, XGBoost, etc...

    * Justificar el porqué del modelo elegido.
    * Entrenar el modelo.
    * Dar una accuracy del modelo.
    * Interpretar y explicar los resultados del modelo.
 

* **MODELO 2**: Debe ser un modelo **basado en Transformers** que incorpore el concepto de ***attention***.

    * Justificar el porqué del modelo elegido.
    * Entrenar el modelo.
    * Dar una accuracy del modelo.
    * Interpretar y explicar los resultados del modelo.    
    
    





## Libraries

In [2]:
import os
import sys

import pandas as pd
import numpy as np

import re
import contractions
import string 
import colorama
from colorama import Fore

## Load data

In [3]:
print(os.listdir("./input/gap-coreference-master"))

['gap-development.tsv', 'gap-test.tsv', 'gap-validation.tsv']


In [4]:
DATA_ROOT = './input/'
GAP_DATA_FOLDER = os.path.join(DATA_ROOT, 'gap-coreference-master')
GAP_DATA_FOLDER

'./input/gap-coreference-master'

In [5]:
train_df_path = os.path.join(GAP_DATA_FOLDER, 'gap-development.tsv')
test_df_path = os.path.join(GAP_DATA_FOLDER, 'gap-test.tsv')
val_df_path = os.path.join(GAP_DATA_FOLDER, 'gap-validation.tsv')

train_df = pd.read_csv(train_df_path, sep='\t') # train_df
test_df = pd.read_csv(test_df_path, sep='\t')
val_df = pd.read_csv(val_df_path, sep='\t')


In [6]:
train_df.head()

Unnamed: 0,ID,Text,Pronoun,Pronoun-offset,A,A-offset,A-coref,B,B-offset,B-coref,URL
0,development-1,Zoe Telford -- played the police officer girlf...,her,274,Cheryl Cassidy,191,True,Pauline,207,False,http://en.wikipedia.org/wiki/List_of_Teachers_...
1,development-2,"He grew up in Evanston, Illinois the second ol...",His,284,MacKenzie,228,True,Bernard Leach,251,False,http://en.wikipedia.org/wiki/Warren_MacKenzie
2,development-3,"He had been reelected to Congress, but resigne...",his,265,Angeloz,173,False,De la Sota,246,True,http://en.wikipedia.org/wiki/Jos%C3%A9_Manuel_...
3,development-4,The current members of Crime have also perform...,his,321,Hell,174,False,Henry Rosenthal,336,True,http://en.wikipedia.org/wiki/Crime_(band)
4,development-5,Her Santa Fe Opera debut in 2005 was as Nuria ...,She,437,Kitty Oppenheimer,219,False,Rivera,294,True,http://en.wikipedia.org/wiki/Jessica_Rivera


In [80]:
train_df.isnull().sum()

Text              0
Pronoun           0
Pronoun-offset    0
A                 0
A-offset          0
B                 0
B-offset          0
A-coref           0
B-coref           0
dtype: int64

In [7]:
# Para visualizar las frases
def gap_printer2(data_df_row):
        
    text   = data_df_row["Text"]
    word_A = data_df_row["A"]
    word_B = data_df_row["B"]
    
    pronoun       = data_df_row["Pronoun"]
    pronoun_begin = data_df_row["Pronoun-offset"]
    pronoun_end   = pronoun_begin + len(pronoun)
    
    word_A_begin = data_df_row["A-offset"]
    word_A_end   = data_df_row["A-offset"] + len(word_A)
    word_B_begin = data_df_row["B-offset"]
    word_B_end   = data_df_row["B-offset"] + len(word_B)
    
    text_c = text.replace(word_A, " {} ")
    text_c = text.replace(word_B, " {} ")
    text_c = text.replace(pronoun, " {} ")
    
    word_boundaries = np.sort([word_A_begin, word_A_end, pronoun_begin, pronoun_end, word_B_begin, word_B_end])
    word_boundaries = list(zip(word_boundaries[::2], word_boundaries[1::2]))
    
    P1 = [0,word_boundaries[0][0]]
    P2 = [word_boundaries[0][1],word_boundaries[1][0]]
    P3 = [word_boundaries[1][1],word_boundaries[2][0]]
    P4 = [word_boundaries[2][1],len(text)]

    text_f = text[P1[0]:P1[1]] + "{}" + text[P2[0]:P2[1]] +  "{}" + text[P3[0]:P3[1]] + "{}" + text[P4[0]:P4[1]]
 
    print(text_f.format( Fore.BLUE  + text[word_boundaries[0][0]:word_boundaries[0][1]]  + Fore.BLACK,
                         Fore.BLUE  + text[word_boundaries[1][0]:word_boundaries[1][1]] + Fore.BLACK,
                         Fore.BLUE  + text[word_boundaries[2][0]:word_boundaries[2][1]]  + Fore.BLACK))

In [5]:
gap_printer2(train_df.loc[3])

The current members of Crime have also performed in San Francisco under the band name ''Remote Viewers``. Strike has published two works of fiction in recent years: Ports of [34mHell[30m, which is listed in the Rock and Roll Hall of Fame Library, and A Loud Humming Sound Came from Above. Rank has produced numerous films (under [34mhis[30m real name, [34mHenry Rosenthal[30m) including the hit The Devil and Daniel Johnston.


- El pronombre que tenemos que identificar es `his`. 
- Tenemos 2 posibles nombres a los que hace referencia: `Hell` y `Henry Rosenthal`.


In [6]:
# extraemos la URL del segundo elemento del dataset
url = train_df["URL"][1]
url

'http://en.wikipedia.org/wiki/Warren_MacKenzie'

In [8]:
# extraemos el segundo texto del dataset
text = train_df["Text"][1]
text

'He grew up in Evanston, Illinois the second oldest of five children including his brothers, Fred and Gordon and sisters, Marge (Peppy) and Marilyn. His high school days were spent at New Trier High School in Winnetka, Illinois. MacKenzie studied with Bernard Leach from 1949 to 1952. His simple, wheel-thrown functional pottery is heavily influenced by the oriental aesthetic of Shoji Hamada and Kanjiro Kawai.'

In [9]:
# extraemos el primer pronombre del segundo elemento del dataset
pronoun = train_df["Pronoun"][1]
pronoun

'His'

In [10]:
# extraemos el offset del primer pronombre del segundo elemento del dataset
pronoun_offset = train_df["Pronoun-offset"][1]
pronoun_offset

284

In [11]:
# visualizamos de otra forma el pronombre. 
# A partir de la variable Text, si seleccionamos únicamente la parte que empieza
# en el pronoun offset y acaba en el pronoun offset + el tamaño de ese pronombre, 
# lo que extraemos del texto es el pronombre. 
text[pronoun_offset:pronoun_offset+len(pronoun)]

'His'

In [12]:
# Extraemos el nombre A y el A-offset
A = train_df["A"][1]
A_offset = train_df["A-offset"][1]
A, A_offset

('MacKenzie', 228)

In [13]:
# nombre A
text[A_offset:A_offset+len(A)]

'MacKenzie'

In [14]:
# nombre B y B-offset
B = train_df["B"][1]
B_offset = train_df["B-offset"][1]
B, B_offset

('Bernard Leach', 251)

In [15]:
# nombre B 
text[B_offset:B_offset+len(B)]

'Bernard Leach'

## Variable selection

Construir el dataframe donde tenemos únicamente las variables más interesantes (la URL no nos hace falta para un problema de procesado de texto).

In [8]:
imp_features =["Text", "Pronoun", "Pronoun-offset", "A", "A-offset", "B", "B-offset"]
target_col = ["A-coref", "B-coref"]

train_df = train_df[imp_features+target_col]
test_df = test_df[imp_features+target_col]
val_df = val_df[imp_features+target_col]

## Text cleaning 

In [9]:
###### ESBORRAAAARR ES NOMÉS PER REINICIAL EL CLEANING
train_df = pd.read_csv(train_df_path, sep='\t') # train_df
imp_features =["Text", "Pronoun", "Pronoun-offset", "A", "A-offset", "B", "B-offset"]
target_col = ["A-coref", "B-coref"]

train_df = train_df[imp_features+target_col]


### To lower case 

In [10]:
# To lower case
def lower_case(df):
    df["text_clean"] = df["Text"].apply(lambda x: x.lower())
    df["A"] = df["A"].apply(lambda x: x.lower())
    df["B"] = df["B"].apply(lambda x: x.lower())

    return df

In [11]:
train_df_clean = lower_case(train_df)
train_df_clean.head()

Unnamed: 0,Text,Pronoun,Pronoun-offset,A,A-offset,B,B-offset,A-coref,B-coref,text_clean
0,Zoe Telford -- played the police officer girlf...,her,274,cheryl cassidy,191,pauline,207,True,False,zoe telford -- played the police officer girlf...
1,"He grew up in Evanston, Illinois the second ol...",His,284,mackenzie,228,bernard leach,251,True,False,"he grew up in evanston, illinois the second ol..."
2,"He had been reelected to Congress, but resigne...",his,265,angeloz,173,de la sota,246,False,True,"he had been reelected to congress, but resigne..."
3,The current members of Crime have also perform...,his,321,hell,174,henry rosenthal,336,False,True,the current members of crime have also perform...
4,Her Santa Fe Opera debut in 2005 was as Nuria ...,She,437,kitty oppenheimer,219,rivera,294,False,True,her santa fe opera debut in 2005 was as nuria ...


### Expand the contractions

In [12]:
def expand_contractions(df):
    df["text_clean"] = df["text_clean"].apply(lambda x: contractions.fix(x))
    return df

In [13]:
train_df_clean = expand_contractions(train_df_clean)
# double check
print(train_df_clean["Text"][0])
print(train_df_clean["text_clean"][0])
print(train_df_clean["Text"][100])
print(train_df_clean["text_clean"][100])

Zoe Telford -- played the police officer girlfriend of Simon, Maggie. Dumped by Simon in the final episode of series 1, after he slept with Jenny, and is not seen again. Phoebe Thomas played Cheryl Cassidy, Pauline's friend and also a year 11 pupil in Simon's class. Dumped her boyfriend following Simon's advice after he wouldn't have sex with her but later realised this was due to him catching crabs off her friend Pauline.
zoe telford -- played the police officer girlfriend of simon, maggie. dumped by simon in the final episode of series 1, after he slept with jenny, and is not seen again. phoebe thomas played cheryl cassidy, pauline's friend and also a year 11 pupil in simon's class. dumped her boyfriend following simon's advice after he would not have sex with her but later realised this was due to him catching crabs off her friend pauline.
Re-elected in the 2007 election, she was re-named the Minister of International Relations, La Francophonie and for the Estrie Region as well as t

### Remove non-characters and URLs 

In [14]:
def remove_non_ascii_characters(df, col='text_clean'):
    df[col] = df[col].apply(lambda text: re.sub(r'[^\x00-\x7f]', r'', text)) # get rid of non-characters and whitespace
    return df

In [15]:
train_df_clean = remove_non_ascii_characters(train_df_clean)

# double check
print(train_df_clean["Text"][0])
print(train_df_clean["text_clean"][0])
print(train_df_clean["Text"][100])
print(train_df_clean["text_clean"][100])

Zoe Telford -- played the police officer girlfriend of Simon, Maggie. Dumped by Simon in the final episode of series 1, after he slept with Jenny, and is not seen again. Phoebe Thomas played Cheryl Cassidy, Pauline's friend and also a year 11 pupil in Simon's class. Dumped her boyfriend following Simon's advice after he wouldn't have sex with her but later realised this was due to him catching crabs off her friend Pauline.
zoe telford -- played the police officer girlfriend of simon, maggie. dumped by simon in the final episode of series 1, after he slept with jenny, and is not seen again. phoebe thomas played cheryl cassidy, pauline's friend and also a year 11 pupil in simon's class. dumped her boyfriend following simon's advice after he would not have sex with her but later realised this was due to him catching crabs off her friend pauline.
Re-elected in the 2007 election, she was re-named the Minister of International Relations, La Francophonie and for the Estrie Region as well as t

In [16]:
def remove_punctuations(df, col='text_clean'):
    """
     - str.maketrans('', '', string.punctuation) crea un traductor utilizando maketrans 
       que mapea los caracteres de puntuación a None, es decir, los elimina.
     - string.punctuation es una cadena predefinida en el módulo string que contiene todos 
       los caracteres de puntuación.
     - text.translate(translator) aplica el traductor al texto, reemplazando las puntuaciones 
       con caracteres vacíos, lo que efectivamente las elimina.
    """
    df[col] = df[col].apply(lambda text: text.translate(str.maketrans('', '', string.punctuation)))
    # return re.sub(r'[]!"$%&\'()*+,./:;=#@?[\\^_`{|}~-]+', "", text)
    return df

In [17]:
train_df_clean = remove_punctuations(train_df_clean)
# double check
print(train_df_clean["Text"][0])
print(train_df_clean["text_clean"][0])
print(train_df_clean["Text"][100])
print(train_df_clean["text_clean"][100])

Zoe Telford -- played the police officer girlfriend of Simon, Maggie. Dumped by Simon in the final episode of series 1, after he slept with Jenny, and is not seen again. Phoebe Thomas played Cheryl Cassidy, Pauline's friend and also a year 11 pupil in Simon's class. Dumped her boyfriend following Simon's advice after he wouldn't have sex with her but later realised this was due to him catching crabs off her friend Pauline.
zoe telford  played the police officer girlfriend of simon maggie dumped by simon in the final episode of series 1 after he slept with jenny and is not seen again phoebe thomas played cheryl cassidy paulines friend and also a year 11 pupil in simons class dumped her boyfriend following simons advice after he would not have sex with her but later realised this was due to him catching crabs off her friend pauline
Re-elected in the 2007 election, she was re-named the Minister of International Relations, La Francophonie and for the Estrie Region as well as the Vice-Chair

## Text Preprocessing 

In [None]:
# def fast_encode(texts, tokenizer, chunk_size=256, maxlen=128):
#     tokenizer.enable_truncation(max_lenght=maxlen)
#     tokenizer.enable_padding(max_lenght=maxlen)

#     all_ids = []

#     for i in range(0, len(texts),chunk_size):
#         text_chunk = texts[i:i+chunk_size].to_list()
#         encs = tokenizer.encode_batch(text_chunk)
#         all_ids.extend([enc.ids for enc in encs])
    
#     return np.array(all_ids)


### Tokenization 

In [18]:
import nltk
nltk.download('punkt') # word_tokenize
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/bernatsort/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [19]:
train_df_clean['tokenized'] = train_df_clean['text_clean'].apply(word_tokenize)
# double check
print(train_df_clean["Text"][0])
print(train_df_clean["tokenized"][0])
print(train_df_clean["Text"][100])
print(train_df_clean["tokenized"][100])
display(train_df_clean.head())

Zoe Telford -- played the police officer girlfriend of Simon, Maggie. Dumped by Simon in the final episode of series 1, after he slept with Jenny, and is not seen again. Phoebe Thomas played Cheryl Cassidy, Pauline's friend and also a year 11 pupil in Simon's class. Dumped her boyfriend following Simon's advice after he wouldn't have sex with her but later realised this was due to him catching crabs off her friend Pauline.
['zoe', 'telford', 'played', 'the', 'police', 'officer', 'girlfriend', 'of', 'simon', 'maggie', 'dumped', 'by', 'simon', 'in', 'the', 'final', 'episode', 'of', 'series', '1', 'after', 'he', 'slept', 'with', 'jenny', 'and', 'is', 'not', 'seen', 'again', 'phoebe', 'thomas', 'played', 'cheryl', 'cassidy', 'paulines', 'friend', 'and', 'also', 'a', 'year', '11', 'pupil', 'in', 'simons', 'class', 'dumped', 'her', 'boyfriend', 'following', 'simons', 'advice', 'after', 'he', 'would', 'not', 'have', 'sex', 'with', 'her', 'but', 'later', 'realised', 'this', 'was', 'due', 'to',

Unnamed: 0,Text,Pronoun,Pronoun-offset,A,A-offset,B,B-offset,A-coref,B-coref,text_clean,tokenized
0,Zoe Telford -- played the police officer girlf...,her,274,cheryl cassidy,191,pauline,207,True,False,zoe telford played the police officer girlfri...,"[zoe, telford, played, the, police, officer, g..."
1,"He grew up in Evanston, Illinois the second ol...",His,284,mackenzie,228,bernard leach,251,True,False,he grew up in evanston illinois the second old...,"[he, grew, up, in, evanston, illinois, the, se..."
2,"He had been reelected to Congress, but resigne...",his,265,angeloz,173,de la sota,246,False,True,he had been reelected to congress but resigned...,"[he, had, been, reelected, to, congress, but, ..."
3,The current members of Crime have also perform...,his,321,hell,174,henry rosenthal,336,False,True,the current members of crime have also perform...,"[the, current, members, of, crime, have, also,..."
4,Her Santa Fe Opera debut in 2005 was as Nuria ...,She,437,kitty oppenheimer,219,rivera,294,False,True,her santa fe opera debut in 2005 was as nuria ...,"[her, santa, fe, opera, debut, in, 2005, was, ..."


### Remove Stop Words ESBORRARR PQ ESBORRA ELS PRONOMS

In [167]:
# Removing stopwords.
# nltk.download("stopwords")
# from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/bernatsort/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [170]:
# stop = set(stopwords.words('english'))
# train_df_clean['stopwords_removed'] = train_df_clean['tokenized'].apply(lambda x: [word for word in x if word not in stop])

# # double check
# print(train_df_clean["Text"][0])
# print(train_df_clean["stopwords_removed"][0])
# print(train_df_clean["Text"][100])
# print(train_df_clean["stopwords_removed"][100])
# display(train_df_clean.head())

Zoe Telford -- played the police officer girlfriend of Simon, Maggie. Dumped by Simon in the final episode of series 1, after he slept with Jenny, and is not seen again. Phoebe Thomas played Cheryl Cassidy, Pauline's friend and also a year 11 pupil in Simon's class. Dumped her boyfriend following Simon's advice after he wouldn't have sex with her but later realised this was due to him catching crabs off her friend Pauline.
['zoe', 'telford', 'played', 'police', 'officer', 'girlfriend', 'simon', 'maggie', 'dumped', 'simon', 'final', 'episode', 'series', '1', 'slept', 'jenny', 'seen', 'phoebe', 'thomas', 'played', 'cheryl', 'cassidy', 'paulines', 'friend', 'also', 'year', '11', 'pupil', 'simons', 'class', 'dumped', 'boyfriend', 'following', 'simons', 'advice', 'would', 'sex', 'later', 'realised', 'due', 'catching', 'crabs', 'friend', 'pauline']
Re-elected in the 2007 election, she was re-named the Minister of International Relations, La Francophonie and for the Estrie Region as well as t

Unnamed: 0,Text,Pronoun,Pronoun-offset,A,A-offset,B,B-offset,A-coref,B-coref,text_clean,tokenized,stopwords_removed
0,Zoe Telford -- played the police officer girlf...,her,274,cheryl cassidy,191,pauline,207,True,False,zoe telford played the police officer girlfri...,"[zoe, telford, played, the, police, officer, g...","[zoe, telford, played, police, officer, girlfr..."
1,"He grew up in Evanston, Illinois the second ol...",His,284,mackenzie,228,bernard leach,251,True,False,he grew up in evanston illinois the second old...,"[he, grew, up, in, evanston, illinois, the, se...","[grew, evanston, illinois, second, oldest, fiv..."
2,"He had been reelected to Congress, but resigne...",his,265,angeloz,173,de la sota,246,False,True,he had been reelected to congress but resigned...,"[he, had, been, reelected, to, congress, but, ...","[reelected, congress, resigned, 1990, accept, ..."
3,The current members of Crime have also perform...,his,321,hell,174,henry rosenthal,336,False,True,the current members of crime have also perform...,"[the, current, members, of, crime, have, also,...","[current, members, crime, also, performed, san..."
4,Her Santa Fe Opera debut in 2005 was as Nuria ...,She,437,kitty oppenheimer,219,rivera,294,False,True,her santa fe opera debut in 2005 was as nuria ...,"[her, santa, fe, opera, debut, in, 2005, was, ...","[santa, fe, opera, debut, 2005, nuria, revised..."


### Stemming

In [None]:
# probar los 3 y luego elegir en base a cuál ha dado mejores resultados

#### PorterStemmer

In [20]:
from nltk.stem import PorterStemmer

In [21]:
def porter_stemmer(text):
    """
        Stem words in list of tokenized words with PorterStemmer
    """
    stemmer = nltk.PorterStemmer()
    stems = [stemmer.stem(i) for i in text]
    return stems

In [22]:
train_df_clean['porter_stemmer'] = train_df_clean['tokenized'].apply(lambda x: porter_stemmer(x))

# double check
print(train_df_clean["Text"][0])
print(train_df_clean["porter_stemmer"][0])
print(train_df_clean["Text"][100])
print(train_df_clean["porter_stemmer"][100])

Zoe Telford -- played the police officer girlfriend of Simon, Maggie. Dumped by Simon in the final episode of series 1, after he slept with Jenny, and is not seen again. Phoebe Thomas played Cheryl Cassidy, Pauline's friend and also a year 11 pupil in Simon's class. Dumped her boyfriend following Simon's advice after he wouldn't have sex with her but later realised this was due to him catching crabs off her friend Pauline.
['zoe', 'telford', 'play', 'the', 'polic', 'offic', 'girlfriend', 'of', 'simon', 'maggi', 'dump', 'by', 'simon', 'in', 'the', 'final', 'episod', 'of', 'seri', '1', 'after', 'he', 'slept', 'with', 'jenni', 'and', 'is', 'not', 'seen', 'again', 'phoeb', 'thoma', 'play', 'cheryl', 'cassidi', 'paulin', 'friend', 'and', 'also', 'a', 'year', '11', 'pupil', 'in', 'simon', 'class', 'dump', 'her', 'boyfriend', 'follow', 'simon', 'advic', 'after', 'he', 'would', 'not', 'have', 'sex', 'with', 'her', 'but', 'later', 'realis', 'thi', 'wa', 'due', 'to', 'him', 'catch', 'crab', 'off

#### SnowballStemmer


In [23]:
from nltk.stem import SnowballStemmer


In [24]:
def snowball_stemmer(text):
    """
        Stem words in list of tokenized words with SnowballStemmer
    """
    stemmer = nltk.SnowballStemmer("english")
    stems = [stemmer.stem(i) for i in text]
    return stems

In [25]:
train_df_clean['snowball_stemmer'] = train_df_clean['tokenized'].apply(lambda x: snowball_stemmer(x))

# double check
print(train_df_clean["Text"][0])
print(train_df_clean["snowball_stemmer"][0])
print(train_df_clean["Text"][100])
print(train_df_clean["snowball_stemmer"][100])

Zoe Telford -- played the police officer girlfriend of Simon, Maggie. Dumped by Simon in the final episode of series 1, after he slept with Jenny, and is not seen again. Phoebe Thomas played Cheryl Cassidy, Pauline's friend and also a year 11 pupil in Simon's class. Dumped her boyfriend following Simon's advice after he wouldn't have sex with her but later realised this was due to him catching crabs off her friend Pauline.
['zoe', 'telford', 'play', 'the', 'polic', 'offic', 'girlfriend', 'of', 'simon', 'maggi', 'dump', 'by', 'simon', 'in', 'the', 'final', 'episod', 'of', 'seri', '1', 'after', 'he', 'slept', 'with', 'jenni', 'and', 'is', 'not', 'seen', 'again', 'phoeb', 'thoma', 'play', 'cheryl', 'cassidi', 'paulin', 'friend', 'and', 'also', 'a', 'year', '11', 'pupil', 'in', 'simon', 'class', 'dump', 'her', 'boyfriend', 'follow', 'simon', 'advic', 'after', 'he', 'would', 'not', 'have', 'sex', 'with', 'her', 'but', 'later', 'realis', 'this', 'was', 'due', 'to', 'him', 'catch', 'crab', 'o

#### LancasterStemmer 

In [26]:
from nltk.stem import LancasterStemmer

In [27]:
def lancaster_stemmer(text):
    """
        Stem words in list of tokenized words with LancasterStemmer
    """
    stemmer = nltk.LancasterStemmer()
    stems = [stemmer.stem(i) for i in text]
    return stems

In [28]:
train_df_clean['lancaster_stemmer'] = train_df_clean['tokenized'].apply(lambda x: lancaster_stemmer(x))

# double check
print(train_df_clean["Text"][0])
print(train_df_clean["lancaster_stemmer"][0])
print(train_df_clean["Text"][100])
print(train_df_clean["lancaster_stemmer"][100])
display(train_df_clean.head())

Zoe Telford -- played the police officer girlfriend of Simon, Maggie. Dumped by Simon in the final episode of series 1, after he slept with Jenny, and is not seen again. Phoebe Thomas played Cheryl Cassidy, Pauline's friend and also a year 11 pupil in Simon's class. Dumped her boyfriend following Simon's advice after he wouldn't have sex with her but later realised this was due to him catching crabs off her friend Pauline.
['zoe', 'telford', 'play', 'the', 'pol', 'off', 'girlfriend', 'of', 'simon', 'maggy', 'dump', 'by', 'simon', 'in', 'the', 'fin', 'episod', 'of', 'sery', '1', 'aft', 'he', 'slept', 'with', 'jenny', 'and', 'is', 'not', 'seen', 'again', 'phoeb', 'thoma', 'play', 'cheryl', 'cassidy', 'paulin', 'friend', 'and', 'also', 'a', 'year', '11', 'pupil', 'in', 'simon', 'class', 'dump', 'her', 'boyfriend', 'follow', 'simon', 'adv', 'aft', 'he', 'would', 'not', 'hav', 'sex', 'with', 'her', 'but', 'lat', 'real', 'thi', 'was', 'due', 'to', 'him', 'catch', 'crab', 'off', 'her', 'frien

Unnamed: 0,Text,Pronoun,Pronoun-offset,A,A-offset,B,B-offset,A-coref,B-coref,text_clean,tokenized,porter_stemmer,snowball_stemmer,lancaster_stemmer
0,Zoe Telford -- played the police officer girlf...,her,274,cheryl cassidy,191,pauline,207,True,False,zoe telford played the police officer girlfri...,"[zoe, telford, played, the, police, officer, g...","[zoe, telford, play, the, polic, offic, girlfr...","[zoe, telford, play, the, polic, offic, girlfr...","[zoe, telford, play, the, pol, off, girlfriend..."
1,"He grew up in Evanston, Illinois the second ol...",His,284,mackenzie,228,bernard leach,251,True,False,he grew up in evanston illinois the second old...,"[he, grew, up, in, evanston, illinois, the, se...","[he, grew, up, in, evanston, illinoi, the, sec...","[he, grew, up, in, evanston, illinoi, the, sec...","[he, grew, up, in, evanston, illino, the, seco..."
2,"He had been reelected to Congress, but resigne...",his,265,angeloz,173,de la sota,246,False,True,he had been reelected to congress but resigned...,"[he, had, been, reelected, to, congress, but, ...","[he, had, been, reelect, to, congress, but, re...","[he, had, been, reelect, to, congress, but, re...","[he, had, been, reelect, to, congress, but, re..."
3,The current members of Crime have also perform...,his,321,hell,174,henry rosenthal,336,False,True,the current members of crime have also perform...,"[the, current, members, of, crime, have, also,...","[the, current, member, of, crime, have, also, ...","[the, current, member, of, crime, have, also, ...","[the, cur, memb, of, crim, hav, also, perform,..."
4,Her Santa Fe Opera debut in 2005 was as Nuria ...,She,437,kitty oppenheimer,219,rivera,294,False,True,her santa fe opera debut in 2005 was as nuria ...,"[her, santa, fe, opera, debut, in, 2005, was, ...","[her, santa, fe, opera, debut, in, 2005, wa, a...","[her, santa, fe, opera, debut, in, 2005, was, ...","[her, sant, fe, oper, debut, in, 2005, was, as..."


#### Compare 

In [29]:
print(train_df_clean["Text"][0])
print(train_df_clean["porter_stemmer"][0])
print(train_df_clean["snowball_stemmer"][0])
print(train_df_clean["lancaster_stemmer"][0])

Zoe Telford -- played the police officer girlfriend of Simon, Maggie. Dumped by Simon in the final episode of series 1, after he slept with Jenny, and is not seen again. Phoebe Thomas played Cheryl Cassidy, Pauline's friend and also a year 11 pupil in Simon's class. Dumped her boyfriend following Simon's advice after he wouldn't have sex with her but later realised this was due to him catching crabs off her friend Pauline.
['zoe', 'telford', 'play', 'the', 'polic', 'offic', 'girlfriend', 'of', 'simon', 'maggi', 'dump', 'by', 'simon', 'in', 'the', 'final', 'episod', 'of', 'seri', '1', 'after', 'he', 'slept', 'with', 'jenni', 'and', 'is', 'not', 'seen', 'again', 'phoeb', 'thoma', 'play', 'cheryl', 'cassidi', 'paulin', 'friend', 'and', 'also', 'a', 'year', '11', 'pupil', 'in', 'simon', 'class', 'dump', 'her', 'boyfriend', 'follow', 'simon', 'advic', 'after', 'he', 'would', 'not', 'have', 'sex', 'with', 'her', 'but', 'later', 'realis', 'thi', 'wa', 'due', 'to', 'him', 'catch', 'crab', 'off

- Descartamos Lancaster stemmer por ser demasiado agresivo. 


In [77]:

from nltk.stem import PorterStemmer, WordNetLemmatizer#, LancasterStemmer, SnowballStemmer
from nltk.corpus import stopwords



[nltk_data] Downloading package punkt to
[nltk_data]     /Users/bernatsort/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [73]:
# Tokenización
tokens = nltk.word_tokenize(clean_text)
tokens

['He',
 'grew',
 'up',
 'in',
 'Evanston',
 'Illinois',
 'the',
 'second',
 'oldest',
 'of',
 'five',
 'children',
 'including',
 'his',
 'brothers',
 'Fred',
 'and',
 'Gordon',
 'and',
 'sisters',
 'Marge',
 'Peppy',
 'and',
 'Marilyn',
 'His',
 'high',
 'school',
 'days',
 'were',
 'spent',
 'at',
 'New',
 'Trier',
 'High',
 'School',
 'in',
 'Winnetka',
 'Illinois',
 'MacKenzie',
 'studied',
 'with',
 'Bernard',
 'Leach',
 'from',
 'to',
 'His',
 'simple',
 'wheelthrown',
 'functional',
 'pottery',
 'is',
 'heavily',
 'influenced',
 'by',
 'the',
 'oriental',
 'aesthetic',
 'of',
 'Shoji',
 'Hamada',
 'and',
 'Kanjiro',
 'Kawai']

In [168]:
def preprocess_text(df, stop=stop, n=1, col='text'):
    '''Function to preprocess and create corpus'''
    new_corpus = []
    stem = PorterStemmer()
    lem = WordNetLemmatizer()

    for text in df[col]:
        words = [w for w in nltk.word_tokenize(text) if (w not in stop)]
        words = [lem.lemmatize(w) for w in words if len(w)>n]
        new_corpus.append(words)

    new_corpus = [word for l in new_corpus for word in l]

    return new_corpus

NameError: name 'stop' is not defined

## xtrain y train

In [None]:
X_train = train_df[imp_features]
y_train = train_df[target]

In [None]:
# features (train dataset)
X_train

Unnamed: 0,Text,Pronoun,Pronoun-offset,A,A-offset,B,B-offset
0,Zoe Telford -- played the police officer girlf...,her,274,Cheryl Cassidy,191,Pauline,207
1,"He grew up in Evanston, Illinois the second ol...",His,284,MacKenzie,228,Bernard Leach,251
2,"He had been reelected to Congress, but resigne...",his,265,Angeloz,173,De la Sota,246
3,The current members of Crime have also perform...,his,321,Hell,174,Henry Rosenthal,336
4,Her Santa Fe Opera debut in 2005 was as Nuria ...,She,437,Kitty Oppenheimer,219,Rivera,294
...,...,...,...,...,...,...,...
1995,"Faye's third husband, Paul Resnick, reported t...",her,433,Nicole,255,Faye,328
1996,The plot of the film focuses on the life of a ...,her,246,Doris Chu,111,Mei,215
1997,Grant played the part in Trevor Nunn's movie a...,she,348,Maria,259,Imelda Staunton,266
1998,The fashion house specialised in hand-printed ...,She,284,Helen,145,Suzanne Bartsch,208


In [None]:
# target (train dataset)
y_train

Unnamed: 0,A-coref,B-coref
0,True,False
1,True,False
2,False,True
3,False,True
4,False,True
...,...,...
1995,False,True
1996,False,True
1997,True,False
1998,True,False


- Puede haber casos en los que la respuesta sea A-coref == False y B-coref == False, lo que indica que el pronombre no hace referencia a ninguno de los nombres propuestos. 