<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Project" data-toc-modified-id="Project-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Project</a></span><ul class="toc-item"><li><span><a href="#Objective" data-toc-modified-id="Objective-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Objective</a></span></li><li><span><a href="#Evaluation" data-toc-modified-id="Evaluation-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Evaluation</a></span></li><li><span><a href="#Submission" data-toc-modified-id="Submission-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Submission</a></span></li><li><span><a href="#Evaluation" data-toc-modified-id="Evaluation-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Evaluation</a></span></li><li><span><a href="#What-is-this-all-about" data-toc-modified-id="What-is-this-all-about-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>What is this all about</a></span></li></ul></li><li><span><a href="#Pretty-printing-of-examples" data-toc-modified-id="Pretty-printing-of-examples-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Pretty printing of examples</a></span></li><li><span><a href="#Dataset-with-lables" data-toc-modified-id="Dataset-with-lables-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Dataset with lables</a></span></li></ul></div>

### Project 

In [16]:
import os
import pandas as pd
import numpy as np

In [17]:
ls

overview.ipynb


In [18]:
print(os.listdir("../input/"))

['test_stage_1.tsv', 'sample_submission_stage_1.csv', 'gap-coreference-master']


In [19]:
data_df = pd.read_csv('../input/test_stage_1.tsv', delimiter='\t')
submission = pd.read_csv('../input/sample_submission_stage_1.csv')

In [20]:
data_df.shape, submission.shape

((2000, 9), (2000, 4))

#### Objective

- In Identify the target of a pronoun within a text passage. 

- You are provided with the pronoun and two candidate names to which the pronoun could refer. 

- You must create an algorithm capable of deciding whether the pronoun refers to name A, name B, or neither.

#### Evaluation

- Submissions are evaluated using the multi-class logarithmic loss.

- Each pronoun has been labeled with whether it refers to A, B, or NEITHER. 

- For each pronoun, you must submit a set of predicted probabilities (one for each class). The formula is :


#### Submission

```
ID,             A,        B,        NEITHER
development-1,  0.33333,  0.33333,  0.33333
development-2,  0.33333,  0.33333,  0.33333
development-3,  0.33333,  0.33333,  0.33333
```

#### Evaluation

$$
l = - \frac{1}{\text{n_samples}} \sum_{i = 1}^{\text{n_samples}} \sum_{j = 1}^{\text{n_classes}} y_{ij} \log(p_{ij})
$$

Where $y_{ij}$ is 1 or zero depending on if the i'th sample has as true label class 1, 2 or 3.

In [21]:
data_df.head()

Unnamed: 0,ID,Text,Pronoun,Pronoun-offset,A,A-offset,B,B-offset,URL
0,development-1,Zoe Telford -- played the police officer girlf...,her,274,Cheryl Cassidy,191,Pauline,207,http://en.wikipedia.org/wiki/List_of_Teachers_...
1,development-2,"He grew up in Evanston, Illinois the second ol...",His,284,MacKenzie,228,Bernard Leach,251,http://en.wikipedia.org/wiki/Warren_MacKenzie
2,development-3,"He had been reelected to Congress, but resigne...",his,265,Angeloz,173,De la Sota,246,http://en.wikipedia.org/wiki/Jos%C3%A9_Manuel_...
3,development-4,The current members of Crime have also perform...,his,321,Hell,174,Henry Rosenthal,336,http://en.wikipedia.org/wiki/Crime_(band)
4,development-5,Her Santa Fe Opera debut in 2005 was as Nuria ...,She,437,Kitty Oppenheimer,219,Rivera,294,http://en.wikipedia.org/wiki/Jessica_Rivera


#### What is this all about 

In [22]:
URL = data_df["URL"][1]
URL

'http://en.wikipedia.org/wiki/Warren_MacKenzie'

In [23]:
# text: String containing the ambiguous pronoun and two candidate names. About a paragraph in length
text = data_df["Text"][1]
print(text)

He grew up in Evanston, Illinois the second oldest of five children including his brothers, Fred and Gordon and sisters, Marge (Peppy) and Marilyn. His high school days were spent at New Trier High School in Winnetka, Illinois. MacKenzie studied with Bernard Leach from 1949 to 1952. His simple, wheel-thrown functional pottery is heavily influenced by the oriental aesthetic of Shoji Hamada and Kanjiro Kawai.


In [24]:
pronoun = data_df["Pronoun"][1]
pronoun

'His'

In [25]:
# pronoun_offset: Character offset of Pronoun in Column 2 (Text)
# offset: the amount or distance by which something is out of line.
pronoun_offset = data_df["Pronoun-offset"][1]
pronoun_offset

284

In [26]:
text[pronoun_offset:pronoun_offset+len(pronoun)]

'His'

In [27]:
A = data_df["A"][1]
A_offset = data_df["A-offset"][1]
A, A_offset

('MacKenzie', 228)

In [28]:
text[A_offset:A_offset+len(A)]

'MacKenzie'

In [29]:
B = data_df["B"][1]
B_offset = data_df["B-offset"][1]
B, B_offset

('Bernard Leach', 251)

In [30]:
text[B_offset:B_offset+len(B)]

'Bernard Leach'

In [31]:
submission.head()

Unnamed: 0,ID,A,B,NEITHER
0,development-1,0.33333,0.33333,0.33333
1,development-2,0.33333,0.33333,0.33333
2,development-3,0.33333,0.33333,0.33333
3,development-4,0.33333,0.33333,0.33333
4,development-5,0.33333,0.33333,0.33333


### Pretty printing of examples

In [32]:
import colorama
from colorama import Fore

def gap_printer2( data_df_row):
        
    text   = data_df_row["Text"]
    word_A = data_df_row["A"]
    word_B = data_df_row["B"]
    
    pronoun       = data_df_row["Pronoun"]
    pronoun_begin = data_df_row["Pronoun-offset"]
    pronoun_end   = pronoun_begin + len(pronoun)
    
    word_A_begin = data_df_row["A-offset"]
    word_A_end   = data_df_row["A-offset"] + len(word_A)
    word_B_begin = data_df_row["B-offset"]
    word_B_end   = data_df_row["B-offset"] + len(word_B)
    
    text_c = text.replace(word_A, " {} ")
    text_c = text.replace(word_B, " {} ")
    text_c = text.replace(pronoun, " {} ")
    
    word_boundaries = np.sort([word_A_begin, word_A_end, pronoun_begin, pronoun_end, word_B_begin, word_B_end])
    word_boundaries = list(zip(word_boundaries[::2], word_boundaries[1::2]))
    
    P1 = [0,word_boundaries[0][0]]
    P2 = [word_boundaries[0][1],word_boundaries[1][0]]
    P3 = [word_boundaries[1][1],word_boundaries[2][0]]
    P4 = [word_boundaries[2][1],len(text)]

    text_f = text[P1[0]:P1[1]] + "{}" + text[P2[0]:P2[1]] +  "{}" + text[P3[0]:P3[1]] + "{}" + text[P4[0]:P4[1]]
        
    #print("True text\n")
    #print(text)
    #print("\n")
    #print("Colored text\n")
    print(text_f.format( Fore.BLUE  + text[word_boundaries[0][0]:word_boundaries[0][1]]  + Fore.BLACK,
                         Fore.BLUE  + text[word_boundaries[1][0]:word_boundaries[1][1]] + Fore.BLACK,
                         Fore.BLUE  + text[word_boundaries[2][0]:word_boundaries[2][1]]  + Fore.BLACK))
    

In [33]:
k = 200
gap_printer2(data_df.loc[k])

He established public works, a bank, churches, and charitable institutions and sought good relations with the Aborigines. In 1813 he sent [34mBlaxland[30m, Wentworth and [34mLawson[30m on an expedition across the Blue Mountains, where they found the great plains of the interior. Central, however to Macquarie's policy was [34mhis[30m treatment of the emancipists, whom he decreed should be treated as social equals to free-settlers in the colony.


In [34]:
data_df.loc[k]["URL"]

'http://en.wikipedia.org/wiki/History_of_New_South_Wales'

True paragraph from the URL

```
Macquarie served as the last autocratic Governor of New South Wales, from 1810 to 1821 and had a leading role in the social and economic development of New South Wales which saw it transition from a penal colony to a budding free society. He established public works, a bank, churches, and charitable institutions and sought good relations with the Aborigines. In 1813 he sent Blaxland, Wentworth and Lawson on an expedition across the Blue Mountains, where they found the great plains of the interior.[15] Central, however to Macquarie's policy was his treatment of the emancipists, whom he decreed should be treated as social equals to free-settlers in the colony. Against opposition, he appointed emancipists to key government positions including Francis Greenway as colonial architect and William Redfern as a magistrate. London judged his public works to be too expensive and society was scandalised by his treatment of emancipists.[16] His legacy lives on with Macquarie Street, Sydney bearing his name as the well as the New South Wales Parliament and various buildings designed during his tenure including the UNESCO listed Hyde Park Barracks.```

This seems we should classify it as None

### Dataset with lables

In the following url there is a dataset with labels concerning this problem:

https://github.com/google-research-datasets/gap-coreference


In [46]:
print(os.listdir("../input/gap-coreference-master"))

['gap-validation.tsv', 'gap_scorer.py', 'gap-development.tsv', 'CONTRIBUTING.md', 'gap-test.tsv', 'LICENSE', 'constants.py', 'README.md']


In [47]:
train_df = pd.read_csv('../input/gap-coreference-master/gap-development.tsv', delimiter='\t')
submission = pd.read_csv('../input/sample_submission_stage_1.csv')

In [48]:
train_df.head()

Unnamed: 0,ID,Text,Pronoun,Pronoun-offset,A,A-offset,A-coref,B,B-offset,B-coref,URL
0,development-1,Zoe Telford -- played the police officer girlf...,her,274,Cheryl Cassidy,191,True,Pauline,207,False,http://en.wikipedia.org/wiki/List_of_Teachers_...
1,development-2,"He grew up in Evanston, Illinois the second ol...",His,284,MacKenzie,228,True,Bernard Leach,251,False,http://en.wikipedia.org/wiki/Warren_MacKenzie
2,development-3,"He had been reelected to Congress, but resigne...",his,265,Angeloz,173,False,De la Sota,246,True,http://en.wikipedia.org/wiki/Jos%C3%A9_Manuel_...
3,development-4,The current members of Crime have also perform...,his,321,Hell,174,False,Henry Rosenthal,336,True,http://en.wikipedia.org/wiki/Crime_(band)
4,development-5,Her Santa Fe Opera debut in 2005 was as Nuria ...,She,437,Kitty Oppenheimer,219,False,Rivera,294,True,http://en.wikipedia.org/wiki/Jessica_Rivera


Notice that `data_df` is basically the data from the url from above. But the data from above contains the targets in the olumns A-coref and B-coref.

In [49]:
np.mean(train_df["URL"] == data_df["URL"])

1.0

In [53]:
gap_printer2(train_df.loc[k])

He established public works, a bank, churches, and charitable institutions and sought good relations with the Aborigines. In 1813 he sent [34mBlaxland[30m, Wentworth and [34mLawson[30m on an expedition across the Blue Mountains, where they found the great plains of the interior. Central, however to Macquarie's policy was [34mhis[30m treatment of the emancipists, whom he decreed should be treated as social equals to free-settlers in the colony.
