# Tag Sentences with SpaCy

Use this code if you manually download a csv of results from SPIKE's UI (as opposed to using the API).
See examples of queries below. The repository has the relevant lists under `../data/lists`, so you can upload them to SPIKE's UI. 

In [1]:
import pandas as pd
import spacy
pd.set_option('max_colwidth', 400)

In [2]:
nlp = spacy.load("en_core_web_trf")

## Creating a Test Set

In the previous step you created a list of musicians scraped from wikipedia. 
Upload the list to spike, and run a simple query with this list in the basic search:
`musicians:{musicians}`. Since the data is taken from wikipedia, which is well edited, it is safe to include case restrictions for names,in case we are looking for PERSONs. 

[Here is an example](https://spike.staging.apps.allenai.org/datasets/wikipedia/search#query=eyJ0eXBlIjoiQiIsInF1ZXJ5IjoibXVzaWNpYW5zJTNBdyUzRCU3Qm11c2ljaWFucyUzQWM1MGZiZDZlNjQ1YTk3YzE2ZjgwZDE4MWVkNzJhZjNhYmFiZWYyYjcxNTRkMTZiNmExZTkzYjM0ZjBkNWE2MjIlN0QiLCJleHBhbnNpb25PdmVycmlkZXMiOiIiLCJjYXNlU3RyYXRlZ3kiOiJleGFjdCIsImZpbHRlcnMiOiJ7fSIsImlzRnV6enkiOiJmYWxzZSJ9&autoRun=true)

Simply run this query and click on "Download CSV" and name the file `simple_musicians.csv`. 

In [11]:
df = pd.read_csv('../data/queries/simple_musicians.csv')
df

Unnamed: 0,sentence_id,musicians,title,musicians_first_index,musicians_last_index,sentence_text
0,3215,Johann Sebastian Bach,Emory Remington,25,27,"A large ensemble of trombonists would gather to play music written for multiple trombones or transcribed from other sources , such as the chorales of Johann Sebastian Bach ."
1,7323,Arapaho,Minnie Devereaux,35,35,"Movie trade magazines claimed she studied at the Carlisle Indian Industrial School , a Pennsylvania boarding school for Native American students , and she appears on the Carlisle rolls as Minerva Burgess of Cheyenne and Arapaho heritage ."
2,9251,John Tchicai,Andrew Cyrille,1,2,With John Tchicai and Reggie Workman
3,16092,Justin Bieber,James Corden,17,18,"The sketch included appearances by then UK Prime Minister Gordon Brown , JLS , Paul McCartney and Justin Bieber ."
4,18726,Jools Holland,Golden Silvers,10,11,"The band performed two songs on "" Later ... with Jools Holland "" during May 2009 and have supported Blur at Hyde Park , London ."
...,...,...,...,...,...,...
94464,192847243,Lauryn Hill,End of Time (song),40,41,"Priya Elan of "" NME "" commented that "" End of Time "" is "" much more instantaneous "" than "" Run the World ( Girls ) "" , and added that its sneaky bassline , which is reminiscent of Lauryn Hill 's work , "" is the perfect counter - balance to those pesky military drums "" ."
94465,192848655,Johann Sebastian Bach,Johann Gramann,0,2,"Johann Sebastian Bach used it in cantatas and organ preludes , including "" Gottlob ! nun geht das Jahr zu Ende "" , BWV 28 for the Sunday after Christmas ."
94466,192849985,Faith Hill,HanaLena (Nash Street),15,16,"The group continued honing their craft playing bluegrass festivals and opening for bigger names like Faith Hill , Tim McGraw , and Jeff Bates ."
94467,192851082,Nicolas Bacri,David Stern (conductor),20,21,"He has premiered four new operas since 2010 : Gil Shochat 's A Child Dreams at the Israel Opera , Nicolas Bacri 's Cosi Fanciulli commissioned by Opera Fuoco and performed at the Théâtre des Champs - Elysées in Paris , Ben Moore 's Enemies , A Love Story in Palm Beach and Jan Sandstrom 's The Rococo Machine in Drottningholm , Sweden ."


If you see in the aggregation pane that some musicians appear a lot, while others only once, it might be best to remove duplicates, so you will have as many different musicians in your dataset.

In [12]:
df.drop_duplicates(subset=['musicians'], inplace=True)
df = df[df["sentence_text"].str.len() > 50] # remove short sentences
 # the same sentence might appear twice, for example, when having two musicians in the text. Remove the duplicates.
df.drop_duplicates(subset=["sentence_text"], inplace=True)
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop_duplicates(subset=["sentence_text"], inplace=True)


Unnamed: 0,sentence_id,musicians,title,musicians_first_index,musicians_last_index,sentence_text
0,3215,Johann Sebastian Bach,Emory Remington,25,27,"A large ensemble of trombonists would gather to play music written for multiple trombones or transcribed from other sources , such as the chorales of Johann Sebastian Bach ."
1,7323,Arapaho,Minnie Devereaux,35,35,"Movie trade magazines claimed she studied at the Carlisle Indian Industrial School , a Pennsylvania boarding school for Native American students , and she appears on the Carlisle rolls as Minerva Burgess of Cheyenne and Arapaho heritage ."
3,16092,Justin Bieber,James Corden,17,18,"The sketch included appearances by then UK Prime Minister Gordon Brown , JLS , Paul McCartney and Justin Bieber ."
4,18726,Jools Holland,Golden Silvers,10,11,"The band performed two songs on "" Later ... with Jools Holland "" during May 2009 and have supported Blur at Hyde Park , London ."
5,21434,Dann Huff,Gems: The Duets Collection,17,18,"It features productions by acclaimed songwriters , producers and musicians including A.R. Rahman , David Foster , Dann Huff and Rudy Perez ."
...,...,...,...,...,...,...
62173,126943626,Zoran Lesendrić,Srđan Šaper,8,9,The following year Nebojša Krstić and Piloti singer Zoran Lesendrić Kiki and he formed Dobrovoljno Pevačko Društvo .
66882,136917572,Scots Gaels,Gaels,1,2,"The Scots Gaels derive from the kingdom of Dál Riata , which included parts of western Scotland and northern Ireland ."
71627,146176582,Magali Luyten,Beautiful Sin,9,10,"Beautiful Sin began when Uli Kusch met Belgian singer Magali Luyten , and wanted to record an album with her band ."
86086,175634528,Walking bass,Bassline,0,1,"Walking bass in the pedal keyboard part of Baroque organ music ( J.S. Bach 's "" Nun komm , der Heiden Heiland "" , BWV 659 , from the Great Eighteen Chorale Preludes ):"


## Data Manipulation

We want to tag each musician as `[B]` or `[I]`, any other non-musician PERSON as `[PB]` or `[PI]` and all other tokens as `[O]`
```
Trumpeter-[O]
Ted-[B]
Curson-[I]
introduced-[O]
him-[O]
to-[O]
pianist-[O]
Cecil-[B]
Taylor-[I]
when-[O]
Cyrille-[O]
was-[O]
18-[O]
```

However, note that "Ted Curson" is tagged as musician in one instance, and Cecil Taylor in another. SPIKE's output only shows one match per line. If a sentence has several matches, the sentence will be found several times in the csv, each time pointing to a different match. Also note, that while we can suspect (by common sense or knowledge) that Cyrille is also a musician, the above patterns do not capture that, so `Cyrille` is tagged as `[O]`.


In [13]:
musicians  = df["musicians"].unique()
musicians

array(['Johann Sebastian Bach', 'Arapaho', 'Justin Bieber',
       'Jools Holland', 'Dann Huff', 'Robbie Williams',
       'Scott Fitzgerald', 'Albert Lee', 'Jimmy Page', 'Kevin Sharp',
       'Karl Wallinger', 'Don Cherry', 'Carlos Santana', 'Paco de Lucía',
       'Jack Black', 'Rosalía', 'Donna Summer', 'Syd Barrett',
       'Justin Timberlake', 'Lauryn Hill', 'Duke Ellington', 'Yoko Ono',
       'Wes Montgomery', 'Luke Kelly', 'Richard Manuel',
       'Michael Chapman', 'Armin van Buuren', 'Joe Walsh', 'Hank Ballard',
       'Niacin', 'Andrew Bird', 'Shirley Manson', 'Eric Dolphy',
       'Bloodgood', 'Martina McBride', 'Noel Gallagher', 'Gene Pitney',
       'Paul Nicholas', 'Paul Ryan', 'Chris Brown', 'John Lennon',
       'Narada', 'Scotty Emerick', 'Becky G', 'Charlie Chan',
       'Katchafire', 'Angèle', 'Janis Joplin', 'Rosendo', 'Miss Kittin',
       'Tony Kanal', 'Otis Rush', 'Larry Bell', 'Yui Horie', 'Fred Durst',
       'Wizkid', 'Soulja Boy', 'Steve Swallow', 'Gene Autr

In [7]:
df = df.sample(n=200)

In [14]:
def tag_sentence(row):
    tokens = []
    start = row["musicians_first_index"]
    end = row["musicians_last_index"]
    start_entity = "O"
    for i, token in enumerate(nlp(row["sentence_text"])):
        ent_type = token.ent_type_
        if start == i:
            tokens.append(f"{token.text}-[B]")
            start_entity = "O"
        elif token.text == "'s":
            tokens.append(f"{token.text}-[O]")
            start_entity = "O"
        elif start < i <= end:
            tokens.append(f"{token.text}-[I]")
            start_entity = "O"
        elif ent_type == "PERSON":
            surname = [t.text for t in token.ancestors]
            first_name = [t.text for t in token.children]
            if not first_name:
                if surname:
                    try:
                        if f"{token.text} {surname[0]}" in musicians:
                            tokens.append(f"{token.text}-[B]")
                            start_entity = "O"
                        else:
                            if start_entity != "PB":
                                tokens.append(f"{token.text}-[PB]")
                                start_entity = "PB"
                    except:
                        print(token.text, row["sentence_text"])
                        tokens.append(f"{token.text}-[O]")
                elif f"{token.text}" in musicians:
                    tokens.append(f"{token.text}-[B]")
                    start_entity = "O"
                else:
                    tokens.append(f"{token.text}-[B]")
                    start_entity = "O"
            elif f"{first_name[0]} {token.text}" in musicians:
                tokens.append(f"{token.text}-[I]")
                start_entity = "O"            
            elif start_entity != "PB":
                tokens.append(f"{token.text}-[PB]")
                start_entity = "PB"
            else:
                tokens.append(f"{token.text}-[PI]")
                start_entity = "PI"
        else:
            tokens.append(f"{token.text}-[O]")
            start_entity = "O"
            
    return " ".join(tokens)
        

df["tagged_sentence"] = df.apply(tag_sentence, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["tagged_sentence"] = df.apply(tag_sentence, axis=1)


In [15]:
for i, row in df.iterrows():
    print(i, row["musicians"])
    print(row["sentence_text"])
    print(row["tagged_sentence"], '\n')
    if i == 10:
        break

0 Johann Sebastian Bach
A large ensemble of trombonists would gather to play music written for multiple trombones or transcribed from other sources , such as the chorales of Johann Sebastian Bach .
A-[O] large-[O] ensemble-[O] of-[O] trombonists-[O] would-[O] gather-[O] to-[O] play-[O] music-[O] written-[O] for-[O] multiple-[O] trombones-[O] or-[O] transcribed-[O] from-[O] other-[O] sources-[O] ,-[O] such-[O] as-[O] the-[O] chorales-[O] of-[O] Johann-[B] Sebastian-[I] Bach-[I] .-[O] 

1 Arapaho
Movie trade magazines claimed she studied at the Carlisle Indian Industrial School , a Pennsylvania boarding school for Native American students , and she appears on the Carlisle rolls as Minerva Burgess of Cheyenne and Arapaho heritage .
Movie-[O] trade-[O] magazines-[O] claimed-[O] she-[O] studied-[O] at-[O] the-[O] Carlisle-[O] Indian-[O] Industrial-[O] School-[O] ,-[O] a-[O] Pennsylvania-[O] boarding-[O] school-[O] for-[O] Native-[O] American-[O] students-[O] ,-[O] and-[O] she-[O] appears-[O

After tagging, write the sentences to file. Make sure to manually go over the file and correct wrong tags.

In [16]:
with open("../data/output/test_set_dirty.txt", "w") as f:
    for i, row in df.iterrows():
        if row['tagged_sentence'].count('[PB]') < 4: # avoid sentences with big lists of people. Not mandatory
            f.write(f"{row['tagged_sentence']}\n")

## Negative Examples

To create negative examples, simply search for [PERSONs](https://spike.staging.apps.allenai.org/datasets/wikipedia/search#query=eyJ0eXBlIjoiQiIsInF1ZXJ5IjoiJTNDRSUzRW5lZ2F0aXZlJTNBZSUzRFBFUlNPTiIsImV4cGFuc2lvbk92ZXJyaWRlcyI6IiIsImNhc2VTdHJhdGVneSI6ImV4YWN0IiwiZmlsdGVycyI6Int9IiwiaXNGdXp6eSI6ImZhbHNlIn0=&autoRun=true), or you can curate a list of random names and run a [query](https://spike.staging.apps.allenai.org/datasets/wikipedia/search#query=eyJ0eXBlIjoiQiIsInF1ZXJ5IjoiJTNDRSUzRW5lZ2F0aXZlcyUzQSU3QnJhbmRvbV9wZW9wbGUlM0E5ZGVhOGMzNTg5NTU2MGZlMTc0MmEzY2VlZjg2NjNiNGI3MzJjZjdiZWQyMGRlYTViNWEzNjYzNDAxNzE3ZjA4JTdEIiwiZXhwYW5zaW9uT3ZlcnJpZGVzIjoiIiwiY2FzZVN0cmF0ZWd5IjoiZXhhY3QiLCJmaWx0ZXJzIjoie30iLCJpc0Z1enp5IjoiZmFsc2UifQ==&autoRun=true) looking for them (to create such a list, you can use the aggregation panel with the previous query). Download results and repeat the above process, but this time, tag all PERSONs as PB/PI.

### Limit to proper nouns
Since NER is not 100% accurate, when running a query as wide as `:e=PERSON`, some false positives may surface. We can avoid the most obvious ones by [limiting the results to proper nouns](https://spike.staging.apps.allenai.org/datasets/wikipedia/search#query=eyJ0eXBlIjoiQiIsInF1ZXJ5IjoibmVnYXRpdmUlM0FlJTNEUEVSU09OJTI2dGFnJTNETk5QIiwiZXhwYW5zaW9uT3ZlcnJpZGVzIjoiIiwiY2FzZVN0cmF0ZWd5IjoiZXhhY3QiLCJmaWx0ZXJzIjoie30iLCJpc0Z1enp5IjoiZmFsc2UifQ==&autoRun=true). This removes results like `3-2` by adding another level of validation from the syntactic representation. 

### Download more examples than you need, shuffle and sample
SPIKE's indexing is at document level. While multi-processing ensures some minimal level of randomization, the sentences are always retrieved in the same order and each document is searched completely, before the results of the next document. 
When downloading results of a general query as `:e=PERSON`, you may notice that many results come from the same document, naturally containing the same person. To avoid this, you can download a larger batch, of 20,000 sentences, shuffle them, and only then tag a subset of the batch. 

In [8]:
from random import shuffle

with open('../data/queries/simple_negatives_large.csv', 'r') as fin,open('../data/queries/simple_negatives_shuffled.csv', 'w') as fout:
    lines = fin.readlines()
    cols = lines[0]
    sents = lines[1:]
    fout.write(cols)
    shuffle(sents)
    for line in sents:
        fout.write(line)

In [11]:
df_neg = pd.read_csv('../data/queries/simple_negatives_large.csv')
df_neg = df_neg.sample(frac=1)
df_neg.drop_duplicates(subset=['negative'], inplace=True)
df_neg = df_neg[df_neg["sentence_text"].str.len() > 50] # remove short sentences
df_neg.drop_duplicates(subset=["sentence_text"], inplace=True)
df_sample = df_neg.head(500)
df_sample

Unnamed: 0,sentence_id,negative,negative_first_index,negative_last_index,sentence_text
2603,9173,Max Roach,21,22,"His first drum teachers were fellow Brooklyn - based drummers Willie Jones and Lenny McBrowne ; through them , Cyrille met Max Roach ."
476,1313,Biel,35,35,"I love New York "" , Crossover der aktuellen Kunst , Museum Ludwig , Köln , Verlag DuMont , 1999 \n „ Transfert : Kunst I m Urbanen Raum "" Art dans l’espace Urban , Biel , 2000 \n „"
314,808,Khoperia,13,13,"The General - Director of Rustavi 2 , Nika Tabatadze , denied that Khoperia , was under pressure from the Rustavi 2 management or the authorities ."
2957,10446,Laffer,9,9,Economist John Quiggin distinguishes between the Laffer curve and Laffer 's analysis of tax rates .
1564,5696,ET JO,21,22,"It was executed in 1513 as shown on the work itself - Francesco Tassi believed it was inscribed "" NOB.PAULUS , ET JO . FRATRES DE CASPOTTIS TRINO OBTULERUNT HAEC 1513 "" ."
...,...,...,...,...,...
854,3150,Benjamin Coplen,11,12,"This bog was discovered in May 1876 by a homesteader , Benjamin Coplen , who found what seemed to be a gigantic bone in the peat - covered water ."
919,3335,James Teit,23,24,"Signed in Spences Bridge on May 10 , 1911 by a committee of 16 chiefs of the St'at'imc , taken down by anthropologist James Teit , it is an assertion of sovereignty over traditional territories as well as a protest against recent alienations of land by white settlers at Seton Portage due to railway construction ."
1455,5302,Ammirul Emmran bin Mazlan,0,3,Ammirul Emmran bin Mazlan ( born 18 April 1995 ) is a Singaporean footballer who plays as a midfielder for S.League club Warriors .
1930,7232,Nancy Phillips,1,2,Counselor Nancy Phillips was instrumental in the initial success of the LEAP Program .


In [12]:
def negatives_tag_sentence(row):
    tokens = []
    start_entity = "O"
    for i, token in enumerate(nlp(row["sentence_text"])):
        ent_type = token.ent_type_
        if ent_type == "PERSON":
            if start_entity == "O":
                tokens.append(f"{token.text}-[PB]")
                start_entity = "PB"
            else:
                tokens.append(f"{token.text}-[PI]")
        else:
            tokens.append(f"{token.text}-[O]")
            start_entity = "O"
            
    return " ".join(tokens)


df_sample["tagged_sentence"] = df_sample.apply(negatives_tag_sentence, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sample["tagged_sentence"] = df_sample.apply(negatives_tag_sentence, axis=1)


In [13]:
df_sample['tagged_sentence']

2603                                                                                                                                                                      His-[O] first-[O] drum-[O] teachers-[O] were-[O] fellow-[O] Brooklyn-[O] --[O] based-[O] drummers-[O] Willie-[PB] Jones-[PI] and-[O] Lenny-[PB] McBrowne-[PI] ;-[O] through-[O] them-[O] ,-[O] Cyrille-[PB] met-[O] Max-[PB] Roach-[PI] .-[O]
476                                                              I-[O] love-[O] New-[O] York-[O] "-[O] ,-[O] Crossover-[O] der-[O] aktuellen-[O] Kunst-[O] ,-[O] Museum-[O] Ludwig-[O] ,-[O] Köln-[O] ,-[O] Verlag-[O] DuMont-[O] ,-[O] 1999-[O] \n -[O] „-[O] Transfert-[O] :-[O] Kunst-[O] I-[O] m-[O] Urbanen-[O] Raum-[O] "-[O] Art-[O] dans-[O] l’espace-[O] Urban-[O] ,-[O] Biel-[O] ,-[O] 2000-[O] \n -[O] „-[O]
314                                                                                                                                                   The-[O] General-[O] --[O] Director

In [14]:
df_sample['tagged_sentence'].to_csv("../data/output/tagged_negatives_shuffled.txt", sep="\t", index=False)

After manually cleaning both files (musicians and negatives), you might want to select about 50 sentences of each (or another ratio), merge them and shuffle.