# Tag Sentences with SpaCy

Use this code if you manually download a csv of results from SPIKE's UI (as opposed to using the API).
See examples of queries below. The repository has the relevant lists under `../data/lists`, so you can upload them to SPIKE's UI. 

In [1]:
import pandas as pd
import spacy
pd.set_option('max_colwidth', 400)

In [2]:
nlp = spacy.load("en_core_web_trf")

## Creating a Test Set

In the previous step you created a list of musicians scraped from wikipedia. 
Upload the list to spike, and run a simple query with this list in the basic search:
`<E>:{musicians}`, where `<E>` makes SPIKE capture the full name. Since the data is taken from wikipedia, which is well edited, it is safe to include case restrictions for names, in case we are looking for PERSONs. 

[Here is an example](https://spike.staging.apps.allenai.org/datasets/wikipedia/search#query=eyJ0eXBlIjoiQiIsInF1ZXJ5IjoiJTNDRSUzRW5lZ2F0aXZlJTNBZSUzRFBFUlNPTiIsImV4cGFuc2lvbk92ZXJyaWRlcyI6IiIsImNhc2VTdHJhdGVneSI6ImV4YWN0IiwiZmlsdGVycyI6Int9IiwiaXNGdXp6eSI6ImZhbHNlIn0=&autoRun=true)

Simply run this query and click on "Download CSV" and name the file `simple_musicians.csv`. 

In [21]:
df = pd.read_csv('../data/queries/simple_musicians.csv')
df.head()

Unnamed: 0,sentence_id,musicians,article_link,musicians_first_index,musicians_last_index,sentence_text
0,1516,Carole King,https://en.wikipedia.org/wiki?curid=3601134,19,20,"Little Eva had recently had a hit with "" The Loco - Motion , "" after being discovered by Carole King , for whom she babysat ."
1,3201,Jack Benny,https://en.wikipedia.org/wiki?curid=3601297,12,13,One of his bunkmates in the Navy orchestra was comedian / violinist Jack Benny .
2,5954,Bix Beiderbecke,https://en.wikipedia.org/wiki?curid=17655263,29,30,"He held positions with Roger Wolfe Kahn and Don Voorhees , and became a prolific studio pianist , recording with Fred Rich , Nat Shilkret , Frankie Trumbauer , Bix Beiderbecke , and the Charleston Chasers ."
3,7043,Brian Wilson,https://en.wikipedia.org/wiki?curid=32353467,13,14,"The players featured included Matt Cain , Barry Zito , Pablo Sandoval , Brian Wilson , Buster Posey , and Ryan Vogelsong ."
4,7203,Mac Davis,https://en.wikipedia.org/wiki?curid=1780664,54,55,"Part of the Lubbock independent school District , the school is known for its academic program and for the fact that it has produced a number of talented musicians , vocalists , businessmen , and scientists over the years ( including Buddy Holly and The Crickets , Natalie Maines , Ralna English , and Mac Davis ) ."


If you see in the aggregation pane that some musicians appear a lot, while others only once, it might be best to remove duplicates, so you will have as many different musicians in your dataset.

In [22]:
df.drop_duplicates(subset=['musicians'], inplace=True)
df = df[df["sentence_text"].str.len() > 50] # remove short sentences
 # the same sentence might appear twice, for example, when having two musicians in the text. Remove the duplicates.
df.drop_duplicates(subset=["sentence_text"], inplace=True)
df

Unnamed: 0,sentence_id,musicians,article_link,musicians_first_index,musicians_last_index,sentence_text
0,1516,Carole King,https://en.wikipedia.org/wiki?curid=3601134,19,20,"Little Eva had recently had a hit with "" The Loco - Motion , "" after being discovered by Carole King , for whom she babysat ."
1,3201,Jack Benny,https://en.wikipedia.org/wiki?curid=3601297,12,13,One of his bunkmates in the Navy orchestra was comedian / violinist Jack Benny .
2,5954,Bix Beiderbecke,https://en.wikipedia.org/wiki?curid=17655263,29,30,"He held positions with Roger Wolfe Kahn and Don Voorhees , and became a prolific studio pianist , recording with Fred Rich , Nat Shilkret , Frankie Trumbauer , Bix Beiderbecke , and the Charleston Chasers ."
3,7043,Brian Wilson,https://en.wikipedia.org/wiki?curid=32353467,13,14,"The players featured included Matt Cain , Barry Zito , Pablo Sandoval , Brian Wilson , Buster Posey , and Ryan Vogelsong ."
4,7203,Mac Davis,https://en.wikipedia.org/wiki?curid=1780664,54,55,"Part of the Lubbock independent school District , the school is known for its academic program and for the fact that it has produced a number of talented musicians , vocalists , businessmen , and scientists over the years ( including Buddy Holly and The Crickets , Natalie Maines , Ralna English , and Mac Davis ) ."
...,...,...,...,...,...,...
1328,1692175,Fred Rose,https://en.wikipedia.org/wiki?curid=25647640,21,22,"In Williams ' original draft , the song had been titled "" I Lose Again "" but was reversed at producer Fred Rose 's insistence ."
1357,1728400,Scouting for Girls,https://en.wikipedia.org/wiki?curid=16327885,12,14,""" It 's Not About You "" is the fourth single by Scouting for Girls from their debut album ."
1482,1854583,Big Mama Thornton,https://en.wikipedia.org/wiki?curid=3155242,44,46,"When asked about her musical influences , she replied : "" Bob Dylan , Joan Baez , Kris Kristofferson , Guy Clark , Waylon Jennings , Willie Nelson , Dolly Parton , Janis Joplin , Robert Johnson , Karen Dalton , Fred Koller , Big Mama Thornton , Billie Holiday , Hank Williams , Tammy Wynette and J.J. Cale . """
1494,1887205,Gary Richrath,https://en.wikipedia.org/wiki?curid=56173764,17,18,"By late 1970 , REO Speedwagon had finalised its first recording lineup with the addition of guitarist Gary Richrath in place of Scorfina ."


## Data Manipulation

We want to tag each musician as `[B]` or `[I]`, any other non-musician PERSON as `[PB]` or `[PI]` and all other tokens as `[O]`
```
Trumpeter-[O]
Ted-[B]
Curson-[I]
introduced-[O]
him-[O]
to-[O]
pianist-[O]
Cecil-[B]
Taylor-[I]
when-[O]
Cyrille-[O]
was-[O]
18-[O]
```

However, note that "Ted Curson" is tagged as musician in one instance, and Cecil Taylor in another. SPIKE's output only shows one match per line. If a sentence has several matches, the sentence will be found several times in the csv, each time pointing to a different match. Also note, that while we can suspect (by common sense or knowledge) that Cyrille is also a musician, the above patterns do not capture that, so `Cyrille` is tagged as `[O]`.


In [23]:
musicians  = df["musicians"].unique()

In [24]:
def tag_sentence(row):
    tokens = []
    start = row["musicians_first_index"]
    end = row["musicians_last_index"]
    start_entity = "O"
    for i, token in enumerate(nlp(row["sentence_text"])):
        ent_type = token.ent_type_
        if start == i:
            tokens.append(f"{token.text}-[B]")
            start_entity = "O"
        elif token.text == "'s":
            tokens.append(f"{token.text}-[O]")
            start_entity = "O"
        elif start < i <= end:
            tokens.append(f"{token.text}-[I]")
            start_entity = "O"
        elif ent_type == "PERSON":
            surname = [t.text for t in token.ancestors]
            first_name = [t.text for t in token.children]
            if not first_name:
                if surname:
                    try:
                        if f"{token.text} {surname[0]}" in musicians:
                            tokens.append(f"{token.text}-[B]")
                            start_entity = "O"
                        else:
                            if start_entity != "PB":
                                tokens.append(f"{token.text}-[PB]")
                                start_entity = "PB"
                    except:
                        print(token.text, row["sentence_text"])
                        tokens.append(f"{token.text}-[O]")
                elif f"{token.text}" in musicians:
                    tokens.append(f"{token.text}-[B]")
                    start_entity = "O"
                else:
                    tokens.append(f"{token.text}-[B]")
                    start_entity = "O"
            elif f"{first_name[0]} {token.text}" in musicians:
                tokens.append(f"{token.text}-[I]")
                start_entity = "O"            
            elif start_entity != "PB":
                tokens.append(f"{token.text}-[PB]")
                start_entity = "PB"
            else:
                tokens.append(f"{token.text}-[PI]")
                start_entity = "PI"
        else:
            tokens.append(f"{token.text}-[O]")
            start_entity = "O"
            
    return " ".join(tokens)
        

df["tagged_sentence"] = df.apply(tag_sentence, axis=1)

In [12]:
for i, row in df.iterrows():
    print(i, row["musicians"])
    print(row["sentence_text"])
    print(row["tagged_sentence"], '\n')
    if i == 10:
        break

0 Carole King
Little Eva had recently had a hit with " The Loco - Motion , " after being discovered by Carole King , for whom she babysat .
Little-[O] Eva-[PB] had-[O] recently-[O] had-[O] a-[O] hit-[O] with-[O] "-[O] The-[O] Loco-[O] --[O] Motion-[O] ,-[O] "-[O] after-[O] being-[O] discovered-[O] by-[O] Carole-[B] King-[I] ,-[O] for-[O] whom-[O] she-[O] babysat-[O] .-[O] 

1 Jack Benny
One of his bunkmates in the Navy orchestra was comedian / violinist Jack Benny .
One-[O] of-[O] his-[O] bunkmates-[O] in-[O] the-[O] Navy-[O] orchestra-[O] was-[O] comedian-[O] /-[O] violinist-[O] Jack-[B] Benny-[I] .-[O] 

2 Bix Beiderbecke
He held positions with Roger Wolfe Kahn and Don Voorhees , and became a prolific studio pianist , recording with Fred Rich , Nat Shilkret , Frankie Trumbauer , Bix Beiderbecke , and the Charleston Chasers .
He-[O] held-[O] positions-[O] with-[O] Roger-[PB] Kahn-[PI] and-[O] Don-[PB] Voorhees-[PI] ,-[O] and-[O] became-[O] a-[O] prolific-[O] studio-[O] pianist-[O] ,-[

After tagging, write the sentences to file. Make sure to manually go over the file and correct wrong tags.

In [14]:
with open("../data/output/test_set_dirty.txt", "w") as f:
    for i, row in df.iterrows():
        if row['tagged_sentence'].count('[PB]') < 3: # avoid sentences with big lists of people. Not mandatory
            f.write(f"{row['tagged_sentence']}\n")

## Negative Examples

To create negative examples, simply search for [PERSONs](https://spike.staging.apps.allenai.org/datasets/wikipedia/search#query=eyJ0eXBlIjoiQiIsInF1ZXJ5IjoiJTNDRSUzRW5lZ2F0aXZlJTNBZSUzRFBFUlNPTiIsImV4cGFuc2lvbk92ZXJyaWRlcyI6IiIsImNhc2VTdHJhdGVneSI6ImV4YWN0IiwiZmlsdGVycyI6Int9IiwiaXNGdXp6eSI6ImZhbHNlIn0=&autoRun=true), or you can curate a list of random names and run a [query](https://spike.staging.apps.allenai.org/datasets/wikipedia/search#query=eyJ0eXBlIjoiQiIsInF1ZXJ5IjoiJTNDRSUzRW5lZ2F0aXZlcyUzQSU3QnJhbmRvbV9wZW9wbGUlM0E5ZGVhOGMzNTg5NTU2MGZlMTc0MmEzY2VlZjg2NjNiNGI3MzJjZjdiZWQyMGRlYTViNWEzNjYzNDAxNzE3ZjA4JTdEIiwiZXhwYW5zaW9uT3ZlcnJpZGVzIjoiIiwiY2FzZVN0cmF0ZWd5IjoiZXhhY3QiLCJmaWx0ZXJzIjoie30iLCJpc0Z1enp5IjoiZmFsc2UifQ==&autoRun=true) looking for them (to create such a list, you can use the aggregation panel with the previous query). Download results and repeat the above process, but this time, tag all PERSONs as PB/PI.

In [18]:
df_neg = pd.read_csv('../data/queries/simple_negatives.csv')
df_neg.drop_duplicates(subset=['negatives'], inplace=True)
df_neg = df[df["sentence_text"].str.len() > 50] # remove short sentences
df_neg.drop_duplicates(subset=["sentence_text"], inplace=True)
df_neg.head()

Unnamed: 0,sentence_id,negatives,negatives_first_index,negatives_last_index,sentence_text,tagged_sentence
0,855,Kralj Dmitar Zvonimir,32,34,"The next live firings did n't happen until 2015 when a single missile was launched from a coastal launcher and another one from "" Petar Krešimir IVs "" sister ship , "" Kralj Dmitar Zvonimir "" ( RTOP-12 ) .","The-[O] next-[O] live-[O] firings-[O] did-[O] n't-[O] happen-[O] until-[O] 2015-[O] when-[O] a-[O] single-[O] missile-[O] was-[O] launched-[O] from-[O] a-[O] coastal-[O] launcher-[O] and-[O] another-[O] one-[O] from-[O] ""-[O] Petar-[O] Krešimir-[O] IVs-[O] ""-[O] sister-[O] ship-[O] ,-[O] ""-[O] Kralj-[O] Dmitar-[O] Zvonimir-[O] ""-[O] (-[O] RTOP-12-[O] )-[O] .-[O]"
1,1315,Pierre Huber,10,11,"Yves Aupetitallot , Private View 1980 - 2000 : Collection Pierre Huber , JRP / Ringier","Yves-[PB] Aupetitallot-[PI] ,-[O] Private-[O] View-[O] 1980-[O] --[O] 2000-[O] :-[O] Collection-[O] Pierre-[PB] Huber-[PI] ,-[O] JRP-[O] /-[O] Ringier-[O]"
2,3750,Oddbjørn Jonstad,5,6,"The party is led by Oddbjørn Jonstad , a former local leader of the Progress Party who was expelled from the party following some controversial proposals he made on immigration issues .","The-[O] party-[O] is-[O] led-[O] by-[O] Oddbjørn-[PB] Jonstad-[PI] ,-[O] a-[O] former-[O] local-[O] leader-[O] of-[O] the-[O] Progress-[O] Party-[O] who-[O] was-[O] expelled-[O] from-[O] the-[O] party-[O] following-[O] some-[O] controversial-[O] proposals-[O] he-[O] made-[O] on-[O] immigration-[O] issues-[O] .-[O]"
4,3911,Henry Sage,3,4,"In 1884 , Henry Sage endowed several of the first scholarships in the nation earmarked especially for women .","In-[O] 1884-[O] ,-[O] Henry-[PB] Sage-[PI] endowed-[O] several-[O] of-[O] the-[O] first-[O] scholarships-[O] in-[O] the-[O] nation-[O] earmarked-[O] especially-[O] for-[O] women-[O] .-[O]"
7,5535,São João,2,3,"Born in São João do Araguaia , Pará , Tiago Alves graduated from Santos ' youth setup .","Born-[O] in-[O] São-[O] João-[O] do-[O] Araguaia-[O] ,-[O] Pará-[O] ,-[O] Tiago-[PB] Alves-[PI] graduated-[O] from-[O] Santos-[O] '-[O] youth-[O] setup-[O] .-[O]"


In [19]:
def negatives_tag_sentence(row):
    tokens = []
    start_entity = "O"
    for i, token in enumerate(nlp(row["sentence_text"])):
        ent_type = token.ent_type_
        if ent_type == "PERSON":
            if start_entity != "PB":
                tokens.append(f"{token.text}-[PB]")
                start_entity = "PB"
            else:
                tokens.append(f"{token.text}-[PI]")
                start_entity = "PI"
        else:
            tokens.append(f"{token.text}-[O]")
            start_entity = "O"
            
    return " ".join(tokens)


df_neg["tagged_sentence"] = df_neg.apply(negatives_tag_sentence, axis=1)

In [20]:
df_neg['tagged_sentence'].to_csv("../data/output/negative_sentences.csv", sep="\t")

After manually cleaning both files (musicians and negatives), you might want to select about 50 sentences of each (or another ratio), merge them and shuffle.