# Assignment 2 starter code
This notebook contains code to run [coreferee](https://github.com/explosion/coreferee), a coreference system running under spaCy to extract coreference chains (or clusters) from text.
To run the notebook, you first have to intall coreferee. See instructions here: https://spacy.io/universe/project/coreferee, but mostly what you need to do is, from a command prompt:

`$ python -m pip install coreferee`

`$ python -m coreferee install en`
    
You'll also need to download the spaCy large language model and the transformer model for English. It turns out, spacy has just released new versions and coreferee is not yet compatible with them, so you need to download specific versions of each model:

`$ python -m pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.0/en_core_web_lg-3.4.0.tar.gz`

`$ python -m pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.4.0/en_core_web_trf-3.4.0.tar.gz`

You can also run install these things by running the next two cells instead. You'll only need to do this once.

## Part 1: Installation

You only need to do this **once**, no matter how many times you modify this notebook. Or you can do it from the command prompt, with the commands above. There is no harm in doing in more than once; it just takes a very long time. 

The first cell installs coreferee and the English model for it (coreferee works for other languages too). The second cell installs the right versions of the English models for spaCy. You need these specific versions because newer versions of the English models don't work with coreferee. 

In [13]:
!python -m pip install coreferee
!python -m coreferee install en

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting spacy<3.6.0,>=3.0.0 (from coreferee)
  Using cached spacy-3.5.4-cp311-cp311-macosx_11_0_arm64.whl.metadata (25 kB)
Collecting thinc<8.2.0,>=8.1.8 (from spacy<3.6.0,>=3.0.0->coreferee)
  Using cached thinc-8.1.12-cp311-cp311-macosx_11_0_arm64.whl.metadata (15 kB)
Using cached spacy-3.5.4-cp311-cp311-macosx_11_0_arm64.whl (6.5 MB)
Using cached thinc-8.1.12-cp311-cp311-macosx_11_0_arm64.whl (776 kB)
Installing collected packages: thinc, spacy
  Attempting uninstall: thinc
    Found existing installation: thinc 8.2.3
    Uninstalling thinc-8.2.3:
      Successfully uninstalled thinc-8.2.3
  Attempting uninstall: spacy
    Found existing installation: spacy 3.7.4
    Uninstalling spacy-3.7.4:
      Suc

In [14]:
!python -m pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.0/en_core_web_lg-3.4.0.tar.gz
!python -m pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.4.0/en_core_web_trf-3.4.0.tar.gz

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.0/en_core_web_lg-3.4.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.0/en_core_web_lg-3.4.0.tar.gz (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting spacy<3.5.0,>=3.4.0 (from en_core_web_lg==3.4.0)
  Using cached spacy-3.4.4-cp311-cp311-macosx_11_0_arm64.whl.metadata (24 kB)
Using cached spacy-3.4.4-cp311-cp311-macosx_11_0_arm64.whl (6.4 MB)
Installing collected packages: spacy
  Attempting uninstall



## Part 2: Test coreferee

In [15]:
# import what we need, load the transformer model, 
# add coreferee to the spacy nlp pipeline

import coreferee, spacy
nlp = spacy.load('en_core_web_trf')
nlp.add_pipe('coreferee')



<coreferee.manager.CorefereeBroker at 0x381de5c90>

In [16]:
# this is just a test, so that you can see what the coreference chains look like
# you may get a CUDA warning here. As long as it's only a warning, things should run just fine

doc = nlp('Although he was very busy with his work, Peter had had enough of it. He and his wife decided they needed a holiday. They travelled to Spain because they loved the country very much.')

In [17]:
# now we print the coreference chains found

doc._.coref_chains.print()

0: he(1), his(6), Peter(9), He(16), his(18)
1: work(7), it(14)
2: [He(16); wife(19)], they(21), They(26), they(31)
3: Spain(29), country(34)


A few things to note about the output:

* We have 4 coreference chains, relating to: *Peter, work, wife(+Peter), Spain*
* Coreferee is able to deal with cataphora, where the pronoun (*he*) appears before the referent (*Peter*)
* Coreferee can deal with groups: *\[he+wife\], they*
* The wife does not appear as an entity with a chain, because there is no referring expression to that entity. It only appears as part of *he and his wife*

In [18]:
# once we have an index for a particular referring expression, 
# we can ask coreferee to resolve it. For instance, printing
# the following expression gives us the referent for 
# the referring expression 31 (they)

print(doc._.coref_chains.resolve(doc[31]))

[Peter, wife]


## Part 3: Run coreferee on local files

In [19]:
# do coreference chains for 5 documents in the data/ directory
# below is a sample for the first text

with open ("A1_data/5c1dbe1d1e67d78e2797d611.txt", "r", encoding='utf-8') as f:
    text1 = f.read()

In [20]:
doc1 = nlp(text1)

In [21]:
doc1._.coref_chains.print()

0: couple(9), couple(76)
1: years(16), their(19)
2: letter(48), them(59), they(78)
3: Ayo(72), Ayo(150)
4: custody(87), it(96)
5: News(112), News(136)
6: fact(117), It(165)
7: orphanage(161), orphanage(206)
8: letter(171), letter(219)
9: Kim(174), Kim(211)
10: children(254), their(266)
11: adoption(300), adoption(333)
12: headlines(317), they(325)
13: Kim(338), Kim(401), Kim(425), her(437), she(452)
14: Nigeria(345), country(359)
15: papers(372), them(376)
16: Clark(395), Clark(432), He(444), him(478), him(486), him(496), he(501)
17: [Kim(401); son(404)], their(403)
18: Canada(474), They(484)
19: Nigeria(489), They(494)
20: Morans(561), Morans(602)
21: family(578), family(636), they(658), their(660), they(684)
22: government(589), it(598)
23: Kim(633), she(652), she(695)
24: agency(648), it(672)


In [22]:
# example: who does she(452) refer to?

print(doc1._.coref_chains.resolve(doc1[452]))

[Kim]


In [23]:
# print referring expressions that are people
# we are interested in those because they are the sources of quotes
for ent in doc1.ents:
    if ent.label_ in ["PERSON"]:
        print(ent.text, ent.label_)

Kim PERSON
Clark Moran PERSON
Ayo PERSON
Kim PERSON
Ayo PERSON
Kim PERSON
Clark PERSON
Kim PERSON
Ayo PERSON
Morans PERSON
Kim PERSON
Ayo PERSON
Clark PERSON
Kim PERSON
Kim PERSON
Clark PERSON
Ayo PERSON
Kim PERSON
Morans PERSON
Ayo PERSON
Kim PERSON
Ben Miljure PERSON


In [34]:
with open ("A1_data/5c1de1661e67d78e27984d34.txt", "r", encoding='utf-8') as f:
    text2 = f.read()

In [35]:
doc2 = nlp(text2)

In [36]:
doc2._.coref_chains.print()

0: Scheer(5), Scheer(24)
1: Trudeau(10), Trudeau(29)
2: Canada(21), Canada(48)
3: [Trudeau(29); party(33)], them(40)
4: Trudeau(76), his(87), His(107), Trudeau(113)
5: Scheer(104), Scheer(119)
6: [Scheer(119); Conservatives(122)], themselves(128)
7: accusations(157), they(163)
8: Convoy(187), his(236)
9: pipelines(193), their(197), they(204)
10: Scheer(242), he(244), his(259), Scheer(289)
11: Twitter(284), it(291)
12: reaction(294), it(303)
13: defence(352), it(358)
14: Chicken(365), he(401), he(406)
15: language(388), it(391), it(425)
16: WelcomeToCanada(440), Canada(525)
17: U.S.(450), country(467)
18: Trudeau(452), Trudeau(458), Trudeau(483), he(487)
19: Scheer(462), Scheer(533), he(535), his(537), Scheer(570), he(659)
20: [he(535); party(538)], they(545)
21: system(633), it(643)
22: Canada(653), It(666)
23: Conservatives(675), Conservatives(728), Conservatives(765)
24: spokeswoman(686), he(690)
25: Trudeau(788), Trudeau(824), Trudeau(826)
26: Conservatives(829), they(836)
27: sover

## Side note: visualizations
If you want to see this all in a much prettier format, you can use [displacy](https://spacy.io/usage/visualizers). 

In [24]:
from spacy import displacy

In [25]:
options = {"ents": ["PERSON"],
          "colors": {"PERSON": "lightsteelblue"}}

displacy.render(doc1, style="ent", options=options, jupyter=True)

## Part 4: Run the quote extraction from Assignment 1
I suggest using the Matcher quote extraction system from A1, but, if you implemented your own version, or improved on this one, feel free to use that instead.

In [26]:
#import what we need for this
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)

In [31]:
# we don't need to load the text again; use text1 from above

matcher = Matcher(nlp.vocab)
pattern_q = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': '"'}]
matcher.add("QUOTES", [pattern_q], greedy='LONGEST')
doc = nlp(text1)
matches_q = matcher(doc)
matches_q.sort(key = lambda x: x[1])
print (len(matches_q))
for match in matches_q[:10]:
 print (match, doc[match[1]:match[2]])

3
(16432004385153140588, 115, 133) "The fact that we are being accused right now of an unethical adoption is crazy."
(16432004385153140588, 164, 174) "It does say that in the letter,"
(16432004385153140588, 179, 209) "I have no idea where that information came from because both Clark and I were there in the office with all of the workers from the orphanage."


In [30]:
# "new approach/own version" of Matcher


with open ("A1_data/5c1dbe1d1e67d78e2797d611.txt", "r", encoding='utf-8') as f:
    text = f.read()

matcher = Matcher(nlp.vocab)
    
pattern_q16 = [{'ORTH': '"'}, {'IS_ALPHA': True, "OP": "*"}, {'IS_PUNCT': True, "OP": "*"}, {'IS_ALPHA': True, "OP": "*"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': '"'}] #7
pattern_q17 = [{'ORTH': '“'}, {'IS_ALPHA': True, "OP": "*"}, {'IS_PUNCT': True, "OP": "*"}, {'IS_ALPHA': True, "OP": "*"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': '”'}] #4
    # both curly and straight quotes
    
    
matcher.add("QUOTES", [pattern_q16, pattern_q17])
doc = nlp(text)
matches_q = matcher(doc)
print(len(matches_q))
for match in matches_q[:10]:
    print(match, doc[match[1]:match[2]])
    print("\n") #blank space between outputs
    
    
    
  #  matcher.add("QUOTES", [pattern_q], greedy='LONGEST')
#doc = nlp(text1)
#matches_q = matcher(doc)
#matches_q.sort(key = lambda x: x[1])
#print (len(matches_q))
#for match in matches_q[:10]:
 #   print (match, doc[match[1]:match[2]])

7
(16432004385153140588, 108, 116) " Kim told CTV News Friday. "


(16432004385153140588, 115, 133) "The fact that we are being accused right now of an unethical adoption is crazy."


(16432004385153140588, 108, 133) " Kim told CTV News Friday. "The fact that we are being accused right now of an unethical adoption is crazy."


(16432004385153140588, 164, 174) "It does say that in the letter,"


(16432004385153140588, 173, 180) " Kim confirmed, adding that "


(16432004385153140588, 179, 209) "I have no idea where that information came from because both Clark and I were there in the office with all of the workers from the orphanage."


(16432004385153140588, 280, 309) "in some cases, extra steps in the citizenship or immigration process may be needed to make sure the adoption meets all requirements of international adoption."




## Your turn

Check instructions on Canvas for what to do and what to submit. 