<h2>Part 1: extracting all direct quotes<h2>

In [1]:
#import modules
import spacy
import re
nlp = spacy.load("en_core_web_sm")

In [2]:
#save file paths
import os
fileids = os.listdir("texts")
fileids = [("texts/" + fileid) for fileid in fileids]
fileids

['texts/5c1452701e67d78e276ee126.txt',
 'texts/5c146e42795bd2fcce2ea8e5.txt',
 'texts/5c149ffc1e67d78e276fbd44.txt',
 'texts/5c15488f1e67d78e277161d7.txt',
 'texts/5c1548a31e67d78e2771624f.txt']

In [3]:
#open and merge files
textlist = []
for fileid in fileids:
    with open (fileid, mode ="r", encoding = "utf-8") as f:
        text = f.read()
        textlist.append(text)

In [4]:
textlist

["The question was common for mayoral candidates to hear back in the October municipal election: what do you think of ride-hailing and should it come to Metro Vancouver?. \n “I was clear when I was mayor – I don’t support Uber at all,” then-Surrey mayoral candidate Doug McCallum responded at one debate. \n It was an odd thing to say: McCallum’s last term as Surrey mayor ended in 2005, while Uber as a company began in 2009, briefly entering the Vancouver market only in 2012. There was no Uber not to support in 2005. \n Rival candidate Bruce Hayne joked in his turn to answer the question, “It was a twinkle in some engineer’s eye some years ago.”. \n McCallum didn’t correct himself. \n But the strange remark may have foreshadowed a growing number of curious statements from the returning mayor of B.C.’s second-largest, rapidly growing city, which is wrestling with big changes after his come-from-behind election win. \n CTV News has analyzed about three weeks' worth of McCallum's speeches, 

In [5]:
#standardize quotation marks
pipeline = [('“', '"'), ('´´', '"'), ('”', '"'), ('’’', '"')]
for old, new in pipeline:
    textlist = [(x.replace(old, new)) for x in textlist]

In [7]:
#remove html- changed to list comprehension
textlist = [(x.replace("\n", "")) for x in textlist]
textlist = [(x.replace("\'", "’")) for x in textlist]

In [8]:
textlist

['The question was common for mayoral candidates to hear back in the October municipal election: what do you think of ride-hailing and should it come to Metro Vancouver?.  "I was clear when I was mayor – I don’t support Uber at all," then-Surrey mayoral candidate Doug McCallum responded at one debate.  It was an odd thing to say: McCallum’s last term as Surrey mayor ended in 2005, while Uber as a company began in 2009, briefly entering the Vancouver market only in 2012. There was no Uber not to support in 2005.  Rival candidate Bruce Hayne joked in his turn to answer the question, "It was a twinkle in some engineer’s eye some years ago.".  McCallum didn’t correct himself.  But the strange remark may have foreshadowed a growing number of curious statements from the returning mayor of B.C.’s second-largest, rapidly growing city, which is wrestling with big changes after his come-from-behind election win.  CTV News has analyzed about three weeks’ worth of McCallum’s speeches, statements a

In [9]:
#quotes function without space
def get_quotes(text):
    quotes = re.findall(r'"(.*?)"', text)
    return(quotes)

In [10]:
#print quotes and save as .txt file
#note: every time you run this it will add to the .txt, so delete the old one first
#also note: this prints a list of lists, where each sublist is the quotes from one text
for text in textlist:
    found_quotes = get_quotes(text)
    if len(found_quotes) > 0:
        print(found_quotes)
    with open ("newapproachoutput.txt", "a") as x:
        for quote in found_quotes:
            print(quote, file = x)

['I was clear when I was mayor – I don’t support Uber at all,', 'It was a twinkle in some engineer’s eye some years ago.', 'Mayor McCallum’s statements vary greatly from truth,', 'There’s a tried-and-true method in Canadian politics: after an election a new government takes office and says, ‘Oh my gosh, the cupboards are bare.’ Or, ‘We’re much deeper in debt than I thought we were, and now I’ve seen the real books.’ So I think there’s an element of that kind of gamesmanship going on,', 'Then there’s the fact that McCallum has been out of office for quite some time, thinking he knew the job, but some things have changed,', 'If you take Fraser Highway SkyTrain and if we’re building that seven days a week around the clock, we probably can save, and this is TransLink’s figures, we can probably save $2-300 million,', 'TransLink has not conducted any detailed study on potential construction methods for a SkyTrain route from Surrey to Langley. The most recent cost estimate (2017 Hatch report)

<h2>Part 2: extracting speakers<h2>

In [11]:
#import modules
import spacy
import re
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

In [12]:
#save file paths
import os
fileids = os.listdir("texts")
fileids = [("texts/" + fileid) for fileid in fileids]
fileids

['texts/5c1452701e67d78e276ee126.txt',
 'texts/5c146e42795bd2fcce2ea8e5.txt',
 'texts/5c149ffc1e67d78e276fbd44.txt',
 'texts/5c15488f1e67d78e277161d7.txt',
 'texts/5c1548a31e67d78e2771624f.txt']

In [13]:
#open files one at a time for this
#if we do them all at once it creates issues with the indexing
text1list = []
with open (fileids[0], mode ="r", encoding = "utf-8") as f:
    text = f.read()
    text1list.append(text)

In [14]:
#standardize quotation marks
pipeline = [('“', '"'), ('´´', '"'), ('”', '"'), ('’’', '"')]
for old, new in pipeline:
    text1list = [(x.replace(old, new)) for x in text1list]

In [15]:
#remove html
text1list = [(x.replace("\n", "")) for x in text1list]
text1list = [(x.replace("\'", "’")) for x in text1list]

In [16]:
text1list

['The question was common for mayoral candidates to hear back in the October municipal election: what do you think of ride-hailing and should it come to Metro Vancouver?.  "I was clear when I was mayor – I don’t support Uber at all," then-Surrey mayoral candidate Doug McCallum responded at one debate.  It was an odd thing to say: McCallum’s last term as Surrey mayor ended in 2005, while Uber as a company began in 2009, briefly entering the Vancouver market only in 2012. There was no Uber not to support in 2005.  Rival candidate Bruce Hayne joked in his turn to answer the question, "It was a twinkle in some engineer’s eye some years ago.".  McCallum didn’t correct himself.  But the strange remark may have foreshadowed a growing number of curious statements from the returning mayor of B.C.’s second-largest, rapidly growing city, which is wrestling with big changes after his come-from-behind election win.  CTV News has analyzed about three weeks’ worth of McCallum’s speeches, statements a

In [17]:
doc1 = nlp(text1list[0])

In [18]:
#quotes function without space
def get_quotes(text):
    quotes = re.findall(r'"(.*?)"', text)
    return(quotes)

In [19]:
#print quotes and adds them to a list
quotelist = []
for text in text1list:
    str_text = str(text)
    found_quotes = get_quotes(str_text)
    if len(found_quotes) > 0:
        print(found_quotes)
    for quote in found_quotes:
        quotelist.append(quote)

['I was clear when I was mayor – I don’t support Uber at all,', 'It was a twinkle in some engineer’s eye some years ago.', 'Mayor McCallum’s statements vary greatly from truth,', 'There’s a tried-and-true method in Canadian politics: after an election a new government takes office and says, ‘Oh my gosh, the cupboards are bare.’ Or, ‘We’re much deeper in debt than I thought we were, and now I’ve seen the real books.’ So I think there’s an element of that kind of gamesmanship going on,', 'Then there’s the fact that McCallum has been out of office for quite some time, thinking he knew the job, but some things have changed,', 'If you take Fraser Highway SkyTrain and if we’re building that seven days a week around the clock, we probably can save, and this is TransLink’s figures, we can probably save $2-300 million,', 'TransLink has not conducted any detailed study on potential construction methods for a SkyTrain route from Surrey to Langley. The most recent cost estimate (2017 Hatch report)

In [20]:
#def function to find people and orgs with spaCy
def findspeakers(doc):
    return [ent.text for ent in doc.ents if (ent.label_ == "PERSON" or ent.label_ == "ORG")]

In [21]:
speakerlist = findspeakers(doc1)

In [22]:
speakerlist

['Uber',
 'Doug McCallum',
 'McCallum',
 'Surrey',
 'Uber',
 'Bruce Hayne',
 'McCallum',
 'CTV News',
 'McCallum',
 'Cloverdale Sport and Ice Complex',
 'Grandview Heights Community Centre',
 'Library',
 'SkyTrain',
 'Langley',
 'McCallum',
 'SkyTrain',
 'McCallum',
 'Cindy Dalglish',
 'Surrey',
 'McCallum',
 'Hamish Telford',
 'McCallum',
 'Telford',
 'McCallum',
 'Surrey',
 'CTV News',
 'Mayors’ Council',
 'CTV News',
 'the Mayors Council',
 'McCallum',
 'Newton',
 'SkyTrain',
 'Langley',
 'Langley SkyTrain',
 'LRT',
 'McCallum',
 'CTV News',
 'SkyTrain',
 'Surrey',
 'Langley',
 'SkyTrain',
 'LRT',
 'Surrey',
 'SkyTrain',
 'Evergreen Line',
 'the Evergreen Line',
 'the Evergreen Line',
 'SkyTrain',
 'McCallum',
 'LRT',
 'Surrey',
 'SkyTrain',
 'LRT',
 'John Horgan',
 'SkyTrain',
 'Kevin Desmond',
 'McCallum',
 'SkyTrain',
 'the Mayors’ Council',
 'LRT',
 'The Mayors’ Council',
 'Fleetwood',
 'SkyTrain',
 'Telford',
 'Overstating Surrey’s',
 'McCallum',
 'Surrey',
 'Surrey',
 'McCallu

In [25]:
#gives a list of tuples with the speaker and its index as values 
#this gives the character index (instead of the token index that spaCy would give)
#the issue is that if a speaker occurs multiple times it always takes the first index
#there's a way to fix this but I don't know what it is
speakerindexlist = []
str_text = str(doc1)
for speaker in speakerlist:
    if speaker in str_text:
        speakercharindex = (speaker, str_text.index(speaker))
        speakerindexlist.append(speakercharindex)

In [26]:
speakerindexlist

[('Uber', 218),
 ('Doug McCallum', 262),
 ('McCallum', 267),
 ('Surrey', 237),
 ('Uber', 218),
 ('Bruce Hayne', 532),
 ('McCallum', 267),
 ('CTV News', 915),
 ('McCallum', 267),
 ('Cloverdale Sport and Ice Complex', 1350),
 ('Grandview Heights Community Centre', 1571),
 ('Library', 1610),
 ('SkyTrain', 1652),
 ('Langley', 1674),
 ('McCallum', 267),
 ('SkyTrain', 1652),
 ('McCallum', 267),
 ('Cindy Dalglish', 2195),
 ('Surrey', 237),
 ('McCallum', 267),
 ('Hamish Telford', 2705),
 ('McCallum', 267),
 ('Telford', 2712),
 ('McCallum', 267),
 ('Surrey', 237),
 ('CTV News', 915),
 ('Mayors’ Council', 3017),
 ('CTV News', 915),
 ('the Mayors Council', 3473),
 ('McCallum', 267),
 ('Newton', 3634),
 ('SkyTrain', 1652),
 ('Langley', 1674),
 ('Langley SkyTrain', 3768),
 ('LRT', 3658),
 ('McCallum', 267),
 ('CTV News', 915),
 ('SkyTrain', 1652),
 ('Surrey', 237),
 ('Langley', 1674),
 ('SkyTrain', 1652),
 ('LRT', 3658),
 ('Surrey', 237),
 ('SkyTrain', 1652),
 ('Evergreen Line', 5029),
 ('the Everg

In [23]:
#gives a list of tuples with the quote and its index as values
#again, there are some issues here
#as in life, sextortion is a big problem
#but the repeated ones are all overgeneralizations; actual quotes are unlikely to be repeated exactly
quoteindexlist = []
str_text = str(doc1)
found_quotes = get_quotes(str_text)
for quote in found_quotes:
    quotecharindex = (quote, str_text.index(quote))
    quoteindexlist.append(quotecharindex)

In [24]:
quoteindexlist

[('I was clear when I was mayor – I don’t support Uber at all,', 171),
 ('It was a twinkle in some engineer’s eye some years ago.', 587),
 ('Mayor McCallum’s statements vary greatly from truth,', 2141),
 ('There’s a tried-and-true method in Canadian politics: after an election a new government takes office and says, ‘Oh my gosh, the cupboards are bare.’ Or, ‘We’re much deeper in debt than I thought we were, and now I’ve seen the real books.’ So I think there’s an element of that kind of gamesmanship going on,',
  2343),
 ('Then there’s the fact that McCallum has been out of office for quite some time, thinking he knew the job, but some things have changed,',
  2728),
 ('If you take Fraser Highway SkyTrain and if we’re building that seven days a week around the clock, we probably can save, and this is TransLink’s figures, we can probably save $2-300 million,',
  3267),
 ('TransLink has not conducted any detailed study on potential construction methods for a SkyTrain route from Surrey to

In [29]:
#this function takes an indexed quote, iterates through index potential speakers,
#and finds the speaker with the index closest to the quote's index
#credit to ChatGPT for correcting my code block
def findnearestspeaker(quote, speakers):
    #this is what my attempts were missing
    #I was trying to iterate through the indices but integers aren't iterable
    quoteinteger = quote[1]
    #ChatGPT also added these lines, which I wouldn't have known to do
    nearestspeakername = None
    mindifference = float('inf')  # Initialize with positive infinity
    #I had figured out the abs(speakerindex - quoteindex) part, but ChatGPT made it run properly
    #mainly by using an if statement instead of min(), which was what I was trying to do
    for speaker in speakers:
        difference = abs(speaker[1] - quoteinteger)
        # Check if the current speaker is closer than the previously found closest speaker
        if difference < mindifference:
            mindifference = difference
            nearestspeakername = speaker
    return nearestspeakername

In [30]:
#this is a very crude way of finding the speaker
#given a quote and a list of potential speakers (people and orgs, as recognized by spaCy)
#it finds the speaker nearest to the quote
#regex works really well for finding direct quotes because it matches exact strings
#but it's not well-suited to a task like this that requires more meta-knowledge of the text
for quote in quoteindexlist:
    nearestspeaker = findnearestspeaker(quote, speakerindexlist)
    print(nearestspeaker[0], ":", quote[0])

Uber : I was clear when I was mayor – I don’t support Uber at all,
Bruce Hayne : It was a twinkle in some engineer’s eye some years ago.
Cindy Dalglish : Mayor McCallum’s statements vary greatly from truth,
Cindy Dalglish : There’s a tried-and-true method in Canadian politics: after an election a new government takes office and says, ‘Oh my gosh, the cupboards are bare.’ Or, ‘We’re much deeper in debt than I thought we were, and now I’ve seen the real books.’ So I think there’s an element of that kind of gamesmanship going on,
Telford : Then there’s the fact that McCallum has been out of office for quite some time, thinking he knew the job, but some things have changed,
the Mayors Council : If you take Fraser Highway SkyTrain and if we’re building that seven days a week around the clock, we probably can save, and this is TransLink’s figures, we can probably save $2-300 million,
Langley SkyTrain : TransLink has not conducted any detailed study on potential construction methods for a Sky