This notebook provides an introduction/setup for error-reporting on the QT30 corpus.

It would be useful to have a general error-check service that could be run on any corpus or file: most immediately however, there is a need to resolve issues in the data of the corpus QT30 (http://corpora.aifdb.org/qt30), which is on its way to being expanded to QT50 with another 20 episodes, and re-released.

A significant number of errors have been uncovered when trying to run a new set of analytics on the corpus (currently under development, https://github.com/arg-tech/papa), which make more complex use of the graph structure and contents of the text field.

There are a few ways of dividing the error types.

Firstly, errors may be general or corpus-specific. Some of the issues are structural mistakes that should never be found in any IAT data (e.g. missing connections between nodes), while others are specific to QT30 (e.g. text without the in-text timestamps formatted correctly for the corpus standard). 

Secondly, errors may be 'visible' errors that can be seen (and therefore corrected) by annotators, such as edges going in the wrong directions. Others are 'invisible', such as markup on the text being for nodes that do not exist elsewhere in the file (and were presumably deleted, but without the text markup updating).

Overall, the conclusion has been that anything fixable by annotators should be fixed by them, rather than automatically by script (even if it's something that *could* be scripted), and only errors that annotators can't fix (like duplicate copies of a node appearing in the JSON) should be passed on to someone with direct database access.

The goal therefore is to produce a comprehensive report of errors that can be then forwarded to the appropriate person/team to fix. It is likely that many of the 'general' errors are already picked up by Erisa's code. 

----

The original solution to the errors was to patch them on a local copy and keep a running tally, but it became clear that the errors were too frequent (albeit trivial) for this to be a practical solution.

Below are some pieces of code used to download the corpus locally and start that patching.

In [1]:
from tools import ova_combo
from tools import aifdb_fetcher
from src import xaif_toolbox
from src import papa
from src import analytics

import pandas as pd 
import regex as re 
import os
import json
import urllib
import csv
import wget
import importlib
import requests
from glob import glob

In [9]:
importlib.reload(papa)

<module 'src.papa' from '/Users/eimear/GitHub/papa/src/papa.py'>

Not all QT30 files are in the current format, some are old and will need converting from OVA2 to OVA3 (function is in xaif_toolbox).

Need to work with the OVA XAIF files not the plain AIF, because need to find issues in the text field as well.

# Downloading and patching

Download QT30

In [2]:
with open('./tests/qt30_test/qt_ep_true_shortnames.csv', 'r', encoding='utf-8-sig') as f:
    qt30_ep_tuples = [tuple(line) for line in csv.reader(f)]
ep_names = [t[1] for t in qt30_ep_tuples]

url = f"http://corpora.aifdb.org/nodesets.php?shortname=qt30"
with urllib.request.urlopen(url) as url:
    nodesets = json.loads(url.read().decode())

with open('./tests/qt30_test/qt30_shortcorps.txt', 'r') as file:
    qt30_episode_list = file.read().splitlines()

(Haven't made this dir yet, but...)

Download, and convert any old-format files.

In [3]:
base_dir = "./qt30_error_finding/orig_episode_maps"
map_type = 'ova'

for shortname, epname in qt30_ep_tuples:
    dir_out = f"{base_dir}/{epname}"
    #os.path.join(base_dir, epname)

    if not os.path.exists(dir_out):
        os.makedirs(dir_out)


    url = f"http://corpora.aifdb.org/nodesets.php?shortname={shortname}"
    with urllib.request.urlopen(url) as url:
        data = json.loads(url.read().decode())
    map_list = data['nodeSets']
    print(f"\nNodesets for {epname}:", data['nodeSets'])
    missing = []
    
    for i, n in enumerate(data['nodeSets']):
        print("----")
        print(f"Checking for argmap at nodeset {n} (node {i+1}/{len(data['nodeSets'])})")
        filepath = f"{dir_out}/{str(n)}.json" #os.path.join(dir_out, f"{str(n)}.json")

        # print(f"Checking for file {filepath}")
        if os.path.exists(filepath):
            pass
            # print("Already downloaded!")
        else:
            # print("File doesn't exist yet.")
            # fileurl = url_base + n
            if map_type == 'ova':
                fileurl = f"http://ova.arg.tech/db/{n}"
                # print(f"\nGetting OVA-friendly file", end='')
            else:
                fileurl = f"http://www.aifdb.org/json/{n}"
                # print(f"\nGetting json file at {fileurl}",  end='')

            print(f" Getting url {fileurl}... ")
            try:
                wget.download(fileurl, filepath)
                print("\nDone")
            except urllib.error.HTTPError:
                missing.append(n)
                print(f"404: Nodeset {n} appears not to exist")
print("----")

for ep in ep_names:
    ova_combo.ova_all_ova3(f'./qt30_error_finding/orig_episode_maps/{ep}', f'./qt30_error_finding/ova3_episode_maps/{ep}')


Nodesets for 11november2021: [23460, 23473, 23475, 23476, 23478, 23479, 23480, 23481, 23483, 23484, 23485, 23487, 23488, 23491, 23494, 23495, 23496, 23497, 23498, 23500, 23502, 23503, 23504, 23505, 23506, 23507, 23508, 23509, 23510, 23511, 23512, 23513, 23514, 23515, 23517, 23555, 23688, 24042, 25475, 25409, 25690, 25691]
----
Checking for argmap at nodeset 23460 (node 1/42)
 Getting url http://ova.arg.tech/db/23460... 

Done
----
Checking for argmap at nodeset 23473 (node 2/42)
 Getting url http://ova.arg.tech/db/23473... 

Done
----
Checking for argmap at nodeset 23475 (node 3/42)
 Getting url http://ova.arg.tech/db/23475... 

Done
----
Checking for argmap at nodeset 23476 (node 4/42)
 Getting url http://ova.arg.tech/db/23476... 

Done
----
Checking for argmap at nodeset 23478 (node 5/42)
 Getting url http://ova.arg.tech/db/23478... 

Done
----
Checking for argmap at nodeset 23479 (node 6/42)
 Getting url http://ova.arg.tech/db/23479... 

Done
----
Checking for argmap at nodeset 234

Functions for consistent/repeat errors

In [4]:
def qt30_text_namefix(text_to_fix, filename, verbose=False):
    for fix_it in [('Luke McCroy Jones', 'Luke McCroy-Jones'), ('Kate Frobes', 'Kate Forbes'),
                   ('Chris Philip', 'Chris Philp'), ('Goerge Mpanga', 'George Mpanga'), ('George MPanga', 'George Mpanga')]:
        if fix_it[0] in text_to_fix:
            if verbose:
                print(f"{os.path.basename(filename)}: {fix_it[0]} error")
            # text_to_fix = text_to_fix.replace(fix_it[0], fix_it[1])
            text_to_fix = re.sub(fix_it[0], fix_it[1], text_to_fix)
    return text_to_fix


# For consistency, replace instances of 'Andrew RT Davies' in the transcript in L-nodes with 'Andrew Davies'
def andy_fix(xaif, verbose=False):
    in_l = False
    in_txt = False
    
    corrected_nodes = []
    # Convert L-node speakers
    for l in xaif['AIF']['nodes']:
        if l['type'] == 'L':
            if re.search('Andrew RT Davies[\s]*:', l['text']):
                l['text'] = re.sub('Andrew RT Davies[\s]*:', 'Andrew Davies :', l['text'])
                in_l = True
                if verbose:
                    print(f"Node {l['nodeID']}: {l['text']}")
        corrected_nodes.append(l)
    xaif['AIF']['nodes'] = corrected_nodes
    
    # Convert transcript attributions
    if 'Andrew RT Davies' in xaif['text']:
        if verbose:
            print('\tAndrew RT Davies found in text')
        xaif['text'] = re.sub('Andrew RT Davies', 'Andrew Davies', xaif['text'])
        # xaif['text'] = re.sub(r"(^|(?<=[\.\"?\'!’…;>]))(Andrew RT Davies)(?=(<.*>)*\[.*?)", '\\1Andrew Davies\\3', xaif['text'])
    
    if verbose:
        if in_l:
            print('\tAndrew RT Davies found in L-node(s)')
    
    return xaif

# One case for fixing names to be consistent across text and L-nodes : gave up on this immediately
def misc_name_fixes(xaif):
    for l in [n for n in xaif['AIF']['nodes'] if n['type'] == 'L']:
        l['text'] = re.sub('Craig Davis :', 'Craig Unknown :', l['text'])
    return xaif


# Make use of 'Unknown' consistent across nodes and transcript
# Assumes speaker attribution is identifiable by a space after speaker name.
def audience_unk(xaif, verbose=False):
    # Check L-nodes
    corrected_nodes = []
    nodes_with_unk = []
    unk_spkrs = []
    # Make sure capitalisation of L-node speakers is consistent
    for n in xaif['AIF']['nodes']:
        if n['type'] != 'L':
            corrected_nodes.append(n)
        else:
            # Check for start to skip annotator nodes
            if re.findall("^[\w*\d*]+ [Uu]nknown[\s]*:", n['text']):
                n['text'] = re.sub('[Uu]nknown[\s]*:', 'Unknown :', n['text'])
                corrected_nodes.append(n)
                nodes_with_unk.append(n)

                splits = n['text'].split(':')
                if len(splits) < 2:
                    print(f"L-node with no recognisable speaker:\t{n['nodeID']}")
                else:
                    spkr = splits[0].strip()
                if spkr not in unk_spkrs:
                    unk_spkrs.append(spkr)
            else:
                corrected_nodes.append(n)
    if verbose:
        if len(unk_spkrs) > 0:
            print(unk_spkrs)
            print(xaif['text'])
    # Handle text
    for spkr in unk_spkrs:
        first_name = spkr.split()[0]
        # Allow for either immediate timestamp or immediate tag.
        xaif['text'] = re.sub(f"{first_name}[\s]*\[", f"{first_name} Unknown[", xaif['text'])
        xaif['text'] = re.sub(f"{first_name}[\s]*\<", f"{first_name} Unknown<", xaif['text'])

    xaif['AIF']['nodes'] = corrected_nodes
    return xaif

# Some audience names were partially converted to codes, but only in L-nodes
# If this has been done, update their name in the transcript
def audience_name2code(xaif, verbose=False):
    codename_nodes = []

    # Collect all L-nodes that have an AudienceMember speaker and are linked to the text
    l_nodes = [n for n in xaif['AIF']['nodes'] if n['type'] == 'L']
    if verbose:
        print(f"{len(l_nodes)} L-nodes to check")
    
    for l in l_nodes:
        codename = re.findall("^AudienceMember \d\d\d\d\d\d\d\dQT\d\d", l['text'])
        if codename:
            if verbose:
                print(f"\tFound {codename[0]} in node {l['nodeID']}, ", end='')
                # print(f"\t\t" + l['text'])
            direct_loc = re.findall(f"node{l['nodeID']}", xaif['text'])
            if direct_loc:
                if verbose:
                    print(f"node {l['nodeID']} found in text")
                codename_nodes.append({'name': codename[0], 'nodeID': l['nodeID']})
                # if verbose:
                    # print(codename_nodes[-1])
            else:
                if verbose:
                    print(f"node {l['nodeID']} not found in text")
            if verbose:
                print(f"\t{l['text']}")
                print()

    if verbose and len(codename_nodes) > 0:
        print("Now checking for each codname entry")
    # Check how the speaker is named in the transcript: if stripped version isn't codename, replace
    for n in codename_nodes:
        if verbose:
            print("Checking for node ", n['nodeID'])
        match_iter = re.finditer(f"(^|(?<=[\.\"?\'!’…;>]))([\'\s\w+]+)?(?=(<.*>)*\[.*?id=\"node{n['nodeID']})", xaif['text'])
        match_list = [m for m in match_iter]
        if verbose:
            print(f"\nList of matches for {n['nodeID']}")
            print(match_list)
        if len(match_list) == 0:
            print(f"Failed to match name {n['name']} for node {n['nodeID']}")
            continue
        else:
            # Orig picking last won't work if there's a linebreak between the name and stamp
            target = match_list[-1]
            # Change: work backwards skipping blanks until first non-blank
            for t in reversed(match_list):
                if t.group().strip() != '':
                    target = t
                    break
            if verbose:
                print(f"Matching ", target)
            if target.group().strip() != n['name'].strip():
                if verbose:
                    print('MISMATCH! Going to correct text')
                    print(f"{target.group().strip()} != {n['name'].strip()}")
            # print(xaif['text'][:target.span()[0]] + n['name'] + xaif['text'][target.span()[1]:])
                xaif['text'] = xaif['text'][:target.span()[0]] + n['name'] + xaif['text'][target.span()[1]:]
            else:
                print(f"Already correct for node {n['nodeID']}")
    return xaif

Sample of running one in-place fixer:

In [5]:
for ep in ep_names:
    print(f'\n*** {ep} ***')
    ep_files = glob(f'./qt30_error_finding/ova3_episode_maps/{ep}/*.json')
    print(ep_files)
    for mapfile in ep_files:
        print('\t', mapfile)

        with open(mapfile, 'r') as f:
            argmap = json.loads(f.read())
        
        for n in argmap['AIF']['nodes']:
            n['text'] = qt30_text_namefix(n['text'], mapfile)
        argmap['text'] = qt30_text_namefix(argmap['text'], mapfile)
        continue # don't actually run this here
        with open(mapfile, 'w') as f:
            json.dump(argmap, f, indent=4)


*** 11november2021 ***
['./qt30_error_finding/ova3_episode_maps/11november2021/23515.json', './qt30_error_finding/ova3_episode_maps/11november2021/23503.json', './qt30_error_finding/ova3_episode_maps/11november2021/23485.json', './qt30_error_finding/ova3_episode_maps/11november2021/23688.json', './qt30_error_finding/ova3_episode_maps/11november2021/23488.json', './qt30_error_finding/ova3_episode_maps/11november2021/23484.json', './qt30_error_finding/ova3_episode_maps/11november2021/23555.json', './qt30_error_finding/ova3_episode_maps/11november2021/23502.json', './qt30_error_finding/ova3_episode_maps/11november2021/23514.json', './qt30_error_finding/ova3_episode_maps/11november2021/23509.json', './qt30_error_finding/ova3_episode_maps/11november2021/23460.json', './qt30_error_finding/ova3_episode_maps/11november2021/23476.json', './qt30_error_finding/ova3_episode_maps/11november2021/23513.json', './qt30_error_finding/ova3_episode_maps/11november2021/23505.json', './qt30_error_finding/o

Trial and error: running analytics on each file and adding a hacky fix to problems as identified.

In [6]:
# 1) Systematic fixes
for i, ep in enumerate(ep_names):
    print(f"\n=== Episode {ep} ({i+1}/30)===")
    # maps  = glob(f'./qt30_error_finding/ova3_mini_name_edits/{ep}/*.json')
    maps  = glob(f'./qt30_error_finding/ova3_episode_maps/{ep}/*.json')
    if not os.path.exists(f'./qt30_error_finding/ova3_systemic_fix/{ep}'):
        os.makedirs(f'./qt30_error_finding/ova3_systemic_fix/{ep}')

    for mapfile in maps:
        # print(os.path.basename(mapfile))
        print(f"Opening {mapfile}")
        with open(mapfile) as f:
            xaif = json.loads(f.read())
        
        # xaif = xaif_toolbox.remove_all_meta(xaif)
        xaif = andy_fix(xaif)
        xaif = misc_name_fixes(xaif)
        xaif = audience_unk(xaif)
        xaif = audience_name2code(xaif, verbose=True)

        for n in xaif['AIF']['nodes']:
            n['text'] = qt30_text_namefix(n['text'], mapfile)
        xaif['text'] = qt30_text_namefix(xaif['text'], mapfile)

        with open(f'./qt30_error_finding/ova3_systemic_fix/{ep}/{os.path.basename(mapfile)}', 'w') as f:
            json.dump(xaif, f, indent=4)


=== Episode 11november2021 (1/30)===
Opening ./qt30_error_finding/ova3_episode_maps/11november2021/23515.json
52 L-nodes to check
	Found AudienceMember 20211111QT18 in node 140_23515, node 140_23515 found in text
	AudienceMember 20211111QT18 : A lot of this comes to the cost of the person

	Found AudienceMember 20211111QT18 in node 147_23515, node 147_23515 found in text
	AudienceMember 20211111QT18 : Being in Eastleigh we're close to an airport and a station which have fantastic ability to get anywhere

	Found AudienceMember 20211111QT18 in node 155_23515, node 155_23515 found in text
	AudienceMember 20211111QT18 : I wanted to go to Edinburgh

	Found AudienceMember 20211111QT18 in node 166_23515, node 166_23515 found in text
	AudienceMember 20211111QT18 : I can't get there for £30 on a flight

	Found AudienceMember 20211111QT18 in node 188_23515, node 188_23515 found in text
	AudienceMember 20211111QT18 : I want to drive to be green

	Found AudienceMember 20211111QT18 in node 195_235

In [7]:
# 2) Hack fixes
for i, ep in enumerate(ep_names):
    print(f"\n=== Episode {ep} ({i+1}/30)===")
    maps  = glob(f'./qt30_error_finding/ova3_systemic_fix/{ep}/*.json')
    if not os.path.exists(f'./qt30_error_finding/ova3_patch_fix/{ep}'):
        os.makedirs(f'./qt30_error_finding/ova3_patch_fix/{ep}')

    for mapfile in maps:
        print(os.path.basename(mapfile))
        with open(mapfile) as f:
            xaif = json.loads(f.read())
        # xaif = xaif_toolbox.remove_all_meta(xaif)

        xaif['text'] = re.sub('\[DONE\]', '', xaif['text'])
        xaif['text'] = re.sub('--', '--.', xaif['text'])

        # Hacky individual fixes for dodgy maps    
        # Main text fixes
        if os.path.basename(mapfile) == '20880.json':
            xaif['text'] = re.sub('Andrew Unknown', 'Andrew Davies', xaif['text'])
        elif os.path.basename(mapfile) == '18798.json':
            faulty_node = [n for n in xaif['AIF']['nodes'] if n['nodeID'] == "125_18798"][0]
            faulty_node.update({'text' : re.sub('Fiona Bruce', 'George Mpanga', faulty_node['text'])})
        elif os.path.basename(mapfile) == '25556.json':
            xaif['text'] = re.sub('\[DONE\]' ,'', xaif['text'])
        elif os.path.basename(mapfile) == '23603.json':
            xaif['AIF']['nodes'] == [n for n in xaif['AIF']['nodes'] if n['text'] != "he said nothing"]
            xaif['text'] = "Rosie Jones [0:15:19]" + xaif['text']
        elif os.path.basename(mapfile) == '23594.json':
            xaif['text'] = "Minette Batters [0:13:30]" + xaif['text']
        elif os.path.basename(mapfile) == '22954.json':
            xaif['text'] = "Fiona Bruce" + xaif['text']
        elif os.path.basename(mapfile) == '19757.json':
            xaif['text'] = re.sub('\[Sue\]:', 'Sue Unknown[00:30:56]', xaif['text'])
            xaif['text'] = re.sub('\[Dave\]:', 'Dave Unknown[00:31:07]', xaif['text'])
            xaif['text'] = re.sub('\[DONE\]', '', xaif['text'])
        elif os.path.basename(mapfile) == '23440.json':
            xaif['text'] = re.sub('\[James\]:', 'James Allcock[00:49:48]', xaif['text'])
        elif os.path.basename(mapfile) == '19750.json':
            xaif['text'] = re.sub('\[Colette\]:', 'Colette Unknown[00:12:31]', xaif['text'])
            xaif['text'] = re.sub('Anneleise', 'Anneliese', xaif['text'])
            xaif['text'] = re.sub('\[Speaker\]', 'Anneliese Dodds[00:13:16]', xaif['text'])
        elif os.path.basename(mapfile) == '19747.json':
            # xaif['text'] = re.sub('\[Bethany\]: I think', 'Bethany Unknown[00:26:48] I think', xaif['text'])
            xaif['text'] = re.sub('\[Bethany\]:&nbsp;I think', 'Bethany Unknown[00:26:48] I think', xaif['text'])
            xaif['text'] = re.sub('\[Bethany\]:&nbsp;<span class=\"highlighted\" id=\"node78_19747\">No', 'Bethany Unknown[00:27:09]&nbsp;<span class=\"highlighted\" id=\"node78_19747\">No', xaif['text'])
            xaif['text'] = re.sub('\[Bethany\]:&nbsp;<span class=\"highlighted\" id=\"node109_19747\">Yes', 'Bethany Unknown[00:27:18]&nbsp;<span class=\"highlighted\" id=\"node109_19747\">Yes', xaif['text'])
            xaif['text'] = re.sub("s --", 's --.', xaif['text']) # hacky
        elif os.path.basename(mapfile) == '19771.json':
            xaif['text'] = re.sub('\[Andrea\]:', 'Andrea Unknown[00:55:09]', xaif['text'])
        elif os.path.basename(mapfile) == '19764.json':
            xaif['text'] = re.sub('\[Josh\]:', 'Josh Unknown[00:40:48]', xaif['text'])
            xaif['text'] = re.sub('\[James\]:', 'James Unknown[00:41:29]', xaif['text'])
        elif os.path.basename(mapfile) == '19772.json':
            xaif['text'] = re.sub('\[James\]:', 'James Unknown[00:59:36]', xaif['text'])
        elif os.path.basename(mapfile) == '19769.json':
            xaif['text'] = re.sub('\[Nicola\]:', 'Nicola Unknown[00:43:15]', xaif['text'])
        elif os.path.basename(mapfile) == '19753.json':
            xaif['text'] = re.sub('\[Andrea\]', 'Andrea Unknown[00:11:35]', xaif['text'])
            xaif['text'] = re.sub('Audience Member', 'Audience Unknown', xaif['text'])
        elif os.path.basename(mapfile) == '19749.json':
            xaif['text'] = re.sub('\[George\]:', 'George Unknown [00:32:13]', xaif['text'])
        elif os.path.basename(mapfile) == '19739.json':
            xaif['text'] = re.sub('Part 1', '', xaif['text'])
        elif os.path.basename(mapfile) == '19774.json':
            xaif['text'] = re.sub('\[Colette\]:', 'Colette Unknown[00:52:15]', xaif['text'])
            faulty_nodes = [n for n in xaif['AIF']['nodes'] if 'Collette' in n['text']]
            for n in faulty_nodes:
                n['text'] = re.sub('Collette', 'Colette Unknown', n['text'])
        elif os.path.basename(mapfile) == '19763.json':
            xaif['text'] = re.sub('\[Speaker\]', 'Unknown Speaker[00:38:55]', xaif['text'])
        elif os.path.basename(mapfile) == '19759.json':
            xaif['text'] = re.sub('\[Alex\]:', 'Alex Unknown[00:41:53]', xaif['text'])
        elif os.path.basename(mapfile) == '19755.json':
            xaif['text'] = re.sub('\[Speaker\]:', 'Nadhim Zahawi[00:35:03]', xaif['text'])


        # Node text fixes
        elif os.path.basename(mapfile) == '18760.json':
            faulty_nodes = [n for n in xaif['AIF']['nodes'] if n['type'] == 'L']
            for n in faulty_nodes:
                n.update({'text': re.sub('Female Audience[\s]*Member1', 'Kirsty Unknown', n['text'])})
                print('updating')
        elif os.path.basename(mapfile) == '18812.json':
            faulty_node = [n for n in xaif['AIF']['nodes'] if n['nodeID'] == "171_18812"][0]
            faulty_node.update({'text' : re.sub('Female Audience[\s]*Member', 'Emma Stephenson', faulty_node['text'])})
            print(faulty_node)
        elif os.path.basename(mapfile) == '25554.json':
            faulty_nodes = [n for n in xaif['AIF']['nodes'] if 'Unknown Unknown' in n['text']]
            for n in faulty_nodes:
                n.update({'text' : re.sub('Unknown Unknown', 'Unknown Speaker', n['text'])})
        elif os.path.basename(mapfile) == '21270.json':
            xaif['text'] = re.sub('Lias ', 'Lisa ', xaif['text'])
        elif os.path.basename(mapfile) == '21305.json':
            faulty_node = [n for n in xaif['AIF']['nodes'] if n['nodeID'] == "21_21305"][0]
            faulty_node['text'] = "Andi Unknown : " + faulty_node['text']
        elif os.path.basename(mapfile) == '21281.json':
            faulty_nodes = [n for n in xaif['AIF']['nodes'] if 'Tim Unknown' in n['text']]
            for n in faulty_nodes:
                n['text'] = re.sub('Tim Unknown', 'Tim Down', n['text'])
        elif os.path.basename(mapfile) == '18848.json':
            for n in xaif['AIF']['nodes']:
                if (n['type'] == 'L') and ('Berlinda' in n['text']):
                    n['text'] = re.sub('Berlinda', 'Belinda', n['text'])
        elif os.path.basename(mapfile) == '18849.json':
            for n in xaif['AIF']['nodes']:
                if (n['type'] == 'L') and ('Deborah Unknown' in n['text']):
                    n['text'] = re.sub('Deborah Unknown', 'Deborah Norrish', n['text'])
        elif os.path.basename(mapfile) == '23820.json':
            faulty_nodes = [n for n in xaif['AIF']['nodes'] if 'Mehdi Hasan' in n['text']]
            for n in faulty_nodes:
                n['text'] = re.sub('Mehdi Hasan', 'Nelufar Hedayat', n['text'])
        elif os.path.basename(mapfile) == '19737.json':
            faulty_nodes = [n for n in xaif['AIF']['nodes'] if 'Suan Desouza' in n['text']]
            for n in faulty_nodes:
                n['text'] = re.sub('Suan', 'Susan', n['text'])



        # Node type fixes
        elif os.path.basename(mapfile) == '18809.json':
            faulty_node = [n for n in xaif['AIF']['nodes'] if n['nodeID'] == '38_18809'][0]
            faulty_node.update({'type': 'I'})
        elif os.path.basename(mapfile) == '23607.json':
            faulty_node = [n for n in xaif['AIF']['nodes'] if n['nodeID'] == "209_23607"][0]
            faulty_node.update({'type': 'I'})
        elif os.path.basename(mapfile) == '21306.json':
            faulty_nodes = [n for n in xaif['AIF']['nodes'] if n['nodeID'] in ["125_21306", "127_21306"]]
            for n in faulty_nodes:
                n['text'] = "Asserting"
                n['type'] = "YA"

        # Wrong reported speech
        elif os.path.basename(mapfile) == '23154.json': 
            # flip edges
            faulty_edges = [e for e in xaif['AIF']['edges'] if e['edgeID'] in [10, 11, 71, 70, 77, 78]]
            for e in faulty_edges:
                e.update({'fromID': e['toID'], 'toID': e['fromID']})

            # flip types
            faulty_nodes = [n for n in xaif['AIF']['nodes'] if n['nodeID'] in ['40_23154', '36_23154', '146_23154', '83_23154', '123_23154', '117_23154']]
            for n in faulty_nodes:
                if n['type'] == 'L':
                    n.update({'type': 'I'})
                elif n['type'] == 'I':
                    n.update({'type': 'L'})
        
        elif os.path.basename(mapfile) == '23602.json': 
            # flip edges
            faulty_edges = [e for e in xaif['AIF']['edges'] if e['edgeID'] in [69, 68]]
            for e in faulty_edges:
                e.update({'fromID': e['toID'], 'toID': e['fromID']})

            # flip types
            faulty_nodes = [n for n in xaif['AIF']['nodes'] if n['nodeID'] in ['94_23602', '88_23602']]
            for n in faulty_nodes:
                if n['type'] == 'L':
                    n.update({'type': 'I'})
                elif n['type'] == 'I':
                    n.update({'type': 'L'})


        # Multiple fixes
        elif os.path.basename(mapfile) == '18790.json':
            faulty_nodes = [n for n in xaif['AIF']['nodes'] if 'Female Audience Member 3 :' in n['text']]
            for n in faulty_nodes:
                n.update({'text' : re.sub('Female Audience Member 3', 'Katie Unknown', n['text'])})
            xaif['text'] = re.sub('Katie\[', 'Katie Unknown[', xaif['text'])
        elif os.path.basename(mapfile) == '18866.json':
            faulty_nodes = [n for n in xaif['AIF']['nodes'] if 'Mike Maleaudience' in n['text']]
            for n in faulty_nodes:
                n.update({'text' : re.sub('Mike Maleaudience', 'Mike Unknown', n['text'])})
            xaif['text'] = re.sub('Mike\[', 'Mike Unknown[', xaif['text'])
        
        elif os.path.basename(mapfile) == '19756.json':
            # Fix text
            xaif['text'] = re.sub('\[Jim\]', 'Jim Unknown[00:29:52]', xaif['text'])
            xaif['text'] = re.sub('\[DONE\]', '', xaif['text'])
            xaif['text'] = re.sub('\[James\]', 'James Unknown[00:30:32]', xaif['text'])

            # Fix James nodes
            faulty_nodes = [n for n in xaif['AIF']['nodes'] if 'James ' in n['text']]
            for n in faulty_nodes:
                n.update({'text' : re.sub(r'James[\s]*:', 'James Unknown :', n['text'])})

            # Fix Jim nodes
            faulty_nodes = [n for n in xaif['AIF']['nodes'] if n['nodeID'] in ['40_19756', '47_19756', '54_19756', '69_19756', '80_19756']]
            for n in faulty_nodes:
                n.update({'text': re.sub('Fiona Bruce', 'Jim Unknown', n['text'])})


        elif os.path.basename(mapfile) == '21280.json':
            # Remove double name
            faulty_node = [n for n in xaif['AIF']['nodes'] if 'Tony Unkown : Tony Unkown :' in n['text']][0]
            faulty_node['text'] = re.sub('Tony Unkown : Tony Unkown :', 'Tony Unkown :', faulty_node['text'])
            
            # Fix L-nodes
            for n in [n for n in xaif['AIF']['nodes'] if n['type'] == 'L']:
                n.update({'text': re.sub('Tony Unkown', 'Tony Unknown', n['text'])})
        

        elif os.path.basename(mapfile) == '22947.json':
            aif_edges = [e for e in xaif['AIF']['edges'] if e['edgeID'] != 55]
            xaif['AIF']['edges'] = aif_edges
            ova_edges = [e for e in xaif['OVA']['edges'] if not (e['fromID'] == "61_22946" and e['toID'] == "60_22946")]
            xaif['OVA']['edges'] = ova_edges

        # Misc: duplicates
        elif os.path.basename(mapfile) == '23597.json':
            # Remove all duplicate nodes
            seen = []
            new_nodelist = []
            for node in xaif['AIF']['nodes']:
                t = tuple(node.items())
                if t not in seen:
                    seen.append(t)
                    new_nodelist.append(node)
            xaif['AIF']['nodes'] = new_nodelist

            # Remove all duplicate edges
            # only check to/from IDs, edge IDs of duplicates are unique
            seen = []
            new_edgelist = []
            for edge in xaif['AIF']['edges']:
                t = tuple([edge['fromID'], edge['toID']])
                if t not in seen:
                    seen.append(t)
                    new_edgelist.append(edge)
            xaif['AIF']['edges'] = new_edgelist

        # just too much wrong with this one...
        elif os.path.basename(mapfile) == '19761.json':
            continue

        with open(f'./qt30_error_finding/ova3_patch_fix/{ep}/{os.path.basename(mapfile)}', 'w') as f:
            json.dump(xaif, f, indent=4)
        # print()


=== Episode 11november2021 (1/30)===
23515.json
23503.json
23485.json
23688.json
23488.json
23484.json
23555.json
23502.json
23514.json
23509.json
23460.json
23476.json
23513.json
23505.json
25691.json
23483.json
23495.json
23494.json
25690.json
23504.json
24042.json
23512.json
25409.json
23498.json
23508.json
23497.json
23478.json
23481.json
23507.json
23511.json
23510.json
23506.json
23480.json
23496.json
23479.json
23475.json
23491.json
23487.json
23517.json
23473.json
23500.json

=== Episode 28october2021 (2/30)===
23542.json
23578.json
23581.json
23539.json
23558.json
23523.json
23574.json
23562.json
23535.json
23563.json
23575.json
23559.json
23538.json
23580.json
23579.json
23548.json
23525.json
23572.json
23533.json
23544.json
23552.json
23587.json
23568.json
23586.json
23569.json
23553.json
23545.json
23565.json
23573.json
23549.json
23531.json
23566.json
23570.json
23527.json
23585.json
23550.json
23547.json
23551.json
25692.json
23584.json
23526.json
23571.json
23567.json
2

# Trying to run analytics

In [10]:
for i, ep in enumerate(ep_names):
    ep_maps = glob(f"./qt30_error_finding/ova3_patch_fix/{ep}/*")
    print(f"======= {os.path.basename(ep)} ({i+1}/30) =======")
    for j, map_file in enumerate(ep_maps):
        print(os.path.basename(map_file), f"({j+1}/{len(ep_maps)})")
        with open(map_file) as f:
            xaif = json.loads(f.read())
        results = papa.all_analytics(xaif)

23515.json (1/41)
23503.json (2/41)
Node 39_23503 found in text but no node 39_23503 found in node list
Node 152_23503 found in text but no node 152_23503 found in node list
Node 39_23503 found in text but no node 39_23503 found in node list
Node 152_23503 found in text but no node 152_23503 found in node list
Node 39_23503 found in text but no node 39_23503 found in node list
Node 152_23503 found in text but no node 152_23503 found in node list
Node 39_23503 found in text but no node 39_23503 found in node list
Node 152_23503 found in text but no node 152_23503 found in node list
Node 39_23503 found in text but no node 39_23503 found in node list
Node 152_23503 found in text but no node 152_23503 found in node list
Node 39_23503 found in text but no node 39_23503 found in node list
Node 152_23503 found in text but no node 152_23503 found in node list
Node 39_23503 found in text but no node 39_23503 found in node list
Node 152_23503 found in text but no node 152_23503 found in node lis

RecursionError: maximum recursion depth exceeded

Example error in 23552.json (file in 28october2021)

```
File ~/GitHub/papa/src/xaif_toolbox.py:574, in xaif_preanalytic_info_collection(xaif, verbose)
    571 all_nodes = node_setup(xaif)
    572 all_nodes = add_edge_info(xaif, all_nodes)
--> 574 all_nodes, said = add_speakers(all_nodes, verbose=verbose)
    575 all_nodes = add_assumed_speakers(all_nodes)
    576 all_nodes = add_agreement(all_nodes)

File ~/GitHub/papa/src/xaif_toolbox.py:335, in add_speakers(all_nodes, verbose)
    329                     all_nodes[ya_out]['introby'].append(all_nodes[n]['nodeID'])
    330 # Reported speech: I-node should be attributed to the quoting speaker
    331                     
    332     
    333 else:
    334     # Get quoting speaker
--> 335     quoter = reporting_speaker(n, all_nodes)
    337     for e_out in all_nodes[n]['eout']:
    338         if all_nodes[e_out]['type'] == 'YA':

File ~/GitHub/papa/src/xaif_toolbox.py:154, in reporting_speaker(l_node_id, all_nodes)
    151 if len(ya_anchor) > 1:
    152     print(f"Multiply-anchored YA {quoting_ya}: anchored by ", *ya_anchor)
--> 154 return reporting_speaker(ya_anchor[0], all_nodes)

File ~/GitHub/papa/src/xaif_toolbox.py:154, in reporting_speaker(l_node_id, all_nodes)
    151 if len(ya_anchor) > 1:
    152     print(f"Multiply-anchored YA {quoting_ya}: anchored by ", *ya_anchor)
--> 154 return reporting_speaker(ya_anchor[0], all_nodes)

    [... skipping similar frames: reporting_speaker at line 154 (2966 times)]

File ~/GitHub/papa/src/xaif_toolbox.py:154, in reporting_speaker(l_node_id, all_nodes)
    151 if len(ya_anchor) > 1:
    152     print(f"Multiply-anchored YA {quoting_ya}: anchored by ", *ya_anchor)
--> 154 return reporting_speaker(ya_anchor[0], all_nodes)

File ~/GitHub/papa/src/xaif_toolbox.py:133, in reporting_speaker(l_node_id, all_nodes)
    132 def reporting_speaker(l_node_id, all_nodes):
--> 133     quoting_ya = [n for n in all_nodes[l_node_id]['ein'] if all_nodes[n]['type'] == 'YA']
    135     # No incoming YA to this node, or the YA is Analysing: this was the original locution so should return this spkr
    136     if len(quoting_ya) == 0 or 'Analysing' in [all_nodes[q]['text'] for q in quoting_ya]:

RecursionError: maximum recursion depth exceeded
```

There's an error in the file, as this kind of cycle shouldn't exist: L-node with edge to a YA-node, and that YA with an edge back the same L.