# Investigating the "如字" pattern
Question: what does "如字" mean?

In [38]:
# load and initialize our best-performing spancat pipeline
import spacy
nlp = spacy.load("../training/tok2vec_ngram/model-best")
print(f"Loaded spaCy span categorizer pipeline with {(nlp.meta['performance']['spans_sc_f'] * 100):.1f}% F1 score")

# load phonetic data from the songben guangyun
import csv
from collections import defaultdict
sbgy = defaultdict(list)
with open("../assets/GDR-SBGY-full.csv", "r") as f:
  reader = csv.DictReader(f)
  for entry in reader:
    sbgy[entry["char"]].append(entry)
sbgy = dict(sbgy)
print(f"Loaded {len(sbgy)} entries from the Songben Guangyun")

Loaded spaCy span categorizer pipeline with 86.9% F1 score
Loaded 19791 entries from the Songben Guangyun


## Process
- Find annotations with single-character headwords to remove ambiguity.
- Filter these annotations to those that include "如字" and at least one other phonetic (`PHON`) span.
- Limit to annotations whose headword is also listed at least twice in the _Songben Guangyun_.


In [45]:
# load the entire JDSW
import srsly
jdsw = list(srsly.read_jsonl("../assets/annotations.jsonl"))
print(f"Loaded {len(jdsw)} entries from the JDSW")

# filter to annotations with a single-character headword
jdsw = list([anno for anno in jdsw if len(anno["meta"]["headword"]) == 1])
print(f"Filtered to {len(jdsw)} single-character headword annotations")

# filter to annotations whose headword appears in the SBGY
jdsw = list([anno for anno in jdsw if anno["meta"]["headword"] in sbgy])
print(f"Filtered to {len(jdsw)} annotations whose headword appears in the Songben Guangyun")

# filter to annotations where the SBGY includes multiple pronunciations
jdsw = list([anno for anno in jdsw if len(sbgy[anno["meta"]["headword"]]) > 1])
print(f"Filtered to {len(jdsw)} annotations whose headword appears in the Songben Guangyun with multiple pronunciations")

# filter to annotations that include "如字"
jdsw = list([anno for anno in jdsw if "如字" in anno["text"]])
print(f"Filtered to {len(jdsw)} annotations that include '如字'")

# run the annotations through the pipeline
import time
start = time.time()
docs = list(nlp.pipe([(anno["text"], anno["meta"]) for anno in jdsw], as_tuples=True))
end = time.time()
print(f"Predicted spans for {len(docs)} documents in {end - start:.2f} seconds")

# filter to annotations with at least two phonetic spans
multi_phon_docs = []
for doc, meta in docs:
  phon_spans = [span for span in doc.spans["sc"] if span.label_ == "PHON"]
  if len(phon_spans) > 1:
    doc.user_data["meta"] = meta
    multi_phon_docs.append(doc)
print(f"Filtered to {len(multi_phon_docs)} documents with at least two phonetic spans")


Loaded 55717 entries from the JDSW
Filtered to 9990 single-character headword annotations
Filtered to 9540 annotations whose headword appears in the Songben Guangyun
Filtered to 4097 annotations whose headword appears in the Songben Guangyun with multiple pronunciations
Filtered to 197 annotations that include '如字'
Predicted spans for 197 documents in 0.40 seconds
Filtered to 159 documents with at least two phonetic spans


In [46]:
# display a few docs using spacy's span visualizer
from spacy import displacy
for doc in multi_phon_docs[:10]:
  headword = doc.user_data['meta']['headword']
  entries = set([entry['reading'] for entry in sbgy[headword]])
  print(f"{headword} ({', '.join(entries)})")
  displacy.render(doc, style="span", options={
    "colors": {
      "PHON": "#FF99C8",
      "GRAF": "#FCF6BD",
      "SEM": "#D0F4DE",
      "META": "#FFFFFF",
      "PER": "#A9DEF9",
      "WORK": "#E4C1F9",
    }
  })

參 (tshomH, sam, srim, tshom, tsrhim)


中 (trjuwng, trjuwngH)


三 (samH, sam)


說 (sywejH, ywet, sywet)


出 (tsyhwijH, tsyhwit)


齊 (dzej, dzejH)


喪 (sangH, sang)


視 (dzyijH, dzyijX)


惡 ('uH, 'u, 'ak)


池 (da, drje)


## Hypothesis
"如字" is used to distinguish between paired pronounciations that differ only in tone.

In [60]:
# filter to annotations with exactly two phonetic spans and exactly two sbgy entries
pair_phon_docs = []
for doc in multi_phon_docs:
  headword = doc.user_data['meta']['headword']
  entries = set([entry['reading'] for entry in sbgy[headword]])
  phon_spans = [span for span in doc.spans["sc"] if span.label_ == "PHON"]
  if len(phon_spans) == 2 and len(entries) == 2:
    pair_phon_docs.append({
      'headword': headword,
      'annotation': doc.text,
      'sbgy_entries': ','.join(entries),
      'phon_spans': ','.join([span.text for span in phon_spans]),
    })
print(f"Filtered to {len(pair_phon_docs)} documents with paired Songben Guangyun entries and paired phonetic spans")

# print some of the results
for doc in pair_phon_docs[:10]:
  print(f"{doc['headword']} ({doc['sbgy_entries']})")
  print(f"'{doc['annotation']}'")
  for span in doc['phon_spans'].split(','):
    if span.endswith("如字"):
      print(f"  {span[-2:]}")
    elif span.endswith("反"):
      fanqie = span[-3:]
      print(f"  {fanqie}")
      initial = fanqie[0]
      rime = fanqie[1]
      initial_entries = sbgy.get(initial, [])
      rime_entries = sbgy.get(rime, [])
      for initial_entry in initial_entries:
        for rime_entry in rime_entries:
          print(f"    {initial_entry['initial']} + {rime_entry['rime']}")
    else:
      print(f"  {span}")

# save the results to a CSV file
import csv
with open("../assets/ruzi.csv", "w") as f:
  writer = csv.DictWriter(f, fieldnames=["headword", "annotation", "sbgy_entries", "phon_spans"])
  writer.writeheader()
  writer.writerows(pair_phon_docs)
  

Filtered to 113 documents with paired Songben Guangyun entries and paired phonetic spans
中 (trjuwng,trjuwngH)
'如字馬丁仲反'
  如字
  丁仲反
    tr + juwngH
    t + juwngH
三 (samH,sam)
'息暫反注同或如字'
  息暫反
    s + amH
  如字
齊 (dzej,dzejH)
'才細反又如字'
  才細反
    dz + ejH
  如字
喪 (sangH,sang)
'息浪反注同荀如字'
  如字
  息浪反
    s + ang
    s + angH
視 (dzyijH,dzyijX)
'如字徐市至反'
  如字
  市至反
    dzy + ijH
池 (da,drje)
'如字又大河反'
  如字
  大河反
    d + a
    d + a
穜 (drjowng,duwng)
'直龍反本或作重音同先種後熟曰穜案如字書禾旁作重是種稑之字作童是穜殖之字今俗則反之'
  如字
  直龍反
    dr + owng
約 ('jiewH,'jak)
'於妙反又如字'
  於妙反
    ' + jiewH
    ' + jiewH
  如字
空 (khuwng,khuwngH)
'音孔又如字下同'
  音孔
  如字
決 (xwet,kwet)
'如字又烏穴反'
  如字
  烏穴反
    ' + wet
