# CATSS Text-Fabric Enrichments

The raw data from the CCAT resource has been converted into a TF format in `tf_conversion.ipynb`. This notebook enriches the dataset in a number of ways:

* Morphological tags are split and parsed into individual word-level features.
* Word-level plain-text is processed into a UTF8 representation feature.
* A first effort is made to connect the ETCBC BHSA Hebrew database with the CATSS Hebrew parallel text using parts of speech for word-level connections. Phrase-level connections are created based on phrases' presence per line.
* Some rudimentary phrase divisions are also added the CATSS data based on the parallel data. 

In [1]:
import collections, glob, re, Levenshtein
from tf.fabric import Fabric
from greekutils import beta2unicode # do: pip install greek-utils==0.2

# some cells are blocked from being run/
# toggle True to allow a given task
run_corrections = False
run_morphology = False
run_unicode = False

## Corrections

Corrections to the data based on the decoding processes.

In [2]:
metadata = metadata = {

'': {            
        'created_by':'R. Kraft et al., CCAT, University of Pennsylvania',
        'converted_by':'Cody Kingham',
        'source':'http://ccat.sas.upenn.edu/rak//catss.html',
        'license':'http://ccat.sas.upenn.edu/gopher/text/religion/biblical/lxxmorph/0-user-declaration.txt',
    },
}


# TO FIX, 2017-12-11:
# advb - 173450 (D.P.E)PA/NW) 1 == styp == D.P
# 5058 & 5159 (N.N.M.*MESRAIM) 0-2 == styp == N.N.M.

run_corrections = False

if run_corrections:

    # instantiate/load old data
    TF = Fabric(locations=['tf'], modules=[''])
    api = TF.load('book chapter verse morph trans', silent=True)
    api.makeAvailableIn(globals())
    
    nodes = collections.defaultdict(dict)

    # prime metadata
    metadata['morph'] = {'valueType': 'str'}
    
    # corrections for morph feature
    for word in F.otype.s('word'):

        if word == 11436:
            new_morph = 'VA.AAD2P.I(/ZW.KATA'
            nodes['morph'][word] = new_morph
        else:
            nodes['morph'][word] = F.morph.v(word)

    save_TF = Fabric(locations=['tf'], modules=['enrichments'])
    save_TF.save(nodeFeatures=nodes, metaData=metadata)
    del metadata['morph'] # clean out metadata dict
    
    print('corrections done!')
    
else:
    
    print('run_corrections toggled False. Doing nothing...')

run_corrections toggled False. Doing nothing...


## Load Enriched, Corrected CATSS Data 

In [3]:
# load in enriched data
TF = Fabric(locations=['tf'], modules=['', 'enrichments'])
api = TF.load('book chapter verse morph trans typ')
api.makeAvailableIn(globals())

This is Text-Fabric 3.1.1
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

22 features found and 1 ignored
  0.00s loading features ...
   |     0.00s B book                 from tf
   |     0.00s B chapter              from tf
   |     0.01s B verse                from tf
   |     0.17s B trans                from tf
   |     0.20s B morph                from tf/enrichments
   |     0.15s B typ                  from tf/enrichments
   |     0.00s Feature overview: 20 for nodes; 1 for edges; 1 configs; 7 computed
  1.02s All features loaded/computed - for details use loadLog()


## Generate New Morphology Features

See the [CATSS morphology documentation](http://ccat.sas.upenn.edu/gopher/text/religion/biblical/lxxmorph/*Morph-Coding). In the source data, morphology is space-separated. In the TF version they are dot separated. Tags have to be split, recognized, and converted. They are added as separate word-level features.

Morphology codes have 3 or 2 columns, depending on part of speech type. From the documentation:

> 1. "TYPE" CODES (3 columns maximum, to identify part of speech)
> 2. "PARSE" CODE (up to 6 columns, as needed, to parse each form) [\*OPTIONAL]
> 3. [lexeme]

Proposals for new features:

* typ — part of speech, derived from the type codes. It is the first letter of the type code and can have a values of (bold is proposed new feature name):
    * N — noun — **noun**
    * A — adjective — **adjv**
    * R — pronoun — **pron**
    * C — conjunction — **conj**
    * X — particle — **part**
    * I — interjection — **intj**
    * M — indeclinable number — **inum**
    * P — preposition — **prep**
    * D — adverb — **advb**
* **styp** — subtype of part of speech, e.g. 1st declension, 3rd declension of various stems. There are lots of categories. The simple code is preserved. Refer to the documentation for their meanings. For that code, I preserve also the part of speech value (N, A, R, etc.)

* case — **case**
* gender — **gender**
* number — **number**
* tense — **tense**
* voice — **voice**
* mood — **mood**
* person — **person**
* degree — **degree**

In [4]:
# Notes to myself
# prototypical counts per type:
# 3 - adjv, noun, verb
# 2 - advb, conj, intj, part, prep, inum
# 3 - inum, pron, propn
# 4 - propn (N.N.M.MESRAIM), verb (participle)

# those with overloaded lexemes:
# verb(>3, not participle), verb(>4, participle)

# store new features here: feature_name to node to feature 
features = collections.defaultdict(dict)

# conversion dicts
typs = {'N': 'noun',
        'V': 'verb',
        'A': 'adjv',
        'R': 'pron',
        'C': 'conj',
        'X': 'part',
        'I': 'intj',
        'M': 'inum',
        'P': 'prep',
        'D': 'advb'}
       #'N': 'propn' proper noun, added below with special rule
    
# nominals 
# [case][number][gender]
cases = {'N': 'nom',
         'G': 'gen',
         'D': 'dat',
         'A': 'acc',
         'V': 'voc'}
numbers = {'S': 'sg',
          'D': 'du',
          'P': 'pl'}
genders = {'M': 'm',
          'F': 'f',
          'N': 'n'}
degrees = {'C': 'comparative',
          'S': 'superlative'}

# verbs
# [tense][voice][mood][person][number] [case][number][gender]

tenses = {'P': 'present',
         'I': 'imperfect',
         'F': 'future',
         'A': 'aorist',
         'X': 'perfect',
         'Y': 'pluperfect'}
voices = {'A': 'active',
         'M': 'middle',
         'P': 'passsive'}
moods = {'I': 'indc',
         'D': 'impv',
         'S': 'subj',
         'O': 'optv',
         'N': 'infv',
         'P': 'ptcp'}

if run_morphology:

    # big loop
    for word in F.otype.s('word'):

        morph = F.morph.v(word)
        split_morph = morph.split('.')

        # parse morphology codes in order of appearance:


        # 1. assign subtypes and types

        styp = split_morph[0] # subtype

        # get type; exception for proper nouns; nouns with no subtypes
        if styp == 'N':
            typ = 'propn'
        else:
            typ = typs[styp[0]] # type is only first char of code, convert it


        # 2. assign parsing data

        # indeclinable words
        if len(split_morph) == 2 or typ in {'advb', 'conj'}:
            case, gender, number, degree, tense, voice, mood, person = ('' for i in range(1,9))
            lexeme = '.'.join(split_morph[1:])


        # nominal words with case/gender/number
        elif typ in {'adjv', 'noun', 'inum', 'pron', 'propn'}:

            parsing_data = split_morph[1]
            case = ''
            gender = ''
            number = ''
            degree = ''

            # get parsing; some parsing codes have < 3 values, loop is thus necessary
            for i, char in enumerate(parsing_data):

                # dative/dual disambiguation
                if i == 0 and char == 'D': 
                    case = 'dat'
                elif i != 0 and char == 'D':
                    number = 'du'

                # disambiguation for 'S' superlative
                elif all([char == 'S' or char == 'C', len(parsing_data) == 4,
                          typ == 'adjv', i != 1]):
                    degree = degrees.get(char, '')

                # all other parsings
                elif char != 'D':
                    case = cases.get(char, '') if not case else case
                    gender = genders.get(char, '') if not gender else gender
                    number = numbers.get(char, '') if not number else number
                    degree = '' if not degree else degree

            # set non applicable values to null
            person, tense, voice, mood = ('' for i in range(1,5))

            lexeme = '.'.join(split_morph[2:])


        # verbs
        elif typ == 'verb':

            parsing_data = split_morph[1]
            tense = tenses[parsing_data[0] ]
            try:
                voice = voices[parsing_data[1]]

            except:
                print(word, morph)
            mood = moods[parsing_data[2]]

            # handle participles 
            try:
                gender = genders[parsing_data[5]] # only participles have >4 chars
                number = numbers[parsing_data[4]]
                case = cases[parsing_data[3]]
                person = '' # non-applicable values
                degree = ''

            except IndexError:

                # all normal verbs
                try:
                    person = parsing_data[3]
                    number = numbers[parsing_data[4]]
                    case = '' # non-applicable values
                    gender = '' 
                    degree = ''

                # handle infinitives
                except IndexError: 
                    person = '' # non-applicable values
                    number = ''
                    case = ''
                    gender = ''
                    degree = ''

            lexeme = '.'.join(split_morph[2:])

        # assign features
        features['typ'][word] = typ
        features['styp'][word] = styp
        features['lex'][word] = lexeme
        features['case'][word] = case
        features['number'][word] = number
        features['gender'][word] = gender
        features['degree'][word] = degree
        features['tense'][word] = tense
        features['voice'][word] = voice
        features['mood'][word] = mood
        features['person'][word] = person

    print(f'done! with {len(features)} new features logged...')
    
    # EXPORT new features
    
    for feat in features:
        metadata[feat] = {'valueType': 'str'}
    
    TF.save(nodeFeatures=features, metaData=metadata)
    
else:
    print('run_morphology toggled False. Doing nothing...')

run_morphology toggled False. Doing nothing...


## Generate UTF8 Feature

The transcription table is available at [the CCAT documentation page](http://ccat.sas.upenn.edu/gopher/text/religion/biblical/lxxmorph/0-betacode.txt).

In [5]:
if run_unicode:
    
    for word in F.otype.s('word'):

        unicode = beta2unicode.convert(F.trans.v(word))

        # add final sigma fix!
        if unicode[-1] == 'σ':
            unicode = unicode[:-1] + 'ς' 
            
        # save feature
        features['utf8'][word] = unicode
            
    # export unicode features
    for feat in features:
        metadata[feat] = {'valueType': 'str'}
    
    TF.save(nodeFeatures=features, metaData=metadata)
    
else: 
    print('run_unicode toggled False. Doing nothing...')

run_unicode toggled False. Doing nothing...


## Integration of Parallel Data with ETCBC Hebrew Data

In [6]:
# instantiate ETCBC data

hebrew_etcbc = Fabric(locations='~/github/etcbc/bhsa/tf', modules='c')
etcbc = hebrew_etcbc.load('g_cons')

This is Text-Fabric 3.1.1
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

114 features found and 0 ignored
  0.00s loading features ...
   |     0.11s B g_cons               from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.00s Feature overview: 108 for nodes; 5 for edges; 1 configs; 7 computed
  3.67s All features loaded/computed - for details use loadLog()


### Method

The connections are made with two processes.

The process 1 involves matching the data in the CATSS parallel files to their respective databases. After the Hebrew/Greek data is sorted out from the text-critical notations, surface forms are matched (within 1 character, using Levenshtein distance) to the corresponding transcription in BHSA/CATSS TF data. Once matched, the relevant slot data is dropped into a tuple which is in turn keyed to its line number in the dictionaries `greekLine_slots` and `hebrewLine_slots` (to be accessed in process 2).

The process 2 involves matching the Hebrew and Greek language data using the tuples stored in process 1. Those matches are made based on matching part of speech tags. The matches are made in sequence to avoid mismatching multiple part of speech tags. 

### Definitions and Functions

In [118]:
# CCAT to ETCBC Transcription
ccat_etcbc = {')': '>',
             'B': 'B',
             'G': 'G',
             'D': 'D',
             'H': 'H',
             'W': 'W',
             'Z': 'Z',
             'X': 'X',
             '+': 'V',
             'Y': 'J',
             'K': 'K',
             'L': 'L',
             'M': 'M',
             'N': 'N',
             'S': 'S',
             '(': '<',
             'P': 'P',
             'C': 'Y',
             'Q': 'Q',
             'R': 'R',
             '&': 'F',
             '$': 'C',
             'T': 'T',
             }

# characters for detecting Hebrew/Greek surface forms
hchars = ''.join(ccat_etcbc.keys()) # string of hebrew characters in transcription
gchars = ''.join(set(l for w in F.otype.s('word') # string of greek characters in transcription
                       for l in F.trans.v(w)
                    ))
gchars = gchars.replace('\\', '\\\\').replace('-', '\-') # fix escape bug with re module

# regex to extract Hebrew/Greek words
hwords = f'^\*?[/{hchars}]+|\s\*?[/{hchars}]+'
gwords = f'^[{gchars}]*|\s[{gchars}]*'

# TEXTUAL CRITICISM NOTATIONS:
# regex patterns + descriptions
# to find and add:
# ~, ~~~
            
# tc chars + descriptions (no capture groups)
tchars = {'{#}':
              'Asterized passage (in Job).',
          '{g}':
              'Reference to difference between the text of Rahlfs and that of the relevant Gšttingen edition.',
          f'\*\*[/?{hchars}]+':
              'Qere',
          '\*\*?z': 
              'Qere wela ketib, ketib wela qere.',
          '\.\.a':
              'Word included in one of the Aramaic sections.',
          '\[\d+\??\]':
              'Reference of number of verse in LXX, different from MT.',
          '\[\[\d+\??\]\]':
              'Reference number of verse in MT, different from the LXX.',
          '---\s{x}':
              'Apparent minus',
          '--\+\s{x}':
              'Apparent plus created by lack of equivalence between long stretches of text in the LXX and MT.',
          '---?\s|---?\t|---?$':
              'In the Greek column:  Hebrew counterpart lacking in the LXX (minus in the LXX).',
          '-?--\+':
              'In col a. of the Hebrew:  element added in the Greek (plus in the LXX).',
          '\'\'|\'':
              'Long minus or plus (at least four lines).',
          '{\.\.~}':# CHECK
              'Stylistic or grammatical transposition.',
          '{d\??}':
              'Reference to doublet (occurring between the two elements of the doublet.',
          '{'+f'\.\.\.[/?\s,{hchars}]*'+'}|{'+f'\.\.\.[{gchars}\s/?]*' + '}':
              'Equivalent reflected elsewhere in the text, disregarded by indexing program.',
          '{'+f'\.\.d[{gchars}]*'+'}': 
              'Distributive rendering, occurring once in the translation but referring to more than one Hebrew word.',
          '{'+f'\.\.\^[\s{gchars}]*'+'}': 
              '[unknown]',
          '{'+ f'\.\.r[/?,{hchars}]*'+'}': 
              'Notation in Hebrew column of elements repeated in the translation.',
          '\?': 
              'Questionable notation, equivalent, etc.',
          '{p}':
              'Greek preverb representing Hebrew preposition.',
          '{'+ f'\.\.p[{gchars}]*' +'}': 
              'Preposition added in the LXX in accordance with the rules of the Greek language or translational habits.',
          '{!}[a-z\-]*': 
              'Infinitive absolute',
          '{s}':
              'Hebrew M/, MN (comparative, superlative) reflected by Greek comparative or superlative.',
          '{t}':
              'Transliterated Hebrew word.',
          '{v}':
              ' The reading of the main text of the LXX seems to reflect a secondary text, while the original reading is reflected in a variant.',
          '=%\s|=%$':
              'Introducing categories of translation technique recorded in col. b.',
          '=%vap|-%vap': # second one is ft. a mistake in text
              'Change from active to passive form in verbs.',
          '=%vpa|%vpa': # second one mistake?
              'Change from passive to active form in verbs.',
          '=%p\s|=%p$':
              'Difference in preposition or particle.',
          '=%p\+|=%p-':
              'Addition[/subtraction] of preposition or particle.',
          '=p%-':
              'Omission of preposition or particle.',
          f'=;\S*|;=\S*': # second maybe mistake
              'Retroversion in col. b based on equivalence occurring in immediate or remote context.',
          '^G\s|\sG\s|\sG$':
              'Hebrew variant, but at this stage no plausible retroversion is suggested.',
          '=\+|-\+':
              'Difference in numbers between MT and the LXX.',
          '=?@\S*':
              'Etymological exegesis.',
          f'=@[?/\s,{hchars}]*a':
              'Etymological exegesis according to Aramaic.',
          f'=:\S*':
              'Introducing reconstructed proper noun.',
          '=v\s':
              'Difference in vocalization (reading).',
          '=vs':
              'Difference in vocalization (reading). [shin/sin]',
          f'=r[/?\s,{hchars}]*':
              'Incomplete retroversion.',
          '{\*}':
              'Agreement of LXX with ketib.',
          '{\*\*}':
              'Agreement of LXX with qere.',
          '^\.\s|\s\.\s|\s\.$':
              'Interchange of consonants between MT and the presumed Hebrew parent text of the LXX.',
          '\.rd\S*':
              'Interchange of consonants R/D, etc.',
          '\.m\S*':
              'Interchange of consonants, metathesis',
          '\.z\S*':
              'Possible abbreviation',
          '\.s\S*':
              'One word of MT separated into two or more words in the parent text of the LXX.',
          '\.j\S*':
              'Two words of MT joined into one word in the parent text of the LXX.',
          '\.w\S*':
              'Different word-division reflected in the parent text of the LXX.',
          '<.*>':
              '[verse backreference/editor remark?]',
          '\^+':
              '[difference in sequence?]',
          '\.\S*':
              ['unknown'],
            '{.*}': 
              '[catch other comments]',
            '--=\S*':
              '[unknown]',
           '<\S*\s|<\S*\t|<\S*$':
              '[incomplete notes?]',
           ':.*':
              '[unknown]',
           '--\S':
              '[unknown]',
           ',\S*':
              '[unknown]',
           '{\S*':
              '[ft. broken angle bracket]'
             }

old_equal = f'=[?/\s\S,{hchars}'+'{}]+'

col_b = {'=\S*':
              'Introducing col. b of the Hebrew (a selection of retroverted readings, presumably found in the parent text of the LXX).',}

tcharsG = r'|'.join(tchars)
tchars.update(col_b)
tchars = r'|'.join(tchars)

def convert_ccat(string):
    
    '''
    Simply assembles and returns an ETCBC transcription string
    from the CCAT Hebrew transcription.
    '''
    
    converted = ''.join(ccat_etcbc[char] if char in ccat_etcbc else char
                            for char in string)
    
    return converted
    
def is_match(hebrew_list, greek_list, tolerance=2):
    
    '''
    Match two supplied lists of surface forms.
    Returns boolean.
    tolerance defines the number of characters that can differ per word.
    '''
    
    # check word by word
    for hebrew_w, greek_w in zip(hebrew_list, greek_list):
        if Levenshtein.distance(hebrew_w, greek_w) > tolerance:
            return False
    # true if it reaches this point
    return True

In [120]:
files = sorted(glob.glob('source/parallel/*.par')) # all parallel data
#files = ['source/parallel/01.Genesis.par'] # load only practice file for now


# line numbers to slot numbers
greekLine_slots = {} 
hebrewLine_slots = {}

# increment as matches are found
catss_slot = 1
etcbc_slot = 1

# process each file
for file in files:
    
    # open and load file data
    with open(file, 'r') as infile:
        file_data = infile.read().split('\n')
    
    # make the links
    skip_line = False
    for i, line in enumerate(file_data): 
        
        # handle "#" indicator, means data continues to next line
        if '#' in line and not skip_line:
            line = line.strip() + file_data[i+1].strip() # add next line
            line = re.sub('\t#$|##\t', '', line).replace('#', '') # remove unwanted chars
            skip_line = True 
        elif '#' in line and skip_line: # skip over line that's been gathered
            skip_line = False
            continue
        
        # -- PROCESS 1: CONNECT PARALLEL FILE DATA TO BHSA/CATSS TF SLOTS --
         
        data = [dat for dat in line.split('\t') if dat] # Heb/Gre. data tab separated

        # skip lines without relevant data
        if len(data) < 2:
            continue
        elif len(data) > 2:
            raise Exception(f'>2 tabs in {file} {i+1}\n{data}\nraw: {file_data[i]+file_data[i+1]}') # sanity check

        hebrew, greek = data # 2 element list

        # process text-critical notations first
        # some t.c. notations are blended with the surface text
        
        # separate t.c. notations
        c_hebrew = re.sub(tchars, '', hebrew).replace('*', '').split()
        c_greek =  re.sub(tcharsG, '', greek).split()
        hebrew_tc = ' '.join([tc.strip() for tc in re.findall(tchars, hebrew)])
        greek_tc = ' '.join([tc.strip() for tc in re.findall(tcharsG, greek)])
        
        
        if False:
            if hebrew_tc and c_hebrew:
                
                # print tests
                print(i+1)
                print('SRC:', hebrew)
                print('HEB: ', c_hebrew)
                print('TC: ', hebrew_tc or 'NONE')
                print()
                
        if False:
            if greek_tc and c_greek:
                # print tests
                print(i+1)
                print('SRC:', greek)
                print('GRE: ', c_greek)
                print('TC: ', greek_tc or 'NONE')
                print()

        #if i > 2000:
        #    break
                
                
        if c_hebrew:
            for word in c_hebrew:
                if not re.findall(hwords, word):
                    raise Exception(f'{file} – {i+1}\nSRC: {line}\nHEB:{c_hebrew}\nTC: {hebrew_tc}')
            
        if c_greek:
            for word in c_greek:
                if not re.findall(gwords, word):
                    raise Exception(f'{file} – {i+1}\nSRC: {line}\nGRE:{c_greek}\nTC: {greek_tc}')
        
        
        # process Hebrew words
        hebrew_words = [subword for word in re.findall(hwords, hebrew)
                            for subword in word.strip().replace('*', '').split('/')]                                

        # < INSERT BHSA TF SLOT MATCHING>
        
        # process Greek words
        
        greek_words = [word.strip() for word in re.findall(gwords, greek)]    

        # < INSERT CATSS TF SLOT MATCHING>
        
        # test prints
        #print(i+1)
        #print(hebrew)
        #print(hebrew_words)
        #print()

        #if i > 24:
            #break

Exception: source/parallel/11.1Sam.par – 6644
SRC: 	A)NAQEMATIEI=S AU)TO\N})T KL )$R	KAI\ PA/NTA TA\ {d} {...KAI\ PA/NTA TA\}
HEB:['A)NAQEMATIEI', 'AU)TO\\N})T', 'KL', ')$R']
TC: =S