# CATSS Text-Fabric Enrichments

The raw data from the CCAT resource has been converted into a TF format in `tf_conversion.ipynb`. This notebook enriches the dataset in a number of ways:

* Morphological tags are split and parsed into individual word-level features.
* Word-level plain-text is processed into a UTF8 representation feature.
* A first effort is made to connect the ETCBC BHSA Hebrew database with the CATSS Hebrew parallel text using parts of speech for word-level connections. Phrase-level connections are created based on phrases' presence per line.
* Some rudimentary phrase divisions are also added the CATSS data based on the parallel data. 

In [7]:
import collections
from tf.fabric import Fabric
from greekutils import beta2unicode # do: pip install greek-utils==0.2

# some cells are blocked from being run/
# toggle True to allow a given task
run_corrections = False
run_morphology = False
run_unicode = False

## Corrections

Corrections to the data based on the decoding processes.

In [2]:
metadata = metadata = {

'': {            
        'created_by':'R. Kraft et al., CCAT, University of Pennsylvania',
        'converted_by':'Cody Kingham',
        'source':'http://ccat.sas.upenn.edu/rak//catss.html',
        'license':'http://ccat.sas.upenn.edu/gopher/text/religion/biblical/lxxmorph/0-user-declaration.txt',
    },
}


# TO FIX, 2017-12-11:
# advb - 173450 (D.P.E)PA/NW) 1 == styp == D.P
# 5058 & 5159 (N.N.M.*MESRAIM) 0-2 == styp == N.N.M.

run_corrections = False

if run_corrections:

    # instantiate/load old data
    TF = Fabric(locations=['tf'], modules=[''])
    api = TF.load('book chapter verse morph trans', silent=True)
    api.makeAvailableIn(globals())
    
    nodes = collections.defaultdict(dict)

    # prime metadata
    metadata['morph'] = {'valueType': 'str'}
    
    # corrections for morph feature
    for word in F.otype.s('word'):

        if word == 11436:
            new_morph = 'VA.AAD2P.I(/ZW.KATA'
            nodes['morph'][word] = new_morph
        else:
            nodes['morph'][word] = F.morph.v(word)

    save_TF = Fabric(locations=['tf'], modules=['enrichments'])
    save_TF.save(nodeFeatures=nodes, metaData=metadata)
    del metadata['morph'] # clean out metadata dict
    
    print('corrections done!')
    
else:
    
    print('run_corrections toggled False. Doing nothing...')

run_corrections toggled False. Doing nothing...


## Load Enriched, Corrected CATSS Data 

In [3]:
# load in enriched data
TF = Fabric(locations=['tf'], modules=['', 'enrichments'])
api = TF.load('book chapter verse morph trans')
api.makeAvailableIn(globals())

This is Text-Fabric 3.1.1
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

21 features found and 1 ignored
  0.00s loading features ...
   |     0.00s B book                 from tf
   |     0.00s B chapter              from tf
   |     0.01s B verse                from tf
   |     0.16s B trans                from tf
   |     0.21s B morph                from tf/enrichments
   |     0.00s Feature overview: 19 for nodes; 1 for edges; 1 configs; 7 computed
  0.89s All features loaded/computed - for details use loadLog()


## Generate New Morphology Features

See the [CATSS morphology documentation](http://ccat.sas.upenn.edu/gopher/text/religion/biblical/lxxmorph/*Morph-Coding). In the source data, morphology is space-separated. In the TF version they are dot separated. Tags have to be split, recognized, and converted. They are added as separate word-level features.

Morphology codes have 3 or 2 columns, depending on part of speech type. From the documentation:

> 1. "TYPE" CODES (3 columns maximum, to identify part of speech)
> 2. "PARSE" CODE (up to 6 columns, as needed, to parse each form) [\*OPTIONAL]
> 3. [lexeme]

Proposals for new features:

* typ — part of speech, derived from the type codes. It is the first letter of the type code and can have a values of (bold is proposed new feature name):
    * N — noun — **noun**
    * A — adjective — **adjv**
    * R — pronoun — **pron**
    * C — conjunction — **conj**
    * X — particle — **part**
    * I — interjection — **intj**
    * M — indeclinable number — **inum**
    * P — preposition — **prep**
    * D — adverb — **advb**
* **styp** — subtype of part of speech, e.g. 1st declension, 3rd declension of various stems. There are lots of categories. The simple code is preserved. Refer to the documentation for their meanings. For that code, I preserve also the part of speech value (N, A, R, etc.)

* case — **case**
* gender — **gender**
* number — **number**
* tense — **tense**
* voice — **voice**
* mood — **mood**
* person — **person**
* degree — **degree**

In [4]:
# Notes to myself
# prototypical counts per type:
# 3 - adjv, noun, verb
# 2 - advb, conj, intj, part, prep, inum
# 3 - inum, pron, propn
# 4 - propn (N.N.M.MESRAIM), verb (participle)

# those with overloaded lexemes:
# verb(>3, not participle), verb(>4, participle)

# store new features here: feature_name to node to feature 
features = collections.defaultdict(dict)

# conversion dicts
typs = {'N': 'noun',
        'V': 'verb',
        'A': 'adjv',
        'R': 'pron',
        'C': 'conj',
        'X': 'part',
        'I': 'intj',
        'M': 'inum',
        'P': 'prep',
        'D': 'advb'}
       #'N': 'propn' proper noun, added below with special rule
    
# nominals 
# [case][number][gender]
cases = {'N': 'nom',
         'G': 'gen',
         'D': 'dat',
         'A': 'acc',
         'V': 'voc'}
numbers = {'S': 'sg',
          'D': 'du',
          'P': 'pl'}
genders = {'M': 'm',
          'F': 'f',
          'N': 'n'}
degrees = {'C': 'comparative',
          'S': 'superlative'}

# verbs
# [tense][voice][mood][person][number] [case][number][gender]

tenses = {'P': 'present',
         'I': 'imperfect',
         'F': 'future',
         'A': 'aorist',
         'X': 'perfect',
         'Y': 'pluperfect'}
voices = {'A': 'active',
         'M': 'middle',
         'P': 'passsive'}
moods = {'I': 'indc',
         'D': 'impv',
         'S': 'subj',
         'O': 'optv',
         'N': 'infv',
         'P': 'ptcp'}

if run_morphology:

    # big loop
    for word in F.otype.s('word'):

        morph = F.morph.v(word)
        split_morph = morph.split('.')

        # parse morphology codes in order of appearance:


        # 1. assign subtypes and types

        styp = split_morph[0] # subtype

        # get type; exception for proper nouns; nouns with no subtypes
        if styp == 'N':
            typ = 'propn'
        else:
            typ = typs[styp[0]] # type is only first char of code, convert it


        # 2. assign parsing data

        # indeclinable words
        if len(split_morph) == 2 or typ in {'advb', 'conj'}:
            case, gender, number, degree, tense, voice, mood, person = ('' for i in range(1,9))
            lexeme = '.'.join(split_morph[1:])


        # nominal words with case/gender/number
        elif typ in {'adjv', 'noun', 'inum', 'pron', 'propn'}:

            parsing_data = split_morph[1]
            case = ''
            gender = ''
            number = ''
            degree = ''

            # get parsing; some parsing codes have < 3 values, loop is thus necessary
            for i, char in enumerate(parsing_data):

                # dative/dual disambiguation
                if i == 0 and char == 'D': 
                    case = 'dat'
                elif i != 0 and char == 'D':
                    number = 'du'

                # disambiguation for 'S' superlative
                elif all([char == 'S' or char == 'C', len(parsing_data) == 4,
                          typ == 'adjv', i != 1]):
                    degree = degrees.get(char, '')

                # all other parsings
                elif char != 'D':
                    case = cases.get(char, '') if not case else case
                    gender = genders.get(char, '') if not gender else gender
                    number = numbers.get(char, '') if not number else number
                    degree = '' if not degree else degree

            # set non applicable values to null
            person, tense, voice, mood = ('' for i in range(1,5))

            lexeme = '.'.join(split_morph[2:])


        # verbs
        elif typ == 'verb':

            parsing_data = split_morph[1]
            tense = tenses[parsing_data[0] ]
            try:
                voice = voices[parsing_data[1]]

            except:
                print(word, morph)
            mood = moods[parsing_data[2]]

            # handle participles 
            try:
                gender = genders[parsing_data[5]] # only participles have >4 chars
                number = numbers[parsing_data[4]]
                case = cases[parsing_data[3]]
                person = '' # non-applicable values
                degree = ''

            except IndexError:

                # all normal verbs
                try:
                    person = parsing_data[3]
                    number = numbers[parsing_data[4]]
                    case = '' # non-applicable values
                    gender = '' 
                    degree = ''

                # handle infinitives
                except IndexError: 
                    person = '' # non-applicable values
                    number = ''
                    case = ''
                    gender = ''
                    degree = ''

            lexeme = '.'.join(split_morph[2:])

        # assign features
        features['typ'][word] = typ
        features['styp'][word] = styp
        features['lex'][word] = lexeme
        features['case'][word] = case
        features['number'][word] = number
        features['gender'][word] = gender
        features['degree'][word] = degree
        features['tense'][word] = tense
        features['voice'][word] = voice
        features['mood'][word] = mood
        features['person'][word] = person

    print(f'done! with {len(features)} new features logged...')
    
    # EXPORT new features
    
    for feat in features:
        metadata[feat] = {'valueType': 'str'}
    
    TF.save(nodeFeatures=features, metaData=metadata)
    
else:
    print('run_morphology toggled False. Doing nothing...')

run_morphology toggled False. Doing nothing...


## Generate UTF8 Feature

The transcription table is available at [the CCAT documentation page](http://ccat.sas.upenn.edu/gopher/text/religion/biblical/lxxmorph/0-betacode.txt).

In [8]:
if run_unicode:
    
    for word in F.otype.s('word'):

        unicode = beta2unicode.convert(F.trans.v(word))

        # add final sigma fix!
        if unicode[-1] == 'σ':
            unicode = unicode[:-1] + 'ς' 
            
        # save feature
        features['utf8'][word] = unicode
            
    # export unicode features
    for feat in features:
        metadata[feat] = {'valueType': 'str'}
    
    TF.save(nodeFeatures=features, metaData=metadata)
    
else: 
    print('run_unicode toggled False. Doing nothing...')

run_unicode toggled False. Doing nothing...
