# CATSS Text-Fabric Enrichments

The raw data from the CCAT resource has been converted into a TF format in `tf_conversion.ipynb`. This notebook enriches the dataset in a number of ways:

* Morphological tags are split and parsed into individual word-level features.
* Word-level plain-text is processed into a UTF8 representation feature.
* A first effort is made to connect the ETCBC BHSA Hebrew database with the CATSS Hebrew parallel text using parts of speech for word-level connections. Phrase-level connections are created based on phrases' presence per line.
* Some rudimentary phrase divisions are also added the CATSS data based on the parallel data. 

## Instantiate CATSS TF Data

In [9]:
import collections
from tf.fabric import Fabric

In [14]:
TF = Fabric(locations=['tf'], modules=[''])

api = TF.load('book chapter verse morph trans')
api.makeAvailableIn(globals())

This is Text-Fabric 3.1.1
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

10 features found and 0 ignored
  0.00s loading features ...
   |     0.00s B book                 from tf
   |     0.00s B chapter              from tf
   |     0.01s B verse                from tf
   |     0.19s B trans                from tf
   |     0.21s B morph                from tf
   |     0.00s Feature overview: 8 for nodes; 1 for edges; 1 configs; 7 computed
  0.91s All features loaded/computed - for details use loadLog()


## Generate New Morphology Features

See the [CATSS morphology documentation](http://ccat.sas.upenn.edu/gopher/text/religion/biblical/lxxmorph/*Morph-Coding). In the source data, morphology is space-separated. In the TF version they are dot separated. Tags have to be split, recognized, and converted. They are added as separate word-level features.

Morphology codes have 3 or 2 columns, depending on part of speech type. From the documentation:

> 1. "TYPE" CODES (3 columns maximum, to identify part of speech)
> 2. "PARSE" CODE (up to 6 columns, as needed, to parse each form) [\*OPTIONAL]
> 3. [lexeme]

Proposals for new features:

* typ — part of speech, derived from the type codes. It is the first letter of the type code and can have a values of (bold is proposed new feature name):
    * N — noun — **noun**
    * A — adjective — **adjv**
    * R — pronoun — **pron**
    * C — conjunction — **conj**
    * X — particle — **part**
    * I — interjection — **intj**
    * M — indeclinable number — **inum**
    * P — preposition — **prep**
    * D — adverb — **advb**
* **styp** — subtype of part of speech, e.g. 1st declension, 3rd declension of various stems. There are lots of categories. The simple code is preserved. Refer to the documentation for their meanings. For that code, I preserve also the part of speech value (N, A, R, etc.)

* case — **case**
* gender — **gender**
* number — **number**
* tense — **tense**
* voice — **voice**
* mood — **mood**
* person — **person**
* degree — **degree**


In [21]:
genesis_1 = T.nodeFromSection(('Genesis', '1'))

for word in L.d(genesis_1, otype='word')[:10]:
    print(F.trans.v(word), ' — ', F.morph.v(word))

E)N  —  P.E)N
A)RXH=|  —  N1.DSF.A)RXH/
E)POI/HSEN  —  VAI.AAI3S.POIE/W
O(  —  RA.NSM.O(
QEO\S  —  N2.NSM.QEO/S
TO\N  —  RA.ASM.O(
OU)RANO\N  —  N2.ASM.OU)RANO/S
KAI\  —  C.KAI/
TH\N  —  RA.ASF.O(
GH=N  —  N1.ASF.GH=


In [36]:
# store new features here: feature_name to node to feature 
features = collections.defaultdict(dict)

# conversion dicts
typs = {'N': 'noun',
        'V': 'verb',
        'A': 'adjv',
        'R': 'pron',
        'C': 'conj',
        'X': 'part',
        'I': 'intj',
        'M': 'inum',
        'P': 'prep',
        'D': 'advb'}
       #'N': 'propn' proper noun, added below with special rule
cases = {'N': 'nom',
         'G': 'gen',
         'D': 'dat',
         'A': 'acc',
         'V': 'voc'}
numbers = {'S': 'sg',
          'D': 'du',
          'P': 'pl'}
genders = {'M': 'm',
          'F': 'f',
          'N': 'n'}
degrees = {'C': 'comparative',
          'S': 'superlative'}
tenses = {'P': 'present',
         'I': 'imperfect',
         'F': 'future',
         'A': 'aorist',
         'X': 'perfect',
         'Y': 'pluperfect'}
voices = {'A': 'active',
         'M': 'middle',
         'P': 'passsive'}
moods = {'D': 'impv',
        'S': 'subj',
        'O': 'optv',
        'N': 'infv',
        'P': 'ptcp'}

typ_counts = collections.defaultdict(lambda: collections.Counter())

# big loop
for word in F.otype.s('word'):
    
    morph = F.morph.v(word)
    split_morph = morph.split('.')
    
    # parse morphology codes in order of appearance:
    
    # first position is always the type (part of speech):
    styp = split_morph[0] # subtype
    # get type:
    if styp == 'N': # exception for proper nouns; nouns with no subtypes
        typ = 'propn'
    else:
        typ = typs[styp[0]] # type is only first char of code, convert it
    
    # assign parsing data
    #if len(split_morph) == 2: # indeclinable word
    #    case, gender, number, degree, tense, voice, mood = ('na' for i in range(1,8))
    
    typ_counts[typ][len(split_morph)] += 1
        
len(typ_counts)

11

In [37]:
from pprint import pprint

In [38]:
pprint(typ_counts)

defaultdict(<function <lambda> at 0x129140d90>,
            {'adjv': Counter({3: 33656, 2: 6, 4: 2}),
             'advb': Counter({2: 21091, 3: 13}),
             'conj': Counter({2: 74270, 3: 4}),
             'intj': Counter({2: 1385}),
             'inum': Counter({2: 3360, 3: 193}),
             'noun': Counter({3: 138910}),
             'part': Counter({2: 8718}),
             'prep': Counter({2: 54909}),
             'pron': Counter({3: 158131, 2: 8, 4: 1}),
             'propn': Counter({3: 27091, 2: 38, 4: 2}),
             'verb': Counter({3: 65566, 4: 34635, 5: 1694, 6: 2})})
