# Advanced Python and Text-Fabric

By Christian Højgaard Jensen (chj@dbi.edu)

*Adapted from Martijn Naaijer (https://github.com/MartijnNaaijer/Shebanq_Course_Files/blob/master/Introduction_to_text_fabric.ipynb)*

In [1]:
import sys, os

In [2]:
from tf.app import use

In [3]:
A = use('bhsa', hoist=globals())

Using etcbc/bhsa/tf - c r1.4 in C:\Users\Ejer/text-fabric-data
Using etcbc/phono/tf - c r1.1 in C:\Users\Ejer/text-fabric-data
Using etcbc/parallels/tf - c r1.1 in C:\Users\Ejer/text-fabric-data


**Documentation:** <a target="_blank" href="https://etcbc.github.io/bhsa" title="provenance of BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis">BHSA</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Writing/Hebrew" title="('Hebrew characters and transcriptions',)">Character table</a> <a target="_blank" href="https://etcbc.github.io/bhsa/features/hebrew/c/0_home.html" title="BHSA feature documentation">Feature docs</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/Bhsa/" title="bhsa API documentation">bhsa API</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/General/" title="text-fabric-api">Text-Fabric API 7.0.3</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/General/#search-templates" title="Search Templates Introduction and Reference">Search Reference</a>

## More features and tuples

If you want to know which range of slots is used for one specific object, you use sInterval.

In [None]:
print(F.otype.sInterval('word'))
print(F.otype.sInterval('phrase'))
print(F.otype.sInterval('clause'))

We have seen the word feature lex, but there are more features, such as sp, part of speech.

In [None]:
for word_slot in range(1, 11):
    print(F.sp.v(word_slot))

We saw that the range of phrases is (651503, 904689). What are the phrase types and phrase functions of the first ten phrases within this range?

In [None]:
for phr in range(651542, 904748):
    print(F.typ.v(phr), F.function.v(phr))

In which book can you find word slot 100000? To be able to locate it, use T.sectionfromNode(), which returns a tuple with the book (by default in English), chapter and verse.


In [None]:
T.sectionFromNode(100000)

If you need the name of the book in a different language, youse the argument "lang".


In [None]:
T.sectionFromNode(100000, lang = 'fr')

T.sectionFromNode returns a tuple. A tuple is also an ordered sequence of elements, but its values are immutable. It elements can be indexed with [] and a tuple can be recognized by the parentheses that embrace it.


In [None]:
this_tuple = (2, 4, 4, 5)
print(this_tuple[3])

T.sectionFromNode() returns a tuple of length 3.


In [None]:
print(len(T.sectionFromNode(10)))

The tuple starts with the biblical book of a node.


In [None]:
print(T.sectionFromNode(10)[0])

We have seen the feature "lex", which is a word feature. Other important word features are "sp" (part of speech), "gn" (gender), "nu" (number), "vt" (verbal tense), "vs" (verbal stem). The latter two have a value only if the word is a verb. Suppose you want to know the values of all of these features of the first 10 words of Genesis, we do this:


In [None]:
for word in F.otype.s('word'):
    if word < 11: # an alternative way of finding the first 10 words
        print(F.lex.v(word), F.sp.v(word), F.gn.v(word), F.nu.v(word), F.vt.v(word), F.vs.v(word))

If a feature has no value in the case of a specific word, NA is returned.

## Sets and counting the number of unique lexemes in the Hebrew Bible

The set is another basic data type in Python. In contrast to the list it contains unique elements only without order. You can use a set if you want to know which unique elements there are in a large mount of data. First we look at a simple example.

In [None]:
integer_set = set() # an empty set is initialized
print(integer_set) # prints the empty set

With .add() you can add an element to a set. In this case we add the integer 5.


In [None]:
integer_set.add(5)
print(integer_set)

Now we add another integer.


In [None]:
integer_set.add(27)
print(integer_set)

Now we try to add 5 again.


In [None]:
integer_set.add(5)
print(integer_set)

Now you see that the set still contains 27 and 5 only once, because the set becomes bigger only if a new unique element is added.


In the following cell we investigate how many unique lexemes occur in the Hebrew parts of the Hebrew Bible.

In [None]:
lex_set = set()

for word in F.otype.s('word'):
    if F.language.v(word) == 'Hebrew':
        lex_set.add(F.lex.v(word))
    
#print(lex_set) #uncomment this line if you want to see the whole set of lexemes. You will see that there is no specific order
print(len(lex_set)) # returns the number of elements in the set

### Use a set or a list?

You could have solved the problem of counting the unique lexemes in the Hebrew Bible with a list as well, by simply appending a lexeme to a list if it does not occur in the list already. The code would be only one line longer, but there is another difference, and that is the time it takes to do the calculation. Let us compare the time it would take using a list and a set.

We start with the list-implementation. First we import the class datetime from the module datetime.

In [None]:
from datetime import datetime

In [None]:
startTime = datetime.now()  # the variable startTime saves the date and the time of the start of the execution of the cell

lex_list = []

for word in F.otype.s('word'):
    if F.language.v(word) == 'Hebrew':
        if not F.lex.v(word) in lex_list: # this is the extra line of code. It checks whether a lexeme occurs in the list already.
            lex_list.append(F.lex.v(word)) 

print(len(lex_list))

print(datetime.now() - startTime) # again, the time is measured and the starttime is subtracted.

And now we look at the set-implementation again and measure how long it takes to execute the code.

In [None]:
startTime = datetime.now()

lex_set = set()

for word in F.otype.s('word'):
    if F.language.v(word) == 'Hebrew':
        lex_set.add(F.lex.v(word)) 

print(len(lex_set))

print(datetime.now() - startTime)

It is clear that the set-implementation is much faster than the list-implementation. This is the case because checking whether a certain lexeme occurs already in the list takes more time when the list becomes longer.


## Levels: L.u() and L.d()

The L in L.u() and L.d() stands for Layer, and u and d stand for up and down. Going up and down between linguistic layers is very important if you want to get access to various elements of linguistic units like clauses. In the figure below the use of L is schematically pictured.

<img src="Layer_picture.png">

Let us take a few examples to explore L.u():

In [None]:
word_node = 1

clause = L.u(word_node, 'clause')
print(clause)

L.u() returns a tuple containing all relevant clauses, in this case one clause. We can extract the clause node using index [0]:

In [None]:
print(clause[0])

L.d() works the opposite way, as it navigates down from one level to another level, e.g. from clause to phrase, or from book to chapter, etc.:

In [None]:
phrase = L.d(clause[0], 'phrase')
print(phrase)

Here the tuple contains four elements, each element corresponding to one the four phrases contained in the clause. Using index [] we can access each of these phrases:

In [None]:
print(phrase[3])

We can also use L.d() in combination with T.nodeFromSection() to find out which nodes belong to a certain section of the text.

In [None]:
exod_1_1 = T.nodeFromSection(('Exodus', 1, 1))
words_exod_1_1 = L.d(exod_1_1, 'word') # retrieve the word nodes with L.d()
print(words_exod_1_1)

You can print the text of a sequence of words with T.text(). T.text() iterates over a list of nodes, so you can either use range() to iterate over an interval or a list of word nodes.

In [None]:
T.text(range(0,4)) #Iterating over interval

In [None]:
T.text(words_exod_1_1) #Iterating over a list

The following is nearly the same, but now we print the whole chapter


In [None]:
exod_1 = T.nodeFromSection(('Exodus', 1, ))
words_exod_1 = L.d(exod_1, 'word') # retrieve the word nodes with L.d()
print(T.text(words_exod_1))

## Who is eating? Retrieving the subject of a predicate

Very often you may not only interested be interested in the features of specific objects, but also of other objects in its environment. Suppose you are interested in eating habits in the Hebrew Bible. You decide to search for cases of the verb >KL[ (to eat), used as the predicate of a clause, and you are interested in those cases in which that clause has an explicit subject. You would like to know all the lexemes in the subject.  

The strategy is as follows. Fist we search for all cases of >KL[. From the word >KL[ we move upward to the phrase in which it occurs, using L.u(). Of this phrase it is checked if it is a predicate. We move upward again, to the level of the clause, and in the clause we check if there is an explicit subject, by moving downwards to the phrases in the clause, which is done with L.d().

In [None]:
for word in F.otype.s('word'):
    if F.lex.v(word) == '>KL[':
        phrase = L.u(word, 'phrase')[0] # L.u() returns a tuple.
        # We want to know the slot of the phrase, so we add the index [0], which is the first value of the tuple.
        
        # now we check if the phrase is a predicate, we chech also for cases in a nominal predicate (PreC), 
        # and predicates with an object suffix (PreO):
        if F.function.v(phrase) in {'Pred','PreC','PreO'}:
            # we move upwards to the clause
            clause = L.u(phrase, 'clause')[0]
            # and we go down again to all the phrases in that clause
            phrases = L.d(clause, 'phrase')
            
            # we loop over all the phrases to check if there is an explicit subject:
            for phr in phrases:
                if F.function.v(phr) == 'Subj':
                    # we create an empty list, in which all lexemes are stored.
                    lex_list = []
                    
                    # we move down to the lexemes of the subject:
                    words = L.d(phr, 'word')
                    
                    for word in words:
                        lex_list.append(F.lex.v(word))
                        
                    print(lex_list)

## Making structured datasets with text-fabric, a standard receipe

In the following cells a csv file containing a number of features related to a topic of interest is created. The file is saved on the harddisk and can be processed further. The strategy of making this file is as follows.
The features of an observation are stored in a list, and all these lists are stored in a dictionary:

{id1 : [feat11, feat12, feat13, ...],  
  id2 : [feat21, feat22, feat23, ...],  
  id3 : [feat31, feat32, feat33, ...],  
  ...  
  ...  
 }
 
The keys of this dictionary have to be unique of course, so in general it is convenient to use the tf-id of the object under investigation as key. The dictionary does not remember the order of the id's, so another structure is needed in which this order is remembered. We use a list to do this. The id's are added to the list in the canonical order in which our observations occur:

[id1, id2, id3, ...]

After all the observations are stored in the list and the dictionary the csv file is made by looping over the id's in the list, then for each id the feature list is fetched in the dictionary and added to the csv.

We will first show this with a very simple example, in which we make a csv file which contains all the places where the name JHWH occurs. The csv file will contain 4 columns: slot of the lexeme JHWH/, book, chapter, verse. On every row there will be one observation of the name.

In [4]:
jhwh_list = []
jhwh_dict = {}

for word in F.otype.s('word'):
    if F.lex.v(word) == 'JHWH/':
        where = T.sectionFromNode(word)
        info = [str(word), where[0], str(where[1]), str(where[2])]
        jhwh_list.append(word) #The word-node
        jhwh_dict[word] = info
            
print(len(jhwh_list)) # prints total number of occurrences of the name

jhwh_dict[740]

6828


['740', 'Genesis', '2', '4']

In [8]:
with open(r'jhwh_data.csv', "w", encoding='utf-8') as csv_file:
    
    # it is often useful to make a header
    header = ['slot', 'book', 'chapter', 'verse']
    csv_file.write('{}\n'.format(','.join(header)))

    for case in jhwh_list:
        info = jhwh_dict[case]
        
        # with .write() the information from info is added to the our file, but the elements in info need to be formatted first.
        csv_file.write('{}\n'.format(','.join(info))) #see below for explanation of this line

## join() and string formatting

In the last line of the previous code cell the information in the info-list was added to the csv file. How exactly was this done?

### join()

First all the elements in the list were concatenated into one string with join(). join() concatenates everything in the list into one long string, and the separator is specified first. We want to make a csv file, so the comma is the natural choice, but you can use any character. Note that the list should contain only strings.

In [None]:
info = ['Genesis', '1', '1']

In [None]:
new_string = (','.join(info))
print(new_string)

### String formatting

String formatting is used if you want to organize various kinds of information in a string.

In [None]:
t = 5           # integer
p = 'Amsterdam' # string
f = True        # boolean 

If you want to put these together in one string, you use the placeholders {} and separate them with a character of choice. In the example below they are separated by spaces.

In [None]:
'{} {} {}'.format(t, p, f)

In [None]:
'The temperature in {} is {} degrees'.format(p, t)

In the cell in which the jhwh_data.csv file was made, you saw {}\n . \n is the newline. With this the next string that is added to the csv file comes on a new line.

## Semi-structured datasets

In the previous examples a so called structured dataset was made, which we saved in a csv file. The dataset is called structured, because it has a fixed format. It consists of a number of columns, and in each column you find the same kind of information for each case in the database.

You should also be able to make unstructured or semi-structured datasets. An unstructured dataset contains data that are closer to the raw data as we find them in 'nature'. An example is a picture of a Dead Sea Scroll. In a semi-structured dataset the data are structured partly. In our case you could for instance make a text file with the consonantal text of the Hebrew Bible. In the following example, a text file is made in which the biblical text is represented per verse as a sequence of lexemes, separated by spaces.

In [None]:
with open("lexemes.txt", "w") as lex_file:    
    for verse in F.otype.s('verse'):
        where = T.sectionFromNode(verse)
        # do not forget to make strings of chapter and verse
        verse_string = '{} {} {} '.format(where[0], where[1], where[2])
        words = L.d(verse, 'word')
        
        for word in words: # this is an alternative approach to string formatting
            if word != words[-1]: # if the word is not the last word,
                verse_string += F.lex.v(word) #add the lexeme
                verse_string += ' ' # and add a space
                
            else: # this is the last word
                verse_string += F.lex.v(word)
                verse_string += '\n' # add a newline
                
        lex_file.write(verse_string)

In the previous script a file was made that contains the text of the whole MT. If you want to make a separate file for each book, you can do the following.

In [None]:
def lexeme_processing(v):
    """ 
    This function returns a string of lexemes for a verse node, which is the input.
    It is identical to part of the code you have seen in the previous cell.
    """
    where = T.sectionFromNode(v)
    verse_string = '{} {} {} '.format(where[0], where[1], where[2])
    
    words = L.d(verse, 'word')
    for word in words:
        if word != words[-1]:
            verse_string += F.lex.v(word)
            verse_string += ' '
        else:
            verse_string += F.lex.v(word)
            # a new verse gets a new line.
            verse_string += '\n'    
    return(verse_string)

In [None]:
# for every book a new file is created.
for book in F.otype.s('book'):
    book_file = F.book.v(book) + '.txt'
    
    with open(book_file, "w") as new_file:
        verses = L.d(book, 'verse')
        for verse in verses:
            # here the function lexeme_processing is called
            new_string = lexeme_processing(verse)
            new_file.write(new_string)

## Final example of a structured dataset: the eat-data

Now we continue with the >KL[ data. On every row in the csv file we store information about one clause containing >KL[, with the following information in columns:  
slot of >KL[,  
book,  
chapter,  
verse,  
verbal tense,  
verbal stem,  
predicate type (Pred, PreC, PreO, PreS)  
lexemes of the subject, concatenated in a string, separated by underscores.  
The first 6 columns contain information about the verb, the predicate type contains information about the phrase in which >KL[ occurs, and the last column contains information about the subject of the clause. It may look like a lot of work, but you will notice that it is done straightforwardly.

In [None]:
eat_list = []
eat_dict = {}

# this part is nearly identical to what you have already seen
for w in F.otype.s('word'):
    
    # select the words with the right lexeme and make sure the language is Hebrew.
    if F.lex.v(w) == '>KL[' and F.language.v(w) == 'Hebrew':
        phrase = L.u(w, 'phrase')[0] 
        
        # we include cases with a subjectsuffix
        if F.function.v(phrase) == 'PreS':
            suffix = F.prs.v(w)
            # now we collect the information needed in a list called info
            where = T.sectionFromNode(w)
            info = [w, where[0], where[1], where[2], F.vt.v(w), F.vs.v(w), F.function.v(phrase), suffix]
            eat_dict[w] = info
            eat_list.append(w)
            
        # the other predicate types are processed now
        else:
            if F.function.v(phrase) in {'Pred','PreC','PreO'}:
                where = T.sectionFromNode(w)
                clause = L.u(phrase, 'clause')[0]
                phrases = L.d(clause, 'phrase')
                subject = False # we only include those cases that have an explicit subject
                
                for phr in phrases:
                    if F.function.v(phr) == 'Subj':
                        subject = True
                        words = L.d(phr, 'word')
                        words_lex = (F.lex.v(w) for w in words)
                        subj_lexemes = "_".join(words_lex)
                    
                if subject: # this is the same as: if subject == True:, but it is a bit cleaner 
                    info = [w, where[0], where[1], where[2], F.vt.v(w), F.vs.v(w), F.function.v(phrase), subj_lexemes]
                    eat_dict[w] = info
                    eat_list.append(w)

In [None]:
eat_dict

In [None]:
with open(r'eat_data.csv', "w") as csv_file:
    
    # we make a header again
    header = ['slot', 'book', 'chapter', 'verse', 'tense', 'stem', 'predicate', 'subj_lex']
    csv_file.write('{}\n'.format(','.join(header)))

    for case in eat_list:
        info_list = eat_dict[case]
        line = [str(element) for element in info_list]
        csv_file.write('{}\n'.format(','.join(line)))

## Data analysis

In [None]:
import numpy as np
import pandas as pd

Using the Pandas module we can import csv files and explore them. First, we use the function `read_csv()` to import a csv file:

In [None]:
dataframe = pd.read_csv("eat_data.csv")

Use `type()` to see the type of the object:

In [None]:
type(dataframe)

We can easily display the dataframe:

In [None]:
dataframe

Or perhaps we prefer to see only the first few rows. For this we use index []:

In [None]:
dataframe[:5]

We can sort our dataframe according to the values of the columns using the function `sort_values()`. The function takes an argument `by` for which the values can be either a string or a list depending on whether we want to sort our dataframe according to one or more columns. The order of the list stipulates the order of sorting, so the first column name in the list will be given highest priority.

In [None]:
dataframe.sort_values(by=['stem','subj_lex'])

We can also display only one column using the index []:

In [None]:
dataframe['tense']

We can create a a new dataframe from one of the columns in the dataframe. This dataframe will automatically count the different values of that column. The function `crosstab()` takes minimum two arguments, `index` and `columns`, where we assign the rows and the columns for our dataframe, respectively:

In [None]:
tense_tab = pd.crosstab(index=dataframe["tense"], columns="count")
tense_tab

We can count the total number of values in our dataframe:

In [None]:
tense_tab.sum()

And we can use the sum of all values to create a proportional table:

In [None]:
tense_tab/tense_tab.sum()

### Two-way tables

We can also create a two-way table based on two of our columns in the dataframe. This is called cross tabulation. Here we want to cross tabulate our coloumns 'tense' and 'stem':

In [None]:
tense_stem = pd.crosstab(index=dataframe['tense'],columns=dataframe['stem'])

tense_stem

We can change the names of the columns and rows:

In [None]:
tense_stem.columns = ['hiphil','niphal', 'pual', 'qal']

tense_stem.index = ['yiqtol', 'infc','qatal','ptca','wayq']

tense_stem

You can easily make marginal counts (that is, row counts and column counts) by adding the argument `margins = True`:

In [None]:
tense_stem = pd.crosstab(index=dataframe['tense'],columns=dataframe['stem'], margins=True)

tense_stem.columns = ['hif','nif','pual','qal','rowtotal']
tense_stem.index = ['impf','infc','perf','ptca','wayq','coltotal']

tense_stem

And you can also display proportions rather than counts by dividing the table by the grand total. For this we need to identify the location of the marginal counts, and we do this using either `.loc` or `.iloc` for either labels (e.g. 'rowtotal') or integers (e.g. 4):

`.loc` uses labels for index

`.iloc` uses integers for index

See more in [docs](http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated)

The grandtotal will be the juncture of rowtotal and coltotal, so stipulate this specific location in our index []:

In [None]:
tense_stem/tense_stem.loc["coltotal","rowtotal"]

In [None]:
tense_stem/tense_stem.iloc[5,4]

If you only want the proportions of each column, divide by coltotal:

In [None]:
tense_stem/tense_stem.loc['coltotal']

By default, the operator divides numbers in a dataframe on a row-to-row basis, which means that we have to change this setting, as we find the proportions on a column-by-column basis. We do so by using the function `.div()` and setting the argument `axis = 'index'` or `axis = 0`

In [None]:
new_tab = tense_stem.div(tense_stem['rowtotal'], axis='index')

Display in percentage:

In [None]:
new_tab*100

We can use the function `round()` to round off the numbers. The first argument is the object to be rounded off, while the second argument is the number of decimals to be rounded off to:

In [None]:
round(new_tab*100, 2)

## Plotting

In this course, we do not want to go too deep into data modelling, statistics, and plotting of data, however interesting, since these subjects require a course on its own. For now, we will only see a few functions that can help visualizing counts and proportions in order to make preliminary observations.

First, let us find the 20 most frequent subjects in our 'eat' data set.

The first step is to create a table with subjects and counts:

In [None]:
table_subj = pd.crosstab(index=dataframe['subj_lex'], columns='count')
table_subj[:10]

The next step is to sort the table in descending order:

In [None]:
table_subj_sort = table_subj.sort_values(by=['count'],ascending=False)
table_subj_sort[:10]

We only want the 20 most frequent lexemes:

In [None]:
top_20 = table_subj_sort[:20]
top_20

Now, let's go plotting!

We use another python package. See [docs](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html)

In [None]:
import matplotlib.pyplot as plt

A standard plot is very easy to create. We simply use the function `plot()` and require the plot to be a barplot, setting the kind='bar'.
Finally, we display the plot with the function `show()`:

In [None]:
ax = top_20.plot(kind='bar')
plt.show()

We can modify our plot in many various ways:

In [None]:
ax = top_20.plot(kind='bar')

plt.title('Top 20 subject lexemes of the verb "eat"') #Adding a title

plt.ylabel('frequency') #adding Y-axis label
plt.xlabel('subject lexeme') #changing X-axis label

plt.legend(loc='best', shadow=True) #Modifying legend

plt.rcParams['figure.figsize'] = (10,5) #Modifying size of figure. X-value is width, Y-value is height.

plt.show()

Apart from these few modifications, there is another thousand possibilities. Keep in mind, that the intention of a plot is to visualize specific data in order for a reader to quicly get a grasp of the results. For this, we will certainly need a title, X- and Y-axis labels, scales and possibly a legend. These are basic requirement and for most purposes these will also suffice. Remember this before you spend hours on the layout of the graph.