In [1]:
import ipykernel
import pandas as pd
import numpy as np
import re   # regular expressions
from TexSoup import TexSoup

In [2]:
# set full view of dataframe columns
pd.set_option('display.max_colwidth', None)


# Goal

I want to create the following csv tables for a well-rounded CLDF dataset:

- ValueTable
- ParameterTable
- CodeTable
- LanguageTable
- ExampleTable

For the ExampleTable, I will aim for automatic extraction from the LaTeX source for Supplementary Material S3 of [Höhn (2024)](https://www.degruyter.com/document/doi/10.1515/lingty-2023-0080/html). Some (manual?) postprocessing might be needed to properly transfer various special signs from LaTeX macros into UTF-8.

Later I will take care of the first four tables, which can (mostly) be derived on the basis of the existing csv. 

# Creating the examples table

<https://github.com/cldf/cldf/blob/master/components/examples/README.md> notes the following regardign examples.csv:

The [examples](https://github.com/cldf-datasets/lgr/blob/v1.0/cldf/examples.csv) used in the 
[Leipzig Glossing Rules](https://doi.org/10.5281/zenodo.10275705) document are available as CLDF dataset. The
`ExampleTable` is described here: https://github.com/cldf-datasets/lgr/blob/v1.0/cldf/Generic-metadata.json#L43-L137

## [ExampleTable](http://cldf.clld.org/v1.0/terms.rdf#ExampleTable): `examples.csv`

Name/Property | Datatype | Cardinality | Description
 --- | --- | --- | --- 
[ID](http://cldf.clld.org/v1.0/terms.rdf#id) | `string` | singlevalued | <div> <p>A unique identifier for a row in a table.</p> <p> To allow usage of identifiers as path components of URLs IDs must only contain alphanumeric characters, underscore and hyphen. </p> </div> 
[Language_ID](http://cldf.clld.org/v1.0/terms.rdf#languageReference) | `string` | singlevalued | <div> <p> An identifier referencing a language either </p> <ul> <li>by providing a foreign key to <code>LanguageTable</code> or</li> <li>by using a known encoding scheme.</li> </ul> </div> <br>References <code>LanguageTable</code>
[Primary_Text](http://cldf.clld.org/v1.0/terms.rdf#primaryText) | `string` | singlevalued | The example text in the source language.
[Analyzed_Word](http://cldf.clld.org/v1.0/terms.rdf#analyzedWord) | list of `string` (separated by `	`) | multivalued | The sequence of words of the primary text to be aligned with glosses
[Gloss](http://cldf.clld.org/v1.0/terms.rdf#gloss) | list of `string` (separated by `	`) | multivalued | The sequence of glosses aligned with the words of the primary text
[Translated_Text](http://cldf.clld.org/v1.0/terms.rdf#translatedText) | `string` | singlevalued | The translation of the example text in a meta language
[Meta_Language_ID](http://cldf.clld.org/v1.0/terms.rdf#metaLanguageReference) | `string` | singlevalued | References the language of the translated text<br>References <code>LanguageTable</code>
[LGR_Conformance](http://cldf.clld.org/v1.0/terms.rdf#lgrConformance) | `string` | singlevalued | The level of conformance of the example with the Leipzig Glossing Rules
[Comment](http://cldf.clld.org/v1.0/terms.rdf#comment) | `string` | unspecified | <div> <p> A human-readable comment on a resource, providing additional context. </p> </div> 

Load the tex file into a list line by line

In [3]:
with open('S3.tex','r',encoding='utf-8') as s3:
    rawtex = []
    for line in s3:
        # Keep the line as raw as possible
        rawtex.append(line.rstrip('\n'))

Create a list of the indices of all lines containing the "subsubsection" string, corresponding to the beginning index of data for a new language

In [4]:
langstart=[rawtex.index(i) for i in rawtex if 'subsubsection' in i]

Create a list of lists containing the lines between each starting index for a language and the starting index for the next language. An exception takes care of the last index on the langstart list, which just takes everything up to the end of the file. Thus, each element in langlist is a list with all lines containing examples for an individual language.

In [5]:
langlist=[]

for i in range(len(langstart)):
    if i < len(langstart)-1:
        langlist.append(rawtex[langstart[i]:langstart[i+1]])
    else:
        langlist.append(rawtex[langstart[i]:])

In [6]:
langlist    # it's working

[['\\subsubsection{Hausa (haus1257), West Chadic}',
  '',
  '\\pex',
  '\\a \\begingl',
  '\\gla \\textbf{m\\={u}} Háus\\textgravemacron{a}w\\={a}//',
  '\\glb we Hausa//',
  "\\glft `we Hausa'\\\\\\citep[371]{newman2000}//",
  '\\endgl',
  '\\a \\begingl',
  '\\gla \\textbf{sh\\={\\textsci}} wannàn m\\={a}làm\\={\\textsci}//',
  '\\glb he \\Dem{}.1 teacher//',
  "\\glft `he (this) teacher'\\\\{\\citep[after][371]{newman2000}}//",
  '\\endgl',
  '\\a',
  '\\begingl',
  '\\gla \\textbf{m\\={u}} m\\textgravemacron{a}làman-nàn//',
  '\\glb we teacher-\\Dem.\\Prox{}//',
  "\\glft `we these teachers'\\\\\\citep[155]{newman2000}//",
  '\\endgl',
  '\\xe',
  '',
  'See \\citet[63, 155, 370f.]{newman2000} and also \\citet[330f.]{jaggar2001} for further examples.',
  '',
  '',
  '',
  ''],
 ['\\subsubsection{Mupun (mwag1236), West Chadic}',
  ' ',
  '\\ex',
  '\\begingl',
  '\\gla \\textbf{war} manaja n\\textschwa//',
  '\\glb 3\\F{} manager \\Def{}//',
  "\\glft `she, the manager'\\\\\\citep[a

In [7]:
len(langstart)  # we don't need these indices anymore

114

Now I build a processor for each language data chunk in order to extract the relevant data and save it in a list of dictionaries.
I will need to account for instances with one or more examples and make sure each example is accounted for 

The data to extract are:

1. `Language_Name` name (might be dropped in eventually when linked to a languages.csv, but I'll keep it for clarity for now)
2. glottocode
3. `Analyzed_Word` for the example with morpheme boundaries, needs to be cleaned up.
4. `Primary_Text` cannot be directly extracted, but should be generated based on 3 after `Analyzed_Word` is cleaned up.
5. `Gloss` for the glosses, needs to be cleaned up.
6. `Translated_Text` for the free translation, needs to be cleaned up.





In [8]:
# A function to remove any non-alphanumeric characters from a string
# eventually not needed due to use of TexSoup
def remove_special_chars(s):
    return ''.join(c for c in s if c.isalnum())



# pattern matching plain gl[abc] lines for expex examples and identifies a group excluding the initial macro and the closing // 
stripgl = re.compile(r'^\\gl[abc]\s(.*?)//\s*$')

# pattern matching glft lines accounting for the possibility that they may, but need not contain a citation block
# the first group matches the free translation 
# (initially excluding quotation marks for easier processing, but there may be examples with material outside )
glft = re.compile(
    r"^\\glft\s(.+?)(?:(?:\s*\\\\)|\s*(?://))?(?:[^\\]*(\\cite(?:al)?p(?:\[[^\]]*\])*\{.*?\}.*))?(?://)?$"
)

# function that returns the single matching group based on a re pattern if there is exactly one group, otherwise returns a tuple of all matches
def extract_glcontent(text,pattern):
    match = pattern.match(text)
    if not match:
        return text
    
    if len(match.groups()) == 1:
        return match.group(1)
    else:
        return match.groups()
    
#    return match if match else text  # Extract the middle part or return original

In [9]:
# testing ground
line=r"\glft `Then we men ate some other food.'\\"
line2=r"\glft `[Addressing the volcano:] We' \\{\citep[155, (285)]{wegener2012} quoting \citep{otherperson1987} and so forth//"
line3=r"\glft `I, an intelligent person, have long known what you, stupid man, are just discovering.'//"
line4=r"\glft `Here are yours, the Khwe's cows that we, the Whites, give you.' \\{(\citealp[41, (1)]{kilianhatz2008} quoting \citealp[514f.]{koehler1989})}//"
#stripgl.match(line3).group(1)
print(len(extract_glcontent(line3,glft)))
a=extract_glcontent(line,glft)
print(a)
#print(b)
#trans=tempstring[0][1:-1]           # remove quotation marks around translation
#print(tempstring, trans)

2
("`Then we men ate some other food.'", None)


The following code loops through the langlist, first extracting the language name and glottocode from the first element. Then it loops through the embedded list of lines for the given language, extracting the example text, gloss, free translation and source strings. 

A dictionary with those values is appended every time a \glft line has been processed. This ensures that we can capture multiple example blocks per language, since the next block of `\gla,\glb,\glft` lines the `for line in langlist[i]` loop encounters will be added separately again.

The bool variable `examplesfound` is set to `True` iff at least one example dictionary is writting within the `for line in langlist[i]` loop. If no example block is encountered (most likely because any examples under the language heading are not coded in as standard glossed examples), the language name is added to the `missing_langs` list, allowing me to further check the entries for those languages.

The keys-value pair 'Meta_Language_ID': 'stan1293' (for English as language of translation) shall be added to all dictionaries by default.

The value of 'LGR_Conformance': 'WORD_ALIGNED|MORPHEME_ALIGNED' is determined based on a manually created list of languages where the presented examples contain at least some morpheme-level alignment. The default value for `lgrconform` is otherwise 'WORD_ALIGNED'. Depending on required purpose, marking of this property may be not fully reliable, since this is marked on the language level and there likely examples that  have no specific indication for morpheme alignment, but might simply lack constructions with no (marked?) morpheme boundaries in the source, while other examples contain such constructions.

In [10]:
lgr_morphemealign=['Hausa','Kambaata','Gulf Arabic','Cairene Egyptian Colloquial Arabic','Maltese',
                   'Mangarrayi','Diyari','Warlpiri','Pitjantjatjara','Guugu Yimidhirr',
                   'Kuku Yalanji','Windesi Wamesa','Kwaio','Cheke Holo','Hoava','Kokota','Norwegian',
                   'Swedish','Aromanian','Bulgarian','Pomak','Khoekhoe/Nama','Khwe/Kxoe','Kinyarwanda',
                   'Nkore-Kiga','Swahili','Menya','Momu','Imonda','Bilua','Lavukaleve','Moskona',
                   'Hatam','Maybrat','Urim','Savosavo','Manambu','Awtuw','Alamblak','Fore','Yagaria',
                   'Amele','Kobon','Usan','Kaera','Kamang','Wersing','Western Pantar','East Geshiza',
                   'Mandarin','Hungarian','Mi\'kmaq','Lezgian','Abhkaz','Basque','Classical Nahuatl',
                   'Evenki','Kalaallisut/West Greenlandic','Turkish']

In [11]:
exlist=[]
missing_langs=[]
nosource_examples=[]
source=''
counter=0
counter2=0
for i in range(len(langlist)):
    examplesfound=False
    langname,rest=langlist[i][0].split('{')[1].split(' (')      # extract language name from first line 
    glt=rest.split(')')[0]                                      # extract clottocode contained in first pair of brackets
    for idx,line in enumerate(langlist[i]):
        if 'gla' in line:
            ex=extract_glcontent(line,stripgl)                  # extract content from \gla line
            counter+=1
        elif 'glb' in line:
            gloss=extract_glcontent(line,stripgl)               # extract content from \glb line
        elif 'glft' in line:
            trans,rawsource=extract_glcontent(line,glft)        # extract translation and (if possible) source information from glft line
            if not rawsource:                                   # if the glft regex in extract_glcontent didn't identify a citation/source
                if 'cite' in langlist[i][idx+1]:                # check the next line
                    rawsource=langlist[i][idx+1]
                else:                                           # if the next line also doesn't contain a citation
                    nosource_examples.append((langname,ex))     # store language name and example with unidentified source
                    continue                                       # and break cycle
            
            if 'citep' in rawsource:                            # extract the arguments depending on the specific citation macro used         
                cite=TexSoup(rawsource).citep.args
            elif 'citealp' in rawsource:
                cite=TexSoup(rawsource).citealp.args
            elif 'citealt' in rawsource:
                cite=TexSoup(rawsource).citealt.args
            else:                                               # just in case we find uncaptured citation macros
                print("Whops, couldn't find a source for example '",ex,"' for language",langname)
                break

            if len(cite) == 3:                                  # if cite(al)p/t has three arguments, the first one can be dropped (would typically be "after" or "part of" and I'm content with a plain reference)
                source=cite[2].contents[0]+'['+cite[1].contents[0]+']'      # put the parts of the citation in the order `bibtexkey[pages and other information]`
            if len(cite) == 2:                                  # if cite(al)p/t has two arguments, we just need to reverse their order and put the then second one in square brackets 
                source=cite[1].contents[0]+'['+cite[0].contents[0]+']'  

            if 'endgl' in line:
                print("In lang",langname, 'endgl occurred on the same line as glft') 

        elif 'endgl' in line:           # endgl marks the end of an example block; write dictionary entry with what could be extracted and reset the variables
            if ex != '':
                 examplesfound=True                 # flag up that at least one example was found for the language (for identifying languages without well-formed expex blocks)
                 counter2+=1
                 if langname in lgr_morphemealign:  # check which type of LGR-conformity should be assumed
                     lgrconform='MORPHEME_ALIGNED'
                 else:
                     lgrconform='WORD_ALIGNED'
                
                 exlist.append({'Language_Name': langname, 
                                'Language_ID': glt, 
                                'Primary_Text': '',
                                'Analyzed_Word': ex, 
                                'Gloss': gloss, 
                                'Translated_Text': trans, 
                                'Source': source, 
                                'Meta_Language_ID': 'stan1293', 
                                'LGR_Conformance': lgrconform,
                                'Comment': ''})
                 
                 # reset variables
                 ex=''
                 gloss=''
                 trans=''
                 source=''
                 
    if not examplesfound:                       # if no example was identified for a language, store the language and index for the block
        missing_langs.append((i,langname))





Manually add comment fields as desired.

In [12]:
for d in exlist:
    if d.get("Language_Name") == 'Kwaio':
        d['Comment'] = 'The status of the -a intervening between pronoun and noun is not clear, see Höhn (2020: 25) for speculation that this might be a reduced article.'
    if d.get('Language_Name') == 'Maybrat':
        d['Comment'] == 'See Dol (2007: 64, fn. 5) concerning the 1PL.INCL use of the pronoun anu.'
    if d.get('Language_Name') == 'Koromfe':
        d['Comment'] == 'The article may be dropped in fast speech, for more discussion see Rennison (1997:242, 250f.).'
    if 'Khoekhoe' in d.get('Language_Name'):
        d['Comment'] == "Acknowledging Menán du Plessis (pers. comm.) for help with glossing. Khoekhoe glosses for PersN-expressions are Höhn's (2024) interpretation of Haacke (1977)."
    if 'Mi\'kmaq' in d.get('Language_Name'):
        d['Comment'] == "Acknowledging Watson Williams (pers. comm.) for help with glossing."     

In [13]:
# there are items that don't seem to split properly, display them
#for i in range(len(langlist)):
#    res=langlist[i][0].split('{')[1].split(' (')
#    if len(res)<2:
#        print(res, i)

# fixed now in source file (Wari' was missing a space before the opening bracket)

In [14]:
# bugfixing testing
# the original re.split pattern was too effective in also splitting the final parts of the source line
#
# solution: only perform the first split
#re.split(r'gl[a-z][a-z]* ',"\glft `\textbf{the} woman'\\{\citep[after][202; gloss extrapolated]{patz2002}}//",1)[1]  

# now deprecated by more precise matching to strip both beginning and end of a gl-line

#pattern = re.compile(r'^\\gl[a-z]{1,2}\s(.*?)//$')
#
#def extract_middle(text):
#    match = pattern.match(text)
#    return match.group(1) if match else text  # Extract the middle part or return original

# test
#extract_middle("\\glft `\\textbf{the} woman'\\\\{\\citep[after][202; gloss extrapolated]{patz2002}}//")


This has extracted 146 example blocks.
    

In [15]:
print("Number of automatically extracted example blocks:", len(exlist))
print("Verify counter:", counter)
print("Verify counter2:", counter2)

print("Number of languages without detected entries:", len(missing_langs))
print("Number of example blocks missing a source entry:", len(nosource_examples))


print(missing_langs)
print(nosource_examples)

Number of automatically extracted example blocks: 148
Verify counter: 149
Verify counter2: 148
Number of languages without detected entries: 3
Number of example blocks missing a source entry: 13
[(35, 'English'), (81, 'Hua'), (102, 'Chitimacha')]
[('Tuvaluan', '\\textbf{Au} ttino poto koo leva ne iloa nee au mea kolaa faatoaa iloa nee koe ttagata valea.'), ('Kwaio', "\\textbf{'a-gauru-a} ta'a i 'Ai'eda"), ('Welsh', '\\textbf{ni} fyfyrwyr'), ('Aromanian', '\\textbf{noi} pikurar-li adrem pini.'), ('Luganda', '\\textbf{Ffe} abantu abaavu ffe tubonaabona.'), ('Swahili', '\\textbf{Nyinyi} wa-nafunzi m-me-cheka.'), ('Menya', 'Nyi tä=\\ng{}ga=\\ng{}i Matiu i=qu=k=\\textbf{i} kuk\\ng{}uä + hn=i yat\\ng{}qä k-i-m=\\ng{}qä=i.'), ('Urim', '\\textbf{tu} melnum'), ('Alamblak', 'yima-\\textbf{k\\"{e}}'), ('Alamblak', 'yima-\\textbf{n\\"em}'), ('Yagaria', "Ovu-\\textbf{da} ma-lo' bei-d-u-e"), ('Katu', '\\textbf{yi} manuih'), ('Classical Nahuatl', '\\textbf{Ni}cu\\={\\i}ca \\textbf{ni}Petoloh.')]


Three languages are missing entries. 

- There is no glossed example for English, since it is the metalanguage of the datasource.
- For Hua and Chitimacha, the quoted sources do not seem to provide standard glossing.

I am going to add these examples semi-manually to the dataset next.

In [16]:
missing_langs

[(35, 'English'), (81, 'Hua'), (102, 'Chitimacha')]

In [17]:
for i in missing_langs:
    print(langlist[i[0]])

['\\subsubsection{English (stan1293), Germanic}', '', '\\ex \\textbf{you} linguists\\xe', '', 'See among others: \\citet{postal1969, delormedougherty1972, sommerstein1972, pesetsky1978, keizer2016}.', '']
['\\subsubsection{Hua (huaa1250), \\gls{tng}, Siane-Yagaria}', '', '\\pex ', '\\a', "\\emph{Forapi' + da} $\\rightarrow$ /\\emph{forapi \\textbf{da}}/ `I, Forapi'", '\\a', "\\emph{Forapi' + Ka} $\\rightarrow$ /\\emph{forapi\\textbf{ga}}/ `You, Forapi'", '\\a', "\\emph{nono' + 'Kama' + da} $\\rightarrow$ /\\emph{nonokama \\textbf{da}}/ `I your maternal uncle'\\\\{\\citep[226]{haiman1980}}", '\\xe', '', '\\pex Person marked genitive forms{\\citep[after][240]{haiman1980}}', "\\a vimata \\newline`of us men'", "\\a ademata \\newline `of us women'", "\\a vi'ita \\newline `of you men'", "\\a adita \\newline `of you women'", '\\xe', '', 'See \\citet[226--232, 239f.]{haiman1980}.', '']
['\\subsubsection{Chitimacha (chit1248), Chitimacha}', '', "\\ex \\textbf{\\textglotstop{}u\\v{s}} pan\\v{s}'

##

In [18]:
# manually append English example entry
exlist.append({'Language_Name': 'English', 
                'Language_ID': 'stan1293', 
                'Analyzed_Word': 'you linguist-s', 
                'Gloss': r'2\Sg.\Pl{} linguist-\Pl{}', 
                'Translated_Text': 'you linguists', 
                'Source': '', 
                'Meta_Language_ID': 'stan1293', 
                'LGR_Conformance': 'MORPHEME_ALIGNED',
                'Comment': ''})


# manually append Hua example entries
exlist.append({'Language_Name': 'Hua', 
                'Language_ID': 'huaa1250', 
                'Analyzed_Word': 'forapi da', 
                'Gloss': r'Forapi 1\Sg{}', 
                'Translated_Text': 'I, Forapi', 
                'Source': 'haiman1980[226]', 
                'Meta_Language_ID': 'stan1293', 
                'LGR_Conformance': 'MORPHEME_ALIGNED',
                'Comment': ''})

exlist.append({'Language_Name': 'Hua', 
                'Language_ID': 'huaa1250', 
                'Analyzed_Word': 'forapi-ga', 
                'Gloss': r'Forapi-2\Sg{}', 
                'Translated_Text': 'You, Forapi', 
                'Source': 'haiman1980[226]', 
                'Meta_Language_ID': 'stan1293', 
                'LGR_Conformance': 'MORPHEME_ALIGNED',
                'Comment': ''})

exlist.append({'Language_Name': 'Hua', 
                'Language_ID': 'huaa1250', 
                'Primary_Text': 'nonokama da',
                'Analyzed_Word': '', 
                'Gloss': '', 
                'Translated_Text': 'I your maternal uncle', 
                'Source': 'haiman1980[226]', 
                'Meta_Language_ID': 'stan1293', 
                'LGR_Conformance': '',
                'Comment': ''})


exlist.append({'Language_Name': 'Hua', 
                'Language_ID': 'huaa1250', 
                'Primary_Text': 'vimata',
                'Analyzed_Word': '', 
                'Gloss': '', 
                'Translated_Text': 'of us men', 
                'Source': 'haiman1980[240]', 
                'Meta_Language_ID': 'stan1293', 
                'LGR_Conformance': '',
                'Comment': 'person marked genitive forms'})

exlist.append({'Language_Name': 'Hua', 
                'Language_ID': 'huaa1250', 
                'Analyzed_Word': 'ademata', 
                'Gloss': '', 
                'Translated_Text': 'of us women', 
                'Source': 'haiman1980[240]', 
                'Meta_Language_ID': 'stan1293', 
                'LGR_Conformance': '',
                'Comment': 'person marked genitive forms'})

exlist.append({'Language_Name': 'Hua', 
                'Language_ID': 'huaa1250', 
                'Analyzed_Word': "vi'ita", 
                'Gloss': '', 
                'Translated_Text': 'of you men', 
                'Source': 'haiman1980[240]', 
                'Meta_Language_ID': 'stan1293', 
                'LGR_Conformance': '',
                'Comment': 'person marked genitive forms'})


exlist.append({'Language_Name': 'Hua', 
                'Language_ID': 'huaa1250', 
                'Analyzed_Word': "adita", 
                'Gloss': '', 
                'Translated_Text': 'of you women', 
                'Source': 'haiman1980[240]', 
                'Meta_Language_ID': 'stan1293', 
                'LGR_Conformance': '',
                'Comment': 'person marked genitive forms'})


# manually append Chitimacha example entry

exlist.append({'Language_Name': 'Chitimacha', 
                'Language_ID': 'chit1248', 
                'Primary_Text': r"\textglotstop{}u\v{s} pan\v{s}' ha hananki' namkinada'",
                'Analyzed_Word': '', 
                'Gloss': '', 
                'Translated_Text': 'We people who live in this house.', 
                'Source': 'swadesh1967[333]', 
                'Meta_Language_ID': 'stan1293', 
                'LGR_Conformance': '',
                'Comment': ''})

# manually adding a Korean entry not detected due to missing glossing
exlist.append({'Language_Name': 'Korean',
                'Language_ID': 'kore1280', 
                'Primary_Text': '',
                'Analyzed_Word': "wuli-(tul) hankwuk salam", 
                'Gloss': 'we-PL Korean person', 
                'Translated_Text': 'we Koreans', 
                'Source': 'sohn1994[292]', 
                'Meta_Language_ID': 'stan1293', 
                'LGR_Conformance': 'MORPHEME_ALIGNED',
                'Comment': 'Gloss added (GFKH)'})


In [19]:
df = pd.DataFrame(exlist)
len(df)

158

In [20]:
df


Unnamed: 0,Language_Name,Language_ID,Primary_Text,Analyzed_Word,Gloss,Translated_Text,Source,Meta_Language_ID,LGR_Conformance,Comment
0,Hausa,haus1257,,\textbf{m\={u}} Háus\textgravemacron{a}w\={a},we Hausa,`we Hausa',newman2000[371],stan1293,MORPHEME_ALIGNED,
1,Hausa,haus1257,,\textbf{sh\={\textsci}} wannàn m\={a}làm\={\textsci},he \Dem{}.1 teacher,`he (this) teacher',newman2000[371],stan1293,MORPHEME_ALIGNED,
2,Hausa,haus1257,,\textbf{m\={u}} m\textgravemacron{a}làman-nàn,we teacher-\Dem.\Prox{},`we these teachers',newman2000[155],stan1293,MORPHEME_ALIGNED,
3,Mupun,mwag1236,,\textbf{war} manaja n\textschwa,3\F{} manager \Def{},"`she, the manager'","frajzyngier1993[172, (154)]",stan1293,WORD_ALIGNED,
4,Gorwaa,goro1270,,\textbf{atén} oo hhawató,Pro1\Pl{} \Anaph.\M{} men.\Lnk.\M{},`we men',"harvey2018[163, (2.205)]",stan1293,WORD_ALIGNED,
...,...,...,...,...,...,...,...,...,...,...
153,Hua,huaa1250,,ademata,,of us women,haiman1980[240],stan1293,,person marked genitive forms
154,Hua,huaa1250,,vi'ita,,of you men,haiman1980[240],stan1293,,person marked genitive forms
155,Hua,huaa1250,,adita,,of you women,haiman1980[240],stan1293,,person marked genitive forms
156,Chitimacha,chit1248,\textglotstop{}u\v{s} pan\v{s}' ha hananki' namkinada',,,We people who live in this house.,swadesh1967[333],stan1293,,


In [21]:
df.iloc[155]

Language_Name                                Hua
Language_ID                             huaa1250
Primary_Text                                 NaN
Analyzed_Word                              adita
Gloss                                           
Translated_Text                     of you women
Source                           haiman1980[240]
Meta_Language_ID                        stan1293
LGR_Conformance                                 
Comment             person marked genitive forms
Name: 155, dtype: object

## Data cleaning

What needs to be done:

1. Remove LaTeX macros and superfluous symbols in `Analyzed_Word` to allow for clean UTF-8 representation
2. Generate `Primary_Text` based on the cleaned `Analyzed_Word` column. Note that a small number of (manually added) entries have this field already filled.
2. Remove LaTeX macros and superfluous in `Translated_Text` to allow clean UTF-8 representation
3. (optional?) Also remove LaTeX-macros in `Gloss` column and replace relevant instances by all caps glosses. The LaTeX macros are generally well readable, so could also stay, but removal is probably better in the interest of an agnostic representation.

### Cleaning up `Analyzed_Word`


1. Check which macros need to be accounted for.
2. There are also macros for the display of specially formatted characters. These should be replaced with appropriate UTF-8 symbols.
3. Take care of presentation-related macros (definitely textbf, check if there are others) that should just be removed and replaced by the content of their argument. Use TexSoup for this. 
4. There are also extraneous curly brackets at the beginnings of some lines for LaTeX-internal reasons. Those should be removed, but best at the end to avoid making recognition of macros harder. (I might also remove square brackets used for highlighting nominal person constructions -- or better keep this in considering these are dedicated examples for adnominal person? Keep in for now.)

#### Checking the macros

The function below provides a list of the macros in the a given column of a dataframe.

For reasons I have not been able to understand in spite of considerable research, certain patterns fail to be captured, although they should be in the scope of `macro_pattern`. As a workaround I'm using a list of patterns with additional patterns. (The only case that would rightfully not be matched is `r'k\\textbottomtiebar{[^}]+}'`, which is deliberately added to account for a digraph in the replacement table later on).

In [22]:
# original regex, for unclear reason not capturing some cases
macro_pattern = re.compile(r'(?:\\[a-zA-Z~\'\"^=]+)|(?:\$[^_]\{?[A-Za-z0-9]+\}?\$)') 

special_patterns = [       
    r'\\={[^}]+}',              # \={x}
    r'\\"[a-zA-Z]+',            # \"em
    r'\\"{\w+}',                # \"{i}
    r'\\v{[^}]+}',              # \v{C}
    r'k\\textbottomtiebar{[^}]+}',  # k\textbottomtiebar{h} - the only pattern that would indeed need to be manually added
    r'\\text\w+',               # \textupsilon
    r'\$\^{[0-9]+}\$'           # $^{51}$
]


def listmacros(patternlist,dafra,col,excludelist=[]):
    '''
    Returns a sorted list of unique matches for a pattern in a particular column of a given dataframe.

    Arguments:
    - pattern: the search pattern as a regular expression (re.compile)
    - dafra: dataframe to be searched
    - col: identifier of column to be searched
    - excludelist (optional): list of strings to exclude from results list, by default an empty list
    '''

    unique_macros = set()


    # Then add matches from specific patterns
    for pattern in patternlist:
        intermed_macros = dafra[col].apply(lambda x: re.findall(pattern, x) if isinstance(x, str) else [])
        unique_macros.update([hit for sublist in intermed_macros for hit in sublist if not any(excl in hit for excl in excludelist)])

    # Flatten the list and retain only one instance per hit

    # Return all hits that do not contain any string from the excludelist
    return sorted(unique_macros)

# Print all unique macros
total_macros = listmacros([macro_pattern]+special_patterns,df,'Analyzed_Word') # supply regular main pattern and special patterns as one list 
print(len(total_macros))
total_macros

67


['$\\emptyset$',
 '$\\epsilon$',
 '$^h$',
 '$^o$',
 '$^{11}$',
 '$^{24}$',
 '$^{51}$',
 '$^{55}$',
 '\\"',
 '\\"em',
 '\\"{e}',
 '\\"{i}',
 "\\'",
 '\\=',
 '\\={\\i}',
 '\\={\\textsci}',
 '\\={a}',
 '\\={e}',
 '\\={g}',
 '\\={ii}',
 '\\={i}',
 '\\={u}',
 '\\={ã}',
 '\\^',
 '\\ae',
 '\\b',
 '\\c',
 '\\cb',
 '\\d',
 '\\i',
 '\\ldots',
 '\\ng',
 '\\super',
 '\\textbari',
 '\\textbf',
 '\\textbottomtiebar',
 '\\textcrh',
 '\\textctz',
 '\\textdoublebarpipe',
 '\\textdoublepipe',
 '\\textdownstep',
 '\\textepsilon',
 '\\textglotstop',
 '\\textgravemacron',
 '\\textltailn',
 '\\textopeno',
 '\\textperiodcentered',
 '\\textrevglotstop',
 '\\textrtaill',
 '\\textrtailn',
 '\\textrtailt',
 '\\textschwa',
 '\\textschwa~t',
 '\\textsci',
 '\\texttildelow',
 '\\textturnv',
 '\\textupsilon',
 '\\u',
 '\\underline',
 '\\unt',
 '\\v',
 '\\v{C}',
 '\\v{c}',
 '\\v{i}',
 '\\v{s}',
 '\\~',
 'k\\textbottomtiebar{h}']

Most macros are concerned with generating special characters and will be taken care of with replacement rules.

- `\textbf` is the main layout-related macro that just needs plain replacement by its argument. I will take care of this **after** inserting UTF-8 symbols where necessary, so I can just drop all remaining tex macros (doing this as a first step would need to be restricted to specific tex macros, so it's easier to just ).
- Check instances of `\super`, `\unt` and `underline`, as they may be related to character representation, but might also be there for layout purposes (in which case they can get removed in the final detexify stepp).


The code below displays all entries with `\super`, `\unt` and `\underline`.
We find that:
1. `super` and `underline` are indeed used for character formatting and should be replaced alongside the other combinations below with appropriate UTF-8 symbols.
2. `unt` is used to indicate syntactic categories or semantic functions as subscripts on square brackets. While the brackets might be able to stay (see above), it's probably best to remove the subscripts in order to reduce clutter (and avoid confusion). This shall be done later in the detexify step.

In [23]:
# Auxiliary function returning finding lines containing particular string in a dataframe 
def list_col_string(dafra,colname,targetstring):
    """
    Returns a list of tuples with the index and content of all lines of a dataframe (dafra) containing in a column (colname) a particular string (targetstring)
    """
    return list(zip(dafra.index[dafra[colname].str.contains(targetstring, na=False)], 
         dafra[dafra[colname].str.contains(targetstring, na=False)][colname].values))


In [24]:
for i in [r'\\super', r'\\unt', r'\\underline']:
    l=list_col_string(df,'Analyzed_Word',i)
    print(f"Instances of {i}: {len(l)}")
    print(*l,sep='\n')
    print('\n')


Instances of \\super: 1
(119, 'rd\\textctz{}\\ae{} \\textbf{lm\\ae{}=\\textltailn{}\\textschwa}=t\\super{h}\\textschwa~t\\super{h}\\textschwa{} mp\\super{h}ri v-s\\super{h}\\ae{}=b\\textopeno{}, rd\\textctz{}\\ae{}. b\\ae{} \\ng\\ae{}=\\textltailn{}\\textschwa{}=t\\super{h}\\textschwa{} mp\\super{h}ri mi-s\\super{h}o\\ng{}.')


Instances of \\unt: 4
(97, '{}[[No \\textbf{mapa=gha}]\\unt{NP} [\\textbf{ave}]\\unt{NP}]\\unt{NP}=na kula ata no-va nito=la.')
(113, 'Sa [Bain \\textbf{\\textglotstop{}ari}]\\unt{NP} b$\\epsilon$h.')
(116, '{}[aning du \\textbf{girra}]\\unt{A} [parra]\\unt{P} laata')
(118, '{}[[Tabang alaku Duinni Maggangkala]\\unt{NP} [\\textbf{ging}]\\unt{NP}]\\unt{NP} a-raung yattu ga-ung misingup.')


Instances of \\underline: 3
(15, 'Minyma \\textbf{palu\\underline{r}u} ngayu-nya nya-ngu')
(16, '\\textbf{Palu\\underline{r}u} wati nyara wa\\underline{r}a-ngku mutaka palya-nu')
(38, 'Maki lavati sa pa Solomone, gi ta-\\underline{n}ani \\textbf{gita} nikana hupa.')




Get list of unique LaTeX macros including one argument in order to find the appropriate UTF-8 replacement. (There's no need to deal with macros with more arguments in this dataset.) 

For simplicity, `pattern_nonvacargs` does not capture vacuous curly brackets to avoid listing double matches (\ldots is sometimes used with {}, sometimes without) - variants with empty curly brackets will be added automatically later.
Because of the unresolved error with missing certain matches from above, the `special_patterns` list needs to be joined with the `pattern_nonvacargs`. `\texbtf` and `\unt` are to be excluded from matching here because these purely layout-related macros will simply be stripped latter using TexSoup.

In [25]:
# macro_pattern_args = re.compile(r'(\\[a-zA-Z]+)(?:\{([^}]*)\})?')
#pattern_args = re.compile(r'\\[a-zA-Z]+(?:\{[^}]*\})?')
pattern_nonvacargs = re.compile(r'(?:\\[A-Za-z\~\'\"\^\=]?[A-Za-z]*(?:\{[^}]+\})?)|(?:\$[_\^\\]\{?[A-Za-z0-9]+\}?\$)')

unique_macros = listmacros([pattern_nonvacargs]+special_patterns,df,'Analyzed_Word',['textbf','unt'])
print(len(unique_macros))
unique_macros

69


['$\\emptyset$',
 '$\\epsilon$',
 '$^h$',
 '$^o$',
 '$^{11}$',
 '$^{24}$',
 '$^{51}$',
 '$^{55}$',
 '\\"em',
 '\\"{e}',
 '\\"{i}',
 "\\'{\\textbari}",
 "\\'{\\textepsilon}",
 "\\'{\\textschwa}",
 '\\={\\i}',
 '\\={\\textsci}',
 '\\={a}',
 '\\={e}',
 '\\={g}',
 '\\={ii}',
 '\\={i}',
 '\\={u}',
 '\\={ã}',
 '\\^{\\textschwa}',
 '\\ae',
 '\\b{ô}',
 '\\cb{t}',
 '\\c{s}',
 '\\d{l}',
 '\\d{n}',
 '\\d{t}',
 '\\i',
 '\\ldots',
 '\\ng',
 '\\super{h}',
 '\\textbari',
 '\\textbottomtiebar',
 '\\textbottomtiebar{h}',
 '\\textcrh',
 '\\textctz',
 '\\textdoublebarpipe',
 '\\textdoublepipe',
 '\\textdownstep',
 '\\textepsilon',
 '\\textglotstop',
 '\\textgravemacron',
 '\\textgravemacron{a}',
 '\\textltailn',
 '\\textopeno',
 '\\textperiodcentered',
 '\\textrevglotstop',
 '\\textrtaill',
 '\\textrtailn',
 '\\textrtailt',
 '\\textschwa',
 '\\textsci',
 '\\texttildelow',
 '\\textturnv',
 '\\textupsilon',
 '\\underline{n}',
 '\\underline{r}',
 '\\u{a}',
 '\\v{C}',
 '\\v{c}',
 '\\v{i}',
 '\\v{s}',
 '\\~{i

For convenience οf editing produce a csv to facilitate mapping of macros to UTF-8 symbols. 

In [26]:
import csv

with open('macrotounicode_empty.csv', 'w', encoding='utf-8') as f:
    writer = csv.writer(f)
    for val in unique_macros:
        writer.writerow([val, ""])

Fill in the appropriate UTF-8 symbols in the csv manually (to avoid accidental overwriting when rerunning this file change the csv filename), then load the completed csv into a dictionary. (If necessary, fine-tune the order to avoid having smaller matches precede more complex ones.)

- extended \textbottomtiebar{h} to include the preceding k in order to use appropriate UTF digraph
- the following elements can be simplified in the list because they are actually included in other macros, so it's preferable to use the more complex form directly for replacement: ['\\textbottomtiebar',
 '\\textbottomtiebar{h}',
 '\\textgravemacron',
 '\\textsci']

In [27]:
dict_utf={}

with open('macrotounicode.csv', 'r', encoding='utf-8') as f:
    csv_reader = csv.reader(f)
    for row in csv_reader:
        if len(row) == 2:  # Ensure correct structure
            key, value = row
            dict_utf[key] = value

# Debugging: Verify how keys are stored
for k, v in dict_utf.items():
    print(f"Key: {repr(k)}, Value: {repr(v)}")

Key: '\\^{\\textschwa}', Value: 'ə̂'
Key: "\\'{\\textbari}", Value: 'ɨ́'
Key: "\\'{\\textepsilon}", Value: 'έ'
Key: "\\'{\\textschwa}", Value: 'ə́'
Key: '\\={\\i}', Value: 'ī'
Key: '\\={a}', Value: 'ā'
Key: '\\={e}', Value: 'ē'
Key: '\\={g}', Value: 'ḡ'
Key: '\\={ii}', Value: 'ĩĩ'
Key: '\\={i}', Value: 'i̅'
Key: '\\={u}', Value: 'ū'
Key: '\\={ã}', Value: 'ã̅'
Key: '\\={\\textsci}', Value: 'ī'
Key: '\\ae', Value: 'æ'
Key: '\\b{ô}', Value: 'ô̱'
Key: '\\cb{t}', Value: 'ț'
Key: '\\c{s}', Value: 'ş'
Key: '\\d{l}', Value: 'ḷ'
Key: '\\d{n}', Value: 'ṇ'
Key: '\\d{t}', Value: 'ṭ'
Key: '$\\emptyset$', Value: '∅'
Key: '$\\epsilon$', Value: 'ε'
Key: '\\i', Value: 'ı'
Key: '\\ldots', Value: '…'
Key: '\\ng', Value: 'ŋ'
Key: '\\super{h}', Value: 'ʰ'
Key: '\\textbari', Value: 'ɨ'
Key: 'k\\textbottomtiebar{h}', Value: 'k͜h'
Key: '\\textcrh', Value: 'ħ'
Key: '\\textctz', Value: 'ʑ'
Key: '\\textdoublebarpipe', Value: 'ǂ'
Key: '\\textdoublepipe', Value: 'ǁ'
Key: '\\textdownstep', Value: '↓'
Key: '\\te

Extend `dict_utf` to include variants with and without following empty curly brackets. In order to select the version with braces for replacement when looking up entries below make sure the bracket version precedes the one without brackets.
    ext_dict_utf[key] = value

In [28]:
ext_dict_utf = {}
for key, value in dict_utf.items():
    ext_dict_utf[key + "{}"] = value  # Add version with braces; in order to select the version with braces for replacement if applicable, make sure this version precedes the one without brackets
    ext_dict_utf[key] = value


In [29]:
ext_dict_utf

{'\\^{\\textschwa}{}': 'ə̂',
 '\\^{\\textschwa}': 'ə̂',
 "\\'{\\textbari}{}": 'ɨ́',
 "\\'{\\textbari}": 'ɨ́',
 "\\'{\\textepsilon}{}": 'έ',
 "\\'{\\textepsilon}": 'έ',
 "\\'{\\textschwa}{}": 'ə́',
 "\\'{\\textschwa}": 'ə́',
 '\\={\\i}{}': 'ī',
 '\\={\\i}': 'ī',
 '\\={a}{}': 'ā',
 '\\={a}': 'ā',
 '\\={e}{}': 'ē',
 '\\={e}': 'ē',
 '\\={g}{}': 'ḡ',
 '\\={g}': 'ḡ',
 '\\={ii}{}': 'ĩĩ',
 '\\={ii}': 'ĩĩ',
 '\\={i}{}': 'i̅',
 '\\={i}': 'i̅',
 '\\={u}{}': 'ū',
 '\\={u}': 'ū',
 '\\={ã}{}': 'ã̅',
 '\\={ã}': 'ã̅',
 '\\={\\textsci}{}': 'ī',
 '\\={\\textsci}': 'ī',
 '\\ae{}': 'æ',
 '\\ae': 'æ',
 '\\b{ô}{}': 'ô̱',
 '\\b{ô}': 'ô̱',
 '\\cb{t}{}': 'ț',
 '\\cb{t}': 'ț',
 '\\c{s}{}': 'ş',
 '\\c{s}': 'ş',
 '\\d{l}{}': 'ḷ',
 '\\d{l}': 'ḷ',
 '\\d{n}{}': 'ṇ',
 '\\d{n}': 'ṇ',
 '\\d{t}{}': 'ṭ',
 '\\d{t}': 'ṭ',
 '$\\emptyset${}': '∅',
 '$\\emptyset$': '∅',
 '$\\epsilon${}': 'ε',
 '$\\epsilon$': 'ε',
 '\\i{}': 'ı',
 '\\i': 'ı',
 '\\ldots{}': '…',
 '\\ldots': '…',
 '\\ng{}': 'ŋ',
 '\\ng': 'ŋ',
 '\\super{h}

#### Replacing macros by UTF symbols

he function`tex_to_utf` converts LaTeX macros representing special symbols or symbols with diacritics to the appropriate UTF characters based on the correspondences in supplied in the csv file.

In [30]:
def tex_to_utf(text):
    """Convert LaTeX macros to Unicode, and strip unwanted brackets."""
    if not isinstance(text, str):
        return text  # Ignore non-string entries
    
    # Replace LaTeX macros based on dictionary entries
    for pattern, replacement in ext_dict_utf.items():
        text = text.replace(pattern, replacement)

    return text


text=r"this \aerial th\textschwa{}ng"

newtext=tex_to_utf(text)

print(newtext)

# Apply function and filter out non-matching rows
#df['converted'] = df['latex_strings'].apply(clean_latex)
#df_filtered = df.dropna(subset=['converted'])  # Keep only rows with macros

#print(df_filtered)

this ærial thəng


Illustrate the results of `tex_to_utf`, create a copy of the dataframe and then apply the function to the `Analyzed_Word` column.

In [31]:
pd.DataFrame({
    "Original": df["Analyzed_Word"].head(),
    "Processed": df["Analyzed_Word"].apply(tex_to_utf).head()
})

Unnamed: 0,Original,Processed
0,\textbf{m\={u}} Háus\textgravemacron{a}w\={a},\textbf{mū} Háusā̀wā
1,\textbf{sh\={\textsci}} wannàn m\={a}làm\={\textsci},\textbf{shī} wannàn mālàmī
2,\textbf{m\={u}} m\textgravemacron{a}làman-nàn,\textbf{mū} mā̀làman-nàn
3,\textbf{war} manaja n\textschwa,\textbf{war} manaja nə
4,\textbf{atén} oo hhawató,\textbf{atén} oo hhawató


In [32]:
df_new=df.copy()
df_new['Analyzed_Word']=df_new['Analyzed_Word'].apply(tex_to_utf)

In [33]:
df_new

Unnamed: 0,Language_Name,Language_ID,Primary_Text,Analyzed_Word,Gloss,Translated_Text,Source,Meta_Language_ID,LGR_Conformance,Comment
0,Hausa,haus1257,,\textbf{mū} Háusā̀wā,we Hausa,`we Hausa',newman2000[371],stan1293,MORPHEME_ALIGNED,
1,Hausa,haus1257,,\textbf{shī} wannàn mālàmī,he \Dem{}.1 teacher,`he (this) teacher',newman2000[371],stan1293,MORPHEME_ALIGNED,
2,Hausa,haus1257,,\textbf{mū} mā̀làman-nàn,we teacher-\Dem.\Prox{},`we these teachers',newman2000[155],stan1293,MORPHEME_ALIGNED,
3,Mupun,mwag1236,,\textbf{war} manaja nə,3\F{} manager \Def{},"`she, the manager'","frajzyngier1993[172, (154)]",stan1293,WORD_ALIGNED,
4,Gorwaa,goro1270,,\textbf{atén} oo hhawató,Pro1\Pl{} \Anaph.\M{} men.\Lnk.\M{},`we men',"harvey2018[163, (2.205)]",stan1293,WORD_ALIGNED,
...,...,...,...,...,...,...,...,...,...,...
153,Hua,huaa1250,,ademata,,of us women,haiman1980[240],stan1293,,person marked genitive forms
154,Hua,huaa1250,,vi'ita,,of you men,haiman1980[240],stan1293,,person marked genitive forms
155,Hua,huaa1250,,adita,,of you women,haiman1980[240],stan1293,,person marked genitive forms
156,Chitimacha,chit1248,\textglotstop{}u\v{s} pan\v{s}' ha hananki' namkinada',,,We people who live in this house.,swadesh1967[333],stan1293,,


#### Removing layout-related macros using TexSoup

Now we can get rid of all remaining LaTeX macros using the TexSoup package.
The function `remove_latex_macros` recursively strips all LaTeX macros from a string. By default it keeps the content intact, but a third argument acts as a switch: if supplied the value `False`, any macros in the `macros_to_remove` list will be deleted including any arguments (no recursive processing necessary in that case).


In [34]:

def process_latex_macros(tex_string, macros_to_process=['textbf'],keep_content=True,capitalise_content=False,nospace=False):
    """
    Removes LaTeX macros provided in a list recursively, either retaining their content or completely removing them

    Args:
    - tex_string (str): input LaTeX string.
    - macros_to_remove: list of macros to remove
    - keep_content: bool marking whether the content of arguments is retained (default: True)
    - capitalise_content: the content of arguments is returned in all caps if retained (default: False)
    - nospace: avoid introducing extra spaces between content parts (relevant for glosses, default: False)

    Returns:
    - string: Cleaned LaTeX string without the specified macros.
    """
    if not isinstance(tex_string, str):  # Ensure input is a string
        return tex_string  

    soup = TexSoup(tex_string)

    if keep_content:
        for macro in macros_to_process:
            for tag in list(soup.find_all(macro) or []):
                if tag.contents:  # If the macro has content
                    content_parts = [process_latex_macros(str(c), macros_to_process) for c in tag.contents]
                    # Determine if brackets should be kept
                    if nospace:
                        content_str = ''.join(content_parts)
                    else:
                        content_str = ' '.join(content_parts)

                    # Replace the macro with the processed content
                    if capitalise_content:
                        tag.replace_with(content_str.upper())
                    else:
                        tag.replace_with(content_str)
                else:  
                    tag.delete()  # If the macro is empty, remove it
    else:
        for macro in macros_to_process:
            for tag in list(soup.find_all(macro) or []):
                tag.delete()  

    cleaned_text = ' '.join(str(soup).split())  # normalise spacing between words 
    
    # this introduces superfluous space between
    cleaned_text = re.sub(r'\[\s*', '[', cleaned_text)  # Remove spaces after `[`
    cleaned_text = re.sub(r'\s*\]', ']', cleaned_text)  # Remove spaces before `]`

    return cleaned_text          # reduce any superfluous spaces

# check if it's working
process_latex_macros(r'\Indf{}=\M=\textbf{2\Sg} and \textbf{this is some text with \emph{other stuff} stuffed inside}', 
                    ['textbf','emph'],True,False,True)

'\\Indf{}=\\M=2\\Sg and this is some text withother stuffstuffed inside'

This works fine, but retains curly brackets not associated with a LaTeX macro. That's actually good because I want to keep them when they are used for grouping words to ensure alignment with glossing (in an admittedly limited number of cases). Superfluous curly brackets about single words will be deleted separately slightly further below. 

For now apply the `remove_latex_macros` function to the `Analyzed_Word` column.

In [35]:
df_new['Analyzed_Word']=df_new['Analyzed_Word'].apply(process_latex_macros)

In [36]:
df_new

Unnamed: 0,Language_Name,Language_ID,Primary_Text,Analyzed_Word,Gloss,Translated_Text,Source,Meta_Language_ID,LGR_Conformance,Comment
0,Hausa,haus1257,,mū Háusā̀wā,we Hausa,`we Hausa',newman2000[371],stan1293,MORPHEME_ALIGNED,
1,Hausa,haus1257,,shī wannàn mālàmī,he \Dem{}.1 teacher,`he (this) teacher',newman2000[371],stan1293,MORPHEME_ALIGNED,
2,Hausa,haus1257,,mū mā̀làman-nàn,we teacher-\Dem.\Prox{},`we these teachers',newman2000[155],stan1293,MORPHEME_ALIGNED,
3,Mupun,mwag1236,,war manaja nə,3\F{} manager \Def{},"`she, the manager'","frajzyngier1993[172, (154)]",stan1293,WORD_ALIGNED,
4,Gorwaa,goro1270,,atén oo hhawató,Pro1\Pl{} \Anaph.\M{} men.\Lnk.\M{},`we men',"harvey2018[163, (2.205)]",stan1293,WORD_ALIGNED,
...,...,...,...,...,...,...,...,...,...,...
153,Hua,huaa1250,,ademata,,of us women,haiman1980[240],stan1293,,person marked genitive forms
154,Hua,huaa1250,,vi'ita,,of you men,haiman1980[240],stan1293,,person marked genitive forms
155,Hua,huaa1250,,adita,,of you women,haiman1980[240],stan1293,,person marked genitive forms
156,Chitimacha,chit1248,\textglotstop{}u\v{s} pan\v{s}' ha hananki' namkinada',,,We people who live in this house.,swadesh1967[333],stan1293,,


Delete `\unt` macros including argument (I decided above to not keep the label) by calling the `remove_latex_macros` function with the third argument set to `False` in order to trigger full removal of the listed macros including its arguments.


In [37]:

df_new['Analyzed_Word']=df_new['Analyzed_Word'].apply(lambda item:process_latex_macros(item,['unt'],False))


Check the result on a row know to contain labelled brackets. As intended, the subscript macro including the labels is removed, while the square brackets indicating structure are retained.

In [38]:
df_new['Analyzed_Word'].loc[118]

'{}[[Tabang alaku Duinni Maggangkala] [ging]] a-raung yattu ga-ung misingup.'

#### Final clean-up

Find remaining curly brackets.

In [39]:
list_col_string(df_new,'Analyzed_Word',r'\{')


[(8, '{}[Intom il-ħaddiema] għandkom tingħaqdu'),
 (71, '{Hè é} tó Khwé-tò dì góέ à tó ò + ǁé Qúva-ǁè ǂxà-á-tè à.'),
 (75, 'ntə̂m. yε kó luzíŋ ↓é bï andzéé bǐ k yéè ntswé ninyá bɔ. atá{[…]}'),
 (87, 'aka {malav} e roa-ru kiu-la-m.'),
 (97, '{}[[No mapa=gha] [ave]]=na kula ata no-va nito=la.'),
 (108, '{}[{Dana} {ben} {eu} age] ho-ig-a.'),
 (116, '{}[aning du girra] [parra] laata'),
 (118,
  '{}[[Tabang alaku Duinni Maggangkala] [ging]] a-raung yattu ga-ung misingup.'),
 (146, 'tau⁵¹ hou²⁴ {ʔat⁵⁵ jen¹¹ kjã:u²⁴}'),
 (147, 'biz Türk-ler vatan-{ı}m{ı}z-{ı} sev-er-iz')]

Not many left, good. However, I want to keep the brackets for the texts at indices 71 and 144 because these indicate groups of elements treated as one "word" in the provided gloss and are therefore relevant to the appropriate interpretation of the `Gloss` column.

This means I cannot simply use a vectorised deletion of all instances of curly brackets in the column. One option is to simply exempt rows 71 and 144 from curly bracket deletion. The code below sketches this, applying a lambda function deleting opening and closing curly brackets by substitution with an empty string unless the processed Series is row 71 or 144.

In [40]:
#df_new['Analyzed_Word'] = df_new.apply(
#    lambda row: row['Analyzed_Word'].replace('{', '').replace('}', '') if row.name not in [71, 144] else row['Analyzed_Word'],
#    axis=1
#)

While this method is sufficient for this specific dataset, it is suboptimal in that this relies on explicit listing. 

A more general (and probably preferable) alternative leverages the specific pattern fo the brackets I want to keep: They are intended to group "words", i.e. substrings separated by spaces, together. Hence, I can use a regular expression that only deletes brackets that contain strings without any spaces. If applied too early, this would also capture LaTeX macros, but since I have already taken care of those, this method should work fine and it is a bit more general in case the data changes in the future (although one should still make sure that no LaTeX macros slip through until here).  

In [41]:
pattern_curly_space = re.compile(r'\{\s*([^}]*[^ }\t]?[^}]*)?\s*\}')    # pattern to match curly brackets around an uninterrupted string containing no space or tabs

df_new['Analyzed_Word'] = df_new['Analyzed_Word'].apply(
    lambda text: pattern_curly_space.sub(
        lambda m: m.group(0) if m.group(1) and ' ' in m.group(1) else (m.group(1) if m.group(1) else ''), text
    )
)


Verify that the operation applied successfully and didn't touch the examples I want to keep.

In [42]:
print(df_new.loc[108, 'Analyzed_Word'])  # Curly brackets should be gone


print(df_new.loc[71, 'Analyzed_Word'])  # Should remain unchanged
print(df_new.loc[146, 'Analyzed_Word'])  # Should remain unchanged


[Dana ben eu age] ho-ig-a.
{Hè é} tó Khwé-tò dì góέ à tó ò + ǁé Qúva-ǁè ǂxà-á-tè à.
tau⁵¹ hou²⁴ {ʔat⁵⁵ jen¹¹ kjã:u²⁴}


Test for plus signs used in examples for manual linebreak.

Found some instances.

In [43]:
list_col_string(df_new,'Analyzed_Word',r'\+')

[(50,
  'Wenn noch nicht einmal [du Linguist] die + neue Rechtschreibung beherrschst…'),
 (71, '{Hè é} tó Khwé-tò dì góέ à tó ò + ǁé Qúva-ǁè ǂxà-á-tè à.'),
 (80, 'Nyi tä=ŋga=ŋi Matiu i=qu=k=i kukŋuä + hn=i yatŋqä k-i-m=ŋqä=i.'),
 (88, 'mi-osnok mi-en-ah-miy, mi-en-ot jig miyes + mi-er tofi.'),
 (139, 'mɨnayarɨ horɨ amna ntono. nɨmno + hokono rma amna')]

Delete those superfluous strings using the vecorised `str.replace` method. The argument `regex=True` is important for proper application (probably due to the use of non-alphanumeric symbols).

(Since the earlier processing was improved, `todelete` wouldn't actually need to involve a list and the for-loop is not necessary.)

In [44]:
todelete=[r' \+']

for p in todelete:
    df_new['Analyzed_Word']=df_new['Analyzed_Word'].str.replace(p,'',regex=True)

`Analyzed_Word` should now be in good shape.

### `Primary_Text`

First, check for tex macros in `Primary_Text`.

In [45]:
primary_macros = listmacros([macro_pattern]+special_patterns,df_new,'Primary_Text') # supply regular main pattern and special patterns as one list 
print(primary_macros)

['\\textglotstop', '\\v', '\\v{s}']


Some found, let's remove them.

In [46]:
list_col_string(df_new,'Primary_Text',r'\\textglotstop')

[(156, "\\textglotstop{}u\\v{s} pan\\v{s}' ha hananki' namkinada'")]

In [47]:
df_new['Primary_Text'] = df_new['Primary_Text'].apply(tex_to_utf)
df_new.loc[156,'Primary_Text']

"ʔuš panš' ha hananki' namkinada'"

Most of `Primary_Text` was left empty when reading the data in at the outset. We can now fix this (at least somewhat mechanistically).

If `Primary_Text` is empty, it can be derived from `Analyzed_Word` by deleting any square or curly brackets used for grouping as well as any `-`, `=`, `~` or `·` used to indicate morpheme or clitic boundaries.

(Note that there is a certain chance removing even elements that are graphemically employed in the language. I cannot completely avoid this for lack of structured access to "clean" primary text data for all instances. While I believe that the issue is marginal for the current dataset, users might want to be aware that, where available, `Analyzed_Text` is the best approximation to the source data.)

In [48]:
print(df_new.columns)

Index(['Language_Name', 'Language_ID', 'Primary_Text', 'Analyzed_Word',
       'Gloss', 'Translated_Text', 'Source', 'Meta_Language_ID',
       'LGR_Conformance', 'Comment'],
      dtype='object')


In [49]:
df_new.loc[df_new['Primary_Text'].isna()|(df_new['Primary_Text'] == ''), ['Primary_Text','Analyzed_Word']] 

Unnamed: 0,Primary_Text,Analyzed_Word
0,,mū Háusā̀wā
1,,shī wannàn mālàmī
2,,mū mā̀làman-nàn
3,,war manaja nə
4,,atén oo hhawató
...,...,...
150,,forapi-ga
153,,ademata
154,,vi'ita
155,,adita


In [50]:
df_new.loc[df_new['Primary_Text'].isna()|(df_new['Primary_Text'] == ''), 'Primary_Text'] = df_new.loc[df_new['Primary_Text'].isna()|(df_new['Primary_Text'] == ''), 'Analyzed_Word'].str.replace(r'[\[\]\{\}\-\=\~·]','',regex=True)

With this, `Primary_Text` should be in good shape. The code below illustrates the result for some symbols.

In [51]:
df_new.loc[df_new['Analyzed_Word'].str.contains(r'\{|\[|\='), ['Language_Name','Primary_Text','Analyzed_Word']]

Unnamed: 0,Language_Name,Primary_Text,Analyzed_Word
6,Cairene Egyptian Colloquial Arabic,ʔintu ʔittalamza tiħibbu ʔilliʕb,ʔintu ʔit[-]talamza tiħibbu ʔilliʕb
8,Maltese,Intom ilħaddiema għandkom tingħaqdu,[Intom il-ħaddiema] għandkom tingħaqdu
17,Guugu Yimidhirr,Nyulu nhayun waarigan gaday waangguwunaarnay.,Nyulu nhayun waarigan gada-y waanggu=wunaarna-y.
29,Windesi Wamesa,sinitupatata,sinitu=pa-tata
30,Maori,E kaha rawa atu maatou ngaa kaiako naa ki te patapatai,E kaha rawa atu [maatou ngaa kaiako naa] ki te pata·patai
36,Cheke Holo,Tahati naikno ḡre e kmana puipuhida,Tahati naikno ḡre e kmana pui~puhi=da
40,Kokota,ka gai ira nakoni zuzufra tana nogoi naito tahi ke aḡeuniu,ka gai ira nakoni zuzufra tana nogoi naito tahi ke aḡe=u=ni=u
50,German,Wenn noch nicht einmal du Linguist die neue Rechtschreibung beherrschst…,Wenn noch nicht einmal [du Linguist] die neue Rechtschreibung beherrschst…
71,Khwe/Kxoe,Hè é tó Khwétò dì góέ à tó ò ǁé Qúvaǁè ǂxàátè à.,{Hè é} tó Khwé-tò dì góέ à tó ò ǁé Qúva-ǁè ǂxà-á-tè à.
75,Nzadi,ntə̂m. yε kó luzíŋ ↓é bï andzéé bǐ k yéè ntswé ninyá bɔ. atá…,ntə̂m. yε kó luzíŋ ↓é bï andzéé bǐ k yéè ntswé ninyá bɔ. atá[…]


### `Source`



In [52]:
def viewport_id(lst,windowsize=1):
    return [val for i in lst for val in range(i, i + windowsize + 1)]

In [53]:
nullsource_lst=df_new[df_new['Source'] == ''].index
nullsource_lst=list(sorted(set(viewport_id(nullsource_lst))))

In [54]:
df_new.loc[df_new['Language_Name']=='Katu',['Language_Name','Analyzed_Word','Source']]

Unnamed: 0,Language_Name,Analyzed_Word,Source
127,Katu,manuih yi,
128,Katu,yi manuih,
129,Katu,yi adi anó yi,"costello1969[28, (35--37)]"


In [55]:
df_new.iloc[nullsource_lst][['Language_Name','Analyzed_Word','Source']]

Unnamed: 0,Language_Name,Analyzed_Word,Source
31,Tuvaluan,Au ttino poto koo leva ne iloa nee au mea kolaa faatoaa iloa nee koe ttagata valea.,
32,Tuvaluan,Taatou tino Tuuvalu e see tau ki meakkai kolaa.,"besnier2000[393, (2018/2019)]"
33,Kwaio,'a-gauru-a ta'a i 'Ai'eda,
34,Kwaio,fa-meru-a ta'a geni,keesing1985[104]
44,Welsh,ni fyfyrwyr,
45,Danish,lad os voksne snakke i fred,"schroeter2021[29, (32b)]"
57,Aromanian,noi pikurar-li adrem pini.,
58,Romanian,Voi avocații vă apărați clienții.,"cornilescunicolae2014[10, (20a)]"
73,Luganda,Ffe abantu abaavu ffe tubonaabona.,
74,Nkore-Kiga,itwe abanyankore ni-tu-hinga ebinyoobwa,"taylor1985[131, (368)]"


First, I'm manually setting the appropriate source values where they correspond to or are derivable from values of other examples from the same language. The reason for these gaps is that the LaTeX codes presents them as blocks and only provides one reference (typically at the end of the block).

In [56]:
# Tuvaluan
df_new.loc[31,'Source'] = 'besnier2000[393, (2018)]'
df_new.loc[32,'Source'] = 'besnier2000[393, (2019)]'

# Kwaio
df_new.loc[33,'Source'] = 'keesing1985[104]'

# Khwe/Kxoe
df_new.loc[71,'Source'] = 'kilianhatz2008[41, (1) quoting Köhler 1989:514f.]'

# Menya
df_new.loc[80,'Source'] = 'whitehead2006[30, (58)]'
df_new.loc[81,'Source'] = 'whitehead2006[30, (59)]'

# Urim
df_new.loc[95,'Source'] = 'hemmilaeluoma1987[125]'

# Alamblak
df_new.loc[102:103,'Source'] = 'bruce1984[96]'

# Yagaria
df_new.loc[106,'Source'] = 'renck1975[19]'

# Katu
for i,k in zip([127,128,129],[35,36,37]):
    df_new.loc[i,'Source'] = f'costello1969[28, ({k})]'

# Classical Nahuatl
df_new.loc[135,'Source'] = 'andrews1975[193]'

# Korean
df_new.loc[142,'Source'] = 'choi2014phd[151, (15)]'





List the remaining row with empty source values.

In [57]:
nullsource_lst=df_new[df_new['Source']== ''].index
df_new.iloc[nullsource_lst][['Language_Name','Analyzed_Word','Source']]

Unnamed: 0,Language_Name,Analyzed_Word,Source
44,Welsh,ni fyfyrwyr,
57,Aromanian,noi pikurar-li adrem pini.,
73,Luganda,Ffe abantu abaavu ffe tubonaabona.,
76,Swahili,Nyinyi wa-nafunzi m-me-cheka.,
148,English,you linguist-s,


The remaining 5(?) instances were elicited for Höhn 2024, so I will set source to that. English is somewhat of an oddball -- it seems doubtful to cite Höhn 2024 as the first source for this, although I'm not certain what the first use of this particular reference in the literature actually is. It turns out not to occur in Postal 1969 or Sommerstein 1972, and Abney 1987 only has "we linguists". Effectively, the value for source probably doesn't make a major difference for the English example, so I'll use Höhn 2024 after all. 

Manually adding some additional context, such as personal communication sources, to `Comment`.

In [58]:
for i in nullsource_lst:
    df_new.loc[i,'Source'] = 'hoehn2024[Supplementary Material S3]'

df_new.loc[44,'Comment'] = 'provided by David Willis (personal communication)'
df_new.loc[57,'Comment'] = 'elicited, see also Höhn (2016:546)'
df_new.loc[73,'Comment'] = 'elicited with Jenneke van der Wal from Saudah Namyalo (pers. comm.)'
df_new.loc[76,'Comment'] = 'elicited from Vital Kazimoto (pers. comm.), cf. also Höhn (2016:546)'






In [59]:
df_new

Unnamed: 0,Language_Name,Language_ID,Primary_Text,Analyzed_Word,Gloss,Translated_Text,Source,Meta_Language_ID,LGR_Conformance,Comment
0,Hausa,haus1257,mū Háusā̀wā,mū Háusā̀wā,we Hausa,`we Hausa',newman2000[371],stan1293,MORPHEME_ALIGNED,
1,Hausa,haus1257,shī wannàn mālàmī,shī wannàn mālàmī,he \Dem{}.1 teacher,`he (this) teacher',newman2000[371],stan1293,MORPHEME_ALIGNED,
2,Hausa,haus1257,mū mā̀làmannàn,mū mā̀làman-nàn,we teacher-\Dem.\Prox{},`we these teachers',newman2000[155],stan1293,MORPHEME_ALIGNED,
3,Mupun,mwag1236,war manaja nə,war manaja nə,3\F{} manager \Def{},"`she, the manager'","frajzyngier1993[172, (154)]",stan1293,WORD_ALIGNED,
4,Gorwaa,goro1270,atén oo hhawató,atén oo hhawató,Pro1\Pl{} \Anaph.\M{} men.\Lnk.\M{},`we men',"harvey2018[163, (2.205)]",stan1293,WORD_ALIGNED,
...,...,...,...,...,...,...,...,...,...,...
153,Hua,huaa1250,ademata,ademata,,of us women,haiman1980[240],stan1293,,person marked genitive forms
154,Hua,huaa1250,vi'ita,vi'ita,,of you men,haiman1980[240],stan1293,,person marked genitive forms
155,Hua,huaa1250,adita,adita,,of you women,haiman1980[240],stan1293,,person marked genitive forms
156,Chitimacha,chit1248,ʔuš panš' ha hananki' namkinada',,,We people who live in this house.,swadesh1967[333],stan1293,,


### `Translated_Text`

Now, let's check which macros we have in the translation and gloss columns in order to decide whether any or all of them have to go.

In [60]:
pattern_args = re.compile(r'(?:\\[a-zA-Z~\'\"^=]+(?:\{[^}]*\})?)|(?:\$[^_]\{?[A-Za-z0-9]+\}?\$)')


transl_macros = listmacros([pattern_args]+special_patterns,df_new,'Translated_Text')
print(len(transl_macros))
transl_macros


6


['\\ldots',
 "\\ldots'",
 "\\ldots''",
 '\\textbf',
 '\\textbf{the}',
 '\\textbf{you uncle there}']

Apply the cleaning mechanisms from `Analyzed_Word`.

In [61]:
df_new['Translated_Text']=df_new['Translated_Text'].apply(tex_to_utf)              # replaces macros by direct characters, here \ldots
df_new['Translated_Text']=df_new['Translated_Text'].apply(process_latex_macros)      # removes \textbf{...}

Some examples contain direct speech using the LaTeX method of double quotes. Convert these to simple double quotes.

In [62]:
todelete=[r'\`\`',r'\'\'']

for p in todelete:
    df_new['Translated_Text']=df_new['Translated_Text'].str.replace(p,'\"',regex=True)

Remove any single quotes around translations.

In [63]:
todelete=[r'\`',r'\'']

for p in todelete:
    df_new['Translated_Text']=df_new['Translated_Text'].str.replace(p,'',regex=True)

Verify that the column is cleaned:

In [64]:
list_col_string(df_new,'Translated_Text',r'\`')

[]

In [65]:
df_new

Unnamed: 0,Language_Name,Language_ID,Primary_Text,Analyzed_Word,Gloss,Translated_Text,Source,Meta_Language_ID,LGR_Conformance,Comment
0,Hausa,haus1257,mū Háusā̀wā,mū Háusā̀wā,we Hausa,we Hausa,newman2000[371],stan1293,MORPHEME_ALIGNED,
1,Hausa,haus1257,shī wannàn mālàmī,shī wannàn mālàmī,he \Dem{}.1 teacher,he (this) teacher,newman2000[371],stan1293,MORPHEME_ALIGNED,
2,Hausa,haus1257,mū mā̀làmannàn,mū mā̀làman-nàn,we teacher-\Dem.\Prox{},we these teachers,newman2000[155],stan1293,MORPHEME_ALIGNED,
3,Mupun,mwag1236,war manaja nə,war manaja nə,3\F{} manager \Def{},"she, the manager","frajzyngier1993[172, (154)]",stan1293,WORD_ALIGNED,
4,Gorwaa,goro1270,atén oo hhawató,atén oo hhawató,Pro1\Pl{} \Anaph.\M{} men.\Lnk.\M{},we men,"harvey2018[163, (2.205)]",stan1293,WORD_ALIGNED,
...,...,...,...,...,...,...,...,...,...,...
153,Hua,huaa1250,ademata,ademata,,of us women,haiman1980[240],stan1293,,person marked genitive forms
154,Hua,huaa1250,vi'ita,vi'ita,,of you men,haiman1980[240],stan1293,,person marked genitive forms
155,Hua,huaa1250,adita,adita,,of you women,haiman1980[240],stan1293,,person marked genitive forms
156,Chitimacha,chit1248,ʔuš panš' ha hananki' namkinada',,,We people who live in this house.,swadesh1967[333],stan1293,,


Check for other special characters.

In [66]:
df_new[df_new['Translated_Text'].str.contains('[|]|{|}|-|=|~|·',regex=True)][['Language_Name','Translated_Text']]

#df_new[df_new['Translated_Text'].str.contains('{|}',regex=True)]['Translated_Text']

Unnamed: 0,Language_Name,Translated_Text
88,Moskona,"we people bathe, wear clothes, wear hats… (stand in clothes = wear clothes)"
115,Kamang,the \{specific group of\} people
133,Basque,"{You, father and son}, have spoiled my whole appetite for dinner."
134,Basque,{We Basques} have a new debt to Orixe.


Inspect 88 in more detail. The equal sign here can remain, since it provides a clarification to the translation.

In [67]:
df_new.iloc[88]['Translated_Text']

'we people bathe, wear clothes, wear hats… (stand in clothes = wear clothes)'

The escaped curly brackets in the Kamang example in 115 can be transformed into plain brackets for consistency.

In [68]:
list_col_string(df_new,'Translated_Text',r'\\\{')

[(115, 'the \\{specific group of\\} people')]

In [69]:
df_new['Translated_Text']=df_new['Translated_Text'].str.replace(r'\\\{','(',regex=True)
df_new['Translated_Text']=df_new['Translated_Text'].str.replace(r'\\\}',')',regex=True)

Finally, remove the remaining curly brackets.

In [70]:
todelete=[r'\{',r'\}']

for p in todelete:
    df_new['Translated_Text']=df_new['Translated_Text'].str.replace(p,'',regex=True)

The `Translated_Text` column is now in good shape.

### `Gloss`

List the macros found in `Gloss`.

In [71]:
gloss_macros = listmacros([pattern_args]+special_patterns,df_new,'Gloss')
print(len(gloss_macros))
gloss_macros

159


['\\Aarg{}',
 '\\Abs{}',
 '\\Acc',
 '\\Acc{}',
 '\\Act{}',
 '\\Addr{}',
 '\\Anaph',
 '\\Aor',
 '\\Appl',
 '\\Art',
 '\\Art{}',
 '\\Ass',
 '\\Aux{}',
 '\\Caus',
 '\\Cert{}',
 '\\Cf',
 '\\Char{}',
 '\\Cl',
 '\\Cnt',
 '\\Common{}',
 '\\Compel{}',
 '\\Com{}',
 '\\Contr{}',
 '\\Dat',
 '\\Dat{}',
 '\\Def',
 '\\Def{}',
 '\\Dem',
 '\\Dem=',
 '\\Dem{}',
 '\\Det',
 '\\Det{}',
 '\\Dso=',
 '\\Ds{}',
 '\\Dur{}',
 '\\Dur~die',
 '\\Dur~way=',
 '\\Du{}',
 '\\Emph',
 '\\Emph{}',
 '\\Erg',
 '\\Erg{}',
 '\\Excl',
 '\\Excl=',
 '\\Exclam{}',
 '\\Excl{}',
 '\\F',
 '\\Fin{}',
 '\\Foc{}',
 '\\Fpron',
 '\\Fv{}',
 '\\F{}',
 '\\Gen',
 '\\Generic=',
 '\\Generic{}',
 '\\Gvn{}',
 '\\Habit',
 '\\Ill',
 '\\Imp{}',
 '\\Inan',
 '\\Incl{}',
 '\\Indf',
 '\\Indf=',
 '\\Indf{}',
 '\\Ind{}',
 '\\Inf',
 '\\Infl',
 '\\Inf{}',
 '\\Inv',
 '\\Ipfv',
 '\\Ipfv{}',
 '\\Irr',
 '\\Irr=',
 '\\Lig=',
 '\\Lig{}',
 '\\Lnk',
 '\\Lnk{}',
 '\\Loc',
 '\\Loc{}',
 '\\M',
 '\\M=',
 '\\M={1\\Pl}',
 '\\M={2\\Sg}',
 '\\Mod{}',
 '\\M{}',
 '\\N',
 '

In [72]:
# Store lists of (idx,example) tuples for some macros for reference
gloss_textbf=list_col_string(df_new,'Gloss','textbf')
gloss_textsc=list_col_string(df_new,'Gloss','textsc')
gloss_diverse=list_col_string(df_new,'Gloss','Incl|Sg|Pl')
gloss_textperiod=list_col_string(df_new,'Gloss','textperiodcentered|texttildelow')


In [73]:
def get_diff_df(dafra,col,tuplelist):
    '''
    Create a dataframe illustrating changes in the column of a dataframe
    Works by mapping a list of tuples generated by `list_col_string` (tuplelist, consisting of an index in origin dataframe and value for a column) onto the current value of a given column in a dataframe at those indices

    dafra: a dataframe (normally the same one used in original call of `list_col_string`)
    col: string for a valid column name in dafra (normally the same one used in original call of `list_col_string`)
    tuplelist: a list of 2-tuples generated by `list_col_string`
    '''
    table_df = pd.DataFrame(tuplelist, columns=['Index', 'Before'])
    table_df['After'] = table_df['Index'].map(dafra[col])
    return table_df



Steps:
1. Save a copy of `Gloss` as distinct column to retain LaTeX code if needed.
2. Remove `\textbf` as above. 
3. Replace `\textsc` macros contents with all caps and remove macro.
4. Replace other macros for special symbols by corresponding utf characters (relevant here: `\textperiodcentered`)
5. Extract all strings of alphabetic characters (beginning with a capital followed by optional small characters) following `\` and preceding either a pair of curly brackets or any one non-alphabetic character (including line end) with their counterpart in all caps. Drop initial backslash and following curly brackets, but retain any other following material. 

In [74]:
# Step 1
# Copy Gloss
df_new['Gloss_LaTeX'] =df_new['Gloss']

In [75]:
# Step 2
# Remove textbf and store resulting pure LaTeX code in new column for future reference
df_new['Gloss'] = df_new['Gloss'].apply(lambda text:process_latex_macros(text,['textbf'],True,nospace=True))
get_diff_df(df_new,'Gloss',gloss_textbf)

Unnamed: 0,Index,Before,After
0,82,\Indf{}=\M=\textbf{2\Sg} steal \Ass-get-2\Sg/\Irr-\Generic=\Def{},\Indf{}=\M=2\Sg steal \Ass-get-2\Sg/\Irr-\Generic=\Def{}
1,133,spoil 3\Sg{}.\Abs{}.\Aux{}.1\Sg{}.\Dat{}.\textbf{2\Pl{}.\Erg} father-son-\Proxart.\Pl{}.\Erg{} dinner-\Loc{}-\Lnk{} appetite all-\Det{}.\Abs{},spoil 3\Sg{}.\Abs{}.\Aux{}.1\Sg{}.\Dat{}.2\Pl{}.\Erg father-son-\Proxart.\Pl{}.\Erg{} dinner-\Loc{}-\Lnk{} appetite all-\Det{}.\Abs{}
2,134,debt new-\Det{}.\Abs{}{} 3\Sg{}.\Abs{}.\Aux{}.\textbf{1\Pl{}.\Erg} Basque-\Proxart.\Pl{}{} Orixe-\Com{},debt new-\Det{}.\Abs{}{} 3\Sg{}.\Abs{}.\Aux{}.1\Pl{}.\Erg Basque-\Proxart.\Pl{}{} Orixe-\Com{}


In [76]:
# Step 3
# Replace textsc macros by capitalised content
df_new['Gloss'] = df_new['Gloss'].apply(lambda text:process_latex_macros(text,['textsc'],True,True))
get_diff_df(df_new,'Gloss',gloss_textsc)

Unnamed: 0,Index,Before,After
0,80,1\Sg{} this=\textsc{time}=\Gvn{} Matthew \Dem=\M={2\Sg}=\Obj{} talk \Indf=\F{} ask 2\Sg-do-1\Sg/\Irr=\textsc{goal}=\Ind{},1\Sg{} this=TIME=\Gvn{} Matthew \Dem=\M={2\Sg}=\Obj{} talk \Indf=\F{} ask 2\Sg-do-1\Sg/\Irr=GOAL=\Ind{}


In [77]:
# Step 4
# Replace all other special symbol macros by UTF characters
df_new['Gloss']=df_new['Gloss'].apply(tex_to_utf)              # replaces macros by direct characters, here \textperiodcentered
get_diff_df(df_new,'Gloss',gloss_textperiod)


Unnamed: 0,Index,Before,After
0,30,\Tam{} strong very away 1\Pl.\Excl{} the.\Pl{} teacher \Dem.2 to the \Redup{}\textperiodcentered{}ask,\Tam{} strong very away 1\Pl.\Excl{} the.\Pl{} teacher \Dem.2 to the \Redup{}·ask
1,114,Ilwang 3\Sg{} \Redup\texttildelow{}quickly run open 3\Sg{}-inside \Loc{} exit-\Fin{},Ilwang 3\Sg{} \Redup·quickly run open 3\Sg{}-inside \Loc{} exit-\Fin{}


In [78]:
# Step 5
# Replace any remaining macros with their capitalised version, removing preceding \ and following {}

gloss_macro=re.compile(r'\\([A-Z][a-z]*)(?:\{\})?')

def capitalize_macro(match):
    return match.group(1).upper()

df_new['Gloss']=df_new['Gloss'].apply(lambda text:gloss_macro.sub(capitalize_macro,text))
get_diff_df(df_new,'Gloss',gloss_diverse)


Unnamed: 0,Index,Before,After
0,4,Pro1\Pl{} \Anaph.\M{} men.\Lnk.\M{},Pro1PL ANAPH.M men.LNK.M
1,5,1\Pl.\Nom{} Kambaata-\M.\Nom{},1PL.NOM Kambaata-M.NOM
2,6,you.\Pl{} [\Def-]students 2\Pl{}.like playing,you.PL [DEF-]students 2PL.like playing
3,7,we \Def-students not 1\Pl.be.able 1\Pl.accept this \Def-decision,we DEF-students not 1PL.be.able 1PL.accept this DEF-decision
4,8,you \Def-workers have.2\Pl{} unite.2\Pl{},you DEF-workers have.2PL unite.2PL
...,...,...,...
113,146,1\Pl{} two sister,1PL two sister
114,147,we Turk-\Pl{} mother.country-1\Pl-\Acc{} love-\Aor-1\Pl{},we Turk-PL mother.country-1PL-ACC love-AOR-1PL
115,148,2\Sg.\Pl{} linguist-\Pl{},2SG.PL linguist-PL
116,149,Forapi 1\Sg{},Forapi 1SG


Now

In [79]:
gloss_curlybrackets=list_col_string(df_new,'Gloss','{|}')
gloss_curlybrackets

[(54, 'we women always-EMPH REFL families for {work.hard-1PL}'),
 (80,
  '1SG this=TIME=GVN Matthew DEM=M={2SG}=OBJ talk INDF=F ask 2SG-do-1SG/IRR=GOAL=IND'),
 (81,
  '1PL person man DEM=M={1PL} food other=FOC CERT ASS-eat-PST/PFV-1PL/DSO=IND'),
 (86, '{} where=LIG=3PL=2PL{} FOC.NF 2PL'),
 (107, 'woman-they.DU{} come-PST-3.DU-IND'),
 (109, 'Juab Minöp OBJ.3DU{} give-PRF.3SG'),
 (112, '{} we again you child unmarried custom like thus be-1PL.PRS'),
 (134, 'debt new-DET.ABS{} 3SG.ABS.AUX.1PL.ERG Basque-PROXART.PL{} Orixe-COM'),
 (139,
  'species.of.leaf seeking we.EXCL went house {one.occupied.with} same-ref we.EXCL')]

Check initial {} in 86 and 112. As suspected they ensure proper alignment of gloss to account for initial ...

In [80]:
print(df_new.iloc[86])
print(df_new.iloc[112])

Language_Name                                                   Bilua
Language_ID                                                  bilu1245
Primary_Text                                   … laizamumela inio me.
Analyzed_Word                               … lai=za=mu=mela inio me.
Gloss                               {} where=LIG=3PL=2PL{} FOC.NF 2PL
Translated_Text                          … you are people from where?
Source                                          obata2003[88, (7.49)]
Meta_Language_ID                                             stan1293
LGR_Conformance                                      MORPHEME_ALIGNED
Comment                                                              
Gloss_LaTeX         {} where=\Lig=3\Pl{}=2\Pl{}{} \Foc{}.\Nf{} 2\Pl{}
Name: 86, dtype: object
Language_Name                                                                  Usan
Language_ID                                                                usan1239
Primary_Text                          

I want to remove empty curly brackets that follow a non-space character and replace curly brackets surrounding material by the content they surround.

In [81]:
# Pattern to remove empty curly brackets {} after a non-space character
pattern_curly_empty = re.compile(r'(\S)\{\}')  

# Pattern to retain content inside curly brackets
pattern_curly_content = re.compile(r'\{([^{}]+)\}')  

text1=r'species.of.leaf seeking we.EXCL went house {one.occupied.with} same-ref we.EXCL'
text2=r'I the+person intelligent PFV{}{} know {} ERG I thing those just know ERG you the+man stupid'

print(pattern_curly_empty.sub(r'\1',text2))
print(pattern_curly_content.sub(r'\1',text1))

I the+person intelligent PFV{} know {} ERG I thing those just know ERG you the+man stupid
species.of.leaf seeking we.EXCL went house one.occupied.with same-ref we.EXCL


Apply the patterns to the dataset (apply `pattern_curly_empty` twice because of double {} observed in 31).

In [82]:
df_new['Gloss']=df_new['Gloss'].apply(lambda text:(pattern_curly_empty.sub(r'\1',text))).apply(lambda text:(pattern_curly_empty.sub(r'\1',text))).apply(lambda text:(pattern_curly_content.sub(r'\1',text)))
get_diff_df(df_new,'Gloss',gloss_curlybrackets)

Unnamed: 0,Index,Before,After
0,54,we women always-EMPH REFL families for {work.hard-1PL},we women always-EMPH REFL families for work.hard-1PL
1,80,1SG this=TIME=GVN Matthew DEM=M={2SG}=OBJ talk INDF=F ask 2SG-do-1SG/IRR=GOAL=IND,1SG this=TIME=GVN Matthew DEM=M=2SG=OBJ talk INDF=F ask 2SG-do-1SG/IRR=GOAL=IND
2,81,1PL person man DEM=M={1PL} food other=FOC CERT ASS-eat-PST/PFV-1PL/DSO=IND,1PL person man DEM=M=1PL food other=FOC CERT ASS-eat-PST/PFV-1PL/DSO=IND
3,86,{} where=LIG=3PL=2PL{} FOC.NF 2PL,{} where=LIG=3PL=2PL FOC.NF 2PL
4,107,woman-they.DU{} come-PST-3.DU-IND,woman-they.DU come-PST-3.DU-IND
5,109,Juab Minöp OBJ.3DU{} give-PRF.3SG,Juab Minöp OBJ.3DU give-PRF.3SG
6,112,{} we again you child unmarried custom like thus be-1PL.PRS,{} we again you child unmarried custom like thus be-1PL.PRS
7,134,debt new-DET.ABS{} 3SG.ABS.AUX.1PL.ERG Basque-PROXART.PL{} Orixe-COM,debt new-DET.ABS 3SG.ABS.AUX.1PL.ERG Basque-PROXART.PL Orixe-COM
8,139,species.of.leaf seeking we.EXCL went house {one.occupied.with} same-ref we.EXCL,species.of.leaf seeking we.EXCL went house one.occupied.with same-ref we.EXCL


The column `Gloss` is now in good shape. 

*optional for future reference*: check LGR_Conformance manually if deemed necessary 

## Create examples.csv file

In [83]:
df_new.to_csv('examples.csv',index=True,index_label='ID')

# Testing ground


anything from here on can and should be ignored

In [None]:
cdict = {
    'apc_order':{
        'descr': 'This describes to order of APCs.',
        'wals_feat': '22A'
    },
    'constituent_order':{
        'descr': 'This describes to constituent order.',
        'comment': 'Mainly based on WALS data',
        'values':{
            'OV': 'Direct objects generally precede the verb (in base structure).',
            'VO': 'Direct objects generally follow the verb (in base structure).'},
        'person-rel': 'PPDC'
    },
    'another_feature': {
        'wals_feat': '56A',
        'person-rel': 'True'
    }
}

In [14]:
for i in cdict:
    print(i)
    if 'wals_feat' in cdict[i]:
        print(cdict[i]['wals_feat'])

apc_order
22A
constituent_order
another_feature
56A


In [3]:
cdict['apc_order']['values']['pre']

'The adnominal pronoun precedes the nominal.'

In [7]:
cdict['apc_order']['descr']

'This describes to order of APCs.'

In [15]:
testlist=[cdict[d]['descr'] for d in cdict if 'descr' in cdict[d]] 
print(testlist)

['This describes to order of APCs.', 'This describes to constituent order.']


In [93]:
import json

with open("sample.json", "w") as outfile: 
    json.dump(cdict, outfile)

In [94]:
# Opening JSON file
with open('sample.json') as json_file:
    imported = json.load(json_file)

In [96]:
print(imported)

{'apc_order': {'descr': 'This describes to order of APCs.', 'values': {'pre': 'The adnominal pronoun precedes the nominal.', 'post': 'The adnominal pronoun follows the nominal.'}}, 'constituent_order': {'descr': 'This describes to constituent order.', 'comment': 'Mainly based on WALS data', 'values': {'OV': 'Direct objects generally precede the verb (in base structure).', 'VO': 'Direct objects generally follow the verb (in base structure).'}}}


In [19]:
for e in cdict:
    if cdict[e].get('person-rel') not in ['PPDC',None]:
        print(cdict[e])

{'wals_feat': '56A', 'person-rel': 'True'}
