# Investigation of an MQL file with errors

Saulo de Oliveira Cantanhêde is
[working on a TF dataset for the Greek New Testament](https://github.com/saulocantanhede/tfgreek2)
and when he uses the TF converter to MQL, the MQL parser reports errors.

# Replication of the problem

First we run the conversion in order to elicit the mistake.

## Prerequisites

1. Have the [saulocantanhede/tfgreek2](https://github.com/saulocantanhede/tfgreek2) repo cloned under `~/github/saulocantanhede/tfgreek2`
2. Have the [Emdros](https://emdros.org) software installed in such a way that the `mql` command works on the command line

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from tf.app import use

## Load the TF data

In [3]:
A = use("saulocantanhede/tfgreek2:clone", checkout="clone", version="0.5.9", hoist=globals())

**Locating corpus resources ...**

   |     0.28s T otype                from ~/github/saulocantanhede/tfgreek2/tf/0.5.9
   |     3.49s T oslots               from ~/github/saulocantanhede/tfgreek2/tf/0.5.9
   |     0.61s T translit             from ~/github/saulocantanhede/tfgreek2/tf/0.5.9
   |     0.61s T lemma                from ~/github/saulocantanhede/tfgreek2/tf/0.5.9
   |     0.52s T punctuation          from ~/github/saulocantanhede/tfgreek2/tf/0.5.9
   |     0.23s T chapter              from ~/github/saulocantanhede/tfgreek2/tf/0.5.9
   |     0.52s T after                from ~/github/saulocantanhede/tfgreek2/tf/0.5.9
   |     0.24s T verse                from ~/github/saulocantanhede/tfgreek2/tf/0.5.9
   |     0.64s T text                 from ~/github/saulocantanhede/tfgreek2/tf/0.5.9
   |     0.64s T unaccent             from ~/github/saulocantanhede/tfgreek2/tf/0.5.9
   |     0.00s T before               from ~/github/saulocantanhede/tfgreek2/tf/0.5.9
   |     0.51s T book                 from ~/github/sa

Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
group,8945,7.01,46
clause,42506,8.36,258
wg,106868,6.88,533
phrase,69007,1.9,95
subphrase,116178,1.6,135
word,137779,1.0,100


Display is setup for viewtype [syntax-view](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/syntax-view.md#start)

See [here](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/viewtypes.md#start) for more information on viewtypes

In [5]:
MQL_DIR = "~/github/annotation/text-fabric/_temp/mql"

In [5]:
A.exportMQL("tfgreek2", MQL_DIR)

  0.00s Checking features of dataset tfgreek2
  0.00s 57 features to export to MQL ...
  0.01s Loading 57 features
  0.01s Writing enumerations
	after          :   19 values, 19 not a name, e.g. « »
	before         :    4 values, 4 not a name, e.g. «(»
	bookshort      :   27 values, 11 not a name, e.g. «1CO»
	criticalsign   :    7 values, 7 not a name, e.g. «(»
	function       :  429 values, 418 not a name, e.g. «Cmpl-Cmpl»
	note           :    1 values, 1 not a name, e.g. «discontinuous discourse»
	punctuation    :    5 values, 5 not a name, e.g. « »
	role           :   14 values, 1 not a name, e.g. «err_clause-complex-met-no-conditionsClCl2»
	rule           :  817 values, 668 not a name, e.g. «12Np»
	typems         :   16 values, 6 not a name, e.g. «apposition-group»
	variant        :    2 values, 2 not a name, e.g. «1»
   |     0.19s Writing an all-in-one enum with   98 values
  0.20s Mapping 57 features onto 10 object types
  0.67s Writing 57 features as data in 10 object types
   

In [6]:
!ls -lh $MQL_DIR

total 393232
-rw-r--r--  1 me  staff   177M Jul  5 09:01 tfgreek2.mql


## Import it into MQL database

We change to the right directory, remove the results of previous runs, do the MQL import.

MQL produces incredibly long error messages, sometimes as long as the input file, so we anticipate that
by redirecting the error messages into the file `error.txt`, so that we can inspect it comfortably in Vim.

In [7]:
!cd $MQL_DIR; rm -rf tfgreek2 errors.txt; mql -b 3 < tfgreek2.mql >& errors.txt

In [8]:
!ls -lh $MQL_DIR

total 613432
-rw-r--r--  1 me  staff    27M Jul  5 09:05 errors.txt
-rw-r--r--  1 me  staff    68M Jul  5 09:05 tfgreek2
-rw-r--r--  1 me  staff   177M Jul  5 09:01 tfgreek2.mql


There it is, an error file of 27 M. By looking at it in Vim, we eventually find something suspect.

The first error is here:

In [14]:
!cd $MQL_DIR; head -55 errors.txt | tail +45

+------------------------+
Creating indices on phrase_objects...!
Dropping indices on wg_objects...!
ERROR: Compile error while executing query:
--------------------------------------------

CREATE OBJECTS
WITH OBJECT TYPE[wg]

CREATE OBJECT
FROM MONADS= { 1-8 }


But the actual message is here:

In [15]:
!cd $MQL_DIR; head -661961 errors.txt | tail +661955

]

 GO
--------------------------------------------
Parsing failed
syntax error near the integer 10490



When I look for `10490` I find this:

In [16]:
!cd $MQL_DIR; head -3029 errors.txt | tail +3018

CREATE OBJECT
FROM MONADS= { 318,320-367 }
WITH ID_D=390881 [
book:=Matthew;
bookshort:='MAT';
cls:=cl;
crule:=ClCl2;
nodeid:=400010200010490;
num:=67;
parent:=(390880);

]


Aha, there are large integers here: `nodeid:=400010200010490;`, probably too much for MQL which
is very like 32 bits software, so the largest unsigned integer is `2^32` which is `4294967296`, much less.

We can check whether this is the root cause by changing the value type of the feature `nodeid` to `str`.

I did it manually, directly in the feature file `~/github/saulocantanhede/tfgreek2/tf/0.5.9/nodeid.tf`:

```
@node
@author=Evangelists and apostles
@converterSourceLocation=https://github.com/saulocantanhede/tfgreek2/tree/main/programs
@converterVersion=0.5.9 (July 3, 2024)
@converters=Saulo de Oliveira Cantanhêde, Tony Jurg, Dirk Roorda
@dataSource=MACULA Greek Linguistic Datasets, available at https://github.com/Clear-Bible/macula-greek/tree/main/Nestle1904/lowfat
@dataSourceFormat=XML lowfat tree data
@dataSourceLocation=https://github.com/saulocantanhede/tfgreek2/tree/main/xml/2022-11-01/gnt
@description=node id (as in the XML source data)
@editors=Eberhart Nestle (1904)
@institute=ETCBC (Eep Talstra Centre for Bible and Computer) at Vrije Universiteit Amsterdam, CBLC (Center of Biblical Languages and Computing) at Andrews University
@title=Greek New Testament (Nestle 1904)
@valueType=str
@version=0.5.9
@xmlSourceDate=February 9, 2023
@xmlVersion=2022-11-01
@writtenBy=Text-Fabric
@dateWritten=2024-07-03T15:41:44Z

138125	400010200010490
138127	400010200120390
138130	400010200130100
138140	400010210050150
```

...

See `@valueType=str` above, where the original value was `int`.

Now we load the dataset again, and note that TF picks up the change:

In [18]:
A = use("saulocantanhede/tfgreek2:clone", checkout="clone", version="0.5.9", hoist=globals())

**Locating corpus resources ...**

   |     0.03s T nodeid               from ~/github/saulocantanhede/tfgreek2/tf/0.5.9


Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
group,8945,7.01,46
clause,42506,8.36,258
wg,106868,6.88,533
phrase,69007,1.9,95
subphrase,116178,1.6,135
word,137779,1.0,100


Display is setup for viewtype [syntax-view](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/syntax-view.md#start)

See [here](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/viewtypes.md#start) for more information on viewtypes

We run the MQL export again:

In [19]:
A.exportMQL("tfgreek2", MQL_DIR)

  0.00s Checking features of dataset tfgreek2
  0.00s 57 features to export to MQL ...
  0.01s Loading 57 features
  0.01s Writing enumerations
	after          :   19 values, 19 not a name, e.g. « »
	before         :    4 values, 4 not a name, e.g. «(»
	bookshort      :   27 values, 11 not a name, e.g. «1CO»
	criticalsign   :    7 values, 7 not a name, e.g. «(»
	function       :  429 values, 418 not a name, e.g. «Cmpl-Cmpl»
	note           :    1 values, 1 not a name, e.g. «discontinuous discourse»
	punctuation    :    5 values, 5 not a name, e.g. « »
	role           :   14 values, 1 not a name, e.g. «err_clause-complex-met-no-conditionsClCl2»
	rule           :  817 values, 668 not a name, e.g. «12Np»
	typems         :   16 values, 6 not a name, e.g. «apposition-group»
	variant        :    2 values, 2 not a name, e.g. «1»
   |     0.19s Writing an all-in-one enum with   98 values
  0.20s Mapping 57 features onto 10 object types
  0.67s Writing 57 features as data in 10 object types
   

In [20]:
!ls -lh $MQL_DIR

total 613432
-rw-r--r--  1 me  staff    27M Jul  5 09:05 errors.txt
-rw-r--r--  1 me  staff    68M Jul  5 09:05 tfgreek2
-rw-r--r--  1 me  staff   177M Jul  5 09:21 tfgreek2.mql


Now the proof of the pudding: we let MQL eat it!

In [21]:
!cd $MQL_DIR; rm -rf tfgreek2 errors.txt; mql -b 3 < tfgreek2.mql >& errors.txt

In [22]:
!ls -lh $MQL_DIR

total 589936
-rw-r--r--  1 me  staff   3.0K Jul  5 09:22 errors.txt
-rw-r--r--  1 me  staff    81M Jul  5 09:22 tfgreek2
-rw-r--r--  1 me  staff   177M Jul  5 09:21 tfgreek2.mql


A very small "error" file:

In [23]:
!cd $MQL_DIR; cat errors.txt

Dropping indices on word_objects...!
+------------------------+
| object_count : integer |
+------------------------+
| 50000                  |
+------------------------+
+------------------------+
| object_count : integer |
+------------------------+
| 50000                  |
+------------------------+
+------------------------+
| object_count : integer |
+------------------------+
| 37779                  |
+------------------------+
Creating indices on word_objects...!
Dropping indices on subphrase_objects...!
+------------------------+
| object_count : integer |
+------------------------+
| 50000                  |
+------------------------+
+------------------------+
| object_count : integer |
+------------------------+
| 50000                  |
+------------------------+
+------------------------+
| object_count : integer |
+------------------------+
| 16178                  |
+------------------------+
Creating indices on subphrase_objects...!
Dropping indices on phrase_objec

All went well!

# Concluding remarks

I will put warnings into the MQL exporting code for integers that are out of bounds.

The values of `nodeid` do not seem to be intended as numbers, but rather as codes, there is some structure in them.
Therefore it is maybe better to treat them as strings anyway.

If your application needs to compute with them, the application code should unravel those strings into tuples of smaller integers.

# Test the improved MQL export function

We revert the offending feature `nodeid` back to an integer valued one, and see what the MQL export
has to tell us.

In [3]:
A = use("saulocantanhede/tfgreek2:clone", checkout="clone", version="0.5.9", hoist=globals())

**Locating corpus resources ...**

   |     0.03s T nodeid               from ~/github/saulocantanhede/tfgreek2/tf/0.5.9


Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
group,8945,7.01,46
clause,42506,8.36,258
wg,106868,6.88,533
phrase,69007,1.9,95
subphrase,116178,1.6,135
word,137779,1.0,100


Display is setup for viewtype [syntax-view](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/syntax-view.md#start)

See [here](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/viewtypes.md#start) for more information on viewtypes

In [7]:
A.exportMQL("tfgreek2", MQL_DIR)

  0.00s Checking features of dataset tfgreek2


   |    1m 50s integer feature "nodeid" has 5558 values less than -2147483647 or larger than 2147483647
  0.04s Export to MQL aborted


Now we change the value type of `nodeid` back to `str`:

In [8]:
A = use("saulocantanhede/tfgreek2:clone", checkout="clone", version="0.5.9", hoist=globals())

**Locating corpus resources ...**

   |     0.04s T nodeid               from ~/github/saulocantanhede/tfgreek2/tf/0.5.9


Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
group,8945,7.01,46
clause,42506,8.36,258
wg,106868,6.88,533
phrase,69007,1.9,95
subphrase,116178,1.6,135
word,137779,1.0,100


Display is setup for viewtype [syntax-view](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/syntax-view.md#start)

See [here](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/viewtypes.md#start) for more information on viewtypes

In [9]:
A.exportMQL("tfgreek2", MQL_DIR)

  0.00s Checking features of dataset tfgreek2
  0.04s 57 features to export to MQL ...
  0.04s Loading 57 features
  0.05s Writing enumerations
	after          :   19 values, 19 not a name, e.g. « »
	before         :    4 values, 4 not a name, e.g. «(»
	bookshort      :   27 values, 11 not a name, e.g. «1CO»
	criticalsign   :    7 values, 7 not a name, e.g. «(»
	function       :  429 values, 418 not a name, e.g. «Cmpl-Cmpl»
	note           :    1 values, 1 not a name, e.g. «discontinuous discourse»
	punctuation    :    5 values, 5 not a name, e.g. « »
	role           :   14 values, 1 not a name, e.g. «err_clause-complex-met-no-conditionsClCl2»
	rule           :  817 values, 668 not a name, e.g. «12Np»
	typems         :   16 values, 6 not a name, e.g. «apposition-group»
	variant        :    2 values, 2 not a name, e.g. «1»
   |     0.23s Writing an all-in-one enum with   98 values
  0.27s Mapping 57 features onto 10 object types
  0.84s Writing 57 features as data in 10 object types
   