{{ message }}

# Expose math lexemes in final output #947

Merged
merged 12 commits into from Jul 27, 2018
Merged

# Expose math lexemes in final output#947

merged 12 commits into from Jul 27, 2018

## Conversation

 This is a llamapun-motivated pull request. I have been creating some token models over arXiv, and would like to take them to the next level by incorporating an ascii, linguistics-friendly, representation of all formulas. To that extent, I would like to dispense with any remnants of TeX, and instead use the closest possible form to the one used by latexml's grammar input. This was simultaneously easier than I expected - in terms of introduced the feature in MathParser, as $lexemes already existed in a very near ideal form. But also harder than expected from the usability side as I just couldn't find a way around introducing a new option called --mathlex, as well as a dedicated llamapun.sty.ltxml binding to avoid introducing a new option in the core latexml processing as well. If it wasn't for the MathParser changes, one could imagine the rest of the toggling UX as a separate dedicated plugin. But making new options from a plugin isn't that direct either... as it is inevitable to have to patch Config.pm. But back to the fun part. With the current PR one can run for instance: latexmlc --whatsin=math --whatsout=math --pmml --preload=llamapun.sty --mathlex 'literal:P(E) = {n \choose k} p^k (1-p)^{ n-k}' and get back: P ( E ) = ( n k ) p k ( 1 - p ) n - k UNKNOWN:P OPEN:( UNKNOWN:E CLOSE:) RELOP:equals OPEN:( FRACOP:binomial ARG:start UNKNOWN:n ARG:end ARG:start UNKNOWN:k ARG:end CLOSE:) UNKNOWN:p POSTSUPERSCRIPT:start UNKNOWN:k POSTSUPERSCRIPT:end OPEN:( NUMBER:1 ADDOP:minus UNKNOWN:p CLOSE:) POSTSUPERSCRIPT:start UNKNOWN:n ADDOP:minus UNKNOWN:k POSTSUPERSCRIPT:end I think the lexeme string itself is also interesting to look at for complex examples (already here with \choose actually), so I will leave more comments. ### dginev commented Feb 7, 2018 • edited  Ok, I think I have this example nailed down. Deleted some of my intermediate comments to reduce clutter. Again, for the tex formula:$ P(E) = {n \choose k} p^k (1-p)^{ n-k} $The resulting lexemes are now: UNKNOWN:P OPEN:( UNKNOWN:E CLOSE:) RELOP:equals OPEN:( FRACOP:binomial ARG:start UNKNOWN:n ARG:end ARG:start UNKNOWN:k ARG:end CLOSE:) UNKNOWN:p POSTSUPERSCRIPT:start UNKNOWN:k POSTSUPERSCRIPT:end OPEN:( NUMBER:1 ADDOP:minus UNKNOWN:p CLOSE:) POSTSUPERSCRIPT:start UNKNOWN:n ADDOP:minus UNKNOWN:k POSTSUPERSCRIPT:end which feels ideal. Hence requesting feedback from @brucemiller : Do you think lowercasing the roles for this export makes sense? (minor...) More major, I decided to use the second child of the XMDual, so that I preserve the presentational queues from latex. Do you think that makes sense, given that the idea is to expose the forms to linguistic tools (e.g. external grammars, token models). As opposed to using the first "content" child of the dual. My start/end "fence tokens" are named a bit ad-hoc, abusing a bit the role:meaning convention, as I am using start/end as the meaning entry. Any intuitions what would be better naming? I kind of like this approach as it feels somewhat self-explanatory. Any high level feedback about merging the PR? Would you like tests added? I may want to try some more complex formulas, e.g. all the formulas in my equations example at http://latexml.mathweb.org/editor, possibly turn them into tests... but will wait for feedback before I jump on that. changed the title [demo] Expose math lexemes in final output Expose math lexemes in final output Feb 7, 2018 ### dginev commented Feb 7, 2018 • edited  The Cauchy-Schwarz Inequality also looks decently lexematized with the current state: OPEN:( SUMOP:sum POSTSUBSCRIPT:start UNKNOWN:k RELOP:equals NUMBER:1 POSTSUBSCRIPT:end POSTSUPERSCRIPT:start UNKNOWN:n POSTSUPERSCRIPT:end UNKNOWN:a POSTSUBSCRIPT:start UNKNOWN:k POSTSUBSCRIPT:end UNKNOWN:b POSTSUBSCRIPT:start UNKNOWN:k POSTSUBSCRIPT:end CLOSE:) POSTSUPERSCRIPT:start NUMBER:2 POSTSUPERSCRIPT:end RELOP:less-than-or-equals OPEN:( SUMOP:sum POSTSUBSCRIPT:start UNKNOWN:k RELOP:equals NUMBER:1 POSTSUBSCRIPT:end POSTSUPERSCRIPT:start UNKNOWN:n POSTSUPERSCRIPT:end UNKNOWN:a POSTSUBSCRIPT:start UNKNOWN:k POSTSUBSCRIPT:end POSTSUPERSCRIPT:start NUMBER:2 POSTSUPERSCRIPT:end CLOSE:) OPEN:( SUMOP:sum POSTSUBSCRIPT:start UNKNOWN:k RELOP:equals NUMBER:1 POSTSUBSCRIPT:end POSTSUPERSCRIPT:start UNKNOWN:n POSTSUPERSCRIPT:end UNKNOWN:b POSTSUBSCRIPT:start UNKNOWN:k POSTSUBSCRIPT:end POSTSUPERSCRIPT:start NUMBER:2 POSTSUPERSCRIPT:end CLOSE:) ### dginev commented Feb 7, 2018 • edited  After a getting a coffee ☕ it seems a bit more obvious to rename things like POSTSUBSCRIPT:start / POSTSUBSCRIPT:end into the more consistent with the internal latexml convention OPEN:postsubscript / CLOSE:postsubscript and similarly for OPEN:arg / CLOSE:arg and the rest. Does that sound better? ### dginev commented Feb 7, 2018 • edited  Got another fix in by inspecting the Ramanujan identity in the examples i mentioned. Here is the result (still using the start/end lexemes, until we discuss): again - I indented by hand for easier reading, this all comes out as a space-separated stream of tokens in the annotation element. It's completely unparsed in the actual output. FRACOP:divide ARG:start NUMBER:1 ARG:end ARG:start OPEN:( UNKNOWN:square-root ARG:start UNKNOWN:phi UNKNOWN:square-root ARG:start NUMBER:5 ARG:end ARG:end ADDOP:minus UNKNOWN:phi CLOSE:) UNKNOWN:e POSTSUPERSCRIPT:start FRACOP:divide ARG:start NUMBER:2 ARG:end ARG:start NUMBER:5 ARG:end UNKNOWN:pi POSTSUPERSCRIPT:end ARG:end RELOP:equals NUMBER:1 ADDOP:plus FRACOP:divide ARG:start UNKNOWN:e POSTSUPERSCRIPT:start ADDOP:minus NUMBER:2 UNKNOWN:pi POSTSUPERSCRIPT:end ARG:end ARG:start NUMBER:1 ADDOP:plus FRACOP:divide ARG:start UNKNOWN:e POSTSUPERSCRIPT:start ADDOP:minus NUMBER:4 UNKNOWN:pi POSTSUPERSCRIPT:end ARG:end ARG:start NUMBER:1 ADDOP:plus FRACOP:divide ARG:start UNKNOWN:e POSTSUPERSCRIPT:start ADDOP:minus NUMBER:6 UNKNOWN:pi POSTSUPERSCRIPT:end ARG:end ARG:start NUMBER:1 ADDOP:plus FRACOP:divide ARG:start UNKNOWN:e POSTSUPERSCRIPT:start ADDOP:minus NUMBER:8 UNKNOWN:pi POSTSUPERSCRIPT:end ARG:end ARG:start NUMBER:1 ADDOP:plus ID:ldots ARG:end ARG:end ARG:end ARG:end ### dginev commented Feb 7, 2018  More special cases handled, example:$ x,\quad\quad \text{for} |q|<1. $lexemes in output: UNKNOWN:x PUNCT:, ATOM:for VERTBAR:| UNKNOWN:q VERTBAR:| RELOP:less-than NUMBER:1 PERIOD:. ### brucemiller commented Feb 8, 2018  Sounds cool! Could really drive and enhance some upcoming development... or clash with it? Perhaps some offlist discussion would be good. ### dginev commented Feb 8, 2018  Definitely a more exploratory piece of work here, happy to discuss at length in a more casual setting than infinitely long github comments 😂 I have high hopes it is in full synergy with other upcoming upgrades, or adaptable to such a point. changed the title Expose math lexemes in final output [Demo] Expose math lexemes in final output Mar 1, 2018 ### brucemiller commented Apr 3, 2018  So, I'm inclined to go ahead and accept this PR on the grounds that it "can't hurt", even though it's clearly not finished. It'd be good to be able to play with it! Object soon if you don't agree :> ### dginev commented Apr 3, 2018  I am in favor! Will rebase it quickly. force-pushed the dginev:math-lexemes branch from 021593a to b513754 Apr 3, 2018 ### dginev commented Apr 3, 2018  Rebase completed, it was a single space indent... ### dginev commented Apr 3, 2018  That said, experimenting with a branch can often be easier than doing so on master, as one ends up landing changes to the feature somewhat chaotically in between main development. Wrapping up at least the initial specification for the lexemes before merging here has its merits. You can always do: git checkout math-lexemes cpanm . --notest to quickly test with this branch -- or make and run blib/script... ### brucemiller commented Apr 4, 2018  Hmm.. XMDual. Not sure what the appropriate representation should be for pre-parsed output. For post-parsed, I'd thought that it should give rise to 2 separate lexeme sequences: the presentational & semantic ones. ### dginev commented Apr 4, 2018  Yes, XMDual is tricky. This PR takes the approach that the lexeme is what "the author wrote", with the intent of preserving any semantic hints, but not incorporating any guesses, be they about lexical semantics, or parse trees. Would need to look into specific examples of duals, but preserving the token sequence coming from the gullet would be the ideal solution from my current perspective. ### dginev commented Apr 4, 2018  I remembered my favourite dual example - piecewise functions. So here is the code for a simple indicator function for example: \mathbf {1} _{A}(x) := \begin{cases} 1 & \text{if } x \in A,\\ 0 & \text{if } x \notin A. \end{cases} The XML is nothing to sniff at, so attaching it as a gist here or the comment will get huge. When generating the lexemes with the current PR, we get: NUMBER:1 POSTSUBSCRIPT:start UNKNOWN:A POSTSUBSCRIPT:end OPEN:( UNKNOWN:x CLOSE:) RELOP:assign OPEN:{ NUMBER:1 ATOM:if UNKNOWN:x RELOP:element-of UNKNOWN:A PUNCT:, NUMBER:0 ATOM:if UNKNOWN:x RELOP:not-element-of UNKNOWN:A PERIOD:. As usual, line-breaks and indentation for the lexemes is manually done by me, for easier reading - it's just a space-separated line in the output. ### dginev commented Apr 4, 2018  While I am unsure if this representation of the dual is the best we can do, it certainly isn't terrible - all visual elements are preserved. One question is whether the {cases} delimiters need to be wrapped more explicitly (with ARG:start and ARG:end as done for e.g. fraction arguments), especially given that the line-break isn't preserved as a token and you need to rely on the comma punctuation. Another minor bit is whether I need to escape all spaces inside the text content of lexemes, so that I get ATOM:if\ and preserve the space given in the \text macro. I think when I wrote the XMDual handler, I explicitly made the choice to use the second, presentation, branch in order to stay true to the authored input. ### brucemiller commented Apr 9, 2018  Ugh! My local copy of dlmf stuff is already updated to master... does this need some sort of rebase? force-pushed the dginev:math-lexemes branch from b513754 to 3122973 Apr 9, 2018 ### dginev commented Apr 9, 2018  Sure, I rebased now for convenience. (still working on the moderncv branch btw, will try to wrap up tomorrow, but this one is pending feedback) ### brucemiller commented Apr 9, 2018  Curious: how do you do that? I did a git rebase which rebased it to your master, I assume. How do I rebase it to the original master (something about "upstream", I assume)? ### dginev commented Apr 9, 2018 • edited  Steps I take from the checkout of my fork (which is origin and your master is my upstream remote). First I update my fork's master to yours: git fetch upstream master git checkout master git rebase upstream/master git push Then I rebase the branch in question: git checkout math-lexemes git rebase master git push origin math-lexemes --force It tends to require force, as the rebase isn't reconcilable with the github-hosted branch when pushed. (which seems to be a weird limitation...) ### dginev commented Apr 9, 2018  I tend to be painfully explicit with the rebase parameters, to be certain I know what git will do before I run the command, since wrong rebases can be a pain to rewind back (bunch of resets...) ### brucemiller commented Apr 9, 2018  Ah, I see... OK, now I've got a whole mess of lexemes embedded in an xml file. Hmmm.... ### dginev commented Apr 9, 2018 • edited  Sounds about right - I am indenting them by hand to proof them, as they're in this unparsed linearized form. But as it happens the space separated stream is perfect for transferring the lexemes over to linguistic applications, grammars, etc. ### brucemiller commented Apr 9, 2018  for immediate purposes, I may just hack MathParser.pm to print the lexeme streams for each numbered eqn to a file... ### dginev commented Apr 9, 2018 • edited  for immediate purposes, I may just hack MathParser.pm to print the lexeme streams for each numbered eqn to a file... I take it you're in a hurry :> It may be easier to debug that way, but you can also write a tiny Perl script using XML::LibXML that snipes the attributes from the result XML file from LaTeXML. ### brucemiller commented Apr 10, 2018  Sure, OK, the refactoring can wait till there's something to go between. But I'd still like to make it slightly more complicated: Namely separate presentation & content traversals to build lexeme sequences after parsing, and the ability to store both of those as well for later annotation. ### dginev commented Apr 10, 2018  Cool, sure. I'm surprised at the interest, given how little love my "semantic tex attribute" issue got back in the day (see #432 ) ### brucemiller commented Apr 10, 2018  ? Had nothing to do with lack of love for the attributes, but rather lack of love for the limitations of MathML annotations. ### dginev commented Apr 11, 2018  Sure, it just never took off - that's all I meant. If we start adding multiple lexeme serializations, we can consider adding multiple tex serializations too while we're at it. mentioned this pull request Apr 22, 2018 ### dginev commented May 30, 2018 • edited  Unwinding back to the top of the discussion, just bookkeeping here that the PR is considered blocked until we also transfer any relevant font+style information to the lexeme serialization. So that, as a basic example, bold x is serialized as a distinct lexeme from a regular mathematical x. The key observation being that while font+style differences in regular language mainly implies emphasis or structural role (e.g. headings) of the same word, while in mathematical expressions the underlying symbol is different - both emphasis and structural information are not employed in formulas. (counter-examples are welcome) ### dginev commented Jul 6, 2018 • edited  Let me add another interesting example I hadn't thought of when preparing the PR - over-accents, which are not scripts. In particular, the conical function syntax in the left-hand side of equation 14.20.2 of the DLMF: Here is the current lexeme string produced with the PR, with my manual indent as usual: OVERACCENT:widehat ARG:start UNKNOWN:Q ARG:end POSTSUPERSCRIPT:start ADDOP:minus UNKNOWN:mu POSTSUPERSCRIPT:end POSTSUBSCRIPT:start ADDOP:minus FRACOP:divide ARG:start NUMBER:1 ARG:end ARG:start NUMBER:2 ARG:end ADDOP:plus UNKNOWN:i UNKNOWN:tau POSTSUBSCRIPT:end This seems reasonable I think? Just jotting it down as an extra example (and archival) added 6 commits Feb 6, 2018 … works force-pushed the dginev:math-lexemes branch from 3122973 to a5e9c0f Jul 9, 2018 added 2 commits Jul 9, 2018 changed the title [Demo] Expose math lexemes in final output Expose math lexemes in final output Jul 9, 2018 ### dginev commented Jul 9, 2018 • edited  Greetings! I have just added a piece of code that I would like to claim could move this PR from "demo" to "experimental", and allow it to enter a release, even in this imperfect form. I am of course also happy to reimplement if I get actionable feedback! The additions from today: I added a crude approximation to "obtaining the Font information" for the leaf node lexeme serialization. I am sure Bruce knows a better way, but this was a quick stab that seems to work decently well for basic examples. I dropped the UNKNOWN role from the serialization, seeking some compromise ground after I got feedback that the grammatical role feels unnecessary. I still consider it useful to have the full assumed grammatical information that latexml's MathParser has, in order to enable reasonable comparisons between math grammars. Dropping unknown doesn't look anything in this regard, as one can assume any lexeme without a role is unknown. Similarly, I concatenated the font information via a dash, to contrast between it and the colon separator for the role. Perfect? No. But good enough for meaningful experiments? I think so, and can name two separate classes that I would be personally interested in. Revealing example: latexmlc --preload=bm.sty --preload=llamapun.sty --pmml --mathlex 'literal:$x+\boldsymbol{y}=0' produces the lexemes: italic-x ADDOP:plus bold-italic-y RELOP:equals NUMBER:0 One funny quirk of the current code is that if you replace the plus with a \bigoplus, you get the lexeme SUMOP:160%-direct-sum. Which feels a bit too precise!
added 2 commits Jul 9, 2018
reviewed
 $text = 'Unknown' unless defined$text; my $lexeme =$role . ":" . $text . ":" . ++$i; $lexeme =~ s/\s//g; my$lexeme = $self->node_to_lexeme($node) . ":" . ++\$i;

#### dginev Jul 9, 2018 Author Collaborator

ah, I forgot I was reusing the method in the internal parse, I need to be a little more careful here - my font changes broke a test I'm afraid.

### dginev commented Jul 9, 2018 • edited

 Removed a small hiccup, now fully separating the experimental lexeme syntax serialization from the internal parser lexemes. Could be cleaned further, but I prefer waiting for feedback. Tests should pass again.

### brucemiller commented Jul 11, 2018

 Yeah, the font size stuff is silly! :> I'd thought there was the beginnings of a notion of math-meaningful font attributes, but can't quite find it. I do find the inverse: there's a couple of methods in Common::Font regarding "pursestyle", which give the parts of a font which are (presumably) not meaningful. Perhaps that's a good place to start with a new method to get the meaningful parts, which would be only family, series, shape, I guess.

### dginev commented Jul 18, 2018

 Quick update here, as discussed, now explicitly only using the family, series and shape of the font, as relative to the default text font. The example from my previous comments with the direct sum now produces: italic-x SUMOP:direct-sum bold-italic-y RELOP:equals NUMBER:0 which seems quite reasonable.
merged commit 44ecafb into brucemiller:master Jul 27, 2018
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
deleted the dginev:math-lexemes branch Apr 15, 2019
mentioned this pull request Oct 12, 2020