New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose math lexemes in final output #947

Merged
merged 12 commits into from Jul 27, 2018

Conversation

Projects
None yet
2 participants
@dginev
Collaborator

dginev commented Feb 6, 2018

This is a llamapun-motivated pull request. I have been creating some token models over arXiv, and would like to take them to the next level by incorporating an ascii, linguistics-friendly, representation of all formulas.

To that extent, I would like to dispense with any remnants of TeX, and instead use the closest possible form to the one used by latexml's grammar input.

This was simultaneously easier than I expected - in terms of introduced the feature in MathParser, as $lexemes already existed in a very near ideal form. But also harder than expected from the usability side as I just couldn't find a way around introducing a new option called --mathlex, as well as a dedicated llamapun.sty.ltxml binding to avoid introducing a new option in the core latexml processing as well.

If it wasn't for the MathParser changes, one could imagine the rest of the toggling UX as a separate dedicated plugin. But making new options from a plugin isn't that direct either... as it is inevitable to have to patch Config.pm.

But back to the fun part. With the current PR one can run for instance:

latexmlc
   --whatsin=math --whatsout=math --pmml
   --preload=llamapun.sty  --mathlex
   'literal:P(E) = {n \choose k} p^k (1-p)^{ n-k}'

and get back:

<Math xmlns="http://dlmf.nist.gov/LaTeXML" xmlns:m="http://www.w3.org/1998/Math/MathML" mode="inline" xml:id="p1.m1" tex="P(E)={n\choose k}p^{k}(1-p)^{n-k}" text="P * E = (n binomial k) * p ^ k * (1 - p) ^ (n - k)" lexemes="UNKNOWN:P OPEN:( UNKNOWN:E CLOSE:) RELOP:equals OPEN:( FRACOP:binomial ARG:start UNKNOWN:n ARG:end ARG:start UNKNOWN:k ARG:end  CLOSE:)  UNKNOWN:p POSTSUPERSCRIPT:start UNKNOWN:k POSTSUPERSCRIPT:end OPEN:( NUMBER:1 ADDOP:minus UNKNOWN:p CLOSE:) POSTSUPERSCRIPT:start UNKNOWN:n ADDOP:minus UNKNOWN:k POSTSUPERSCRIPT:end" fragid="p1.m1">
  <m:math alttext="P(E)={n\choose k}p^{k}(1-p)^{n-k}" display="inline">
    <m:semantics>
      <m:mrow>
        <m:mrow>
          <m:mi>P</m:mi>
          <m:mo>⁢</m:mo>
          <m:mrow>
            <m:mo stretchy="false">(</m:mo>
            <m:mi>E</m:mi>
            <m:mo stretchy="false">)</m:mo>
          </m:mrow>
        </m:mrow>
        <m:mo>=</m:mo>
        <m:mrow>
          <m:mrow>
            <m:mo>(</m:mo>
            <m:mfrac linethickness="0pt">
              <m:mi>n</m:mi>
              <m:mi>k</m:mi>
            </m:mfrac>
            <m:mo>)</m:mo>
          </m:mrow>
          <m:mo>⁢</m:mo>
          <m:msup>
            <m:mi>p</m:mi>
            <m:mi>k</m:mi>
          </m:msup>
          <m:mo>⁢</m:mo>
          <m:msup>
            <m:mrow>
              <m:mo stretchy="false">(</m:mo>
              <m:mrow>
                <m:mn>1</m:mn>
                <m:mo>-</m:mo>
                <m:mi>p</m:mi>
              </m:mrow>
              <m:mo stretchy="false">)</m:mo>
            </m:mrow>
            <m:mrow>
              <m:mi>n</m:mi>
              <m:mo>-</m:mo>
              <m:mi>k</m:mi>
            </m:mrow>
          </m:msup>
        </m:mrow>
      </m:mrow>
      <m:annotation encoding="application/x-llamapun">UNKNOWN:P OPEN:( UNKNOWN:E CLOSE:) RELOP:equals OPEN:( FRACOP:binomial ARG:start UNKNOWN:n ARG:end ARG:start UNKNOWN:k ARG:end  CLOSE:)  UNKNOWN:p POSTSUPERSCRIPT:start UNKNOWN:k POSTSUPERSCRIPT:end OPEN:( NUMBER:1 ADDOP:minus UNKNOWN:p CLOSE:) POSTSUPERSCRIPT:start UNKNOWN:n ADDOP:minus UNKNOWN:k POSTSUPERSCRIPT:end</m:annotation>
    </m:semantics>
  </m:math>
</Math>

I think the lexeme string itself is also interesting to look at for complex examples (already here with \choose actually), so I will leave more comments.

@dginev

This comment has been minimized.

Collaborator

dginev commented Feb 7, 2018

Ok, I think I have this example nailed down. Deleted some of my intermediate comments to reduce clutter.

Again, for the tex formula:

$ P(E) = {n \choose k} p^k (1-p)^{ n-k} $

The resulting lexemes are now:

UNKNOWN:P OPEN:( UNKNOWN:E CLOSE:) RELOP:equals

OPEN:( FRACOP:binomial ARG:start UNKNOWN:n ARG:end ARG:start UNKNOWN:k ARG:end  CLOSE:)

UNKNOWN:p POSTSUPERSCRIPT:start UNKNOWN:k POSTSUPERSCRIPT:end 

OPEN:( NUMBER:1 ADDOP:minus UNKNOWN:p CLOSE:) 

POSTSUPERSCRIPT:start UNKNOWN:n ADDOP:minus UNKNOWN:k POSTSUPERSCRIPT:end

which feels ideal. Hence requesting feedback from @brucemiller :

  • Do you think lowercasing the roles for this export makes sense? (minor...)

  • More major, I decided to use the second child of the XMDual, so that I preserve the presentational queues from latex. Do you think that makes sense, given that the idea is to expose the forms to linguistic tools (e.g. external grammars, token models). As opposed to using the first "content" child of the dual.

  • My start/end "fence tokens" are named a bit ad-hoc, abusing a bit the role:meaning convention, as I am using start/end as the meaning entry. Any intuitions what would be better naming? I kind of like this approach as it feels somewhat self-explanatory.

  • Any high level feedback about merging the PR? Would you like tests added?

I may want to try some more complex formulas, e.g. all the formulas in my equations example at http://latexml.mathweb.org/editor, possibly turn them into tests... but will wait for feedback before I jump on that.

@dginev dginev changed the title from [demo] Expose math lexemes in final output to Expose math lexemes in final output Feb 7, 2018

@dginev

This comment has been minimized.

Collaborator

dginev commented Feb 7, 2018

The Cauchy-Schwarz Inequality also looks decently lexematized with the current state:
csineq

OPEN:( 
  SUMOP:sum POSTSUBSCRIPT:start UNKNOWN:k RELOP:equals NUMBER:1 POSTSUBSCRIPT:end
   POSTSUPERSCRIPT:start UNKNOWN:n POSTSUPERSCRIPT:end
  UNKNOWN:a POSTSUBSCRIPT:start UNKNOWN:k POSTSUBSCRIPT:end 
  UNKNOWN:b POSTSUBSCRIPT:start UNKNOWN:k POSTSUBSCRIPT:end 
CLOSE:) POSTSUPERSCRIPT:start NUMBER:2 POSTSUPERSCRIPT:end 

RELOP:less-than-or-equals 

OPEN:( 
  SUMOP:sum POSTSUBSCRIPT:start UNKNOWN:k RELOP:equals NUMBER:1 POSTSUBSCRIPT:end 
  POSTSUPERSCRIPT:start UNKNOWN:n POSTSUPERSCRIPT:end
  UNKNOWN:a POSTSUBSCRIPT:start UNKNOWN:k POSTSUBSCRIPT:end 
POSTSUPERSCRIPT:start NUMBER:2 POSTSUPERSCRIPT:end 
CLOSE:) 

OPEN:( 
  SUMOP:sum POSTSUBSCRIPT:start UNKNOWN:k RELOP:equals NUMBER:1 POSTSUBSCRIPT:end 
  POSTSUPERSCRIPT:start UNKNOWN:n POSTSUPERSCRIPT:end
  UNKNOWN:b POSTSUBSCRIPT:start UNKNOWN:k POSTSUBSCRIPT:end 
  POSTSUPERSCRIPT:start NUMBER:2 POSTSUPERSCRIPT:end
CLOSE:)
@dginev

This comment has been minimized.

Collaborator

dginev commented Feb 7, 2018

After a getting a coffee ☕️ it seems a bit more obvious to rename things like

  • POSTSUBSCRIPT:start / POSTSUBSCRIPT:end

into the more consistent with the internal latexml convention

  • OPEN:postsubscript / CLOSE:postsubscript

and similarly for OPEN:arg / CLOSE:arg and the rest.

Does that sound better?

@dginev

This comment has been minimized.

Collaborator

dginev commented Feb 7, 2018

Got another fix in by inspecting the Ramanujan identity in the examples i mentioned. Here is the result (still using the start/end lexemes, until we discuss):

ramanujan

again - I indented by hand for easier reading, this all comes out as a space-separated stream of tokens in the annotation element. It's completely unparsed in the actual output.

FRACOP:divide 
ARG:start NUMBER:1 ARG:end 
ARG:start
    OPEN:( 
        UNKNOWN:square-root ARG:start
            UNKNOWN:phi UNKNOWN:square-root ARG:start NUMBER:5 ARG:end 
        ARG:end 
        ADDOP:minus UNKNOWN:phi 
    CLOSE:)
    UNKNOWN:e 
    POSTSUPERSCRIPT:start 
        FRACOP:divide ARG:start NUMBER:2 ARG:end ARG:start NUMBER:5 ARG:end 
        UNKNOWN:pi 
    POSTSUPERSCRIPT:end 
ARG:end

RELOP:equals

NUMBER:1 ADDOP:plus 
FRACOP:divide 
ARG:start
    UNKNOWN:e POSTSUPERSCRIPT:start ADDOP:minus NUMBER:2 UNKNOWN:pi POSTSUPERSCRIPT:end 
ARG:end
ARG:start
    NUMBER:1 ADDOP:plus 
    FRACOP:divide 
    ARG:start
        UNKNOWN:e POSTSUPERSCRIPT:start ADDOP:minus NUMBER:4 UNKNOWN:pi POSTSUPERSCRIPT:end 
    ARG:end 
    ARG:start
        NUMBER:1 ADDOP:plus 
        FRACOP:divide 
        ARG:start 
            UNKNOWN:e POSTSUPERSCRIPT:start ADDOP:minus NUMBER:6 UNKNOWN:pi POSTSUPERSCRIPT:end 
        ARG:end 
        ARG:start 
            NUMBER:1 ADDOP:plus 
            FRACOP:divide 
            ARG:start
                UNKNOWN:e POSTSUPERSCRIPT:start ADDOP:minus NUMBER:8 UNKNOWN:pi POSTSUPERSCRIPT:end 
            ARG:end 
            ARG:start NUMBER:1 ADDOP:plus ID:ldots ARG:end
        ARG:end
    ARG:end
ARG:end
@dginev

This comment has been minimized.

Collaborator

dginev commented Feb 7, 2018

More special cases handled, example:

$ x,\quad\quad \text{for} |q|<1. $

lexemes in output:

UNKNOWN:x PUNCT:,
ATOM:for VERTBAR:| UNKNOWN:q VERTBAR:| RELOP:less-than NUMBER:1 PERIOD:.
@brucemiller

This comment has been minimized.

Owner

brucemiller commented Feb 8, 2018

Sounds cool! Could really drive and enhance some upcoming development... or clash with it? Perhaps some offlist discussion would be good.

@dginev

This comment has been minimized.

Collaborator

dginev commented Feb 8, 2018

Definitely a more exploratory piece of work here, happy to discuss at length in a more casual setting than infinitely long github comments 😂 I have high hopes it is in full synergy with other upcoming upgrades, or adaptable to such a point.

@dginev dginev changed the title from Expose math lexemes in final output to [Demo] Expose math lexemes in final output Mar 1, 2018

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Apr 3, 2018

So, I'm inclined to go ahead and accept this PR on the grounds that it "can't hurt", even though it's clearly not finished. It'd be good to be able to play with it! Object soon if you don't agree :>

@dginev

This comment has been minimized.

Collaborator

dginev commented Apr 3, 2018

I am in favor! Will rebase it quickly.

@dginev dginev force-pushed the dginev:math-lexemes branch from 021593a to b513754 Apr 3, 2018

@dginev

This comment has been minimized.

Collaborator

dginev commented Apr 3, 2018

Rebase completed, it was a single space indent...

@dginev

This comment has been minimized.

Collaborator

dginev commented Apr 3, 2018

That said, experimenting with a branch can often be easier than doing so on master, as one ends up landing changes to the feature somewhat chaotically in between main development. Wrapping up at least the initial specification for the lexemes before merging here has its merits.

You can always do:

git checkout math-lexemes
cpanm . --notest

to quickly test with this branch -- or make and run blib/script...

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Apr 4, 2018

Hmm.. XMDual. Not sure what the appropriate representation should be for pre-parsed output. For post-parsed, I'd thought that it should give rise to 2 separate lexeme sequences: the presentational & semantic ones.

@dginev

This comment has been minimized.

Collaborator

dginev commented Apr 4, 2018

Yes, XMDual is tricky. This PR takes the approach that the lexeme is what "the author wrote", with the intent of preserving any semantic hints, but not incorporating any guesses, be they about lexical semantics, or parse trees.

Would need to look into specific examples of duals, but preserving the token sequence coming from the gullet would be the ideal solution from my current perspective.

@dginev

This comment has been minimized.

Collaborator

dginev commented Apr 4, 2018

I remembered my favourite dual example - piecewise functions. So here is the code for a simple indicator function for example:

\mathbf {1} _{A}(x) :=
  \begin{cases}
   1 & \text{if } x \in A,\\
   0 & \text{if } x \notin A.
  \end{cases}

image

The XML is nothing to sniff at, so attaching it as a gist here or the comment will get huge.

When generating the lexemes with the current PR, we get:

NUMBER:1 
POSTSUBSCRIPT:start UNKNOWN:A POSTSUBSCRIPT:end
OPEN:( UNKNOWN:x CLOSE:) 
RELOP:assign 
OPEN:{ 
  NUMBER:1 ATOM:if UNKNOWN:x RELOP:element-of UNKNOWN:A PUNCT:,
  NUMBER:0 ATOM:if UNKNOWN:x RELOP:not-element-of UNKNOWN:A 
PERIOD:.

As usual, line-breaks and indentation for the lexemes is manually done by me, for easier reading - it's just a space-separated line in the output.

@dginev

This comment has been minimized.

Collaborator

dginev commented Apr 4, 2018

While I am unsure if this representation of the dual is the best we can do, it certainly isn't terrible - all visual elements are preserved.

One question is whether the {cases} delimiters need to be wrapped more explicitly (with ARG:start and ARG:end as done for e.g. fraction arguments), especially given that the line-break isn't preserved as a token and you need to rely on the comma punctuation.

Another minor bit is whether I need to escape all spaces inside the text content of lexemes, so that I get ATOM:if\ and preserve the space given in the \text macro.

I think when I wrote the XMDual handler, I explicitly made the choice to use the second, presentation, branch in order to stay true to the authored input.

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Apr 9, 2018

Ugh! My local copy of dlmf stuff is already updated to master... does this need some sort of rebase?

@dginev dginev force-pushed the dginev:math-lexemes branch from b513754 to 3122973 Apr 9, 2018

@dginev

This comment has been minimized.

Collaborator

dginev commented Apr 9, 2018

Sure, I rebased now for convenience. (still working on the moderncv branch btw, will try to wrap up tomorrow, but this one is pending feedback)

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Apr 9, 2018

Curious: how do you do that? I did a git rebase which rebased it to your master, I assume. How do I rebase it to the original master (something about "upstream", I assume)?

@dginev

This comment has been minimized.

Collaborator

dginev commented Apr 9, 2018

Steps I take from the checkout of my fork (which is origin and your master is my upstream remote).

  1. First I update my fork's master to yours:
git fetch upstream master
git checkout master
git rebase upstream/master
git push
  1. Then I rebase the branch in question:
git checkout math-lexemes
git rebase master
git push origin math-lexemes --force

It tends to require force, as the rebase isn't reconcilable with the github-hosted branch when pushed. (which seems to be a weird limitation...)

@dginev

This comment has been minimized.

Collaborator

dginev commented Apr 9, 2018

I tend to be painfully explicit with the rebase parameters, to be certain I know what git will do before I run the command, since wrong rebases can be a pain to rewind back (bunch of resets...)

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Apr 9, 2018

Ah, I see... OK, now I've got a whole mess of lexemes embedded in an xml file. Hmmm....

@dginev

This comment has been minimized.

Collaborator

dginev commented Apr 9, 2018

Sounds about right - I am indenting them by hand to proof them, as they're in this unparsed linearized form. But as it happens the space separated stream is perfect for transferring the lexemes over to linguistic applications, grammars, etc.

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Apr 9, 2018

for immediate purposes, I may just hack MathParser.pm to print the lexeme streams for each numbered eqn to a file...

@dginev

This comment has been minimized.

Collaborator

dginev commented Apr 9, 2018

for immediate purposes, I may just hack MathParser.pm to print the lexeme streams for each numbered eqn to a file...

I take it you're in a hurry :> It may be easier to debug that way, but you can also write a tiny Perl script using XML::LibXML that snipes the attributes from the result XML file from LaTeXML.

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Apr 10, 2018

So, what is it that we're disagreeing on?

@dginev

This comment has been minimized.

Collaborator

dginev commented Apr 10, 2018

Well we were mostly wrestling due to me requesting it :>

But I guess my point leans towards 1) "this PR can be simple", and 2) a MathParser refactor may be needed to make the serialization a utility, and to make a nicely pluggable "analysis" module that executes roughly around MathParser.

I was going to bring up the MathParser refactors I have been thinking of when we got to discussing grammars again at some later point.

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Apr 10, 2018

Sure, OK, the refactoring can wait till there's something to go between. But I'd still like to make it slightly more complicated: Namely separate presentation & content traversals to build lexeme sequences after parsing, and the ability to store both of those as well for later annotation.

@dginev

This comment has been minimized.

Collaborator

dginev commented Apr 10, 2018

Cool, sure. I'm surprised at the interest, given how little love my "semantic tex attribute" issue got back in the day (see #432 )

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Apr 10, 2018

? Had nothing to do with lack of love for the attributes, but rather lack of love for the limitations of MathML annotations.

@dginev

This comment has been minimized.

Collaborator

dginev commented Apr 11, 2018

Sure, it just never took off - that's all I meant. If we start adding multiple lexeme serializations, we can consider adding multiple tex serializations too while we're at it.

@dginev

This comment has been minimized.

Collaborator

dginev commented May 30, 2018


Unwinding back to the top of the discussion, just bookkeeping here that the PR is considered blocked until we also transfer any relevant font+style information to the lexeme serialization. So that, as a basic example, bold x is serialized as a distinct lexeme from a regular mathematical x.

The key observation being that while font+style differences in regular language mainly implies emphasis or structural role (e.g. headings) of the same word, while in mathematical expressions the underlying symbol is different - both emphasis and structural information are not employed in formulas. (counter-examples are welcome)

@dginev

This comment has been minimized.

Collaborator

dginev commented Jul 6, 2018

Let me add another interesting example I hadn't thought of when preparing the PR - over-accents, which are not scripts. In particular, the conical function syntax in the left-hand side of equation 14.20.2 of the DLMF:

image

Here is the current lexeme string produced with the PR, with my manual indent as usual:

OVERACCENT:widehat ARG:start UNKNOWN:Q ARG:end
POSTSUPERSCRIPT:start ADDOP:minus UNKNOWN:mu POSTSUPERSCRIPT:end 
POSTSUBSCRIPT:start
  ADDOP:minus FRACOP:divide
    ARG:start NUMBER:1 ARG:end 
    ARG:start NUMBER:2 ARG:end  
  ADDOP:plus UNKNOWN:i UNKNOWN:tau 
POSTSUBSCRIPT:end

This seems reasonable I think? Just jotting it down as an extra example (and archival)

@dginev dginev force-pushed the dginev:math-lexemes branch from 3122973 to a5e9c0f Jul 9, 2018

@dginev dginev changed the title from [Demo] Expose math lexemes in final output to Expose math lexemes in final output Jul 9, 2018

@dginev

This comment has been minimized.

Collaborator

dginev commented Jul 9, 2018

Greetings! I have just added a piece of code that I would like to claim could move this PR from "demo" to "experimental", and allow it to enter a release, even in this imperfect form. I am of course also happy to reimplement if I get actionable feedback!

The additions from today:

  • I added a crude approximation to "obtaining the Font information" for the leaf node lexeme serialization. I am sure Bruce knows a better way, but this was a quick stab that seems to work decently well for basic examples.
  • I dropped the UNKNOWN role from the serialization, seeking some compromise ground after I got feedback that the grammatical role feels unnecessary. I still consider it useful to have the full assumed grammatical information that latexml's MathParser has, in order to enable reasonable comparisons between math grammars. Dropping unknown doesn't look anything in this regard, as one can assume any lexeme without a role is unknown.
  • Similarly, I concatenated the font information via a dash, to contrast between it and the colon separator for the role.

Perfect? No. But good enough for meaningful experiments? I think so, and can name two separate classes that I would be personally interested in.

Revealing example:

latexmlc --preload=bm.sty --preload=llamapun.sty --pmml --mathlex 'literal:$x+\boldsymbol{y}=0'

produces the lexemes:

italic-x ADDOP:plus bold-italic-y RELOP:equals NUMBER:0

One funny quirk of the current code is that if you replace the plus with a \bigoplus, you get the lexeme SUMOP:160%-direct-sum. Which feels a bit too precise!

dginev added some commits Jul 9, 2018

$text = 'Unknown' unless defined $text;
my $lexeme = $role . ":" . $text . ":" . ++$i;
$lexeme =~ s/\s//g;
my $lexeme = $self->node_to_lexeme($node) . ":" . ++$i;

This comment has been minimized.

@dginev

dginev Jul 9, 2018

Collaborator

ah, I forgot I was reusing the method in the internal parse, I need to be a little more careful here - my font changes broke a test I'm afraid.

@dginev

This comment has been minimized.

Collaborator

dginev commented Jul 9, 2018

Removed a small hiccup, now fully separating the experimental lexeme syntax serialization from the internal parser lexemes. Could be cleaned further, but I prefer waiting for feedback. Tests should pass again.

@brucemiller

This comment has been minimized.

Owner

brucemiller commented Jul 11, 2018

Yeah, the font size stuff is silly! :> I'd thought there was the beginnings of a notion of math-meaningful font attributes, but can't quite find it. I do find the inverse: there's a couple of methods in Common::Font regarding "pursestyle", which give the parts of a font which are (presumably) not meaningful. Perhaps that's a good place to start with a new method to get the meaningful parts, which would be only family, series, shape, I guess.

@dginev

This comment has been minimized.

Collaborator

dginev commented Jul 18, 2018

Quick update here, as discussed, now explicitly only using the family, series and shape of the font, as relative to the default text font. The example from my previous comments with the direct sum now produces:

italic-x SUMOP:direct-sum bold-italic-y RELOP:equals NUMBER:0

which seems quite reasonable.

@brucemiller brucemiller merged commit 44ecafb into brucemiller:master Jul 27, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment