-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose math lexemes in final output #947
Conversation
Ok, I think I have this example nailed down. Deleted some of my intermediate comments to reduce clutter. Again, for the tex formula: $ P(E) = {n \choose k} p^k (1-p)^{ n-k} $ The resulting lexemes are now:
which feels ideal. Hence requesting feedback from @brucemiller :
I may want to try some more complex formulas, e.g. all the formulas in my equations example at http://latexml.mathweb.org/editor, possibly turn them into tests... but will wait for feedback before I jump on that. |
After a getting a coffee ☕ it seems a bit more obvious to rename things like
into the more consistent with the internal latexml convention
and similarly for Does that sound better? |
Got another fix in by inspecting the Ramanujan identity in the examples i mentioned. Here is the result (still using the start/end lexemes, until we discuss): again - I indented by hand for easier reading, this all comes out as a space-separated stream of tokens in the annotation element. It's completely unparsed in the actual output.
|
More special cases handled, example: $ x,\quad\quad \text{for} |q|<1. $ lexemes in output:
|
Sounds cool! Could really drive and enhance some upcoming development... or clash with it? Perhaps some offlist discussion would be good. |
Definitely a more exploratory piece of work here, happy to discuss at length in a more casual setting than infinitely long github comments 😂 I have high hopes it is in full synergy with other upcoming upgrades, or adaptable to such a point. |
So, I'm inclined to go ahead and accept this PR on the grounds that it "can't hurt", even though it's clearly not finished. It'd be good to be able to play with it! Object soon if you don't agree :> |
I am in favor! Will rebase it quickly. |
Rebase completed, it was a single space indent... |
That said, experimenting with a branch can often be easier than doing so on You can always do:
to quickly test with this branch -- or |
Hmm.. XMDual. Not sure what the appropriate representation should be for pre-parsed output. For post-parsed, I'd thought that it should give rise to 2 separate lexeme sequences: the presentational & semantic ones. |
Yes, XMDual is tricky. This PR takes the approach that the lexeme is what "the author wrote", with the intent of preserving any semantic hints, but not incorporating any guesses, be they about lexical semantics, or parse trees. Would need to look into specific examples of duals, but preserving the token sequence coming from the gullet would be the ideal solution from my current perspective. |
I remembered my favourite dual example - piecewise functions. So here is the code for a simple indicator function for example: \mathbf {1} _{A}(x) :=
\begin{cases}
1 & \text{if } x \in A,\\
0 & \text{if } x \notin A.
\end{cases} The XML is nothing to sniff at, so attaching it as a gist here or the comment will get huge. When generating the lexemes with the current PR, we get:
As usual, line-breaks and indentation for the lexemes is manually done by me, for easier reading - it's just a space-separated line in the output. |
While I am unsure if this representation of the dual is the best we can do, it certainly isn't terrible - all visual elements are preserved. One question is whether the Another minor bit is whether I need to escape all spaces inside the text content of lexemes, so that I get I think when I wrote the XMDual handler, I explicitly made the choice to use the second, presentation, branch in order to stay true to the authored input. |
Ugh! My local copy of dlmf stuff is already updated to master... does this need some sort of rebase? |
Sure, I rebased now for convenience. (still working on the moderncv branch btw, will try to wrap up tomorrow, but this one is pending feedback) |
Curious: how do you do that? I did a |
Steps I take from the checkout of my fork (which is
It tends to require force, as the rebase isn't reconcilable with the github-hosted branch when pushed. (which seems to be a weird limitation...) |
I tend to be painfully explicit with the rebase parameters, to be certain I know what git will do before I run the command, since wrong rebases can be a pain to rewind back (bunch of resets...) |
Ah, I see... OK, now I've got a whole mess of lexemes embedded in an xml file. Hmmm.... |
Sounds about right - I am indenting them by hand to proof them, as they're in this unparsed linearized form. But as it happens the space separated stream is perfect for transferring the lexemes over to linguistic applications, grammars, etc. |
for immediate purposes, I may just hack MathParser.pm to print the lexeme streams for each numbered eqn to a file... |
I take it you're in a hurry :> It may be easier to debug that way, but you can also write a tiny Perl script using XML::LibXML that snipes the attributes from the result XML file from LaTeXML. |
Well we were mostly wrestling due to me requesting it :> But I guess my point leans towards 1) "this PR can be simple", and 2) a MathParser refactor may be needed to make the serialization a utility, and to make a nicely pluggable "analysis" module that executes roughly around MathParser. I was going to bring up the MathParser refactors I have been thinking of when we got to discussing grammars again at some later point. |
Sure, OK, the refactoring can wait till there's something to go between. But I'd still like to make it slightly more complicated: Namely separate presentation & content traversals to build lexeme sequences after parsing, and the ability to store both of those as well for later annotation. |
Cool, sure. I'm surprised at the interest, given how little love my "semantic tex attribute" issue got back in the day (see #432 ) |
? Had nothing to do with lack of love for the attributes, but rather lack of love for the limitations of MathML annotations. |
Sure, it just never took off - that's all I meant. If we start adding multiple lexeme serializations, we can consider adding multiple tex serializations too while we're at it. |
Unwinding back to the top of the discussion, just bookkeeping here that the PR is considered blocked until we also transfer any relevant font+style information to the lexeme serialization. So that, as a basic example, The key observation being that while font+style differences in regular language mainly implies emphasis or structural role (e.g. headings) of the same word, while in mathematical expressions the underlying symbol is different - both emphasis and structural information are not employed in formulas. (counter-examples are welcome) |
Let me add another interesting example I hadn't thought of when preparing the PR - over-accents, which are not scripts. In particular, the conical function syntax in the left-hand side of equation 14.20.2 of the DLMF: Here is the current lexeme string produced with the PR, with my manual indent as usual:
This seems reasonable I think? Just jotting it down as an extra example (and archival) |
Greetings! I have just added a piece of code that I would like to claim could move this PR from "demo" to "experimental", and allow it to enter a release, even in this imperfect form. I am of course also happy to reimplement if I get actionable feedback! The additions from today:
Perfect? No. But good enough for meaningful experiments? I think so, and can name two separate classes that I would be personally interested in. Revealing example:
produces the lexemes:
One funny quirk of the current code is that if you replace the plus with a |
lib/LaTeXML/MathParser.pm
Outdated
$text = 'Unknown' unless defined $text; | ||
my $lexeme = $role . ":" . $text . ":" . ++$i; | ||
$lexeme =~ s/\s//g; | ||
my $lexeme = $self->node_to_lexeme($node) . ":" . ++$i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I forgot I was reusing the method in the internal parse, I need to be a little more careful here - my font changes broke a test I'm afraid.
Removed a small hiccup, now fully separating the experimental lexeme syntax serialization from the internal parser lexemes. Could be cleaned further, but I prefer waiting for feedback. Tests should pass again. |
Yeah, the font size stuff is silly! :> I'd thought there was the beginnings of a notion of math-meaningful font attributes, but can't quite find it. I do find the inverse: there's a couple of methods in Common::Font regarding "pursestyle", which give the parts of a font which are (presumably) not meaningful. Perhaps that's a good place to start with a new method to get the meaningful parts, which would be only family, series, shape, I guess. |
Quick update here, as discussed, now explicitly only using the family, series and shape of the font, as relative to the default text font. The example from my previous comments with the direct sum now produces:
which seems quite reasonable. |
This is a llamapun-motivated pull request. I have been creating some token models over arXiv, and would like to take them to the next level by incorporating an ascii, linguistics-friendly, representation of all formulas.
To that extent, I would like to dispense with any remnants of TeX, and instead use the closest possible form to the one used by latexml's grammar input.
This was simultaneously easier than I expected - in terms of introduced the feature in MathParser, as
$lexemes
already existed in a very near ideal form. But also harder than expected from the usability side as I just couldn't find a way around introducing a new option called--mathlex
, as well as a dedicatedllamapun.sty.ltxml
binding to avoid introducing a new option in the core latexml processing as well.If it wasn't for the MathParser changes, one could imagine the rest of the toggling UX as a separate dedicated plugin. But making new options from a plugin isn't that direct either... as it is inevitable to have to patch Config.pm.
But back to the fun part. With the current PR one can run for instance:
and get back:
I think the lexeme string itself is also interesting to look at for complex examples (already here with
\choose
actually), so I will leave more comments.