-
Notifications
You must be signed in to change notification settings - Fork 113
I am afraid I have trouble understanding the training data and I cannot find any explanation in the README or paper. #23
Comments
@jparsert I don't suppose you figured this out? |
No i did not. Do you have any further information? |
After some digging it seems as though the tab separates the (x, y) pairs and each line is given in the "Polish notation" for converting a tree (like the one given in section 2.1 of the paper) into a sequence: https://en.wikipedia.org/wiki/Polish_notation There's a small note in section 2.2. I'm not sure what "sub Y" is but I think it can be ignored |
Ryan is spot on, this is how to parse these things.
The trick is to start reading the expression from the right (which corresponds to the bottom of the tree). Here are some further expressions, you get the hang of them after parsing a few:
pow x INT+ 3 mul div INT+ 1 INT+ 4 pow x INT+ 4 we can write acc. to Ryan as (pow x INT+ 3, mul div INT+ 1 INT+ 4 pow x INT+ 4) which is in normal LaTeX notation (x^{3}, 3\cdot\frac{1}{4}x^{4})
add pow x INT+ 2 exp x add mul div INT+ 1 INT+ 3 pow x INT+ 3 exp x we can write as (add pow x INT+ 2 exp x, add mul div INT+ 1 INT+ 3 pow x INT+ 3 exp x) which is in normal LaTeX notation (x^{2}+\exp(x), \frac{1}{3}\cdot x^{3}+\exp(x)) You can check these conversion are correct by noticing that integrating the left-hand side always gives the right-hand side (which is what this dataset is about). With this information it becomes clear that you actually just need to understand either side of the pair (X,Y), since you can integrate X to get Y, or differentiate Y to get X. (@jparsert greetings from another place from Oxford ;) ) |
I did already figure out the Polish notation bit and also wrote a parser that returns a nice AST. If you are interested i can give you that although it's really not much only a couple lines of code. Do you intend on extending/improving this work? |
I don't intend to improve this particular line of work; we were mainly interested in this data to evaluate ChatGPT (see https://arxiv.org/abs/2301.13867) because it is unlikely to have been directly in the training data of a LLM that has been trained with publicly scrapped mathematical data. But it would be worth a thought to perhaps automate this, for which an AST would then be useful. I don't really have a specific interest in going along this route at the moment - though if you are interested in having a chat, please do let me know. |
The As for the line
sub, add and mul are the binary operation + - and * |
Thanks! After concentrating a bit we did figure out the puzzle that was Having the |
What for example what does a line like that mean:
2|sub Y' add INT+ 4 mul INT+ 2 pow add x sqrt INT+ 3 INT- 1 add mul INT+ 2 ln add x sqrt INT+ 3 mul INT+ 4 x
What is the "sub Y' what is the operator "INT-" or "INT" is it just integer addition/subtraction? And why do you differenciate between that and sub/add? Also where do I get the training pairs, there is only one expression per line not two.
The text was updated successfully, but these errors were encountered: