Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

I am afraid I have trouble understanding the training data and I cannot find any explanation in the README or paper. #23

Open
jparsert opened this issue Mar 22, 2022 · 9 comments

Comments

@jparsert
Copy link

jparsert commented Mar 22, 2022

What for example what does a line like that mean:
2|sub Y' add INT+ 4 mul INT+ 2 pow add x sqrt INT+ 3 INT- 1 add mul INT+ 2 ln add x sqrt INT+ 3 mul INT+ 4 x
What is the "sub Y' what is the operator "INT-" or "INT" is it just integer addition/subtraction? And why do you differenciate between that and sub/add? Also where do I get the training pairs, there is only one expression per line not two.

@Ryan-Rhys
Copy link

@jparsert I don't suppose you figured this out?

@jparsert
Copy link
Author

No i did not. Do you have any further information?

@Ryan-Rhys
Copy link

After some digging it seems as though the tab separates the (x, y) pairs and each line is given in the "Polish notation" for converting a tree (like the one given in section 2.1 of the paper) into a sequence:

https://en.wikipedia.org/wiki/Polish_notation

There's a small note in section 2.2.

I'm not sure what "sub Y" is but I think it can be ignored

@Ryan-Rhys
Copy link

20230120_174455

Is my interpretation of the first line in prim_fwd_test

@frieders
Copy link

Ryan is spot on, this is how to parse these things.

sub Y' and everything that comes before it can be ignored, I guess those is just the number of identical tree they generated (because it seems in their setup they couldn't exclude generating the same tree multiple times), which has implications on the probability of occurrence I assume - can be ignored for this question. INT+ 5 of course just means the integer 5.

The trick is to start reading the expression from the right (which corresponds to the bottom of the tree). Here are some further expressions, you get the hang of them after parsing a few:

  1. Line 2 from the file prim_fwd.test:
pow x INT+ 3         mul div INT+ 1 INT+ 4 pow x INT+ 4

we can write acc. to Ryan as

(pow x INT+ 3, mul div INT+ 1 INT+ 4 pow x INT+ 4)

which is in normal LaTeX notation

(x^{3}, 3\cdot\frac{1}{4}x^{4})
  1. Line 8 from the file prim_fwd.test:
add pow x INT+ 2 exp x	     add mul div INT+ 1 INT+ 3 pow x INT+ 3 exp x

we can write as

(add pow x INT+ 2 exp x, add mul div INT+ 1 INT+ 3 pow x INT+ 3 exp x)

which is in normal LaTeX notation

(x^{2}+\exp(x), \frac{1}{3}\cdot x^{3}+\exp(x))

You can check these conversion are correct by noticing that integrating the left-hand side always gives the right-hand side (which is what this dataset is about). With this information it becomes clear that you actually just need to understand either side of the pair (X,Y), since you can integrate X to get Y, or differentiate Y to get X.

(@jparsert greetings from another place from Oxford ;) )

@jparsert
Copy link
Author

I did already figure out the Polish notation bit and also wrote a parser that returns a nice AST. If you are interested i can give you that although it's really not much only a couple lines of code. Do you intend on extending/improving this work?

@friederrr
Copy link

friederrr commented Mar 8, 2023

I don't intend to improve this particular line of work; we were mainly interested in this data to evaluate ChatGPT (see https://arxiv.org/abs/2301.13867) because it is unlikely to have been directly in the training data of a LLM that has been trained with publicly scrapped mathematical data. But it would be worth a thought to perhaps automate this, for which an AST would then be useful. I don't really have a specific interest in going along this route at the moment - though if you are interested in having a chat, please do let me know.

@f-charton
Copy link

f-charton commented Mar 8, 2023

The prefix_to_infix() function, in src/char_sp.py should do this for you. It takes a prefix (ie polish) notation sequence, and rewrites it "as written by humans", with parentheses. To use it, you need to load char_sp.py, then instantiate class CharSPEnvironment, then call prefix_to_infix() on the prefix notation list.

As for the line
2|sub Y' add INT+ 4 mul INT+ 2 pow add x sqrt INT+ 3 INT- 1 add mul INT+ 2 ln add x sqrt INT+ 3 mul INT+ 4 x
The 2 | is just a line number and a separator.
sub Y' add INT+ 4 mul INT+ 2 pow add x sqrt INT+ 3 INT- 1 is the input sequence. It reads
sub ( Y' add (INT+ 4 (mul (INT+ 2 pow (add x sqrt INT+ 3) INT- 1)))
which corresponds (in infix notation) to
Y' - ( 4 + (2 * (x + sqrt(3))^-1)), with =0 assumed at the end of the expression (I think we mention this in the paper)
i.e. Y' = 4 + 2 / (x+sqrt3)). We want to integrate the function 4+2/(x+sqrt(3)).

add mul INT+ 2 ln add x sqrt INT+ 3 mul INT+ 4 x which means 2 * ln(x+sqrt(3)) + 4*x

sub, add and mul are the binary operation + - and *
INT+ and INT- are the positive and negative signs, when writing integers.

@friederrr
Copy link

Thanks!

After concentrating a bit we did figure out the puzzle that was 2|sub Y' add INT+ 4 mul INT+ 2 pow add x sqrt INT+ 3 INT- 1 for us. ;)

Having the prefix_to_infix() function already coded, maybe we will proceed after all to do the automatic evaluation ...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants