-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feedback on the Springer Parser #20
Comments
On #5: usually when LaTex markups are embedded in HTML/XML text they also have a tag with plane text. So the easiest way is to substitute all the LaTex span with the string under plane text tag. |
Would it be possible to provide a couple examples of the raw HTML / XML for the inserted blanks? That might make it easier / quicker to narrow down that issue. |
Do we still have dois for issue 1, 3 and 5? So proper unit test can be developed for the fix. @zhugeyicixin @vtshitoyan |
Yes. Here are some example dois. For 1, "10.1007/s00339-013-8138-9" and "10.1007/bf02663182" |
@zjensen262 please refer these DOIs for unit tests in #26 |
@zhugeyicixin thanks for the list of dois, they were very helpful to isolate and resolve the issues. I think I have fixed the first four issues as mentioned above, and I am going through some additional papers to verify. For some reason we cannot download the html version of 10.1007/bf02663182. Could you send us the html file of this doi if you have it on hand? |
@IAmGrootel Hi Alex, here is the html file for 10.1007/bf02663182.
|
Thanks! I just submitted a pull request. Let me know if there are further issues. |
Please close this issue if you think the problems have been solved @zhugeyicixin Thanks! |
Here is some feedback from Tanjin who analyzed the results for the Springer Parser based on a few papers.
Many blanks are inserted, especially when dealing with subscripts/superscripts. This makes it difficult to correctly parse chemical formula.
E.g.:
Pb(Zr x Ti 1− x )O 3
Pb 0.97 Nd 0.02 (Zr 0.55 Ti 0.45 )O 3 (PNZT)
ScTaO 4
Ar + ion
Mg 2 Ni
7.49 × 10 3 kg/m 3
1.5 J/cm 2
CuK α
k -space
Paragraphs in the same section are not separated
E.g.: Introduction section of the paper 10.1007/s00339-013-8138-9.
References are not removed
Some text is missed in a section with sub-sections.
E.g.: Methods section missed for the paper 10.1007/bf01142064.
I am not sure if we need to keep the formula in same format?
E.g. Some formula starts and ends with "$$", which some starts and ends with "\(" as the boundary.
Formula 1: $$ \sigma_{\text{wh}} = \sqrt { \sigma_{\text{sat}}^{2} - \left( {\sigma_{\text{sat}}^{2} - \sigma_{0}^{2} } \right)\exp ( - r(\varepsilon - \varepsilon_{0} ))} $$
Formula 2: \( {\dot{{\varepsilon }}} \)?
I think we should at least address the first 4 points. Happy to discuss this further.
The text was updated successfully, but these errors were encountered: