Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback on the Springer Parser #20

Closed
vtshitoyan opened this issue Sep 26, 2018 · 9 comments
Closed

Feedback on the Springer Parser #20

vtshitoyan opened this issue Sep 26, 2018 · 9 comments

Comments

@vtshitoyan
Copy link
Contributor

Here is some feedback from Tanjin who analyzed the results for the Springer Parser based on a few papers.

  1. Many blanks are inserted, especially when dealing with subscripts/superscripts. This makes it difficult to correctly parse chemical formula.
    E.g.:
    Pb(Zr x Ti 1− x )O 3
    Pb 0.97 Nd 0.02 (Zr 0.55 Ti 0.45 )O 3 (PNZT)
    ScTaO 4
    Ar + ion
    Mg 2 Ni
    7.49 × 10 3 kg/m 3
    1.5 J/cm 2
    CuK α
    k -space

  2. Paragraphs in the same section are not separated
    E.g.: Introduction section of the paper 10.1007/s00339-013-8138-9.

  3. References are not removed

  4. Some text is missed in a section with sub-sections.
    E.g.: Methods section missed for the paper 10.1007/bf01142064.

  5. I am not sure if we need to keep the formula in same format?
    E.g. Some formula starts and ends with "$$", which some starts and ends with "\(" as the boundary.
    Formula 1: $$ \sigma_{\text{wh}} = \sqrt { \sigma_{\text{sat}}^{2} - \left( {\sigma_{\text{sat}}^{2} - \sigma_{0}^{2} } \right)\exp ( - r(\varepsilon - \varepsilon_{0} ))} $$
    Formula 2: \( {\dot{{\varepsilon }}} \)?

I think we should at least address the first 4 points. Happy to discuss this further.

@OlgaGKononova
Copy link
Collaborator

On #5: usually when LaTex markups are embedded in HTML/XML text they also have a tag with plane text. So the easiest way is to substitute all the LaTex span with the string under plane text tag.

@eddotman
Copy link
Contributor

Would it be possible to provide a couple examples of the raw HTML / XML for the inserted blanks? That might make it easier / quicker to narrow down that issue.

@shaunrong
Copy link
Contributor

Do we still have dois for issue 1, 3 and 5? So proper unit test can be developed for the fix. @zhugeyicixin @vtshitoyan

@zhugeyicixin
Copy link

Do we still have dois for issue 1, 3 and 5? So proper unit test can be developed for the fix. @zhugeyicixin @vtshitoyan

Yes. Here are some example dois.

For 1, "10.1007/s00339-013-8138-9" and "10.1007/bf02663182"
For 3, "10.1007/s00339-013-8138-9" and "10.1007/s10853-011-5258-5"
For 5, "10.1007/s10853-015-9171-1"

@shaunrong
Copy link
Contributor

@zjensen262 please refer these DOIs for unit tests in #26

@IAmGrootel
Copy link

@zhugeyicixin thanks for the list of dois, they were very helpful to isolate and resolve the issues. I think I have fixed the first four issues as mentioned above, and I am going through some additional papers to verify.

For some reason we cannot download the html version of 10.1007/bf02663182. Could you send us the html file of this doi if you have it on hand?

@zhugeyicixin
Copy link

@IAmGrootel Hi Alex, here is the html file for 10.1007/bf02663182.

paper_10.1007_bf02663182.txt

@zhugeyicixin thanks for the list of dois, they were very helpful to isolate and resolve the issues. I think I have fixed the first four issues as mentioned above, and I am going through some additional papers to verify.

For some reason we cannot download the html version of 10.1007/bf02663182. Could you send us the html file of this doi if you have it on hand?

@IAmGrootel
Copy link

Thanks! I just submitted a pull request. Let me know if there are further issues.

@hhaoyan
Copy link
Contributor

hhaoyan commented May 13, 2019

Please close this issue if you think the problems have been solved @zhugeyicixin Thanks!

@hhaoyan hhaoyan closed this as completed May 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants