Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing title_* #14

Open
cverluise opened this issue Nov 9, 2019 · 0 comments
Open

Missing title_* #14

cverluise opened this issue Nov 9, 2019 · 0 comments
Assignees
Labels
beta enhancement New feature or request parsing

Comments

@cverluise
Copy link
Owner

Around 10% of the npl_publn in the beta version have neither title_j nor title_m nor title_main_a. Most of the time, part of these elements are wrongly parsed the title_main_m.

How to reproduce the behaviour

SELECT
  *
FROM (
  SELECT
    *
  FROM
    `npl-parsing.patcit.beta`
  WHERE
    title_j is NULL
    AND title_m is NULL
    AND title_main_a is NULL
    ) 
    AS parsing
JOIN (
  SELECT
    npl_publn_id AS id,
    npl_biblio
  FROM
    `usptobias.patstat.tls214`) AS tls214
ON
  tls214.id=parsing.npl_publn_id

Ideas/ solution

There seems to be a common pattern in these citations in the sense that they are already very structured (e.g NIELSEN F ET AL: 'HERSTELLUNG STAUBARMER, FREIFLIESSENDER PRODUKTE', CHEMIETECHNIK, HUTHIG, HEIDELBERG, DE, vol. 22, no. 10, 1 October 1993 (1993-10-01), pages 48 - 49, XP000415410, ISSN: 0340-9961).

At this stage, training the Grobid model on these examples seems to be the best option. Then, examples affected by this issue will be processed again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beta enhancement New feature or request parsing
Projects
None yet
Development

No branches or pull requests

1 participant