Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds author, date, copyright extraction #49

Merged
merged 9 commits into from
Feb 20, 2016
Merged

Adds author, date, copyright extraction #49

merged 9 commits into from
Feb 20, 2016

Conversation

philgooch
Copy link
Contributor

  • Uses metadata fields and classes to extract publication date
  • Uses metadata fields to extract authors
  • Uses class names to attempt to extract copyright line
  • Widens search scope for fallback title extraction (h1, h2)
  • Adds softTitle field with less aggressive truncation

* Uses metadata fields and classes to extract publication date
* Uses metadata fields to extract authors
* Uses class names to attempt to extract copyright line
* Widens search scope for title extraction (h1, h2)
* Adds softTitle field with less aggressive truncation
@ageitgey
Copy link
Owner

Thanks for the PR! These features look really useful.

Can you also update the README.md file to reflect the new data elements returned? The Extracted data elements section and the Module Interface sections need to be updated to reflect the new data you extracted:

  • softTitle
  • date
  • copyright
  • author

Thanks!

* Updates Shovel Knight review JSON example with new data elements
* Moves og:site_name into separate publisher field (semantically more correct)
* Correctly falls back to named author in the body text
* Adds unit test for fallback author
* Byline class is used by a number of news sources to identify the author
* Removes wildcard 'date' class matchers
* Adds itemprop*='datePublished' matchers
* Adds additional '|' copyright chunk delimiter
* Don't create author list containing an empty string
* Output null field values rather than empty strings
* Don't assume that copyright information is present!
@snellingio
Copy link

Looking forward to this being merged in!

@snellingio snellingio mentioned this pull request Feb 19, 2016
@ageitgey
Copy link
Owner

Thanks! Didn't see this earlier because github doesn't alert you when new commits are merged.

ageitgey added a commit that referenced this pull request Feb 20, 2016
Adds author, date, copyright extraction
@ageitgey ageitgey merged commit 52cb470 into ageitgey:master Feb 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants