-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make webpages() consistent across aut and ARCH. #539
Conversation
- Filter HTTP headers, and HTML from content on webpages so that it is consistent with the app implementation, and the ARCH implementation - Update PlainTextExtractor to use .all() since HTML is removed from content - Add domain to all() - Update csv exports on app so that they are rfc4180 compliant - Apply GitHub workflows to main branch - Consistent formating on DataFrameLoader.scala - Update tests as needed - Resolves #538
Codecov Report
@@ Coverage Diff @@
## main #539 +/- ##
=======================================
Coverage ? 92.93%
Complexity ? 42
=======================================
Files ? 39
Lines ? 835
Branches ? 52
=======================================
Hits ? 776
Misses ? 35
Partials ? 24 |
I think we should consider renaming the Does that make sense? If so, I can get it updated here, and in archivesunleashed/aut-docs#117 as well. I'm also working on getting the PySpark example notebooks updated as well to reflect this. In the process there, I realized we never made an equivalent for |
Makes a lot of sense – I like |
@ianmilligan1 ready for testing when you have time. Documentation PR: archivesunleashed/aut-docs#117 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested and all works! (Except for the discovery of Spark 3.0.0 having a bug, as discussed on Slack!)
#117) * Documentation updates for archivesunleashed/aut#539 and archivesunleashed/aut#534. * Change all content to raw_content.
GitHub issue(s):
What does this Pull Request do?
Make webpages() consistent across aut and ARCH.
How should this be tested?