Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make webpages() consistent across aut and ARCH. #539

Merged
merged 7 commits into from
May 30, 2022
Merged

Make webpages() consistent across aut and ARCH. #539

merged 7 commits into from
May 30, 2022

Conversation

ruebot
Copy link
Member

@ruebot ruebot commented May 27, 2022

GitHub issue(s):

What does this Pull Request do?

Make webpages() consistent across aut and ARCH.

  • Filter HTTP headers, and HTML from content on webpages so that it is consistent with the app implementation, and the ARCH implementation
  • Update PlainTextExtractor to use .all() since HTML is removed from content
  • Add domain to all()
  • Update csv exports on app so that they are rfc4180 compliant
  • Apply GitHub workflows to main branch
  • Consistent formating on DataFrameLoader.scala
  • Update tests as needed
  • Resolves Remove http headers, and html on webpages() #538

How should this be tested?

- Filter HTTP headers, and HTML from content on webpages so that it is consistent with the app implementation, and the ARCH implementation
- Update PlainTextExtractor to use .all() since HTML is removed from content
- Add domain to all()
- Update csv exports on app so that they are rfc4180 compliant
- Apply GitHub workflows to main branch
- Consistent formating on DataFrameLoader.scala
- Update tests as needed
- Resolves #538
@codecov
Copy link

codecov bot commented May 27, 2022

Codecov Report

❗ No coverage uploaded for pull request base (main@988f70f). Click here to learn what that means.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main     #539   +/-   ##
=======================================
  Coverage        ?   92.93%           
  Complexity      ?       42           
=======================================
  Files           ?       39           
  Lines           ?      835           
  Branches        ?       52           
=======================================
  Hits            ?      776           
  Misses          ?       35           
  Partials        ?       24           

@ruebot
Copy link
Member Author

ruebot commented May 27, 2022

I think we should consider renaming the content column produced by .all() to raw_content or something along those lines. So, it is clearly differentiated from the content column produced by .webpages(). Otherwise, it's easy to presume you should be able to apply particular matchbox utilities or udfs to the column, when in actuality you can't be cause the correct data isn't there. I'm linking of extractBoilerPipe, extractLink, and extractImageLinks, all of which require HTML to still be present in the content column.

Does that make sense? If so, I can get it updated here, and in archivesunleashed/aut-docs#117 as well.

I'm also working on getting the PySpark example notebooks updated as well to reflect this. In the process there, I realized we never made an equivalent for keepValidPagesDF in the Python implementation. I'm struggling with how to reimplement that other than verbosely doing in the documentation examples like I here 🤷‍♂️

ruebot added a commit to archivesunleashed/notebooks that referenced this pull request May 29, 2022
@ruebot ruebot added this to In Progress in 1.0.0 Release of AUT May 29, 2022
@ianmilligan1
Copy link
Member

Does that make sense?

Makes a lot of sense – I like raw_content. I can test the PR when it's updated and ready to merge!

@ruebot
Copy link
Member Author

ruebot commented May 30, 2022

@ianmilligan1 ready for testing when you have time.

Documentation PR: archivesunleashed/aut-docs#117
Updated PySpark notebooks: https://github.com/archivesunleashed/notebooks/tree/main/PySpark%20Examples

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and all works! (Except for the discovery of Spark 3.0.0 having a bug, as discussed on Slack!)

ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request May 30, 2022
@ruebot ruebot merged commit 9011c92 into main May 30, 2022
1.0.0 Release of AUT automation moved this from In Progress to Done May 30, 2022
@ruebot ruebot deleted the issue-538 branch May 30, 2022 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Remove http headers, and html on webpages()
2 participants