Make webpages() consistent across aut and ARCH. #539

ruebot · 2022-05-27T20:10:44Z

GitHub issue(s):

Remove http headers, and html on webpages() #538

What does this Pull Request do?

Make webpages() consistent across aut and ARCH.

Filter HTTP headers, and HTML from content on webpages so that it is consistent with the app implementation, and the ARCH implementation
Update PlainTextExtractor to use .all() since HTML is removed from content
Add domain to all()
Update csv exports on app so that they are rfc4180 compliant
Apply GitHub workflows to main branch
Consistent formating on DataFrameLoader.scala
Update tests as needed
Resolves Remove http headers, and html on webpages() #538

How should this be tested?

GitHub actions build
Testing using documentation updates: Documentation updates for https://github.com/archivesunleashed/aut/pu… aut-docs#117

- Filter HTTP headers, and HTML from content on webpages so that it is consistent with the app implementation, and the ARCH implementation - Update PlainTextExtractor to use .all() since HTML is removed from content - Add domain to all() - Update csv exports on app so that they are rfc4180 compliant - Apply GitHub workflows to main branch - Consistent formating on DataFrameLoader.scala - Update tests as needed - Resolves #538

…shed/aut#534.

codecov · 2022-05-27T20:26:42Z

Codecov Report

❗ No coverage uploaded for pull request base (main@988f70f). Click here to learn what that means.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main     #539   +/-   ##
=======================================
  Coverage        ?   92.93%           
  Complexity      ?       42           
=======================================
  Files           ?       39           
  Lines           ?      835           
  Branches        ?       52           
=======================================
  Hits            ?      776           
  Misses          ?       35           
  Partials        ?       24

ruebot · 2022-05-27T23:08:08Z

I think we should consider renaming the content column produced by .all() to raw_content or something along those lines. So, it is clearly differentiated from the content column produced by .webpages(). Otherwise, it's easy to presume you should be able to apply particular matchbox utilities or udfs to the column, when in actuality you can't be cause the correct data isn't there. I'm linking of extractBoilerPipe, extractLink, and extractImageLinks, all of which require HTML to still be present in the content column.

Does that make sense? If so, I can get it updated here, and in archivesunleashed/aut-docs#117 as well.

I'm also working on getting the PySpark example notebooks updated as well to reflect this. In the process there, I realized we never made an equivalent for keepValidPagesDF in the Python implementation. I'm struggling with how to reimplement that other than verbosely doing in the documentation examples like I here 🤷‍♂️

src/main/scala/io/archivesunleashed/app/CommandLineApp.scala

…ue-538

ianmilligan1 · 2022-05-29T18:39:39Z

Does that make sense?

Makes a lot of sense – I like raw_content. I can test the PR when it's updated and ready to merge!

ruebot · 2022-05-30T13:14:56Z

@ianmilligan1 ready for testing when you have time.

Documentation PR: archivesunleashed/aut-docs#117
Updated PySpark notebooks: https://github.com/archivesunleashed/notebooks/tree/main/PySpark%20Examples

ianmilligan1

Tested and all works! (Except for the discovery of Spark 3.0.0 having a bug, as discussed on Slack!)

#117) * Documentation updates for archivesunleashed/aut#539 and archivesunleashed/aut#534. * Change all content to raw_content.

ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request May 27, 2022

Documentation updates for archivesunleashed/aut#539 and archivesunlea…

1c7d5f1

…shed/aut#534.

ruebot mentioned this pull request May 27, 2022

Documentation updates for https://github.com/archivesunleashed/aut/pu… archivesunleashed/aut-docs#117

Merged

ruebot requested a review from ianmilligan1 May 27, 2022 20:12

Merge branch 'main' into issue-538

bc14e1e

ruebot commented May 28, 2022

View reviewed changes

src/main/scala/io/archivesunleashed/app/CommandLineApp.scala Outdated Show resolved Hide resolved

Remove header option for non-coalesced option.

6d322f7

ruebot added a commit to archivesunleashed/notebooks that referenced this pull request May 29, 2022

Updates for archivesunleashed/aut#539

6963a26

Merge branch 'issue-538' of github.com:archivesunleashed/aut into iss…

8bc1e66

…ue-538

ruebot added this to In Progress in 1.0.0 Release of AUT May 29, 2022

ruebot added 2 commits May 29, 2022 15:21

Change all content to raw_content.

40e290e

scalafmt

1e71c67

ianmilligan1 approved these changes May 30, 2022

View reviewed changes

Update Apache Spark version in README.

0da136c

ruebot merged commit 9011c92 into main May 30, 2022

1.0.0 Release of AUT automation moved this from In Progress to Done May 30, 2022

ruebot deleted the issue-538 branch May 30, 2022 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make webpages() consistent across aut and ARCH. #539

Make webpages() consistent across aut and ARCH. #539

ruebot commented May 27, 2022 •

edited

Loading

codecov bot commented May 27, 2022 •

edited

Loading

ruebot commented May 27, 2022

ianmilligan1 commented May 29, 2022

ruebot commented May 30, 2022

ianmilligan1 left a comment

Make webpages() consistent across aut and ARCH. #539

Make webpages() consistent across aut and ARCH. #539

Conversation

ruebot commented May 27, 2022 • edited Loading

What does this Pull Request do?

How should this be tested?

codecov bot commented May 27, 2022 • edited Loading

Codecov Report

ruebot commented May 27, 2022

ianmilligan1 commented May 29, 2022

ruebot commented May 30, 2022

ianmilligan1 left a comment

Choose a reason for hiding this comment

ruebot commented May 27, 2022 •

edited

Loading

codecov bot commented May 27, 2022 •

edited

Loading