Add a number of additional app extractors.#451
Conversation
- Resolves #447 - Add AudioInformationExtractor, ImageInformationExtractor, PDFInformationExtractor, PresentationProgramInformationExtractor, SpreadsheetInformationExtractor, TextFilesInformationExtractor, VideoInformationExtractor, WebGraphExtractor, WordProcessorInformationExtractor - Add tests for the new extractors - Update CommandLineApp to use new extractors - Add domain, and language column WebPagesExtractor - Change "TEXT" to "csv" - Lower case "GEXF" and "GRAPHML"
|
I'll get an associated documentation PR opened up later today. |
Codecov Report
@@ Coverage Diff @@
## master #451 +/- ##
==========================================
+ Coverage 74.55% 76.72% +2.17%
==========================================
Files 40 49 +9
Lines 1285 1422 +137
Branches 246 264 +18
==========================================
+ Hits 958 1091 +133
- Misses 211 215 +4
Partials 116 116 |
|
Documentation PR: archivesunleashed/aut-docs#57 |
ianmilligan1
left a comment
There was a problem hiding this comment.
Worked nicely!
Note that the example command bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor WebGraphInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/WebGraphInformationExtractor should have been WebGraphExtractor but I don't think that affects anything. Just in case the PR text is used in the future for any testing or copy-and-pasting.
|
Oh, sorry. That was copypasta on my part. |
|
Heh no worries @ruebot - it was actually good to see robust error messages. |
GitHub issue(s): #447
What does this Pull Request do?
Add a number of additional app extractors.
PDFInformationExtractor, PresentationProgramInformationExtractor,
SpreadsheetInformationExtractor, TextFilesInformationExtractor,
VideoInformationExtractor, WebGraphExtractor,
WordProcessorInformationExtractor
How should this be tested?
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor AudioInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/AudioInformationExtractorbin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor ImageInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/ImageInformationExtractorbin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PDFInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/PDFInformationExtractorbin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PresentationProgramInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/PresentationProgramInformationExtractorbin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor SpreadsheetInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/SpreadsheetInformationExtractorbin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor TextFilesInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/TextFilesInformationExtractorbin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor VideoInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/VideoInformationExtractorbin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor WordProcessorInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/WordProcessorInformationExtractorbin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor WebGraphInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/447-test/WebGraphInformationExtractorAdditional Notes:
WebGraphExtractoras an additional option, since it is slightly different than thecsvoutput ofDomainGraphExtractorWebPagesExtractorto produce similar, and more enhanced output thatPlainTextExtractor. We might want to consider removingPlainTextExtractorin the futurecsvoutput.