Skip to content

@ruebot ruebot released this Jun 3, 2020

Documentation

Release Notes

Full Changelog

Closed issues:

  • Broken link in documentation #476
  • Improve udfs/package.scala test coverage #473
  • Remove tabDelimit #471
  • Remove Extract Entities #469
  • PEP8 Naming - UDFs, App method names, DataFrame names, and filters. #468
  • Python UDFs - class or not? #467
  • Remove ExtractImageDetailsDF.scala #464
  • github-stite-deploy uses password based authentication which is being deprecated by GitHub #461
  • Implement Python versions of Serializable APIs #410
  • Implement Python versions of App utilities #409
  • Implement Python versions of Matchbox utilities #408
  • Improve TupleFormatter.scala test coverage #59
  • Create tests for NERCombinedJson.scala #53
  • Create tests for NER3Classifier.scala #52
  • Create tests for ExtractEntities.scala #48

Merged pull requests:

  • Remove RDD suffixes on file, class, and object names. #479 (ruebot)
  • PEP8 Python app method names. #477 (ruebot)
  • Move Python UDF methods out of their own class. #475 (ruebot)
  • Add DataFrame udf tests. #474 (ruebot)
  • Remove tabDelimit. #472 (ruebot)
  • Remove NER functionality. #470 (ruebot)
  • Add ExtractPopularImages, WriteGEXF, and WriteGraphML to Python. #466 (ruebot)
  • Remove ExtractImageDetailsDF; resolves #464. #465 (ruebot)
  • Implement Scala Matchbox UDFs in Python. #463 (ruebot)
  • Import clean-up for df package. #462 (ruebot)
Assets 30

@ruebot ruebot released this May 5, 2020

Documentation

Release Notes

Full Changelog

Implemented enhancements:

  • Update PlainTextExtractor to just extract text #452
  • Migration of all RDD functionality over to DataFrames #223

Fixed bugs:

  • DomainFrequencyExtractor should remove WWW prefix #456

Closed issues:

  • For extractor (spark-submit) job, set Spark app name to be the extractor job name. #458
  • Remove RDD options from app #449
  • Add parquet as an app format option #448
  • Add datathon derivatives to app (binary info, web pages, web graph #447
  • Update Java 8 instructions for MacOS #445
  • Add spark-submit to README #444

Merged pull requests:

  • [skip travis] README updates #460 (ruebot)
  • Set spark-submit app name to be "aut - extractorName". #459 (ruebot)
  • Add RemovePrefixWWWDF to DomainFrequencyExtractor. #457 (ruebot)
  • Updating Java install instructions for MacOS, resolves #445 #455 (ianmilligan1)
  • Add option to save to Parquet for app. #454 (ruebot)
  • Update PlainTextExtractor to output a single column; text. #453 (ruebot)
  • Add a number of additional app extractors. #451 (ruebot)
  • Remove RDD option in app; DataFrame only now. #450 (ruebot)
  • [skip-travis] Add spark-submit option to README; resolves #444. #446 (ruebot)
Assets 30

@ruebot ruebot released this Apr 15, 2020

Documentation

Release Notes

Full Changelog

Implemented enhancements:

  • Discussion: Restyle UDFs in the context of DataFrames #425
  • Add alt text column to imageGraph (imageLinks) #420
  • UDFs that filter on url should also filter on src #418

Fixed bugs:

  • CommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439
  • DomainGraphExtractor produces different output in RDD vs DF #436
  • Command line app fails because of missing log4j configuration #433

Closed issues:

  • Remove GraphXML and ExtractGraphX #442
  • Use Monochromatic Ids instead of hash to produce network identifiers. #440
  • Add graphml output to DomainGraphExtractor #435
  • Add webgraph, imagegraph, webpages, etc. to command line app #431
  • Rename imageLinks to imageGraph #419

Merged pull requests:

Assets 30

@ruebot ruebot released this Feb 6, 2020

Documentation

Release Notes

Full Changelog

Implemented enhancements:

  • Enhance keepValidPages #359
  • Add discardLanguage filter #352
  • Add crawl_date to binary DataFrames and imageLinks #413

Fixed bugs:

  • textFiles does not filter properly #390
  • DataFrame error with text files: java.net.MalformedURLException: unknown protocol: filedesc #362

Closed issues:

  • .webpages() additional tokenized columns? #402
  • Test and documentation inventory #372
  • Missing doc comments #392
  • Bug in ArcTest? Why run RemoveHTML? #369
  • UDF CaMeL cASe consistency issues #368
  • ExtractDomain or ExtractBaseDomain? #367
  • Align DataFrame boilerplate in Python and Scala #366
  • Create a ComputeSHA1 method #363
  • Discussion: Should we align our Named Entity Recognition output with WANE format? #297
  • DataFrame discussion: open thread #190

Merged pull requests:

Assets 21

@ruebot ruebot released this Aug 21, 2019

aut-0.18.0 (2019-08-21)

Full Changelog

Implemented enhancements:

  • Add method for unknown extensions in binary extractions #343
  • Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342
  • Add filter/keep by http status to RecordLoader class #315
  • Audio binary object extraction #307
  • Video binary object extraction #306
  • Powerpoint binary object extraction #305
  • Doc binary object extraction #304
  • Spreadsheet binary object extraction #303
  • PDF binary object extraction #302
  • Test aut with Apache Spark 2.4.0 #295
  • Replace hashing of unique ids with .zipWithUniqueId() #243
  • Integration of neural network models for image analysis #240
  • More complete Twitter Ingestion #194
  • Image Search Functionality #165
  • feature request: log when loadArchives opens and closes warc files in a dir #156

Fixed bugs:

  • DataFrame commands throwing java.lang.NullPointerException on example data #320
  • Class issues when using aut-0.17.0-fatjar.jar #313
  • Image extraction does not scale with number of WARCs #298
  • ExtractDomain mistakenly checks source first then url #277
  • Improve ExtractDomain to Better Isolate Domains #269

Closed issues:

  • Inconsistency in ArchiveRecord.getContentBytes #334
  • Rationalize computeHash and ComputeMD5 #333
  • Test additional Java versions with TravisCI #324
  • Remove Twitter/tweet analysis #322
  • Trouble testing s3 connectivity #319
  • Depfu Error: No dependency files found #309
  • Strategy to deal with conflict between application and Spark distribution dependencies #308
  • SaveImageTest.scala should delete saved image file #299
  • Remove Deprecated ExtractGraph.scala file for next release. #291
  • DetectLanguage.scala: class LanguageIdentifier in package language is deprecated #286
  • CVE-2017-7525 -- com.fasterxml.jackson.core:jackson-databind #279
  • Maven build warning during release #273
  • Improve DataFrameLoader.scala test coverage #265
  • Improve package.scala test coverage #263
  • Discussion: Idiom for loading DataFrames #231
  • DataFrame field names: open thread #229
  • DataFrame performance comparison: Scala vs. Python #215
  • TweetUtilsTest.scala doesn't test Spark, only underlying json4s library #206
  • feature request: ArchiveRecord.archiveFile #164
  • feature request: possibility to query about the progress #162
  • Update to Apache Tika 1.19.1; security vulnerabilities in 1.12 #131
  • Create tests for ExtractGraph.scala #49
  • Setup Victims #5

Merged pull requests:

Assets 29

@ruebot ruebot released this Oct 4, 2018

Change Log

aut-0.17.0 (2018-10-04)

Full Changelog

Implemented enhancements:

  • Add EscapeHTML Function for ExtractLinks #266
  • PySpark support #12

Fixed bugs:

  • AUT exits/dies on java.util.zip.ZipException: too many length or distance symbols #271
  • AUT exits/dies on java.util.zip.ZipException: invalid distance too far back #246
  • Improve ExtractDomain Normalization #239
  • Twitter analysis is broken; see also: json4s/json4s#496 #197
  • Prevent encoding errors in PySpark #122

Closed issues:

  • Cannot skip bad record while reading warc file #267
  • Why did Scalastyle not reject null values in TweetUtilTest #255
  • Create UDF to combine basic text filtering features #253
  • spark-shell --packages "io.archivesunleashed:aut:0.16.0" fails with not_found dependencies #242
  • CommandLineAppRunner.scala produces output per WARC instead of combined result. #235
  • Extract images out of images DataFrame and store to disk #232
  • Before the next release, make sure docker-aut builds on master... or make sure --packages works #227
  • DataFrames for image analysis #220
  • The attempt to upgrade Spark version to 2.3.0 is not successful #218
  • Convert nulls to Option(T) #212
  • Bringing Scala DataFrames into PySpark #209
  • What is AUT? #208
  • Refactor ExtractGraph and assess value of GraphX for producing network graphs #203
  • Codify creation of standard derivatives into apps #195
  • TweetUtils - support fulltext #192
  • Combine UDFs into appropriate objects #187
  • Register Scala functions for use in Pyspark #148
  • PySpark performance bottlenecks: counting values #130
  • Redesign of PySpark DataFrame interface for filtering #120
  • Improve RecordLoader.scala test coverage #60

Merged pull requests:

Assets 8

@ruebot ruebot released this Apr 26, 2018 · 169 commits to main since this release

Full Changelog

Implemented enhancements:

  • Revisit approach to .keepValidPages() #177

Closed issues:

  • keepValidPages incorrectly filters out pages with mime-type text/html followed by charset #199

Merged pull requests:

Assets 26

@ruebot ruebot released this Apr 11, 2018

aut-0.15.0 (2018-04-11)

Full Changelog

Implemented enhancements:

  • Clean-up scaladoc comments #184

Closed issues:

  • Rename package io.archivesunleashed.io #188
  • Major Refactoring: RecordRDD #180
  • Major refactoring: matchbox cleanup #179
  • Major refactoring: io.archivesunleashed.spark -> io.archivesunleashed #178

Merged pull requests:

This Change Log was automatically generated by github_changelog_generator

Assets 14

@ruebot ruebot released this Mar 20, 2018

Full Changelog

Closed issues:

  • Incorporate Scala UDFs into Auto-documentation #176

Merged pull requests:

Assets 14
You can’t perform that action at this time.