@ruebot ruebot released this Oct 4, 2018 · 12 commits to master since this release

Assets 8

Change Log

aut-0.17.0 (2018-10-04)

Full Changelog

Implemented enhancements:

  • Add EscapeHTML Function for ExtractLinks #266
  • PySpark support #12

Fixed bugs:

  • AUT exits/dies on java.util.zip.ZipException: too many length or distance symbols #271
  • AUT exits/dies on java.util.zip.ZipException: invalid distance too far back #246
  • Improve ExtractDomain Normalization #239
  • Twitter analysis is broken; see also: json4s/json4s#496 #197
  • Prevent encoding errors in PySpark #122

Closed issues:

  • Cannot skip bad record while reading warc file #267
  • Why did Scalastyle not reject null values in TweetUtilTest #255
  • Create UDF to combine basic text filtering features #253
  • spark-shell --packages "io.archivesunleashed:aut:0.16.0" fails with not_found dependencies #242
  • CommandLineAppRunner.scala produces output per WARC instead of combined result. #235
  • Extract images out of images DataFrame and store to disk #232
  • Before the next release, make sure docker-aut builds on master... or make sure --packages works #227
  • DataFrames for image analysis #220
  • The attempt to upgrade Spark version to 2.3.0 is not successful #218
  • Convert nulls to Option(T) #212
  • Bringing Scala DataFrames into PySpark #209
  • What is AUT? #208
  • Refactor ExtractGraph and assess value of GraphX for producing network graphs #203
  • Codify creation of standard derivatives into apps #195
  • TweetUtils - support fulltext #192
  • Combine UDFs into appropriate objects #187
  • Register Scala functions for use in Pyspark #148
  • PySpark performance bottlenecks: counting values #130
  • Redesign of PySpark DataFrame interface for filtering #120
  • Improve RecordLoader.scala test coverage #60

Merged pull requests:

@ruebot ruebot released this Apr 11, 2018 · 46 commits to master since this release

Assets 14

aut-0.15.0 (2018-04-11)

Full Changelog

Implemented enhancements:

  • Clean-up scaladoc comments #184

Closed issues:

  • Rename package io.archivesunleashed.io #188
  • Major Refactoring: RecordRDD #180
  • Major refactoring: matchbox cleanup #179
  • Major refactoring: io.archivesunleashed.spark -> io.archivesunleashed #178

Merged pull requests:

This Change Log was automatically generated by github_changelog_generator

@ruebot ruebot released this Feb 28, 2018 · 59 commits to master since this release

Assets 8

Change Log

aut-0.12.2 (2018-02-28)

Full Changelog

Implemented enhancements:

  • ArchiveRecord.warcFile #171
  • Better approach to ids in WriteGraphML & WriteGEXF #168
  • Build pre-filtered networks #109
  • KeepDate UDF should support date range #108
  • Changing keepDate to allow multiple dates, would close #108 #161 (ianmilligan1)

Fixed bugs:

  • Broken GEXF Files Due to < and > characters in node id fields #172
  • There is insufficient memory for the Java Runtime Environment to continue #159
  • AUT Fails on Extracting Text from WARCs #158

Closed issues:

  • RecordLoader.loadArchives fails with nested dirs #169
  • Unparseable date error #163
  • remove angle brackets from ArchiveRecord.getUrl #157
  • Benchmarking Scala vs Python #121
  • Improve WacArcInputFormat.java test coverage #80
  • Improve WacWarcInputFormat.java test coverage #78
  • Improve WarcRecordWritable.java test coverage #77
  • Improve ArcRecordWritable.java test coverage #75
  • Improve ArcRecord.scala test coverage #69
  • Improve RemoveHttpHeader.scala test coverage #57
  • Investigate Jupyter notebooks on Altiscale #37

Merged pull requests:

@ruebot ruebot released this Dec 12, 2017 · 71 commits to master since this release

Assets 8

aut-0.12.0 (2017-12-11)

Full Changelog

Implemented enhancements:

Fixed bugs:

Closed issues:

  • Create tests for WriteGEXF.scala #138
  • ERROR ArcRecordUtils - Read 1224 bytes but expected 1300 bytes #128
  • WarcRecordUtils.java uses or overrides a deprecated API #127
  • class LanguageIdentifier in package language is deprecated #126
  • multiple versions of scala #125
  • ExtractLinks running slowly #123
  • com.cloudera.cdh:hadoop-ant:pom:0.20.2-cdh3u4 -- errors #118
  • Improve ExtractDate.scala test coverage #64

Merged pull requests:

@ruebot ruebot released this Nov 23, 2017 · 89 commits to master since this release

Assets 8

Change Log

aut-0.11.0 (2017-11-22)

Full Changelog

Implemented enhancements:

  • GetCrawlYear to accompany GetCrawlMonth #104
  • Refactor RecordLoader classes #102
  • Adding getCrawlYear in ArchiveRecords, resolves #104 #105 (ianmilligan1)

Closed issues:

  • spark-shell --packages "io.archivesunleashed:aut:0.10.0"` fails with not_found dependencies #113
  • update the version of the dependencies not available on the central maven repository #111
  • Bake keepValidPages() into RecordLoader #101
  • Create tests for JsonUtil.scala #66
  • Improve ExtractDomain.scala test coverage #63
  • Improve ExtractImageLinks.scala test coverage #62
  • Improve ExtractLinks.scala test coverage #61
  • Improve StringUtils.scala test coverage #58
  • Improve RemoveHTML.scala test coverage #56
  • Create tests for TweetUtils.scala #54
  • Create tests for ExtractTextFromPDFs.scala #51
  • Create tests for ExtractPopularImages.scala #50
  • Create tests for ExtractBoilerpipeText.scala #47
  • Create tests for ComputeMD5.scala #46
  • Create tests for ComputeImageSize.scala #45

Merged pull requests:

@ruebot ruebot released this Oct 2, 2017 · 113 commits to master since this release

Assets 12

aut-0.10.0 (2017-10-02)

Full Changelog

Fixed bugs:

  • NER breaks for WARC files? #41

Closed issues:

  • Do we need pythonconverters/ArcRecordConverter.scala? If so, tests. If not, delete it. #65
  • Upgrade to Spark 2 on Altiscale #43
  • Investigate our test coverage according to codecov.io #36
  • Update Scala version #35
  • Update to use Java 8 #32
  • Migrate warcbase-resources to aut-resources #30
  • mvn site-deploy -DskipTests is still failing #27
  • Retarget Hadoop #9

Merged pull requests: