Various refactorings and performance improvements by ibbem · Pull Request #34 · VariantSync/DiffDetective

ibbem · 2022-05-31T11:55:45Z

These are all the refactorings I have done so far.

Most changes are small refactorings, but there are notable changes and goals:

Performance improvements (especially for mining)
- 5cecdeb Do not store before and after children in DiffNodes
- 724393b Use BufferedReader to parse full diffs
- 7df8ac6 Only parse the code type when needed
- 34300ba Do not repeatedly concatenate lines when parsing diffs
- 41b2969 Compile regular expressions only once
- fa09f4b Optimize regular expressions for identifying code types
Getting rid of inefficient IO and weird new line handling
- 3fd83ed Refactor CSV exporting using streams
- cf92178 Get rid of normalizedLineEndings and readAsString
- 6856e05 Use a BufferedReader for importing line graphs
- 979c2ce Use a BufferedReader to parse diff trees
- 5573665 Use a BufferedReader to parse import analysis metadata
- 3f591eb Use streams to aggregate commit times
- 2625b6b Use streams to parse the dataset description
- 516d72a Use Paths instead of Strings
- d4cad3b Use java.nio.Files instead of java.io.File
- 2f6b2a7 Do not trim the lines of line graphs when testing
- 7ee72f5 Count lines without building an intermediate array
Explicitly specifying invariants and complying to them
- 8e16bfc Document the contract breach of DiffNode.hashCode
- e6f94c1 Add documentation for DiffNode.getID
- b6d03c7 Refactor Starfold to adhere to DiffNode's invariants

These functions are auto generated by the record construct of java.

The new contract doesn't handle the special case of a missing dot in the given file extension. This is fine, because this method is only called with constant strings, so no user input has to be changed. It may have been a better choice to not include the dot in the expected file extension, but in that case all the call sites would have to be updated (which results in all file extensions handling functions and constants being modified), so including the dot is the saner choice in this code base. Also the name `hasExtension` doesn't imply any additional file type checks. So it shouldn't check if a file is a directory (the current use sites do check this for themselves).

One line break regex didn't match a standalone carriage return character, but old Macintosh versions did use it, so the version which handles this case including the two other standard ones (carriage return followed by newline, used by windows, and a standalone newline, used by everyone else) is used.

This test case only failed because of new line encoding issues (mismatch between "\r\n" and "\n").

By using string templates with brace replacements the actual log message in the code gets easier to read and is more maintainable than string concatenation. Notable mistakes made when using the Logger where: - Exceptions where passed as the last argument, but have to be passed as first argument for tinylog's special exception printing to activate. - The escape sequence "%n" is not interpreted by tinylog. (`PrintStream.printf` of `System.out` interprets "%n" as platform independent new line.) - Less "{}" placeholders in the template than the number of argument provided. Not all log messages are preserved exactly. Most changes are - Changing the order of the log message and an exception. By using tinylog's first exception argument the exception is printed after the message. - Adding or removing colons and exclamation marks. Mostly to have correct punctuation when an exception is printed by tinylog, which adds a colon by itself. - Replacing double quotes by single quotes. These are easier to read in the code itself.

Fixes a maven warning about platform dependent encoding by stating the actual encoding of the source files. UTF-8 should already be the default on most platforms, so nothing should actually change.

This improves readability and clutter (and doesn't degrade performance, because the compiler should optimise this the same way in both cases).

Also makes the intend clearer.

The line graph export tests get more accurate and are ready for idiomatic line termination handling.

The `java.io.File` API is deprecated in favor of `java.nio.Files` and in most cases using `java.nio.Files` results in code which is easier to read. This might not be the best usage example, but I do one bit at a time.

Streams make the code easier to understand and maintain. Simultaneously this code will use less memory because the file is read lazily. By using `Files.lines` all the platform dependent line endings are also handled automatically.

Streams make the code easier to understand and maintain. By using `Files.lines` all the platform dependent line endings are also handled automatically. Because streams do not handle `IOException`s it is necessary to wrap them in `UncheckedIOException`s and unpack them outside the stream. The Apache version `org.apache.commons.lang3.stream.Streams.FailableStream` would have been a great substitute, but it doesn't provide a `flatMap` method. The order of `hasExtension` and `isRegularFile` changed to limit (comparably) slow IO actions by performing pure filters first.

By using `BufferedReader`s all the platform dependent line endings are handled automatically. No weird splitting, joining and replacing needed. This also reduces space usage, because not all lines are kept in memory at the same time.

By using `BufferedReader`s all the platform dependent line endings are handled automatically. This also reduces space usage if the input is read from a file, because not all lines are duplicated in memory at the same time. Call sites which generate a `String` and have to use `StringReader` are subject to future refactoring.

Closing the `BufferedReader` is conventionally handled by the function which opens it (unless otherwise noted), making the exception handling less error prone. Because streams do not handle `IOException`s it is necessary to wrap them in `UncheckedIOException`s and unpack them outside the stream. The Apache version `org.apache.commons.lang3.stream.Streams.FailableStream` would have been a great substitute, but it doesn't provide a `flatMap` method. Because a `BufferedReader` can't be read twice, a utility method `assertEqualToFile` is created which uses a new `BufferedReader` to parse line endings correctly and efficiently (instead of using `normalizedLineEndings` like the old `assertEqualFileContent` did).

By reusing `assertEqualToFile` it is possible to handle line endings nicer and more efficient.

By only matching one regex with a little bit more complexity and differentiating the results by a capture group the performance of checking for matches gets way better. Because this regex has to be matched for each line it is compiled only once to save this overhead for most matches.

For performance reasons regexes should be compiled only once, because it takes considerable time to compile them. The simplest solution is to compile these regexes in a static context, so that they are only compiled when the class is loaded. A further optimisation might be to compile them lazily (only compile regexes when they are actually needed) to prevent slow startup times. Because regexes are mostly implementation details they are made private unless they clearly belong to a public API (e.g. `LINEBREAK_REGEX`).

Because string concatenation has to allocate a new string for each allocation the time required for concatenating `n` lines is `O(n^2)` (with a constant depending on the line length). This can be optimised to `O(n)` (this is the purpose of `StringBuilder` after all). In this case a list of lines is used because this avoids problems with line ending encoding (which is currently handled by always using "\r\n" internally). This also makes it easier to insert symbols at the start of the line (for example plus or minus for diff types). The use sites of the `DiffNode` label are refactored to prevent duplicate work (for example joining lines and immediately splitting them again).

The code type is not needed in all code paths. It needs some time, which is not negligible (regular expression match), to parse the code type, so it should not be parsed in all code paths. Consequently, the variable can be declared in a smaller scope to make that fact obvious to the reader (and a bit more to the compiler).

This is already handled by the following elimination of all whitespace and avoids an unnecessary allocation.

By using `BufferedReader`s all the platform dependent line endings are handled automatically. This also reduces space usage if the input is read from a file, because not all lines are duplicated in memory at the same time. Call sites which generate a `String` and have to use `StringReader` are subject to future refactoring.

The time specific children of `DiffNode` do not guarantee any ordering, but the Starfold transformation requires these nodes to be ordered: - When `respectNodeOrder` is active. - For the label of the merged node, which is composed sequentially from the merged nodes in the order of the given children. For that reason Starfold cannot use `DiffNode.getChildren` and has to use `DiffNode.getAllChildren` instead. Because this list contains both /before/ and /after/ children the actual computation has to filter the children to a specific time. It would also be possible to run the two cases (/before/ and /after/) in parallel, but this a consumer with the semantic of a coroutine and a finalisation step to merge the remaining nodes. But this introduces complexity which doesn't seem adequate.

Because the invariant `node.getBeforeParent().getBeforeChildren().contains(node)` holds for each `DiffNode node` the `beforeChildren` and `afterChildren` lists can be computed and do not have to be stored. By using this invariant, the check if a node is a before/after child can be implemented in `O(1)` in contrast to the `O(n)` version which uses the `beforeChildren` and `afterChildren` lists. This speedup is especially important for parsing `DiffTree`s. It is also possible to store `Set`s instead of `List`s for `DiffNode` children to get a similar time complexity, but as demonstrated by this commit that memory can be saved. If there is a need to compute a `beforeChildren` or `afterChildren` set (these lists didn't actually guarantee any ordering) more efficiently than by traversing `childOrdering` (which requires about the double time) this would be a valid approach.

Computing the diff type is not expensive, but it's always a good thing to avoid unnecessary work.

This saves memory by shortening the list of possible candidates.

pmbittner · 2022-06-09T15:56:02Z

Thank you very much for this clean pull-request with many useful improvements in code style and performance. I will merge now.

ibbem added 4 commits May 21, 2022 12:05

Remove duplicate constants

159c479

Remove unnecessary record functions

95a9f27

These functions are auto generated by the record construct of java.

Eliminate duplicate computation

3a1a9b8

Reuse utility methods to clarify intend

65064f6

pmbittner self-requested a review May 31, 2022 15:25

pmbittner assigned ibbem May 31, 2022

ibbem force-pushed the benjamin/refactoring branch from 62f6172 to 3aa7579 Compare May 31, 2022 15:26

pmbittner requested changes May 31, 2022

View reviewed changes

ibbem added 22 commits June 1, 2022 16:57

Fix a test case with a broken assert

e08e843

This test case only failed because of new line encoding issues (mismatch between "\r\n" and "\n").

Let git ignore the python cache

40a4c78

Specify UTF-8 as default encoding for all source files

88b37ad

Fixes a maven warning about platform dependent encoding by stating the actual encoding of the source files. UTF-8 should already be the default on most platforms, so nothing should actually change.

Declare variables inside the smallest possible scope

fe840b8

This improves readability and clutter (and doesn't degrade performance, because the compiler should optimise this the same way in both cases).

Use a specialized function to check for blank lines

bbf6b38

Count lines without building an intermediate array

1444fb1

Also makes the intend clearer.

Do not trim the lines of line graphs when testing

5a82b66

The line graph export tests get more accurate and are ready for idiomatic line termination handling.

Use java.nio.Files instead of java.io.File

fc0c087

The `java.io.File` API is deprecated in favor of `java.nio.Files` and in most cases using `java.nio.Files` results in code which is easier to read. This might not be the best usage example, but I do one bit at a time.

Use streams to parse the dataset description

b27d445

Streams make the code easier to understand and maintain. Simultaneously this code will use less memory because the file is read lazily. By using `Files.lines` all the platform dependent line endings are also handled automatically.

Get rid of normalizedLineEndings and readAsString

bba72cb

By reusing `assertEqualToFile` it is possible to handle line endings nicer and more efficient.

Remove an unnecessary String.trim

5e956b7

This is already handled by the following elimination of all whitespace and avoids an unnecessary allocation.

ibbem added 7 commits June 1, 2022 18:04

Add documentation for DiffNode.getID

f7bd043

Document the contract breach of DiffNode.hashCode

33194cd

Do not recompute the diff type if possible

6975bb6

Computing the diff type is not expensive, but it's always a good thing to avoid unnecessary work.

Filter as many DiffNodes in Starfold as possible

8ba61f0

This saves memory by shortening the list of possible candidates.

ibbem force-pushed the benjamin/refactoring branch from 3aa7579 to 8ba61f0 Compare June 2, 2022 07:32

ibbem added a commit that referenced this pull request Jun 2, 2022

Merge pull request #34 from VariantSync/benjamin/refactoring

97c5a20

pmbittner approved these changes Jun 9, 2022

View reviewed changes

pmbittner merged commit 5119f4c into main Jun 9, 2022

ibbem deleted the benjamin/refactoring branch June 10, 2022 08:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various refactorings and performance improvements#34

Various refactorings and performance improvements#34
pmbittner merged 33 commits intomainfrom
benjamin/refactoring

ibbem commented May 31, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pmbittner commented Jun 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ibbem commented May 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pmbittner commented Jun 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ibbem commented May 31, 2022 •

edited

Loading