Skip to content

Various refactorings and performance improvements#34

Merged
pmbittner merged 33 commits intomainfrom
benjamin/refactoring
Jun 9, 2022
Merged

Various refactorings and performance improvements#34
pmbittner merged 33 commits intomainfrom
benjamin/refactoring

Conversation

@ibbem
Copy link
Copy Markdown
Collaborator

@ibbem ibbem commented May 31, 2022

These are all the refactorings I have done so far.

Most changes are small refactorings, but there are notable changes and goals:

  • Performance improvements (especially for mining)
    • 5cecdeb Do not store before and after children in DiffNodes
    • 724393b Use BufferedReader to parse full diffs
    • 7df8ac6 Only parse the code type when needed
    • 34300ba Do not repeatedly concatenate lines when parsing diffs
    • 41b2969 Compile regular expressions only once
    • fa09f4b Optimize regular expressions for identifying code types
  • Getting rid of inefficient IO and weird new line handling
    • 3fd83ed Refactor CSV exporting using streams
    • cf92178 Get rid of normalizedLineEndings and readAsString
    • 6856e05 Use a BufferedReader for importing line graphs
    • 979c2ce Use a BufferedReader to parse diff trees
    • 5573665 Use a BufferedReader to parse import analysis metadata
    • 3f591eb Use streams to aggregate commit times
    • 2625b6b Use streams to parse the dataset description
    • 516d72a Use Paths instead of Strings
    • d4cad3b Use java.nio.Files instead of java.io.File
    • 2f6b2a7 Do not trim the lines of line graphs when testing
    • 7ee72f5 Count lines without building an intermediate array
  • Explicitly specifying invariants and complying to them
    • 8e16bfc Document the contract breach of DiffNode.hashCode
    • e6f94c1 Add documentation for DiffNode.getID
    • b6d03c7 Refactor Starfold to adhere to DiffNode's invariants

@pmbittner pmbittner self-requested a review May 31, 2022 15:25
@ibbem ibbem force-pushed the benjamin/refactoring branch from 62f6172 to 3aa7579 Compare May 31, 2022 15:26
Comment thread src/main/java/org/variantsync/diffdetective/diff/GitDiffer.java Outdated
Comment thread src/main/java/org/variantsync/diffdetective/diff/GitDiffer.java Outdated
Comment thread src/main/java/org/variantsync/diffdetective/diff/difftree/DiffNode.java Outdated
Comment thread src/main/java/org/variantsync/diffdetective/diff/difftree/DiffNode.java Outdated
ibbem added 22 commits June 1, 2022 16:57
The new contract doesn't handle the special case of a missing dot in the
given file extension. This is fine, because this method is only called
with constant strings, so no user input has to be changed.

It may have been a better choice to not include the dot in the expected
file extension, but in that case all the call sites would have to be
updated (which results in all file extensions handling functions and
constants being modified), so including the dot is the saner choice in
this code base.

Also the name `hasExtension` doesn't imply any additional file type
checks. So it shouldn't check if a file is a directory (the current use
sites do check this for themselves).
One line break regex didn't match a standalone carriage return
character, but old Macintosh versions did use it, so the version which
handles this case including the two other standard ones (carriage return
followed by newline, used by windows, and a standalone newline, used by
everyone else) is used.
This test case only failed because of new line encoding issues (mismatch
between "\r\n" and "\n").
By using string templates with brace replacements the actual log message
in the code gets easier to read and is more maintainable than string
concatenation.

Notable mistakes made when using the Logger where:
- Exceptions where passed as the last argument, but have to be passed as
  first argument for tinylog's special exception printing to activate.
- The escape sequence "%n" is not interpreted by tinylog.
  (`PrintStream.printf` of `System.out` interprets "%n" as platform
  independent new line.)
- Less "{}" placeholders in the template than the number of argument
  provided.

Not all log messages are preserved exactly. Most changes are
- Changing the order of the log message and an exception. By using
  tinylog's first exception argument the exception is printed after the
  message.
- Adding or removing colons and exclamation marks. Mostly to have
  correct punctuation when an exception is printed by tinylog, which
  adds a colon by itself.
- Replacing double quotes by single quotes. These are easier to read in
  the code itself.
Fixes a maven warning about platform dependent encoding by stating the
actual encoding of the source files. UTF-8 should already be the default
on most platforms, so nothing should actually change.
This improves readability and clutter (and doesn't degrade performance,
because the compiler should optimise this the same way in both cases).
The line graph export tests get more accurate and are ready for
idiomatic line termination handling.
The `java.io.File` API is deprecated in favor of `java.nio.Files` and in
most cases using `java.nio.Files` results in code which is easier to
read.

This might not be the best usage example, but I do one bit at a time.
Streams make the code easier to understand and maintain. Simultaneously
this code will use less memory because the file is read lazily. By using
`Files.lines` all the platform dependent line endings are also handled
automatically.
Streams make the code easier to understand and maintain. By using
`Files.lines` all the platform dependent line endings are also handled
automatically.

Because streams do not handle `IOException`s it is necessary to wrap
them in `UncheckedIOException`s and unpack them outside the stream. The
Apache version `org.apache.commons.lang3.stream.Streams.FailableStream`
would have been a great substitute, but it doesn't provide a `flatMap`
method.

The order of `hasExtension` and `isRegularFile` changed to limit
(comparably) slow IO actions by performing pure filters first.
By using `BufferedReader`s all the platform dependent line endings are
handled automatically. No weird splitting, joining and replacing needed.
This also reduces space usage, because not all lines are kept in memory
at the same time.
By using `BufferedReader`s all the platform dependent line endings are
handled automatically. This also reduces space usage if the input is
read from a file, because not all lines are duplicated in memory at the
same time.

Call sites which generate a `String` and have to use `StringReader` are
subject to future refactoring.
Closing the `BufferedReader` is conventionally handled by the function
which opens it (unless otherwise noted), making the exception
handling less error prone.

Because streams do not handle `IOException`s it is necessary to wrap
them in `UncheckedIOException`s and unpack them outside the stream. The
Apache version `org.apache.commons.lang3.stream.Streams.FailableStream`
would have been a great substitute, but it doesn't provide a `flatMap`
method.

Because a `BufferedReader` can't be read twice, a utility method
`assertEqualToFile` is created which uses a new `BufferedReader` to
parse line endings correctly and efficiently (instead of using
`normalizedLineEndings` like the old `assertEqualFileContent` did).
By reusing `assertEqualToFile` it is possible to handle line endings
nicer and more efficient.
By only matching one regex with a little bit more complexity and
differentiating the results by a capture group the performance of
checking for matches gets way better. Because this regex has to be
matched for each line it is compiled only once  to save this overhead
for most matches.
For performance reasons regexes should be compiled only once, because it
takes considerable time to compile them.

The simplest solution is to compile these regexes in a static context,
so that they are only compiled when the class is loaded. A further
optimisation might be to compile them lazily (only compile regexes when
they are actually needed) to prevent slow startup times.

Because regexes are mostly implementation details they are made private
unless they clearly belong to a public API (e.g. `LINEBREAK_REGEX`).
Because string concatenation has to allocate a new string for each
allocation the time required for concatenating `n` lines is `O(n^2)`
(with a constant depending on the line length). This can be optimised to
`O(n)` (this is the purpose of `StringBuilder` after all).

In this case a list of lines is used because this avoids problems with
line ending encoding (which is currently handled by always using "\r\n"
internally). This also makes it easier to insert symbols at the start of
the line (for example plus or minus for diff types).

The use sites of the `DiffNode` label are refactored to prevent
duplicate work (for example joining lines and immediately splitting them again).
The code type is not needed in all code paths. It needs some time, which
is not negligible (regular expression match), to parse the code type, so
it should not be parsed in all code paths.

Consequently, the variable can be declared in a smaller scope to make
that fact obvious to the reader (and a bit more to the compiler).
This is already handled by the following elimination of all whitespace
and avoids an unnecessary allocation.
ibbem added 7 commits June 1, 2022 18:04
By using `BufferedReader`s all the platform dependent line endings are
handled automatically. This also reduces space usage if the input is
read from a file, because not all lines are duplicated in memory at the
same time.

Call sites which generate a `String` and have to use `StringReader` are
subject to future refactoring.
The time specific children of `DiffNode` do not guarantee any ordering,
but the Starfold transformation requires these nodes to be ordered:
- When `respectNodeOrder` is active.
- For the label of the merged node, which is composed sequentially from
  the merged nodes in the order of the given children.

For that reason Starfold cannot use `DiffNode.getChildren` and has to
use `DiffNode.getAllChildren` instead. Because this list contains both
/before/ and /after/ children the actual computation has to filter the
children to a specific time.

It would also be possible to run the two cases (/before/ and /after/) in
parallel, but this a consumer with the semantic of a coroutine and a
finalisation step to merge the remaining nodes. But this introduces
complexity which doesn't seem adequate.
Because the invariant
  `node.getBeforeParent().getBeforeChildren().contains(node)`
holds for each `DiffNode node` the `beforeChildren` and `afterChildren`
lists can be computed and do not have to be stored.

By using this invariant, the check if a node is a before/after child can
be implemented in `O(1)` in contrast to the `O(n)` version which uses
the `beforeChildren` and `afterChildren` lists. This speedup is
especially important for parsing `DiffTree`s.

It is also possible to store `Set`s instead of `List`s for `DiffNode`
children to get a similar time complexity, but as demonstrated by this
commit that memory can be saved. If there is a need to compute a
`beforeChildren` or `afterChildren` set (these lists didn't actually
guarantee any ordering) more efficiently than by traversing
`childOrdering` (which requires about the double time) this would be a
valid approach.
Computing the diff type is not expensive, but it's always a good thing
to avoid unnecessary work.
This saves memory by shortening the list of possible candidates.
@ibbem ibbem force-pushed the benjamin/refactoring branch from 3aa7579 to 8ba61f0 Compare June 2, 2022 07:32
@pmbittner
Copy link
Copy Markdown
Member

Thank you very much for this clean pull-request with many useful improvements in code style and performance. I will merge now.

@pmbittner pmbittner merged commit 5119f4c into main Jun 9, 2022
@ibbem ibbem deleted the benjamin/refactoring branch June 10, 2022 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants