Various refactorings and performance improvements#34
Merged
Conversation
These functions are auto generated by the record construct of java.
62f6172 to
3aa7579
Compare
pmbittner
requested changes
May 31, 2022
The new contract doesn't handle the special case of a missing dot in the given file extension. This is fine, because this method is only called with constant strings, so no user input has to be changed. It may have been a better choice to not include the dot in the expected file extension, but in that case all the call sites would have to be updated (which results in all file extensions handling functions and constants being modified), so including the dot is the saner choice in this code base. Also the name `hasExtension` doesn't imply any additional file type checks. So it shouldn't check if a file is a directory (the current use sites do check this for themselves).
One line break regex didn't match a standalone carriage return character, but old Macintosh versions did use it, so the version which handles this case including the two other standard ones (carriage return followed by newline, used by windows, and a standalone newline, used by everyone else) is used.
This test case only failed because of new line encoding issues (mismatch between "\r\n" and "\n").
By using string templates with brace replacements the actual log message
in the code gets easier to read and is more maintainable than string
concatenation.
Notable mistakes made when using the Logger where:
- Exceptions where passed as the last argument, but have to be passed as
first argument for tinylog's special exception printing to activate.
- The escape sequence "%n" is not interpreted by tinylog.
(`PrintStream.printf` of `System.out` interprets "%n" as platform
independent new line.)
- Less "{}" placeholders in the template than the number of argument
provided.
Not all log messages are preserved exactly. Most changes are
- Changing the order of the log message and an exception. By using
tinylog's first exception argument the exception is printed after the
message.
- Adding or removing colons and exclamation marks. Mostly to have
correct punctuation when an exception is printed by tinylog, which
adds a colon by itself.
- Replacing double quotes by single quotes. These are easier to read in
the code itself.
Fixes a maven warning about platform dependent encoding by stating the actual encoding of the source files. UTF-8 should already be the default on most platforms, so nothing should actually change.
This improves readability and clutter (and doesn't degrade performance, because the compiler should optimise this the same way in both cases).
Also makes the intend clearer.
The line graph export tests get more accurate and are ready for idiomatic line termination handling.
The `java.io.File` API is deprecated in favor of `java.nio.Files` and in most cases using `java.nio.Files` results in code which is easier to read. This might not be the best usage example, but I do one bit at a time.
Streams make the code easier to understand and maintain. Simultaneously this code will use less memory because the file is read lazily. By using `Files.lines` all the platform dependent line endings are also handled automatically.
Streams make the code easier to understand and maintain. By using `Files.lines` all the platform dependent line endings are also handled automatically. Because streams do not handle `IOException`s it is necessary to wrap them in `UncheckedIOException`s and unpack them outside the stream. The Apache version `org.apache.commons.lang3.stream.Streams.FailableStream` would have been a great substitute, but it doesn't provide a `flatMap` method. The order of `hasExtension` and `isRegularFile` changed to limit (comparably) slow IO actions by performing pure filters first.
By using `BufferedReader`s all the platform dependent line endings are handled automatically. No weird splitting, joining and replacing needed. This also reduces space usage, because not all lines are kept in memory at the same time.
By using `BufferedReader`s all the platform dependent line endings are handled automatically. This also reduces space usage if the input is read from a file, because not all lines are duplicated in memory at the same time. Call sites which generate a `String` and have to use `StringReader` are subject to future refactoring.
Closing the `BufferedReader` is conventionally handled by the function which opens it (unless otherwise noted), making the exception handling less error prone. Because streams do not handle `IOException`s it is necessary to wrap them in `UncheckedIOException`s and unpack them outside the stream. The Apache version `org.apache.commons.lang3.stream.Streams.FailableStream` would have been a great substitute, but it doesn't provide a `flatMap` method. Because a `BufferedReader` can't be read twice, a utility method `assertEqualToFile` is created which uses a new `BufferedReader` to parse line endings correctly and efficiently (instead of using `normalizedLineEndings` like the old `assertEqualFileContent` did).
By reusing `assertEqualToFile` it is possible to handle line endings nicer and more efficient.
By only matching one regex with a little bit more complexity and differentiating the results by a capture group the performance of checking for matches gets way better. Because this regex has to be matched for each line it is compiled only once to save this overhead for most matches.
For performance reasons regexes should be compiled only once, because it takes considerable time to compile them. The simplest solution is to compile these regexes in a static context, so that they are only compiled when the class is loaded. A further optimisation might be to compile them lazily (only compile regexes when they are actually needed) to prevent slow startup times. Because regexes are mostly implementation details they are made private unless they clearly belong to a public API (e.g. `LINEBREAK_REGEX`).
Because string concatenation has to allocate a new string for each allocation the time required for concatenating `n` lines is `O(n^2)` (with a constant depending on the line length). This can be optimised to `O(n)` (this is the purpose of `StringBuilder` after all). In this case a list of lines is used because this avoids problems with line ending encoding (which is currently handled by always using "\r\n" internally). This also makes it easier to insert symbols at the start of the line (for example plus or minus for diff types). The use sites of the `DiffNode` label are refactored to prevent duplicate work (for example joining lines and immediately splitting them again).
The code type is not needed in all code paths. It needs some time, which is not negligible (regular expression match), to parse the code type, so it should not be parsed in all code paths. Consequently, the variable can be declared in a smaller scope to make that fact obvious to the reader (and a bit more to the compiler).
This is already handled by the following elimination of all whitespace and avoids an unnecessary allocation.
By using `BufferedReader`s all the platform dependent line endings are handled automatically. This also reduces space usage if the input is read from a file, because not all lines are duplicated in memory at the same time. Call sites which generate a `String` and have to use `StringReader` are subject to future refactoring.
The time specific children of `DiffNode` do not guarantee any ordering, but the Starfold transformation requires these nodes to be ordered: - When `respectNodeOrder` is active. - For the label of the merged node, which is composed sequentially from the merged nodes in the order of the given children. For that reason Starfold cannot use `DiffNode.getChildren` and has to use `DiffNode.getAllChildren` instead. Because this list contains both /before/ and /after/ children the actual computation has to filter the children to a specific time. It would also be possible to run the two cases (/before/ and /after/) in parallel, but this a consumer with the semantic of a coroutine and a finalisation step to merge the remaining nodes. But this introduces complexity which doesn't seem adequate.
Because the invariant `node.getBeforeParent().getBeforeChildren().contains(node)` holds for each `DiffNode node` the `beforeChildren` and `afterChildren` lists can be computed and do not have to be stored. By using this invariant, the check if a node is a before/after child can be implemented in `O(1)` in contrast to the `O(n)` version which uses the `beforeChildren` and `afterChildren` lists. This speedup is especially important for parsing `DiffTree`s. It is also possible to store `Set`s instead of `List`s for `DiffNode` children to get a similar time complexity, but as demonstrated by this commit that memory can be saved. If there is a need to compute a `beforeChildren` or `afterChildren` set (these lists didn't actually guarantee any ordering) more efficiently than by traversing `childOrdering` (which requires about the double time) this would be a valid approach.
Computing the diff type is not expensive, but it's always a good thing to avoid unnecessary work.
This saves memory by shortening the list of possible candidates.
3aa7579 to
8ba61f0
Compare
ibbem
added a commit
that referenced
this pull request
Jun 2, 2022
pmbittner
approved these changes
Jun 9, 2022
Member
|
Thank you very much for this clean pull-request with many useful improvements in code style and performance. I will merge now. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
These are all the refactorings I have done so far.
Most changes are small refactorings, but there are notable changes and goals:
mining)DiffNodesnormalizedLineEndingsandreadAsStringBufferedReaderfor importing line graphsBufferedReaderto parse diff treesBufferedReaderto parse import analysis metadataPaths instead ofStringsjava.nio.Filesinstead ofjava.io.FileDiffNode.hashCodeDiffNode.getID