Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for asserting on DataSet, DataTable, DataColumn, DataRow #1419

Merged
merged 3 commits into from
Jan 6, 2021

Conversation

logiclrd
Copy link
Contributor

@logiclrd logiclrd commented Oct 30, 2020

I tried asserting equivalency on DataSet objects and the results were .. dissatisfying. :-)

This PR:

  • Adds equivalency steps for DataSet and related types.
  • Registers these steps in GetDefaultSteps in EquivalencyStepCollection.cs.
  • Adds new overloads of .Should() in AssertionExtensions.cs drive by new subclasses of ReferenceTypeAssertions<>.
  • Adds comprehensive unit testing of the new functionality.
  • Adds documentation of the new functionality for the web site.
  • Updates the ApprovedApi files to include the newly-added API surface.
  • Adds mention of the new feature to the release notes on the Releases page.

IMPORTANT

  • The code complies with the Coding Guidelines for C#.
  • The changes are covered by a new or existing set of unit tests which follow the Arrange-Act-Assert syntax such as is used in this example.
  • If the contribution adds a feature or fixes a bug, please update the release notes, which are published on the website.
  • If the contribution changes the public API the changes needs to be included by running AcceptApiChanges.ps1/AcceptApiChanges.sh.
  • If the contribution affects the documentation, please include your changes in this pull request so the documentation will appear on the website.

Copy link
Member

@dennisdoomen dennisdoomen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Impressive work. I didn't even know people are still using data sets.
❌ Please properly document all types and all public, protected members, especially everything that is exposed to consumers of this library.
🔧 I suggest to rebase this PR on top of #1379. It will be merged soon and contains some breaking changes.
🔧 Most of the review comments apply on many places on this PR. I only mention an issue once, and hope you'll fix similar cases everywhere.

Copy link
Member

@dennisdoomen dennisdoomen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, wasn't finished with the review...

❌ Please also update the release notes and documentation as mentioned in the PR template.

@logiclrd logiclrd force-pushed the JDG_DataSets branch 2 times, most recently from 85850f9 to 3d7e0a2 Compare October 31, 2020 19:58
@logiclrd
Copy link
Contributor Author

I found a bug in #1379 while trying to rebase onto it. MemberToMemberInfoAdapter reports member.Type instead of member.DeclaringType for its own DeclaringType property.

@logiclrd
Copy link
Contributor Author

Rebased, all tests pass :-)

If you foresee merging these changes in, just not before EOD today, would you consider labelling this PR with hacktoberfest-accepted? :-)

@dennisdoomen dennisdoomen added the hacktoberfest-accepted Accepted for hacktoberfest. Will be merged later. label Nov 1, 2020
@logiclrd
Copy link
Contributor Author

logiclrd commented Nov 1, 2020

I see #1379 merged, and have rebased onto develop. :-)

@dennisdoomen
Copy link
Member

Please properly document all types and all public, protected members, especially everything that is exposed to consumers of this library.

You're still busy with this, right?

@logiclrd
Copy link
Contributor Author

logiclrd commented Nov 3, 2020

Please properly document all types and all public, protected members, especially everything that is exposed to consumers of this library.

You're still busy with this, right?

I thought I was finished, did I miss something? docs/_pages/data.md

@logiclrd
Copy link
Contributor Author

logiclrd commented Nov 3, 2020

Ohh, you mean XML docs in the code? I'll go fill them in. :-)

Src/FluentAssertions/FluentAssertions.csproj Outdated Show resolved Hide resolved
Src/FluentAssertions/FluentAssertions.csproj Outdated Show resolved Hide resolved
@logiclrd logiclrd force-pushed the JDG_DataSets branch 2 times, most recently from 0f0ef37 to 185fb73 Compare November 7, 2020 19:48
@logiclrd
Copy link
Contributor Author

@dennisdoomen @jnyrup Can you give me any suggestions regarding performance? I'm trying to use this with DataTable objects with lots of rows, and my code is ... not fast :-)

  • 1000 rows: 2.3 seconds
  • 10000 rows: 21 seconds
  • 100000 rows: 242 seconds

Is it super expensive to set up a nested equivalency context?

@logiclrd
Copy link
Contributor Author

I ran a profiler on it and came up with:

  • The DataRowCollection equivalency step goes back through the "dispatcher" via parent.AssertEquivalencyUsing for DataRow objects. It looks like this dispatching process is expensive, and maybe it could go directly to the DataRow equivalency step?

  • Within each DataRow, it uses parent.AssertEquivalencyUsing on each field of the row as its own nested collection item context, which has the same cost problem.

  • When it creates the nested scope, it captures the stack trace in order to identify the caller (StackTrace.CaptureStackTrace into CallerIdentifier.ExtractVariableNameFrom).

This last one looks like it is possibly by far the most expensive thing in the process. It seems to be triggered by AssertEquivalencyUsing calling UpdateScopeWithReportableContext, which in turn accesses context.CurrentNode.Description. It actually reads it twice, the first time just to check its length, and as far as I can tell it recomputes it each time. Computing it calls GetSubjectId(), which in turn calls CallerIdentifier.DetermineCallerIdentity.

So, I'm thinking:

  1. UpdateScopeWithReportableContext should be careful to trigger the calculation of the property only once (should halve the time).

and

  1. When the DataRowCollection equivalency step is comparing DataRows and the DataRow equivalency step is comparing fields, I can replace GetSubjectId with a much cheaper implementation (should reduce the time by >99%).

Does that seem reasonable? Did I miss anything obvious?

I'll try this later and see what happens. :-)

@logiclrd
Copy link
Contributor Author

I can confirm that the first change did indeed just about halve the running time.

@logiclrd
Copy link
Contributor Author

With UpdateScopeWithReportableContext only accessing the value once and with the getSubjectById method wired up to a Lazy<string>, the execution of the test that generates a DataTable with 10,000 rows, serializes it, deserializes and then does an equivalence comparison dropped from 21 seconds to 1.6 seconds. But, there's still somewhere to go, I think; at 100,000 rows that test takes on the order of 38 seconds, of which at least 37 is taken by the equivalency check.

@logiclrd
Copy link
Contributor Author

Further improvement: By eliminating linear scans from CyclicReferenceDetector, that same test has dropped from 38 seconds to 9.3 seconds. I mean, heck, it's fast enough now that I'm actually averaging multiple runs :-P It's still close to 90% of the entire test run time, though. I want to use it to test DataTables with a million records, so let's see what else can be eked out...

@logiclrd
Copy link
Contributor Author

logiclrd commented Nov 11, 2020

A sample test run with 1,000,000 rows timed as follows:

  • Creating and populating the initial DataTable: 6 seconds
  • Serializing the data to JSON: 2 seconds
  • Deserializing the data from JSON into another DataTable object: 6 seconds
  • dataTable1.Should().BeEquivalentTo(dataTable2): 92 seconds

@logiclrd
Copy link
Contributor Author

With these performance changes, the performance of testing two large DataTable values for equivalency increases approximately 60x at 100,000 rows. The difference is nonlinear, because the old cyclic reference check implementation uses a linear scan. Doubling the number of rows quadruples the time it takes.

I'm not sure what other low-hanging opportunities there are for improvement. When I profile it now, the profiler tells me that 80% of the time is spent doing I/O apparently, but I can't seem to ever catch it doing I/O. Not sure what that's all about :-P

@logiclrd
Copy link
Contributor Author

I've added a profiling fixture for equivalency tests of large DataTable objects, and using it have observed close to linear performance with respect to the number of rows:

|         Method | RowCount |          Mean |       Error |      StdDev |        Gen 0 |        Gen 1 | Gen 2 |   Allocated |
|--------------- |--------- |--------------:|------------:|------------:|-------------:|-------------:|------:|------------:|
| BeEquivalentTo |       10 |      2.445 ms |   0.0322 ms |   0.0269 ms |     289.0625 |       3.9063 |     - |     1.76 MB |
| BeEquivalentTo |      100 |      7.733 ms |   0.1338 ms |   0.2412 ms |    1007.8125 |      23.4375 |     - |     6.08 MB |
| BeEquivalentTo |     1000 |     55.819 ms |   1.0056 ms |   0.8915 ms |    8111.1111 |     222.2222 |     - |       49 MB |
| BeEquivalentTo |    10000 |    539.920 ms |   5.4939 ms |   4.5877 ms |   79000.0000 |   11000.0000 |     - |   478.04 MB |
| BeEquivalentTo |   100000 |  5,586.120 ms | 105.3597 ms | 112.7337 ms |  793000.0000 |  114000.0000 |     - |  4767.03 MB |
| BeEquivalentTo |  1000000 | 54,524.512 ms | 447.0123 ms | 418.1356 ms | 7953000.0000 | 1137000.0000 |     - | 47793.88 MB |

This is a good performance characteristic, but the coefficient on it still feels high. I was hoping to set up a test in our codebase using DataTable objects with 1,000,000 rows in them, but even with the latest performance improvements, 80% of the test time, over 50 seconds, is spent inside the dataTable.Should().BeEquivalentTo(expectedDataTable) call.

Do you have any ideas off the top of your head about how to further improve the performance?

Copy link
Member

@jnyrup jnyrup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new commits looks good.
I had an extra look at the code and found a few more things.

I haven't reviewed the tests thoroughly, but just skimmed them and it looked good.

@logiclrd
Copy link
Contributor Author

logiclrd commented Jan 5, 2021

Should I collapse these follow-up commits into the main commits, then?

Copy link
Member

@jnyrup jnyrup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two small nits, then I'll be happy to merge this magnum opus into master.

Yes, please collapse the follow-up commits.
Not long ago I had to look at commit from 2012/2013 trying to understand some design decision.
It really helped that we had a clean git history.

Dennis wrote a blog post that elaborates more on the topic.
https://www.continuousimprover.com/2020/03/keep-source-control-history-clean.html

Tests/Approval.Tests/Approval.Tests.csproj Outdated Show resolved Hide resolved
@logiclrd logiclrd force-pushed the JDG_DataSets branch 11 times, most recently from 7e31403 to 39f7c6b Compare January 5, 2021 21:24
…ataTable, DataRow, and typed data tables).

Preparatory work:
- Refactored JoinUsingWritingStyle out of Formatting/TimeSpanValueFormatter.cs into an extension method in new file Formatting/EnumerableExtensions.cs.
- Added unit tests of JoinUsingWritingStyle in EnumerableExtensionSpecs.cs.
- Added a performance optimization to GenericDictionaryEquivalencyStep.cs and GenericEnumerableEquivalencyStep.cs to avoid having TypeExtensions.GetClosedGenericInterfaces call .GetInterfaces() if the TypeCode indicates that the value couldn't possibly implement the specified interface to begin with.
Added equivalency steps for dealing with DataSet-related types:
- DataSetEquivalencyStep.cs
- DataRelationEquivalencyStep.cs
- DataTableEquivalencyStep.cs
- DataColumnEquivalencyStep.cs
- ConstraintCollectionEquivalencyStep.cs
- ConstraintEquivalencyStep.cs
- DataRowCollectionEquivalencyStep.cs
- DataRowEquivalencyStep.cs
Updated references to system assemblies as needed:
- Added reference to System.Data and System.Data.DataSetExtensions from FluentAssertions.csproj for the net47 target framework.
- Added NuGet references to System.Data.DataSetExtensions for the netcoreapp2.1 and netstandard2.0 targets in FluentAssertions.csproj.
- Added a package reference to System.Data.DataSetExtensions from Approval.Tests.csproj.
- Updated Benchmarks.csproj and NSpec3.Net47.Specs.csproj to reference System.Data.DataSetExtensions in the target framework.
Registered these new IEquivalencyStep implementations in GetDefaultSteps in EquivalencyStepCollection.cs.
Added new type DataEquivalencyAssertionOptions<T>. Added a new overload of CloneDefaults to AssertionOptions.cs to support cloning to specific types.
Added new subclasses of ReferenceTypeAssertions<TSubject, TAssertions> to support tailored Should() options for DataSet and related types:
- DataRowAssertions.cs
- DataTableAssertions.cs
- DataColumnAssertions.cs
- DataRowAssertions.cs
Added supporting type RowMatchMode.cs.
Added a new overload of .Should() to AssertionExtensions.cs for DataColumn, and added DataSetAssertionExtensions.cs, DataTableAssertionExtensions.cs and DataRowAssertionExtensions.cs to provide Should<T> overloads specializing on DataSet, DataTable and DataRow (allowing the compile-time type to be captured).
Updated TypeExtensions.cs to short-circuit GetClosedGenericInterfaces if the type isn't an object type.
Added unit tests of all the different paths through the code in the new equivalency steps to FluentAssertions.Specs.csproj as Equivalency/DataEquivalencySpecs.*.cs. Added a reference to System.Data.DataSetExtensions in order to use TypedTableBase<T>.
Ran AcceptApiChanges.ps1.
- Added LargeDataTableEquivalency.cs to the Benchmarks project with code to predictably populate two identical DataTables then measure the time taken to ensure that they are equivalent.
- Added benchmark class UsersOfGetClosedGenericInterfaces.cs to the Benchmarks project.
- Added references to the Bogus library and to System.Collections from Benchmarks.csproj.
- Updated Program.cs to run the LargeDataTableEquivalency benchmark.
- Added new file docs/_pages/data.md with documentation of System.Data assertions.
- Updated docs/_data/navigation.yml to link the System.Data documentation into the navigation tree.
- Added mention of the System.Data functionality in the Release Notes (docs/_pages/releases.md).
@logiclrd
Copy link
Contributor Author

logiclrd commented Jan 5, 2021

[C:\code\fluentassertions]git diff fa9cbcc..a5a5550

[C:\code\fluentassertions]git cherry -v develop JDG_DataSets
+ 2fa723b7db333191b9f311a45232430f22dd2305 Added support for assertions related to System.Data types (DataSet, DataTable, DataRow, and typed data tables).
+ 56837fcfcec5649cd3d8d1f512c00c310b7e782a Added benchmarks related to the new System.Data functionality: - Added LargeDataTableEquivalency.cs to the Benchmarks project with code to predictably populate two identical DataTables then measure the time taken to ensure that they are equivalent. - Added benchmark class UsersOfGetClosedGenericInterfaces.cs to the Benchmarks project. - Added references to the Bogus library and to System.Collections from Benchmarks.csproj. - Updated Program.cs to run the LargeDataTableEquivalency benchmark.
+ a5a5550be0b6ad0ec222bfd3de3103e1066a9f4c Added documentation of the new System.Data functionality: - Added new file docs/_pages/data.md with documentation of System.Data assertions. - Updated docs/_data/navigation.yml to link the System.Data documentation into the navigation tree. - Added mention of the System.Data functionality in the Release Notes (docs/_pages/releases.md).

[C:\code\fluentassertions]

@logiclrd
Copy link
Contributor Author

logiclrd commented Jan 5, 2021

Oh dear :-) Well, I didn't really intend to trigger so many automated builds, I just wanted the collapsing of the commits to be preserved in my fork. But, it is really highlighting some flaky tests:

image

Every one of those builds is a functionally identical codebase.

@dennisdoomen
Copy link
Member

Oh dear :-) Well, I didn't really intend to trigger so many automated builds,

Don't worry about that. We don't care ;-)

@jnyrup jnyrup merged commit 463497f into fluentassertions:develop Jan 6, 2021
@jnyrup
Copy link
Member

jnyrup commented Jan 6, 2021

@logiclrd Thanks for taking the time to implement this feature and having the patience to listen to our concerns during the review process.

@dennisdoomen
Copy link
Member

Awesome work @logiclrd 👌👌

@logiclrd
Copy link
Contributor Author

logiclrd commented Jan 6, 2021

Thanks so much, it's exciting to be part of such a widely-used library :-) Your detailed reviews definitely made a big difference to the quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hacktoberfest-accepted Accepted for hacktoberfest. Will be merged later.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants