Fix Sweeper truncating entire layer#147
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #147 +/- ##
===========================================
+ Coverage 22.38% 36.43% +14.04%
===========================================
Files 13 13
Lines 746 785 +39
Branches 132 139 +7
===========================================
+ Hits 167 286 +119
+ Misses 574 492 -82
- Partials 5 7 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Right now the chunk_size is hardcoded in DuplicateTest._try_fix(). I'm not sure of the best way to set this with a default value given the optional nature of the config.json config file. |
There was a problem hiding this comment.
Pull request overview
This PR addresses an issue where the duplicates sweeper could delete/truncate more rows than intended (notably in the pre1978 layer) by chunking large OID-based deletes and switching to an ArcPy “select then delete” workflow that should respect layer/table-view selections.
Changes:
- Chunk duplicate OID deletions into smaller batches and delete via
SelectLayerByAttribute+DeleteFeatures/DeleteRows. - Add unit tests covering chunking and
try_fix()behavior (and somesweep()behaviors). - Update VS Code debug launch config and add a repository Copilot instructions document.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
src/sweeper/sweepers/duplicates.py |
Implements batched deletion + selection-based delete helpers to avoid overly-large SQL clauses and reduce risk of unintended full deletes. |
tests/test_duplicates.py |
Adds tests for batching and deletion behavior using mocked ArcPy/xxhash. |
.vscode/launch.json |
Updates debug configuration to run the duplicates sweeper with specific args. |
.github/copilot-instructions.md |
Adds project contribution/style guidance for Copilot usage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@jacobdadams I've opened a new pull request, #148, to work on those changes. Once the pull request is ready, I'll request review from you. |
…ates.py (#148) * Initial plan * fix(tests): resolve ruff F821 errors by importing DuplicateTest at module level Co-authored-by: jacobdadams <38168030+jacobdadams@users.noreply.github.com> Agent-Logs-Url: https://github.com/agrc/sweeper/sessions/ba6f55df-5b0c-43fd-a7dd-25596cf24014 --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: jacobdadams <38168030+jacobdadams@users.noreply.github.com>
|
Michael confirmed one of the errors showed up in the SQL Server logs:
The log doesn't specify the layer, but I think the time lines up with one of our manual tests. |
|
And I've run this against a local copy of the 1978 data w/duplicates and it works as intended. Haven't tested it against anything in SDE yet. |
stdavis
left a comment
There was a problem hiding this comment.
Thanks for taking this on! I think that your hypothesis about the cause of the issue seems very plausible especially given the database log from Michael.
Haven't tested it against anything in SDE yet.
I think that it would be good to test this against an SDE database before merging. I'm sure that Michael could help you find a test database to work with.
Co-authored-by: Scott Davis <stdavis@utah.gov>
|
I've tested this on a test layer in internal and it found the expected number of duplicates and deleted them appropriately. The layer had the expected number of remaining features afterwards. It did take a little longer (I assume from the overhead of setting and clearing the selection for each batch along with the overhead for multiple delete calls), but I think this is much more robust than before. |
As per #146, sweeper was nuking all the rows in the pre1978 layer. This may have been caused by the extremely long SQL where clause (https://learn.microsoft.com/en-us/sql/t-sql/language-elements/in-transact-sql?view=sql-server-ver17&redirectedfrom=MSDN#remarks).
This PR fixes #146
This PR chunks the delete OID list into lists of smaller size to hopefully work around this issue.
It also changes the arcpy delete behavior. Previously, it created a layer or table view containing only the records to be deleted and then nuked everything.
Now, it creates a layer or table view of the entire layer, runs a selection using the where clause, and then calls delete, which honors that selection. This is how the docs do deletions (https://pro.arcgis.com/en/pro-app/latest/tool-reference/data-management/delete-features.htm).
The change of deletion methods may fix the full truncate issue because now the where clause is being run against the layer created. If the layer creation creates an in-memory copy, the where clause would be applied locally without touching SQL Server. However, I don't know if the layer creation pulls over a full copy or just creates a shallow reference.