New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make column type inference optional #3050
Conversation
cf8fbbb
to
a66d9f8
Compare
@kgodey @ghislaineguerin @pavish @dmos62 I've assigned you all to this PR to comment with your opinion on the direction I'm heading with changes to the import process. (Don't review it because it's still a draft.) Motivation for this PR
Current (draft) state of this PRWhy I'm asking for your input
A "pie in the sky" import UX to consider laterI want you all to understand some additional context of the discussions that @dmos62 and I have been having, as it relates to the UX choices here. I proposed an import process to Dom that would work like this:
Dom said this is doable, but would probably take a least one full cycle of focused work. We're interested in exploring this option at some point in the future, but we though it was important to deliver something in this release which would address user concerns. This PR does not represent the ideal UX that I would want for the user, but it represents a balance between user concerns and implementation feasibility. That's why it's a bit of a "band-aid". |
@seancolsen: @ghislaineguerin and I discussed this in our call today. @ghislaineguerin is going to spend a little bit of time thinking about this and maybe come up with an alternate proposal. At minimum, I'd like some changes to the copy of the "Column type inference" section to make it more newbie-friendly. I don't think most users will know what "inference" means, we might want to make it "Guess column data types" or something like that. Also, we'll need some changes to the next page in the flow since we state there that column types will be detected automatically: |
Thanks for weighing in, @kgodey. When you say "@ghislaineguerin is going to spend a little bit of time thinking about this and maybe come up with an alternate proposal," do you expect that her proposal will still follow the overall structure in this PR? Or would it potentially be something wildly different? I ask because, with this being a draft PR, I still have some work left to do to make this change functional. I've only implemented the most basic of the UI so far, hoping it would be enough to communicate my intended direction. If you and @ghislaineguerin are satisfied with the direction this is heading at a high level, then I'll continue my work and be happy to wordsmith things later in the process. But if, for example, you two are unsure if it even makes sense to ask the user a question like this on the import page, then I'll hold off on continuing my work here. |
The input I had about the type inference section is that "slower" and "faster" is not enough commentary to inform the user (and it's misleading). E.g. inference is "slower", but in many cases it will take a neglibile amount of time and save a lot of time in not having to manually choose types for every column after import. Or, even worse, a user might opt for the "faster" option out of simple conservacy, not knowing that this means he should change column types himself after the import. I don't have an idea for how to fix this. I would consider a paragraph or two of text explaining the tradeoffs and the risks. As a sidenote, I really dislike when tools give you multiple options, some "faster" or "more stable" or whatever, but don't provide the insights necessary to actually make the choice. |
Hi @seancolsen We're not concerned with the direction of this PR; we believe that allowing users to skip a potentially slow process is a good idea. Our aim is to reduce the number of choices a user has to make, so we think having data type inference as the default makes sense. It is also ok to ask them that question before they proceed to the preview. We also discussed the idea of allowing users to interrupt the data type inference at any point, but we concluded that it might not be the best approach. Interrupting a process could be significantly more stressful and unpredictable than simply making a decision at the start. Perhaps once the preview is rendered, we could offer the user an option to check the data types at that point? This might enhance the user experience, as it could be interrupted without affecting other processes if it's taking too long. As for the questions, I suggest focusing the labels more on automatic versus manual data type assignment. Instead of asking the user to choose between 'the best types' and 'using text', which might not meet their expectations, we could consider 'Guess column data types' and 'Set data types manually'. I think this wording implies that the user will still have control over the output. |
@seancolsen does @ghislaineguerin's comment address the questions you asked me? |
@kgodey and @ghislaineguerin. Yep. I'm good here. Thanks for the clarification. I'll continue working on this. |
- Utilize `getGloballyUniqueId` function. - Clean up imports. - Refactor `state` into simpler `isDraggingOver` boolean. - Some minor readability improvements. - Don't process drop if `fileList` is empty. - Use `{@const}` for `percentage`. - Use distinct CSS for focus, hover, and dragging-over. - Ignore `:active` CSS state.
Also clean up ImportPreviewPage component a bit
@pavish I'd like you to give this a full review now. There is a lot of code to look over in here, much of which touches code you have worked on. Hard to describe everything in the PR description, since this was a tricky one. Happy to hop on a call to discuss if needed. @kgodey, @ghislaineguerin, and @dmos62 I'd like you to take another look at the language of the question, as we initially discussed. Look for the "After initial import page loads" section in the PR description. |
Yes. UI looks good. I don't have significant feedback. |
I like the new copy better. Some thoughts:
I don't have a strong opinion about any of this. |
@seancolsen This looks good to me. Thanks |
The newest version of pgTAP (the SQL testing framework we're using) seems to have introduced a bug: I'm going to disable the affected tests for now. |
pgTAP 1.3.0 introduced a bug that affects our tests, but not the actual functionality: theory/pgtap#315
This reverts commit b7420a0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seancolsen The code looks good.
However, I'm not able to import a file. I think this might be due to a backend issue or a change in contract (!?), I'm not sure.
The import fails at the preview step.
Here is the error:
[
{
"code": 4999,
"message": "",
"field": null,
"detail": null,
"stacktrace": [
"1. File \"/code/db/columns/operations/alter.py\", line 161, in batch_update_columns",
"2. db_conn.execute_msar_func_with_engine(",
"3. File \"/code/db/connection.py\", line 19, in execute_msar_func_with_engine",
"4. return conn.execute(",
"5. File \"/usr/local/lib/python3.9/site-packages/psycopg/connection.py\", line 879, in execute",
"6. raise ex.with_traceback(None)",
"7. psycopg.errors.SyntaxError: column \"id\" of relation \"patents 1\" is an identity column",
"8. HINT: Use ALTER TABLE ... ALTER COLUMN ... DROP IDENTITY instead.",
"9. CONTEXT: SQL statement \"ALTER TABLE \"patents 1\" ALTER COLUMN id DROP DEFAULT, ALTER COLUMN id TYPE integer USING mathesar_types.cast_to_integer(id), ALTER COLUMN \"Center\" DROP DEFAULT, ALTER COLUMN \"Center\" TYPE text USING mathesar_types.cast_to_text(\"Center\"), ALTER COLUMN \"Status\" DROP DEFAULT, ALTER COLUMN \"Status\" TYPE text USING mathesar_types.cast_to_text(\"Status\"), ALTER COLUMN \"Case Number\" DROP DEFAULT, ALTER COLUMN \"Case Number\" TYPE text USING mathesar_types.cast_to_text(\"Case Number\"), ALTER COLUMN \"Patent Number\" DROP DEFAULT, ALTER COLUMN \"Patent Number\" TYPE text USING mathesar_types.cast_to_text(\"Patent Number\"), ALTER COLUMN \"Application SN\" DROP DEFAULT, ALTER COLUMN \"Application SN\" TYPE text USING mathesar_types.cast_to_text(\"Application SN\"), ALTER COLUMN \"Title\" DROP DEFAULT, ALTER COLUMN \"Title\" TYPE text USING mathesar_types.cast_to_text(\"Title\"), ALTER COLUMN \"Patent Expiration Date\" DROP DEFAULT, ALTER COLUMN \"Patent Expiration Date\" TYPE date USING mathesar_types.cast_to_date(\"Patent Expiration Date\")\"",
"10. PL/pgSQL function __msar.exec_ddl(text) line 10 at EXECUTE",
"11. PL/pgSQL function __msar.exec_ddl(text,anyarray) line 14 at RETURN",
"12. SQL statement \"SELECT __msar.exec_ddl(",
"13. 'ALTER TABLE %s %s',",
"14. __msar.get_relation_name(tab_id),",
"15. msar.process_col_alter_jsonb(tab_id, col_alters)",
"16. )\"",
"17. PL/pgSQL function msar.alter_columns(oid,jsonb) line 22 at PERFORM",
"18. ",
"19. During handling of the above exception, another exception occurred:",
"20. ",
"21. Traceback (most recent call last):",
"22. File \"/code/mathesar/utils/models.py\", line 36, in update_sa_table",
"23. alter_table(table.name, table.oid, table.schema.name, table.schema._sa_engine, data)",
"24. File \"/code/db/tables/operations/alter.py\", line 50, in alter_table",
"25. batch_update_columns(table_oid, engine, update_data['columns'])",
"26. File \"/code/db/columns/operations/alter.py\", line 175, in batch_update_columns",
"27. raise InvalidTypeOptionError",
"28. db.columns.exceptions.InvalidTypeOptionError",
"29. ",
"30. During handling of the above exception, another exception occurred:",
"31. ",
"32. Traceback (most recent call last):",
"33. File \"/usr/local/lib/python3.9/site-packages/rest_framework/views.py\", line 506, in dispatch",
"34. response = handler(request, *args, **kwargs)",
"35. File \"/code/mathesar/api/db/viewsets/tables.py\", line 57, in partial_update",
"36. serializer.save()",
"37. File \"/usr/local/lib/python3.9/site-packages/rest_framework/serializers.py\", line 200, in save",
"38. self.instance = self.update(self.instance, validated_data)",
"39. File \"/code/mathesar/api/serializers/tables.py\", line 170, in update",
"40. instance.update_sa_table(validated_data)",
"41. File \"/code/mathesar/models/base.py\", line 478, in update_sa_table",
"42. result = model_utils.update_sa_table(self, update_params)",
"43. File \"/code/mathesar/utils/models.py\", line 42, in update_sa_table",
"44. raise base_api_exceptions.MathesarAPIException(e, status_code=status.HTTP_400_BAD_REQUEST)",
"45. mathesar.api.exceptions.generic_exceptions.base_exceptions.MathesarAPIException: [{'code': 4999, 'message': '', 'field': None, 'detail': None}]"
]
}
]
cc: @mathemancer
Update: I cleaned my DB state and started fresh and it works as expected. It's probably due to the SQL functions in my local environment not being updated. I did restart the server first, after I pulled in this branch but that doesn't seem to have updated the functions automatically. @mathemancer Is anything additional required? |
Fixes #2358
Summary
Before
Mathesar mandatorily performs column type inference for imports.
tall.csv is an example of an import for which this inference is problematic.
Prior to this PR, when I would attempt to import it, I would usually experience the following error (though it's worth mentioning that I have also observed it to succeed too).
The root cause of this error is not the focus of this PR. I have spent a little time trying to trace it down, but it's been very tricky. I can add additional comments if needed.
After
The first step of the import process prompts the user to decide whether to enable type inference.
With inference enabled, tall.csv seems to succeed at importing somewhat more reliably (though my sample size is small here). When the import succeeds, the preview page takes about 40 seconds to load on my machine. And errors do still seem to happen with inference. Here is a demo of tall.csv showing successfully inferred column types.
With inference disabled tall.csv imports more straightforwardly. The preview page loads in about 3 seconds. Notice all columns are text.
Refactoring
When I started this PR, it seemed like a relatively simple change. But the deeper I got into it, the more I struggled to find a clean path through the existing code. I ended up doing quite a bit of refactoring along the way, with some user-facing improvements for other bugs and UX issues I noticed along the way.
Code changes
I re-organized
pages/import-preview
andpages/import-upload
intopages/import
because there are now some common components used across both of these pages which live alongside them.I did a small amount of refactoring in
ImportUploadPage.svelte
, but it was mostly straightforward.I also made some improvements to
FileUpload.svelte
to handle some small bugs and UX issues.ImportPreviewPage.svelte
is what gave me the most trouble.ImportPreviewPage.svelte
ImportPreviewLayout.svelte
ImportPreviewContent.svelte
(this is the component which most closely resembles the code prior to my refactoring)ImportPreviewSheet.svelte
ImportPreviewPageUtils.ts
The preview page is making so many different API requests, that I decided to create a new abstraction to handle this complexity:
AsyncStore
. Look inImportPreviewContent.svelte
and see that we are doing:At this point,
previewRequest
has not been run yet. Then later we do:Then, we can use
$previewRequest
to see theAsyncStoreValue
associated with the request, and that tells us a bunch of stuff about the state of the request.The
previewRequest
can also be canceled, re-run, and reset.I don't think we necessarily need to use this new
AsyncStore
abstraction everywhere, but so far it seems to be really handy!User-facing changes
After initial import page loads
Since my previous comment on this PR, I made some minor tweaks to the language for the new field.
@kgodey I've removed the word "inference". Does this address your concern? I kept the field label as "Column types" instead of adopting your suggestion of "Guess column data types" because it felt simpler and more declarative. I don't feel too strongly about this though.
@dmos62 I added the text "for large imports". Does this address your concern. You said you'd consider "a paragraph or two of text," but I'd like to avoid overwhelming the user with that much information.
During file upload
After file upload completes
While table is being created
While preview page is loading
After preview page loads
inference
query param to the URL for this page. The form syncs with this param bidirectionally.After canceling import from preview screen
Checklist
Update index.md
).develop
branch of the repositoryDeveloper Certificate of Origin
Developer Certificate of Origin