Make column type inference optional #3050

seancolsen · 2023-07-17T13:12:31Z

Summary

Before

Mathesar mandatorily performs column type inference for imports.
tall.csv is an example of an import for which this inference is problematic.

Prior to this PR, when I would attempt to import it, I would usually experience the following error (though it's worth mentioning that I have also observed it to succeed too).

The root cause of this error is not the focus of this PR. I have spent a little time trying to trace it down, but it's been very tricky. I can add additional comments if needed.

After

The first step of the import process prompts the user to decide whether to enable type inference.
With inference enabled, tall.csv seems to succeed at importing somewhat more reliably (though my sample size is small here). When the import succeeds, the preview page takes about 40 seconds to load on my machine. And errors do still seem to happen with inference. Here is a demo of tall.csv showing successfully inferred column types.
With inference disabled tall.csv imports more straightforwardly. The preview page loads in about 3 seconds. Notice all columns are text.

Refactoring

When I started this PR, it seemed like a relatively simple change. But the deeper I got into it, the more I struggled to find a clean path through the existing code. I ended up doing quite a bit of refactoring along the way, with some user-facing improvements for other bugs and UX issues I noticed along the way.

Code changes

I re-organized pages/import-preview and pages/import-upload into pages/import because there are now some common components used across both of these pages which live alongside them.
I did a small amount of refactoring in ImportUploadPage.svelte, but it was mostly straightforward.
I also made some improvements to FileUpload.svelte to handle some small bugs and UX issues.
ImportPreviewPage.svelte is what gave me the most trouble.
- This component was quite large before this PR! It had lots of local state, lots of imperative logic, lots of API requests, lots of tricky loading states and error states. Ooof. Much of the complexity here is intrinsic to the import process, as defined by the API. But the component had grown to the point where it was quite difficult for me to modify without already having a solid understanding of its inner workings.
- This PR explodes that component into the following separate files which I think should be much easier to work with:
  - ImportPreviewPage.svelte
  - ImportPreviewLayout.svelte
  - ImportPreviewContent.svelte (this is the component which most closely resembles the code prior to my refactoring)
  - ImportPreviewSheet.svelte
  - ImportPreviewPageUtils.ts
The preview page is making so many different API requests, that I decided to create a new abstraction to handle this complexity: AsyncStore. Look in ImportPreviewContent.svelte and see that we are doing:
```
const previewRequest = new AsyncStore(generateTablePreview);
```
At this point, previewRequest has not been run yet. Then later we do:
```
previewRequest.run({ table, columns })
```
Then, we can use $previewRequest to see the AsyncStoreValue associated with the request, and that tells us a bunch of stuff about the state of the request.

The previewRequest can also be canceled, re-run, and reset.

I don't think we necessarily need to use this new AsyncStore abstraction everywhere, but so far it seems to be really handy!

User-facing changes

After initial import page loads

Before	After

I changed some font sizes to be more consistent.
I adjusted some language slightly.
I added the "Column Types" field.

Since my previous comment on this PR, I made some minor tweaks to the language for the new field.

@kgodey I've removed the word "inference". Does this address your concern? I kept the field label as "Column types" instead of adopting your suggestion of "Guess column data types" because it felt simpler and more declarative. I don't feel too strongly about this though.

@dmos62 I added the text "for large imports". Does this address your concern. You said you'd consider "a paragraph or two of text," but I'd like to avoid overwhelming the user with that much information.

During file upload

Before	After

Don't show "processing" warning during upload phase because we are not actually processing the data yet. The progress bar should be enough here, I think.
I added a "Cancel upload" button. Which cancels the upload and resets the file input.

After file upload completes

Before	After
(This state did not exist previously because the flow moved on to the next step automatically.)

User must press "Proceed". This is one extra click, which is a downside to this PR, but I don't think we should auto-proceed anymore because the form has an additional question after the file input.

While table is being created

Before	After

While preview page is loading

Before	After

After preview page loads

Before	After

I improved aesthetics for tall viewports. Previously, there was white and tan horizontal space below the preview area. Now, that space retains the same background color as behind the preview sheet.
I moved the truncation warning from inside the form area to below the preview sheet to improve contextual placement. It now also specifies 20 exactly instead of saying "the first few rows". Also, it is hidden when the preview has fewer than 20 rows.
The form has the same "Column Types" question from the first page, allowing the user to make changes to the inference here as well.
I reframed the "Use first row as header" question as radio choices to better describe its behavior improve aesthetics when juxtaposed with the "Column Types" question.
I made some font sizes more consistent.
I made the "Table Name" field horizontal instead of vertical so as to conserve vertical space.
I moved the help text into a blue box and re-worded it a bit.
I added an inference query param to the URL for this page. The form syncs with this param bidirectionally.

After canceling import from preview screen

Before	After

Previously, Mathesar would immediately redirect to the schema page while concurrently deleting the unconfirmed import table. This lead to a momentary flash of the table being listed on the schema page before the deletion request resolved. This PR replaces the cancel button with a "Cleaning up" loading spinner.

Checklist

My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the develop branch of the repository
My commit messages follow best practices.
My code follows the established code style of the repository.
I added tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

seancolsen · 2023-07-17T14:39:59Z

@kgodey @ghislaineguerin @pavish @dmos62 I've assigned you all to this PR to comment with your opinion on the direction I'm heading with changes to the import process. (Don't review it because it's still a draft.)

Motivation for this PR

We've had some reports of users being unable to import data into Mathesar due to problems with our column type inference. (See Cannot import large csv tables #2995 and Importing a CSV with a large number of columns times out at confirming data step #2250).
Ideally we'd improve column type inference, but that work may take multiple months.
As a stop-gap solution, @dmos62 and I have identified the following band-aid fixes that we think can be completed within this cycle:
- Make column type inference optional (to be addressed via this PR)
- Gracefully recover from failed type inference during import (to be addressed later in this cycle hopefully)

Current (draft) state of this PR

Before	After

Upon uploading a file, the column type inference begins immediately, without user interaction.	The user will need to click "Continue". This is one additional click, but I think that's okay.

Why I'm asking for your input

@kgodey You've commented on a few of the related tickets, and I think you might want to weigh in on this, especially since we're working to identify the best way to address user feedback while balancing implementation feasibility.
@ghislaineguerin you did the UX design and graphic for the import page. This PR makes some changes to that design. I've reduced some font sizes to present a more uniform "form" page with multiple inputs. And I've simplified the field label for the first question to be more declarative.
@pavish you implemented most of the import page (if I recall correctly). I'd like to note that the code changes in this PR are not yet ready for review. Please review from a UX perspective.
@dmos62 you had some additional thoughts about how best to present the inference options to the user

A "pie in the sky" import UX to consider later

I want you all to understand some additional context of the discussions that @dmos62 and I have been having, as it relates to the UX choices here.

I proposed an import process to Dom that would work like this:

Mathesar automatically starts performing type inference after data is uploaded.
Type inference is performed per-column.
Type inference is batched in 1k row "chunks". For Text columns, performance would be improved because Postgres would only need to analyze the first chunk to finish inference. For stricter column types (e.g. Number), the service layer would have knowledge of the worst-case progress of inference process.
The front end uses web sockets or polling to see the progress of inference on a per-column basis.
The front end can cancel inference on a per-column basis, falling back to Text for the column type.
The user can then wait for inference to complete. If they get impatient, they can manually choose the correct type for columns, thereby canceling the inference process.

Dom said this is doable, but would probably take a least one full cycle of focused work. We're interested in exploring this option at some point in the future, but we though it was important to deliver something in this release which would address user concerns.

This PR does not represent the ideal UX that I would want for the user, but it represents a balance between user concerns and implementation feasibility. That's why it's a bit of a "band-aid".

kgodey · 2023-07-17T20:32:38Z

@seancolsen: @ghislaineguerin and I discussed this in our call today. @ghislaineguerin is going to spend a little bit of time thinking about this and maybe come up with an alternate proposal.

At minimum, I'd like some changes to the copy of the "Column type inference" section to make it more newbie-friendly. I don't think most users will know what "inference" means, we might want to make it "Guess column data types" or something like that.

Also, we'll need some changes to the next page in the flow since we state there that column types will be detected automatically:

seancolsen · 2023-07-18T00:25:21Z

Thanks for weighing in, @kgodey. When you say "@ghislaineguerin is going to spend a little bit of time thinking about this and maybe come up with an alternate proposal," do you expect that her proposal will still follow the overall structure in this PR? Or would it potentially be something wildly different? I ask because, with this being a draft PR, I still have some work left to do to make this change functional. I've only implemented the most basic of the UI so far, hoping it would be enough to communicate my intended direction. If you and @ghislaineguerin are satisfied with the direction this is heading at a high level, then I'll continue my work and be happy to wordsmith things later in the process. But if, for example, you two are unsure if it even makes sense to ask the user a question like this on the import page, then I'll hold off on continuing my work here.

dmos62 · 2023-07-18T09:26:42Z

The input I had about the type inference section is that "slower" and "faster" is not enough commentary to inform the user (and it's misleading). E.g. inference is "slower", but in many cases it will take a neglibile amount of time and save a lot of time in not having to manually choose types for every column after import. Or, even worse, a user might opt for the "faster" option out of simple conservacy, not knowing that this means he should change column types himself after the import.

I don't have an idea for how to fix this. I would consider a paragraph or two of text explaining the tradeoffs and the risks.

As a sidenote, I really dislike when tools give you multiple options, some "faster" or "more stable" or whatever, but don't provide the insights necessary to actually make the choice.

ghislaineguerin · 2023-07-18T09:36:29Z

Hi @seancolsen We're not concerned with the direction of this PR; we believe that allowing users to skip a potentially slow process is a good idea. Our aim is to reduce the number of choices a user has to make, so we think having data type inference as the default makes sense. It is also ok to ask them that question before they proceed to the preview.

We also discussed the idea of allowing users to interrupt the data type inference at any point, but we concluded that it might not be the best approach. Interrupting a process could be significantly more stressful and unpredictable than simply making a decision at the start. Perhaps once the preview is rendered, we could offer the user an option to check the data types at that point? This might enhance the user experience, as it could be interrupted without affecting other processes if it's taking too long.

As for the questions, I suggest focusing the labels more on automatic versus manual data type assignment. Instead of asking the user to choose between 'the best types' and 'using text', which might not meet their expectations, we could consider 'Guess column data types' and 'Set data types manually'. I think this wording implies that the user will still have control over the output.

kgodey · 2023-07-18T17:26:19Z

@seancolsen does @ghislaineguerin's comment address the questions you asked me?

seancolsen · 2023-07-18T17:28:44Z

@kgodey and @ghislaineguerin. Yep. I'm good here. Thanks for the clarification. I'll continue working on this.

@const

- Utilize `getGloballyUniqueId` function. - Clean up imports. - Refactor `state` into simpler `isDraggingOver` boolean. - Some minor readability improvements. - Don't process drop if `fileList` is empty. - Use `{@const}` for `percentage`. - Use distinct CSS for focus, hover, and dragging-over. - Ignore `:active` CSS state.

Also clean up ImportPreviewPage component a bit

seancolsen · 2023-08-08T02:30:21Z

@pavish I'd like you to give this a full review now. There is a lot of code to look over in here, much of which touches code you have worked on. Hard to describe everything in the PR description, since this was a tricky one. Happy to hop on a call to discuss if needed.

@kgodey, @ghislaineguerin, and @dmos62 I'd like you to take another look at the language of the question, as we initially discussed. Look for the "After initial import page loads" section in the PR description.

dmos62 · 2023-08-08T14:58:54Z

I added the text "for large imports". Does this address your concern?

Yes. UI looks good. I don't have significant feedback.

kgodey · 2023-08-08T20:01:02Z

I like the new copy better. Some thoughts:

I would make the header "Column Data Types" instead of "Column Types" since I think that's a little less ambiguous.
I am not sure we need the "Analyze data and" in the first option, I think it works fine if it's just "Pick the best data type for each column". (note that I changed "type" to "data type").

I don't have a strong opinion about any of this.

seancolsen · 2023-08-22T10:50:08Z

@kgodey I like your proposed copy and I've pushed f873d70 to implement it.

ghislaineguerin · 2023-08-22T12:33:58Z

@seancolsen This looks good to me. Thanks

mathemancer · 2023-08-22T13:54:25Z

The newest version of pgTAP (the SQL testing framework we're using) seems to have introduced a bug:

theory/pgtap#315 .

I'm going to disable the affected tests for now.

pgTAP 1.3.0 introduced a bug that affects our tests, but not the actual functionality: theory/pgtap#315

This reverts commit b7420a0.

pavish

@seancolsen The code looks good.

However, I'm not able to import a file. I think this might be due to a backend issue or a change in contract (!?), I'm not sure.

The import fails at the preview step.

Here is the error:

[
    {
        "code": 4999,
        "message": "",
        "field": null,
        "detail": null,
        "stacktrace": [
            "1. File \"/code/db/columns/operations/alter.py\", line 161, in batch_update_columns",
            "2. db_conn.execute_msar_func_with_engine(",
            "3. File \"/code/db/connection.py\", line 19, in execute_msar_func_with_engine",
            "4. return conn.execute(",
            "5. File \"/usr/local/lib/python3.9/site-packages/psycopg/connection.py\", line 879, in execute",
            "6. raise ex.with_traceback(None)",
            "7. psycopg.errors.SyntaxError: column \"id\" of relation \"patents 1\" is an identity column",
            "8. HINT:  Use ALTER TABLE ... ALTER COLUMN ... DROP IDENTITY instead.",
            "9. CONTEXT:  SQL statement \"ALTER TABLE \"patents 1\" ALTER COLUMN id DROP DEFAULT, ALTER COLUMN id TYPE integer USING mathesar_types.cast_to_integer(id), ALTER COLUMN \"Center\" DROP DEFAULT, ALTER COLUMN \"Center\" TYPE text USING mathesar_types.cast_to_text(\"Center\"), ALTER COLUMN \"Status\" DROP DEFAULT, ALTER COLUMN \"Status\" TYPE text USING mathesar_types.cast_to_text(\"Status\"), ALTER COLUMN \"Case Number\" DROP DEFAULT, ALTER COLUMN \"Case Number\" TYPE text USING mathesar_types.cast_to_text(\"Case Number\"), ALTER COLUMN \"Patent Number\" DROP DEFAULT, ALTER COLUMN \"Patent Number\" TYPE text USING mathesar_types.cast_to_text(\"Patent Number\"), ALTER COLUMN \"Application SN\" DROP DEFAULT, ALTER COLUMN \"Application SN\" TYPE text USING mathesar_types.cast_to_text(\"Application SN\"), ALTER COLUMN \"Title\" DROP DEFAULT, ALTER COLUMN \"Title\" TYPE text USING mathesar_types.cast_to_text(\"Title\"), ALTER COLUMN \"Patent Expiration Date\" DROP DEFAULT, ALTER COLUMN \"Patent Expiration Date\" TYPE date USING mathesar_types.cast_to_date(\"Patent Expiration Date\")\"",
            "10. PL/pgSQL function __msar.exec_ddl(text) line 10 at EXECUTE",
            "11. PL/pgSQL function __msar.exec_ddl(text,anyarray) line 14 at RETURN",
            "12. SQL statement \"SELECT __msar.exec_ddl(",
            "13. 'ALTER TABLE %s %s',",
            "14. __msar.get_relation_name(tab_id),",
            "15. msar.process_col_alter_jsonb(tab_id, col_alters)",
            "16. )\"",
            "17. PL/pgSQL function msar.alter_columns(oid,jsonb) line 22 at PERFORM",
            "18. ",
            "19. During handling of the above exception, another exception occurred:",
            "20. ",
            "21. Traceback (most recent call last):",
            "22. File \"/code/mathesar/utils/models.py\", line 36, in update_sa_table",
            "23. alter_table(table.name, table.oid, table.schema.name, table.schema._sa_engine, data)",
            "24. File \"/code/db/tables/operations/alter.py\", line 50, in alter_table",
            "25. batch_update_columns(table_oid, engine, update_data['columns'])",
            "26. File \"/code/db/columns/operations/alter.py\", line 175, in batch_update_columns",
            "27. raise InvalidTypeOptionError",
            "28. db.columns.exceptions.InvalidTypeOptionError",
            "29. ",
            "30. During handling of the above exception, another exception occurred:",
            "31. ",
            "32. Traceback (most recent call last):",
            "33. File \"/usr/local/lib/python3.9/site-packages/rest_framework/views.py\", line 506, in dispatch",
            "34. response = handler(request, *args, **kwargs)",
            "35. File \"/code/mathesar/api/db/viewsets/tables.py\", line 57, in partial_update",
            "36. serializer.save()",
            "37. File \"/usr/local/lib/python3.9/site-packages/rest_framework/serializers.py\", line 200, in save",
            "38. self.instance = self.update(self.instance, validated_data)",
            "39. File \"/code/mathesar/api/serializers/tables.py\", line 170, in update",
            "40. instance.update_sa_table(validated_data)",
            "41. File \"/code/mathesar/models/base.py\", line 478, in update_sa_table",
            "42. result = model_utils.update_sa_table(self, update_params)",
            "43. File \"/code/mathesar/utils/models.py\", line 42, in update_sa_table",
            "44. raise base_api_exceptions.MathesarAPIException(e, status_code=status.HTTP_400_BAD_REQUEST)",
            "45. mathesar.api.exceptions.generic_exceptions.base_exceptions.MathesarAPIException: [{'code': 4999, 'message': '', 'field': None, 'detail': None}]"
        ]
    }
]

cc: @mathemancer

pavish · 2023-08-23T18:47:25Z

Update:

I cleaned my DB state and started fresh and it works as expected.

It's probably due to the SQL functions in my local environment not being updated. I did restart the server first, after I pulled in this branch but that doesn't seem to have updated the functions automatically. @mathemancer Is anything additional required?

seancolsen added 2 commits July 14, 2023 13:02

Move duplicated help message to common component

b34ffc8

Use DocsLink component

30ff61e

seancolsen added this to the v0.1.3 milestone Jul 17, 2023

seancolsen changed the base branch from develop to import_help_text July 17, 2023 13:15

Begin reworking import page for optional inference

a66d9f8

seancolsen force-pushed the optional_inference branch from cf8fbbb to a66d9f8 Compare July 17, 2023 14:36

seancolsen assigned kgodey, dmos62, pavish and ghislaineguerin Jul 17, 2023

seancolsen added 2 commits July 21, 2023 12:49

Merge branch 'develop' into optional_inference

9e42fb7

Continue improving import page

fb7590a

seancolsen changed the base branch from import_help_text to develop July 21, 2023 18:26

seancolsen mentioned this pull request Jul 21, 2023

Small clean up to import help text code #3041

Closed

7 tasks

seancolsen added 9 commits July 21, 2023 14:27

Merge branch 'develop' into optional_inference

751937d

Improve error message

2b38095

Organize some imports

13cf19c

Fix state initialization in DataFileInput

df3444e

Merge branch 'develop' into optional_inference

e623f38

Accept "inference" param on import preview page

8f4e04c

Also clean up ImportPreviewPage component a bit

Re-arrange some UI on import preview page

fb7cd7b

Re-organize some files

d30df64

seancolsen marked this pull request as ready for review August 8, 2023 02:25

Merge branch 'develop' into optional_inference

62b5977

seancolsen assigned kgodey Aug 8, 2023

Display errors from columns fetch

818233f

seancolsen unassigned dmos62 Aug 8, 2023

kgodey removed their assignment Aug 8, 2023

rajatvijay added the pr-status: review A PR awaiting review label Aug 9, 2023

seancolsen modified the milestones: v0.1.3, v0.1.4 Aug 17, 2023

seancolsen added 2 commits August 22, 2023 06:42

Merge branch 'develop' into optional_inference

0736af3

Adjust UI copy

f873d70

seancolsen unassigned ghislaineguerin Aug 22, 2023

mathemancer and others added 4 commits August 22, 2023 22:03

disable tests broken by upstream bug

b7420a0

pgTAP 1.3.0 introduced a bug that affects our tests, but not the actual functionality: theory/pgtap#315

Revert "disable tests broken by upstream bug"

3905ab0

This reverts commit b7420a0.

Merge branch 'develop' into optional_inference

80d1f46

Merge branch 'develop' into optional_inference

78fcdf7

pavish requested changes Aug 23, 2023

View reviewed changes

pavish approved these changes Aug 23, 2023

View reviewed changes

pavish added this pull request to the merge queue Aug 23, 2023

Merged via the queue into develop with commit 8018a44 Aug 23, 2023
9 checks passed

pavish deleted the optional_inference branch August 23, 2023 18:51

seancolsen mentioned this pull request Oct 9, 2023

Cannot import large csv tables #2995

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make column type inference optional #3050

Make column type inference optional #3050

seancolsen commented Jul 17, 2023 •

edited

seancolsen commented Jul 17, 2023 •

edited

kgodey commented Jul 17, 2023 •

edited

seancolsen commented Jul 18, 2023 •

edited

dmos62 commented Jul 18, 2023

ghislaineguerin commented Jul 18, 2023

kgodey commented Jul 18, 2023

seancolsen commented Jul 18, 2023

seancolsen commented Aug 8, 2023 •

edited

dmos62 commented Aug 8, 2023

kgodey commented Aug 8, 2023

seancolsen commented Aug 22, 2023 •

edited

ghislaineguerin commented Aug 22, 2023

mathemancer commented Aug 22, 2023

pavish left a comment •

edited

pavish commented Aug 23, 2023

Make column type inference optional #3050

Make column type inference optional #3050

Conversation

seancolsen commented Jul 17, 2023 • edited

Summary

Before

After

Refactoring

Code changes

User-facing changes

After initial import page loads

During file upload

After file upload completes

While table is being created

While preview page is loading

After preview page loads

After canceling import from preview screen

Checklist

Developer Certificate of Origin

seancolsen commented Jul 17, 2023 • edited

Motivation for this PR

Current (draft) state of this PR

Why I'm asking for your input

A "pie in the sky" import UX to consider later

kgodey commented Jul 17, 2023 • edited

seancolsen commented Jul 18, 2023 • edited

dmos62 commented Jul 18, 2023

ghislaineguerin commented Jul 18, 2023

kgodey commented Jul 18, 2023

seancolsen commented Jul 18, 2023

seancolsen commented Aug 8, 2023 • edited

dmos62 commented Aug 8, 2023

kgodey commented Aug 8, 2023

seancolsen commented Aug 22, 2023 • edited

ghislaineguerin commented Aug 22, 2023

mathemancer commented Aug 22, 2023

pavish left a comment • edited

Choose a reason for hiding this comment

pavish commented Aug 23, 2023

seancolsen commented Jul 17, 2023 •

edited

seancolsen commented Jul 17, 2023 •

edited

kgodey commented Jul 17, 2023 •

edited

seancolsen commented Jul 18, 2023 •

edited

seancolsen commented Aug 8, 2023 •

edited

seancolsen commented Aug 22, 2023 •

edited

pavish left a comment •

edited