Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make column type inference optional #3050

Merged
merged 37 commits into from Aug 23, 2023
Merged

Make column type inference optional #3050

merged 37 commits into from Aug 23, 2023

Conversation

seancolsen
Copy link
Contributor

@seancolsen seancolsen commented Jul 17, 2023

Fixes #2358

Summary

Before

  • Mathesar mandatorily performs column type inference for imports.

  • tall.csv is an example of an import for which this inference is problematic.

    Prior to this PR, when I would attempt to import it, I would usually experience the following error (though it's worth mentioning that I have also observed it to succeed too).

    The root cause of this error is not the focus of this PR. I have spent a little time trying to trace it down, but it's been very tricky. I can add additional comments if needed.

After

  • The first step of the import process prompts the user to decide whether to enable type inference.

  • With inference enabled, tall.csv seems to succeed at importing somewhat more reliably (though my sample size is small here). When the import succeeds, the preview page takes about 40 seconds to load on my machine. And errors do still seem to happen with inference. Here is a demo of tall.csv showing successfully inferred column types.

  • With inference disabled tall.csv imports more straightforwardly. The preview page loads in about 3 seconds. Notice all columns are text.

Refactoring

When I started this PR, it seemed like a relatively simple change. But the deeper I got into it, the more I struggled to find a clean path through the existing code. I ended up doing quite a bit of refactoring along the way, with some user-facing improvements for other bugs and UX issues I noticed along the way.

Code changes

  • I re-organized pages/import-preview and pages/import-upload into pages/import because there are now some common components used across both of these pages which live alongside them.

  • I did a small amount of refactoring in ImportUploadPage.svelte, but it was mostly straightforward.

  • I also made some improvements to FileUpload.svelte to handle some small bugs and UX issues.

  • ImportPreviewPage.svelte is what gave me the most trouble.

    • This component was quite large before this PR! It had lots of local state, lots of imperative logic, lots of API requests, lots of tricky loading states and error states. Ooof. Much of the complexity here is intrinsic to the import process, as defined by the API. But the component had grown to the point where it was quite difficult for me to modify without already having a solid understanding of its inner workings.
    • This PR explodes that component into the following separate files which I think should be much easier to work with:
      • ImportPreviewPage.svelte
      • ImportPreviewLayout.svelte
      • ImportPreviewContent.svelte (this is the component which most closely resembles the code prior to my refactoring)
      • ImportPreviewSheet.svelte
      • ImportPreviewPageUtils.ts
  • The preview page is making so many different API requests, that I decided to create a new abstraction to handle this complexity: AsyncStore. Look in ImportPreviewContent.svelte and see that we are doing:

    const previewRequest = new AsyncStore(generateTablePreview);

    At this point, previewRequest has not been run yet. Then later we do:

    previewRequest.run({ table, columns })

    Then, we can use $previewRequest to see the AsyncStoreValue associated with the request, and that tells us a bunch of stuff about the state of the request.

    The previewRequest can also be canceled, re-run, and reset.

    I don't think we necessarily need to use this new AsyncStore abstraction everywhere, but so far it seems to be really handy!

User-facing changes

After initial import page loads

Before After
image image
  • I changed some font sizes to be more consistent.
  • I adjusted some language slightly.
  • I added the "Column Types" field.

Since my previous comment on this PR, I made some minor tweaks to the language for the new field.

@kgodey I've removed the word "inference". Does this address your concern? I kept the field label as "Column types" instead of adopting your suggestion of "Guess column data types" because it felt simpler and more declarative. I don't feel too strongly about this though.

@dmos62 I added the text "for large imports". Does this address your concern. You said you'd consider "a paragraph or two of text," but I'd like to avoid overwhelming the user with that much information.

During file upload

Before After
image image
  • Don't show "processing" warning during upload phase because we are not actually processing the data yet. The progress bar should be enough here, I think.
  • I added a "Cancel upload" button. Which cancels the upload and resets the file input.

After file upload completes

Before After
(This state did not exist previously because the flow moved on to the next step automatically.) image
  • User must press "Proceed". This is one extra click, which is a downside to this PR, but I don't think we should auto-proceed anymore because the form has an additional question after the file input.

While table is being created

Before After
image image

While preview page is loading

Before After
image image

After preview page loads

Before After
image image
  • I improved aesthetics for tall viewports. Previously, there was white and tan horizontal space below the preview area. Now, that space retains the same background color as behind the preview sheet.
  • I moved the truncation warning from inside the form area to below the preview sheet to improve contextual placement. It now also specifies 20 exactly instead of saying "the first few rows". Also, it is hidden when the preview has fewer than 20 rows.
  • The form has the same "Column Types" question from the first page, allowing the user to make changes to the inference here as well.
  • I reframed the "Use first row as header" question as radio choices to better describe its behavior improve aesthetics when juxtaposed with the "Column Types" question.
  • I made some font sizes more consistent.
  • I made the "Table Name" field horizontal instead of vertical so as to conserve vertical space.
  • I moved the help text into a blue box and re-worded it a bit.
  • I added an inference query param to the URL for this page. The form syncs with this param bidirectionally.

After canceling import from preview screen

Before After
image image
  • Previously, Mathesar would immediately redirect to the schema page while concurrently deleting the unconfirmed import table. This lead to a momentary flash of the table being listed on the schema page before the deletion request resolved. This PR replaces the cancel button with a "Cleaning up" loading spinner.

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the develop branch of the repository
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@seancolsen seancolsen added this to the v0.1.3 milestone Jul 17, 2023
@seancolsen seancolsen changed the base branch from develop to import_help_text July 17, 2023 13:15
@seancolsen
Copy link
Contributor Author

seancolsen commented Jul 17, 2023

@kgodey @ghislaineguerin @pavish @dmos62 I've assigned you all to this PR to comment with your opinion on the direction I'm heading with changes to the import process. (Don't review it because it's still a draft.)

Motivation for this PR

  1. We've had some reports of users being unable to import data into Mathesar due to problems with our column type inference. (See Cannot import large csv tables #2995 and Importing a CSV with a large number of columns times out at confirming data step #2250).
  2. Ideally we'd improve column type inference, but that work may take multiple months.
  3. As a stop-gap solution, @dmos62 and I have identified the following band-aid fixes that we think can be completed within this cycle:

Current (draft) state of this PR

Before After
image image
Upon uploading a file, the column type inference begins immediately, without user interaction. The user will need to click "Continue". This is one additional click, but I think that's okay.

Why I'm asking for your input

  • @kgodey You've commented on a few of the related tickets, and I think you might want to weigh in on this, especially since we're working to identify the best way to address user feedback while balancing implementation feasibility.

  • @ghislaineguerin you did the UX design and graphic for the import page. This PR makes some changes to that design. I've reduced some font sizes to present a more uniform "form" page with multiple inputs. And I've simplified the field label for the first question to be more declarative.

  • @pavish you implemented most of the import page (if I recall correctly). I'd like to note that the code changes in this PR are not yet ready for review. Please review from a UX perspective.

  • @dmos62 you had some additional thoughts about how best to present the inference options to the user

A "pie in the sky" import UX to consider later

I want you all to understand some additional context of the discussions that @dmos62 and I have been having, as it relates to the UX choices here.

I proposed an import process to Dom that would work like this:

  • Mathesar automatically starts performing type inference after data is uploaded.
  • Type inference is performed per-column.
  • Type inference is batched in 1k row "chunks". For Text columns, performance would be improved because Postgres would only need to analyze the first chunk to finish inference. For stricter column types (e.g. Number), the service layer would have knowledge of the worst-case progress of inference process.
  • The front end uses web sockets or polling to see the progress of inference on a per-column basis.
  • The front end can cancel inference on a per-column basis, falling back to Text for the column type.
  • The user can then wait for inference to complete. If they get impatient, they can manually choose the correct type for columns, thereby canceling the inference process.

Dom said this is doable, but would probably take a least one full cycle of focused work. We're interested in exploring this option at some point in the future, but we though it was important to deliver something in this release which would address user concerns.

This PR does not represent the ideal UX that I would want for the user, but it represents a balance between user concerns and implementation feasibility. That's why it's a bit of a "band-aid".

@kgodey
Copy link
Contributor

kgodey commented Jul 17, 2023

@seancolsen: @ghislaineguerin and I discussed this in our call today. @ghislaineguerin is going to spend a little bit of time thinking about this and maybe come up with an alternate proposal.

At minimum, I'd like some changes to the copy of the "Column type inference" section to make it more newbie-friendly. I don't think most users will know what "inference" means, we might want to make it "Guess column data types" or something like that.

Also, we'll need some changes to the next page in the flow since we state there that column types will be detected automatically:

Screenshot 2023-07-17 at 16-25-50 Import public Mathesar

@seancolsen
Copy link
Contributor Author

seancolsen commented Jul 18, 2023

Thanks for weighing in, @kgodey. When you say "@ghislaineguerin is going to spend a little bit of time thinking about this and maybe come up with an alternate proposal," do you expect that her proposal will still follow the overall structure in this PR? Or would it potentially be something wildly different? I ask because, with this being a draft PR, I still have some work left to do to make this change functional. I've only implemented the most basic of the UI so far, hoping it would be enough to communicate my intended direction. If you and @ghislaineguerin are satisfied with the direction this is heading at a high level, then I'll continue my work and be happy to wordsmith things later in the process. But if, for example, you two are unsure if it even makes sense to ask the user a question like this on the import page, then I'll hold off on continuing my work here.

@dmos62
Copy link
Contributor

dmos62 commented Jul 18, 2023

The input I had about the type inference section is that "slower" and "faster" is not enough commentary to inform the user (and it's misleading). E.g. inference is "slower", but in many cases it will take a neglibile amount of time and save a lot of time in not having to manually choose types for every column after import. Or, even worse, a user might opt for the "faster" option out of simple conservacy, not knowing that this means he should change column types himself after the import.

I don't have an idea for how to fix this. I would consider a paragraph or two of text explaining the tradeoffs and the risks.

As a sidenote, I really dislike when tools give you multiple options, some "faster" or "more stable" or whatever, but don't provide the insights necessary to actually make the choice.

@ghislaineguerin
Copy link
Contributor

Hi @seancolsen We're not concerned with the direction of this PR; we believe that allowing users to skip a potentially slow process is a good idea. Our aim is to reduce the number of choices a user has to make, so we think having data type inference as the default makes sense. It is also ok to ask them that question before they proceed to the preview.

We also discussed the idea of allowing users to interrupt the data type inference at any point, but we concluded that it might not be the best approach. Interrupting a process could be significantly more stressful and unpredictable than simply making a decision at the start. Perhaps once the preview is rendered, we could offer the user an option to check the data types at that point? This might enhance the user experience, as it could be interrupted without affecting other processes if it's taking too long.

As for the questions, I suggest focusing the labels more on automatic versus manual data type assignment. Instead of asking the user to choose between 'the best types' and 'using text', which might not meet their expectations, we could consider 'Guess column data types' and 'Set data types manually'. I think this wording implies that the user will still have control over the output.

@kgodey
Copy link
Contributor

kgodey commented Jul 18, 2023

@seancolsen does @ghislaineguerin's comment address the questions you asked me?

@seancolsen
Copy link
Contributor Author

@kgodey and @ghislaineguerin. Yep. I'm good here. Thanks for the clarification. I'll continue working on this.

@seancolsen seancolsen changed the base branch from import_help_text to develop July 21, 2023 18:26
- Utilize `getGloballyUniqueId` function.
- Clean up imports.
- Refactor `state` into simpler `isDraggingOver` boolean.
- Some minor readability improvements.
- Don't process drop if `fileList` is empty.
- Use `{@const}` for `percentage`.
- Use distinct CSS for focus, hover, and dragging-over.
- Ignore `:active` CSS state.
Also clean up ImportPreviewPage component a bit
@seancolsen seancolsen marked this pull request as ready for review August 8, 2023 02:25
@seancolsen
Copy link
Contributor Author

seancolsen commented Aug 8, 2023

@pavish I'd like you to give this a full review now. There is a lot of code to look over in here, much of which touches code you have worked on. Hard to describe everything in the PR description, since this was a tricky one. Happy to hop on a call to discuss if needed.

@kgodey, @ghislaineguerin, and @dmos62 I'd like you to take another look at the language of the question, as we initially discussed. Look for the "After initial import page loads" section in the PR description.

@dmos62
Copy link
Contributor

dmos62 commented Aug 8, 2023

I added the text "for large imports". Does this address your concern?

Yes. UI looks good. I don't have significant feedback.

@kgodey
Copy link
Contributor

kgodey commented Aug 8, 2023

I like the new copy better. Some thoughts:

  • I would make the header "Column Data Types" instead of "Column Types" since I think that's a little less ambiguous.
  • I am not sure we need the "Analyze data and" in the first option, I think it works fine if it's just "Pick the best data type for each column". (note that I changed "type" to "data type").

I don't have a strong opinion about any of this.

@kgodey kgodey removed their assignment Aug 8, 2023
@rajatvijay rajatvijay added the pr-status: review A PR awaiting review label Aug 9, 2023
@seancolsen seancolsen modified the milestones: v0.1.3, v0.1.4 Aug 17, 2023
@seancolsen
Copy link
Contributor Author

seancolsen commented Aug 22, 2023

@kgodey I like your proposed copy and I've pushed f873d70 to implement it.

image

@ghislaineguerin
Copy link
Contributor

@seancolsen This looks good to me. Thanks

@mathemancer
Copy link
Contributor

The newest version of pgTAP (the SQL testing framework we're using) seems to have introduced a bug:

theory/pgtap#315 .

I'm going to disable the affected tests for now.

Copy link
Member

@pavish pavish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seancolsen The code looks good.

However, I'm not able to import a file. I think this might be due to a backend issue or a change in contract (!?), I'm not sure.

The import fails at the preview step.

Here is the error:

[
    {
        "code": 4999,
        "message": "",
        "field": null,
        "detail": null,
        "stacktrace": [
            "1. File \"/code/db/columns/operations/alter.py\", line 161, in batch_update_columns",
            "2. db_conn.execute_msar_func_with_engine(",
            "3. File \"/code/db/connection.py\", line 19, in execute_msar_func_with_engine",
            "4. return conn.execute(",
            "5. File \"/usr/local/lib/python3.9/site-packages/psycopg/connection.py\", line 879, in execute",
            "6. raise ex.with_traceback(None)",
            "7. psycopg.errors.SyntaxError: column \"id\" of relation \"patents 1\" is an identity column",
            "8. HINT:  Use ALTER TABLE ... ALTER COLUMN ... DROP IDENTITY instead.",
            "9. CONTEXT:  SQL statement \"ALTER TABLE \"patents 1\" ALTER COLUMN id DROP DEFAULT, ALTER COLUMN id TYPE integer USING mathesar_types.cast_to_integer(id), ALTER COLUMN \"Center\" DROP DEFAULT, ALTER COLUMN \"Center\" TYPE text USING mathesar_types.cast_to_text(\"Center\"), ALTER COLUMN \"Status\" DROP DEFAULT, ALTER COLUMN \"Status\" TYPE text USING mathesar_types.cast_to_text(\"Status\"), ALTER COLUMN \"Case Number\" DROP DEFAULT, ALTER COLUMN \"Case Number\" TYPE text USING mathesar_types.cast_to_text(\"Case Number\"), ALTER COLUMN \"Patent Number\" DROP DEFAULT, ALTER COLUMN \"Patent Number\" TYPE text USING mathesar_types.cast_to_text(\"Patent Number\"), ALTER COLUMN \"Application SN\" DROP DEFAULT, ALTER COLUMN \"Application SN\" TYPE text USING mathesar_types.cast_to_text(\"Application SN\"), ALTER COLUMN \"Title\" DROP DEFAULT, ALTER COLUMN \"Title\" TYPE text USING mathesar_types.cast_to_text(\"Title\"), ALTER COLUMN \"Patent Expiration Date\" DROP DEFAULT, ALTER COLUMN \"Patent Expiration Date\" TYPE date USING mathesar_types.cast_to_date(\"Patent Expiration Date\")\"",
            "10. PL/pgSQL function __msar.exec_ddl(text) line 10 at EXECUTE",
            "11. PL/pgSQL function __msar.exec_ddl(text,anyarray) line 14 at RETURN",
            "12. SQL statement \"SELECT __msar.exec_ddl(",
            "13. 'ALTER TABLE %s %s',",
            "14. __msar.get_relation_name(tab_id),",
            "15. msar.process_col_alter_jsonb(tab_id, col_alters)",
            "16. )\"",
            "17. PL/pgSQL function msar.alter_columns(oid,jsonb) line 22 at PERFORM",
            "18. ",
            "19. During handling of the above exception, another exception occurred:",
            "20. ",
            "21. Traceback (most recent call last):",
            "22. File \"/code/mathesar/utils/models.py\", line 36, in update_sa_table",
            "23. alter_table(table.name, table.oid, table.schema.name, table.schema._sa_engine, data)",
            "24. File \"/code/db/tables/operations/alter.py\", line 50, in alter_table",
            "25. batch_update_columns(table_oid, engine, update_data['columns'])",
            "26. File \"/code/db/columns/operations/alter.py\", line 175, in batch_update_columns",
            "27. raise InvalidTypeOptionError",
            "28. db.columns.exceptions.InvalidTypeOptionError",
            "29. ",
            "30. During handling of the above exception, another exception occurred:",
            "31. ",
            "32. Traceback (most recent call last):",
            "33. File \"/usr/local/lib/python3.9/site-packages/rest_framework/views.py\", line 506, in dispatch",
            "34. response = handler(request, *args, **kwargs)",
            "35. File \"/code/mathesar/api/db/viewsets/tables.py\", line 57, in partial_update",
            "36. serializer.save()",
            "37. File \"/usr/local/lib/python3.9/site-packages/rest_framework/serializers.py\", line 200, in save",
            "38. self.instance = self.update(self.instance, validated_data)",
            "39. File \"/code/mathesar/api/serializers/tables.py\", line 170, in update",
            "40. instance.update_sa_table(validated_data)",
            "41. File \"/code/mathesar/models/base.py\", line 478, in update_sa_table",
            "42. result = model_utils.update_sa_table(self, update_params)",
            "43. File \"/code/mathesar/utils/models.py\", line 42, in update_sa_table",
            "44. raise base_api_exceptions.MathesarAPIException(e, status_code=status.HTTP_400_BAD_REQUEST)",
            "45. mathesar.api.exceptions.generic_exceptions.base_exceptions.MathesarAPIException: [{'code': 4999, 'message': '', 'field': None, 'detail': None}]"
        ]
    }
]

cc: @mathemancer

@pavish
Copy link
Member

pavish commented Aug 23, 2023

Update:

I cleaned my DB state and started fresh and it works as expected.

It's probably due to the SQL functions in my local environment not being updated. I did restart the server first, after I pulled in this branch but that doesn't seem to have updated the functions automatically. @mathemancer Is anything additional required?

@pavish pavish added this pull request to the merge queue Aug 23, 2023
Merged via the queue into develop with commit 8018a44 Aug 23, 2023
9 checks passed
@pavish pavish deleted the optional_inference branch August 23, 2023 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-status: review A PR awaiting review
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Make column type inference optional
7 participants