Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added pandas and JSON normalization code #2968

Merged
merged 13 commits into from Jul 6, 2023

Conversation

IamEzio
Copy link
Contributor

@IamEzio IamEzio commented Jun 24, 2023

Fixes part of #2895

This PR does the following:

  • Adds pandas as a dependency and uses pandas.json_normalize() method to normalize JSON file inputs.
  • Adds custom exceptions for JSON import feature.
  • Adds missing_keys.json file to test json normalization code.

Screenshots

NA

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the develop branch of the repository
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@IamEzio IamEzio changed the title added json normalization code Added pandas and JSON normalization code Jun 24, 2023
@IamEzio IamEzio added the pr-status: review A PR awaiting review label Jun 24, 2023
@IamEzio IamEzio marked this pull request as draft June 25, 2023 06:07
@IamEzio IamEzio marked this pull request as ready for review June 25, 2023 17:07
@rajatvijay rajatvijay added this to the GSoC 2023 milestone Jun 26, 2023
@rajatvijay
Copy link
Contributor

@IamEzio assigning this back to you since python test cases are failing.

@rajatvijay rajatvijay assigned IamEzio and unassigned dmos62 Jun 26, 2023
@rajatvijay rajatvijay added pr-status: revision A PR awaiting follow-up work from its author after review and removed pr-status: review A PR awaiting review labels Jun 26, 2023
@IamEzio
Copy link
Contributor Author

IamEzio commented Jun 26, 2023

Thanks @rajatvijay, I have fixed them now. PTAL

@@ -1,4 +1,5 @@
import json
import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we renaming the import here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm used to seeing pandas renamed to pd, but most people's readability would probably be improved without this shorthand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed re: readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it now. Thanks!

Copy link
Contributor

@dmos62 dmos62 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please provide API tests for JSON validation and normalization.

except (JSONDecodeError, ValueError) as e:
raise database_api_exceptions.InvalidJSONFormat(e)

if isinstance(data, list) and all(isinstance(val, dict) for val in data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A variable name to communicate intent of this boolean composition would be nice.

Suggested change
if isinstance(data, list) and all(isinstance(val, dict) for val in data):
is_list_of_dicts = isinstance(data, list) and all(isinstance(val, dict) for val in data)
if is_list_of_dicts:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Dom, added it now.

@@ -32,9 +33,14 @@ def insert_record_or_records(table, engine, record_data):
return None


def insert_records_from_json(table, engine, json_filepath):
def insert_records_from_json(table, engine, json_filepath, column_names):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a docstring. Include explanation for why column_names is necessary. Include explanation for why and how json_normalize is being invoked (what's max_level and why is it 0). Include explanation of the general algorithm and intent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include a summary and explation of the algorithm. You normalize into a dataframe, then convert to a JSON string, then to Python, ending up with a sequence of rows where each row is dict-like, in which every key-value pair is a column, then you take those columns and if they are a dict or a list, you serialize them back into JSON. That takes some time to figure out and the reason for having to do some of these things is not necessarily immediately obvious. Help the reader out with a summary and an explanation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presume that the reader is not me or anyone that you've discussed JSON importing with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Dom, I have added the explanation now. PTAL. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@@ -1,4 +1,5 @@
import json
import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm used to seeing pandas renamed to pd, but most people's readability would probably be improved without this shorthand.

@IamEzio
Copy link
Contributor Author

IamEzio commented Jul 3, 2023

Hi @kgodey @dmos62 I have made the suggested changes. I also have the PR #2977 as a follow-through that has the required API tests. PTAL. Thanks!

@@ -32,9 +33,14 @@ def insert_record_or_records(table, engine, record_data):
return None


def insert_records_from_json(table, engine, json_filepath):
def insert_records_from_json(table, engine, json_filepath, column_names):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include a summary and explation of the algorithm. You normalize into a dataframe, then convert to a JSON string, then to Python, ending up with a sequence of rows where each row is dict-like, in which every key-value pair is a column, then you take those columns and if they are a dict or a list, you serialize them back into JSON. That takes some time to figure out and the reason for having to do some of these things is not necessarily immediately obvious. Help the reader out with a summary and an explanation.

@dmos62 dmos62 enabled auto-merge July 6, 2023 12:21
@dmos62 dmos62 added this pull request to the merge queue Jul 6, 2023
Merged via the queue into mathesar-foundation:develop with commit 23d7b24 Jul 6, 2023
10 checks passed
@IamEzio IamEzio deleted the json-normalize branch July 7, 2023 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-status: revision A PR awaiting follow-up work from its author after review
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants