Added pandas and JSON normalization code #2968

IamEzio · 2023-06-24T16:53:00Z

Fixes part of #2895

This PR does the following:

Adds pandas as a dependency and uses pandas.json_normalize() method to normalize JSON file inputs.
Adds custom exceptions for JSON import feature.
Adds missing_keys.json file to test json normalization code.

Screenshots

NA

Checklist

My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the develop branch of the repository
My commit messages follow best practices.
My code follows the established code style of the repository.
I added tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no
visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

rajatvijay · 2023-06-26T09:12:46Z

@IamEzio assigning this back to you since python test cases are failing.

IamEzio · 2023-06-26T17:59:03Z

Thanks @rajatvijay, I have fixed them now. PTAL

kgodey · 2023-06-26T19:18:42Z

db/records/operations/insert.py

@@ -1,4 +1,5 @@
 import json
+import pandas as pd


Why are we renaming the import here?

I'm used to seeing pandas renamed to pd, but most people's readability would probably be improved without this shorthand.

Agreed re: readability.

Updated it now. Thanks!

dmos62

Also, please provide API tests for JSON validation and normalization.

dmos62 · 2023-06-27T10:19:12Z

mathesar/imports/json.py

+    except (JSONDecodeError, ValueError) as e:
+        raise database_api_exceptions.InvalidJSONFormat(e)
+
+    if isinstance(data, list) and all(isinstance(val, dict) for val in data):


A variable name to communicate intent of this boolean composition would be nice.

Suggested change

if isinstance(data, list) and all(isinstance(val, dict) for val in data):

is_list_of_dicts = isinstance(data, list) and all(isinstance(val, dict) for val in data)

if is_list_of_dicts:

Thanks Dom, added it now.

dmos62 · 2023-06-27T10:21:53Z

db/records/operations/insert.py

@@ -32,9 +33,14 @@ def insert_record_or_records(table, engine, record_data):
    return None


-def insert_records_from_json(table, engine, json_filepath):
+def insert_records_from_json(table, engine, json_filepath, column_names):


Add a docstring. Include explanation for why column_names is necessary. Include explanation for why and how json_normalize is being invoked (what's max_level and why is it 0). Include explanation of the general algorithm and intent.

Done. Thanks!

Include a summary and explation of the algorithm. You normalize into a dataframe, then convert to a JSON string, then to Python, ending up with a sequence of rows where each row is dict-like, in which every key-value pair is a column, then you take those columns and if they are a dict or a list, you serialize them back into JSON. That takes some time to figure out and the reason for having to do some of these things is not necessarily immediately obvious. Help the reader out with a summary and an explanation.

Presume that the reader is not me or anyone that you've discussed JSON importing with.

Hi Dom, I have added the explanation now. PTAL. Thanks!

dmos62 · 2023-06-27T10:23:47Z

db/records/operations/insert.py

@@ -1,4 +1,5 @@
 import json
+import pandas as pd


I'm used to seeing pandas renamed to pd, but most people's readability would probably be improved without this shorthand.

IamEzio · 2023-07-03T18:41:47Z

Hi @kgodey @dmos62 I have made the suggested changes. I also have the PR #2977 as a follow-through that has the required API tests. PTAL. Thanks!

dmos62 · 2023-07-04T12:22:13Z

db/records/operations/insert.py

@@ -32,9 +33,14 @@ def insert_record_or_records(table, engine, record_data):
    return None


-def insert_records_from_json(table, engine, json_filepath):
+def insert_records_from_json(table, engine, json_filepath, column_names):


Include a summary and explation of the algorithm. You normalize into a dataframe, then convert to a JSON string, then to Python, ending up with a sequence of rows where each row is dict-like, in which every key-value pair is a column, then you take those columns and if they are a dict or a list, you serialize them back into JSON. That takes some time to figure out and the reason for having to do some of these things is not necessarily immediately obvious. Help the reader out with a summary and an explanation.

added json normalization code

6713950

IamEzio changed the title ~~added json normalization code~~ Added pandas and JSON normalization code Jun 24, 2023

IamEzio mentioned this pull request Jun 24, 2023

Support for importing JSON #2895

Closed

IamEzio assigned dmos62 Jun 24, 2023

IamEzio added the pr-status: review A PR awaiting review label Jun 24, 2023

IamEzio and others added 4 commits June 25, 2023 08:58

Merge branch 'centerofci:develop' into json-normalize

86b4c96

created json_parsing dir

2566d90

added validation while creating datafile

62bf8cd

nit

27386ea

IamEzio marked this pull request as draft June 25, 2023 06:07

nit

c991229

IamEzio marked this pull request as ready for review June 25, 2023 17:07

added default msg in exceptions

39b378a

rajatvijay added this to the GSoC 2023 milestone Jun 26, 2023

rajatvijay assigned IamEzio and unassigned dmos62 Jun 26, 2023

rajatvijay added pr-status: revision A PR awaiting follow-up work from its author after review and removed pr-status: review A PR awaiting review labels Jun 26, 2023

fixed failing tests

2cf344f

IamEzio mentioned this pull request Jun 26, 2023

Added api tests for importing JSON feature #2977

Merged

7 tasks

kgodey reviewed Jun 26, 2023

View reviewed changes

dmos62 requested changes Jun 27, 2023

View reviewed changes

added docstring

45d5218

Merge branch 'develop' into json-normalize

f19fbd1

dmos62 requested changes Jul 4, 2023

View reviewed changes

IamEzio added 2 commits July 4, 2023 23:35

explained json normalization algo

ca6d5ee

merged from master

8266039

Merge branch 'develop' into json-normalize

8a3d900

dmos62 approved these changes Jul 6, 2023

View reviewed changes

dmos62 enabled auto-merge July 6, 2023 12:21

dmos62 added this pull request to the merge queue Jul 6, 2023

Merged via the queue into mathesar-foundation:develop with commit 23d7b24 Jul 6, 2023
10 checks passed

IamEzio deleted the json-normalize branch July 7, 2023 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added pandas and JSON normalization code #2968

Added pandas and JSON normalization code #2968

IamEzio commented Jun 24, 2023

rajatvijay commented Jun 26, 2023

IamEzio commented Jun 26, 2023

kgodey Jun 26, 2023

dmos62 Jun 27, 2023

kgodey Jun 27, 2023

IamEzio Jul 3, 2023

dmos62 left a comment •

edited

dmos62 Jun 27, 2023

IamEzio Jul 3, 2023

dmos62 Jun 27, 2023

IamEzio Jul 3, 2023

dmos62 Jul 4, 2023

dmos62 Jul 4, 2023

IamEzio Jul 4, 2023

dmos62 Jul 6, 2023

dmos62 Jun 27, 2023

IamEzio commented Jul 3, 2023 •

edited

dmos62 Jul 4, 2023

	if isinstance(data, list) and all(isinstance(val, dict) for val in data):
	is_list_of_dicts = isinstance(data, list) and all(isinstance(val, dict) for val in data)
	if is_list_of_dicts:

Added pandas and JSON normalization code #2968

Added pandas and JSON normalization code #2968

Conversation

IamEzio commented Jun 24, 2023

Checklist

Developer Certificate of Origin

rajatvijay commented Jun 26, 2023

IamEzio commented Jun 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmos62 left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IamEzio commented Jul 3, 2023 • edited

Choose a reason for hiding this comment

dmos62 left a comment •

edited

IamEzio commented Jul 3, 2023 •

edited