New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added pandas and JSON normalization code #2968
Conversation
@IamEzio assigning this back to you since python test cases are failing. |
Thanks @rajatvijay, I have fixed them now. PTAL |
db/records/operations/insert.py
Outdated
@@ -1,4 +1,5 @@ | |||
import json | |||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we renaming the import here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm used to seeing pandas
renamed to pd
, but most people's readability would probably be improved without this shorthand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed re: readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated it now. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please provide API tests for JSON validation and normalization.
mathesar/imports/json.py
Outdated
except (JSONDecodeError, ValueError) as e: | ||
raise database_api_exceptions.InvalidJSONFormat(e) | ||
|
||
if isinstance(data, list) and all(isinstance(val, dict) for val in data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A variable name to communicate intent of this boolean composition would be nice.
if isinstance(data, list) and all(isinstance(val, dict) for val in data): | |
is_list_of_dicts = isinstance(data, list) and all(isinstance(val, dict) for val in data) | |
if is_list_of_dicts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Dom, added it now.
@@ -32,9 +33,14 @@ def insert_record_or_records(table, engine, record_data): | |||
return None | |||
|
|||
|
|||
def insert_records_from_json(table, engine, json_filepath): | |||
def insert_records_from_json(table, engine, json_filepath, column_names): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a docstring. Include explanation for why column_names
is necessary. Include explanation for why and how json_normalize
is being invoked (what's max_level
and why is it 0
). Include explanation of the general algorithm and intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include a summary and explation of the algorithm. You normalize into a dataframe, then convert to a JSON string, then to Python, ending up with a sequence of rows where each row is dict-like, in which every key-value pair is a column, then you take those columns and if they are a dict or a list, you serialize them back into JSON. That takes some time to figure out and the reason for having to do some of these things is not necessarily immediately obvious. Help the reader out with a summary and an explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presume that the reader is not me or anyone that you've discussed JSON importing with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Dom, I have added the explanation now. PTAL. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
db/records/operations/insert.py
Outdated
@@ -1,4 +1,5 @@ | |||
import json | |||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm used to seeing pandas
renamed to pd
, but most people's readability would probably be improved without this shorthand.
@@ -32,9 +33,14 @@ def insert_record_or_records(table, engine, record_data): | |||
return None | |||
|
|||
|
|||
def insert_records_from_json(table, engine, json_filepath): | |||
def insert_records_from_json(table, engine, json_filepath, column_names): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include a summary and explation of the algorithm. You normalize into a dataframe, then convert to a JSON string, then to Python, ending up with a sequence of rows where each row is dict-like, in which every key-value pair is a column, then you take those columns and if they are a dict or a list, you serialize them back into JSON. That takes some time to figure out and the reason for having to do some of these things is not necessarily immediately obvious. Help the reader out with a summary and an explanation.
23d7b24
Fixes part of #2895
This PR does the following:
pandas
as a dependency and usespandas.json_normalize()
method to normalize JSON file inputs.missing_keys.json
file to test json normalization code.Screenshots
NA
Checklist
Update index.md
).develop
branch of the repositoryvisible errors.
Developer Certificate of Origin
Developer Certificate of Origin