Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add experimental import of JSONL files, with optional multiprocessing #4855 #4858 #4918

Conversation

mradamcox
Copy link
Member

@mradamcox mradamcox commented Jun 18, 2019

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Description of Change

This pull request adds support for the import of JOSNL (json lines) files through the existing import commands.

python manage.py packages -o import_business_data -s resources.jsonl

It also adds support for the use of multiprocessing during JSONL import.

python manage.py packages -o import_business_data -s resources.jsonl --use_multiprocessing

Both of these new features should be considered "experimental". One of the main reasons for this is that the baked-in print statements that are within the actual import code (as well as the error reporting) have not yet been updated to work with these two new features. So, the console printouts during JSONL import look like this:

1 of 1 resources saved
Resources for Import: 1, Resources Saved: 1, Tiles for Import: 11, Tiles Saved: 11, Relations for Import: 0, Relations Saved: 0
1 of 1 resources saved
Resources for Import: 1, Resources Saved: 1, Tiles for Import: 58, Tiles Saved: 58, Relations for Import: 0, Relations Saved: 0
1 of 1 resources saved
Resources for Import: 1, Resources Saved: 1, Tiles for Import: 34, Tiles Saved: 34, Relations for Import: 0, Relations Saved: 0

and while using multiprocessing they get all jumbled up:

1 of 1 resources saved
1 of 1 resources saved
1 of 1 resources saved
Resources for Import: 1, Resources Saved: 1, Tiles for Import: 11, Tiles Saved: 11, Relations for Import: 0, Relations Saved: 0
1 of 1 resources saved
1 of 1 resources saved
Resources for Import: 1, Resources Saved: 1, Tiles for Import: 34, Tiles Saved: 34, Relations for Import: 0, Relations Saved: 0
1 of 1 resources saved
Resources for Import: 1, Resources Saved: 1, Tiles for Import: 33, Tiles Saved: 33, Relations for Import: 0, Relations Saved: 0
1 of 1 resources saved
Resources for Import: 1, Resources Saved: 1, Tiles for Import: 54, Tiles Saved: 54, Relations for Import: 0, Relations Saved: 0
1 of 1 resources saved
Resources for Import: 1, Resources Saved: 1, Tiles for Import: 58, Tiles Saved: 58, Relations for Import: 0, Relations Saved: 0

Another important thing to note is that while multiple processes are running, you can't just ctrl+c to stop the operation, as the processes just keep spawning.

To address these shortcomings I have added some warnings.

For JSONL:

 WARNING: Support for loading JSONL files is still experimental. Be aware that
 the format of logging and console messages has not been updated.

For multiprocessing:

WARNING: Support for multiprocessing files is still experimental. While using
multiprocessing to import resources, you will not be able to use ctrl+c (etc.)
to cancel the operation. You will need to manually kill all of the processes
with or just close the terminal. Also, be aware that print statements
will be very jumbled.

and that warning is followed by a confirmation prompt (which you can bypass by adding -y/--yes to the initial command).

Issues Solved

#4858
#4855

Checklist

  • Unit tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Ticket Background

Further comments

@mradamcox mradamcox requested a review from ryan86 June 18, 2019 16:39
# during an import that uses multiprocessing.
# see https://stackoverflow.com/a/49461944/3873885
django.setup()
from django.db import connection, connections, transaction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E402 module level import not at top of file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact this line must be here before the models are imported, see code comment and stack overflow link.

arches/management/commands/packages.py Outdated Show resolved Hide resolved
arches/management/commands/packages.py Outdated Show resolved Hide resolved
arches/management/commands/packages.py Outdated Show resolved Hide resolved
@coveralls
Copy link

coveralls commented Jun 18, 2019

Coverage Status

Coverage decreased (-0.05%) to 42.401% when pulling db01542 on legiongis:4855_4858_jsonl_multiprocessing_4_4_x into 8eb6fc6 on archesproject:stable/4.4.x.

@@ -828,7 +855,9 @@ def import_business_data(self, data_source, config_file=None, overwrite=None, bu
path = utils.get_valid_path(source)
if path is not None:
print 'Importing {0}. . .'.format(path)
BusinessDataImporter(path, config_file).import_business_data(overwrite=overwrite, bulk=bulk_load, create_concepts=create_concepts, create_collections=create_collections)
BusinessDataImporter(path, config_file).import_business_data(overwrite=overwrite,
bulk=bulk_load, create_concepts=create_concepts,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E128 continuation line under-indented for visual indent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to ignore this rule in this context. The linter would have me indent the continuation of the arguments very far to the right which in my mind is more difficult to read.

BusinessDataImporter(path, config_file).import_business_data(overwrite=overwrite, bulk=bulk_load, create_concepts=create_concepts, create_collections=create_collections)
BusinessDataImporter(path, config_file).import_business_data(overwrite=overwrite,
bulk=bulk_load, create_concepts=create_concepts,
create_collections=create_collections, use_multiprocessing=use_multiprocessing)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E128 continuation line under-indented for visual indent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to ignore this rule in this context. The linter would have me indent the continuation of the arguments very far to the right which in my mind is more difficult to read.

@@ -503,6 +509,7 @@ def load_business_data(package_dir):
business_data.append(os.path.join(package_dir, 'business_data', f))
else:
business_data += glob.glob(os.path.join(package_dir, 'business_data','*.json'))
business_data += glob.glob(os.path.join(package_dir, 'business_data','*.jsonl'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E231 missing whitespace after ','

arches/management/commands/packages.py Show resolved Hide resolved
arches/management/commands/packages.py Show resolved Hide resolved
@mradamcox
Copy link
Member Author

This can now be tested by loading the demo HPLA package, using the branch I set up that has JSONL files in the business data directory:

python manage.py packages -o load_package -s https://github.com/legiongis/hpla_v4-pkg/archive/jsonl-business-data.zip -db true

To test multiprocessing you'll need to import those business data files individually and use the flag as described in the main body of this pull request.

@chiatt chiatt merged commit 4159d48 into archesproject:stable/4.4.x Jun 25, 2019
@mradamcox mradamcox mentioned this pull request Jul 11, 2019
6 tasks
@mradamcox mradamcox deleted the 4855_4858_jsonl_multiprocessing_4_4_x branch August 12, 2019 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants