Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SO-2127 snomed rf2 importer issues #122

Merged
merged 30 commits into from Jan 6, 2017

Conversation

cmark
Copy link
Member

@cmark cmark commented Dec 12, 2016

SNOMED CT RF2 importer enhancements, bugfixes:

  • Batch processing (execute batch index queries instead of single item queries to improve performance)
  • Reduce memory requirement of FULL RF2 imports (SG dataset successfully completes with 6-8 GB)
  • FULL import allows unpublished components to be imported
  • Skip cycle detection
  • Embedded SCT ID service is now properly execute bulk operations (using properly configured index queries to load all available document with a single query)
  • Component caching now supports bulk loading/lookup of SCT components
  • Run purge/optimize operations after versioning to remove garbage from index
  • Load partial documents where possible

Clear unnecessary allocated data when after validator.doValidate().
Validate Stated Relationships file instead of validating Relationships
file twice.
Report errors to a files instead of console.
Report taxonomy errors by effective time.
Allow unpublished components to be imported from FULL RF2 files.
...instead of Strings

Add writeTo(Writer) method to SnomedValidationDefect.
Bulk ID operations will load documents in bulk instead of one by one
before executing the operation.
...instead of referencing variable from anonym class.

This will ensure that only relevant information from the RF2 rows will
be kept in memory and not the entire row.
Component importers will read RF2 rows until they reach the specified
threshold (50000 atm) and then start processing them instead of
processing the rows one by one. This way the importer will execute way
more less index lookups than before (single index query per 50000 rows
to get the current revision).
...for the last imported RF2 effective time
Fetch Description objects by storageKey and attach new members to lang
members list directly instead of hacking with low-level revision
objects.
@cmark cmark requested review from apeteri and nagyo December 12, 2016 09:35
Copy link
Member

@apeteri apeteri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the changes!

Please take a look at the review comments below; there doesn't seem to be any big stuff to move around. Some of them are already outdated because I reviewed commits in chronological order. 😄

try {
for (SnomedValidationDefect defect : defects) {
final String filePath = defect.getFileName();
final String defectsFile = String.format("d:/%s_%s_defects.txt", configuration.getArchiveFile().getName(), filePath);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard-coded file location above 👓

}
for (String lineDefect : defect.getDefects()) {
defectWriters.get(defectsFile).write(defect.getDefectType().name() + "\t" + lineDefect);
defectWriters.get(defectsFile).write(lineSeparator);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BufferedWriter exposes a newLine method, which uses the line.separator system property as well, so lineSeparator could be removed if it's not used elsewhere.

@@ -38,6 +38,8 @@
*/
public class Rf2FileModifier {

private static final Splitter ON = Splitter.on('\t');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename this constant so that it indicates that it's a Splitter.

@@ -160,6 +160,7 @@ protected void doValidate(String effectiveTime, IProgressMonitor monitor) {
addDefect(DefectType.REFSET_MEMBER_COMPONENT_NOT_EXIST, referencedComponentNotExist);
uuidInvalid.clear();
referencedComponentNotExist.clear();
memberDataByUuid.clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uuidNotUnique is not cleared here. Do all fields need to be cleared in all validators after checking an effectiveTime layer? If not, please revisit doValidate methods and check if the appropriate fields are cleared.

Copy link
Member Author

@cmark cmark Jan 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the field whether it needs to be cleared or not after doValidate(). uuidNotUnique seems to be used to report ID uniqueness issues independent of the effective time.

final Collection<String> defects = newHashSet();
final Collection<String> conceptIdsToInactivate = newHashSet();

final Collection<TaxonomyDefect> defects = newArrayListWithExpectedSize(invalidRelationships.size());
for (final String effectiveTime : invalidRelationships.keySet()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that ArrayListMultimap's key set might not preserve the ordering used in processTaxonomy. (Not sure if we are relying on it here.)

editedConcept.setEffectiveTime(currentRow.getEffectiveTime());
editedConcept.setReleased(true);
protected void applyRow(Concept component, ConceptRow row, Collection<Concept> componentsToAttach) {
component.setId(row.getId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as in reference set importers, core component importers could also avoid setting the SCTID twice.


return description;
protected void attach(Collection<Description> componentsToAttach) {
// nothing to do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thought: it might be quicker to store the concept IDs instead of resolving them in applyRow, retrieve the actual concepts with a single call here, then use concept.getDescriptions().add(description) on the appropriate concept. But we'd have to test this somehow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my original intention with the attach method, use it everywhere, but I found the current impl. good enough (in terms of performance) so I did not change these. (takes ~1 sec to process 50.000 descriptions/relationships at the moment)

component.setActive(row.isActive());
component.setModule(getConceptSafe(row.getModuleId(), SnomedRf2Headers.FIELD_MODULE_ID, row.getId()));

if (component.getSource() == null || !component.getSource().getId().equals(row.getSourceId())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to accept import files where the source ID gets changed? Is the SCT change processor prepared to properly index these changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully, no. 😄


return relationship;
protected void attach(Collection<Relationship> componentsToAttach) {
// nothing to do, we already registered the relationships to the source concepts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Same as with descriptions.)

@@ -406,8 +402,7 @@ private ImportAction checkHeaders(final String[] expectedHeader, final String[]
}

private ImportAction handlePreImportException(final SuperCSVException e) {
final String reason = null != e.getMessage() ? " Reason: '" + e.getMessage() + "'" : "";
log("Exception caught while reading release file. Continuing with next row." + reason);
log("Exception caught while reading release file. Continuing with next row. {}", e.getMessage());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the vararg Objects land in an SLF4J Logger, we could propagate exceptions as an additional parameter now, to get a stack trace along with the exception message in the log output. There are other opportunities to do so below.

@@ -633,13 +634,13 @@ protected void preCommit(final InternalCDOTransaction transaction) throws Snowow
}

private ImportAction checkCommitException(final SnowowlServiceException e) {
log("SNOMED CT import failed. Caught exception while import, aborting. {}", e.getMessage());
log("SNOMED CT import failed. Caught exception while import, aborting. {}", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The placeholder should be removed from this format string.

importContext.getLogger().warn("Caught exception while import, aborting.", e);
return ImportAction.BREAK;
}

private ImportAction checkCommitException(final CommitException e) {
log("SNOMED CT import failed. Caught exception while import, aborting. {}", e.getMessage());
log("SNOMED CT import failed. Caught exception while import, aborting. {}", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also here)

@apeteri
Copy link
Member

apeteri commented Jan 6, 2017

Good to merge otherwise. ✔️

@cmark
Copy link
Member Author

cmark commented Jan 6, 2017

Thank you!

@cmark cmark merged commit ca6c944 into develop Jan 6, 2017
@cmark cmark deleted the issue/SO-2127-snomed-rf2-importer-issues branch June 19, 2017 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants