Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Somatic CNV tagging germline events, bringing AnnotatedInterval in line with tribble reading (and gearing up for writing as well), and region merging. #4276

Merged
merged 7 commits into from
Apr 10, 2018

Conversation

LeeTL1220
Copy link
Contributor

@LeeTL1220 LeeTL1220 commented Jan 26, 2018

Most of these changes are to support automated evaluation of GATK CNV.

  • Updates AnnotatedIntervals (formerly SimpleAnnotatedGenomicRegion) to use the tribble framework for reading. Writing is done in a way that should be concordant with a future tribble writing framework, as per discussion with @droazen.
  • Changes to XsvLocatableTableCodec to support usage of arbitrary config files. This cannot be done when using tribble features in the CLI. Already reviewed with @jonn-smith . Support for SAM File headers and comments is included.
  • Note: The reading of AnnotatedIntervals cannot be done automatically on the command line, unless the config file is a sibling. The tools below do not even attempt this, since the use cases involved will never have a sibling config file.
  • Created a default config file in the jar file resources to read tsvs with locatable fields from the CNV collection files. This is much less strict than the framework used by the CNV tools. The reader will accept any columns (or subset of the columns).

CLIs (both experimental quality):

  • TagGermlineEvents is a simple tool that attempts to identify events in a tumor seg file that correspond to a germline events.

    • This is done purely with concordance on the breakpoints of the events (within some padding).
    • Input germline segments must have calls.
    • If a germline call is broken into multiple segments, this tool will handle that appropriately (ditto if there are multiple tumor segments overlapping the germline call).
  • MergeAnnotatedRegions will merge all overlapping regions and resolve annotation value conflicts.

Closes #3995

@codecov-io
Copy link

codecov-io commented Jan 29, 2018

Codecov Report

Merging #4276 into master will increase coverage by 0.039%.
The diff coverage is 82.671%.

@@               Coverage Diff               @@
##              master     #4276       +/-   ##
===============================================
+ Coverage     79.807%   79.846%   +0.039%     
- Complexity     17163     17321      +158     
===============================================
  Files           1067      1075        +8     
  Lines          62439     62891      +452     
  Branches       10138     10176       +38     
===============================================
+ Hits           49831     50216      +385     
- Misses          8661      8699       +38     
- Partials        3947      3976       +29
Impacted Files Coverage Δ Complexity Δ
...broadinstitute/hellbender/utils/IntervalUtils.java 91.532% <ø> (ø) 181 <0> (ø) ⬇️
...tils/codecs/xsvLocatableTable/XsvTableFeature.java 55.385% <ø> (ø) 16 <0> (ø) ⬇️
.../broadinstitute/hellbender/utils/tsv/DataLine.java 86.325% <100%> (ø) 59 <0> (ø) ⬇️
...tools/funcotator/dataSources/TableFuncotation.java 60% <100%> (ø) 20 <0> (ø) ⬇️
.../tools/copynumber/utils/MergeAnnotatedRegions.java 100% <100%> (ø) 3 <3> (?)
...ils/annotatedinterval/AnnotatedIntervalHeader.java 100% <100%> (ø) 6 <6> (?)
...tmutpileup/ValidateBasicSomaticShortMutations.java 85.965% <100%> (ø) 7 <0> (ø) ⬇️
...nder/tools/copynumber/utils/TagGermlineEvents.java 100% <100%> (ø) 3 <3> (?)
...ataSources/xsv/LocatableXsvFuncotationFactory.java 85.185% <61.538%> (+0.491%) 24 <0> (-4) ⬇️
...g/broadinstitute/hellbender/utils/io/Resource.java 55.556% <66.667%> (+3.175%) 6 <1> (+1) ⬆️
... and 21 more

Copy link
Contributor

@samuelklee samuelklee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LeeTL1220 Can you address some of the design issues with SimpleAnnotatedGenomicRegion and the associated classes, as well as change the test files to more accurately reflect the expected inputs (the legacy CNV test files are currently a mishmash of conventions from the new CNV collections, altered locatable column headers, etc.)? I think it might be easier to proceed with the review once some of this refactoring is done.

As we discussed, I was also unable to run the tools from the command line, even though tests are passing. Not sure if you figured out what was going on there, but I also caught another issue that would not have caused tests to fail (i.e., you only allow writing to pre-existing files). There may be other issues lurking, which is why I'd like to be able to run from the command line before reviewing further.

Thanks for being accommodating!

@@ -1,4 +1,6 @@
#SAMPLE_NAME=SAMPLE2
@HD VN:1.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the format of many of these test files. I assume that the only way we are allowed to modify incoming files is to change the relevant column headers to CONTIG, START, and END (since this is specified by the tool documentation), correct? (Although see comment above about just using the column numbers to specify the locatable columns.)

Then why does this one have a CNV-collections header along with with legacy (GATK CNV 1.0) column headers? Under what circumstances would we ever see an input like this?

I don't think any of these test files should contain a CNV-collections header unless they are files that would have been produced by the pipeline without further modification. (Although if you want to be really thorough, you can use a test file that might have been created by CombineSegmentBreakpoints to check that you can further combine it with other segment files. Such a file could conceivably be a mishmash like this one. However, you should name it clearly and document the test accordingly.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the -with-legacy-headers from the test file names.

Currently, since this is being driven by a config file (that contains CONTIG, START, and END), the input files are meant to represent files either generated by the CNV pipeline.

These files could already conceivably have been generated by CombineSegmentBreakpoints.

Are you asking me to test input files with different config file formats?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was that your test files should primarily reflect actual inputs expected by the tool.

Now that you've changed over to config files that identify the column numbers of the locatable fields, I would not expect to see this file as a typical input---it's a mishmash of CNV 1.0/ReCapSeg-style columns, modified CONTIG\tSTART\tEND column headers, and a CNV-collections-style SAM header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just using Segment_Mean and Segment_Call (CNV 1.0 style columns) as two example annotations. While not likely an exact input (unless I was comparing CNV 1.0 to CNV ModeledSegments), the header can really be CONTIG, START, END + anything. The only requirement for the columns for this tool is to have CONTIG, START, and END.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it is somewhat confusing to use this as test input if we'd never really see it in the real world. As a developer who might have to maintain the tests that use this input, it's nice to be able to understand what the origin of the input might be (in case I have to modify it or come up with new inputs). The name of the test file (*-different-annotation-headers-with-legacy-header.tsv) doesn't really provide enough clues in that regard, either.

Why not just use an unmodified ReCapSeg or CNV 1.0 seg file as test input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using fake data, I have included test files from JaBbA, UCSC, ReCapSeg (which is like CNV 1.0), PCAWG consensus files, and an Oncotator annotated CNV output.

Done.

#SAMPLE_NAME=SAMPLE1
@HD VN:1.5
@SQ SN:1 LN:2000000
@RG ID:GATKCopyNumber SM:sample1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this filename include "-with-legacy-header"? Isn't this a file produced by CallCopyRatioSegments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid of all -with-legacy-header ...
Done

@@ -1,3 +1,6 @@
@HD VN:1.5
@SQ SN:1 LN:2000000
@RG ID:GATKCopyNumber SM:test-sample
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test file is only used in SimpleAnnotatedGenomicRegionUnitTest, so it should be renamed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import java.util.function.Function;
import java.util.stream.Collectors;

@CommandLineProgramProperties(
oneLineSummary = "Combine the breakpoints of two segment files and annotate the resulting intervals with chosen columns from each file.",
summary = "Combine the breakpoints of two segment files while preserving annotations.\n" +
"This tool will load all segments into RAM.\n"+
"Expected interval columns are: " + SimpleAnnotatedGenomicRegion.CONTIG_HEADER + ", " +
SimpleAnnotatedGenomicRegion.START_HEADER + ", " + SimpleAnnotatedGenomicRegion.END_HEADER,
"Column headers for locatable information are taken from the first segment file.\n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably more useful as part of the javadoc for the --segments argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, why do we even need the column headers to be CONTIG, START, and END if you can use the XSV config files to specify these by column number? I would instead require that the first three columns give the locatable columns with arbitrary column headers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can specify the column headers or column index now, via config file. By default these are CONTIG, START, and END.

Deleted comment.

Done.

Arrays.asList(input1ToOutputHeaderMap, input2ToOutputHeaderMap), getBestAvailableSequenceDictionary(),
l -> progressMeter.update(l));

final SamFileHeaderMerger samFileHeaderMerger = new SamFileHeaderMerger(SAMFileHeader.SortOrder.coordinate,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK to merge sequence dictionaries if they are not identical? Perhaps you should at least emit a warning?

There is not much documentation about how sequence dictionaries or comment lines/headers are handled, in general. If this is functionality that you think is worth having, perhaps it's also worth documenting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added docs and this tool will throw an exception if the dictionaries cannot be merged properly. For example, identical contig names with different lengths will cause an exception.

}

/**
* Same as {@link #create(Path, Path, Set)} , but uses the default annotation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no scenario in which we would not want to use the default config file, then let's just eliminate the corresponding constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above... made private
Done

public static SimpleAnnotatedGenomicRegionCollection create(final List<SimpleAnnotatedGenomicRegion> regions,
final SAMFileHeader samFileHeader,
final List<String> annotations,
final String contigColumnName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The telescoping constructors makes it difficult to see how these column names as used. As noted elsewhere, they are passed to a write method at one point but are never actually used there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduced to two public create methods.

Done

codec.getComments(), regions, codec.getFinalContigColumn(), codec.getFinalStartColumn(), codec.getFinalEndColumn());

}
catch ( final FileNotFoundException ex ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to throw an exception for broken tests---in that case, we should just fix the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a typo. It meant that the input file was not found.

Done.

/** Does not include the locatable fields. */
private List<String> annotations;
private List<String> comments;
private List<SimpleAnnotatedGenomicRegion> records;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in the last round of comments, because you allow SimpleAnnotatedGenomicRegion to contain an arbitrary (and mutable!) map, this collection class is quite vulnerable to shenanigans. For example, I can create a collection that contains records with different annotations.

I think some sort of parametrization of records and collections that restricts them to a certain set of specified annotations would be best. However, barring this, you should at least check that all records have consistent annotations upon construction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer mutable.

Effectively, there is enforcement at construction now. And there is an attribute on the collection that will store the annotations that are in each AnnotatedInterval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*
* @param input readable path to use for the xsv file. Must be readable. Never {@code null}.
* @param inputConfigFile config file for specifying the format of the xsv file. Must be readable. Never {@code null}.
* @param headersOfInterest Only preserve these headers. These must be present in the input file. This parameter should not include the locatable columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why headersOfInterest here and columnsOfInterest elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@LeeTL1220 LeeTL1220 force-pushed the ll_issue3995_tribble branch 2 times, most recently from 9e98d84 to 4970a3a Compare February 16, 2018 20:17
*/
public final class AnnotatedInterval implements Locatable {

private SimpleInterval interval;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this can be final

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pshapiro4broad is here! Now it's a party!

Done.

// By initializing writer to be based on fileWriter, writer.close will close the fileWriter as well.
writer = new SimpleTableWriter(fileWriter, new TableColumnCollection(finalColumnList));
} catch (final IOException ioe) {
throw new UserException.CouldNotCreateOutputFile(outputFile, "Could not create: " + outputFile.getAbsolutePath());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this new UserException.CouldNotCreateOutputFile(outputFile, ioe);? It seems like the user would really want the exception in the output, as it could be a few different things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it a GATKException. Now user will get runtime error details.

Done.

// TODO: Test for optional SAM File Header
// TODO: Test for other column names
/**
* {@inheritDoc}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an overridden method with no doc changes, it is sufficient to omit the javadoc completely, as javadoc automatically inherits docs for overridden methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That answers that. Deleted.

Done.


return new Object[][] {
{
Arrays.asList(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could avoid some code duplication here by moving the Arrays.asList() call into your test method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if I am missing something, but it seems like I would have to do the following, which does not reduce code duplication:

    @DataProvider(name = "mergeTests")
    public Object [][] createMergeTests() {

        return new Object[][] {
            {
                    new AnnotatedInterval[]{
                            new AnnotatedInterval(new SimpleInterval("1", 100, 200),
                                    ImmutableSortedMap.of("Foo", "bar", "Foo1", "bar1")),
                            new AnnotatedInterval(new SimpleInterval("1", 100, 200),
                                    ImmutableSortedMap.of("Foo", "bar", "Foo1", "bar1"))
                    },
// ..........

No action

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pshapiro4broad Comment above...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid that by calling new Object[][][] { ... } E.g.

    @Test(dataProvider = "simpleTests")
    public void testSimpleTagging(AnnotatedInterval[] tumorSegments, AnnotatedInterval[] normalSegments,
                                  AnnotatedInterval[] gt) {

        final List<AnnotatedInterval> testResult = SimpleGermlineTagger.tagTumorSegmentsWithGermlineActivity(Arrays.asList(tumorSegments),
                Arrays.asList(normalSegments), "call",
                ReferenceUtils.loadFastaDictionary(new File(ReferenceUtils.getFastaDictionaryFileName(REF))), TEST_GERMLINE_TAGGING_ANNOTATION, 10);

        Assert.assertEquals(testResult, Arrays.asList(gt));
    }

    @DataProvider(name = "simpleTests")
    public Object[][] createSimpleTests() {
        return new Object[][][]{{
                {
                        // Tumor segments are assumed to be mutable.
                        new AnnotatedInterval(new SimpleInterval("1", 100, 200), Maps.newTreeMap(ImmutableSortedMap.of("call", "+"))),
                        new AnnotatedInterval(new SimpleInterval("1", 201, 300), Maps.newTreeMap(ImmutableSortedMap.of("call", "0")))
                },
                {
                        new AnnotatedInterval(new SimpleInterval("1", 100, 200), ImmutableSortedMap.of("call", "+")),
                        new AnnotatedInterval(new SimpleInterval("1", 201, 500), ImmutableSortedMap.of("call", "0"))
                },
                {
                        new AnnotatedInterval(new SimpleInterval("1", 100, 200), ImmutableSortedMap.of("call", "+", TEST_GERMLINE_TAGGING_ANNOTATION, "+")),
                        new AnnotatedInterval(new SimpleInterval("1", 201, 300), ImmutableSortedMap.of("call", "0", TEST_GERMLINE_TAGGING_ANNOTATION, "0"))
                }
        },
       ...
        };
    }

Although now that I've written it out it doesn't really format that much shorter. I'll leave it up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(no action)

return new Object[][] {
{
// Trivial case
Lists.newArrayList(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not Arrays.asList()? Do these need to be modifiable? Also, you could shorten the data provider by putting the array -> list conversion into the test method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted to Arrays.asList(). Done.

No action on the array -> list conversion in the test.

@droazen
Copy link
Collaborator

droazen commented Mar 1, 2018

@samuelklee: @LeeTL1220 and I just had a discussion about the writer aspect of this branch, and we agreed on the following:

  1. Lee will introduce a new header type to encapsulate the information that's currently passed in individually to the writeHeader() method in AnnotatedIntervalWriter. This makes the interface cleaner and more future-proof, since the signature will just become writeHeader(AnnotatedIntervalHeader)

  2. Lee will start writing out 3 additional structured header lines (as comment lines) to every header, declaring the names of the chrom, start, and stop columns. These will not be respected on input yet (he will still be relying on a config file to get the names of these 3 columns), but it's the first step in the direction of storing all necessary schema information in the header of each file, rather than separately from each file.

  3. Lee will file a github issue to eventually use these 3 header lines on input, when they are present, to get the names of the chrom/start/stop columns (possibly still with a fallback to a separate config file if they aren't, but that is a point we can debate in a future PR).

@samuelklee
Copy link
Contributor

samuelklee commented Mar 1, 2018

Sorry @droazen @LeeTL1220, can you give me a bit more context? @LeeTL1220 is no longer using any of the CNV-specific collections classes that I had hoped might be Tribble-ized in the future, so I'm OK with any decisions you guys make that are specific to his classes (does @jonn-smith have an opinion?) I think that moving towards storing the config in the header is a good thing, in general.

If we need to make corresponding changes to the CNV-specific collections classes, then we should talk more. Not all of those collections describe locatables, so I'm not sure how we could fit them in the Tribble framework.

@LeeTL1220
Copy link
Contributor Author

LeeTL1220 commented Mar 1, 2018 via email

@LeeTL1220
Copy link
Contributor Author

@samuelklee Let's discuss tribble-ifying the CNV classes, but for this PR, leaving those alone.

@LeeTL1220
Copy link
Contributor Author

@droazen number 3 in the list above: #4489

@LeeTL1220
Copy link
Contributor Author

@samuelklee There have been a lot of changes since I first opened this PR. I thought you might want to check that I addressed the comments you had that are still applicable. However, since this is now totally decoupled from the CNV Collection code, there should be less to look at.

@jonn-smith Can you look at the XsvLocatableCodec changes and the changes to the FuncotationFactory code? And that I have honored what I said I would do in conversations.

@HD VN:1.5
@SQ SN:1 LN:2000000
@RG ID:GATKCopyNumber SM:sample_gt
CONTIG START END Num_Probes Segment_Mean Segment_Call
Copy link
Contributor

@samuelklee samuelklee Apr 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test file still seems unnatural to me (see original comment directly below). How would this be generated in the real world? It has a CNV-collections-style header and a mix of "CONTIG/START/END" and CNV 1.0 column headers, so it seems like it would have to be formatted manually.

Is the tool be able to consume truth files in their original format? If so, then the test files should reflect that.

Also, note the filename extension of many of the test files is .tsv.seg---probably this can be just .seg.

Copy link
Contributor

@samuelklee samuelklee Apr 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same goes for most of the test files below.

What is the spec for these tools? I think they should be able to consume strictly unmodified CNV-collections files (namely, the output of CallCopyRatioSegments) and truth files (CNV 1.0-style seg files, etc.) along with the appropriate config files. Requiring that the user go in and add additional headers or modify column names is something we should avoid.

Perhaps this is what the tools already do, but if so, it's not clear to me from looking at these test files.

@@ -1,4 +1,5 @@
#SAMPLE_NAME=SAMPLE1
# This is another comment
# This is yet another comment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test file is also not up to date. The CNV-collections column headers are correct, but the SAM-style header is missing. (I think you probably generated this file when the header consisted only of #SAMPLE_NAME=... and didn't update it when the CNV-collections format was updated?)

Copy link
Contributor Author

@LeeTL1220 LeeTL1220 Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per discussions with @jonn-smith and @droazen , the AnnotatedInterval codec will support files that either contain comments (#) or a SAM File Header ('@'). Not both. Then the locatable column headers is whatever is specified by the user config file or the default config file. The default config file will match what the CNV collection classes use.

This test is of a file that is comments only and uses the default configuration file. Therefore, the test should not have a SAM File Header and the locatable column headers should be CONTIG, START, and END.

No action.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this file appear naturally in the wild? My point is that it wouldn't (it would require that a user take a CNV-collections file, strip the SAM header, and add comments), so it's extremely confusing to base a test on it. Test files should reflect the most typical use cases; if a file is atypical or unusual (e.g., an improperly or unexpectedly formatted file meant to stress test the code), then it should be clearly named as such.

Can you list here exhaustively the typical formats expected for each of the tools? Ideally, I would be able to easily discern this just from glancing at the test files, but my main objection is that I cannot---precisely because most of them are a mishmash of conventions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think it might be more convenient to make the default config file specify the locatable column headers by indices 0-2. For seg files that have Sample as the first column, I think it's a bit cleaner to just use cut externally than it is to change column headers with e.g. sed.

/**
* Simple class that just has an interval and sorted name-value pairs.
*/
public final class AnnotatedInterval implements Locatable, Feature {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package name should be changed to reflect the name change to AnnotatedInterval. You should probably also edit the commit message for this branch.

Also note #3884. I don't mind the redundancy for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the package.

Done.

@samuelklee
Copy link
Contributor

@LeeTL1220 Apologies, I don't think I have the bandwidth for a detailed re-review, but I'm OK with the XSV code if @jonn-smith approves.

However, I still don't understand the test files and left some comments there.

Copy link
Collaborator

@jonn-smith jonn-smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments.

throw new UserException.MalformedFile("Could not decode from data file: " + dataPath.toUri().toString());
}

supportedFieldNames.addAll(codec.getHeaderWithoutLocationColumns());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well just use header here, instead of grabbing the info from the codec again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

// Initialize our field name lists:
initializeFieldNameLists();

// Adjust the manual annotations to make sure we don't try to annotate any fields we aren't
// responsible for:
//TODO: Isn't this the default map not the override map as the name would imply?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - this sets the values in the overrides map initially. These values then get set in the createFuncotations method via a call to setOverrideValuesInFuncotations.

This has been updated in another branch to be much more elegant, but the underlying data structure name is the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting the TODO then...
Done

@@ -78,29 +112,79 @@
/** The XSV Table Header */
private List<String> header;

/** The XSV Table Header */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment not quite right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -135,35 +219,88 @@ else if ( split.size() > header.size() ) {
split.remove( split.size() - 1 );
}
}

return new XsvTableFeature(contigColumn, startColumn, endColumn, header, split, dataSourceName);
// HACK: For tribble to work properly, we need to know the position (in bytes) of the file pointer. Currently, that cannot be ascertained.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need this hack?

At least part of it can go away - If the header is null here you would have already gotten a NullPointerException in the above if statements anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the null check.

Done

}

/**
* Read until we get to the header of this xsv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add the @ineritDoc annotation here to append these notes to the default javadoc for this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


locatableColumns = Lists.newArrayList(finalContigColumn, finalStartColumn, finalEndColumn);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this isn't Arrays.asList(finalContigColumn, finalStartColumn, finalEndColumn) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(At one time I was modifying these and I forgot to adjust this line when that was no longer true)

* Throw an exception if the given column name cannot be used for one of the locatable columns.
* @param columnName candidate column name for one of the locatable fields (contig, start, or end)
*/
public static void validateLocatableColumnName(String columnName) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing final

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*/
public SAMFileHeader renderSamFileHeader() {
assertHeaderInitialized();
if (isPreambleSamFileHeader(getPreamble())) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a micro-optimization, but each call to getPreamble is creating a new immutable list. Maybe you can cache it? Save some cycles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hope that the optimizer would catch this, but it is explicit now.

final SAMFileHeader result = codec.decode(reader, null);

final List<String> finalComments = new ArrayList<>();
finalComments.addAll(result.getComments());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little confusing to me. Does SamTextHeaderCodec create default comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a comment to that effect.

Done

…eading and writing an annotated interval.

Just made the CLI, but not much else.  Still experimental.

Simple CLI to tag germline events when you have tumor-normal matched seg files

Refactoring out a control class to decouple from the CLI.

Fixing output string vs toString error

Exposing the padding parameter.

Creating simple merging utility for messy segment data

Little utility to merge segments

fix for separator

Rebased to master and other changes to fix that rebasing on this branch.

Beginnings of supporting optional sam file header in simple annotated genomic region

First round of refactoring seems to make the CombineSegmentBreakpoints work.

Second round of refactoring now seems to work.

Easing SimpleAnnotatedGenomicRegion parsing restrictions for MergeAnnotatedRegions

Added another example for building a docker image.

Added a few docs

Doc changes

Addressing TODOs

Addressing merge issues.

More docs and testing.

Added testing and cleaned up the code some.

Better testing of the germline tagging.

Minor changes

Moved region merging to a utility class.

More updates for docs and adherence to CNV conventions.

Initializing simple annotated regions from the tribble framework.

Making columns in config file index or column name.

Removed tie-in to the CNV collection framework.  Added more tests and updated tests.

Some more updates to workaround the fact that the codec has no write capability.

Updating writer to include addAll.

Added writer that looks similar to what will be a writing framework in tribble.

Adding documentation.

Making AbstractLocatableCollection constructors package private again.

Added docs for writer interface.

Added one more doc for writer interface.

Added even one more doc for writer interface.

Fixed compilation issue that was missed.

Fixed TagGermlineEventsIntegrationTest and renamed a couple of methods to make clearer.

Fixed issue with bad datasource names, internally.

Still have outstanding PR comments and writing is incomplete.

Addressing more PR comments.

Addressing more PR comments.

Addressing more PR comments with some heavy refactoring.

Streamlining the AnnotatedIntervalWriter to more in spirit with requested interface.  Removed some `create` methods.

More documentation.

Testing the SAMFileHeader generation.

More SAM FileHeader testing and validation for CombineSegmentBreakpoints.

More tests.

Answering PR comments on docs.

Fixed `TagGermlineEventsIntegrationTest`

More PR comments.  Made more defensive copies when creating collections.

More PR comments.  Adding functionality to create an annotated interval header

Addressing more PR comments.

Refactoring the header.  Meant to make it AnnotatedIntervalHeader.  Removed the header fields from the annotated interval collection class.  Added comments for the header fields being used.

Added replacement of structured comments.

Fixing some broken tests.

Making the config file `create` method available again.

Making options for alternate config files.

Fixing silly error.

Updating tests.  Making code a bit simpler.

Addressing some more TODOs.

Fixed issue with tagging germline events.

Making the format more strict for AnnotatedInterval writing.

More TODOs and other changes like getting rid of the comments field in AnnoatatedInterval.

Beginning the notion of a preamble.

Auto-detecting type of xsv and adjusting tests.

Renamed some test files to be in more in line with conventions.

More testing and changes to make the code look more like tribble-standard.

More testing and changes to make the code look more like tribble-standard.

More testing changes.

Addressing TODOs

Addressing TODOs and fixed a few tests.

Testing inconsistent region list on input.

Fixed file path.

Some doc improvements.

Some doc improvements.

Reverting a change.

Added test of resource determination.

Fixed test of resource.  Added a small speed increase at cost to code readability.

Fixed test.  Some fixes of variable rename.

Fixed tests and made writing interface a bit easier.

Fixed tests and made writing interface a bit easier.

Fixed tests and supporting the seg extension.

Added issues for TODOs and references in the comments.

Fixed assertion issue where file could be created for writing, but it would still cause a failure.

Fixed integration test failure.

Fixed warning

Minor changes.  Right before adding the abililty to specify multiple columns.

Multiple column support.

Another round of PR comments.

More testing with real-life files.

More testing with real-life files.
@LeeTL1220 LeeTL1220 changed the title Somatic CNV tagging germline events, bringing SimpleAnnotatedGenomicRegion in line with tribble reading (and gearing up for writing as well), and region merging. Somatic CNV tagging germline events, bringing AnnotatedInterval in line with tribble reading (and gearing up for writing as well), and region merging. Apr 10, 2018
Copy link
Collaborator

@jonn-smith jonn-smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any issues in this latest commit.

@LeeTL1220
Copy link
Contributor Author

Victory!!

@LeeTL1220 LeeTL1220 merged commit cad0c36 into master Apr 10, 2018
@LeeTL1220 LeeTL1220 deleted the ll_issue3995_tribble branch April 10, 2018 20:58
cwhelan pushed a commit to cwhelan/gatk-linked-reads that referenced this pull request May 25, 2018
…ne with tribble reading (and gearing up for writing as well), and region merging. (broadinstitute#4276)

* Added several CLI and the start of a tribble-supported xsv spec for reading and writing an annotated interval.

Just made the CLI, but not much else.  Still experimental.

Simple CLI to tag germline events when you have tumor-normal matched seg files

Refactoring out a control class to decouple from the CLI.

Fixing output string vs toString error

Exposing the padding parameter.

Creating simple merging utility for messy segment data

Little utility to merge segments

fix for separator

Rebased to master and other changes to fix that rebasing on this branch.

Beginnings of supporting optional sam file header in simple annotated genomic region

First round of refactoring seems to make the CombineSegmentBreakpoints work.

Second round of refactoring now seems to work.

Easing SimpleAnnotatedGenomicRegion parsing restrictions for MergeAnnotatedRegions

Added another example for building a docker image.

Added a few docs

Doc changes

Addressing TODOs

Addressing merge issues.

More docs and testing.

Added testing and cleaned up the code some.

Better testing of the germline tagging.

Minor changes

Moved region merging to a utility class.

More updates for docs and adherence to CNV conventions.

Initializing simple annotated regions from the tribble framework.

Making columns in config file index or column name.

Removed tie-in to the CNV collection framework.  Added more tests and updated tests.

Some more updates to workaround the fact that the codec has no write capability.

Updating writer to include addAll.

Added writer that looks similar to what will be a writing framework in tribble.

Adding documentation.

Making AbstractLocatableCollection constructors package private again.

Added docs for writer interface.

Added one more doc for writer interface.

Added even one more doc for writer interface.

Fixed compilation issue that was missed.

Fixed TagGermlineEventsIntegrationTest and renamed a couple of methods to make clearer.

Fixed issue with bad datasource names, internally.

Still have outstanding PR comments and writing is incomplete.

Addressing more PR comments.

Addressing more PR comments.

Addressing more PR comments with some heavy refactoring.

Streamlining the AnnotatedIntervalWriter to more in spirit with requested interface.  Removed some `create` methods.

More documentation.

Testing the SAMFileHeader generation.

More SAM FileHeader testing and validation for CombineSegmentBreakpoints.

More tests.

Answering PR comments on docs.

Fixed `TagGermlineEventsIntegrationTest`

More PR comments.  Made more defensive copies when creating collections.

More PR comments.  Adding functionality to create an annotated interval header

Addressing more PR comments.

Refactoring the header.  Meant to make it AnnotatedIntervalHeader.  Removed the header fields from the annotated interval collection class.  Added comments for the header fields being used.

Added replacement of structured comments.

Fixing some broken tests.

Making the config file `create` method available again.

Making options for alternate config files.

Fixing silly error.

Updating tests.  Making code a bit simpler.

Addressing some more TODOs.

Fixed issue with tagging germline events.

Making the format more strict for AnnotatedInterval writing.

More TODOs and other changes like getting rid of the comments field in AnnoatatedInterval.

Beginning the notion of a preamble.

Auto-detecting type of xsv and adjusting tests.

Renamed some test files to be in more in line with conventions.

More testing and changes to make the code look more like tribble-standard.

More testing and changes to make the code look more like tribble-standard.

More testing changes.

Addressing TODOs

Addressing TODOs and fixed a few tests.

Testing inconsistent region list on input.

Fixed file path.

Some doc improvements.

Some doc improvements.

Reverting a change.

Added test of resource determination.

Fixed test of resource.  Added a small speed increase at cost to code readability.

Fixed test.  Some fixes of variable rename.

Fixed tests and made writing interface a bit easier.

Fixed tests and made writing interface a bit easier.

Fixed tests and supporting the seg extension.

Added issues for TODOs and references in the comments.

Fixed assertion issue where file could be created for writing, but it would still cause a failure.

Fixed integration test failure.

Fixed warning

Minor changes.  Right before adding the abililty to specify multiple columns.

Multiple column support.

Another round of PR comments.

More testing with real-life files.

More testing with real-life files.

* Fixed typo in data provider.

* Doc changes.

* Renamed annotated region package to annotated interval.

* Fixing default location of config file.

* Improved error message.

* Fixed unit test file paths that were causing failures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make SimpleAnnotatedGenomicRegion able to produce and consume the SequenceDictionary
6 participants