Loadmatrix #2

catoverdrive · 2017-09-27T13:36:00Z

No description provided.

…ing, java.lang.Long] to VariantSampleMatrix[Annotation, Annotation, Annotation]

…e of file.

…call to sc.parallelize

…col ids still calls String.split()

cseed

Sorry for the delay on this.

cseed · 2017-10-02T15:22:39Z

src/main/scala/is/hail/io/LoadMatrix.scala

+  }
+
+  // this assumes that col IDs are in last line of header.
+  /// FIXME: Is the toString.split call too slow?


Seems fine here. I'm mostly worried about per-cell performance and, to a lesser extent, per row (or per column) performance.

cseed · 2017-10-02T15:26:46Z

src/main/scala/is/hail/io/LoadMatrix.scala

+  }
+
+  def apply(hc: HailContext,
+            file1: String,


You picked up a few things (totally natural) from import_vcf that add additional complexity that aren't necessary here. I don't think you need file1, that is to support the header_file option to import_vcf.

cseed · 2017-10-02T15:29:52Z

python/hail/context.py

+    @typecheck_method(path=oneof(strlike, listof(strlike)),
+                      min_partitions=nullable(integral),
+                      drop_samples=bool)
+    def import_matrix(self, path, min_partitions = None, drop_samples = False):


This is a good start.

I think we should also support at least 5 options for the cell type: Int{32, 64}, Float{32, 64} and String (so the users can use annotate_genotypes to parse the cells themselves.) Second, you should support missingness (see HailContext.import_table's missing option.

cseed · 2017-10-02T15:31:50Z

src/main/scala/is/hail/io/LoadMatrix.scala

+    val lines = sc.textFilesLines(files, nPartitions.getOrElse(sc.defaultMinPartitions))
+
+    val fileByPartition = lines.partitions.map(p => partitionPath(p))
+    val firstPartitions = fileByPartition.zipWithIndex.filter((name) => name._2 == 0 || fileByPartition(name._2-1) != name._1).map((name) => name._2)


Pattern matching is much more readable than the _N business. Something like: fileByPartition.zipWithIndex.filter { case (file, index) => index == 0 || ....

cseed · 2017-10-02T15:32:49Z

src/main/scala/is/hail/io/LoadMatrix.scala

+    val keyType = matrixType.kType
+    val rowKeys: RDD[RegionValue] = lines.mapPartitionsWithIndex { (i,it) =>
+
+      if (firstPartitions.contains(i)) {


firstPartitions is either an Array or Seq, so contains is O(N). Make it a set.

cseed · 2017-10-02T15:39:07Z

src/main/scala/is/hail/io/LoadMatrix.scala

+      val rvb = new RegionValueBuilder(region)
+      val rv = RegionValue(region)
+
+      new Iterator[RegionValue] {


So all this complicated iterator business in LoadVCF was to handle some complex filtering conditions that involved partially parsing the input. I don't think you need any of that here and can just it.map { ... }.

cseed · 2017-10-02T15:39:40Z

src/main/scala/is/hail/io/LoadMatrix.scala

+                  val firstsep = line.indexOf(sep)
+
+                  region.clear()
+                  rvb.start(matrixType.orderedRDD2Type.rowType.fundamentalType)


You don't need .fundamentalType. If that's still in LoadVCF, it should get cleaned up. (I'll check.)

cseed · 2017-10-02T15:40:31Z

src/main/scala/is/hail/io/LoadMatrix.scala

+                        rvb.addLong(v)
+                        v = 0L
+                        ii += 1
+                      } else if (line(off) <= 57 && line(off) >= 48) {


Characters are way more readable: line(off) <= '9' && ..., etc.

cseed · 2017-10-02T15:41:18Z

src/test/scala/is/hail/io/ImportMatrixSuite.scala

+class ImportMatrixSuite extends SparkSuite {
+
+  @Test def testHeadersNotIdentical() {
+    val files =hc.hadoopConf.globAll(List("src/test/resources/sampleheader*.txt"))


nitpick: space after =.

cseed · 2017-10-02T15:44:02Z

src/main/scala/is/hail/io/LoadMatrix.scala

+                        )
+                      }
+                      if (off == line.length || line(off) == sep(0)) {
+                        rvb.addLong(v)


This treats empty elements as 0?

Should I be using setMissing() to designate empty elements? How about if a line doesn't have the right number of /t separators? (Currently it throws an error)

empty elements should be an error (unless the missing string is itself empty, although it normally defaults to NA). Missing elements should use setMissing, yes. I think throwing an error for the wrong number of columns is good.

How do I differentiate between a missing element and an empty element? Like if I'm parsing a line that goes ... 0\t\t1\t0\t ... that should throw an error? But what does a missing element look like?

There should be an missing='NA' option (like HailContext.import_table, see my other comment) to import_matrix that says what the missing element text should be. (missing=None should mean no missingness.)

So if missing='NA', then yes, ...0\t\t1\t0\t... should throw an error. And ...0\tNA\t1\t0\t... should be OK (with a missing element).

(never mind, I just saw your comment below. I'll take a look at the import_table stuff on missingness first.)

catoverdrive · 2017-10-02T21:02:41Z

I think I addressed all the comments. I also added an attempt at parsing more datatypes; TInt32 and TInt64 should work without creating new objects on the heap, but since doing anything with strings generally involves creating new strings, I wasn't sure how to avoid that problem for TString. TFloat32 and TFloat64 also work the same way as TString for now because inline parsing for floating point numbers is hard due to rounding.

* # This is a combination of 22 commits. # This is the 1st commit message: apply resettable context forgot to fix one use of AutoCloseable fix add setup iterator more sensible method ordering make TrivialContext Resettable a few more missing resettablecontexts address comments apply resettable context forgot to fix one use of AutoCloseable fix add setup iterator more sensible method ordering remove rogue element type type make TrivialContext Resettable wip wip wip wip use safe row in join suite pull over hailcontext remove Region.clear(newEnd) add selectRegionValue # This is the commit message #2: convert relational.scala ; # This is the commit message #3: scope the extract aggregators constfb call # This is the commit message #4: scope interpret # This is the commit message #5: typeAfterSelect used by selectRegionValue # This is the commit message #6: load matrix # This is the commit message #7: imports # This is the commit message #8: loadbgen converted # This is the commit message hail-is#9: convert loadplink # This is the commit message hail-is#10: convert loadgdb # This is the commit message hail-is#11: convert loadvcf # This is the commit message hail-is#12: convert blockmatrix # This is the commit message hail-is#13: convert filterintervals # This is the commit message hail-is#14: convert ibd # This is the commit message hail-is#15: convert a few methods # This is the commit message hail-is#16: convert split multi # This is the commit message hail-is#17: convert VEP # This is the commit message hail-is#18: formatting fix # This is the commit message hail-is#19: add partitionBy and values # This is the commit message hail-is#20: fix bug in localkeysort # This is the commit message hail-is#21: fixup HailContext.readRowsPartition use # This is the commit message hail-is#22: port balding nichols model * apply resettable context forgot to fix one use of AutoCloseable fix add setup iterator more sensible method ordering make TrivialContext Resettable a few more missing resettablecontexts address comments apply resettable context forgot to fix one use of AutoCloseable fix add setup iterator more sensible method ordering remove rogue element type type make TrivialContext Resettable wip wip wip wip use safe row in join suite pull over hailcontext remove Region.clear(newEnd) add selectRegionValue convert relational.scala ; scope the extract aggregators constfb call scope interpret typeAfterSelect used by selectRegionValue load matrix imports loadbgen converted convert loadplink convert loadgdb convert loadvcf convert blockmatrix convert filterintervals convert ibd convert a few methods convert split multi convert VEP formatting fix add partitionBy and values fix bug in localkeysort fixup HailContext.readRowsPartition use port balding nichols model port over table.scala couple fixes convert matrix table remove necessary use of rdd variety of fixups wip add a clear * Remove direct Region allocation from FilterColsIR When regions are off-heap, we can allow the globals to live in a separate, longer-lived Region that is not cleared until the whole partition is finished. For now, we pay the memory cost. * Use RVDContext in MatrixRead zip This Region will get cleared by consumers. I introduced the zip primitive which is a safer way to zip two RVDs because it does not rely on the user correctly clearing the regions used by the left and right hand sides of the zip. * Control the Regions in LoadGDB I do not fully understand how LoadGDB is working, but a simple solution to the use-case is to serialize to arrays of bytes and parallelize those. I realize there is a proliferation of `coerce` methods. I plan to trim this down once we do not have RDD and ContextRDD coexisting * wip * unify RVD.run * reset in write * fixes * use context region when allocating * also read RVDs using RVDContext * formatting * address comments * remove unused val * abstract over boundary * little fixes * whoops forgot to clear before persisting This fixes the LDPrune if you dont clear the region things go wrong. Not sure what causes that bug. Maybe its something about encoders? * serialize for shuffles, region.scoped in matrixmapglobals, fix joins * clear more! * wip * wip * rework GeneralRDD to ease ContextRDD transition * formatting * final fixes * formatting * merge failures * more bad merge stuff * formatting * remove unnecessary stuff * remove fixme * boom! * variety of merge mistakes * fix destabilize bug * add missing newline * remember to clear the producer region in localkeysort * switch def to val * cleanup filteralleles and exporbidbimfam * fix clearing and serialization issue * fix BitPackedVectorView Previously it always assumed the variant struct started at offset zero, which is not true * address comments, remove a comment * remove direct use of Region * oops * werrrks, mebbe * needs cleanup * fix filter intervals * fixes * fixes * fix filterintervals * remove unnecessary copy in TableJoin * and finally fix the last test * re-use existing CodecSpec definition * remove unnecessary boundaries * use RVD abstraction when possible * formatting * bugfix: RegionValue must know its region * remove unnecessary val and comment * remove unused methods * eliminate unused constructors * undo debug change * formatting * remove unused imports * fix bug in tablejoin * fix RichRDDSuite test If you have no data, then you have no partitions, not 1 partition

* Prototype for batch Python interface * wip * Attempt #2 * change how temp dir is formatted * change directory structure * Add tests and actually execute pipeline * added project * better error msg * added docker support * Add infrastructure for testing, conda, etc. * support for new pipeline env * comment out docker test

* [batch] add close billing project functionality * address comments * rewrite subquery to take advantage of time_completed index * fix check * undo accidental change to query * tweak table format * add 'deleted' state for billing projects * oops * update estimated-current * fix * define custom errors * address comments * fix check * address comments * fix check * fix check #2 * fix check #3 * fix check-sql

…0814) * [batch] replace GCS with GoogleStorageAsyncFS in LogStore * delint * addr comments * fix s3 exception handling * fix catching nonexistent blob in azurefs * flip version number of instance

wang added 6 commits September 22, 2017 13:10

added a LoadMatrix class and some associated tests

a0c0bdb

changed type of LoadMatrix.apply from VariantSampleMatrix[String, Str…

951bd7d

…ing, java.lang.Long] to VariantSampleMatrix[Annotation, Annotation, Annotation]

wrote python interface for LoadMatrix

2042ba8

changed header filtering to use partitionPath to remove the first lin…

3fc865a

…e of file.

moved header check into rowKeys RDD construction instead of separate …

fe5c066

…call to sc.parallelize

inline parsing of longs in data fields without line.split(). getting …

7f41933

…col ids still calls String.split()

cseed requested changes Oct 2, 2017

View reviewed changes

cseed mentioned this pull request Oct 2, 2017

LoadMatrix hail-is/hail#2246

Closed

addressed comments and added some support for different types.

e44918c

cseed closed this Nov 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loadmatrix #2

Loadmatrix #2

catoverdrive commented Sep 27, 2017

cseed left a comment

cseed Oct 2, 2017

cseed Oct 2, 2017

cseed Oct 2, 2017

cseed Oct 2, 2017

cseed Oct 2, 2017

cseed Oct 2, 2017

cseed Oct 2, 2017

cseed Oct 2, 2017

cseed Oct 2, 2017

cseed Oct 2, 2017

catoverdrive Oct 2, 2017

cseed Oct 2, 2017

catoverdrive Oct 2, 2017

cseed Oct 2, 2017

catoverdrive Oct 2, 2017

catoverdrive commented Oct 2, 2017

Loadmatrix #2

Loadmatrix #2

Conversation

catoverdrive commented Sep 27, 2017

cseed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

catoverdrive commented Oct 2, 2017