Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

MAHOUT-1490 Data frame R-like bindings #3

Closed
wants to merge 8 commits into from

3 participants

@dlyubimov

At this point, exploration and WIP of columnar data frame support, speed, approaches.
MAHOUT-1490

@dlyubimov dlyubimov changed the title from Mahout-1490 Data frame R-like bindings to MAHOUT-1490 Data frame R-like bindings
@dlyubimov dlyubimov closed this
@dlyubimov dlyubimov reopened this
@sscdotopen sscdotopen commented on the diff
...hout/math/scalaframes/AbstractColumnarDataFrame.scala
@@ -0,0 +1,9 @@
+package org.apache.mahout.math.scalaframes
+
+/**
+ *
+ * @author dmitriy

no @author tags allowed in ASF code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@sscdotopen sscdotopen commented on the diff
...a/org/apache/mahout/math/scalaframes/UnsafeUtil.scala
@@ -0,0 +1,45 @@
+package org.apache.mahout.math.scalaframes
+
+
+/**
+ *
+ * @author dmitriy

remove @author again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dlyubimov

Ok, this is still is too little substance for review, and did not seem to generate much new ideas either. Plus i am probably will not be spending much time on it any time soon.

Withdrawing request for the time being

@dlyubimov dlyubimov closed this
@skanjila

Ok I'll be doing a more substantial commit that might help us see how the APIs around a DataFrame

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on May 28, 2014
  1. @dlyubimov

    Tests for RHS and mutate.

    dlyubimov authored
  2. @dlyubimov

    +select

    dlyubimov authored
  3. @dlyubimov
  4. @dlyubimov

    should return DFrameLike

    dlyubimov authored
  5. @dlyubimov

    Playing with messy things

    dlyubimov authored
  6. @dlyubimov

    speed tests using Unsafe..

    dlyubimov authored
  7. @dlyubimov
Commits on May 29, 2014
  1. @dlyubimov

    WIP

    dlyubimov authored
This page is out of date. Refresh to see the latest.
Showing with 412 additions and 0 deletions.
  1. +9 −0 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/AbstractColumnarDataFrame.scala
  2. +5 −0 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/BaseDFrame.scala
  3. +17 −0 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/CellOps.scala
  4. +5 −0 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/DataFrameLike.scala
  5. +14 −0 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/DataFrameSchema.scala
  6. +10 −0 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/DataFrameVectorLike.scala
  7. +20 −0 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/DoubleCellOps.scala
  8. +10 −0 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/DplyrLikeOps.scala
  9. +20 −0 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/IntCellOps.scala
  10. +16 −0 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/LHS.scala
  11. +21 −0 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/LongCellOps.scala
  12. +32 −0 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/Subscripted.scala
  13. +45 −0 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/UnsafeUtil.scala
  14. +26 −0 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/package.scala
  15. +162 −0 math-scala/src/test/scala/org/apache/mahout/math/scalaframes/ScalaframesSuite.scala
View
9 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/AbstractColumnarDataFrame.scala
@@ -0,0 +1,9 @@
+package org.apache.mahout.math.scalaframes
+
+/**
+ *
+ * @author dmitriy

no @author tags allowed in ASF code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ */
+abstract class AbstractColumnarDataFrame(val schema:DataFrameSchema) extends DataFrameLike {
+
+}
View
5 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/BaseDFrame.scala
@@ -0,0 +1,5 @@
+package org.apache.mahout.math.scalaframes
+
+class BaseDFrame extends DataFrameLike {
+
+}
View
17 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/CellOps.scala
@@ -0,0 +1,17 @@
+package org.apache.mahout.math.scalaframes
+
+trait CellOps extends Any {
+
+ def +(that:CellOps):CellOps
+ def -(that:CellOps):CellOps
+ def unary_-():CellOps
+ def *(that:CellOps):CellOps
+ def /(that:CellOps):CellOps
+
+ def toLong:Long
+ def toInt:Int
+ def toDouble:Double
+ def toString:String
+
+
+}
View
5 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/DataFrameLike.scala
@@ -0,0 +1,5 @@
+package org.apache.mahout.math.scalaframes
+
+trait DataFrameLike {
+
+}
View
14 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/DataFrameSchema.scala
@@ -0,0 +1,14 @@
+package org.apache.mahout.math.scalaframes
+
+class DataFrameSchema(val cols: Iterable[(String, DFType.DFType)]) extends Serializable
+
+object DataFrameSchema {
+
+ def fromTSVLine(line: String): DataFrameSchema = new DataFrameSchema(
+ line.split("\t").view.filter(_.length > 0).map(_ -> DFType.string))
+
+ def toTSVLine(dfs: DataFrameSchema) = {
+ val names = dfs.cols.map(_._1)
+ if (names.size == 0) "" else names.tail.scanLeft(names.head)(_ + '\t' + _)
+ }
+}
View
10 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/DataFrameVectorLike.scala
@@ -0,0 +1,10 @@
+package org.apache.mahout.math.scalaframes
+
+import scala.collection.GenSeq
+
+
+/**
+ *
+ * @author dmitriy
+ */
+trait DataFrameVectorLike[T] extends GenSeq[T]
View
20 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/DoubleCellOps.scala
@@ -0,0 +1,20 @@
+package org.apache.mahout.math.scalaframes
+
+class DoubleCellOps(val x:Double) extends AnyVal with CellOps {
+
+ def +(that: CellOps): CellOps = x + that.toDouble
+
+ def -(that: CellOps): CellOps = x - that.toDouble
+
+ def unary_-(): CellOps = -x
+
+ def *(that: CellOps): CellOps = x * that.toDouble
+
+ def /(that: CellOps): CellOps = x / that.toDouble
+
+ def toLong: Long = x.toLong
+
+ def toInt: Int = x.toInt
+
+ def toDouble: Double = x
+}
View
10 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/DplyrLikeOps.scala
@@ -0,0 +1,10 @@
+package org.apache.mahout.math.scalaframes
+
+class DplyrLikeOps(private[scalaframes] val frame:DataFrameLike) {
+
+ // TODO
+ def select(selections: Subscripted*): DataFrameLike = this
+
+ def mutate(mutations: LHS*): DataFrameLike = this
+
+}
View
20 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/IntCellOps.scala
@@ -0,0 +1,20 @@
+package org.apache.mahout.math.scalaframes
+
+class IntCellOps(val x:Int) extends AnyVal with CellOps {
+
+ def +(that: CellOps): CellOps = x + that.toInt
+
+ def -(that: CellOps): CellOps = x - that.toInt
+
+ def unary_-(): CellOps = -x
+
+ def *(that: CellOps): CellOps = x * that.toInt
+
+ def /(that: CellOps): CellOps = x / that.toInt
+
+ def toLong: Long = x.toLong
+
+ def toInt: Int = x
+
+ def toDouble: Double = x.toDouble
+}
View
16 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/LHS.scala
@@ -0,0 +1,16 @@
+package org.apache.mahout.math.scalaframes
+
+/**
+ * "left-hand side" fragment
+ */
+class LHS(val name: String) {
+
+ private[scalaframes] var assignmentFunc: () => Any = _
+
+ def :=(af: => Any): this.type = {
+ assignmentFunc = () => af
+ this
+ }
+
+
+}
View
21 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/LongCellOps.scala
@@ -0,0 +1,21 @@
+package org.apache.mahout.math.scalaframes
+
+class LongCellOps(val x:Long) extends AnyVal with CellOps {
+
+ def +(that: CellOps): CellOps = x + that.toLong
+
+ def -(that: CellOps): CellOps = x - that.toLong
+
+ def unary_-(): CellOps = -x
+
+ def *(that: CellOps): CellOps = x * that.toLong
+
+ def /(that: CellOps): CellOps = x / that.toLong
+
+ def toLong: Long = x
+
+ def toInt: Int = x.toInt
+
+ def toDouble: Double = x.toDouble
+}
+
View
32 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/Subscripted.scala
@@ -0,0 +1,32 @@
+package org.apache.mahout.math.scalaframes
+
+/** Indexed something (row, column) */
+class Subscripted {
+
+ var name: Option[String] = None
+ var ordinal:Option[Int] = None
+
+ /** Implies column removal in select(). */
+ var del:Boolean = false
+
+ def this(name:String) {
+ this()
+ this.name = Some(name)
+ }
+
+ def this(ordinal: Int) {
+ this()
+ if (ordinal < 0) {
+ this.ordinal = Some(-ordinal)
+ this.del = true
+ } else {
+ this.ordinal = Some(ordinal)
+ }
+ }
+
+ def unary_-():Subscripted = {
+ del = !del
+ this
+ }
+
+}
View
45 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/UnsafeUtil.scala
@@ -0,0 +1,45 @@
+package org.apache.mahout.math.scalaframes
+
+
+/**
+ *
+ * @author dmitriy

remove @author again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ */
+object UnsafeUtil {
+ lazy val arrayOffset = unsafe.arrayBaseOffset(classOf[Array[Byte]])
+ lazy val unsafe = concurrent.util.Unsafe.instance
+
+ // lazy val unsafe: Unsafe = {
+ //
+ // if (this.getClass.getClassLoader() == null)
+ // Unsafe.getUnsafe();
+ //
+ // try {
+ // val fld: Field = classOf[Unsafe].getDeclaredField("theUnsafe");
+ // fld.setAccessible(true);
+ // fld.get(classOf[Unsafe]).asInstanceOf[Unsafe];
+ // } catch {
+ // case e: Throwable => throw new RuntimeException("no sun.misc.Unsafe", e);
+ // }
+ // }
+
+ def setUnsafeDouble(arr: Array[Byte], x: Double, offset: Long): this.type = {
+ unsafe.putDouble(arr, arrayOffset + offset, x)
+ this
+ }
+
+ def getUnsafeDouble(arr: Array[Byte], offset: Long): Double = {
+ unsafe.getDouble(arr, arrayOffset + offset)
+ }
+
+ def setUnsafeLong(arr: Array[Byte], x: Long, offset: Long): this.type = {
+ unsafe.putLong(arr, arrayOffset + offset, x)
+ this
+ }
+
+ def getUnsafeLong(arr: Array[Byte], offset: Long): Long = {
+ unsafe.getLong(arr, arrayOffset + offset)
+ }
+
+
+}
View
26 math-scala/src/main/scala/org/apache/mahout/math/scalaframes/package.scala
@@ -0,0 +1,26 @@
+package org.apache.mahout.math
+
+package object scalaframes {
+
+ lazy val NA = new DoubleCellOps(Double.NaN)
+
+ implicit def frame2Dplyr(f:DataFrameLike):DplyrLikeOps = new DplyrLikeOps(f)
+ implicit def dplyr2Frame(dplyr:DplyrLikeOps):DataFrameLike = dplyr.frame
+
+ implicit def lhs(name:String):LHS = new LHS(name)
+ implicit def nf(name:String):Subscripted = new Subscripted(name)
+ implicit def nf(ordinal:Int):Subscripted = new Subscripted(ordinal)
+
+ implicit def i2ops(x:Int) = new IntCellOps(x)
+ implicit def d2ops(x:Double) = new DoubleCellOps(x)
+ implicit def l2ops(x:Long) = new LongCellOps(x)
+
+ def col(nf:Subscripted):CellOps = null
+
+ /** Data Frame object type */
+ object DFType extends Enumeration {
+ val int, long, double, string, bytes = Value
+ type DFType = Value
+ }
+
+}
View
162 math-scala/src/test/scala/org/apache/mahout/math/scalaframes/ScalaframesSuite.scala
@@ -0,0 +1,162 @@
+package org.apache.mahout.math.scalaframes
+
+import org.scalatest.FunSuite
+import org.apache.mahout.test.MahoutSuite
+import java.nio.ByteBuffer
+import scala.util.Random
+import concurrent._
+import scala.concurrent.duration.Duration
+
+class ScalaframesSuite extends FunSuite with MahoutSuite {
+
+ test("mutate") {
+ val testFrame = new BaseDFrame()
+
+ val mutatedFrame = testFrame.mutate(
+ "ACol" := col("5") + col(4),
+ "BCol" := col("AAA") + 3,
+ "CCol" := 1e-10
+ )
+ }
+
+ test("select") {
+ val testFrame = new BaseDFrame()
+
+ // Mixing integral and named subscripts
+ val selectedFrame = testFrame.select(
+ "ACol", 5, "BCol", -"CCol", -4
+ )
+
+ }
+
+
+
+ test("memory access speed") {
+
+ import ExecutionContext.Implicits.global
+
+ val rnd = new Random(1234L)
+
+ val s = 1 << 30
+ val blockSize = 1 << 6
+ val numBlocks = s / blockSize
+ val reads = 10 * (s / blockSize)
+
+ val memchunk = ByteBuffer.allocate(s)
+
+ // fill the chunk with random stuff
+ while (memchunk.remaining() > 0) memchunk.put(rnd.nextInt().toByte)
+
+ val arr = memchunk.array()
+
+ var ms = System.currentTimeMillis()
+
+ val futures = (0 until 4).map((s) => future {
+ var sum = 0.0
+ var i = 0
+ var blockBase = 0
+ while (i < reads) {
+ var j = 0
+ var l = 0l
+ while (j < blockSize) {
+ l = arr(blockBase + j)
+ l << 3
+ l |= arr(blockBase + j + 1) & 0xff
+ l << 3
+ l |= arr(blockBase + j + 2) & 0xff
+ l << 3
+ l |= arr(blockBase + j + 3) & 0xff
+ l << 3
+ l |= arr(blockBase + j + 4) & 0xff
+ l << 3
+ l |= arr(blockBase + j + 5) & 0xff
+ l << 3
+ l |= arr(blockBase + j + 6) & 0xff
+ l << 3
+ l |= arr(blockBase + j + 7) & 0xff
+
+ sum += java.lang.Double.longBitsToDouble(l)
+ j += 8
+ }
+ blockBase = blockSize * math.abs(((l ^ (l >> s + 1)).toInt) % numBlocks)
+ i += 1
+ }
+ })
+ futures.foreach(Await.result(_, atMost = Duration.Inf))
+ ms = System.currentTimeMillis() - ms
+
+ printf("N random %d ms.\n", ms)
+
+ }
+ test("memory unsafe access speed") {
+
+ import ExecutionContext.Implicits.global
+
+ val rnd = new Random(1234L)
+
+ val s = 1 << 30
+ val blockSize = 1 << 6
+ val numBlocks = s / blockSize
+ val reads = 10 * (s / blockSize)
+
+ val memchunk = ByteBuffer.allocate(s)
+
+ // fill the chunk with random stuff
+ while (memchunk.remaining() > 0) memchunk.put(rnd.nextInt().toByte)
+
+ val arr = memchunk.array()
+
+ var ms = System.currentTimeMillis()
+
+ val futures = (0 until 4).map((s) => future {
+ var sum = 0.0
+ var i = 0
+ var blockBase = 0
+ while (i < reads) {
+ var j = 0
+ var l = 0d
+ while (j < blockSize) {
+ l = UnsafeUtil.getUnsafeDouble(arr=arr,offset=blockBase)
+ sum += l
+ j += 8
+ }
+ val k = java.lang.Double.doubleToLongBits(l)
+ blockBase = blockSize * math.abs(((k ^ (k >> s + 1)).toInt) % numBlocks)
+ i += 1
+ }
+ })
+ futures.foreach(Await.result(_, atMost = Duration.Inf))
+ ms = System.currentTimeMillis() - ms
+
+ printf("N random %d ms.\n", ms)
+
+ }
+
+ test("Unsafe put, get Double") {
+
+ val rnd = new Random(124)
+ val control = rnd.nextDouble()
+
+ val buff = new Array[Byte](8)
+
+ UnsafeUtil.setUnsafeDouble(arr = buff, x = control, offset = 0l)
+ val x = UnsafeUtil.getUnsafeDouble(arr = buff, offset = 0l)
+
+ control should equal(x)
+ }
+
+ test("Unsafe put, get long") {
+
+ val rnd = new Random(124)
+ val control = rnd.nextLong()
+
+ val buff = new Array[Byte](8)
+
+ UnsafeUtil.setUnsafeLong(arr = buff, x = control, offset = 0l)
+ val x = UnsafeUtil.getUnsafeLong(arr = buff, offset = 0l)
+
+ control should equal(x)
+ }
+
+}
+
Something went wrong with that request. Please try again.