Skip to content

Conversation

@yupbank
Copy link
Contributor

@yupbank yupbank commented Mar 20, 2018

Add a function that append shape to a column which can save memory and time to analyze

@yupbank yupbank force-pushed the mannualy-meta-data branch 2 times, most recently from 776096a to e2ee2c1 Compare March 21, 2018 18:44
Copy link
Contributor

@thunterdb thunterdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yupbank thanks a lot for doing this, this is a much requested feature. It looks like some of the tests are erroring (this is a known travis bug), but some others are failing on the test that you added.

"""Append extra metadata for a dataframe that
describes the numerical shape of the content.
This method is useful when a dataframe contains non-scalar tensors, for which the shape must be checked beforehand.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment also that indicates that the user is responsible for providing the right shape, and that a mismatch will trigger eventually an exception in Spark?

:param dframe: a Spark DataFrame
:param col: a Column expression
:param size: a shape corresponding to the tensor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can link to the documentation in python: https://www.tensorflow.org/programmers_guide/tensors#shape
This is important for people to understand the order of the elements.

res = tfs.reduce_rows(x, ddf)
assert res == sum([r.x for r in data])

# This test fails
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent?

data = [Row(x=float(x)) for x in range(5)]
df = self.sql.createDataFrame(data)
ddf = tfs.append_shape(df, col('x'), [-1, 1])
import ipdb; ipdb.set_trace()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you don't need that anymore

def test_append_shape(self):
data = [Row(x=float(x)) for x in range(5)]
df = self.sql.createDataFrame(data)
ddf = tfs.append_shape(df, col('x'), [-1, 1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what happens if you replace -1 by None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the py4j would be unhappy, so i can do a convert from None to -1 .
i miss the type annotation py3 have, so that i don't need to worry about it


val meta = new MetadataBuilder
val colDtypes = df.select(col).schema.fields.head.dataType
val basicDatatype = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the extra {}?

@yupbank
Copy link
Contributor Author

yupbank commented Mar 21, 2018

thanks for the review... which i should add a work in progress in the title.

it's just i'm trying too hard to use py4j. and keep on hitting this error

  File "/home/travis/.cache/spark-versions/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/home/travis/.cache/spark-versions/spark-2.1.1-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/home/travis/.cache/spark-versions/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
    format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o46.appendShape.
: java.lang.StackOverflowError
	at org.tensorframes.ExperimentalOperations$class.appendShape(ExperimentalOperations.scala:66)
	at org.tensorframes.impl.DebugRowOps.appendShape(DebugRowOps.scala:281)
	at org.tensorframes.ExperimentalOperations$class.appendShape(ExperimentalOperations.scala:66)
	at org.tensorframes.impl.DebugRowOps.appendShape(DebugRowOps.scala:281)
	at org.tensorframes.ExperimentalOperations$class.appendShape(ExperimentalOperations.scala:66)

the scala function signature is appendShape(df: DataFrame, col:Column, shape: util.ArrayList[Int]): DataFrame
while the python function to use it DataFrame(_java_api().appendShape(dframe._jdf, pyspark.sql.function.col('x')._jc, [-1, 1]), _sql)

can you help me correct this?

@yupbank yupbank changed the title Manually append meta data [WIP]Manually append meta data Mar 21, 2018
}

def appendShape(df: DataFrame, col:Column, shape: util.ArrayList[Int]): DataFrame =
appendShape(df, col, shape)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yupbank you want to convert shape to an array, otherwise you get into a loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah...that's why the stack overflowed! thanks a lot! saved my day!

@yupbank yupbank changed the title [WIP]Manually append meta data Manually append meta data Mar 21, 2018
@yupbank
Copy link
Contributor Author

yupbank commented Mar 21, 2018

@thunterdb can i have another round of 👁

@thunterdb
Copy link
Contributor

@yupbank some travis-related changes got committed. Can you please rebase your pull request so that travis can take them into account?

@yupbank yupbank force-pushed the mannualy-meta-data branch from 3d43f4d to 068fa74 Compare April 23, 2018 20:51
@thunterdb
Copy link
Contributor

Merging, thank you very much!

@thunterdb thunterdb merged commit 4531eb6 into databricks:master Apr 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants