-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYSTEMML-1116] Make SystemML Python DSL NumPy-friendly #290
Conversation
This is quite interesting! Integrating the DSL more tightly with NumPy in this manner is great. I haven't tried it yet, but +1 to the example I see. As for the depth of evaluation, what it we changed it to something simpler such as |
Thanks @dusenberrymw ... My intent is that if we can even scale subset of external libraries that uses numpy/scipy with minimal change (for eg: scikit-learn), it will be a great selling point for SystemML. Not sure if you noticed, Python SystemML package is much closer to a traditional python package, i.e. no jar needs to be specified in --driver-class-path of pyspark :)
|
Completed tasks in this PR:
Remaining tasks:
@bertholdreinwald @dusenberrymw @deroneriksson @lresende @asurve If it's OK with you all, I can merge this in. |
Also, updated the issue https://issues.apache.org/jira/browse/SYSTEMML-1043 |
- Allow SystemML matrix to be invoked by numpy methods: import systemml as sml import numpy as np m1 = sml.matrix(np.ones((3,3)) + 2) m2 = sml.matrix(np.ones((3,3)) + 3) np.add(m1, m2).toNumPyArray() - Allow users to control the depth of lazy evaluation
eliminating unnecessary java to python conversion if possible
136e436
to
29b3d7c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@niketanpansare Okay I tried this out locally, and overall it's great! 👍 I'm really excited for this deeper integration with Python and NumPy specifically. Also +1 for the ability to pip install and then start up Spark normally without the --driver_class_path
& --jars
stuff.
A few thoughts:
- Can you test with Python 3 as well (there are some
print
statements, for example, that have Python 2 syntax)? - Can you implement
np.dot(...)
?sml.matrix.dot
is implemented, but the ufunc is missing fornp.dot(m1,m2)
, wherem1
andm2
are thesystemml.defmatrix
matrices:TypeError: __numpy_ufunc__ not implemented for this type
.
Other things:
- Can we update
toNumPyArray()
totoNumPy()
for simplicity? - Can we update
toDataFrame()
totoDF()
for simplicity and to be the same as Spark?
java_dir = os.path.join(imp.find_module("systemml")[1], "systemml-java") | ||
for file in os.listdir(java_dir): | ||
if fnmatch.fnmatch(file, 'systemml-*-incubating-SNAPSHOT.jar'): | ||
jar_file_name = os.path.join(java_dir, file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we tested if this JAR will be distributed to the worker nodes on a cluster as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Please try following script on your spark cluster:
import systemml as sml
from systemml.random import uniform
m1 = uniform(size=(10000,10000))
m2 = m1.dot(m1).sum().toNumPy()
print(m2)
|
||
# To run: | ||
# - Python 2: `PYSPARK_PYTHON=python2 spark-submit --master local[*] --driver-class-path SystemML.jar test_mlcontext.py` | ||
# - Python 3: `PYSPARK_PYTHON=python3 spark-submit --master local[*] --driver-class-path SystemML.jar test_mlcontext.py` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update these two test invocation lines with this filename?
|
||
# To run: | ||
# - Python 2: `PYSPARK_PYTHON=python2 spark-submit --master local[*] --driver-class-path SystemML.jar test_mlcontext.py` | ||
# - Python 3: `PYSPARK_PYTHON=python3 spark-submit --master local[*] --driver-class-path SystemML.jar test_mlcontext.py` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update these two test invocation lines with this filename?
|
||
# To run: | ||
# - Python 2: `PYSPARK_PYTHON=python2 spark-submit --master local[*] --driver-class-path SystemML.jar test_mllearn.py` | ||
# - Python 3: `PYSPARK_PYTHON=python3 spark-submit --master local[*] --driver-class-path SystemML.jar test_mllearn.py` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update these two test invocation lines with this filename?
|
||
/** | ||
* To run Python tests, please: | ||
* 1. Set the RUN_PYTHON_TEST flag to true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's aim to run this test by default on Jenkins.
# Don't use this method instead use matrix's printAST() | ||
def printAST(self, numSpaces): | ||
# Don't use this method instead use matrix's print_ast() | ||
def print_ast(self, numSpaces): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this to _print_ast
to indicate that it is a private method not for use by the user.
""" | ||
Creates a single column vector with values starting from <start>, to <stop>, in increments of <step>. | ||
Note: Unlike Numpy's arange which returns a row-vector, this returns a column vector. | ||
Also, Unlike Numpy's arange which doesnot include stop, this method includes stop in the interval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to only support the equivalent of np.arange
in this Python DSL, rather than introduce our R-like seq
function. The users of the Python DSL will be Python users, so they will expect every aspect of our Python DSL to be Pythonic.
|
||
def __numpy_ufunc__(self, func, method, pos, inputs, **kwargs): | ||
""" | ||
This function enables systemml matrix to be compatible with NumPy's ufuncs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a bit of documentation to the parameters here?
self.assertTrue(np.allclose(sml.matrix(m1).sum(axis=0), m1.sum(axis=0))) | ||
|
||
def test_sum3(self): | ||
self.assertTrue(np.allclose(sml.matrix(m1).sum(axis=1), m1.sum(axis=1).reshape(dim, 1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a quick note to the reference guide that explains that we always return a 2d matrix (ex. (3,1)), while NumPy can return a 1d vector (i.e. (3,))?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, ditto for the other functions below with this behavior.
# ~/spark-1.6.1-scala-2.11/bin/spark-submit --master local[*] --driver-class-path SystemML.jar test.py | ||
class TestMLLearn(unittest.TestCase): | ||
|
||
def testLogisticSK2(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: To be in line with the Pythonic style of the rest of the DSL codebase, can we update these functions from CamelCase to underscore_case?
Thanks @dusenberrymw for your review. I will incorporate your suggestion in the next commit :) |
simplicity and to be the same as Spark
Addressed in the commit afac2b2 As an FYI, I am addressing similar tasks in seperate commit to simplify your review process :) |
Addressed remaining comments in the commit: b4b8394 @dusenberrymw can you please review ? ... Let's try to push this PR in today if possible as #305 is dependent on this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay just a few comments regarding print_ast
and errors in Python 3 due to Python 2 print
statements.
@@ -413,7 +417,7 @@ class matrix(object): | |||
Then the left-indexed matrix is set to be backed by DMLOp consisting of following pydml: | |||
left-indexed-matrix = new-deep-copied-matrix | |||
left-indexed-matrix[index] = value | |||
8. Please use m.print_ast() and/or type `m` for debugging. Here is a sample session: | |||
8. Please use m._print_ast() and/or type `m` for debugging. Here is a sample session: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function name change for this internal function looks good, but I think the documentation is wrong now -- the user should still use the matrix print_ast()
function, right?
# Don't use this method instead use matrix's printAST() | ||
def printAST(self, numSpaces): | ||
# Don't use this method instead use matrix's _print_ast() | ||
def _print_ast(self, numSpaces): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay the name change for this function def _print_ast(self, numSpaces)
is good since it shouldn't be called by the user. However, the other function that is supposed to be used instead should still be called print_ast()
without an underscore. Otherwise, they both seem like internal functions.
return self | ||
|
||
def printAST(self, numSpaces = 0): | ||
def _print_ast(self, numSpaces = 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah let's make this function named print_ast()
since it is intended to be called by the user, and the other variant that is not supposed to be called by the user can still be named _print_ast(numSpaces)
.
if matrix.THROW_ARRAY_CONVERSION_ERROR: | ||
raise Exception('[ERROR]:' + msg) | ||
else: | ||
print '[WARN]:' + msg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update this to Python 3 compatible syntax, as this currently throws an error as soon as the systemml
package is imported.
out = matrix(None, op=dmlOp) | ||
dmlOp.dml = [out.ID, ' = ', self.ID ] + getIndexingDML(index) + [ '\n' ] | ||
return out | ||
print '__getitem__' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update this to Python 3 compatible syntax, as this currently throws an error as soon as the systemml
package is imported.
Python 3 convention
@dusenberrymw addressed above comments in the commit ecc242b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This is an awesome update!
Follow up items:
Also, we should closely follow the NumPy ufunc stuff. Looks like they are revisiting the idea, but may be renaming the function to array_ufunc [1]. The ideal case would be to not have to rely on a custom version of NumPy. [1]: numpy/numpy#8247 |
Once numpy/numpy#8247 is merged and numpy is released, we can update our setup to have that version as dependency :) |
Thank you @niketanpansare for all your hard work on this PR! |
1. Added python test cases for matrix. 2. Added web documentation for all the Python APIs. 3. Added set_lazy method to enable and disable lazy evaluation. 4. matrix class itself has almost all basic linear algebra operators supported by DML. 4. Updated SystemML.jar to *-incubating.jar 5. Added maven cleanup logic for python artifacts. 6. Integrated python testcases with maven (See org.apache.sysml.test.integration.functions.python.PythonTestRunner). This requires SPARK_HOME to be set. Closes #290.
1. Added python test cases for matrix. 2. Added web documentation for all the Python APIs. 3. Added set_lazy method to enable and disable lazy evaluation. 4. matrix class itself has almost all basic linear algebra operators supported by DML. 4. Updated SystemML.jar to *-incubating.jar 5. Added maven cleanup logic for python artifacts. 6. Integrated python testcases with maven (See org.apache.sysml.test.integration.functions.python.PythonTestRunner). This requires SPARK_HOME to be set. Closes apache#290.
Bumped up the Python package version number @deroneriksson
Allow SystemML matrix to be invoked by numpy methods ( @dusenberrymw ):
Few caveats:
ufunc
, but this should be enabled in next release. Until then to test above code, please use:git clone https://github.com/niketanpansare/numpy.git cd numpy python setup.py install