Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMML-1116] Make SystemML Python DSL NumPy-friendly #290

Closed
wants to merge 26 commits into from

Conversation

niketanpansare
Copy link
Contributor

@niketanpansare niketanpansare commented Nov 18, 2016

import systemml as sml
import numpy as np
m1 = sml.matrix(np.ones((3,3)) + 2)
m2 = sml.matrix(np.ones((3,3)) + 3)
np.add(m1, m2).toNumPyArray()
  • Allow users to control the depth of lazy evaluation (as per @frreiss suggestion).
>>> import systemml as sml
>>> import numpy as np
>>> m1 = sml.matrix(np.ones((3,3)) + 2)

Welcome to Apache SystemML!

>>> m2 = sml.matrix(np.ones((3,3)) + 3)
>>> np.add(m1, m2) + m1
# This matrix (mVar4) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPyArray() or toDataFrame() or toPandas() methods.
mVar2 = load(" ", format="csv")
mVar1 = load(" ", format="csv")
mVar3 = mVar1 + mVar2
mVar4 = mVar3 + mVar1
save(mVar4, " ")


>>> sml.set_max_depth(1)
>>> m1 = sml.matrix(np.ones((3,3)) + 2)
>>> m2 = sml.matrix(np.ones((3,3)) + 3)
>>> np.add(m1, m2) + m1
# This matrix (mVar8) is backed by NumPy array. To fetch the NumPy array, invoke toNumPyArray() method.

Few caveats:

  • The current version of NumPy explicitly disables overriding ufunc, but this should be enabled in next release. Until then to test above code, please use:
git clone https://github.com/niketanpansare/numpy.git
cd numpy
python setup.py install
  • Following NumPy ufunc that are not yet implemented in SystemML's Python DSL:
conj, hyperbolic/inverse-hyperbolic functions(i.e. sinh, arcsinh, cosh, ...), bitwise operator, xor operator, isreal, iscomplex, isfinite, isinf, isnan, copysign, nextafter, modf, frexp, trunc

@dusenberrymw
Copy link
Contributor

This is quite interesting! Integrating the DSL more tightly with NumPy in this manner is great. I haven't tried it yet, but +1 to the example I see.

As for the depth of evaluation, what it we changed it to something simpler such as sml.set_lazy(False) to disable lazy evaluation?

@niketanpansare
Copy link
Contributor Author

niketanpansare commented Nov 18, 2016

Thanks @dusenberrymw ... My intent is that if we can even scale subset of external libraries that uses numpy/scipy with minimal change (for eg: scikit-learn), it will be a great selling point for SystemML. Not sure if you noticed, Python SystemML package is much closer to a traditional python package, i.e. no jar needs to be specified in --driver-class-path of pyspark :)

sml.set_lazy(False) is much cleaner interface. I can update it in my next commit 👍

@niketanpansare
Copy link
Contributor Author

niketanpansare commented Nov 19, 2016

Completed tasks in this PR:

  1. Added python test cases for matrix (as per @asurve suggestion).
  2. Added web documentation for all the Python APIs: http://niketanpansare.github.io/incubator-systemml/python-reference (as per @asurve suggestion).
  3. Much tighter integration with NumPy (and several additional operators supported) via https://docs.scipy.org/doc/numpy-dev/reference/ufuncs.html
  4. Facility to enable and disable lazy evaluation via set_lazy method (as per @dusenberrymw suggestion).
  5. matrix class itself has almost all basic linear algebra operators supported by DML.
  6. Update SystemML.jar to *-incubating.jar (as per @lresende suggestion).
  7. Added maven cleanup logic (as per @deroneriksson suggestion).
  8. Integrate python testcases with maven and jenkins: @dusenberrymw @akchinSTC, @asurve To run test on Jenkins:
  • Set the RUN_PYTHON_TEST flag in the org.apache.sysml.test.integration.functions.python.PythonTestRunner class to true. This is currently set to false until we ensure below two steps.
  • We will have to setup Spark and SPARK_HOME environment.
  • Make sure SystemML.jar is created in the target/ folder before tests are run (This should happen automatically).

Remaining tasks:

  1. Large-scale testing: The current matrix has a huge performance issue (due to py4j conversion) especially in non-lazy setting or when evaluated too often. This issue was partly in this PR by reducing the explicit conversion during eval.
  2. Pushdown-able loop structures.
  3. Explore integration of non-unfuncs in Numpy (such as concatenate)
  4. Refactor scikit-learn to make it scalable (by above point 5 and also by not redundantly forcing numpy array conversion).

@bertholdreinwald @dusenberrymw @deroneriksson @lresende @asurve If it's OK with you all, I can merge this in.

@niketanpansare
Copy link
Contributor Author

Also, updated the issue https://issues.apache.org/jira/browse/SYSTEMML-1043

Copy link
Contributor

@dusenberrymw dusenberrymw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@niketanpansare Okay I tried this out locally, and overall it's great! 👍 I'm really excited for this deeper integration with Python and NumPy specifically. Also +1 for the ability to pip install and then start up Spark normally without the --driver_class_path & --jars stuff.

A few thoughts:

  • Can you test with Python 3 as well (there are some print statements, for example, that have Python 2 syntax)?
  • Can you implement np.dot(...)? sml.matrix.dot is implemented, but the ufunc is missing for np.dot(m1,m2), where m1 and m2 are the systemml.defmatrix matrices: TypeError: __numpy_ufunc__ not implemented for this type.

Other things:

  • Can we update toNumPyArray() to toNumPy() for simplicity?
  • Can we update toDataFrame() to toDF() for simplicity and to be the same as Spark?

java_dir = os.path.join(imp.find_module("systemml")[1], "systemml-java")
for file in os.listdir(java_dir):
if fnmatch.fnmatch(file, 'systemml-*-incubating-SNAPSHOT.jar'):
jar_file_name = os.path.join(java_dir, file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tested if this JAR will be distributed to the worker nodes on a cluster as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Please try following script on your spark cluster:

import systemml as sml
from systemml.random import uniform
m1 = uniform(size=(10000,10000))
m2 = m1.dot(m1).sum().toNumPy()
print(m2)


# To run:
# - Python 2: `PYSPARK_PYTHON=python2 spark-submit --master local[*] --driver-class-path SystemML.jar test_mlcontext.py`
# - Python 3: `PYSPARK_PYTHON=python3 spark-submit --master local[*] --driver-class-path SystemML.jar test_mlcontext.py`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update these two test invocation lines with this filename?


# To run:
# - Python 2: `PYSPARK_PYTHON=python2 spark-submit --master local[*] --driver-class-path SystemML.jar test_mlcontext.py`
# - Python 3: `PYSPARK_PYTHON=python3 spark-submit --master local[*] --driver-class-path SystemML.jar test_mlcontext.py`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update these two test invocation lines with this filename?


# To run:
# - Python 2: `PYSPARK_PYTHON=python2 spark-submit --master local[*] --driver-class-path SystemML.jar test_mllearn.py`
# - Python 3: `PYSPARK_PYTHON=python3 spark-submit --master local[*] --driver-class-path SystemML.jar test_mllearn.py`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update these two test invocation lines with this filename?


/**
* To run Python tests, please:
* 1. Set the RUN_PYTHON_TEST flag to true.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's aim to run this test by default on Jenkins.

# Don't use this method instead use matrix's printAST()
def printAST(self, numSpaces):
# Don't use this method instead use matrix's print_ast()
def print_ast(self, numSpaces):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this to _print_ast to indicate that it is a private method not for use by the user.

"""
Creates a single column vector with values starting from <start>, to <stop>, in increments of <step>.
Note: Unlike Numpy's arange which returns a row-vector, this returns a column vector.
Also, Unlike Numpy's arange which doesnot include stop, this method includes stop in the interval.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to only support the equivalent of np.arange in this Python DSL, rather than introduce our R-like seq function. The users of the Python DSL will be Python users, so they will expect every aspect of our Python DSL to be Pythonic.


def __numpy_ufunc__(self, func, method, pos, inputs, **kwargs):
"""
This function enables systemml matrix to be compatible with NumPy's ufuncs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a bit of documentation to the parameters here?

self.assertTrue(np.allclose(sml.matrix(m1).sum(axis=0), m1.sum(axis=0)))

def test_sum3(self):
self.assertTrue(np.allclose(sml.matrix(m1).sum(axis=1), m1.sum(axis=1).reshape(dim, 1)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a quick note to the reference guide that explains that we always return a 2d matrix (ex. (3,1)), while NumPy can return a 1d vector (i.e. (3,))?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, ditto for the other functions below with this behavior.

# ~/spark-1.6.1-scala-2.11/bin/spark-submit --master local[*] --driver-class-path SystemML.jar test.py
class TestMLLearn(unittest.TestCase):

def testLogisticSK2(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: To be in line with the Pythonic style of the rest of the DSL codebase, can we update these functions from CamelCase to underscore_case?

@niketanpansare
Copy link
Contributor Author

Thanks @dusenberrymw for your review. I will incorporate your suggestion in the next commit :)

@niketanpansare
Copy link
Contributor Author

niketanpansare commented Dec 2, 2016

Can you implement np.dot(...)?
Can we update toNumPyArray() to toNumPy() for simplicity?
Can we update toDataFrame() to toDF() for simplicity and to be the same as Spark?

Addressed in the commit afac2b2

As an FYI, I am addressing similar tasks in seperate commit to simplify your review process :)

@niketanpansare
Copy link
Contributor Author

Addressed remaining comments in the commit: b4b8394

@dusenberrymw can you please review ? ... Let's try to push this PR in today if possible as #305 is dependent on this PR.

Copy link
Contributor

@dusenberrymw dusenberrymw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay just a few comments regarding print_ast and errors in Python 3 due to Python 2 print statements.

@@ -413,7 +417,7 @@ class matrix(object):
Then the left-indexed matrix is set to be backed by DMLOp consisting of following pydml:
left-indexed-matrix = new-deep-copied-matrix
left-indexed-matrix[index] = value
8. Please use m.print_ast() and/or type `m` for debugging. Here is a sample session:
8. Please use m._print_ast() and/or type `m` for debugging. Here is a sample session:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name change for this internal function looks good, but I think the documentation is wrong now -- the user should still use the matrix print_ast() function, right?

# Don't use this method instead use matrix's printAST()
def printAST(self, numSpaces):
# Don't use this method instead use matrix's _print_ast()
def _print_ast(self, numSpaces):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay the name change for this function def _print_ast(self, numSpaces) is good since it shouldn't be called by the user. However, the other function that is supposed to be used instead should still be called print_ast() without an underscore. Otherwise, they both seem like internal functions.

return self

def printAST(self, numSpaces = 0):
def _print_ast(self, numSpaces = 0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let's make this function named print_ast() since it is intended to be called by the user, and the other variant that is not supposed to be called by the user can still be named _print_ast(numSpaces).

if matrix.THROW_ARRAY_CONVERSION_ERROR:
raise Exception('[ERROR]:' + msg)
else:
print '[WARN]:' + msg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this to Python 3 compatible syntax, as this currently throws an error as soon as the systemml package is imported.

out = matrix(None, op=dmlOp)
dmlOp.dml = [out.ID, ' = ', self.ID ] + getIndexingDML(index) + [ '\n' ]
return out
print '__getitem__'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this to Python 3 compatible syntax, as this currently throws an error as soon as the systemml package is imported.

@niketanpansare
Copy link
Contributor Author

@dusenberrymw addressed above comments in the commit ecc242b

Copy link
Contributor

@dusenberrymw dusenberrymw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This is an awesome update!

@dusenberrymw
Copy link
Contributor

dusenberrymw commented Dec 3, 2016

Follow up items:

  • Automatic Python 2 & 3 testing.
  • np.arange vs. DML seq (different semantics, Python people will prefer the former).

Also, we should closely follow the NumPy ufunc stuff. Looks like they are revisiting the idea, but may be renaming the function to array_ufunc [1]. The ideal case would be to not have to rely on a custom version of NumPy.

[1]: numpy/numpy#8247

@asfgit asfgit closed this in 23ccab8 Dec 3, 2016
@niketanpansare
Copy link
Contributor Author

Once numpy/numpy#8247 is merged and numpy is released, we can update our setup to have that version as dependency :)

@deroneriksson
Copy link
Member

Thank you @niketanpansare for all your hard work on this PR!

asfgit pushed a commit that referenced this pull request Dec 7, 2016
1. Added python test cases for matrix.
2. Added web documentation for all the Python APIs.
3. Added set_lazy method to enable and disable lazy evaluation.
4. matrix class itself has almost all basic linear algebra operators
supported by DML.
4. Updated SystemML.jar to *-incubating.jar
5. Added maven cleanup logic for python artifacts.
6. Integrated python testcases with maven (See
org.apache.sysml.test.integration.functions.python.PythonTestRunner). This
requires SPARK_HOME to be set.

Closes #290.
j143-zz pushed a commit to j143-zz/systemml that referenced this pull request Nov 4, 2017
1. Added python test cases for matrix.
2. Added web documentation for all the Python APIs.
3. Added set_lazy method to enable and disable lazy evaluation.
4. matrix class itself has almost all basic linear algebra operators
supported by DML.
4. Updated SystemML.jar to *-incubating.jar
5. Added maven cleanup logic for python artifacts.
6. Integrated python testcases with maven (See
org.apache.sysml.test.integration.functions.python.PythonTestRunner). This
requires SPARK_HOME to be set.

Closes apache#290.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants