Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZEPPELIN-1903. ZeppelinContext can not display pandas DataFrame in PySparkInterpreter #1839

Closed
wants to merge 2 commits into from

Conversation

zjffdu
Copy link
Contributor

@zjffdu zjffdu commented Jan 4, 2017

What is this PR for?

I copy some code from PythonInterpreter to PySparkInterpreter to enable display pandas DataFrame in PySparkInterpreter. Ideally IMO all the features in PythonInterpreter should be available in PySparkInterpeter. PySparkInterpreter should be an extension of PythonInterpreter. After refactoring of PythonInterpreter is done, we can consider about it.

What type of PR is it?

[Improvement]

Todos

  • - Task

What is the Jira issue?

How should this be tested?

Unit test is added and also manually tested.

Screenshots (if appropriate)

image

Questions:

  • Does the licenses files need update? No
  • Is there breaking changes for older versions? No
  • Does this needs documentation? No

@zjffdu
Copy link
Contributor Author

zjffdu commented Jan 4, 2017

@bzz @Leemoonsoo @felixcheung Please help review

@zjffdu zjffdu changed the title ZEPPELIN-1903. ZeppelinContext can not display pandas DataFrame in PySparkInterpreter [WIP] ZEPPELIN-1903. ZeppelinContext can not display pandas DataFrame in PySparkInterpreter Jan 5, 2017
@zjffdu
Copy link
Contributor Author

zjffdu commented Jan 5, 2017

CI is failed, working on it.

@zjffdu zjffdu force-pushed the ZEPPELIN-1903 branch 2 times, most recently from 84ec048 to 1ee3c60 Compare January 6, 2017 03:27
@zjffdu zjffdu closed this Jan 6, 2017
@zjffdu zjffdu reopened this Jan 6, 2017
@zjffdu zjffdu changed the title [WIP] ZEPPELIN-1903. ZeppelinContext can not display pandas DataFrame in PySparkInterpreter ZEPPELIN-1903. ZeppelinContext can not display pandas DataFrame in PySparkInterpreter Jan 6, 2017
self._displayhook = lambda *args: None

def show(self, obj):
from pyspark.sql import DataFrame
if isinstance(obj, DataFrame):
print(gateway.jvm.org.apache.zeppelin.spark.ZeppelinContext.showDF(self.z, obj._jdf))
elif type(obj).__name__ == "DataFrame": # does not play well with sub-classes
# `isinstance(obj, DataFrame)` would req `import pandas.core.frame.DataFrame`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, why is that the case, if pandas is not imported here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I think I figure out what you mean.
How about adding the check in a method, something like

def isPandas(obj):
  try:
    from pandas.core.frame import DataFrame
    return isinstance(obj, DataFrame)
  except ImportError:
    return false

@zjffdu zjffdu changed the title ZEPPELIN-1903. ZeppelinContext can not display pandas DataFrame in PySparkInterpreter [WIP] ZEPPELIN-1903. ZeppelinContext can not display pandas DataFrame in PySparkInterpreter Apr 28, 2017
@zjffdu zjffdu force-pushed the ZEPPELIN-1903 branch 2 times, most recently from 6a922e2 to f7977c0 Compare April 28, 2017 05:06
@zjffdu zjffdu changed the title [WIP] ZEPPELIN-1903. ZeppelinContext can not display pandas DataFrame in PySparkInterpreter ZEPPELIN-1903. ZeppelinContext can not display pandas DataFrame in PySparkInterpreter Apr 28, 2017
@zjffdu
Copy link
Contributor Author

zjffdu commented Apr 28, 2017

Sorry for late update, @felixcheung Please help review.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@@ -44,15 +49,59 @@ def flush(self):
class PyZeppelinContext(dict):
def __init__(self, zc):
self.z = zc
self.max_result = 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this from a interpreter property? I think there is a PR on something like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be done in #2282, I will update this PR after #2282 is merged.

body_buf.write(str(cell))
body_buf.write("\n")
body_buf.seek(0); header_buf.seek(0)
#TODO(bzz): fix it, so it shows red notice, as in Spark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are we going to do with this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually all the code of method show_dataframe are copied from python interpreter, I don't have much context about this piece of code. Ideally we should have all the logic in python interpreter, and PySparkInterprete rshould just reuse or extend python interpreter. But this would be a large follow up ticket.

else:
print(str(obj))

def show_dataframe(self, df, show_index=False):
"""Pretty prints DF using Table Display System
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much of this overlap with z.showData? should we abstract out a method in zeppelin context to do pretty print

body_buf.write("\t")
body_buf.write(str(cell))
body_buf.write("\n")
body_buf.seek(0); header_buf.seek(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: style - new line, don't use ;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants