Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc for exchanging data frames #1677

Closed
wants to merge 1 commit into from
Closed

Conversation

m30m
Copy link
Contributor

@m30m m30m commented Nov 24, 2016

What is this PR for?

ZeppelinContext can be used to exchange DataFrames but there are some nasty tricks and typecasts.
It's good to provide some examples.

What type of PR is it?

Documentation

Questions:

  • Does the licenses files need update? no
  • Is there breaking changes for older versions? no
  • Does this needs documentation? no

ZeppelinContext can be used to exchange DataFrames but there are some nasty tricks and typecasts.
It's good to provide some examples.
@Leemoonsoo
Copy link
Member

@m30m Awesome!

LGTM and merge to master if there're no more comments.

@zjffdu
Copy link
Contributor

zjffdu commented Nov 25, 2016

Should we do it implicitly for user in ZeppelinContext? Because I feel the syntax is not easy to understand if user don't know the internal implementation of pyspark. And I think we should not expose such internal things to users.

z.put("myPythonDataFrame", postsDf._jdf)

@m30m
Copy link
Contributor Author

m30m commented Nov 25, 2016

It's not possible to put the DataFrame directly because of this error:

  File "/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1124, in __call__
args_command, temp_args = self._build_args(*args)

  File "/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1094, in _build_args
    [get_command_part(arg, self.pool) for arg in new_args])

  File "/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 289, in get_command_part
    command_part = REFERENCE_TYPE + parameter._get_object_id()

  File "/spark-2.0.1-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line 841, in __getattr__
    "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))

AttributeError: 'DataFrame' object has no attribute '_get_object_id'

@zjffdu
Copy link
Contributor

zjffdu commented Nov 25, 2016

I mean we can internally do this in PyZeppelinContext as following:

def __setitem__(self, key, item):
    if isinstance(item, DataFrame):
       self.z.put(key, item._jdf)
    else:
       self.z.put(key, item)

@m30m
Copy link
Contributor Author

m30m commented Nov 25, 2016

Yes, that's a good idea. Shall I add a commit to this branch?

@zjffdu
Copy link
Contributor

zjffdu commented Nov 25, 2016

Yes, and you also need to update method __getitem__ so that user don't need to construct DataFrame as following. z.get("myScalaDataFrame") should return DataFrame directly

myScalaDataFrame = DataFrame(z.get("myScalaDataFrame"), sqlContext)

@felixcheung
Copy link
Member

Let's keep this as documentation only and let's open a JIRA (another PR) for the DataFrame support?

@zjffdu
Copy link
Contributor

zjffdu commented Nov 26, 2016

If we want to support the feature I mentioned I above in another PR, then the document here is useless because we have to update the doc later. So it would be better to do it in this PR IMHO.

@felixcheung
Copy link
Member

well, it's a lot quicker to get doc-only PR in :)
besides we should have a JIRA for changes like this. It's your call, @m30m

@m30m
Copy link
Contributor Author

m30m commented Nov 26, 2016

I'm not sure whether it's a good idea to hide this complexity in a special way and I should check whether these changes are backward compatible. So I guess a doc-only PR, with a JIRA issue afterwards to handle some spark special types is a better solution.

@Leemoonsoo
Copy link
Member

Merge to master if there're no further discussions

@asfgit asfgit closed this in 7d878f7 Dec 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants