Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add toDataFrame to PySpark SQL #4421

Closed
wants to merge 14 commits into from
Closed

Conversation

dwmclary
Copy link
Contributor

@dwmclary dwmclary commented Feb 6, 2015

This seemed like a reasonably useful function to add to SparkSQL. However, unlike the JIRA, this implementation does not parse type characters (e.g. brackets and braces). This method creates a DataFrame with column names that map to the existing types in the RDD. In general, this seems far more useful, as users likely wish to quickly apply names to existing collections.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

[Row(b=u' A1', c1=u' A1'), Row(b=u' B2', c1=u' B2'), Row(b=u' C3', c1=u' C3'), Row(b=u' D4', c1=u' D4')]
"""
fieldNames = [f for f in re.split("( |\\\".*?\\\"|'.*?')", nameString) if f.strip()]
reservedWords = set(map(string.lower,["ABS","ALL","AND", "APPROXIMATE", "AS", "ASC", "AVG", "BETWEEN", "BY", \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really speak to this patch in general, since I don't know much about this part of Spark SQL, but to avoid duplication it probably makes sense to keep the list of reserved words in the JVM and fetch it into Python from there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a reasonable request to me. I couldn't decide if it was better
to have to pickle and ship a list of words or just to have it instantiated
in both places.

On Fri, Feb 6, 2015 at 7:31 AM, Josh Rosen notifications@github.com wrote:

In python/pyspark/sql.py
#4421 (comment):

  •    >>> unparsedStrings = sc.parallelize(["1, A1, true", "2, B2, false", "3, C3, true", "4, D4, false"])
    
  •    >>> input = unparsedStrings.map(lambda x: x.split(",")).map(lambda x: [int(x[0]), x[1], bool(x[2])])
    
  •    >>> df1 = sqlCtx.applyNames("a b c", input)
    
  •    >>> df1.registerTempTable("df1")
    
  •    >>> sqlCtx.sql("select a from df1").collect()
    
  •    [Row(a=1), Row(a=2), Row(a=3), Row(a=4)]
    
  •    >>> input2 = unparsedStrings.map(lambda x: x.split(",")).map(lambda x: [int(x[0]), x[1], bool(x[2]), {"k":int(x[0]), "v":2*int(x[0])}, x])
    
  •    >>> df2 = sqlCtx.applyNames("a b c d e", input2)
    
  •    >>> df2.registerTempTable("df2")
    
  •    >>> sqlCtx.sql("select d['k']+d['v'] from df2").collect()
    
  •    [Row(c0=3), Row(c0=6), Row(c0=9), Row(c0=12)]
    
  •    >>> sqlCtx.sql("select b, e[1] from df2").collect()
    
  •    [Row(b=u' A1', c1=u' A1'), Row(b=u' B2', c1=u' B2'), Row(b=u' C3', c1=u' C3'), Row(b=u' D4', c1=u' D4')]
    
  •    """
    
  •    fieldNames = [f for f in re.split("( |\\"._?\\"|'._?')", nameString) if f.strip()]
    
  •    reservedWords = set(map(string.lower,["ABS","ALL","AND", "APPROXIMATE", "AS", "ASC", "AVG", "BETWEEN", "BY", \
    

I can't really speak to this patch in general, since I don't know much
about this part of Spark SQL, but to avoid duplication it probably makes
sense to keep the list of reserved words in the JVM and fetch it into
Python from there.


Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/4421/files#r24247234.

@dwmclary
Copy link
Contributor Author

dwmclary commented Feb 6, 2015

Updated to keep reserved words in the JVM.

@rxin
Copy link
Contributor

rxin commented Feb 10, 2015

@dwmclary thanks for submitting this. I think this is similar to the toDataFrame method that supports renaming, isn't it?

@rxin
Copy link
Contributor

rxin commented Feb 10, 2015

In particular, I'm talking about

def toDataFrame(colNames: String*): DataFrame

@dwmclary
Copy link
Contributor Author

Reynold,

It is similar, but I think the distinction here is that toDataFrame
appears to require that old names (and a schema) exist. Or, at least
that's what DataFrameImpl.scala suggests:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameImpl.scala,
line 93.

I think there's a benefit for a quick way to get a DataFrame from a
"plain" RDD. If we don't want to do @davies applyNames idea, then maybe we
can change the behavior of toDataFrame.

Cheers,
Dan

On Mon, Feb 9, 2015 at 4:33 PM, Reynold Xin notifications@github.com
wrote:

In particular, I'm talking about

def toDataFrame(colNames: String*): DataFrame


Reply to this email directly or view it on GitHub
#4421 (comment).

@rxin
Copy link
Contributor

rxin commented Feb 10, 2015

Types need to exist, but names don't. They can just be random column names like _1, _2, _3.

In Scala, if you import sqlContext.implicits._, then any RDD of Product (which includes RDD of case classes and RDD of tuples) can be implicitly turned into a DataFrame.

In Python, I think we can add an explicit method that turns a RDD of tuple into a DataFrame, if that doesn't exist yet.

@dwmclary
Copy link
Contributor Author

Ah, yes, I see that now.

Python doesn't seem to have a toDataFrame, so maybe the logical thing to do
here is to just do a new PR with a Python implementation of toDataFrame --
it'd be a little bit from my current PR and then call into the Scala method.

What do you think?

On Mon, Feb 9, 2015 at 5:12 PM, Reynold Xin notifications@github.com
wrote:

Types need to exist, but names don't. They can just be random column names
like _1, _2, _3.

In Scala, if you import sqlContext.implicits._, then any RDDProduct
http://which%20includes%20RDD%20of%20case%20classes%20and%20RDD%20of%20tuples
can be implicitly turned into a DataFrame.

In Python, I think we can add an explicit method that turns a RDD of tuple
into a DataFrame, if that doesn't exist yet.


Reply to this email directly or view it on GitHub
#4421 (comment).

@dwmclary
Copy link
Contributor Author

Or, I guess I can just do it in this PR if you don't mind it changing a
bunch.

On Mon, Feb 9, 2015 at 5:18 PM, Dan McClary dan.mcclary@gmail.com wrote:

Ah, yes, I see that now.

Python doesn't seem to have a toDataFrame, so maybe the logical thing to
do here is to just do a new PR with a Python implementation of toDataFrame
-- it'd be a little bit from my current PR and then call into the Scala
method.

What do you think?

On Mon, Feb 9, 2015 at 5:12 PM, Reynold Xin notifications@github.com
wrote:

Types need to exist, but names don't. They can just be random column
names like _1, _2, _3.

In Scala, if you import sqlContext.implicits._, then any RDDProduct
http://which%20includes%20RDD%20of%20case%20classes%20and%20RDD%20of%20tuples
can be implicitly turned into a DataFrame.

In Python, I think we can add an explicit method that turns a RDD of
tuple into a DataFrame, if that doesn't exist yet.


Reply to this email directly or view it on GitHub
#4421 (comment).

@rxin
Copy link
Contributor

rxin commented Feb 10, 2015

Adding toDataFrame to Python DataFrame is a great idea. You can do it in this PR if you want (make sure you update the title).

Also - you might want to do it on top of #4479 otherwise it will conflict.

@rxin
Copy link
Contributor

rxin commented Feb 10, 2015

I just talked to @davies offline. He is going to submit a PR that adds createDataFrame with named columns. I think we can roll this into that one and close this PR. Would be great if @dwmclary you can take a look once that is submitted.

@dwmclary
Copy link
Contributor Author

Sounds like a plan -- I'll do it on top of #4479.

Thought: I've added a getReservedWords private method to SQLContext.scala.
I feel like leaving that there isn't a bad idea: other methods may need to
check reserved words in the future.

On Mon, Feb 9, 2015 at 5:27 PM, Reynold Xin notifications@github.com
wrote:

Adding toDataFrame to Python DataFrame is a great idea. You can do it in
this PR if you want (make sure you update the title).

Also - you might want to do it on top of #4479
#4479 otherwise it will conflict.


Reply to this email directly or view it on GitHub
#4421 (comment).

@dwmclary dwmclary changed the title Spark-2789: Apply names to RDD to create DataFrame Add toDataFrame to PySpark SQL Feb 10, 2015
@dwmclary
Copy link
Contributor Author

OK, I've updated this to use as a reference. One thing we may want to take from this PR is that toDataFrame and createDataFrame absolutely need to check reserved words in column names. I've added the behavior in scala and in the DataFrame Suite.

Perhaps I should just open a new PR with the reserved words checking?

I'll take a look at @davies PR when it shows up.

@marmbrus
Copy link
Contributor

Why do you need to check reserved words. In SQL you can use backticks to access columns that are named after reserved words.

@davies
Copy link
Contributor

davies commented Feb 10, 2015

@dwmclary It's almost ready: #4498

@dwmclary
Copy link
Contributor Author

So, we'll allow a column named SELECT regardless of whether it's been
called out as SELECT? It just seems to me that it invites a lot of
potentially erroneous user behavior at DDL time.

On Tue, Feb 10, 2015 at 11:53 AM, Michael Armbrust <notifications@github.com

wrote:

Why do you need to check reserved words. In SQL you can use backticks to
access columns that are named after reserved words.


Reply to this email directly or view it on GitHub
#4421 (comment).

@rxin
Copy link
Contributor

rxin commented Feb 10, 2015

Believe it or not that is valid SQL ... (with proper ``)

@dwmclary
Copy link
Contributor Author

I've been thinking of it as equivalent to a CREATE TABLE, in which case I
think it's dialect-specific. Perhaps ANSI and pgSQL allow it, but, for
example, Oracle disallows:

SQL> create table dumb_name (select varchar2(10), from varchar2(10));
create table dumb_name (select varchar2(10), from varchar2(10))
*
ERROR at line 1:
ORA-00904: : invalid identifier

SQL> create table dumb_name ("select" varchar2(10), "from" varchar2(10));

Table created.

Either way, I'm fine to just close out this PR. We should close SPARK-2789
too.

Cheers,
Dan

On Tue, Feb 10, 2015 at 11:59 AM, Reynold Xin notifications@github.com
wrote:

Believe it or not that is valid SQL ...


Reply to this email directly or view it on GitHub
#4421 (comment).

@dwmclary dwmclary closed this Feb 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants