Add toDataFrame to PySpark SQL #4421

dwmclary · 2015-02-06T08:12:26Z

This seemed like a reasonably useful function to add to SparkSQL. However, unlike the JIRA, this implementation does not parse type characters (e.g. brackets and braces). This method creates a DataFrame with column names that map to the existing types in the RDD. In general, this seems far more useful, as users likely wish to quickly apply names to existing collections.

AmplabJenkins · 2015-02-06T08:17:10Z

Can one of the admins verify this patch?

JoshRosen · 2015-02-06T15:30:57Z

python/pyspark/sql.py

+        [Row(b=u' A1', c1=u' A1'), Row(b=u' B2', c1=u' B2'), Row(b=u' C3', c1=u' C3'), Row(b=u' D4', c1=u' D4')]
+        """
+        fieldNames = [f for f in re.split("( |\\\".*?\\\"|'.*?')", nameString) if f.strip()]
+        reservedWords = set(map(string.lower,["ABS","ALL","AND", "APPROXIMATE", "AS", "ASC", "AVG", "BETWEEN", "BY", \


I can't really speak to this patch in general, since I don't know much about this part of Spark SQL, but to avoid duplication it probably makes sense to keep the list of reserved words in the JVM and fetch it into Python from there.

Seems like a reasonable request to me. I couldn't decide if it was better
to have to pickle and ship a list of words or just to have it instantiated
in both places.

On Fri, Feb 6, 2015 at 7:31 AM, Josh Rosen notifications@github.com wrote:

In python/pyspark/sql.py
#4421 (comment):

>>> unparsedStrings = sc.parallelize(["1, A1, true", "2, B2, false", "3, C3, true", "4, D4, false"])

>>> input = unparsedStrings.map(lambda x: x.split(",")).map(lambda x: [int(x[0]), x[1], bool(x[2])])

>>> df1 = sqlCtx.applyNames("a b c", input)

>>> df1.registerTempTable("df1")

>>> sqlCtx.sql("select a from df1").collect()

[Row(a=1), Row(a=2), Row(a=3), Row(a=4)]

>>> input2 = unparsedStrings.map(lambda x: x.split(",")).map(lambda x: [int(x[0]), x[1], bool(x[2]), {"k":int(x[0]), "v":2*int(x[0])}, x])

>>> df2 = sqlCtx.applyNames("a b c d e", input2)

>>> df2.registerTempTable("df2")

>>> sqlCtx.sql("select d['k']+d['v'] from df2").collect()

[Row(c0=3), Row(c0=6), Row(c0=9), Row(c0=12)]

>>> sqlCtx.sql("select b, e[1] from df2").collect()

[Row(b=u' A1', c1=u' A1'), Row(b=u' B2', c1=u' B2'), Row(b=u' C3', c1=u' C3'), Row(b=u' D4', c1=u' D4')]

"""

fieldNames = [f for f in re.split("( |\\"._?\\"|'._?')", nameString) if f.strip()]

reservedWords = set(map(string.lower,["ABS","ALL","AND", "APPROXIMATE", "AS", "ASC", "AVG", "BETWEEN", "BY", \

I can't really speak to this patch in general, since I don't know much
about this part of Spark SQL, but to avoid duplication it probably makes
sense to keep the list of reserved words in the JVM and fetch it into
Python from there.

—
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/4421/files#r24247234.

dwmclary · 2015-02-06T22:36:03Z

Updated to keep reserved words in the JVM.

Conflicts: python/pyspark/sql.py

rxin · 2015-02-10T00:32:35Z

@dwmclary thanks for submitting this. I think this is similar to the toDataFrame method that supports renaming, isn't it?

rxin · 2015-02-10T00:33:08Z

In particular, I'm talking about

spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

Line 105 in 68b25cf

def toDataFrame(colNames: String*): DataFrame

dwmclary · 2015-02-10T00:49:18Z

Reynold,

It is similar, but I think the distinction here is that toDataFrame
appears to require that old names (and a schema) exist. Or, at least
that's what DataFrameImpl.scala suggests:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameImpl.scala,
line 93.

I think there's a benefit for a quick way to get a DataFrame from a
"plain" RDD. If we don't want to do @davies applyNames idea, then maybe we
can change the behavior of toDataFrame.

Cheers,
Dan

On Mon, Feb 9, 2015 at 4:33 PM, Reynold Xin notifications@github.com
wrote:

In particular, I'm talking about

spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

Line 105 in 68b25cf

def toDataFrame(colNames: String*): DataFrame

—
Reply to this email directly or view it on GitHub
#4421 (comment).

rxin · 2015-02-10T01:11:53Z

Types need to exist, but names don't. They can just be random column names like _1, _2, _3.

In Scala, if you import sqlContext.implicits._, then any RDD of Product (which includes RDD of case classes and RDD of tuples) can be implicitly turned into a DataFrame.

In Python, I think we can add an explicit method that turns a RDD of tuple into a DataFrame, if that doesn't exist yet.

dwmclary · 2015-02-10T01:18:29Z

Ah, yes, I see that now.

Python doesn't seem to have a toDataFrame, so maybe the logical thing to do
here is to just do a new PR with a Python implementation of toDataFrame --
it'd be a little bit from my current PR and then call into the Scala method.

What do you think?

On Mon, Feb 9, 2015 at 5:12 PM, Reynold Xin notifications@github.com
wrote:

Types need to exist, but names don't. They can just be random column names
like _1, _2, _3.

In Scala, if you import sqlContext.implicits._, then any RDDProduct
http://which%20includes%20RDD%20of%20case%20classes%20and%20RDD%20of%20tuples
can be implicitly turned into a DataFrame.

In Python, I think we can add an explicit method that turns a RDD of tuple
into a DataFrame, if that doesn't exist yet.

—
Reply to this email directly or view it on GitHub
#4421 (comment).

dwmclary · 2015-02-10T01:19:19Z

Or, I guess I can just do it in this PR if you don't mind it changing a
bunch.

On Mon, Feb 9, 2015 at 5:18 PM, Dan McClary dan.mcclary@gmail.com wrote:

Ah, yes, I see that now.

Python doesn't seem to have a toDataFrame, so maybe the logical thing to
do here is to just do a new PR with a Python implementation of toDataFrame
-- it'd be a little bit from my current PR and then call into the Scala
method.

What do you think?

On Mon, Feb 9, 2015 at 5:12 PM, Reynold Xin notifications@github.com
wrote:

Types need to exist, but names don't. They can just be random column
names like _1, _2, _3.

In Scala, if you import sqlContext.implicits._, then any RDDProduct
http://which%20includes%20RDD%20of%20case%20classes%20and%20RDD%20of%20tuples
can be implicitly turned into a DataFrame.

In Python, I think we can add an explicit method that turns a RDD of
tuple into a DataFrame, if that doesn't exist yet.

—
Reply to this email directly or view it on GitHub
#4421 (comment).

rxin · 2015-02-10T01:27:35Z

Adding toDataFrame to Python DataFrame is a great idea. You can do it in this PR if you want (make sure you update the title).

Also - you might want to do it on top of #4479 otherwise it will conflict.

rxin · 2015-02-10T02:06:33Z

I just talked to @davies offline. He is going to submit a PR that adds createDataFrame with named columns. I think we can roll this into that one and close this PR. Would be great if @dwmclary you can take a look once that is submitted.

dwmclary · 2015-02-10T02:23:45Z

Sounds like a plan -- I'll do it on top of #4479.

Thought: I've added a getReservedWords private method to SQLContext.scala.
I feel like leaving that there isn't a bad idea: other methods may need to
check reserved words in the future.

On Mon, Feb 9, 2015 at 5:27 PM, Reynold Xin notifications@github.com
wrote:

Adding toDataFrame to Python DataFrame is a great idea. You can do it in
this PR if you want (make sure you update the title).

Also - you might want to do it on top of #4479
#4479 otherwise it will conflict.

—
Reply to this email directly or view it on GitHub
#4421 (comment).

dwmclary · 2015-02-10T19:48:07Z

OK, I've updated this to use as a reference. One thing we may want to take from this PR is that toDataFrame and createDataFrame absolutely need to check reserved words in column names. I've added the behavior in scala and in the DataFrame Suite.

Perhaps I should just open a new PR with the reserved words checking?

I'll take a look at @davies PR when it shows up.

marmbrus · 2015-02-10T19:52:40Z

Why do you need to check reserved words. In SQL you can use backticks to access columns that are named after reserved words.

davies · 2015-02-10T19:54:54Z

@dwmclary It's almost ready: #4498

dwmclary · 2015-02-10T19:55:46Z

So, we'll allow a column named SELECT regardless of whether it's been
called out as SELECT? It just seems to me that it invites a lot of
potentially erroneous user behavior at DDL time.

On Tue, Feb 10, 2015 at 11:53 AM, Michael Armbrust <notifications@github.com

wrote:

Why do you need to check reserved words. In SQL you can use backticks to
access columns that are named after reserved words.

—
Reply to this email directly or view it on GitHub
#4421 (comment).

rxin · 2015-02-10T19:59:03Z

Believe it or not that is valid SQL ... (with proper ``)

dwmclary · 2015-02-10T21:07:19Z

I've been thinking of it as equivalent to a CREATE TABLE, in which case I
think it's dialect-specific. Perhaps ANSI and pgSQL allow it, but, for
example, Oracle disallows:

SQL> create table dumb_name (select varchar2(10), from varchar2(10));
create table dumb_name (select varchar2(10), from varchar2(10))
*
ERROR at line 1:
ORA-00904: : invalid identifier

SQL> create table dumb_name ("select" varchar2(10), "from" varchar2(10));

Table created.

Either way, I'm fine to just close out this PR. We should close SPARK-2789
too.

Cheers,
Dan

On Tue, Feb 10, 2015 at 11:59 AM, Reynold Xin notifications@github.com
wrote:

Believe it or not that is valid SQL ...

—
Reply to this email directly or view it on GitHub
#4421 (comment).

dwmclary added 6 commits February 5, 2015 10:56

basic apply names functionality

df8b015

working for map type

15eb351

added array and list types, struct types don't seem relevant

aa38d76

added applyNames to pyspark

29d8ffa

added pyspark method and tests

8c773b3

removed stray print

bc57c0e

JoshRosen reviewed Feb 6, 2015
View reviewed changes

centralized reserved words in JVM

5bea511

dwmclary and others added 3 commits February 6, 2015 15:39

Merge branch 'master' of https://github.com/apache/spark into SPARK-2789

edfde1c

restructure pyspark.sql

2b2b983

Merge branch 'master' of github.com:apache/spark into sql

1b5f0a5

Conflicts: python/pyspark/sql.py

dwmclary added 3 commits February 9, 2015 22:14

moved applyNames into pyspark bindings on toDataFrame

6760dc2

added toDataFrame to pyspark -- in context

1b35ff8

removed applyNames from SQLConfSuite.scala

f04e923

dwmclary changed the title ~~Spark-2789: Apply names to RDD to create DataFrame~~ Add toDataFrame to PySpark SQL Feb 10, 2015

added reserved word checking in toDataFrame; added PySpark toDataFrame

6191265

dwmclary closed this Feb 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add toDataFrame to PySpark SQL #4421

Add toDataFrame to PySpark SQL #4421

dwmclary commented Feb 6, 2015

AmplabJenkins commented Feb 6, 2015

JoshRosen Feb 6, 2015

dwmclary Feb 6, 2015

dwmclary commented Feb 6, 2015

rxin commented Feb 10, 2015

rxin commented Feb 10, 2015

dwmclary commented Feb 10, 2015

rxin commented Feb 10, 2015

dwmclary commented Feb 10, 2015

dwmclary commented Feb 10, 2015

rxin commented Feb 10, 2015

rxin commented Feb 10, 2015

dwmclary commented Feb 10, 2015

dwmclary commented Feb 10, 2015

marmbrus commented Feb 10, 2015

davies commented Feb 10, 2015

dwmclary commented Feb 10, 2015

rxin commented Feb 10, 2015

dwmclary commented Feb 10, 2015

Add toDataFrame to PySpark SQL #4421

Add toDataFrame to PySpark SQL #4421

Conversation

dwmclary commented Feb 6, 2015

AmplabJenkins commented Feb 6, 2015

JoshRosen Feb 6, 2015

Choose a reason for hiding this comment

dwmclary Feb 6, 2015

Choose a reason for hiding this comment

dwmclary commented Feb 6, 2015

rxin commented Feb 10, 2015

rxin commented Feb 10, 2015

dwmclary commented Feb 10, 2015

rxin commented Feb 10, 2015

dwmclary commented Feb 10, 2015

dwmclary commented Feb 10, 2015

rxin commented Feb 10, 2015

rxin commented Feb 10, 2015

dwmclary commented Feb 10, 2015

dwmclary commented Feb 10, 2015

marmbrus commented Feb 10, 2015

davies commented Feb 10, 2015

dwmclary commented Feb 10, 2015

rxin commented Feb 10, 2015

dwmclary commented Feb 10, 2015