-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add toDataFrame to PySpark SQL #4421
Conversation
Can one of the admins verify this patch? |
[Row(b=u' A1', c1=u' A1'), Row(b=u' B2', c1=u' B2'), Row(b=u' C3', c1=u' C3'), Row(b=u' D4', c1=u' D4')] | ||
""" | ||
fieldNames = [f for f in re.split("( |\\\".*?\\\"|'.*?')", nameString) if f.strip()] | ||
reservedWords = set(map(string.lower,["ABS","ALL","AND", "APPROXIMATE", "AS", "ASC", "AVG", "BETWEEN", "BY", \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't really speak to this patch in general, since I don't know much about this part of Spark SQL, but to avoid duplication it probably makes sense to keep the list of reserved words in the JVM and fetch it into Python from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a reasonable request to me. I couldn't decide if it was better
to have to pickle and ship a list of words or just to have it instantiated
in both places.
On Fri, Feb 6, 2015 at 7:31 AM, Josh Rosen notifications@github.com wrote:
In python/pyspark/sql.py
#4421 (comment):
>>> unparsedStrings = sc.parallelize(["1, A1, true", "2, B2, false", "3, C3, true", "4, D4, false"])
>>> input = unparsedStrings.map(lambda x: x.split(",")).map(lambda x: [int(x[0]), x[1], bool(x[2])])
>>> df1 = sqlCtx.applyNames("a b c", input)
>>> df1.registerTempTable("df1")
>>> sqlCtx.sql("select a from df1").collect()
[Row(a=1), Row(a=2), Row(a=3), Row(a=4)]
>>> input2 = unparsedStrings.map(lambda x: x.split(",")).map(lambda x: [int(x[0]), x[1], bool(x[2]), {"k":int(x[0]), "v":2*int(x[0])}, x])
>>> df2 = sqlCtx.applyNames("a b c d e", input2)
>>> df2.registerTempTable("df2")
>>> sqlCtx.sql("select d['k']+d['v'] from df2").collect()
[Row(c0=3), Row(c0=6), Row(c0=9), Row(c0=12)]
>>> sqlCtx.sql("select b, e[1] from df2").collect()
[Row(b=u' A1', c1=u' A1'), Row(b=u' B2', c1=u' B2'), Row(b=u' C3', c1=u' C3'), Row(b=u' D4', c1=u' D4')]
"""
fieldNames = [f for f in re.split("( |\\"._?\\"|'._?')", nameString) if f.strip()]
reservedWords = set(map(string.lower,["ABS","ALL","AND", "APPROXIMATE", "AS", "ASC", "AVG", "BETWEEN", "BY", \
I can't really speak to this patch in general, since I don't know much
about this part of Spark SQL, but to avoid duplication it probably makes
sense to keep the list of reserved words in the JVM and fetch it into
Python from there.—
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/4421/files#r24247234.
Updated to keep reserved words in the JVM. |
@dwmclary thanks for submitting this. I think this is similar to the toDataFrame method that supports renaming, isn't it? |
In particular, I'm talking about
|
Reynold, It is similar, but I think the distinction here is that toDataFrame I think there's a benefit for a quick way to get a DataFrame from a Cheers, On Mon, Feb 9, 2015 at 4:33 PM, Reynold Xin notifications@github.com
|
Types need to exist, but names don't. They can just be random column names like _1, _2, _3. In Scala, if you import sqlContext.implicits._, then any RDD of Product (which includes RDD of case classes and RDD of tuples) can be implicitly turned into a DataFrame. In Python, I think we can add an explicit method that turns a RDD of tuple into a DataFrame, if that doesn't exist yet. |
Ah, yes, I see that now. Python doesn't seem to have a toDataFrame, so maybe the logical thing to do What do you think? On Mon, Feb 9, 2015 at 5:12 PM, Reynold Xin notifications@github.com
|
Or, I guess I can just do it in this PR if you don't mind it changing a On Mon, Feb 9, 2015 at 5:18 PM, Dan McClary dan.mcclary@gmail.com wrote:
|
Adding toDataFrame to Python DataFrame is a great idea. You can do it in this PR if you want (make sure you update the title). Also - you might want to do it on top of #4479 otherwise it will conflict. |
Sounds like a plan -- I'll do it on top of #4479. Thought: I've added a getReservedWords private method to SQLContext.scala. On Mon, Feb 9, 2015 at 5:27 PM, Reynold Xin notifications@github.com
|
OK, I've updated this to use as a reference. One thing we may want to take from this PR is that toDataFrame and createDataFrame absolutely need to check reserved words in column names. I've added the behavior in scala and in the DataFrame Suite. Perhaps I should just open a new PR with the reserved words checking? I'll take a look at @davies PR when it shows up. |
Why do you need to check reserved words. In SQL you can use backticks to access columns that are named after reserved words. |
So, we'll allow a column named SELECT regardless of whether it's been On Tue, Feb 10, 2015 at 11:53 AM, Michael Armbrust <notifications@github.com
|
Believe it or not that is valid SQL ... (with proper ``) |
I've been thinking of it as equivalent to a CREATE TABLE, in which case I SQL> create table dumb_name (select varchar2(10), from varchar2(10)); SQL> create table dumb_name ("select" varchar2(10), "from" varchar2(10)); Table created. Either way, I'm fine to just close out this PR. We should close SPARK-2789 Cheers, On Tue, Feb 10, 2015 at 11:59 AM, Reynold Xin notifications@github.com
|
This seemed like a reasonably useful function to add to SparkSQL. However, unlike the JIRA, this implementation does not parse type characters (e.g. brackets and braces). This method creates a DataFrame with column names that map to the existing types in the RDD. In general, this seems far more useful, as users likely wish to quickly apply names to existing collections.