Skip to content

Commit 9ddcf61

Browse files
bowenliang123pan3793
authored andcommitted
[KYUUBI #3406] [FOLLOWUP] Add create datasource table DDL usage to Pyspark docs
### _Why are the changes needed?_ Following #3406 , fixing spelling mistakes and adding new DDL usage for jdbc source in PySpark client docs. ### _How was this patch tested?_ - [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible - [ ] Add screenshots for manual tests if appropriate - [ ] [Run test](https://kyuubi.apache.org/docs/latest/develop_tools/testing.html#running-tests) locally before make a pull request Closes #3552 from bowenliang123/pyspark-docs-improve. Closes #3406 eb05a30 [Bowen Liang] add docs for using as JDBC Datasource table with DDL. and minor spelling fix. Authored-by: Bowen Liang <liangbowen@gf.com.cn> Signed-off-by: Cheng Pan <chengpan@apache.org> (cherry picked from commit eb04c7f) Signed-off-by: Cheng Pan <chengpan@apache.org>
1 parent c065b88 commit 9ddcf61

File tree

1 file changed

+35
-9
lines changed

1 file changed

+35
-9
lines changed

docs/client/python/pyspark.md

Lines changed: 35 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -23,15 +23,15 @@
2323
## Requirements
2424
PySpark works with Python 3.7 and above.
2525

26-
Install PySpark with Spark SQL and optional pandas on Spark using PyPI as follows:
26+
Install PySpark with Spark SQL and optional pandas support on Spark using PyPI as follows:
2727

2828
```shell
2929
pip install pyspark 'pyspark[sql]' 'pyspark[pandas_on_spark]'
3030
```
3131

3232
For installation using Conda or manually downloading, please refer to [PySpark installation](https://spark.apache.org/docs/latest/api/python/getting_started/install.html).
3333

34-
## Preperation
34+
## Preparation
3535

3636

3737
### Prepare JDBC driver
@@ -46,15 +46,15 @@ Refer to docs of the driver and prepare the JDBC driver jar file.
4646

4747
### Prepare JDBC Hive Dialect extension
4848

49-
Hive Dialect support is requried by Spark for wraping SQL correctly and sending to JDBC driver. Kyuubi provides a JDBC dialect extension with auto regiested Hive Daliect support for Spark. Follow the instrunctions in [Hive Dialect Support](../../engines/spark/jdbc-dialect.html) to prepare the plugin jar file `kyuubi-extension-spark-jdbc-dialect_-*.jar`.
49+
Hive Dialect support is required by Spark for wrapping SQL correctly and sending it to the JDBC driver. Kyuubi provides a JDBC dialect extension with auto-registered Hive Daliect support for Spark. Follow the instructions in [Hive Dialect Support](../../engines/spark/jdbc-dialect.html) to prepare the plugin jar file `kyuubi-extension-spark-jdbc-dialect_-*.jar`.
5050

51-
### Including jars of JDBC driver and Hive Dialect extention
51+
### Including jars of JDBC driver and Hive Dialect extension
5252

53-
Choose one of following ways to include jar files to Spark.
53+
Choose one of the following ways to include jar files in Spark.
5454

5555
- Put the jar file of JDBC driver and Hive Dialect to `$SPARK_HOME/jars` directory to make it visible for the classpath of PySpark. And adding `spark.sql.extensions = org.apache.spark.sql.dialect.KyuubiSparkJdbcDialectExtension` to `$SPARK_HOME/conf/spark_defaults.conf.`
5656

57-
- With spark's start shell, include JDBC driver when you submit the application with `--packages`, and the Hive Dialect plugins with `--jars`
57+
- With spark's start shell, include the JDBC driver when submitting the application with `--packages`, and the Hive Dialect plugins with `--jars`
5858

5959
```
6060
$SPARK_HOME/bin/pyspark --py-files PY_FILES \
@@ -79,10 +79,10 @@ spark = SparkSession.builder \
7979

8080
For further information about PySpark JDBC usage and options, please refer to Spark's [JDBC To Other Databases](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html).
8181

82-
### Reading and Writing via JDBC data source
82+
### Using as JDBC Datasource programmingly
8383

8484
```python
85-
# Loading data from Kyuubi via HiveDriver as JDBC source
85+
# Loading data from Kyuubi via HiveDriver as JDBC datasource
8686
jdbcDF = spark.read \
8787
.format("jdbc") \
8888
.options(driver="org.apache.hive.jdbc.HiveDriver",
@@ -94,7 +94,7 @@ jdbcDF = spark.read \
9494
.load()
9595

9696

97-
# Saving data to Kyuubi via HiveDriver as JDBC source
97+
# Saving data to Kyuubi via HiveDriver as JDBC datasource
9898
jdbcDF.write \
9999
.format("jdbc") \
100100
.options(driver="org.apache.hive.jdbc.HiveDriver",
@@ -106,6 +106,32 @@ jdbcDF.write \
106106
.save()
107107
```
108108

109+
### Using as JDBC Datasource table with SQL
110+
111+
From Spark 3.2.0, [`CREATE DATASOURCE TABLE`](https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-datasource.html) is supported to create jdbc source with SQL.
112+
113+
114+
```python
115+
# create JDBC Datasource table with DDL
116+
spark.sql("""CREATE TABLE kyuubi_table USING JDBC
117+
OPTIONS (
118+
driver='org.apache.hive.jdbc.HiveDriver',
119+
url='jdbc:hive2://kyuubi_server_ip:port',
120+
user='user',
121+
password='password',
122+
dbtable='testdb.some_table'
123+
)""")
124+
125+
# read data to dataframe
126+
jdbcDF = spark.sql("SELECT * FROM kyuubi_table")
127+
128+
# write data from dataframe in overwrite mode
129+
df.writeTo("kyuubi_table").overwrite
130+
131+
# write data from query
132+
spark.sql("INSERT INTO kyuubi_table SELECT * FROM some_table")
133+
```
134+
109135

110136
### Use PySpark with Pandas
111137
From PySpark 3.2.0, PySpark supports pandas API on Spark which allows you to scale your pandas workload out.

0 commit comments

Comments
 (0)