Skip to content
Permalink
Browse files
add quick start steps (#31)
add quick start steps
  • Loading branch information
LOVEGISER committed May 12, 2022
1 parent 32fe7da commit a29495b1a709b22b82b3bc75edb198481af4f20d
Showing 1 changed file with 94 additions and 0 deletions.
@@ -30,6 +30,100 @@ More information about compilation and usage, please visit [Spark Doris Connecto

[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)

### QuickStart

1. download and compile Spark Doris Connector from https://github.com/apache/incubator-doris-spark-connector, we suggest compile Spark Doris Connector by Doris offfcial image。

```bash
$ docker pull apache/incubator-doris:build-env-ldb-toolchain-latest
```

2. the result of compile jar is like:spark-doris-connector-3.1_2.12-1.0.0-SNAPSHOT.jar

3. download spark for https://spark.apache.org/downloads.html .if in china there have a good choice of tencent link https://mirrors.cloud.tencent.com/apache/spark/spark-3.1.2/

```bash
#download
wget https://mirrors.cloud.tencent.com/apache/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
#decompression
tar -xzvf spark-3.1.2-bin-hadoop3.2.tgz
```

4. config Spark environment

```shell
vim /etc/profile
export SPARK_HOME=/your_parh/spark-3.1.2-bin-hadoop3.2
export PATH=$PATH:$SPARK_HOME/bin
source /etc/profile
```

5. copy spark-doris-connector-3.1_2.12-1.0.0-SNAPSHOT.jar to spark jars directory。

```shell
cp /your_path/spark-doris-connector/target/spark-doris-connector-3.1_2.12-1.0.0-SNAPSHOT.jar $SPARK_HOME/jars
```

6. created doris database and table。

```sql
create database mongo_doris;
use mongo_doris;
CREATE TABLE data_sync_test_simple
(
_id VARCHAR(32) DEFAULT '',
id VARCHAR(32) DEFAULT '',
user_name VARCHAR(32) DEFAULT '',
member_list VARCHAR(32) DEFAULT ''
)
DUPLICATE KEY(_id)
DISTRIBUTED BY HASH(_id) BUCKETS 10
PROPERTIES("replication_num" = "1");
INSERT INTO data_sync_test_simple VALUES ('1','1','alex','123');
```

7. Input this coed in spark-shell.

```bash
import org.apache.doris.spark._
val dorisSparkRDD = sc.dorisRDD(
tableIdentifier = Some("mongo_doris.data_sync_test"),
cfg = Some(Map(
"doris.fenodes" -> "127.0.0.1:8030",
"doris.request.auth.user" -> "root",
"doris.request.auth.password" -> ""
))
)
dorisSparkRDD.collect()
```

- mongo_doris:doris database name
- data_sync_test:doris table mame.
- doris.fenodes:doris FE IP:http_port
- doris.request.auth.user:doris user name.
- doris.request.auth.password:doris password

8. if Spark is Cluster model,upload Jar to HDFS,add doris-spark-connector jar HDFS URL in spark.yarn.jars.

```bash
spark.yarn.jars=hdfs:///spark-jars/doris-spark-connector-3.1.2-2.12-1.0.0.jar
```

Link:https://github.com/apache/incubator-doris/discussions/9486

9. in pyspark,input this code in pyspark shell command.

```bash
dorisSparkDF = spark.read.format("doris")
.option("doris.table.identifier", "mongo_doris.data_sync_test")
.option("doris.fenodes", "127.0.0.1:8030")
.option("user", "root")
.option("password", "")
.load()
# show 5 lines data
dorisSparkDF.show(5)
```

## Report issues or submit pull request

If you find any bugs, feel free to file a [GitHub issue](https://github.com/apache/incubator-doris/issues) or fix it by submitting a [pull request](https://github.com/apache/incubator-doris/pulls).

0 comments on commit a29495b

Please sign in to comment.