[CARBONDATA-3232] Add example and doc for alluxio integration

Optimize carbonData usage with alluxio: 1.Add doc 2.optimize the example This closes #3054
apache · Jan 24, 2019 · 028eb25 · 028eb25
1 parent 06977de
commit 028eb25
Show file tree

Hide file tree

Showing 8 changed files with 264 additions and 38 deletions.
diff --git a/README.md b/README.md
@@ -70,6 +70,7 @@ CarbonData is built using Apache Maven, to [build CarbonData](https://github.com
 ##  Integration
 * [Hive](https://github.com/apache/carbondata/blob/master/docs/hive-guide.md)
 * [Presto](https://github.com/apache/carbondata/blob/master/docs/presto-guide.md)
+* [Alluxio](https://github.com/apache/carbondata/blob/master/docs/alluxio-guide.md)
 
 ## Other Technical Material
 * [Apache CarbonData meetup material](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=66850609)

diff --git a/docs/alluxio-guide.md b/docs/alluxio-guide.md
@@ -0,0 +1,136 @@
+<!--
+    Licensed to the Apache Software Foundation (ASF) under one or more 
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership. 
+    The ASF licenses this file to you under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with 
+    the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing, software 
+    distributed under the License is distributed on an "AS IS" BASIS, 
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and 
+    limitations under the License.
+-->
+
+
+# Alluxio guide
+This tutorial provides a brief introduction to using Alluxio.
+ - How to use Alluxio in CarbonData?
+    - [Running alluxio example in CarbonData project by IDEA](#Running alluxio example in CarbonData project by IDEA)
+    - [CarbonData supports alluxio by spark-shell](#CarbonData supports alluxio by spark-shell)
+    - [CarbonData supports alluxio by spark-submit](#CarbonData supports alluxio by spark-submit)
+
+## Running alluxio example in CarbonData project by IDEA
+
+### [Building CarbonData](https://github.com/apache/carbondata/tree/master/build)
+ - Please refer to [Building CarbonData](https://github.com/apache/carbondata/tree/master/build).   
+ - Users need to install IDEA and scala plugin, and import CarbonData project.
+
+### Installing and starting Alluxio
+ - Please refer to [https://www.alluxio.org/docs/1.8/en/Getting-Started.html#starting-alluxio](https://www.alluxio.org/docs/1.8/en/Getting-Started.html#starting-alluxio)   
+ - Access the Alluxio web: [http://localhost:19999/home](http://localhost:19999/home)   
+
+### Running Example
+ - Please refer to [AlluxioExample](https://github.com/apache/carbondata/blob/master/examples/spark2/src/main/scala/org/apache/carbondata/examples/AlluxioExample.scala)
+
+## CarbonData supports alluxio by spark-shell
+
+### [Building CarbonData](https://github.com/apache/carbondata/tree/master/build)
+ - Please refer to [Building CarbonData](https://github.com/apache/carbondata/tree/master/build).   
+
+### Preparing Spark
+ - Please refer to [http://spark.apache.org/docs/latest/](http://spark.apache.org/docs/latest/)
+
+### Downloading alluxio and uncompressing it
+ - Please refer to [https://www.alluxio.org/download](https://www.alluxio.org/download)
+
+### Running spark-shell
+ - Running the command in spark path
+ ```$command
+./bin/spark-shell --jars ${CARBONDATA_PATH}/assembly/target/scala-2.11/apache-carbondata-1.6.0-SNAPSHOT-bin-spark2.2.1-hadoop2.7.2.jar,${ALLUXIO_PATH}/client/alluxio-1.8.1-client.jar
+```
+ - Testing use alluxio by CarbonSession
+ ```$scala
+import org.apache.spark.sql.CarbonSession._
+import org.apache.spark.sql.SparkSession
+	
+val carbon = SparkSession.builder().master("local").appName("test").getOrCreateCarbonSession("alluxio://localhost:19998/carbondata");
+carbon.sql("CREATE TABLE carbon_alluxio(id String,name String, city String,age Int) STORED as carbondata");
+carbon.sql(s"LOAD DATA LOCAL INPATH '${CARBONDATA_PATH}/integration/spark-common-test/src/test/resources/sample.csv' into table carbon_alluxio");
+carbon.sql("select * from carbon_alluxio").show
+```
+ - Result
+ ```$scala
+ scala> carbon.sql("select * from carbon_alluxio").show
+ +---+------+---------+---+
+ | id|  name|     city|age|
+ +---+------+---------+---+
+ |  1| david| shenzhen| 31|
+ |  2| eason| shenzhen| 27|
+ |  3| jarry|    wuhan| 35|
+ |  3| jarry|Bangalore| 35|
+ |  4| kunal|    Delhi| 26|
+ |  4|vishal|Bangalore| 29|
+ +---+------+---------+---+
+ ```
+## CarbonData supports alluxio by spark-submit
+
+### [Building CarbonData](https://github.com/apache/carbondata/tree/master/build)
+ - Please refer to [Building CarbonData](https://github.com/apache/carbondata/tree/master/build).   
+
+### Preparing Spark
+ - Please refer to [http://spark.apache.org/docs/latest/](http://spark.apache.org/docs/latest/)
+
+### Downloading alluxio and uncompressing it
+ - Please refer to [https://www.alluxio.org/download](https://www.alluxio.org/download)
+
+### Running spark-submit
+#### Upload data to alluxio
+```$command
+./bin/alluxio fs  copyFromLocal ${CARBONDATA_PATH}/hadoop/src/test/resources/data.csv /
+```
+#### Command
+```$command
+./bin/spark-submit \
+--master local \
+--jars ${ALLUXIO_PATH}/client/alluxio-1.8.1-client.jar,${CARBONDATA_PATH}/examples/spark2/target/carbondata-examples-1.6.0-SNAPSHOT.jar \
+--class org.apache.carbondata.examples.AlluxioExample \
+${CARBONDATA_PATH}/assembly/target/scala-2.11/apache-carbondata-1.6.0-SNAPSHOT-bin-spark2.2.1-hadoop2.7.2.jar \
+false
+```
+**NOTE**: Please set runShell as false, which can avoid dependency on alluxio shell module.
+
+#### Result
+```$command
++-----------------+-------+--------------------+--------------------+---------+-----------+---------+----------+
+|SegmentSequenceId| Status|     Load Start Time|       Load End Time|Merged To|File Format|Data Size|Index Size|
++-----------------+-------+--------------------+--------------------+---------+-----------+---------+----------+
+|                1|Success|2019-01-09 15:10:...|2019-01-09 15:10:...|       NA|COLUMNAR_V3|  23.92KB|    1.07KB|
+|                0|Success|2019-01-09 15:10:...|2019-01-09 15:10:...|       NA|COLUMNAR_V3|  23.92KB|    1.07KB|
++-----------------+-------+--------------------+--------------------+---------+-----------+---------+----------+
+
++-------+------+
+|country|amount|
++-------+------+
+| france|   202|
+|  china|  1698|
++-------+------+
+
++-----------------+---------+--------------------+--------------------+---------+-----------+---------+----------+
+|SegmentSequenceId|   Status|     Load Start Time|       Load End Time|Merged To|File Format|Data Size|Index Size|
++-----------------+---------+--------------------+--------------------+---------+-----------+---------+----------+
+|                3|Compacted|2019-01-09 15:10:...|2019-01-09 15:10:...|      0.1|COLUMNAR_V3|  23.92KB|    1.03KB|
+|                2|Compacted|2019-01-09 15:10:...|2019-01-09 15:10:...|      0.1|COLUMNAR_V3|  23.92KB|    1.07KB|
+|                1|Compacted|2019-01-09 15:10:...|2019-01-09 15:10:...|      0.1|COLUMNAR_V3|  23.92KB|    1.07KB|
+|              0.1|  Success|2019-01-09 15:10:...|2019-01-09 15:10:...|       NA|COLUMNAR_V3|  37.65KB|    1.08KB|
+|                0|Compacted|2019-01-09 15:10:...|2019-01-09 15:10:...|      0.1|COLUMNAR_V3|  23.92KB|    1.07KB|
++-----------------+---------+--------------------+--------------------+---------+-----------+---------+----------+
+
+```
+
+## Reference
+[1] https://www.alluxio.org/docs/1.8/en/Getting-Started.html
+[2] https://www.alluxio.org/docs/1.8/en/compute/Spark.html
diff --git a/docs/documentation.md b/docs/documentation.md
@@ -29,15 +29,17 @@ Apache CarbonData is a new big data file format for faster interactive query usi
 
 **Quick Start:** [Run an example program](./quick-start-guide.md#installing-and-configuring-carbondata-to-run-locally-with-spark-shell) on your local machine or [study some examples](https://github.com/apache/carbondata/tree/master/examples/spark2/src/main/scala/org/apache/carbondata/examples).
 
-**CarbonData SQL Language Reference:** CarbonData extends the Spark SQL language and adds several [DDL](./ddl-of-carbondata.md) and [DML](./dml-of-carbondata.md) statements to support operations on it.Refer to the [Reference Manual](./language-manual.md) to understand the supported features and functions.
+**CarbonData SQL Language Reference:** CarbonData extends the Spark SQL language and adds several [DDL](./ddl-of-carbondata.md) and [DML](./dml-of-carbondata.md) statements to support operations on it. Refer to the [Reference Manual](./language-manual.md) to understand the supported features and functions.
 
 **Programming Guides:** You can read our guides about [Java APIs supported](./sdk-guide.md) or [C++ APIs supported](./csdk-guide.md) to learn how to integrate CarbonData with your applications.
 
 
 
 ## Integration
 
-CarbonData can be integrated with popular Execution engines like [Spark](./quick-start-guide.md#spark) , [Presto](./quick-start-guide.md#presto) and [Hive](./quick-start-guide.md#hive).Refer to the [Installation and Configuration](./quick-start-guide.md#integration) section to understand all modes of Integrating CarbonData.
+ - CarbonData can be integrated with popular execution engines like [Spark](./quick-start-guide.md#spark) , [Presto](./quick-start-guide.md#presto) and [Hive](./quick-start-guide.md#hive).
+ - CarbonData can be integrated with popular storage engines like HDFS, Huawei Cloud(OBS) and [Alluxio](./quick-start-guide.md#alluxio).   
+  Refer to the [Installation and Configuration](./quick-start-guide.md#integration) section to understand all modes of Integrating CarbonData.
 
 
 

diff --git a/docs/introduction.md b/docs/introduction.md
@@ -115,8 +115,10 @@ CarbonData has rich set of features to support various use cases in Big Data ana
 
 - ##### HDFS
 
-  CarbonData uses HDFS api to write and read data from HDFS.CarbonData can take advantage of the locality information to efficiently suggest spark to run tasks near to the data.
+  CarbonData uses HDFS api to write and read data from HDFS. CarbonData can take advantage of the locality information to efficiently suggest spark to run tasks near to the data.
 
+- ##### Alluxio
+  CarbonData also supports read and write with [Alluxio](./quick-start-guide.md#alluxio). 
 
 
 ## Integration with Big Data ecosystem

diff --git a/docs/quick-start-guide.md b/docs/quick-start-guide.md
@@ -35,9 +35,10 @@ This tutorial provides a quick introduction to using CarbonData. To follow along
 
 ## Integration
 
-CarbonData can be integrated with Spark,Presto and Hive Execution Engines. The below documentation guides on Installing and Configuring with these execution engines.
+### Integration with Execution Engines
+CarbonData can be integrated with Spark,Presto and Hive execution engines. The below documentation guides on Installing and Configuring with these execution engines.
 
-### Spark
+#### Spark
 
 [Installing and Configuring CarbonData to run locally with Spark Shell](#installing-and-configuring-carbondata-to-run-locally-with-spark-shell)
 
@@ -48,13 +49,21 @@ CarbonData can be integrated with Spark,Presto and Hive Execution Engines. The b
 [Installing and Configuring CarbonData Thrift Server for Query Execution](#query-execution-using-carbondata-thrift-server)
 
 
-### Presto
+#### Presto
 [Installing and Configuring CarbonData on Presto](#installing-and-configuring-carbondata-on-presto)
 
-### Hive
+#### Hive
 [Installing and Configuring CarbonData on Hive](https://github.com/apache/carbondata/blob/master/docs/hive-guide.md)
 
+### Integration with Storage Engines
+#### HDFS
+[CarbonData supports read and write with HDFS](https://github.com/apache/carbondata/blob/master/docs/quick-start-guide.md#installing-and-configuring-carbondata-on-standalone-spark-cluster)
 
+#### S3
+[CarbonData supports read and write with S3](https://github.com/apache/carbondata/blob/master/docs/s3-guide.md) 
+
+#### Alluxio
+[CarbonData supports read and write with Alluxio](https://github.com/apache/carbondata/blob/master/docs/alluxio-guide.md)
 
 ## Installing and Configuring CarbonData to run locally with Spark Shell
 

diff --git a/examples/spark2/pom.xml b/examples/spark2/pom.xml
@@ -105,6 +105,16 @@
       <artifactId>carbondata-core</artifactId>
       <version>${project.version}</version>
     </dependency>
+    <dependency>
+      <groupId>org.alluxio</groupId>
+      <artifactId>alluxio-core-client-hdfs</artifactId>
+      <version>1.8.1</version>
+    </dependency>
+    <dependency>
+      <groupId>org.alluxio</groupId>
+      <artifactId>alluxio-shell</artifactId>
+      <version>1.8.1</version>
+    </dependency>
   </dependencies>
 
   <build>

diff --git a/examples/spark2/src/main/scala/org/apache/carbondata/examples/AlluxioExample.scala b/examples/spark2/src/main/scala/org/apache/carbondata/examples/AlluxioExample.scala
@@ -17,57 +17,116 @@
 
 package org.apache.carbondata.examples
 
+import java.io.File
+import java.text.SimpleDateFormat
+import java.util.Date
+
+import alluxio.cli.fs.FileSystemShell
 import org.apache.spark.sql.SparkSession
 
 import org.apache.carbondata.core.constants.CarbonCommonConstants
 import org.apache.carbondata.core.datastore.impl.FileFactory
 import org.apache.carbondata.core.util.CarbonProperties
 import org.apache.carbondata.examples.util.ExampleUtils
 
-
 /**
  * configure alluxio:
  * 1.start alluxio
- * 2.upload the jar :"/alluxio_path/core/client/target/
- * alluxio-core-client-YOUR-VERSION-jar-with-dependencies.jar"
- * 3.Get more detail at:http://www.alluxio.org/docs/master/en/Running-Spark-on-Alluxio.html
+ * 2. Please upload data to alluxio if you set runShell as false
+ * ./bin/alluxio fs copyFromLocal /carbondata_path/hadoop/src/test/resources/data.csv /
+ * 3.Get more details at: https://www.alluxio.org/docs/1.8/en/compute/Spark.html
  */
-
 object AlluxioExample {
-  def main(args: Array[String]) {
-    val spark = ExampleUtils.createCarbonSession("AlluxioExample")
-    exampleBody(spark)
-    spark.close()
+  def main (args: Array[String]) {
+    val carbon = ExampleUtils.createCarbonSession("AlluxioExample",
+      storePath = "alluxio://localhost:19998/carbondata")
+    val runShell: Boolean = if (null != args && args.length > 0) {
+      args(0).toBoolean
+    } else {
+      true
+    }
+    exampleBody(carbon, runShell)
+    carbon.close()
   }
 
-  def exampleBody(spark : SparkSession): Unit = {
+  def exampleBody (spark: SparkSession, runShell: Boolean = true): Unit = {
+    val rootPath = new File(this.getClass.getResource("/").getPath
+      + "../../../..").getCanonicalPath
     spark.sparkContext.hadoopConfiguration.set("fs.alluxio.impl", "alluxio.hadoop.FileSystem")
     FileFactory.getConfiguration.set("fs.alluxio.impl", "alluxio.hadoop.FileSystem")
 
     // Specify date format based on raw data
     CarbonProperties.getInstance()
-      .addProperty(CarbonCommonConstants.CARBON_DATE_FORMAT, "yyyy/MM/dd")
+            .addProperty(CarbonCommonConstants.CARBON_DATE_FORMAT, "yyyy/MM/dd")
+    val time = new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date())
+    val alluxioPath = "alluxio://localhost:19998"
+    var alluxioFile = alluxioPath + "/data.csv"
+
+    val remoteFile = "/carbon_alluxio" + time + ".csv"
+
+    var mFsShell: FileSystemShell = null
+
+    // avoid dependency alluxio shell when running it with spark-submit
+    if (runShell) {
+      mFsShell = new FileSystemShell()
+      alluxioFile = alluxioPath + remoteFile
+      val localFile = rootPath + "/hadoop/src/test/resources/data.csv"
+      mFsShell.run("copyFromLocal", localFile, remoteFile)
+    }
+
+    import spark._
+
+    sql("DROP TABLE IF EXISTS alluxio_table")
 
-    spark.sql("DROP TABLE IF EXISTS alluxio_table")
+    sql(
+      s"""
+         | CREATE TABLE IF NOT EXISTS alluxio_table(
+         |    ID Int,
+         |    date Date,
+         |    country String,
+         |    name String,
+         |    phonetype String,
+         |    serialname String,
+         |    salary Int)
+         | STORED BY 'carbondata'
+         | TBLPROPERTIES(
+         |    'SORT_COLUMNS' = 'phonetype,name',
+         |    'DICTIONARY_INCLUDE'='phonetype',
+         |    'TABLE_BLOCKSIZE'='32',
+         |    'AUTO_LOAD_MERGE'='true')
+       """.stripMargin)
 
-    spark.sql("""
-           CREATE TABLE IF NOT EXISTS alluxio_table
-           (ID Int, date Date, country String,
-           name String, phonetype String, serialname String, salary Int)
-           STORED BY 'carbondata'
-           """)
+    for (i <- 0 until 2) {
+      sql(
+        s"""
+           | LOAD DATA LOCAL INPATH '$alluxioFile'
+           | into table alluxio_table
+         """.stripMargin)
+    }
 
-    spark.sql(s"""
-           LOAD DATA LOCAL INPATH 'alluxio://localhost:19998/data.csv' into table alluxio_table
-           """)
+    sql("SELECT * FROM alluxio_table").show()
 
-    spark.sql("""
-           SELECT country, count(salary) AS amount
-           FROM alluxio_table
-           WHERE country IN ('china','france')
-           GROUP BY country
-           """).show()
+    sql("SHOW SEGMENTS FOR TABLE alluxio_table").show()
+    sql(
+      """
+        | SELECT country, count(salary) AS amount
+        | FROM alluxio_table
+        | WHERE country IN ('china','france')
+        | GROUP BY country
+      """.stripMargin).show()
 
-    spark.sql("DROP TABLE IF EXISTS alluxio_table")
+    for (i <- 0 until 2) {
+      sql(
+        s"""
+           | LOAD DATA LOCAL INPATH '$alluxioFile'
+           | into table alluxio_table
+         """.stripMargin)
+    }
+    sql("SHOW SEGMENTS FOR TABLE alluxio_table").show()
+    if (runShell) {
+      mFsShell.run("rm", remoteFile)
+      mFsShell.close()
+    }
+    sql("DROP TABLE IF EXISTS alluxio_table")
   }
 }