# Aerospike Spark Connector Tutorial for Scala

## Tested with Java 8, Spark 2.4.0, Python 3.7,  Scala 2.11.12, and  Spylon ( https://pypi.org/project/spylon-kernel/)


In [None]:
%%init_spark
launcher.master = "local[*]"

By default the aerospike-spark-assembly jar must be placed in the "jars" directory in the spark installation.
If using a non default jar location, the below variable must be set t the appropriate value.

> `launcher.jars = ["/Users/kmatty/Documents/Jupyter/SparkConnector/aerospike-spark-assembly-2.4.0.jar"]` 

In [None]:
//Specify the Seed Host of the Aerospike Server
val AS_HOST ="127.0.0.1:3000"

In [None]:
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SaveMode

## Create sample data and write it into Aerospike Database

In [None]:
//Create test data

val num_records=1000
val rand = scala.util.Random

//schema of input data

val schema: StructType = new StructType(
    Array(
    StructField("id", IntegerType, nullable = false),
    StructField("name", StringType, nullable = false),
    StructField("age", IntegerType, nullable = false),
    StructField("salary",IntegerType, nullable = false)
  ))

val inputDF = {
    val inputBuf=  new ArrayBuffer[Row]()
    for ( i <- 1 to num_records){
        val name = "name"  + i
        val age = i%100
        val salary = 50000 + rand.nextInt(50000)
        val id = i 
        val r = Row(id, name, age,salary)
        inputBuf.append(r)
    }
    val inputRDD = spark.sparkContext.parallelize(inputBuf.toSeq)
    spark.createDataFrame(inputRDD,schema)
}

inputDF.show(10)

//Write the Sample Data to Aerospike
inputDF.write.mode(SaveMode.Overwrite) 
.format("com.aerospike.spark.sql") //aerospike specific format
.option("aerospike.seedhost", AS_HOST) //db hostname, can be added multiple hosts, delimited with ":"
.option("aerospike.namespace", "test") //use this namespace 
.option("aerospike.writeset", "input_data") //write to this set
.option("aerospike.updateByKey", "id") //indicates which columns should be used for construction of primary key
.save()

## Schema in the Spark Connector

-  Aerospike is schemaless, however spark adher to schema. After the schema is decided upon (either through inference or given), data within the bins must honor the types. 

- To infer the schema, the connector samples a set of records (configurable through `aerospike.schema.scan`) to decide the name of bins/columns and their types. This implies that the derived schema depends entirely upon sampled records.  

- Note that `__key` was not part of provided schema. So how can one query using `__key`? We can just add `__key` in provided schema with appropriate type. Similarly we can add `__gen` or `__ttl` etc.  
         
      val schemaWithPK: StructType = new StructType(Array(
                StructField("__key",IntegerType, nullable = false),    
                StructField("id", IntegerType, nullable = false),
                StructField("name", StringType, nullable = false),
                StructField("age", IntegerType, nullable = false),
                StructField("salary",IntegerType, nullable = false)))
                
- We recommend that you provide schema for queries that involve complex data types such as lists, maps, and mixed types. 
          

## Load data into a DataFrame without specifying any schema i.e. using connector schema inference

In [None]:
// Create a Spark DataFrame by using the Connector Schema inference mechanism

val loadedDFWithoutSchema=spark
.sqlContext
.read
.format("com.aerospike.spark.sql")
.option("aerospike.seedhost", AS_HOST)
.option("aerospike.keyPath", "/etc/aerospike/features.conf") //Path to feature file, while running in cluster this file needs to be on all drivers. Consult documentation on how to read from HDFS or as string. 
.option ("aerospike.namespace", "test")
.option("aerospike.set", "input_data") //read the data from this set
.load
loadedDFWithoutSchema.printSchema()
//Notice that schema of loaded data has some additional fields. 
// When connector infers schema, it also adds internal metadata.

## Load data into a DataFrame with user specified schema 

In [None]:
//Data can be loaded with known schema as well.
val loadedDFWithSchema=spark
.sqlContext
.read
.format("com.aerospike.spark.sql")
.schema(schema)
.option("aerospike.seedhost",AS_HOST)
.option("aerospike.featurekey", "/etc/aerospike/features.conf") 
.option ("aerospike.namespace", "test")
.option("aerospike.set", "input_data").load
loadedDFWithSchema.show(5)

## Writing Sample Complex Data Types (CDT) data into Aerospike

In [None]:
val complex_data_json="resources/nested_data.json"
val alias=  StructType(List(
    StructField("first_name",StringType, false),
    StructField("last_name",StringType, false)))

  val name= StructType(List(
    StructField("first_name",StringType, false),
    StructField("aliases",ArrayType(alias), false )
  ))

  val street_adress= StructType(List(
    StructField("street_name", StringType, false),
    StructField("apt_number" , IntegerType, false)))

  val address = StructType( List(
    StructField ("zip" , LongType, false),
    StructField("street", street_adress, false),
    StructField("city", StringType, false)))

  val workHistory = StructType(List(
    StructField ("company_name" , StringType, false),
    StructField( "company_address" , address, false),
    StructField("worked_from", StringType, false)))

  val person=  StructType ( List(
    StructField("name" , name, false, Metadata.empty),
    StructField("SSN", StringType, false,Metadata.empty),
    StructField("home_address", ArrayType(address), false),
    StructField("work_history", ArrayType(workHistory), false)))

val cmplx_data_with_schema=spark.read.schema(person).json(complex_data_json)

cmplx_data_with_schema.printSchema()
cmplx_data_with_schema.write.mode(SaveMode.Overwrite) 
.format("com.aerospike.spark.sql") //aerospike specific format
.option("aerospike.seedhost", AS_HOST) //db hostname, can be added multiple hosts, delimited with ":"
.option("aerospike.namespace", "test") //use this namespace 
.option("aerospike.writeset", "scala_complex_input_data") //write to this set
.option("aerospike.updateByKey", "name.first_name") //indicates which columns should be used for construction of primary key
.save()

## Load Complex Data Types (CDT) into a DataFrame without specifying any schema (using connector schema inference)

In [None]:
val loadedComplexDFWithoutSchema=spark
.sqlContext
.read
.format("com.aerospike.spark.sql")
.option("aerospike.seedhost", AS_HOST)
.option("aerospike.keyPath", "/etc/aerospike/features.conf") //Path to feature file, while running in cluster this file needs to be on all drivers. Consult documentation on how to read from HDFS or as string. 
.option ("aerospike.namespace", "test")
.option("aerospike.set", "scala_complex_input_data") //read the data from this set
.load
loadedComplexDFWithoutSchema.printSchema()

## Load Complex Data Types (CDT) into a DataFrame with user specified schema

In [None]:
val loadedComplexDFWithSchema=spark
.sqlContext
.read
.format("com.aerospike.spark.sql")
.option("aerospike.seedhost", AS_HOST)
.option("aerospike.keyPath", "/etc/aerospike/features.conf") //Path to feature file, while running in cluster this file needs to be on all drivers. Consult documentation on how to read from HDFS or as string. 
.option ("aerospike.namespace", "test")
.option("aerospike.set", "scala_complex_input_data") //read the data from this set
.schema(person)
.load
loadedComplexDFWithSchema.printSchema()
//Please note the difference in types of loaded data in both cases. With schema, we extactly infer complex types.

# Quering Aerospike Data using SparkSQL

### Things to keep in mind
   1. Queries that involve Primary Key in the predicate trigger aerospike_batch_get()( https://www.aerospike.com/docs/client/c/usage/kvs/batch.html) and run extremely fast. For e.g. a query containing `__key` with, with no `OR` between two bins.
   2. All other queries may entail a full scan of the Aerospike DB if they can’t be converted to Aerospike batchget. 

## Queries that include Primary Key in the Predicate

In case of batchget queries we can also apply filters upon metadata columns like `__gen` or `__ttl` etc. To do so, these columns should be exposed through schema (if schema provided). 

In [None]:
val batchGet1= spark.sqlContext
.read
.format("com.aerospike.spark.sql")
.option("aerospike.seedhost", AS_HOST)
.option("aerospike.featurekey", "/etc/aerospike/features.conf") 
.option ("aerospike.namespace", "test")
.option("aerospike.set", "input_data")
.option("aerospike.keyType", "int") //used to hint primary key(PK) type when schema is not provided.
.load.where("__key = 829")
batchGet1.show()
//Please be aware Aerospike database supports only equality test with PKs in primary key query. 
//So, a where clause with "__key >10", would result in scan query!

In [None]:
//In this query we are doing *OR* between PK subqueries 

val somePrimaryKeys= 1.to(10).toSeq
val someMoreKeys= 12.to(14).toSeq
val batchGet2= spark.sqlContext
.read
.format("com.aerospike.spark.sql")
.option("aerospike.seedhost",AS_HOST)
.option("aerospike.featurekey", "/etc/aerospike/features.conf") 
.option ("aerospike.namespace", "test")
.option("aerospike.set", "input_data")
.option("aerospike.keyType", "int") //used to hint primary key(PK) type when inferred without schema.
.load.where((col("__key") isin (somePrimaryKeys:_*)) || ( col("__key") isin (someMoreKeys:_*) ))
batchGet2.show(5)
//We should got in total 13 records.

## Queries that do not include Primary Key in the Predicate

In [None]:

val somePrimaryKeys= 1.to(10).toSeq
val scanQuery1= spark.sqlContext
.read
.format("com.aerospike.spark.sql")
.option("aerospike.seedhost", AS_HOST)
.option ("aerospike.namespace", "test")
.option("aerospike.featurekey", "/etc/aerospike/features.conf") 
.option("aerospike.set", "input_data")
.option("aerospike.keyType", "int") //used to hint primary key(PK) type when inferred without schema.
.load.where((col("__key") isin (somePrimaryKeys:_*)) || ( col("age") >50 ))

scanQuery1.show()

//Since there is OR between PKs and Bin. It will be treated as Scan query. 
//Primary keys are not stored in bins(by default), hence only filters corresponding to bins are honored.  

## Query with CDT

In [None]:
//Find all people who have atleast 5 jobs in past.
loadedComplexDFWithSchema
.withColumn("past_jobs", col("work_history.company_name"))
.withColumn("num_jobs", size(col("past_jobs")))
.where(col("num_jobs")  >4).show()

## Use Aerospike Spark Connector Configuration properties in the Spark API to improve performance

aerospike.partition.factor: number of logical aerospike partitions [0-15]
aerospike.maxthreadcount : maximum number of threads to use for writing data into Aerospike
aerospike.compression : compression of java client-server communication
aerospike.batchMax : maximum number of records per read request (default 5000)
aerospike.recordspersecond : same as java client

#### Other
aerospike.keyType : Primary key type hint for schema inference. Always set it properly if primary key type is not string

See https://www.aerospike.com/docs/connect/processing/spark/reference.html for detailed description of the above properties