## Programmatically Specifying the Schema

Ref: https://spark.apache.org/docs/2.0.2/sql-programming-guide.html#programmatically-specifying-the-schema

In [1]:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

### Step 1: Create RDD of `Rows` from original RDD

In [2]:
// Simulate RDD of rows
val rowRDD = sc.parallelize(Array(
    Row("dat_pc1", "10.2.3.5","asda4:1241:4124"),
    Row("dat_mac1", "10.2.3.4","bas4:1241:4125")))

Name: java.net.BindException
Message: Can't assign requested address: Service 'sparkDriver' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.
StackTrace:   at sun.nio.ch.Net.bind0(Native Method)
  at sun.nio.ch.Net.bind(Net.java:433)
  at sun.nio.ch.Net.bind(Net.java:425)
  at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
  at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:128)
  at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:558)
  at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1283)
  at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:501)
  at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:4

### Step 2: Create the schema represented by a `StructType` matching the structure of `Rows` in the RDD.

In [3]:
val schema = StructType(Array(
    StructField("host_name", StringType, true),
    StructField("ip_address",StringType, true),
    StructField("mac_address",StringType, true)))

schema = StructType(StructField(host_name,StringType,true), StructField(ip_address,StringType,true), StructField(mac_address,StringType,true))


StructType(StructField(host_name,StringType,true), StructField(ip_address,StringType,true), StructField(mac_address,StringType,true))

### Step 3: Apply the schema to RDD of Rows via `createDataFrame`

In [4]:
// Construct DataFrame from RDD
val df = spark.createDataFrame(rowRDD, schema)

df = [host_name: string, ip_address: string ... 1 more field]


[host_name: string, ip_address: string ... 1 more field]

In [5]:
// Creates a temporary view `HostInfo` using the DataFrame
df.createOrReplaceTempView("HostInfo")

### Step 4: Query DataFrame using SparkSQL

In [21]:
val results = spark.sql("SELECT * FROM HostInfo LIMIT 10")
results.show()

+---------+----------+---------------+
|host_name|ip_address|    mac_address|
+---------+----------+---------------+
|  dat_pc1|  10.2.3.5|asda4:1241:4124|
| dat_mac1|  10.2.3.4| bas4:1241:4125|
+---------+----------+---------------+



results = [host_name: string, ip_address: string ... 1 more field]


[host_name: string, ip_address: string ... 1 more field]

In [8]:
df.printSchema()

root
 |-- host_name: string (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- mac_address: string (nullable = true)



In [22]:
df.select("host_name").limit(10).show()

+---------+
|host_name|
+---------+
|  dat_pc1|
| dat_mac1|
+---------+

