Skip to content

Latest commit

 

History

History
185 lines (133 loc) · 5.16 KB

sparkr.md

File metadata and controls

185 lines (133 loc) · 5.16 KB
layout title categories navigation
global
Interactive Data Analytics in SparkR
module
weight show
75
true

In this chapter, we will use the SparkR shell for interactive data exploration. While the code snippets below assume Mac OS X, Linux and Windows should be supported as well.

Preparing the Environment

Simply type in the following command from the root directory of your USB:

usb/$ spark/bin/sparkR

This starts a SparkR shell, which by default comes with a SparkContext variable sc and a SQLContext variable sqlContext.

> sc
Java ref type org.apache.spark.api.java.JavaSparkContext id 0
> sqlContext
Java ref type org.apache.spark.sql.SQLContext id 1

We are going to use the iris dataset that comes with R for the rest of this chapter. Load this into a DataFrame object as follows:

> irisDF <- createDataFrame(sqlContext, iris)

Let's take a look at the schema of irisDF before we start exploring it.

> printSchema(irisDF)
root
 |-- Sepal_Length: double (nullable = true)
 |-- Sepal_Width: double (nullable = true)
 |-- Petal_Length: double (nullable = true)
 |-- Petal_Width: double (nullable = true)
 |-- Species: string (nullable = true)

At any point, you can look at the Web UI at http://localhost:4040/ to see what's going on behind the scenes.

Interactive Analysis

Basic Operations

Let's take a peek at irisDF. You can use the take operation to get the first K records. Here, K = 2.

> take(irisDF, 2)

You should see that a table of two rows is printed out, where the columns are from the schema we've seen earlier. Each column has different datatypes. For example, the Species column (the 5th one) is of type character.

> typeof(take(irisDF, 2)[[5]])
[1] "character"

We can refer to specific column using $:

> take(select(irisDF,irisDF$Species), 2)

We can also refer to multiple columns:

> take(irisDF[, c("Petal_Length", "Sepal_Length")], 2)

Now, let's see how many records in total are in this dataset.

> count(irisDF)

You can visit the Web UI for further exploration.

Find the count job that just ran, and click on it under the description column. Feel free to look at the Event Timeline and DAG Visualization tabs in this job's details page.

Advanced Operations

We can filter on patterns using like.

For example, let's count the number of rows for the Species ''setosa''.

> count(filter(irisDF, like(irisDF$Species, "setosa")))

We can also use the where.

For example, the number of rows with Petal_Length less than 1.2 are:

> count(where(irisDF, irisDF$Petal_Length < 1.2))

Let's try to find out how many unique Species are there in this dataset along with their individual counts.

We can do this by using a groupby and followed by aggregating them using agg:

> species <- agg(group_by(irisDF, irisDF$Species), count = n(irisDF$Species))
> head(species)

We can sort the results by Species name:

> top_species <- head(arrange(species, asc(species$Species)))

We can use R's plotting capabilities from SparkR too:

> barplot(top_species$count, names.arg=top_species$Species)

Mutate DataFrame

You can create new columns on the dataframe. For example, create a Petal_Area column as follows:

> irisDF$Petal_Area <- irisDF$Petal_Length * irisDF$Petal_Width

See the change in action:

> printSchema(irisDF)
root
 |-- Sepal_Length: double (nullable = true)
 |-- Sepal_Width: double (nullable = true)
 |-- Petal_Length: double (nullable = true)
 |-- Petal_Width: double (nullable = true)
 |-- Species: string (nullable = true)
 |-- Petal_Area: double (nullable = true)

Dropping a columns is simple as well!

> irisDF$Petal_Area <- NULL
> printSchema(irisDF)
root
 |-- Sepal_Length: double (nullable = true)
 |-- Sepal_Width: double (nullable = true)
 |-- Petal_Length: double (nullable = true)
 |-- Petal_Width: double (nullable = true)
 |-- Species: string (nullable = true)

Running SQL Queries

First, we need to register a table corresponding to irisDF DataFrame.

> registerTempTable(irisDF, "irisTable")

Now we can perform any SparkSQL operations on the irisTable table.

For example, let's count the number of rows for the Species ''setosa'' one more time using SQL through SparkR.

> count(sql(sqlContext, "SELECT Species FROM irisTable WHERE Species LIKE 'setosa'"))

This brings us to the end of the SparkR chapter of the tutorial. You can explore the full API by using the command help(package=SparkR).