layout | title | categories | navigation | |||||
---|---|---|---|---|---|---|---|---|
global |
Interactive Data Analytics in SparkR |
|
|
In this chapter, we will use the SparkR shell for interactive data exploration. While the code snippets below assume Mac OS X, Linux and Windows should be supported as well.
Simply type in the following command from the root directory of your USB:
usb/$ spark/bin/sparkR
This starts a SparkR
shell, which by default comes with a SparkContext
variable sc
and a SQLContext
variable sqlContext
.
> sc Java ref type org.apache.spark.api.java.JavaSparkContext id 0 > sqlContext Java ref type org.apache.spark.sql.SQLContext id 1
We are going to use the iris
dataset that comes with R
for the rest of this chapter.
Load this into a DataFrame
object as follows:
> irisDF <- createDataFrame(sqlContext, iris)
Let's take a look at the schema of irisDF
before we start exploring it.
> printSchema(irisDF) root |-- Sepal_Length: double (nullable = true) |-- Sepal_Width: double (nullable = true) |-- Petal_Length: double (nullable = true) |-- Petal_Width: double (nullable = true) |-- Species: string (nullable = true)
At any point, you can look at the Web UI at http://localhost:4040/ to see what's going on behind the scenes.
Let's take a peek at irisDF
.
You can use the take
operation to get the first K records.
Here, K = 2.
> take(irisDF, 2)
You should see that a table of two rows is printed out, where the columns are from the schema we've seen earlier.
Each column has different datatypes.
For example, the Species
column (the 5th one) is of type character
.
> typeof(take(irisDF, 2)[[5]]) [1] "character"
We can refer to specific column using $
:
> take(select(irisDF,irisDF$Species), 2)
We can also refer to multiple columns:
> take(irisDF[, c("Petal_Length", "Sepal_Length")], 2)
Now, let's see how many records in total are in this dataset.
> count(irisDF)
You can visit the Web UI for further exploration.
Find the count
job that just ran, and click on it under the description column.
Feel free to look at the Event Timeline
and DAG Visualization
tabs in this job's details page.
We can filter on patterns using like
.
For example, let's count the number of rows for the Species
''setosa''.
> count(filter(irisDF, like(irisDF$Species, "setosa")))
We can also use the where
.
For example, the number of rows with Petal_Length
less than 1.2 are:
> count(where(irisDF, irisDF$Petal_Length < 1.2))
Let's try to find out how many unique Species
are there in this dataset along with their individual counts.
We can do this by using a groupby
and followed by aggregating them using agg
:
> species <- agg(group_by(irisDF, irisDF$Species), count = n(irisDF$Species)) > head(species)
We can sort the results by Species
name:
> top_species <- head(arrange(species, asc(species$Species)))
We can use R
's plotting capabilities from SparkR too:
> barplot(top_species$count, names.arg=top_species$Species)
You can create new columns on the dataframe.
For example, create a Petal_Area
column as follows:
> irisDF$Petal_Area <- irisDF$Petal_Length * irisDF$Petal_Width
See the change in action:
> printSchema(irisDF) root |-- Sepal_Length: double (nullable = true) |-- Sepal_Width: double (nullable = true) |-- Petal_Length: double (nullable = true) |-- Petal_Width: double (nullable = true) |-- Species: string (nullable = true) |-- Petal_Area: double (nullable = true)
Dropping a columns is simple as well!
> irisDF$Petal_Area <- NULL > printSchema(irisDF) root |-- Sepal_Length: double (nullable = true) |-- Sepal_Width: double (nullable = true) |-- Petal_Length: double (nullable = true) |-- Petal_Width: double (nullable = true) |-- Species: string (nullable = true)
First, we need to register a table corresponding to irisDF
DataFrame.
> registerTempTable(irisDF, "irisTable")
Now we can perform any SparkSQL operations on the irisTable
table.
For example, let's count the number of rows for the Species
''setosa'' one more time using SQL through SparkR.
> count(sql(sqlContext, "SELECT Species FROM irisTable WHERE Species LIKE 'setosa'"))
This brings us to the end of the SparkR chapter of the tutorial.
You can explore the full API by using the command help(package=SparkR)
.