In [2]:
val nums = List(1 to 20: _*)
val rdd = sc.parallelize(nums,5)

//SELECT * FROM table ORDER BY col4 DESC LIMIT 10;
rdd.top(5)

Array(20, 19, 18, 17, 16)

### take
Extracts the first n items of the RDD and returns them as an array. (Note: This sounds
very easy, but it is actually quite a tricky problem for the implementors of Spark because
the items in question can be in many different partitions.)

### takeOrdered
Orders the data items of the RDD using their inherent implicit ordering function and
returns the first n items as an array.

### takeSample
Behaves different from sample in the following respects:
• It will return an exact number of samples (Hint: 2nd parameter).
• It returns an Array instead of RDD.
• It internally randomizes the order of the items returned.

### top
Utilizes the implicit ordering of T to determine the top k values and returns them as an
array.

In [12]:
val b = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"), 2)
b.take(2).foreach(println)

val b1 = sc.parallelize(List("dog", "cat", "ape", "salmon", "gnu"),2)
b1.takeOrdered(2).foreach(println)

val x = sc.parallelize(1 to 1000,3)
println()
println(x.takeSample(true,100,1).mkString(","))

val c = sc.parallelize(Array(6, 9, 4, 7, 5, 8) ,2)
println()
println(c.top(2).mkString(","))

dog
cat
ape
cat
764,815,274,452,39,538,238,544,475,480,416,868,517,363,39,316,37,90,210,202,335,773,572,243,354,305,584,820,528,749,188,366,913,667,214,540,807,738,204,968,39,863,541,703,397,489,172,29,211,542,600,977,941,923,900,485,575,650,258,31,737,155,685,562,223,675,330,864,291,536,392,108,188,408,475,565,873,504,34,343,79,493,868,974,973,110,587,457,739,745,977,800,783,59,276,987,160,351,515,901
9,8


### filter

Evaluates a boolean function for each data item of the RDD and puts the items for
which the function returned true into the resulting RDD.

When you provide a filter function, it must be able to handle all data items contained
in the RDD. Scala provides so-called partial functions to deal with mixed data-types.
(Tip: Partial functions are very useful if you have some data which may be bad and
you do not want to handle but for the good data (matching data) you want to apply
some kind of map function. The following article is good. It teaches you about partial
functions in a very nice way and explains why case has to be used for partial functions:
http://blog.bruchez.name/2011/10/scala-partial-functions-without-phd.html)

 filterWith : Deprecated "use mapPartitionsWithIndex and filter"
This is an extended version of filter. It takes two function arguments. The first argument
must conform to Int ⇒ T and is executed once per partition. It will transform the
partition index to type T . The second function looks like (U, T ) ⇒ Boolean. T is
the transformed partition index and U are the data items from the RDD. Finally the
function has to return either true or false (i.e. Apply the filter).



In [20]:
val a = sc.parallelize (1 to 10 , 3)
println(a.filter( _ % 2 == 0).collect.mkString(","))
println(a.collect.mkString(","))

val b = sc.parallelize(1 to 8)
println(b.filter(_ < 4).collect.mkString(","))


//Error
// val a1 = sc . parallelize ( List (" cat " , " horse " , 4.0 , 3.5 , 2 , " dog ") )
//a1 . filter ( _ < 4) . collect
/*This fails because some components of a are not implicitly comparable against integers.
  Collect uses the isDefinedAt property of a function-object to determine whether the test-
  function is compatible with each data item. Only data items that pass this test (=filter)
are then mapped using the function-object. */

val a2 = sc.parallelize ( List (" cat " , " horse " , 4.0 , 3.5 , 2 , " dog ") )
a2 . collect ({ case a : Int => " is integer "
case b : String => " is string " }) . collect

val myfunc : PartialFunction [ Any , Any ] = {
  case a : Int  => " is integer "
  case b : String => " is string " }

myfunc.isDefinedAt("")
myfunc.isDefinedAt(1)
myfunc.isDefinedAt(1.5)

//    Be careful! The above code works because it only checks the type itself! If you use
//    operations on this type, you have to explicitly declare what type you want instead of any.
//    Otherwise the compiler does (apparently) not know what bytecode it should produce:

//val myfunc2 : PartialFunction [ Any , Any ] = { case x if ( x < 4) => " x "} //error
val myfunc3 : PartialFunction[Int,Any] = { case x if (x < 4) => "x"}

2,4,6,8,10
1,2,3,4,5,6,7,8,9,10
1,2,3


In [15]:
val rdd = sc.parallelize(List("A","B","C","D"))
val str1 = "A"

val rslt1 = rdd.filter(x => {x != "A"}).count
val rslt2 = rdd.filter(x => { str1 != null && x != str1}).count

println("Demo closure: rslt1: " + rslt1 + " rslt2 : " + rslt2)

Demo closure: rslt1: 3 rslt2 : 3


### Distinct
Returns a new RDD that contains each unique value only once.

SELECT DISTINCT * FROM table;

In [24]:
println(this.getClass.getSimpleName)
val c = sc.parallelize(List("Gnu","Cat","Rat","Dog","Gnu","Rat") ,2)
c.distinct.collect.foreach(println)

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10,1,2,3,4,5))
println(a.distinct(2).partitions.length) //2
println(a.distinct(3).partitions.length) //3
println("-------------------------------------------------------")

a.distinct(3).foreachPartition(p => {
    p.foreach(println)
    println("-------------------------------------------------------")
})

$iw
Dog                                                                             
Cat
Gnu
Rat
2
3
-------------------------------------------------------


#### Bloom Filter

In [26]:
import breeze.util.BloomFilter
val nums = List(1 to 20: _*).map(_.toString)
val rdd = sc.parallelize(nums,5)

val bf = rdd.mapPartitions{ iter =>
val bf = BloomFilter.optimallySized[String](10000, 0.0001)
    iter.foreach(i => bf += i)
    Iterator(bf)
}.reduce(_ | _)

println(bf.contains("5"))
println(bf.contains("31"))

true
false
