# Just Enough Spark - a Jupyter Notebook

You can execute each statement to generate the output, or choose any option in the "Cell" menu. Modify the statements to try new things. 


## An RDD is an immutable collection
* Most methods set up the execution graph for spark
* Action methods execute the graph
* partial results can be cached for reuse

*RDDs are construction with methods on the sparkContext (sc) object*

### RDDs can be created from files, Cassandra tables, Scala collections, and many other sources
Let's first look at creating an an rdd from a Scala object.  We use the parallelize function

In [None]:
val myrdd = sc.parallelize(Seq(4,5,6))

In [None]:
myrdd

Now filter out the even numbers

In [None]:
val evenNumbers = myrdd.filter( x => x % 2 == 0)
evenNumbers

Note that nothing really happened - we set up the execution graph.  We'll use the *action* method *collect* to execute it and dump all of the results into an array.  

In [None]:
evenNumbers.collect

## Examine a table using CQL
(Jupyter notebook feature)
Use the %%Cql Magic to prefix your CQL.

In [None]:
%%showschema music.tracks_by_album

In [None]:
%%Cql select * from music.tracks_by_album limit 5

## Creating RDDs from Cassandra Tables
* Can add a where clause to push down filter
* Creates and RDD of CassandraRow objects
* .as will map it to a case class or tuples for ease of use


In [None]:
val tracks = sc.cassandraTable("music","tracks_by_album")
tracks

In [None]:
tracks.first


### get the album and track in a tuple.  This is the new syntax:

In [None]:
val albumTracks = sc.cassandraTable[(String,String)]("music",
"tracks_by_album").select("album_title","track_title")
albumTracks

The first 10 rows as tuples ....

In [None]:
albumTracks.take(10) foreach println

### Create RDDs from Cassandra Tables and return an RDD of case class objects
.as() will map the rdd to a case class


In [None]:
case class Track(album_title: String,
album_year:Int,
track_number:Int,
album_genre: Option[String],
performer: Option[String],
track_title: String)

In [None]:
val tracks = sc.cassandraTable[Track]("music","tracks_by_album")
tracks

In [None]:
tracks take 5 foreach println

## Some other useful actions ...
* first – same as take(1)(0)
* collect – bring everything back to the caller as a scala array
* saveToCassandra
* count


In [None]:
tracks.first

In [None]:
tracks.count

## Some Typical Transformations
filter, map, distinct

Show tracks from 1989

In [None]:
tracks.filter(x => x.album_year == 1989).take(10).foreach(println)


**This can also be accomplished with a .where function on the cassandraTable to push the work into Cassandra**

map the cassandra table to 2-tuples 

In [None]:
tracks.map(x =>(x.album_title, x.track_title)).
   take(5).foreach(println)

Combine operations into a single graphe or even a single statement

In [None]:
tracks.filter(x => x.album_year == 1990).
map(x => (x.album_title, x.track_title)).
take(5).foreach(println)


## Pair RDDs – Special operations on RDD of  2-Tuples
* Think of each tuple as (Key,Value)
* countByKey
* groupByKey
* reduceByKey


In [None]:
val albumTracks = tracks.map(t => (t.album_title, t.track_title))

How many tracks in each album?

In [None]:
val trackTitles = albumTracks.countByKey
trackTitles

Why not sort the results descending? toList turns the map into a list of tuples and sort by the negative of the count

## Top 10 List


In [None]:
albumTracks.countByKey.toList.sortBy( t => -t._2 ) take 10 foreach println

In [None]:
tracks.filter(_.album_title == "Greatest Hits").collect foreach println

In [None]:
var x:Option[Int] = Some(5)

In [None]:
x

In [None]:
x = None

In [None]:
x.orElse(Some(0))

In [None]:
tracks.filter(_.album_title == "Greatest Hits").saveAsTextFile("cfs:///tmp/tracks2")