# Just Enough Spark - a Jupyter Notebook

You can execute each statement to generate the output, or choose any option in the "Cell" menu. Modify the statements to try new things. 


## An RDD is an immutable collection
* Most methods set up the execution graph for spark
* Action methods execute the graph
* partial results can be cached for reuse

*RDDs are construction with methods on the sparkContext (sc) object*

### RDDs can be created from files, Cassandra tables, Scala collections, and many other sources
Let's first look at creating an an rdd from a Scala object.  We use the parallelize function

In [2]:
val myrdd = sc.parallelize(Seq(4,5,6))

In [3]:
myrdd

Now filter out the even numbers

In [4]:
val evenNumbers = myrdd.filter( x => x % 2 == 0)
evenNumbers

Note that nothing really happened - we set up the execution graph.  We'll use the *action* method *collect* to execute it and dump all of the results into an array.  

In [5]:
evenNumbers.collect

## Examine a table using CQL
(Jupyter notebook feature)
Use the %%Cql Magic to prefix your CQL.

In [6]:
%%Cql select * from music.tracks_by_album limit 5

## Creating RDDs from Cassandra Tables
* Can add a where clause to push down filter
* Creates and RDD of CassandraRow objects
* .as will map it to a case class or tuples for ease of use


In [7]:
val tracks = sc.cassandraTable("music","tracks_by_album")
tracks

In [8]:
tracks.take(2)

### get the album and track in a tuple.  This is the new syntax:

In [9]:
val albumTracks = sc.cassandraTable[(String,String)]("music","tracks_by_album").select("album_title","track_title")

The first 10 rows as tuples ....

In [10]:
albumTracks.take(10) foreach println

### Create RDDs from Cassandra Tables and return an RDD of case class objects
.as() will map the rdd to a case class


In [11]:
case class Track(album_title: String,
year:Int, number:Int,
album_genre: String,
performer: String,
track_title: String)

In [12]:
val tracks = sc.cassandraTable("music","tracks_by_album").as(Track)
tracks

In [13]:
tracks take 5 foreach println

## Some other useful actions ...
* first – same as take(1)(0)
* collect – bring everything back to the caller as a scala array
* saveToCassandra
* count


In [14]:
tracks.first

In [15]:
tracks.count

## Some Typical Transformations
filter, map, distinct

Show tracks from 1989

In [16]:
tracks.filter(x => x.year == 1989).take(10).foreach(println)


**This can also be accomplished with a .where function on the cassandraTable to push the work into Cassandra**

map the cassandra table to 2-tuples 

In [17]:
tracks.map(x =>(x.album_title, x.track_title)).
   take(5).foreach(println)

Combine operations into a single graphe or even a single statement

In [18]:
tracks.filter(x => x.year == 1990).map(x => (x.album_title, x.track_title)).take(5).foreach(println)


## Pair RDDs – Special operations on RDD of  2-Tuples
* Think of each tuple as (Key,Value)
* countByKey
* groupByKey
* reduceByKey


In [19]:
val albumTracks = tracks.map(t => (t.album_title, t.track_title))

How many tracks in each album?

In [20]:
val trackTitles = albumTracks.countByKey

Why not sort the results descending? toList turns the map into a list of tuples and sort by the negative of the count

In [21]:
albumTracks.countByKey.toList.sortBy( t => -t._2 ) take 10 foreach println

In [22]:
tracks.filter(_.album_title == "Greatest Hits").collect foreach println

Track(Greatest Hits,1995,1,Unknown,Wesley Willis,Rock n Roll McDonald's)
Track(Greatest Hits,1995,2,Unknown,Wesley Willis,Larry Nevers/ Walter Budzyn)
Track(Greatest Hits,1995,3,Unknown,Wesley Willis,Rick Sims)
Track(Greatest Hits,1995,4,Unknown,Wesley Willis,Outburst)
Track(Greatest Hits,1995,5,Unknown,Wesley Willis,Chronic Schizophrenia)
Track(Greatest Hits,1995,6,Unknown,Wesley Willis,Urge Overkill)
Track(Greatest Hits,1995,7,Unknown,Wesley Willis,Skrew)
Track(Greatest Hits,1995,8,Unknown,Wesley Willis,Tammy Smith)
Track(Greatest Hits,1995,9,Unknown,Wesley Willis,Vampire Bat)
Track(Greatest Hits,1995,10,Unknown,Wesley Willis,Elvis Presley)
Track(Greatest Hits,1995,11,Unknown,Wesley Willis,The Chicken Cow)
Track(Greatest Hits,1995,12,Unknown,Wesley Willis,Kris Kringle Was A Cat Thief)
Track(Greatest Hits,1995,13,Unknown,Wesley Willis,Eazy-E)
Track(Greatest Hits,1995,14,Unknown,Wesley Willis,Jesus Is the Answer)
Track(Greatest Hits,1995,15,Unknown,Wesley Willis,He's Doing Time In Jail