Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
701 lines (501 sloc) 23.7 KB

FunProg week 6: Collections

In September 2012, I enrolled in the course Functional Programming Principles in Scala given by Martin Odersky at Coursera. In that course I learnt about functional programming and how it can be effectively combined with object-oriented programming. These are all my compiled notes from that course.

Vectors

We have seen that lists are linear: access to the first element is much faster than access to the middle or end of a list. Vectors, on the other hand, have more evenly balanced access patterns.

A vector is constructed using an array of maximum 32 elements. If it needs to contain more than 32 elements, it becomes an array of 32 pointers to 32 arrays of 32 elements. And so on -adding levels- until there is space for all the elements. So a depth of six can handle around a billion elements.

Access to random elements in a vector is related to its depth, so obviously it's very efficient. Lists are a better choice when we use a pattern where we use a list's head and a tail. Otherwise vectors are generally a better choice.

We create vectors in a similar way to lists:

val nums = Vector(1, 2, 3, -4)
val people = Vector("Bob", "James", "Peter")

And they support the same operations as lists, with the exception of :: (but we can use x +: xs or xs :+ x instead to create a new vector with a new element at the beginning or the end of previous vector). Please note that vectors are immutable, so when we say we modify them we are actually creating a new one.

Ranges

Ranges are another simple kind of sequence that represent a sequence of evenly spaced integers using to (inclusive), until (exclusive) and by (to determine step value). Ranges are represented as single objects with three fields: lower bound, upper bound and step value.

val r: Range = 1 until 5 // 1, 2, 3, 4
val s: Range = 1 to 5    // 1, 2, 3, 4, 5
1 to 10 by 3             // 1, 4, 7, 10
6 to 1 by -2             // 6, 4, 2

Operations on sequences

Sequences have a lot of available methods that can be found at Scala Standard Library, being these some of the most common:

  • xs exists p returns true if the sequence contains any element satisfying the predicate,
  • xs forall p returns true if all elements in the sequence satisfy the predicate,
  • xs zip ys returns a sequence of pairs taking one element from each sequence for each pair (the sequence with less elements will set the length of the sequence sequence of pairs),
  • xs.unzip returns two sequences from a sequence of pairs,
  • xs.flatMap f applies function f to all elements in the sequence and concatenates the results,
  • xs.sum returns the sum of all elements in the sequence,
  • xs.product returns the product of all elements in the sequence,
  • xs.max returns the maximum of all elements in the sequence (an Ordering is required),
  • xs.min returns the maximum of all elements in the sequence (an Ordering is required).

Some examples

Let's see how we could get all combinations of numbers x (from 1 to M) and y (from 1 to N):

val M = 3
val N = 5   
(1 to M) flatMap (x => (1 to N) map (y => (x, y)))
//> Vector((1,1), (1,2), (1,3), (1,4), (1,5), 
//>        (2,1), (2,2), (2,3), (2,4), (2,5), 
//>        (3,1), (3,2), (3,3), (3,4), (3,5))

And now how we could compute the scalar product of two vectors:

def scalarProduct1(xs: Vector[Double], ys: Vector[Double]): Double =
  (xs zip ys).map(xy => xy._1 * xy._2).sum

Or same thing using patter matching function value:

def scalarProduct2(xs: Vector[Double], ys: Vector[Double]): Double =
  (xs zip ys).map{ case (x, y) => x * y }.sum

And how we could say if a number is prime (that is, its only divisors are 1 and the number itself):

def isPrime(n: Int): Boolean = (2 until n) forall (n % _ != 0)

isPrime(7)  // true
isPrime(14) // false

Collection hierarchy

List and Vector have a common base class, Seq, which is actually the class of all sequences. And Seq itself is a subclass of Iterable same way as Set and Map are.

Array and String support the same operations as Seq and can implicitly be converted to sequences where needed, but are not direct subclasses of Seq because they come directly from Java. Anyway we can do things like these:

val xs: Array[Int] = Array(1, 2, 3)
xs map (x => 2 * x) // Array(2, 4, 6)

val ys: String = "Hello World!"
ys filter (_.isUpper) // "HW"

This is a simplification of the hierarchy we've seen so far:

Iterable

Map -> Iterable
Set -> Iterable
Seq -> Iterable

List   -> Seq
Vector -> Seq
Range  -> Seq

String ~> Seq
Array  ~> Seq

-> means "it's subclass of", ~> means "it's not subclass of, but very works as if it was".

Combinatorial search

Higher order functions on collections often replace loops on imperative languages.

Let's see that with an example. Say we have a positive integer n and want to find all pairs of positive integers i and j such that i + j is prime and `1 <= j < i < n'. For example, for 7 the sought pairs are these ones:

i: 2  3  4  4  5  6  6
j: 1  2  1  3  2  1  5

In imperative languages we can use two nested loops, one for i and one for j and check their sum on each iteration to see if we need to store that combination in any kind of collection.

In a functional language we look for a more natural way to do the same: first we generate the sequence of all pairs of integers fulfilling the 1 <= j < i < n' condition; after that we filter the pairs for whichi + j` is prime.

For the first part we can use some tools on sequences we've seen before and write this:

(1 until n) map (i =>
  (1 until i) map (j => (i, j)))

If we try this (i.e. in a Scala Worksheet) we'll see we get a vector of vectors of pairs, but we actually just want a single collection of pairs.

To achieve this, we can use the foldRight method we saw in previous sessions and the concatenation operation:

(xss foldRight Seq[Int]())(_ ++ _)

We use the xss notation to show its a collection of collections. There's actually a built-in method to do exactly this same operation:

xss.flatten

So we can rewrite previous code as follows:

((1 until n) map (i =>
  (1 until i) map (j => (i, j)))).flatten

And we'll get a vector of pairs (so a simple collection of pairs, which is what we wanted).

There's another way to achieve the same, taking into account that:

xs flatMap f = (xs map f).flatten

So our code can be written this way:

(1 until n) flatMap (i =>
  (1 until i) map (j => (i, j)))

The first part of our problem is solved: generating the sequence of all paris of integers fulfilling the 1 <= j < i < n condition. Now we want to filter the pairs for which i + j is prime. To do that we can simply use the filter method:

(1 until n) flatMap (i =>
  (1 until i) map (j => (i, j))) filter (pair =>
	isPrime(pair._1 + pair._2))

And we finally get a collection with all pairs which fulfil the condition.

OK, this already works, but it's not very easy to follow and understand, it's probably less clear than the imperative language approach in fact. That's why we have For expressions.

For expressions

Higher order functions such as map, flatMap or filter provide powerful constructs for manipulating lists. But sometimes the level of abstraction required by these functions make the program difficult to understand. In these situations, the for expression can help.

Imagine we have a persons list of elements of a class as follows:

case class Person(name: String, age: Int)

To get the names of all persons over 20 years old we could write:

persons filter (p => p.age > 20) map (p => p.name)

But we can instead use a for expression:

for (p <- persons if p.age > 20) yield p.name

The for expression is similar to loops in imperative languages, except that it builds a list of the results of all iterations.

A for expressions is always of the form:

for (s) yield e

Where s is a sequence of generators and filters and e is an expression whose value is returned by an iteration (yield sets what we want to produce with the expression):

  • a generator is of the form p <- e where p is a pattern and e an expression whose value is a collection,
  • a filter is of the form if f where f is a boolean expression,
  • the sequence must start with a generator,
  • if there are several generators in the sequence, the last generators vary faster than the first.

Instead of (s), we can write {s} and then the sequence of generators and filters can be written on multiple lines without requiring semicolons.

Let's see now how we can solve our previous problem (pairs where the sum of both numbers is prime) using a for expression:

for {
  i <- 1 until n
  j <- 1 until i
  if isPrime(i + j)
} yield (i, j)

This is obviously a much cleaner and easy to understand implementation.

Finally, this is a for expression for the scalarProduct we saw in previous session:

def scalarProduct(xs: Vector[Double], ys: Vector[Double]): Double =
  (for ((x, y) <- xs zip ys) yield x * y).sum

Sets

In previous sessions we showed this basic collection hierarchy:

Iterable

Map -> Iterable
Set -> Iterable
Seq -> Iterable

List   -> Seq
Vector -> Seq
Range  -> Seq

String ~> Seq
Array  ~> Seq

So we're now taking a look at sets. We can construct them easily:

val fruit = Set("apple", "banana", "pear")
val s = (1 to 6).toSet

And most operations on sequences are also available on sets. Nevertheless, there are some principal differences between sets and sequences:

  • sets are unordered (the elements of a set do not have a predefined order in which they appear in the set),
  • sets do not have duplicate elements,
  • the fundamental operation on sets is contains.

N queens problem

A classical problem in computing is the eight queens problem (or, more in general, N queens problem) which consists in placing queens on a chessboard so that no one is threatened by another. For example, for N = 4:

: X : :    row 0 col 1
: : : X    row 1 col 3
X : : :    row 2 col 0
: : X :    row 3 col 2

This is just a valid solution, but our goal is to find a set with all the possible solutions.

We can solve this problem with a recursive algorithm:

  1. Suppose that we have already generated all the solutions consisting of placing k-1 queens on a board of size n.
  2. Each solution is represented by a list (of length k-1) containing the numbers of columns (between 0 and n-1).
  3. The column number of the queen in k-1 row comes first in the list, followed by the column number of the queen in row k-2, etc. So, in our previous example, List(0, 3, 1).
  4. The solution set is thus represented as a set of lists, with one element for each solution.
  5. Now, to place the k-th queen, we generate all possible extensions of each solution preceded by a new queen.

So, in code:

def queens(n: Int): Set[List[Int]] = {
  def placeQueens(k: Int): Set[List[Int]] =
	if (k == 0) Set(List())
	else 
	  for {
		queens <- placeQueens(k - 1)
		col <- 0 until n
		if isSafe(col, queens)
	  } yield col :: queens
  placeQueens(n)
}

We need a function to know if a queen is safe in a concrete column with a current queens distribution:

def isSafe(col: Int, queens: List[Int]): Boolean = {
  val row = queens.length // (1)
  val queensWithRow = (row - 1 to 0 by -1) zip queens // (2)
  queensWithRow forall {
	case (r, c) => col != c && math.abs(col - c) != row - r // (3)  
  }
}

In (1) we give a row value to the new queen.

In (2) we convert a list like List(0, 3, 1) into a new list of pairs with rows included (row, column) List((2,0), (1,3), (0,1)).

In (3) we check that the new column is not the same as any other's queen's column and that there is no diagonal between the new queen and any other.

Finally we just need a function to show the results in a graphic representation:

def show(queens: List[Int]) = {
  val lines =
	for (col <- queens.reverse)
	yield Vector.fill(queens.length)(": ").updated(col,"X ").mkString 
  "\n" + (lines mkString "\n")
}

And for N = 4 we get:

: : X : 
X : : : 
: : : X 
: X : : 

: X : : 
: : : X 
X : : : 
: : X :

The second solution is the one we showed at the beginning of this problem. The first one is a new solution that our algorithm has found. If we increase the value of N we'll find more and more different solutions.

Queries with for

If we analyze for expressions we've seen so far, they look a little bit like database queries. For instance, imagine we have a class Book as follows:

case class Book(title: String, authors: List[String])

We could create a little database with a simple list:

val books: List[Book] = List(
  Book(title = "book1 cs", authors = List("author11", "author12")),
  Book(title = "book2 maths", authors = List("author21")),
  Book(title = "book3 cs", authors = List("author31", "author32", "author33")),
  Book(title = "book4", authors = List("author11")))

Which is exactly the same as this other expression, just clearer:

val books: List[Book] = List(
  Book("book1 cs", List("author11", "author12")), ...)

We can now ask for the titles whose author is "author32":

for (b <- books; a <- b.authors if a == "author32") 
yield b.title

Note that there is an iteration on the books and another on the authors of any book.

Or the books with the word "cs" in the title:

for (b <- books if b.title indexOf "cs" >= 0) 
yield b.title

A more complicated query could be this one, to get the names of all authors that have written at least two books:

for {
  b1 <- books
  b2 <- books
  if b1 != b2
  a1 <- b1.authors
  a2 <- b2.authors
  if a1 == a2
} yield a1

This gets the correct result, but duplicated, because it iterates all books both in b1 and in b2, so it creates all combination of books twice (in reversed order).

To avoid this, a first way would be to require that b1.title < b2.title instead of just b1 != b2. We still will have all combinations duplicated, but just produce results for one of them. But this is not a good enough solution, as if an author writes three books, let's say A, B and C, the condition will be satisfied for A < B, A < C and B < C, so again it will be shown more than once.

Another way, better, to solve that little problem is to use distinct method:

( for {
	b1 <- books
	b2 <- books
	if b1 != b2
	a1 <- b1.authors
	a2 <- b2.authors
	if a1 == a2
  } yield a1
).distinct

And the probably best option would be to define books from the very beginning as a Set instead of a List, so when we use for expressions we would get sets as well, and as sets can't have duplicate values we'll solve the problem.

For expressions and higher order functions

For expressions are very related to the higher order functions map, flatMap and filter. These functions can actually be defined in terms of for. But in realitiy, the Scala compiler does the opposite: it expresses for expressions in terms of higher order functions.

So when we write these expressions, the Scala compiler translates them into their equivalences:

(1) for (x <- e1) yield e2 
(1) e1.map(x => e2)

(2) for (x <- e1 if f; s) yield e2 
(2) for (x <- e1.withFilter(x => f); s) yield e2

(3) for (x <- e1; y <- e2; s) yield e3
(3) e1.flatMap(x => for (y <- e2; s) yield e3)

In (1) we translate a for expression into a map expression. In (2) and (3) we translate for expressions into new for expressions with less elements.

We actually saw this with a previous example where we wanted to get pairs of numbers whose sum was prime and got these two solutions:

for {
  i <- 1 until n
  j <- 1 until i
  if isPrime(i +j)
} yield (i, j)

(1 until n) flatMap (i =>
  (1 until i) map (j => (i, j))) filter (pair =>
	isPrime(pair._1 + pair._2))

Taking a look at them, we do check that for expressions can be expressed in terms of higher order functions map, flatMap and filter.

Or, another example, our previous query to get books written by an author:

for (b <- books; a <- b.authors if a == "author32") 
yield b.title

Can also be expressed using higher order functions as follows (process shows step by step substitution):

// Original query
for (b <- books; a <- b.authors if a == "author32")
yield b.title

// (3) for (x <- e1; y <- e2; s) yield e3
// (3) e1.flatMap(x => for (y <- e2; s) yield e3)
books flatMap (b =>
  for(a <- b.authors if a == "author32") yield b.title)

// (2) for (x <- e1 if f; s) yield e2
// (2) for (x <- e1.withFilter(x => f); s) yield e2
books flatMap (b =>
  for(a <- b.authors.withFilter(a => a == "author32")) yield b.title)

// (1) for (x <- e1) yield e2
// (1) e1.map(x => e2)
books flatMap (b =>
  b.authors.withFilter(a => a == "author32").map(a => b.title))

The translation of for is not limited to lists or sequences, but is based solely on the presence of the methods map, flatMap and withFilter. So we can use the for syntax with our own types as long as we define these three functions for them. These are similar ideas to those used in Microsoft's LINQ.

Maps

A map of type Map[Key, Value] is an immutable data structure that associates keys of type Key with values of type Value. For example:

val romanLetters = Map("I" -> 1, "V" -> 5, "X" -> 10)
val romanNumbers = romanLetters map {
  case (x, y) => (y, x)
}

But maps can also be used everywhere functions can, as class Map[Key, Value] extends the function type Key => Value. So everywhere we need one, we can use the other.

We can retrieve a value using this syntax:

romanNumbers(5) // "V"

But if key doesn't exist it will thrown an exception. So it's probably a better idea to use this other syntax that returns the value or a None if there is no match:

romanNumbers(5)     //> res0: java.lang.String = V
romanNumbers get 5  //> res1: Option[java.lang.String] = Some(V)
romanNumbers get 6  //> res2: Option[java.lang.String] = None

The Option type

Please note that previous code returns values of type Option. The Option type is defined as:

trait Option[+A]
case class Some[+A](value: A) extends Option[A]
object None extends Option[Nothing]

So our previous map get key expressions return None if there is no value for the given key or Some(x) if there is a value.

Since options are defined as case classes, they can be decomposed using pattern matching:

def showRoman(number: Int) = romanNumbers get(number) match {
  case Some(letter) => letter
  case None => "no Roman letter for this number"
}
showRoman(10)  //> res3: java.lang.String = X
showRoman(13)  //> res4: java.lang.String = no Roman letter for this number

OrderBy and GroupBy

In SQL queries we often use "order by" and "group by" operations. In Scala we can get the same on collections.

For instance for "order by":

val fruit = List("apple", "pear", "orange", "pineapple")
fruit sortWith(_.length < _.length)
//> res5: List[java.lang.String] = List(pear, apple, orange, pineapple)
fruit.sorted
//> res6: List[java.lang.String] = List(apple, orange, pear, pineapple)

And for "group by":

fruit groupBy(_.head)
//> res7: scala.collection.immutable.Map[Char,List[java.lang.String]] = 
//> Map(a -> List(apple), o -> List(orange), p -> List(pear, pineapple))

Please note that as a result of a groupBy operation we get a Map of collections.

Map example: Polynom

A polynomial can be seen as a map from exponents to coefficients:

// x^3 - 2x + 5
Map (0 -> 5, 1 -> -2, 3 -> 1)

So we can create a Poly class to represent polynomials using maps:

class Poly(val terms: Map[Int, Double]) {
  def + (other: Poly) = new Poly(this.terms ++ other.terms)
  override def toString =
	(for ((exp, coeff) <- terms.toList.sorted.reverse) yield coeff + "x^" + exp) mkString " + "
}

But if we now test it we see that additions are not calculated well:

val p1 = new Poly(Map(1 -> 2.0, 3 -> 4.0, 5 -> 6.2))
val p2 = new Poly(Map(0 -> 3.0, 3 -> 7.0))
p1 + p2
//> res0: Poly = 6.2x^5 + 7.0x^3 + 2.0x^1 + 3.0x^0

Coefficient for x^3 should be 11.0 instead of 7.0. So we obviously have to improve something:

class Poly(val terms: Map[Int, Double]) {
  def + (other: Poly) = new Poly(this.terms ++ (other.terms map adjust))
  def adjust(term: (Int, Double)): (Int, Double) = {
	val (exp, coeff) = term
	terms get exp match {
	  case Some(coeff1) => exp -> (coeff + coeff1)
	  case None => exp -> coeff
	}
  }
  override def toString =
	(for ((exp, coeff) <- terms.toList.sorted.reverse) yield coeff + "x^" + exp) mkString " + "
}

And now it works as expected:

p1 + p2
//> res0: Poly = 6.2x^5 + 11.0x^3 + 2.0x^1 + 3.0x^0

But it looks too much work and verbosity to compute just an addition. Remember we said that maps are actually functions. But they are "partial functions" in the sense that applying a map to a key could lead to an exception if that key isn't included in the map. To avoid this, there is a withDefaultValue operation that turns a map into a total function:

romanNumbers(6) // exception thrown
val rn = romanNumbers withDefaultValue "-"
rn(5)  //> res3: java.lang.String = V
rn(6)  //> res3: java.lang.String = -

We can apply this technique to our Poly class:

class Poly(val terms0: Map[Int, Double]) {
  val terms = terms0 withDefaultValue 0.0
  def + (other: Poly) = new Poly(this.terms ++ (other.terms map adjust))
  def adjust(term: (Int, Double)): (Int, Double) = {
	val (exp, coeff) = term
	exp -> (coeff + terms(exp))
  }
  override def toString =
	(for ((exp, coeff) <- terms.toList.sorted.reverse) yield coeff + "x^" + exp) mkString " + "
}

Finally, we want to improve the way we create our polynomials, so instead of the first syntax we can use the second:

val p1 = new Poly(Map(1 -> 2.0, 3 -> 4.0, 5 -> 6.2))
val p1 = new Poly(1 -> 2.0, 3 -> 4.0, 5 -> 6.2)

This is challenging if we take into account that the number of parameters (in this example, three) could be any number. To achieve this, we need an auxiliary constructor in our class:

def this(bindings: (Int, Double)*) = this(bindings.toMap)

Where the * notation means that this is a repeated parameter.

Example: Translating phone numbers into sentences

Just to recapitulate concepts seen in previous sessions, we are going to work in a more complex example.

Let's imagine this assignation between numbers in a phone (i.e. an iPhone) and letters:

val mnemonics = Map('2' -> "ABC", '3' -> "DEF", '4' -> "GHI", 
                    '5' -> "JKL", '6' -> "MNO", '7' -> "PQRS", 
                    '8' -> "TUV", '9' -> "WXYZ")

And let's assume we have a dictionary with a list of words.

Our goal is to design a method translate such as this call produces this result (and others):

translate("7225247386")
// Scala is fun

Please note that "S" is included in phone key for number 7, "c-a" in key for number 2, "l" in key for number 5, "a" in key for number 2, and so on until "u" and "n" that are in keys for number 8 and 6. So, for each number we have some different letters that we can use to create existing words in our dictionary of words.

Final consideration on collections

Now that we've seen some Scala immutable collections it's a good time to sumarize some of their characteristics:

  • easy to use: few steps to do the job,
  • concise: one word replaces a whole loop,
  • safe: type checker is really good at catching errors,
  • fast: collection operations are tuned, can be parallelized,
  • universal: one vocabulary to work on all kinds of collections.

This makes them a very attractive tool for software development.