# Getting Started with Saul

We will look at a Spam Classification task where we try to classify email documents as SPAM or HAM. This notebook will work through the steps in creating and running the Spam Classifier.

### Step -1 - Jupyter-Scala integration for Saul

In [1]:
classpath.addRepository("http://cogcomp.cs.illinois.edu/m2repo")
classpath.add("edu.illinois.cs.cogcomp" %% "saul" % "0.5.5") 

34 new artifact(s)


34 new artifacts in macro
34 new artifacts in runtime
34 new artifacts in compile




### Step 0 - Spam Data

something about how the data looks like

In [2]:
import scala.io.Source
import java.io.File

val spamDataBasePath = "../../data/EmailSpam/"
val trainDataPath = spamDataBasePath  + "train/"
val testDataPath = spamDataBasePath  + "test/"

val dir = new File(trainDataPath + "ham")
require(dir.exists() && dir.isDirectory())

// val sampleDoc = dir.listFiles.filter(_.isFile).head
// $println("Sample Document:")
// Source.fromFile(sampleDoc).getLines.foreach(println(_))

[32mimport [36mscala.io.Source[0m
[32mimport [36mjava.io.File[0m
[36mspamDataBasePath[0m: [32mString[0m = [32m"../../data/EmailSpam/"[0m
[36mtrainDataPath[0m: [32mString[0m = [32m"../../data/EmailSpam/train/"[0m
[36mtestDataPath[0m: [32mString[0m = [32m"../../data/EmailSpam/test/"[0m
[36mdir[0m: [32mjava[0m.[32mio[0m.[32mFile[0m = ../../data/EmailSpam/train/ham

### Step 1 - Data Reader

We create a reader that parses each document and parses it into required classes. 

We define an Email as a collection of words and its label.

In [3]:
case class Email(val words: Seq[String], val label: String)

object DataReader {
    def apply(dirName: String, label: String): Iterable[Email] = {
        val dir = new File(dirName)
        require(dir.exists() && dir.isDirectory)
        
        dir.listFiles
           .filter(_.isFile)
           .flatMap(file => parseEmail(file.getAbsolutePath, label))
    }
    
    private def parseEmail(fileName: String, label: String): Option[Email] = {
        val source = Source.fromFile(fileName)
        if (source.hasNext) {
            val words = source.getLines
                              .flatMap(_.split("\\s+"))
                              .toSeq
            Some(Email(words, label))
        } else {
            None
        }
    }
}


defined [32mclass [36mEmail[0m
defined [32mobject [36mDataReader[0m

### Step 2 - DataModel (Entities, Features)

Where we define the DataModel

In [4]:
import edu.illinois.cs.cogcomp.saul.datamodel.DataModel

object SpamDataModel extends DataModel {
    val email = node[Email]
    
    // Features
    val words = property(email) { 
        doc: Email => doc.words.toList
    }
    
    val bigrams = property(email) {
        doc: Email => doc.words.sliding(2).map(_.mkString("-")).toList
    }
    
    val spamLabel = property(email) {
        doc: Email => doc.label
    }
}

[32mimport [36medu.illinois.cs.cogcomp.saul.datamodel.DataModel[0m
defined [32mobject [36mSpamDataModel[0m

### Step 3 - Classifier

In [8]:
import edu.illinois.cs.cogcomp.lbjava.learn.SupportVectorMachine
import edu.illinois.cs.cogcomp.saul.classifier.Learnable
import SpamDataModel._

object SpamClassifier extends Learnable(email) {
    def label = spamLabel
    override lazy val classifier = new SupportVectorMachine()
    override def feature = using(words, bigrams)
}

[32mimport [36medu.illinois.cs.cogcomp.lbjava.learn.SupportVectorMachine[0m
[32mimport [36medu.illinois.cs.cogcomp.saul.classifier.Learnable[0m
[32mimport [36mSpamDataModel._[0m
defined [32mobject [36mSpamClassifier[0m

### Step 4 - App (Train, Test)

In [7]:
val trainData = DataReader(trainDataPath + "spam", "spam") ++ DataReader(trainDataPath + "ham", "ham")
val testData = DataReader(testDataPath + "spam", "spam") ++ DataReader(testDataPath + "ham", "ham")

SpamDataModel.email.populate(trainData)
SpamDataModel.email.populate(testData, train = false)

SpamClassifier.learn(30)
SpamClassifier.test()

INFO  [2016-11-06 18:30:11,121] cmd4$$user$SpamClassifier$: Learnable: Learn with data of size 9
INFO  [2016-11-06 18:30:11,122] cmd4$$user$SpamClassifier$: Training: 30 iterations remain.
INFO  [2016-11-06 18:30:11,123] cmd4$$user$SpamClassifier$: Training: 30 iterations remain.
INFO  [2016-11-06 18:30:11,248] cmd4$$user$SpamClassifier$: Training: 20 iterations remain.
INFO  [2016-11-06 18:30:11,348] cmd4$$user$SpamClassifier$: Training: 10 iterations remain.
 Label   Precision Recall    F1   LCount PCount
-----------------------------------------------
ham        100.000  60.000 75.000      5      3
spam        71.429 100.000 83.333      5      7
-----------------------------------------------
Accuracy    80.000    -      -      -        10


[36mtrainData[0m: [32mIterable[0m[[32mEmail[0m] = [33mArraySeq[0m(
  [33mEmail[0m(
    [33mStream[0m(
      [32m"Subject:"[0m,
      [32m"double"[0m,
      [32m"your"[0m,
      [32m"life"[0m,
      [32m"insurance"[0m,
      [32m"at"[0m,
      [32m"no"[0m,
      [32m"extra"[0m,
      [32m"cost"[0m,
      [32m"!"[0m,
      [32m"29155"[0m,
      [32m"the"[0m,
      [32m"lowest"[0m,
      [32m"life"[0m,
      [32m"insurance"[0m,
      [32m"quotes"[0m,
      [32m"without"[0m,
[33m...[0m
[36mtestData[0m: [32mIterable[0m[[32mEmail[0m] = [33mArraySeq[0m(
  [33mEmail[0m(
    [33mStream[0m(
      [32m"Subject:"[0m,
      [32m"slotting"[0m,
      [32m"order"[0m,
      [32m"confirmation"[0m,
      [32m"may"[0m,
      [32m"18"[0m,
      [32m","[0m,
      [32m"2004"[0m,
      [32m"etacitne"[0m,
      [32m"{"[0m,
      [32m"%"[0m,
      [32m"begin"[0m,
      [32m"_"[0m,
      [32m"split"[0m,
      [32m"76"[0m,
 