conjecture, initial commit

etsy · Jun 17, 2014 · dcfd90d · dcfd90d
commit dcfd90d
Show file tree

Hide file tree

Showing 84 changed files with 9,175 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,28 @@
+*.class
+*.log
+*.swp
+*.swo
+
+# sbt specific
+dist/*
+target/
+lib_managed/
+src_managed/
+project/boot/
+project/plugins/project/
+
+# Scala-IDE specific
+.scala_dependencies
+
+#java
+
+*.class
+
+# Package Files #
+*.jar
+*.war
+*.ear
+
+*~
+*\#
+.history
diff --git a/LICENSE.md b/LICENSE.md
@@ -0,0 +1,22 @@
+The MIT License
+===============
+
+Copyright (c) 2009 Anton Grigoryev
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,90 @@
+#Conjecture
+
+Conjecture is a framework for building machine learning models in Hadoop using the Scalding DSL.
+The goal of this project is to enable the development of statistical models as viable components
+in a wide range of product settings. Applications include classification and categorization,
+recommender systems, ranking, filtering, and regression (predicting real-valued numbers).
+Conjecture has been designed with a primary emphasis on flexibility and can handle a wide variety of inputs.
+Integration with Hadoop and scalding enable seamless handling of extremely large data volumes,
+and integration with established ETL processes. Predicted labels can either be consumed directly
+by the web stack using the dataset loader, or models can be deployed and consumed by live web code.
+Currently, binary classification (assigning one of two possible labels to input data points)
+is the most mature component of the Conjecture package.
+
+#Tutorial
+There are a few stages involved in training a machine learning model using Conjecture.
+
+## Create Training Data
+We represent the training data as "feature vectors" which are just mappings of feature names to real values.
+In this case we represent them as a java map of strings to doubles
+(although we have a class StringKeyedVector which provides convenience methods for feature vector construction).
+We also need the true label of each instance, which we represent as 0 and 1
+(the mapping of these binary labels to e.g., "male" and "female" is up to the user).
+We construct BinaryLabeledInstances, which are just wrappers for a feature vector and a label.
+
+    val bl = new BinaryLabeledInstance(0.0)
+    bl.addTerm("bias", 1,0")
+    bl.addTerm("some_feature", 0.5)
+
+## Training a Classifier
+Classifiers are essentially trained by presenting the labeled instances to them.  There are several kinds 
+of linear classifiers we implement, among them:
+
+* Logistic regression,
+* Perceptron,
+* MIRA (a large margin perceptron model),
+* Passive aggressive.
+
+These models all have several options, such as learning rate, regularization parameters and so on.  We supply
+reasonable defaults for these paramteres although they can be changed readily.  To train a linear model
+simply call the update function with the labeled instance:
+
+    val p = new LogisticRegression()
+    p.update(bl)
+
+In order to make this procedure tractable for large datasets, we provied scalding wrappers for the training.
+These operate by training several small models on mappers, then aggregating them into a final complete model
+on the reducers.  This wrapper is called like so:
+
+    new BinaryModelTrainer(args)
+      .train(instances, 'instance, 'model)
+      .write(SequenceFile("model"))
+      .map('model -> 'model){ x : UpdateableBinaryModel => new com.google.gson.Gson.toJson(x) }
+      .write(Tsv("model_json"))
+
+This code segment will train a model using a pipe called instances which has a field called instance which contains
+the BinaryLabeledInstance objects.  It produces a pipe with a single field containing the completed model, which can
+then be written to disk.
+
+This class uses the command line args object from scalding, in order to let you set some options on the command line.
+Some useful options are:
+
+| Argument                            | Possible values                               | Default            | Meaning                                          |
+|-------------------------------------|-----------------------------------------------|--------------------|--------------------------------------------------|
+| --model                             | mira, logistic_regression, passive_aggressive | passive_aggressive | The type of model to use.                        |
+| --iters                             | 1, 2, 3...                                    | 1                  | The number of iterations of training to perform. |
+| --zero_class_prob, --one_class_prob | [0, 1]                                        | 1                  |                                                  |
+
+To see all the commandline options, see the BinaryModelTrainer class.
+
+## Evaluating a Classifier
+It is important to get a sense of the performance you can expect out of your classifier on unseen data.
+In order to do this we recommend to use cross validation.
+In essence, your input set of instances is split up into testing and training portions (multiple different ways),
+then a classifier is trained on each training portion, and evaluated (against the true labels which are present)
+using the testing portion.
+This is all wrapped up in a class called BinaryCrossValidator, it is used like so:
+
+    new BinaryCrossValidator(args, 5)
+      .crossValidate(instances, 'instance)
+      .write(Tsv("model_xval"))
+
+This class also takes the command line arguments, which it passes to a model trainer for each fold.
+This allows the specification of options to the cross validated models on the command line.
+The output contains statistics about the performance of the model as well as teh confusion matrices
+for each fold.
+
+A script is included which cross validates a logistic regression model on the iris dataset.
+
+
+
diff --git a/bin/demo.sh b/bin/demo.sh
@@ -0,0 +1,8 @@
+#!/bin/bash
+
+# - make monolithic conjecture jar.
+sbt clean assembly
+# - make the instances.
+java -cp target/conjecture-assembly-0.0.7-SNAPSHOT.jar com.twitter.scalding.Tool com.etsy.conjecture.demo.IrisDataToMulticlassLabeledInstances --input_file data/iris.tsv --output_file iris_model/instances --local
+# - construct the classifier.
+java -cp target/conjecture-assembly-0.0.7-SNAPSHOT.jar com.twitter.scalding.Tool com.etsy.conjecture.demo.LearnMulticlassClassifier --input iris_model/instances --output iris_model --class_names Iris-versicolor,Iris-virginica,Iris-setosa --iters 5 --folds 3 --local
diff --git a/bin/model_diff.py b/bin/model_diff.py
@@ -0,0 +1,19 @@
+import json
+import sys
+import math
+
+if __name__ == '__main__':
+  if len(sys.argv) != 3:
+    sys.exit("Usage: python " +  sys.argv[0] + " [model file] [model file]")
+  a = json.load(open(sys.argv[1]))['param']['vector']
+  b = json.load(open(sys.argv[2]))['param']['vector']
+  features = set(a.keys()) | set(b.keys())
+  diff = []
+  for f in features:
+    dv = a.get(f, 0.0) - b.get(f, 0.0)
+    if math.fabs(dv) > 0.01:
+      diff.append((f, dv, a.get(f), b.get(f)))
+  diff.sort( key = lambda tup: -math.fabs(tup[1]))
+  for t in diff:
+    print t[0] + "\t" + str(t[2]) + "\t" + str(t[3]) + "\t(" + str(t[1]) + ")"
+
diff --git a/bin/model_param.py b/bin/model_param.py
@@ -0,0 +1,11 @@
+import json
+import sys
+import math
+
+if __name__ == '__main__':
+  if len(sys.argv) != 2:
+    sys.exit("Usage: python " +  sys.argv[0] + " [model file]")
+  vec = json.load(open(sys.argv[1]))['param']['vector'].items()
+  vec.sort(key = lambda tup: -math.fabs(tup[1]))
+  for v in vec:
+    print v[0] + "\t" + str(v[1])
diff --git a/bin/prediction_inspection.py b/bin/prediction_inspection.py
@@ -0,0 +1,75 @@
+import json
+import sys
+from optparse import OptionParser
+from math import floor
+
+colors = ["FF0000", "FF1000", "FF2000", "FF3000", "FF4000", "FF5000", "FF6000",
+          "FF7000", "FF8000", "FF9000", "FFA000", "FFB000", "FFC000", "FFD000",
+          "FFE000", "FFF000", "FFFF00", "F0FF00", "E0FF00", "D0FF00", "C0FF00",
+          "B0FF00", "A0FF00", "90FF00", "80FF00", "70FF00", "60FF00", "50FF00",
+          "40FF00", "30FF00", "20FF00", "10FF00"]
+bins = len(colors)
+
+
+parser = OptionParser(usage="""builds a simple web page providing introspection on predictions made by conjecture models.
+Depends on the supporting data provided in the instance itself, currently only supporting binary
+classification problems
+Usage: %prog [options]
+""")
+
+parser.add_option('-o', '--out', dest='out', default=False, action='store',
+                  help="[optional] destination of the generated html. Defaults to standard out")
+parser.add_option('-f', '--file', dest='file', default=False, action='store',
+                  help="[optional] file storing input predictions and instances. Defaults to standard in")
+parser.add_option('-l', '--label', dest='label', default=False, action='store',
+                  help="[optional] only keep examples with this label")
+parser.add_option('-L', '--limit', dest='limit', default=1000, action='store',
+                  help="maximum number of prediction examples to display. Default: 1000")
+
+
+(options, args) = parser.parse_args()
+
+output = open(options.out, 'w') if (options.out) else sys.stdout
+input = open(options.file, 'r') if(options.file) else sys.stdin
+
+limit = int(options.limit)
+
+output.write("<html>")
+ct = 0
+
+for line in input:
+    parts = line.strip().split("\t")
+    content = json.loads(parts[0])
+    label = int(content['label']['value'])
+    pred = float(parts[2])
+
+    if (options.label and str(label) != options.label):
+        continue
+
+    error = min(1.0, abs(pred-label))
+    bin = bins - int(floor(error*bins)) - 1
+
+    color = "#" + colors[bin]
+    out = ""
+
+    support = json.loads(content['supporting_data'])
+
+    for key in support.keys():
+        out = out + "<b>" + key + "</b></br>" + support[key] + "<br/>"
+
+    if (len(out) < 10000 and ct < limit):
+        try:
+            output.write("<div style='background-color: "  + color + "; width: 700px;'>");
+            output.write("%d (%f)<br/>" %( label, pred))
+            output.write(out)
+            output.write("</div><p>")
+            ct = ct + 1
+        except:
+            pass
+
+    if (ct >= limit):
+        break
+
+output.write("</html>");
+output.flush()
+output.close()
diff --git a/build.sbt b/build.sbt
@@ -0,0 +1,106 @@
+import sbt._
+import AssemblyKeys._
+import aether.Aether._
+
+name := "conjecture"
+
+version := "0.0.7-SNAPSHOT"
+
+organization := "com.etsy"
+
+scalaVersion := "2.9.3"
+
+sbtVersion := "0.12.1"
+
+scalacOptions ++= Seq("-unchecked", "-deprecation")
+
+compileOrder := CompileOrder.JavaThenScala
+
+javaHome := Some(file("/usr/java/latest"))
+
+publishArtifact in packageDoc := false
+
+resolvers ++= {
+  Seq(
+      "snapshots" at "http://oss.sonatype.org/content/repositories/snapshots",
+      "releases" at "http://oss.sonatype.org/content/repositories/releases",
+      "Concurrent Maven Repo" at "http://conjars.org/repo",
+      "cloudera" at "https://repository.cloudera.com/artifactory/cloudera-repos/"
+  )
+}
+
+libraryDependencies += "cascading" % "cascading-core" % "2.0.0"
+
+libraryDependencies += "cascading" % "cascading-local" % "2.0.0" exclude("com.google.guava", "guava")
+
+libraryDependencies += "cascading" % "cascading-hadoop" % "2.0.0"
+
+libraryDependencies += "cascading.kryo" % "cascading.kryo" % "0.4.6"
+
+libraryDependencies += "com.google.code.gson" % "gson" % "2.2.2"
+
+libraryDependencies += "com.twitter" % "maple" % "0.2.4"
+
+libraryDependencies += "com.twitter" % "algebird-core_2.9.2" % "0.1.12"
+
+libraryDependencies += "com.twitter" % "scalding-core_2.9.2" % "0.8.5"
+
+libraryDependencies += "commons-lang" % "commons-lang" % "2.4"
+
+libraryDependencies += "com.joestelmach" % "natty" % "0.7"
+
+libraryDependencies += "io.backchat.jerkson" % "jerkson_2.9.2" % "0.7.0"
+
+libraryDependencies += "com.google.guava" % "guava" % "13.0.1"
+
+libraryDependencies += "org.apache.commons" % "commons-math3" % "3.0"
+
+libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.0.0-cdh4.1.1" exclude("commons-daemon", "commons-daemon")
+
+libraryDependencies += "org.apache.hadoop" % "hadoop-hdfs" % "2.0.0-cdh4.1.1" exclude("commons-daemon", "commons-daemon")
+
+libraryDependencies += "org.apache.hadoop" % "hadoop-tools" % "2.0.0-mr1-cdh4.1.1" exclude("commons-daemon", "commons-daemon")
+
+libraryDependencies += "net.sf.trove4j" % "trove4j" % "3.0.3"
+
+libraryDependencies += "com.esotericsoftware.kryo" % "kryo" % "2.21"
+
+libraryDependencies += "com.novocode" % "junit-interface" % "0.10" % "test"
+
+parallelExecution in Test := false
+
+seq(assemblySettings: _*)
+
+publishTo <<= version { (v: String) =>
+  val archivaURL = "http://ivy.etsycorp.com/repository"
+  if (v.trim.endsWith("SNAPSHOT")) {
+    Some("publish-snapshots" at (archivaURL + "/snapshots"))
+  } else {
+    Some("publish-releases"  at (archivaURL + "/internal"))
+  }
+}
+
+seq(aetherPublishSettings: _*)
+
+pomIncludeRepository := { _ => false }
+
+// Uncomment if you don't want to run all the tests before building assembly
+// test in assembly := {}
+
+// Janino includes a broken signature, and is not needed:
+excludedJars in assembly <<= (fullClasspath in assembly) map { cp =>
+  val excludes = Set("jsp-api-2.1-6.1.14.jar", "jsp-2.1-6.1.14.jar",
+    "jasper-compiler-5.5.12.jar", "janino-2.5.16.jar")
+  cp filter { jar => excludes(jar.data.getName)}
+}
+
+// Some of these files have duplicates, let's ignore:
+mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
+  {
+    case s if s.endsWith(".class") => MergeStrategy.last
+    case s if s.endsWith("project.clj") => MergeStrategy.concat
+    case s if s.endsWith(".html") => MergeStrategy.last
+    case s if s.contains("servlet") => MergeStrategy.last
+    case x => old(x)
+  }
+}