Skip to content
This repository has been archived by the owner on Dec 18, 2019. It is now read-only.

Commit

Permalink
conjecture, initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
jattenberg committed Jun 17, 2014
0 parents commit dcfd90d
Show file tree
Hide file tree
Showing 84 changed files with 9,175 additions and 0 deletions.
28 changes: 28 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
*.class
*.log
*.swp
*.swo

# sbt specific
dist/*
target/
lib_managed/
src_managed/
project/boot/
project/plugins/project/

# Scala-IDE specific
.scala_dependencies

#java

*.class

# Package Files #
*.jar
*.war
*.ear

*~
*\#
.history
22 changes: 22 additions & 0 deletions LICENSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
The MIT License
===============

Copyright (c) 2009 Anton Grigoryev

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
90 changes: 90 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
#Conjecture

Conjecture is a framework for building machine learning models in Hadoop using the Scalding DSL.
The goal of this project is to enable the development of statistical models as viable components
in a wide range of product settings. Applications include classification and categorization,
recommender systems, ranking, filtering, and regression (predicting real-valued numbers).
Conjecture has been designed with a primary emphasis on flexibility and can handle a wide variety of inputs.
Integration with Hadoop and scalding enable seamless handling of extremely large data volumes,
and integration with established ETL processes. Predicted labels can either be consumed directly
by the web stack using the dataset loader, or models can be deployed and consumed by live web code.
Currently, binary classification (assigning one of two possible labels to input data points)
is the most mature component of the Conjecture package.

#Tutorial
There are a few stages involved in training a machine learning model using Conjecture.

## Create Training Data
We represent the training data as "feature vectors" which are just mappings of feature names to real values.
In this case we represent them as a java map of strings to doubles
(although we have a class StringKeyedVector which provides convenience methods for feature vector construction).
We also need the true label of each instance, which we represent as 0 and 1
(the mapping of these binary labels to e.g., "male" and "female" is up to the user).
We construct BinaryLabeledInstances, which are just wrappers for a feature vector and a label.

val bl = new BinaryLabeledInstance(0.0)
bl.addTerm("bias", 1,0")
bl.addTerm("some_feature", 0.5)

## Training a Classifier
Classifiers are essentially trained by presenting the labeled instances to them. There are several kinds
of linear classifiers we implement, among them:

* Logistic regression,
* Perceptron,
* MIRA (a large margin perceptron model),
* Passive aggressive.

These models all have several options, such as learning rate, regularization parameters and so on. We supply
reasonable defaults for these paramteres although they can be changed readily. To train a linear model
simply call the update function with the labeled instance:

val p = new LogisticRegression()
p.update(bl)

In order to make this procedure tractable for large datasets, we provied scalding wrappers for the training.
These operate by training several small models on mappers, then aggregating them into a final complete model
on the reducers. This wrapper is called like so:

new BinaryModelTrainer(args)
.train(instances, 'instance, 'model)
.write(SequenceFile("model"))
.map('model -> 'model){ x : UpdateableBinaryModel => new com.google.gson.Gson.toJson(x) }
.write(Tsv("model_json"))

This code segment will train a model using a pipe called instances which has a field called instance which contains
the BinaryLabeledInstance objects. It produces a pipe with a single field containing the completed model, which can
then be written to disk.

This class uses the command line args object from scalding, in order to let you set some options on the command line.
Some useful options are:

| Argument | Possible values | Default | Meaning |
|-------------------------------------|-----------------------------------------------|--------------------|--------------------------------------------------|
| --model | mira, logistic_regression, passive_aggressive | passive_aggressive | The type of model to use. |
| --iters | 1, 2, 3... | 1 | The number of iterations of training to perform. |
| --zero_class_prob, --one_class_prob | [0, 1] | 1 | |

To see all the commandline options, see the BinaryModelTrainer class.

## Evaluating a Classifier
It is important to get a sense of the performance you can expect out of your classifier on unseen data.
In order to do this we recommend to use cross validation.
In essence, your input set of instances is split up into testing and training portions (multiple different ways),
then a classifier is trained on each training portion, and evaluated (against the true labels which are present)
using the testing portion.
This is all wrapped up in a class called BinaryCrossValidator, it is used like so:

new BinaryCrossValidator(args, 5)
.crossValidate(instances, 'instance)
.write(Tsv("model_xval"))

This class also takes the command line arguments, which it passes to a model trainer for each fold.
This allows the specification of options to the cross validated models on the command line.
The output contains statistics about the performance of the model as well as teh confusion matrices
for each fold.

A script is included which cross validates a logistic regression model on the iris dataset.



8 changes: 8 additions & 0 deletions bin/demo.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#!/bin/bash

# - make monolithic conjecture jar.
sbt clean assembly
# - make the instances.
java -cp target/conjecture-assembly-0.0.7-SNAPSHOT.jar com.twitter.scalding.Tool com.etsy.conjecture.demo.IrisDataToMulticlassLabeledInstances --input_file data/iris.tsv --output_file iris_model/instances --local
# - construct the classifier.
java -cp target/conjecture-assembly-0.0.7-SNAPSHOT.jar com.twitter.scalding.Tool com.etsy.conjecture.demo.LearnMulticlassClassifier --input iris_model/instances --output iris_model --class_names Iris-versicolor,Iris-virginica,Iris-setosa --iters 5 --folds 3 --local
19 changes: 19 additions & 0 deletions bin/model_diff.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
import json
import sys
import math

if __name__ == '__main__':
if len(sys.argv) != 3:
sys.exit("Usage: python " + sys.argv[0] + " [model file] [model file]")
a = json.load(open(sys.argv[1]))['param']['vector']
b = json.load(open(sys.argv[2]))['param']['vector']
features = set(a.keys()) | set(b.keys())
diff = []
for f in features:
dv = a.get(f, 0.0) - b.get(f, 0.0)
if math.fabs(dv) > 0.01:
diff.append((f, dv, a.get(f), b.get(f)))
diff.sort( key = lambda tup: -math.fabs(tup[1]))
for t in diff:
print t[0] + "\t" + str(t[2]) + "\t" + str(t[3]) + "\t(" + str(t[1]) + ")"

11 changes: 11 additions & 0 deletions bin/model_param.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
import json
import sys
import math

if __name__ == '__main__':
if len(sys.argv) != 2:
sys.exit("Usage: python " + sys.argv[0] + " [model file]")
vec = json.load(open(sys.argv[1]))['param']['vector'].items()
vec.sort(key = lambda tup: -math.fabs(tup[1]))
for v in vec:
print v[0] + "\t" + str(v[1])
75 changes: 75 additions & 0 deletions bin/prediction_inspection.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
import json
import sys
from optparse import OptionParser
from math import floor

colors = ["FF0000", "FF1000", "FF2000", "FF3000", "FF4000", "FF5000", "FF6000",
"FF7000", "FF8000", "FF9000", "FFA000", "FFB000", "FFC000", "FFD000",
"FFE000", "FFF000", "FFFF00", "F0FF00", "E0FF00", "D0FF00", "C0FF00",
"B0FF00", "A0FF00", "90FF00", "80FF00", "70FF00", "60FF00", "50FF00",
"40FF00", "30FF00", "20FF00", "10FF00"]
bins = len(colors)


parser = OptionParser(usage="""builds a simple web page providing introspection on predictions made by conjecture models.
Depends on the supporting data provided in the instance itself, currently only supporting binary
classification problems
Usage: %prog [options]
""")

parser.add_option('-o', '--out', dest='out', default=False, action='store',
help="[optional] destination of the generated html. Defaults to standard out")
parser.add_option('-f', '--file', dest='file', default=False, action='store',
help="[optional] file storing input predictions and instances. Defaults to standard in")
parser.add_option('-l', '--label', dest='label', default=False, action='store',
help="[optional] only keep examples with this label")
parser.add_option('-L', '--limit', dest='limit', default=1000, action='store',
help="maximum number of prediction examples to display. Default: 1000")


(options, args) = parser.parse_args()

output = open(options.out, 'w') if (options.out) else sys.stdout
input = open(options.file, 'r') if(options.file) else sys.stdin

limit = int(options.limit)

output.write("<html>")
ct = 0

for line in input:
parts = line.strip().split("\t")
content = json.loads(parts[0])
label = int(content['label']['value'])
pred = float(parts[2])

if (options.label and str(label) != options.label):
continue

error = min(1.0, abs(pred-label))
bin = bins - int(floor(error*bins)) - 1

color = "#" + colors[bin]
out = ""

support = json.loads(content['supporting_data'])

for key in support.keys():
out = out + "<b>" + key + "</b></br>" + support[key] + "<br/>"

if (len(out) < 10000 and ct < limit):
try:
output.write("<div style='background-color: " + color + "; width: 700px;'>");
output.write("%d (%f)<br/>" %( label, pred))
output.write(out)
output.write("</div><p>")
ct = ct + 1
except:
pass

if (ct >= limit):
break

output.write("</html>");
output.flush()
output.close()
106 changes: 106 additions & 0 deletions build.sbt
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
import sbt._
import AssemblyKeys._
import aether.Aether._

name := "conjecture"

version := "0.0.7-SNAPSHOT"

organization := "com.etsy"

scalaVersion := "2.9.3"

sbtVersion := "0.12.1"

scalacOptions ++= Seq("-unchecked", "-deprecation")

compileOrder := CompileOrder.JavaThenScala

javaHome := Some(file("/usr/java/latest"))

publishArtifact in packageDoc := false

resolvers ++= {
Seq(
"snapshots" at "http://oss.sonatype.org/content/repositories/snapshots",
"releases" at "http://oss.sonatype.org/content/repositories/releases",
"Concurrent Maven Repo" at "http://conjars.org/repo",
"cloudera" at "https://repository.cloudera.com/artifactory/cloudera-repos/"
)
}

libraryDependencies += "cascading" % "cascading-core" % "2.0.0"

libraryDependencies += "cascading" % "cascading-local" % "2.0.0" exclude("com.google.guava", "guava")

libraryDependencies += "cascading" % "cascading-hadoop" % "2.0.0"

libraryDependencies += "cascading.kryo" % "cascading.kryo" % "0.4.6"

libraryDependencies += "com.google.code.gson" % "gson" % "2.2.2"

libraryDependencies += "com.twitter" % "maple" % "0.2.4"

libraryDependencies += "com.twitter" % "algebird-core_2.9.2" % "0.1.12"

libraryDependencies += "com.twitter" % "scalding-core_2.9.2" % "0.8.5"

libraryDependencies += "commons-lang" % "commons-lang" % "2.4"

libraryDependencies += "com.joestelmach" % "natty" % "0.7"

libraryDependencies += "io.backchat.jerkson" % "jerkson_2.9.2" % "0.7.0"

libraryDependencies += "com.google.guava" % "guava" % "13.0.1"

libraryDependencies += "org.apache.commons" % "commons-math3" % "3.0"

libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.0.0-cdh4.1.1" exclude("commons-daemon", "commons-daemon")

libraryDependencies += "org.apache.hadoop" % "hadoop-hdfs" % "2.0.0-cdh4.1.1" exclude("commons-daemon", "commons-daemon")

libraryDependencies += "org.apache.hadoop" % "hadoop-tools" % "2.0.0-mr1-cdh4.1.1" exclude("commons-daemon", "commons-daemon")

libraryDependencies += "net.sf.trove4j" % "trove4j" % "3.0.3"

libraryDependencies += "com.esotericsoftware.kryo" % "kryo" % "2.21"

libraryDependencies += "com.novocode" % "junit-interface" % "0.10" % "test"

parallelExecution in Test := false

seq(assemblySettings: _*)

publishTo <<= version { (v: String) =>
val archivaURL = "http://ivy.etsycorp.com/repository"
if (v.trim.endsWith("SNAPSHOT")) {
Some("publish-snapshots" at (archivaURL + "/snapshots"))
} else {
Some("publish-releases" at (archivaURL + "/internal"))
}
}

seq(aetherPublishSettings: _*)

pomIncludeRepository := { _ => false }

// Uncomment if you don't want to run all the tests before building assembly
// test in assembly := {}

// Janino includes a broken signature, and is not needed:
excludedJars in assembly <<= (fullClasspath in assembly) map { cp =>
val excludes = Set("jsp-api-2.1-6.1.14.jar", "jsp-2.1-6.1.14.jar",
"jasper-compiler-5.5.12.jar", "janino-2.5.16.jar")
cp filter { jar => excludes(jar.data.getName)}
}

// Some of these files have duplicates, let's ignore:
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case s if s.endsWith(".class") => MergeStrategy.last
case s if s.endsWith("project.clj") => MergeStrategy.concat
case s if s.endsWith(".html") => MergeStrategy.last
case s if s.contains("servlet") => MergeStrategy.last
case x => old(x)
}
}
Loading

0 comments on commit dcfd90d

Please sign in to comment.