This repository has been archived by the owner on Dec 18, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 61
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
jattenberg
committed
Jun 17, 2014
0 parents
commit dcfd90d
Showing
84 changed files
with
9,175 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
*.class | ||
*.log | ||
*.swp | ||
*.swo | ||
|
||
# sbt specific | ||
dist/* | ||
target/ | ||
lib_managed/ | ||
src_managed/ | ||
project/boot/ | ||
project/plugins/project/ | ||
|
||
# Scala-IDE specific | ||
.scala_dependencies | ||
|
||
#java | ||
|
||
*.class | ||
|
||
# Package Files # | ||
*.jar | ||
*.war | ||
*.ear | ||
|
||
*~ | ||
*\# | ||
.history |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
The MIT License | ||
=============== | ||
|
||
Copyright (c) 2009 Anton Grigoryev | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in | ||
all copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN | ||
THE SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
#Conjecture | ||
|
||
Conjecture is a framework for building machine learning models in Hadoop using the Scalding DSL. | ||
The goal of this project is to enable the development of statistical models as viable components | ||
in a wide range of product settings. Applications include classification and categorization, | ||
recommender systems, ranking, filtering, and regression (predicting real-valued numbers). | ||
Conjecture has been designed with a primary emphasis on flexibility and can handle a wide variety of inputs. | ||
Integration with Hadoop and scalding enable seamless handling of extremely large data volumes, | ||
and integration with established ETL processes. Predicted labels can either be consumed directly | ||
by the web stack using the dataset loader, or models can be deployed and consumed by live web code. | ||
Currently, binary classification (assigning one of two possible labels to input data points) | ||
is the most mature component of the Conjecture package. | ||
|
||
#Tutorial | ||
There are a few stages involved in training a machine learning model using Conjecture. | ||
|
||
## Create Training Data | ||
We represent the training data as "feature vectors" which are just mappings of feature names to real values. | ||
In this case we represent them as a java map of strings to doubles | ||
(although we have a class StringKeyedVector which provides convenience methods for feature vector construction). | ||
We also need the true label of each instance, which we represent as 0 and 1 | ||
(the mapping of these binary labels to e.g., "male" and "female" is up to the user). | ||
We construct BinaryLabeledInstances, which are just wrappers for a feature vector and a label. | ||
|
||
val bl = new BinaryLabeledInstance(0.0) | ||
bl.addTerm("bias", 1,0") | ||
bl.addTerm("some_feature", 0.5) | ||
|
||
## Training a Classifier | ||
Classifiers are essentially trained by presenting the labeled instances to them. There are several kinds | ||
of linear classifiers we implement, among them: | ||
|
||
* Logistic regression, | ||
* Perceptron, | ||
* MIRA (a large margin perceptron model), | ||
* Passive aggressive. | ||
|
||
These models all have several options, such as learning rate, regularization parameters and so on. We supply | ||
reasonable defaults for these paramteres although they can be changed readily. To train a linear model | ||
simply call the update function with the labeled instance: | ||
|
||
val p = new LogisticRegression() | ||
p.update(bl) | ||
|
||
In order to make this procedure tractable for large datasets, we provied scalding wrappers for the training. | ||
These operate by training several small models on mappers, then aggregating them into a final complete model | ||
on the reducers. This wrapper is called like so: | ||
|
||
new BinaryModelTrainer(args) | ||
.train(instances, 'instance, 'model) | ||
.write(SequenceFile("model")) | ||
.map('model -> 'model){ x : UpdateableBinaryModel => new com.google.gson.Gson.toJson(x) } | ||
.write(Tsv("model_json")) | ||
|
||
This code segment will train a model using a pipe called instances which has a field called instance which contains | ||
the BinaryLabeledInstance objects. It produces a pipe with a single field containing the completed model, which can | ||
then be written to disk. | ||
|
||
This class uses the command line args object from scalding, in order to let you set some options on the command line. | ||
Some useful options are: | ||
|
||
| Argument | Possible values | Default | Meaning | | ||
|-------------------------------------|-----------------------------------------------|--------------------|--------------------------------------------------| | ||
| --model | mira, logistic_regression, passive_aggressive | passive_aggressive | The type of model to use. | | ||
| --iters | 1, 2, 3... | 1 | The number of iterations of training to perform. | | ||
| --zero_class_prob, --one_class_prob | [0, 1] | 1 | | | ||
|
||
To see all the commandline options, see the BinaryModelTrainer class. | ||
|
||
## Evaluating a Classifier | ||
It is important to get a sense of the performance you can expect out of your classifier on unseen data. | ||
In order to do this we recommend to use cross validation. | ||
In essence, your input set of instances is split up into testing and training portions (multiple different ways), | ||
then a classifier is trained on each training portion, and evaluated (against the true labels which are present) | ||
using the testing portion. | ||
This is all wrapped up in a class called BinaryCrossValidator, it is used like so: | ||
|
||
new BinaryCrossValidator(args, 5) | ||
.crossValidate(instances, 'instance) | ||
.write(Tsv("model_xval")) | ||
|
||
This class also takes the command line arguments, which it passes to a model trainer for each fold. | ||
This allows the specification of options to the cross validated models on the command line. | ||
The output contains statistics about the performance of the model as well as teh confusion matrices | ||
for each fold. | ||
|
||
A script is included which cross validates a logistic regression model on the iris dataset. | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
#!/bin/bash | ||
|
||
# - make monolithic conjecture jar. | ||
sbt clean assembly | ||
# - make the instances. | ||
java -cp target/conjecture-assembly-0.0.7-SNAPSHOT.jar com.twitter.scalding.Tool com.etsy.conjecture.demo.IrisDataToMulticlassLabeledInstances --input_file data/iris.tsv --output_file iris_model/instances --local | ||
# - construct the classifier. | ||
java -cp target/conjecture-assembly-0.0.7-SNAPSHOT.jar com.twitter.scalding.Tool com.etsy.conjecture.demo.LearnMulticlassClassifier --input iris_model/instances --output iris_model --class_names Iris-versicolor,Iris-virginica,Iris-setosa --iters 5 --folds 3 --local |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
import json | ||
import sys | ||
import math | ||
|
||
if __name__ == '__main__': | ||
if len(sys.argv) != 3: | ||
sys.exit("Usage: python " + sys.argv[0] + " [model file] [model file]") | ||
a = json.load(open(sys.argv[1]))['param']['vector'] | ||
b = json.load(open(sys.argv[2]))['param']['vector'] | ||
features = set(a.keys()) | set(b.keys()) | ||
diff = [] | ||
for f in features: | ||
dv = a.get(f, 0.0) - b.get(f, 0.0) | ||
if math.fabs(dv) > 0.01: | ||
diff.append((f, dv, a.get(f), b.get(f))) | ||
diff.sort( key = lambda tup: -math.fabs(tup[1])) | ||
for t in diff: | ||
print t[0] + "\t" + str(t[2]) + "\t" + str(t[3]) + "\t(" + str(t[1]) + ")" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
import json | ||
import sys | ||
import math | ||
|
||
if __name__ == '__main__': | ||
if len(sys.argv) != 2: | ||
sys.exit("Usage: python " + sys.argv[0] + " [model file]") | ||
vec = json.load(open(sys.argv[1]))['param']['vector'].items() | ||
vec.sort(key = lambda tup: -math.fabs(tup[1])) | ||
for v in vec: | ||
print v[0] + "\t" + str(v[1]) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
import json | ||
import sys | ||
from optparse import OptionParser | ||
from math import floor | ||
|
||
colors = ["FF0000", "FF1000", "FF2000", "FF3000", "FF4000", "FF5000", "FF6000", | ||
"FF7000", "FF8000", "FF9000", "FFA000", "FFB000", "FFC000", "FFD000", | ||
"FFE000", "FFF000", "FFFF00", "F0FF00", "E0FF00", "D0FF00", "C0FF00", | ||
"B0FF00", "A0FF00", "90FF00", "80FF00", "70FF00", "60FF00", "50FF00", | ||
"40FF00", "30FF00", "20FF00", "10FF00"] | ||
bins = len(colors) | ||
|
||
|
||
parser = OptionParser(usage="""builds a simple web page providing introspection on predictions made by conjecture models. | ||
Depends on the supporting data provided in the instance itself, currently only supporting binary | ||
classification problems | ||
Usage: %prog [options] | ||
""") | ||
|
||
parser.add_option('-o', '--out', dest='out', default=False, action='store', | ||
help="[optional] destination of the generated html. Defaults to standard out") | ||
parser.add_option('-f', '--file', dest='file', default=False, action='store', | ||
help="[optional] file storing input predictions and instances. Defaults to standard in") | ||
parser.add_option('-l', '--label', dest='label', default=False, action='store', | ||
help="[optional] only keep examples with this label") | ||
parser.add_option('-L', '--limit', dest='limit', default=1000, action='store', | ||
help="maximum number of prediction examples to display. Default: 1000") | ||
|
||
|
||
(options, args) = parser.parse_args() | ||
|
||
output = open(options.out, 'w') if (options.out) else sys.stdout | ||
input = open(options.file, 'r') if(options.file) else sys.stdin | ||
|
||
limit = int(options.limit) | ||
|
||
output.write("<html>") | ||
ct = 0 | ||
|
||
for line in input: | ||
parts = line.strip().split("\t") | ||
content = json.loads(parts[0]) | ||
label = int(content['label']['value']) | ||
pred = float(parts[2]) | ||
|
||
if (options.label and str(label) != options.label): | ||
continue | ||
|
||
error = min(1.0, abs(pred-label)) | ||
bin = bins - int(floor(error*bins)) - 1 | ||
|
||
color = "#" + colors[bin] | ||
out = "" | ||
|
||
support = json.loads(content['supporting_data']) | ||
|
||
for key in support.keys(): | ||
out = out + "<b>" + key + "</b></br>" + support[key] + "<br/>" | ||
|
||
if (len(out) < 10000 and ct < limit): | ||
try: | ||
output.write("<div style='background-color: " + color + "; width: 700px;'>"); | ||
output.write("%d (%f)<br/>" %( label, pred)) | ||
output.write(out) | ||
output.write("</div><p>") | ||
ct = ct + 1 | ||
except: | ||
pass | ||
|
||
if (ct >= limit): | ||
break | ||
|
||
output.write("</html>"); | ||
output.flush() | ||
output.close() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
import sbt._ | ||
import AssemblyKeys._ | ||
import aether.Aether._ | ||
|
||
name := "conjecture" | ||
|
||
version := "0.0.7-SNAPSHOT" | ||
|
||
organization := "com.etsy" | ||
|
||
scalaVersion := "2.9.3" | ||
|
||
sbtVersion := "0.12.1" | ||
|
||
scalacOptions ++= Seq("-unchecked", "-deprecation") | ||
|
||
compileOrder := CompileOrder.JavaThenScala | ||
|
||
javaHome := Some(file("/usr/java/latest")) | ||
|
||
publishArtifact in packageDoc := false | ||
|
||
resolvers ++= { | ||
Seq( | ||
"snapshots" at "http://oss.sonatype.org/content/repositories/snapshots", | ||
"releases" at "http://oss.sonatype.org/content/repositories/releases", | ||
"Concurrent Maven Repo" at "http://conjars.org/repo", | ||
"cloudera" at "https://repository.cloudera.com/artifactory/cloudera-repos/" | ||
) | ||
} | ||
|
||
libraryDependencies += "cascading" % "cascading-core" % "2.0.0" | ||
|
||
libraryDependencies += "cascading" % "cascading-local" % "2.0.0" exclude("com.google.guava", "guava") | ||
|
||
libraryDependencies += "cascading" % "cascading-hadoop" % "2.0.0" | ||
|
||
libraryDependencies += "cascading.kryo" % "cascading.kryo" % "0.4.6" | ||
|
||
libraryDependencies += "com.google.code.gson" % "gson" % "2.2.2" | ||
|
||
libraryDependencies += "com.twitter" % "maple" % "0.2.4" | ||
|
||
libraryDependencies += "com.twitter" % "algebird-core_2.9.2" % "0.1.12" | ||
|
||
libraryDependencies += "com.twitter" % "scalding-core_2.9.2" % "0.8.5" | ||
|
||
libraryDependencies += "commons-lang" % "commons-lang" % "2.4" | ||
|
||
libraryDependencies += "com.joestelmach" % "natty" % "0.7" | ||
|
||
libraryDependencies += "io.backchat.jerkson" % "jerkson_2.9.2" % "0.7.0" | ||
|
||
libraryDependencies += "com.google.guava" % "guava" % "13.0.1" | ||
|
||
libraryDependencies += "org.apache.commons" % "commons-math3" % "3.0" | ||
|
||
libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.0.0-cdh4.1.1" exclude("commons-daemon", "commons-daemon") | ||
|
||
libraryDependencies += "org.apache.hadoop" % "hadoop-hdfs" % "2.0.0-cdh4.1.1" exclude("commons-daemon", "commons-daemon") | ||
|
||
libraryDependencies += "org.apache.hadoop" % "hadoop-tools" % "2.0.0-mr1-cdh4.1.1" exclude("commons-daemon", "commons-daemon") | ||
|
||
libraryDependencies += "net.sf.trove4j" % "trove4j" % "3.0.3" | ||
|
||
libraryDependencies += "com.esotericsoftware.kryo" % "kryo" % "2.21" | ||
|
||
libraryDependencies += "com.novocode" % "junit-interface" % "0.10" % "test" | ||
|
||
parallelExecution in Test := false | ||
|
||
seq(assemblySettings: _*) | ||
|
||
publishTo <<= version { (v: String) => | ||
val archivaURL = "http://ivy.etsycorp.com/repository" | ||
if (v.trim.endsWith("SNAPSHOT")) { | ||
Some("publish-snapshots" at (archivaURL + "/snapshots")) | ||
} else { | ||
Some("publish-releases" at (archivaURL + "/internal")) | ||
} | ||
} | ||
|
||
seq(aetherPublishSettings: _*) | ||
|
||
pomIncludeRepository := { _ => false } | ||
|
||
// Uncomment if you don't want to run all the tests before building assembly | ||
// test in assembly := {} | ||
|
||
// Janino includes a broken signature, and is not needed: | ||
excludedJars in assembly <<= (fullClasspath in assembly) map { cp => | ||
val excludes = Set("jsp-api-2.1-6.1.14.jar", "jsp-2.1-6.1.14.jar", | ||
"jasper-compiler-5.5.12.jar", "janino-2.5.16.jar") | ||
cp filter { jar => excludes(jar.data.getName)} | ||
} | ||
|
||
// Some of these files have duplicates, let's ignore: | ||
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) => | ||
{ | ||
case s if s.endsWith(".class") => MergeStrategy.last | ||
case s if s.endsWith("project.clj") => MergeStrategy.concat | ||
case s if s.endsWith(".html") => MergeStrategy.last | ||
case s if s.contains("servlet") => MergeStrategy.last | ||
case x => old(x) | ||
} | ||
} |
Oops, something went wrong.