Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Results from IntelliJ inspections #3

Merged
merged 1 commit into from

2 participants

@srowen
Owner

I'm not sure if this is welcome or not, but as part of warming up to the code, I ran IntelliJ's static analysis on it. Here are a number of refinements it suggests. Most is code clarification and general Java convention stuff. Feel free to disregard or modify.

@jwills jwills merged commit 445bb64 into cloudera:master
@jwills
Owner

Any contribution from you would be welcome, Sean. Maybe it's time for me to give up Eclipse for Intellij. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Mar 30, 2013
  1. Result of IntelliJ inspections

    srowen authored
This page is out of date. Refresh to see the latest.
Showing with 267 additions and 319 deletions.
  1. +2 −5 client/src/main/java/com/cloudera/science/ml/client/Help.java
  2. +3 −7 client/src/main/java/com/cloudera/science/ml/client/Main.java
  3. +1 −4 client/src/main/java/com/cloudera/science/ml/client/cmd/Command.java
  4. +2 −2 client/src/main/java/com/cloudera/science/ml/client/cmd/KMeansAssignmentCommand.java
  5. +4 −4 client/src/main/java/com/cloudera/science/ml/client/cmd/KMeansCommand.java
  6. +1 −1  client/src/main/java/com/cloudera/science/ml/client/cmd/KMeansSketchCommand.java
  7. +3 −1 client/src/main/java/com/cloudera/science/ml/client/cmd/NormalizeCommand.java
  8. +4 −4 client/src/main/java/com/cloudera/science/ml/client/cmd/SampleCommand.java
  9. +2 −1  client/src/main/java/com/cloudera/science/ml/client/cmd/SummaryCommand.java
  10. +12 −10 client/src/main/java/com/cloudera/science/ml/client/params/InputParameters.java
  11. +6 −4 client/src/main/java/com/cloudera/science/ml/client/params/OutputParameters.java
  12. +0 −3  client/src/main/java/com/cloudera/science/ml/client/params/PipelineParameters.java
  13. +8 −5 client/src/main/java/com/cloudera/science/ml/client/params/Specs.java
  14. +0 −3  client/src/main/java/com/cloudera/science/ml/client/params/SummaryParameters.java
  15. +6 −6 client/src/main/java/com/cloudera/science/ml/client/util/AvroIO.java
  16. +0 −3  core/src/main/java/com/cloudera/science/ml/core/records/BasicSpec.java
  17. +1 −1  core/src/main/java/com/cloudera/science/ml/core/records/DataType.java
  18. +8 −8 core/src/main/java/com/cloudera/science/ml/core/records/RecordSpec.java
  19. +0 −3  core/src/main/java/com/cloudera/science/ml/core/records/avro/AvroFieldSpec.java
  20. +2 −4 core/src/main/java/com/cloudera/science/ml/core/records/avro/AvroRecord.java
  21. +2 −2 core/src/main/java/com/cloudera/science/ml/core/records/avro/AvroSpec.java
  22. +12 −18 core/src/main/java/com/cloudera/science/ml/core/records/avro/Spec2Schema.java
  23. +4 −4 core/src/main/java/com/cloudera/science/ml/core/records/csv/CSVRecord.java
  24. +0 −3  core/src/main/java/com/cloudera/science/ml/core/records/csv/CSVSpec.java
  25. +1 −0  core/src/main/java/com/cloudera/science/ml/core/records/vectors/VectorRecord.java
  26. +6 −4 core/src/main/java/com/cloudera/science/ml/core/vectors/Centers.java
  27. +7 −7 core/src/main/java/com/cloudera/science/ml/core/vectors/VectorConvert.java
  28. +1 −1  core/src/main/java/com/cloudera/science/ml/core/vectors/Vectors.java
  29. +2 −6 core/src/main/java/com/cloudera/science/ml/core/vectors/Weighted.java
  30. +4 −7 core/src/test/java/com/cloudera/science/ml/core/vectors/CentersTest.java
  31. +15 −10 kmeans-parallel/src/main/java/com/cloudera/science/ml/kmeans/parallel/CentersIndex.java
  32. +18 −17 kmeans-parallel/src/main/java/com/cloudera/science/ml/kmeans/parallel/KMeansParallel.java
  33. +5 −7 kmeans-parallel/src/test/java/com/cloudera/science/ml/kmeans/parallel/KMeansParallelTest.java
  34. +5 −7 kmeans/src/main/java/com/cloudera/science/ml/kmeans/core/KMeans.java
  35. +2 −1  kmeans/src/main/java/com/cloudera/science/ml/kmeans/core/KMeansEvaluation.java
  36. +5 −8 kmeans/src/main/java/com/cloudera/science/ml/kmeans/core/StoppingCriteria.java
  37. +8 −9 kmeans/src/test/java/com/cloudera/science/ml/kmeans/core/KMeansTest.java
  38. +6 −5 mahout/src/main/java/com/cloudera/science/ml/mahout/types/MLWritables.java
  39. +2 −2 parallel/src/main/java/com/cloudera/science/ml/parallel/crossfold/Crossfold.java
  40. +2 −2 parallel/src/main/java/com/cloudera/science/ml/parallel/fn/ShuffleFns.java
  41. +3 −5 parallel/src/main/java/com/cloudera/science/ml/parallel/normalize/Normalizer.java
  42. +1 −6 parallel/src/main/java/com/cloudera/science/ml/parallel/normalize/StringSplitFn.java
  43. +6 −5 parallel/src/main/java/com/cloudera/science/ml/parallel/normalize/Transform.java
  44. +5 −2 parallel/src/main/java/com/cloudera/science/ml/parallel/normalize/VectorScaling.java
  45. +1 −4 parallel/src/main/java/com/cloudera/science/ml/parallel/pivot/MapAggregator.java
  46. +15 −19 parallel/src/main/java/com/cloudera/science/ml/parallel/pivot/Pivot.java
  47. +2 −5 parallel/src/main/java/com/cloudera/science/ml/parallel/pivot/Stat.java
  48. +2 −2 parallel/src/main/java/com/cloudera/science/ml/parallel/pobject/ListOfListsPObject.java
  49. +2 −2 parallel/src/main/java/com/cloudera/science/ml/parallel/pobject/ListPObject.java
  50. +0 −3  parallel/src/main/java/com/cloudera/science/ml/parallel/records/Records.java
  51. +0 −3  parallel/src/main/java/com/cloudera/science/ml/parallel/records/SummarizedRecords.java
  52. +13 −9 parallel/src/main/java/com/cloudera/science/ml/parallel/sample/ReservoirSampling.java
  53. +8 −5 parallel/src/main/java/com/cloudera/science/ml/parallel/serialize/Serializables.java
  54. +9 −3 parallel/src/main/java/com/cloudera/science/ml/parallel/summary/Entry.java
  55. +0 −3  parallel/src/main/java/com/cloudera/science/ml/parallel/summary/InternalNumeric.java
  56. +0 −6 parallel/src/main/java/com/cloudera/science/ml/parallel/summary/InternalStats.java
  57. +5 −11 parallel/src/main/java/com/cloudera/science/ml/parallel/summary/Summarizer.java
  58. +1 −4 parallel/src/main/java/com/cloudera/science/ml/parallel/summary/Summary.java
  59. +3 −3 parallel/src/main/java/com/cloudera/science/ml/parallel/types/MLAvros.java
  60. +11 −11 parallel/src/main/java/com/cloudera/science/ml/parallel/types/MLRecords.java
  61. +2 −2 parallel/src/test/java/com/cloudera/science/ml/parallel/normalize/SummaryTest.java
  62. +1 −1  parallel/src/test/java/com/cloudera/science/ml/parallel/normalize/VectorScalingTest.java
  63. +5 −6 parallel/src/test/java/com/cloudera/science/ml/parallel/sample/ReservoirSamplingTest.java
View
7 client/src/main/java/com/cloudera/science/ml/client/Help.java
@@ -23,14 +23,11 @@
import com.cloudera.science.ml.client.cmd.Command;
import com.google.common.collect.Lists;
-/**
- *
- */
@Parameters(commandDescription = "Retrieves details on the functions of other commands")
public class Help {
@Parameter(description = "Commands")
- List<String> helpCommands = Lists.newArrayList();;
-
+ List<String> helpCommands = Lists.newArrayList();
+
public int usage(JCommander jc, Map<String, Command> cmds) {
if (helpCommands.isEmpty()) {
System.out.println("Commands:\n");
View
10 client/src/main/java/com/cloudera/science/ml/client/Main.java
@@ -74,13 +74,9 @@ public int run(String[] args) throws Exception {
}
try {
return cmd.execute(getConf());
- } catch (Exception e) {
- if (e instanceof CommandException) {
- System.err.println("Error: " + e.getMessage());
- } else {
- throw e;
- }
- return 1;
+ } catch (CommandException ce) {
+ System.err.println("Error: " + ce.getMessage());
+ return 1;
}
}
View
5 client/src/main/java/com/cloudera/science/ml/client/cmd/Command.java
@@ -16,12 +16,9 @@
import org.apache.hadoop.conf.Configuration;
-/**
- *
- */
public interface Command {
int execute(Configuration conf) throws Exception;
- public String getDescription();
+ String getDescription();
}
View
4 client/src/main/java/com/cloudera/science/ml/client/cmd/KMeansAssignmentCommand.java
@@ -76,8 +76,8 @@ public int execute(Configuration conf) throws Exception {
List<MLCenters> centers = AvroIO.read(MLCenters.class, new File(centersFile));
if (!centerIds.isEmpty()) {
List<MLCenters> filter = Lists.newArrayListWithExpectedSize(centerIds.size());
- for (int i = 0; i < centerIds.size(); i++) {
- filter.add(centers.get(centerIds.get(i)));
+ for (Integer centerId : centerIds) {
+ filter.add(centers.get(centerId));
}
centers = filter;
}
View
8 client/src/main/java/com/cloudera/science/ml/client/cmd/KMeansCommand.java
@@ -65,7 +65,7 @@
@Parameter(names = "--stopping-threshold",
description = "Stop the Lloyd's iterations if the delta between centers falls below this")
- private double stoppingThreshold = 1e-4;
+ private double stoppingThreshold = 1.0e-4;
@Parameter(names = "--centers-file",
description = "A local file to store the centers that were created into")
@@ -87,8 +87,8 @@ public int execute(Configuration conf) throws Exception {
List<MLWeightedCenters> mlwc = AvroIO.read(MLWeightedCenters.class, new File(sketchFile));
List<List<Weighted<Vector>>> sketches = toSketches(mlwc);
List<Weighted<Vector>> allPoints = Lists.newArrayList();
- for (int i = 0; i < sketches.size(); i++) {
- allPoints.addAll(sketches.get(i));
+ for (List<Weighted<Vector>> sketch : sketches) {
+ allPoints.addAll(sketch);
}
List<Centers> centers = getClusters(allPoints, kmeans);
AvroIO.write(Lists.transform(centers, VectorConvert.FROM_CENTERS),
@@ -129,7 +129,7 @@ public int execute(Configuration conf) throws Exception {
return centers;
}
- private List<List<Weighted<Vector>>> toSketches(List<MLWeightedCenters> mlwc) {
+ private static List<List<Weighted<Vector>>> toSketches(List<MLWeightedCenters> mlwc) {
List<List<Weighted<Vector>>> base = Lists.newArrayList();
for (MLWeightedCenters wc : mlwc) {
base.add(Lists.transform(wc.getCenters(), VectorConvert.TO_WEIGHTED_VEC));
View
2  client/src/main/java/com/cloudera/science/ml/client/cmd/KMeansSketchCommand.java
@@ -101,7 +101,7 @@ public int execute(Configuration conf) throws Exception {
return 0;
}
- private List<MLWeightedCenters> toWeightedCenters(List<List<Weighted<Vector>>> in) {
+ private static List<MLWeightedCenters> toWeightedCenters(List<List<Weighted<Vector>>> in) {
List<MLWeightedCenters> out = Lists.newArrayList();
for (List<Weighted<Vector>> e : in) {
MLWeightedCenters mlwc = MLWeightedCenters.newBuilder()
View
4 client/src/main/java/com/cloudera/science/ml/client/cmd/NormalizeCommand.java
@@ -14,6 +14,8 @@
*/
package com.cloudera.science.ml.client.cmd;
+import java.util.Locale;
+
import org.apache.crunch.PCollection;
import org.apache.crunch.Pipeline;
import org.apache.crunch.PipelineResult;
@@ -100,7 +102,7 @@ public int execute(Configuration conf) throws Exception {
}
private Transform getDefaultTransform() {
- String t = transform.toLowerCase();
+ String t = transform.toLowerCase(Locale.ENGLISH);
if ("none".equals(t)) {
return Transform.NONE;
} else if ("linear".equals(t)) {
View
8 client/src/main/java/com/cloudera/science/ml/client/cmd/SampleCommand.java
@@ -19,7 +19,6 @@
import org.apache.crunch.PipelineResult;
import org.apache.crunch.lib.Sample;
import org.apache.hadoop.conf.Configuration;
-import org.apache.mahout.math.Vector;
import com.beust.jcommander.Parameter;
import com.beust.jcommander.Parameters;
@@ -58,11 +57,12 @@
public int execute(Configuration conf) throws Exception {
Pipeline p = pipelineParams.create(SampleCommand.class, conf);
PCollection<Record> elements = inputParams.getRecords(p);
-
- PCollection<Record> sample = null;
+
if (sampleSize > 0 && samplingProbability > 0.0) {
throw new IllegalArgumentException("--size and --prob are mutually exclusive options.");
- } else if (sampleSize > 0) {
+ }
+ PCollection<Record> sample;
+ if (sampleSize > 0) {
sample = ReservoirSampling.sample(elements, sampleSize);
} else if (samplingProbability > 0.0 && samplingProbability < 1.0) {
sample = Sample.sample(elements, samplingProbability);
View
3  client/src/main/java/com/cloudera/science/ml/client/cmd/SummaryCommand.java
@@ -16,6 +16,7 @@
import java.io.File;
import java.util.List;
+import java.util.Locale;
import java.util.Set;
import org.apache.crunch.PCollection;
@@ -85,7 +86,7 @@ public int execute(Configuration conf) throws Exception {
throw new CommandException("Invalid header file row: " + line);
}
String name = pieces[0];
- String meta = pieces[1].toLowerCase().trim();
+ String meta = pieces[1].toLowerCase(Locale.ENGLISH).trim();
if (meta.startsWith("ignore") || meta.startsWith("id")) {
ignoredColumns.add(i);
rsb.add(name, DataType.STRING);
View
22 client/src/main/java/com/cloudera/science/ml/client/params/InputParameters.java
@@ -14,8 +14,9 @@
*/
package com.cloudera.science.ml.client.params;
-import java.util.Arrays;
+import java.util.Collections;
import java.util.List;
+import java.util.Locale;
import java.util.regex.Pattern;
import org.apache.crunch.PCollection;
@@ -61,20 +62,21 @@ public String getDelimiter() {
return delim;
}
- public PCollection<Vector> getVectorsFromPath(final Pipeline pipeline, String path) {
- return getVectors(pipeline, Arrays.asList(path));
+ public PCollection<Vector> getVectorsFromPath(Pipeline pipeline, String path) {
+ return getVectors(pipeline, Collections.singletonList(path));
}
- public PCollection<Vector> getVectors(final Pipeline pipeline) {
+ public PCollection<Vector> getVectors(Pipeline pipeline) {
return getVectors(pipeline, inputPaths);
}
private PCollection<Vector> getVectors(final Pipeline pipeline, List<String> paths) {
- format = format.toLowerCase();
- PCollection<Vector> ret = null;
+ format = format.toLowerCase(Locale.ENGLISH);
if ("text".equals(format)) {
throw new IllegalArgumentException("Vectors must be in 'seq' or 'avro' format");
- } else if ("seq".equals(format)) {
+ }
+ PCollection<Vector> ret;
+ if ("seq".equals(format)) {
ret = from(paths, new Function<String, PCollection<Vector>>() {
@Override
public PCollection<Vector> apply(String input) {
@@ -95,8 +97,8 @@ public String getDelimiter() {
}
public PCollection<Record> getRecords(final Pipeline pipeline) {
- format = format.toLowerCase();
- PCollection<Record> ret = null;
+ format = format.toLowerCase(Locale.ENGLISH);
+ PCollection<Record> ret;
if ("text".equals(format)) {
PCollection<String> text = fromInputs(new Function<String, PCollection<String>>() {
@Override
@@ -132,7 +134,7 @@ public String getDelimiter() {
return from(inputPaths, f);
}
- private <T> PCollection<T> from(List<String> paths, Function<String, PCollection<T>> f) {
+ private static <T> PCollection<T> from(List<String> paths, Function<String, PCollection<T>> f) {
PCollection<T> ret = null;
for (PCollection<T> p : Lists.transform(paths, f)) {
if (ret == null) {
View
10 client/src/main/java/com/cloudera/science/ml/client/params/OutputParameters.java
@@ -14,6 +14,8 @@
*/
package com.cloudera.science.ml.client.params;
+import java.util.Locale;
+
import org.apache.crunch.PCollection;
import org.apache.crunch.Target;
import org.apache.crunch.Target.WriteMode;
@@ -39,7 +41,7 @@
private String outputType;
public PType<Vector> getVectorPType() {
- outputType = outputType.toLowerCase();
+ outputType = outputType.toLowerCase(Locale.ENGLISH);
if ("avro".equals(outputType)) {
return MLAvros.vector();
} else if ("seq".equals(outputType)) {
@@ -50,10 +52,10 @@
}
public <T> void write(PCollection<T> collect, String output) {
- outputType = outputType.toLowerCase();
+ outputType = outputType.toLowerCase(Locale.ENGLISH);
PTypeFamily ptf = collect.getTypeFamily();
PType<T> ptype = collect.getPType();
- Target target = null;
+ Target target;
if ("text".equals(outputType)) {
target = To.textFile(output);
} else if ("avro".equals(outputType)) {
@@ -81,7 +83,7 @@
collect.write(target, WriteMode.OVERWRITE);
}
- private void forceConversionException(String outputFile, PType<?> ptype, String type) {
+ private static void forceConversionException(String outputFile, PType<?> ptype, String type) {
String msg = String.format(
"Could not convert type %s into %s format for output: %s",
ptype.getTypeClass().getCanonicalName(), type, outputFile);
View
3  client/src/main/java/com/cloudera/science/ml/client/params/PipelineParameters.java
@@ -21,9 +21,6 @@
import com.beust.jcommander.Parameter;
-/**
- *
- */
public class PipelineParameters {
@Parameter(names = "--local",
View
13 client/src/main/java/com/cloudera/science/ml/client/params/Specs.java
@@ -23,8 +23,11 @@
import com.google.common.collect.ImmutableList;
import com.google.common.collect.Lists;
-public class Specs {
-
+public final class Specs {
+
+ private Specs() {
+ }
+
public static Integer getFieldId(Spec spec, String value) {
List<Integer> fieldIds = getFieldIds(spec, ImmutableList.of(value));
if (fieldIds.isEmpty()) {
@@ -38,7 +41,7 @@ public static Integer getFieldId(Spec spec, String value) {
return ImmutableList.of();
}
- List<Integer> fieldIds = null;
+ List<Integer> fieldIds;
if (spec == null || spec.getField(values.get(0)) == null) {
fieldIds = Lists.transform(values, new Function<String, Integer>() {
@Override
@@ -48,8 +51,8 @@ public Integer apply(String input) {
});
} else {
fieldIds = Lists.newArrayListWithExpectedSize(values.size());
- for (int i = 0; i < values.size(); i++) {
- FieldSpec f = spec.getField(values.get(i));
+ for (String value : values) {
+ FieldSpec f = spec.getField(value);
if (f != null) {
fieldIds.add(f.position());
}
View
3  client/src/main/java/com/cloudera/science/ml/client/params/SummaryParameters.java
@@ -19,9 +19,6 @@
import com.beust.jcommander.ParametersDelegate;
import com.cloudera.science.ml.parallel.summary.Summary;
-/**
- *
- */
public class SummaryParameters {
@ParametersDelegate
View
12 client/src/main/java/com/cloudera/science/ml/client/util/AvroIO.java
@@ -28,15 +28,15 @@
import com.google.common.collect.Lists;
-/**
- *
- */
-public class AvroIO {
-
+public final class AvroIO {
+
+ private AvroIO() {
+ }
+
public static <T extends SpecificRecord> void write(List<T> values, File file)
throws IOException {
Schema s = values.get(0).getSchema();
- Class clazz = values.get(0).getClass();
+ Class<T> clazz = (Class<T>) values.get(0).getClass();
DataFileWriter<T> dfw = new DataFileWriter<T>(new SpecificDatumWriter<T>(clazz));
dfw.create(s, file);
for (T t : values) {
View
3  core/src/main/java/com/cloudera/science/ml/core/records/BasicSpec.java
@@ -19,9 +19,6 @@
import com.google.common.collect.ImmutableList;
-/**
- *
- */
public class BasicSpec implements Spec {
private final DataType dataType;
View
2  core/src/main/java/com/cloudera/science/ml/core/records/DataType.java
@@ -18,5 +18,5 @@
* A list of currently supported types for {@code Record} instances.
*/
public enum DataType {
- BOOLEAN, INT, LONG, DOUBLE, STRING, RECORD;
+ BOOLEAN, INT, LONG, DOUBLE, STRING, RECORD
}
View
16 core/src/main/java/com/cloudera/science/ml/core/records/RecordSpec.java
@@ -57,9 +57,9 @@ public FieldSpec getField(int index) {
@Override
public FieldSpec getField(String fieldName) {
- for (int i = 0; i < fields.size(); i++) {
- if (fields.get(i).name().equals(fieldName)) {
- return fields.get(i);
+ for (FieldSpec field : fields) {
+ if (field.name().equals(fieldName)) {
+ return field;
}
}
return null;
@@ -74,7 +74,7 @@ public static Builder builder(Spec base) {
}
public static class Builder {
- List<FieldSpec> fields = Lists.newArrayList();
+ private final List<FieldSpec> fields = Lists.newArrayList();
public Builder() { }
@@ -120,11 +120,11 @@ public Spec build() {
}
private static class FieldSpecImpl implements FieldSpec {
- private String name;
- private int position;
- private Spec spec;
+ private final String name;
+ private final int position;
+ private final Spec spec;
- public FieldSpecImpl(String name, int position, Spec spec) {
+ private FieldSpecImpl(String name, int position, Spec spec) {
this.name = name;
this.position = position;
this.spec = spec;
View
3  core/src/main/java/com/cloudera/science/ml/core/records/avro/AvroFieldSpec.java
@@ -19,9 +19,6 @@
import com.cloudera.science.ml.core.records.FieldSpec;
import com.cloudera.science.ml.core.records.Spec;
-/**
- *
- */
public class AvroFieldSpec implements FieldSpec {
private final String name;
View
6 core/src/main/java/com/cloudera/science/ml/core/records/avro/AvroRecord.java
@@ -19,12 +19,9 @@
import com.cloudera.science.ml.core.records.Record;
import com.cloudera.science.ml.core.records.Spec;
-/**
- *
- */
public class AvroRecord implements Record {
- private GenericData.Record impl;
+ private final GenericData.Record impl;
public AvroRecord(GenericData.Record impl) {
this.impl = impl;
@@ -34,6 +31,7 @@ public AvroRecord(GenericData.Record impl) {
return impl;
}
+ @Override
public Record copy(boolean deep) {
if (deep) {
return new AvroRecord(new GenericData.Record(impl, true));
View
4 core/src/main/java/com/cloudera/science/ml/core/records/avro/AvroSpec.java
@@ -69,7 +69,7 @@ public Schema getSchema() {
return schema;
}
- private List<Schema.Field> getFields() {
+ private List<Field> getFields() {
return getSchema().getFields();
}
@@ -80,7 +80,7 @@ public int size() {
@Override
public List<String> getFieldNames() {
- return Lists.transform(getFields(), new Function<Schema.Field, String>() {
+ return Lists.transform(getFields(), new Function<Field, String>() {
@Override
public String apply(Field input) {
return input.name();
View
30 core/src/main/java/com/cloudera/science/ml/core/records/avro/Spec2Schema.java
@@ -18,28 +18,14 @@
import org.apache.avro.Schema;
-import com.cloudera.science.ml.core.records.DataType;
import com.cloudera.science.ml.core.records.FieldSpec;
import com.cloudera.science.ml.core.records.Spec;
import com.google.common.collect.Lists;
-/**
- *
- */
-public class Spec2Schema {
+public final class Spec2Schema {
public static Schema create(Spec spec) {
- if (DataType.RECORD == spec.getDataType()) {
- List<Schema.Field> fields = Lists.newArrayList();
- for (int i = 0; i < spec.size(); i++) {
- FieldSpec f = spec.getField(i);
- fields.add(new Schema.Field(f.name(), create(f.spec()), "", null));
- }
- Schema s = Schema.createRecord("R" + spec.hashCode(), "", "", false);
- s.setFields(fields);
- return s;
- } else {
- switch (spec.getDataType()) {
+ switch (spec.getDataType()) {
case INT:
return Schema.create(Schema.Type.INT);
case LONG:
@@ -50,9 +36,17 @@ public static Schema create(Spec spec) {
return Schema.create(Schema.Type.BOOLEAN);
case STRING:
return Schema.create(Schema.Type.STRING);
- }
- return null;
+ case RECORD:
+ List<Schema.Field> fields = Lists.newArrayList();
+ for (int i = 0; i < spec.size(); i++) {
+ FieldSpec f = spec.getField(i);
+ fields.add(new Schema.Field(f.name(), create(f.spec()), "", null));
+ }
+ Schema s = Schema.createRecord("R" + spec.hashCode(), "", "", false);
+ s.setFields(fields);
+ return s;
}
+ return null;
}
private Spec2Schema() {}
View
8 core/src/main/java/com/cloudera/science/ml/core/records/csv/CSVRecord.java
@@ -45,11 +45,11 @@ public Spec getSpec() {
@Override
public Record copy(boolean deep) {
- if (!deep) {
- List<String> v = Arrays.asList(new String[values.size()]);
+ if (deep) {
+ List<String> v = Lists.newArrayList(values);
return new CSVRecord(v);
} else {
- List<String> v = Lists.newArrayList(values);
+ List<String> v = Arrays.asList(new String[values.size()]);
return new CSVRecord(v);
}
}
@@ -130,7 +130,7 @@ public String getAsString(int index) {
public double getAsDouble(int index) {
try {
return Double.valueOf(values.get(index));
- } catch (NumberFormatException e) {
+ } catch (NumberFormatException ignored) {
return Double.NaN;
}
}
View
3  core/src/main/java/com/cloudera/science/ml/core/records/csv/CSVSpec.java
@@ -24,9 +24,6 @@
import com.google.common.collect.Lists;
import com.google.common.collect.Maps;
-/**
- *
- */
public class CSVSpec implements Spec {
private final DataType dataType;
View
1  core/src/main/java/com/cloudera/science/ml/core/records/vectors/VectorRecord.java
@@ -50,6 +50,7 @@ public Vector getVector() {
return vector;
}
+ @Override
public Object get(int index) {
return vector.getQuick(index);
}
View
10 core/src/main/java/com/cloudera/science/ml/core/vectors/Centers.java
@@ -53,12 +53,13 @@ public Centers(Vector... points) {
*/
public Centers(Iterable<Vector> points) {
this.centers = ImmutableList.copyOf(Sets.newLinkedHashSet(points));
- Preconditions.checkArgument(this.centers.size() > 0);
+ Preconditions.checkArgument(!this.centers.isEmpty());
}
/**
* Returns the number of points in this instance.
*/
+ @Override
public int size() {
return centers.size();
}
@@ -66,6 +67,7 @@ public int size() {
/**
* Returns the {@code Vector} at the given index.
*/
+ @Override
public Vector get(int index) {
return centers.get(index);
}
@@ -148,7 +150,7 @@ public double getSumOfSquaredDistances(Centers other) {
@Override
public boolean equals(Object other) {
- if (other == null || !(other instanceof Centers)) {
+ if (!(other instanceof Centers)) {
return false;
}
Centers c = (Centers) other;
@@ -158,8 +160,8 @@ public boolean equals(Object other) {
@Override
public int hashCode() {
int hc = 0;
- for (int i = 0; i < centers.size(); i++) {
- hc += centers.get(i).hashCode();
+ for (Vector center : centers) {
+ hc += center.hashCode();
}
return hc;
}
View
14 core/src/main/java/com/cloudera/science/ml/core/vectors/VectorConvert.java
@@ -59,10 +59,10 @@ public static MLCenters fromCenters(Centers input) {
return FROM_CENTERS.apply(input);
}
- public static Function<MLVector, Vector> TO_VECTOR = new Function<MLVector, Vector>() {
+ public static final Function<MLVector, Vector> TO_VECTOR = new Function<MLVector, Vector>() {
@Override
public Vector apply(MLVector input) {
- Vector base = null;
+ Vector base;
if (input.getIndices().isEmpty()) {
double[] d = new double[input.getSize()];
for (int i = 0; i < d.length; i++) {
@@ -83,7 +83,7 @@ public Vector apply(MLVector input) {
}
};
- public static Function<Vector, MLVector> FROM_VECTOR = new Function<Vector, MLVector>() {
+ public static final Function<Vector, MLVector> FROM_VECTOR = new Function<Vector, MLVector>() {
@Override
public MLVector apply(Vector input) {
List<Double> values = Lists.newArrayList();
@@ -114,14 +114,14 @@ public MLVector apply(Vector input) {
}
};
- public static Function<MLWeightedVector, Weighted<Vector>> TO_WEIGHTED_VEC = new Function<MLWeightedVector, Weighted<Vector>>() {
+ public static final Function<MLWeightedVector, Weighted<Vector>> TO_WEIGHTED_VEC = new Function<MLWeightedVector, Weighted<Vector>>() {
@Override
public Weighted<Vector> apply(MLWeightedVector input) {
return new Weighted<Vector>(TO_VECTOR.apply(input.getVec()), input.getWeight());
}
};
- public static Function<Weighted<Vector>, MLWeightedVector> FROM_WEIGHTED_VEC = new Function<Weighted<Vector>, MLWeightedVector>() {
+ public static final Function<Weighted<Vector>, MLWeightedVector> FROM_WEIGHTED_VEC = new Function<Weighted<Vector>, MLWeightedVector>() {
@Override
public MLWeightedVector apply(Weighted<Vector> input) {
MLWeightedVector.Builder b = MLWeightedVector.newBuilder();
@@ -130,14 +130,14 @@ public MLWeightedVector apply(Weighted<Vector> input) {
}
};
- public static Function<MLCenters, Centers> TO_CENTERS = new Function<MLCenters, Centers>() {
+ public static final Function<MLCenters, Centers> TO_CENTERS = new Function<MLCenters, Centers>() {
@Override
public Centers apply(MLCenters input) {
return new Centers(Lists.transform(input.getCenters(), TO_VECTOR));
}
};
- public static Function<Centers, MLCenters> FROM_CENTERS = new Function<Centers, MLCenters>() {
+ public static final Function<Centers, MLCenters> FROM_CENTERS = new Function<Centers, MLCenters>() {
@Override
public MLCenters apply(Centers input) {
MLCenters.Builder b = MLCenters.newBuilder();
View
2  core/src/main/java/com/cloudera/science/ml/core/vectors/Vectors.java
@@ -22,7 +22,7 @@
/**
* Factory methods for working with {@code Vector} objects.
*/
-public class Vectors {
+public final class Vectors {
/**
* Converts the given {@code Vector} into a {@code double[]}.
View
8 core/src/main/java/com/cloudera/science/ml/core/vectors/Weighted.java
@@ -93,7 +93,7 @@ public double weight() {
@Override
public boolean equals(Object other) {
- if (other == null || !(other instanceof Weighted)) {
+ if (!(other instanceof Weighted)) {
return false;
}
Weighted<T> wv = (Weighted<T>) other;
@@ -107,10 +107,6 @@ public int hashCode() {
@Override
public String toString() {
- return new StringBuilder()
- .append(thing)
- .append(";")
- .append(weight)
- .toString();
+ return String.valueOf(thing) + ';' + weight;
}
}
View
11 core/src/test/java/com/cloudera/science/ml/core/vectors/CentersTest.java
@@ -19,15 +19,12 @@
import org.apache.mahout.math.Vector;
import org.junit.Test;
-import com.cloudera.science.ml.core.vectors.Centers;
-import com.cloudera.science.ml.core.vectors.Vectors;
-
public class CentersTest {
- private static double THRESH = 0.001;
+ private static final double THRESH = 0.001;
- Vector a = Vectors.of(17.0, 29.0);
- Vector b = Vectors.of(18.0, 27.0);
- Vector c = Vectors.of(16.0, 25.0);
+ private final Vector a = Vectors.of(17.0, 29.0);
+ private final Vector b = Vectors.of(18.0, 27.0);
+ private final Vector c = Vectors.of(16.0, 25.0);
@Test
public void testSingleton() throws Exception {
View
25 kmeans-parallel/src/main/java/com/cloudera/science/ml/kmeans/parallel/CentersIndex.java
@@ -56,11 +56,11 @@ public Distances(double[] clusterDistances, int[] closestPoints) {
}
}
- public CentersIndex(int numClusterings, int dimensions) {
+ CentersIndex(int numClusterings, int dimensions) {
this(numClusterings, dimensions, 128, 10, 1729L);
}
- public CentersIndex(int numClusterings, int dimensions, int projectionBits, int projectionSamples,
+ CentersIndex(int numClusterings, int dimensions, int projectionBits, int projectionSamples,
long seed) {
this.pointsPerCenter = new int[numClusterings];
this.indices = Lists.newArrayList();
@@ -76,11 +76,11 @@ public CentersIndex(int numClusterings, int dimensions, int projectionBits, int
this.seed = seed;
}
- public CentersIndex(List<Centers> centers) {
+ CentersIndex(List<Centers> centers) {
this(centers, 128, 10, 1729L);
}
- public CentersIndex(List<Centers> centers, int projectionBits, int projectionSamples, long seed) {
+ CentersIndex(List<Centers> centers, int projectionBits, int projectionSamples, long seed) {
this(centers.size(), centers.get(0).get(0).size(), projectionBits, projectionSamples, seed);
for (int centerId = 0; centerId < centers.size(); centerId++) {
for (Vector v : centers.get(centerId)) {
@@ -106,11 +106,10 @@ private void buildIndices() {
}
}
indices.clear();
- for (int i = 0; i < points.size(); i++) {
- List<double[]> px = points.get(i);
+ for (List<double[]> px : points) {
List<BitSet> indx = Lists.newArrayList();
- for (int j = 0; j < px.size(); j++) {
- indx.add(index(Vectors.of(px.get(j))));
+ for (double[] aPx : px) {
+ indx.add(index(Vectors.of(aPx)));
}
indices.add(indx);
}
@@ -211,14 +210,20 @@ public Distances getDistances(Vector vec, boolean approx) {
int distance;
int index;
- public Idx(int distance, int index) {
+ Idx(int distance, int index) {
this.distance = distance;
this.index = index;
}
@Override
public int compareTo(Idx idx) {
- return distance - idx.distance;
+ if (distance < idx.distance) {
+ return -1;
+ }
+ if (distance > idx.distance) {
+ return 1;
+ }
+ return 0;
}
}
View
35 kmeans-parallel/src/main/java/com/cloudera/science/ml/kmeans/parallel/KMeansParallel.java
@@ -55,8 +55,9 @@
* <p>An implementation of the k-means|| algorithm, as described in
* <a href="http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf">Bahmani et al. (2012)</a>
*
- * <p>The main algorithm is executed by the {@code compute} method, which takes a number of
- * configured {@link KMPConfig} instances and runs over a given dataset of points for a fixed number
+ * <p>The main algorithm is executed by the {@link #computeClusterAssignments(PCollection, List, PType)}
+ * method, which takes a number of
+ * configured instances and runs over a given dataset of points for a fixed number
* of iterations in order to find a candidate set of points to stream into the client and
* cluster using the in-memory k-means algorithms defined in the {@code kmeans} package.
*/
@@ -103,7 +104,7 @@ public KMeansParallel(Random random, int projectionBits, int projectionSamples)
* @return A reference to the Crunch job that calculates the cost for each centers instance
*/
public <V extends Vector> PObject<List<Double>> getCosts(PCollection<V> vecs, List<Centers> centers) {
- Preconditions.checkArgument(centers.size() > 0, "No centers specified");
+ Preconditions.checkArgument(!centers.isEmpty(), "No centers specified");
return getCosts(vecs, createIndex(centers));
}
@@ -119,7 +120,7 @@ private CentersIndex createIndex(List<Centers> centers) {
return getCosts(vecs, Arrays.asList(centers));
}
- private <V extends Vector> PObject<List<Double>> getCosts(PCollection<V> vecs, CentersIndex centers) {
+ private static <V extends Vector> PObject<List<Double>> getCosts(PCollection<V> vecs, CentersIndex centers) {
return new ListPObject<Double>(vecs
.parallelDo("center-costs", new CenterCostFn<V>(centers), tableOf(ints(), doubles()))
.groupByKey(1)
@@ -153,12 +154,12 @@ private CentersIndex createIndex(List<Centers> centers) {
*/
public <V extends Vector> PObject<List<List<Long>>> getCountsOfClosest(
PCollection<V> vecs, List<Centers> centers) {
- Preconditions.checkArgument(centers.size() > 0, "No centers specified");
+ Preconditions.checkArgument(!centers.isEmpty(), "No centers specified");
Crossfold cf = new Crossfold(1); //TODO
return getCountsOfClosest(cf.apply(vecs), createIndex(centers));
}
- private <V extends Vector> PObject<List<List<Long>>> getCountsOfClosest(
+ private static <V extends Vector> PObject<List<List<Long>>> getCountsOfClosest(
PCollection<Pair<Integer, V>> vecs, CentersIndex centers) {
return new ListOfListsPObject<Long>(
vecs
@@ -190,10 +191,10 @@ private CentersIndex createIndex(List<Centers> centers) {
CentersIndex centers = new CentersIndex(crossfold.getNumFolds(),
initialPoints.get(0).size(), projectionBits, projectionSamples,
random == null ? System.currentTimeMillis() : random.nextLong());
-
- for (int i = 0; i < initialPoints.size(); i++) {
+
+ for (Vector initialPoint : initialPoints) {
for (int j = 0; j < lValues.length; j++) {
- centers.add(initialPoints.get(i), j);
+ centers.add(initialPoint, j);
}
}
@@ -212,13 +213,13 @@ private CentersIndex createIndex(List<Centers> centers) {
return getWeightedVectors(folds, centers);
}
- private <V extends Vector> List<List<Weighted<Vector>>> getWeightedVectors(
+ private static <V extends Vector> List<List<Weighted<Vector>>> getWeightedVectors(
PCollection<Pair<Integer, V>> folds, CentersIndex centers) {
List<List<Long>> indexWeights = getCountsOfClosest(folds, centers).getValue();
return centers.getWeightedVectors(indexWeights);
}
- private <V extends Vector> void updateCenters(
+ private static <V extends Vector> void updateCenters(
Iterable<Pair<Integer, V>> vecs,
CentersIndex centers) {
for (Pair<Integer, V> p : vecs) {
@@ -227,9 +228,9 @@ private CentersIndex createIndex(List<Centers> centers) {
}
private static class ScoringFn<V extends Vector> extends DoFn<Pair<Integer, V>, Pair<Integer, Pair<V, Double>>> {
- private CentersIndex centers;
+ private final CentersIndex centers;
- public ScoringFn(CentersIndex centers) {
+ private ScoringFn(CentersIndex centers) {
this.centers = centers;
}
@@ -246,7 +247,7 @@ public void process(Pair<Integer, V> in, Emitter<Pair<Integer, Pair<V, Double>>>
private static class ClosestCenterFn<V extends Vector> extends DoFn<Pair<Integer, V>, Pair<Integer, Integer>> {
private final CentersIndex centers;
- public ClosestCenterFn(CentersIndex centers) {
+ private ClosestCenterFn(CentersIndex centers) {
this.centers = centers;
}
@@ -260,7 +261,7 @@ public void process(Pair<Integer, V> in, Emitter<Pair<Integer, Integer>> emitter
private static class AssignedCenterFn<V extends Vector> extends DoFn<V, Record> {
private final CentersIndex centers;
- public AssignedCenterFn(CentersIndex centers) {
+ private AssignedCenterFn(CentersIndex centers) {
this.centers = centers;
}
@@ -282,9 +283,9 @@ public void process(V vec, Emitter<Record> emitter) {
private static class CenterCostFn<V extends Vector> extends DoFn<V, Pair<Integer, Double>> {
private final CentersIndex centers;
- private double[] currentCosts;
+ private final double[] currentCosts;
- public CenterCostFn(CentersIndex centers) {
+ private CenterCostFn(CentersIndex centers) {
this.centers = centers;
this.currentCosts = new double[centers.getNumCenters()];
}
View
12 ...ns-parallel/src/test/java/com/cloudera/science/ml/kmeans/parallel/KMeansParallelTest.java
@@ -30,15 +30,13 @@
import com.cloudera.science.ml.core.vectors.Weighted;
import com.cloudera.science.ml.kmeans.core.KMeans;
import com.cloudera.science.ml.parallel.crossfold.Crossfold;
-import com.cloudera.science.ml.parallel.records.Records;
import com.cloudera.science.ml.parallel.types.MLAvros;
-import com.cloudera.science.ml.parallel.types.MLRecords;
import com.google.common.collect.ImmutableList;
import com.google.common.collect.Lists;
public class KMeansParallelTest {
- private PCollection<Vector> vecs = MemPipeline.typedCollectionOf(
+ private final PCollection<Vector> vecs = MemPipeline.typedCollectionOf(
MLAvros.vector(),
Vectors.of(2.0, 1.0),
Vectors.of(1.0, 1.0),
@@ -58,23 +56,23 @@
Vectors.of(4.0, 3.0));
private KMeansParallel kmp;
- private Random r = new Random(29L);
+ private final Random r = new Random(29L);
@Before
- public void setUp() throws Exception {
+ public void setUp() {
kmp = new KMeansParallel(r, 128, 32);
}
@Test
public void testBasic() throws Exception {
List<Vector> initialPoints = ImmutableList.of(Vectors.of(1.0, 1.0));
- KMeans km = new KMeans();
-
+
List<List<Weighted<Vector>>> points = kmp.initialization(vecs, 5, 4, initialPoints,
new Crossfold(2, 1729L));
List<Centers> centers = Lists.newArrayList();
List<Weighted<Vector>> allPoints = Lists.newArrayList(points.get(0));
allPoints.addAll(points.get(1));
+ KMeans km = new KMeans();
centers.add(km.compute(allPoints, 1, new Random(17)));
centers.add(km.compute(allPoints, 2, new Random(17)));
centers.add(km.compute(allPoints, 3, new Random(17)));
View
12 kmeans/src/main/java/com/cloudera/science/ml/kmeans/core/KMeans.java
@@ -43,9 +43,6 @@
/**
* Constructor that uses the k-means++ initialization strategy and
* a 1000-iteration stopping criteria.
- *
- * @param numClusters The number of clusters to create
- * @param stoppingCriteria The stopping criteria to use for Lloyd's algorithm
*/
public KMeans() {
this(KMeansInitStrategy.PLUS_PLUS, StoppingCriteria.threshold(1000));
@@ -98,7 +95,8 @@ public KMeans(
* @return The centers that the algorithm converged toward
*/
public <V extends Vector> Centers lloydsAlgorithm(Collection<Weighted<V>> points, Centers centers) {
- Centers current = centers, last = null;
+ Centers current = centers;
+ Centers last = null;
int iteration = 0;
while (!stoppingCriteria.stop(iteration, current, last)) {
last = current;
@@ -125,10 +123,10 @@ public KMeans(
}
List<Vector> centroids = Lists.newArrayList();
for (Map.Entry<Integer, List<Weighted<V>>> e : assignments.entrySet()) {
- if (e.getValue().size() > 0) {
- centroids.add(centroid(e.getValue()));
- } else {
+ if (e.getValue().isEmpty()) {
centroids.add(centers.get(e.getKey())); // fix the no-op center
+ } else {
+ centroids.add(centroid(e.getValue()));
}
}
return new Centers(centroids);
View
3  kmeans/src/main/java/com/cloudera/science/ml/kmeans/core/KMeansEvaluation.java
@@ -76,7 +76,8 @@ private void init() {
for (int i = 0; i < testCenters.size(); i++) {
Centers test = testCenters.get(i);
Centers train = trainCenters.get(i);
- double trainCost = 0.0, testCost = 0.0;
+ double trainCost = 0.0;
+ double testCost = 0.0;
double[][] assignments = new double[test.size()][train.size()];
int totalPoints = 0;
for (Weighted<Vector> wv : testPoints) {
View
13 kmeans/src/main/java/com/cloudera/science/ml/kmeans/core/StoppingCriteria.java
@@ -75,24 +75,21 @@ public static StoppingCriteria or(StoppingCriteria... criteria) {
private static class ThresholdStoppingCriteria extends StoppingCriteria {
private final double threshold;
- public ThresholdStoppingCriteria(double threshold) {
+ private ThresholdStoppingCriteria(double threshold) {
Preconditions.checkArgument(threshold > 0);
this.threshold = threshold;
}
@Override
public boolean stop(int iteration, Centers current, Centers last) {
- if (last == null) {
- return false;
- }
- return last.getSumOfSquaredDistances(current) < threshold;
+ return last != null && last.getSumOfSquaredDistances(current) < threshold;
}
}
private static class MaxIterationStoppingCriteria extends StoppingCriteria {
private final int maxIterations;
- public MaxIterationStoppingCriteria(int maxIterations) {
+ private MaxIterationStoppingCriteria(int maxIterations) {
Preconditions.checkArgument(maxIterations > 0);
this.maxIterations = maxIterations;
}
@@ -104,9 +101,9 @@ public boolean stop(int iteration, Centers current, Centers last) {
}
private static class OrStoppingCriteria extends StoppingCriteria {
- private StoppingCriteria[] criteria;
+ private final StoppingCriteria[] criteria;
- public OrStoppingCriteria(StoppingCriteria[] criteria) {
+ private OrStoppingCriteria(StoppingCriteria[] criteria) {
Preconditions.checkArgument(criteria.length > 0);
this.criteria = criteria;
}
View
17 kmeans/src/test/java/com/cloudera/science/ml/kmeans/core/KMeansTest.java
@@ -30,21 +30,20 @@
public class KMeansTest {
- Weighted<Vector> a = wpoint(1.0, 1.0);
- Weighted<Vector> b = wpoint(5.0, 4.0);
- Weighted<Vector> c = wpoint(4.0, 3.0);
- Weighted<Vector> d = wpoint(2.0, 1.0);
- List<Weighted<Vector>> points = ImmutableList.of(a, b, c, d);
-
- KMeans kmeans = new KMeans();
+ private final Weighted<Vector> a = wpoint(1.0, 1.0);
+ private final Weighted<Vector> b = wpoint(5.0, 4.0);
+ private final Weighted<Vector> c = wpoint(4.0, 3.0);
+ private final Weighted<Vector> d = wpoint(2.0, 1.0);
+ private final List<Weighted<Vector>> points = ImmutableList.of(a, b, c, d);
+ private final KMeans kmeans = new KMeans();
private Random rand;
@Before
- public void setUp() throws Exception {
+ public void setUp() {
rand = new Random(1729L);
}
- public Vector vec(double... values) {
+ public static Vector vec(double... values) {
return Vectors.of(values);
}
View
11 mahout/src/main/java/com/cloudera/science/ml/mahout/types/MLWritables.java
@@ -18,6 +18,7 @@
import org.apache.crunch.types.PType;
import org.apache.crunch.types.writable.WritableType;
import org.apache.crunch.types.writable.Writables;
+import org.apache.hadoop.io.Writable;
import org.apache.mahout.math.NamedVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
@@ -25,11 +26,11 @@
/**
* Factory methods for creating {@code PType} instances for use with the ML Parallel libraries.
*/
-public class MLWritables {
+public final class MLWritables {
static {
- Writables.register(Vector.class, (WritableType) vector());
- Writables.register(NamedVector.class, (WritableType) vector(NamedVector.class));
+ Writables.register(Vector.class, (WritableType<Vector,? extends Writable>) vector());
+ Writables.register(NamedVector.class, (WritableType<NamedVector,? extends Writable>) vector(NamedVector.class));
}
/**
@@ -46,14 +47,14 @@
Writables.writables(VectorWritable.class));
}
- private static MapFn<VectorWritable, Vector> VEC_IN = new MapFn<VectorWritable, Vector>() {
+ private static final MapFn<VectorWritable, Vector> VEC_IN = new MapFn<VectorWritable, Vector>() {
@Override
public Vector map(VectorWritable input) {
return input.get();
}
};
- private static MapFn<Vector, VectorWritable> VEC_OUT = new MapFn<Vector, VectorWritable>() {
+ private static final MapFn<Vector, VectorWritable> VEC_OUT = new MapFn<Vector, VectorWritable>() {
@Override
public VectorWritable map(Vector input) {
return new VectorWritable(input);
View
4 parallel/src/main/java/com/cloudera/science/ml/parallel/crossfold/Crossfold.java
@@ -36,8 +36,8 @@
*/
public static final long DEFAULT_SEED = 1729L;
- private int numFolds;
- private long seed;
+ private final int numFolds;
+ private final long seed;
public Crossfold(int numFolds) {
this(numFolds, DEFAULT_SEED);
View
4 parallel/src/main/java/com/cloudera/science/ml/parallel/fn/ShuffleFns.java
@@ -38,7 +38,7 @@
private transient Random rand;
- public ShuffleFn(Long seed, int numSplits) {
+ private ShuffleFn(Long seed, int numSplits) {
this.seed = seed;
this.numSplits = numSplits;
}
@@ -63,7 +63,7 @@ public void initialize() {
private transient Random random;
- public CVShuffleFn(int numSplits, Long seed, int numFolds) {
+ private CVShuffleFn(int numSplits, Long seed, int numFolds) {
this.numSplits = numSplits;
this.seed = seed;
this.numFolds = numFolds;
View
8 parallel/src/main/java/com/cloudera/science/ml/parallel/normalize/Normalizer.java
@@ -22,7 +22,6 @@
import org.apache.commons.logging.LogFactory;
import org.apache.crunch.DoFn;
import org.apache.crunch.Emitter;
-import org.apache.crunch.MapFn;
import org.apache.crunch.PCollection;
import org.apache.crunch.types.PType;
import org.apache.mahout.math.NamedVector;
@@ -60,7 +59,7 @@ public static Builder builder() {
private Boolean sparse = null;
private int idColumn = -1;
private Transform defaultTransform = Transform.NONE;
- private Map<Integer, Transform> transforms = Maps.newHashMap();
+ private final Map<Integer, Transform> transforms = Maps.newHashMap();
public Builder summary(Summary s) {
if (s != null) {
@@ -118,7 +117,7 @@ private Normalizer(Summary summary, Boolean sparse, int idColumn,
@Override
public void process(Record record, Emitter<V> emitter) {
int len = record.getSpec().size() + expansion;
- Vector v = null;
+ Vector v;
if (record instanceof VectorRecord) {
v = ((VectorRecord) record).getVector().like();
} else if (sparse) {
@@ -154,9 +153,8 @@ public void process(Record record, Emitter<V> emitter) {
LOG.warn(String.format("Unknown categorical value encountered for field %d: '%s', skipping...",
i, record.getAsString(i)));
return;
- } else {
- v.setQuick(offset + index, ss.getScale());
}
+ v.setQuick(offset + index, ss.getScale());
offset += ss.numLevels();
}
}
View
7 parallel/src/main/java/com/cloudera/science/ml/parallel/normalize/StringSplitFn.java
@@ -24,9 +24,6 @@
import com.cloudera.science.ml.core.records.csv.CSVRecord;
import com.cloudera.science.ml.parallel.types.MLRecords;
-/**
- *
- */
public class StringSplitFn extends DoFn<String, Record> {
private final String delim;
@@ -50,9 +47,7 @@ public StringSplitFn(String delim, Pattern ignoredLines) {
@Override
public void process(String line, Emitter<Record> emitter) {
- if (line == null || line.isEmpty()) {
- return;
- } else if (ignoredLines != null && ignoredLines.matcher(line).find()) {
+ if (line == null || line.isEmpty() || ignoredLines != null && ignoredLines.matcher(line).find()) {
return;
}
emitter.emit(new CSVRecord(line.split(delim)));
View
11 parallel/src/main/java/com/cloudera/science/ml/parallel/normalize/Transform.java
@@ -18,9 +18,6 @@
import com.cloudera.science.ml.parallel.summary.SummaryStats;
-/**
- *
- */
public abstract class Transform implements Serializable {
public abstract double apply(double value, SummaryStats stats);
@@ -28,19 +25,22 @@
public static Transform forName(String name) {
if ("z".equalsIgnoreCase(name)) {
return Z;
- } else if ("linear".equalsIgnoreCase(name)) {
+ }
+ if ("linear".equalsIgnoreCase(name)) {
return LINEAR;
}
return NONE;
}
public static final Transform NONE = new Transform() {
+ @Override
public double apply(double value, SummaryStats stats) {
return value;
}
};
public static final Transform Z = new Transform() {
+ @Override
public double apply(double value, SummaryStats stats) {
if (stats.stdDev() == 0.0) {
return value;
@@ -50,11 +50,12 @@ public double apply(double value, SummaryStats stats) {
};
public static final Transform LINEAR = new Transform() {
+ @Override
public double apply(double value, SummaryStats stats) {
if (stats.range() == 0.0) {
return value;
}
- return (value - stats.min()) / (stats.range());
+ return (value - stats.min()) / stats.range();
}
};
}
View
7 parallel/src/main/java/com/cloudera/science/ml/parallel/normalize/VectorScaling.java
@@ -24,7 +24,10 @@
* Functions for applying scale factors to a {@code PCollection<Vector>} instance, which is
* useful if we want to weight some features more than others for clustering.
*/
-public class VectorScaling {
+public final class VectorScaling {
+
+ private VectorScaling() {
+ }
public static <V extends Vector> PCollection<V> scale(PCollection<V> vecs, double[] scaleFactors) {
return vecs.parallelDo("scale", new ScaleFn<V>(scaleFactors), vecs.getPType());
@@ -34,7 +37,7 @@
private final double[] scaleFactors;
private transient Vector scaleVec;
- public ScaleFn(double[] scaleFactors) {
+ private ScaleFn(double[] scaleFactors) {
this.scaleFactors = scaleFactors;
}
View
5 parallel/src/main/java/com/cloudera/science/ml/parallel/pivot/MapAggregator.java
@@ -21,12 +21,9 @@
import com.google.common.collect.ImmutableList;
import com.google.common.collect.Maps;
-/**
- *
- */
public class MapAggregator extends SimpleAggregator<Map<String, Stat>> {
- private Map<String, Stat> values = Maps.newHashMap();
+ private final Map<String, Stat> values = Maps.newHashMap();
@Override
public void reset() {
View
34 parallel/src/main/java/com/cloudera/science/ml/parallel/pivot/Pivot.java
@@ -36,14 +36,11 @@
import com.cloudera.science.ml.parallel.types.MLRecords;
import com.google.common.collect.Maps;
-/**
- *
- */
public class Pivot {
- public static enum Agg { SUM, MEAN };
-
- private Spec createSpec(Spec spec, List<Integer> groupColumns) {
+ public enum Agg { SUM, MEAN }
+
+ private static Spec createSpec(Spec spec, List<Integer> groupColumns) {
RecordSpec.Builder b = RecordSpec.builder();
for (Integer c : groupColumns) {
FieldSpec f = spec.getField(c);
@@ -66,18 +63,17 @@ public Records pivot(SummarizedRecords records,
RecordSpec.Builder b = RecordSpec.builder(keySpec);
SummaryStats attrStats = summary.getStats(attributeColumn);
SummaryStats valueStats = summary.getStats(valueColumn);
- List<String> levels = null;
if (!valueStats.isNumeric()) {
throw new IllegalArgumentException("Non-numeric value column in pivot op");
- } else if (attrStats.isNumeric() || attrStats.numLevels() == 1) {
+ }
+ if (attrStats.isNumeric() || attrStats.numLevels() == 1) {
throw new IllegalArgumentException("Non-categorical attribute column in pivot op");
- } else {
- levels = attrStats.getLevels();
- for (String level : levels) {
- b.addDouble(level);
- }
}
-
+ List<String> levels = attrStats.getLevels();
+ for (String level : levels) {
+ b.addDouble(level);
+ }
+
Spec outSpec = b.build();
return new Records(records.get().parallelDo("pivotmap",
new PivotMapperFn(keySpec, groupColumns, attributeColumn, valueColumn),
@@ -98,7 +94,7 @@ public Records pivot(SummarizedRecords records,
private final Map<Record, Map<String, Stat>> cache;
private int cacheAdds = 0;
- public PivotMapperFn(Spec spec, List<Integer> groupColumns, int attributeColumn, int valueColumn) {
+ private PivotMapperFn(Spec spec, List<Integer> groupColumns, int attributeColumn, int valueColumn) {
this.spec = spec;
this.groupColumns = groupColumns;
this.attributeColumn = attributeColumn;
@@ -145,11 +141,11 @@ public void cleanup(Emitter<Pair<Record, Map<String, Stat>>> emitter) {
}
private static class PivotFinishFn extends MapFn<Pair<Record, Map<String, Stat>>, Record> {
- private Spec spec;
- private List<String> levels;
- private Agg agg;
+ private final Spec spec;
+ private final List<String> levels;
+ private final Agg agg;
- public PivotFinishFn(Spec spec, List<String> levels, Agg agg) {
+ private PivotFinishFn(Spec spec, List<String> levels, Agg agg) {
this.spec = spec;
this.levels = levels;
this.agg = agg;
View
7 parallel/src/main/java/com/cloudera/science/ml/parallel/pivot/Stat.java
@@ -14,16 +14,13 @@
*/
package com.cloudera.science.ml.parallel.pivot;
-/**
- *
- */
class Stat {
public long count = 0L;
public double sum = 0.0;
- public Stat() { this(0L, 0.0); }
+ Stat() { this(0L, 0.0); }
- public Stat(long count, double sum) {
+ Stat(long count, double sum) {
this.count = count;
this.sum = sum;
}
View
4 parallel/src/main/java/com/cloudera/science/ml/parallel/pobject/ListOfListsPObject.java
@@ -25,8 +25,8 @@
import com.google.common.collect.Maps;
public class ListOfListsPObject<V> extends PObjectImpl<Pair<Pair<Integer, Integer>, V>, List<List<V>>> {
- private V emptyValue;
- private int[] expected;
+ private final V emptyValue;
+ private final int[] expected;
public ListOfListsPObject(PCollection<Pair<Pair<Integer, Integer>, V>> collect, int[] expected,
V empty) {
View
4 parallel/src/main/java/com/cloudera/science/ml/parallel/pobject/ListPObject.java
@@ -34,8 +34,8 @@ public ListPObject(PCollection<Pair<Integer, V>> collect) {
List<Pair<Integer, V>> list = Lists.newArrayList(iterable);
Collections.sort(list);
List<V> ret = Lists.newArrayList();
- for (int i = 0; i < list.size(); i++) {
- ret.add(list.get(i).second());
+ for (Pair<Integer, V> aList : list) {
+ ret.add(aList.second());
}
return ret;
}
View
3  parallel/src/main/java/com/cloudera/science/ml/parallel/records/Records.java
@@ -19,9 +19,6 @@
import com.cloudera.science.ml.core.records.Record;
import com.cloudera.science.ml.core.records.Spec;
-/**
- *
- */
public class Records {
private final Spec spec;
private final PCollection<Record> records;
View
3  parallel/src/main/java/com/cloudera/science/ml/parallel/records/SummarizedRecords.java
@@ -19,9 +19,6 @@
import com.cloudera.science.ml.core.records.Record;
import com.cloudera.science.ml.parallel.summary.Summary;
-/**
- *
- */
public class SummarizedRecords extends Records {
private final Summary summary;
View
22 parallel/src/main/java/com/cloudera/science/ml/parallel/sample/ReservoirSampling.java
@@ -39,8 +39,11 @@
* and Spirakis (2005)</a>.
*
*/
-public class ReservoirSampling {
-
+public final class ReservoirSampling {
+
+ private ReservoirSampling() {
+ }
+
public static <T> PCollection<T> sample(
PCollection<T> input,
int sampleSize) {
@@ -55,6 +58,7 @@
PType<Pair<T, Integer>> ptype = ptf.pairs(input.getPType(), ptf.ints());
return weightedSample(
input.parallelDo(new MapFn<T, Pair<T, Integer>>() {
+ @Override
public Pair<T, Integer> map(T t) { return Pair.of(t, 1); }
}, ptype),
sampleSize,
@@ -79,7 +83,7 @@
return Pair.of(0, p);
}
}, ptf.tableOf(ptf.ints(), input.getPType()));
- int[] ss = new int[] { sampleSize };
+ int[] ss = { sampleSize };
return groupedWeightedSample(groupedIn, ss, random)
.parallelDo(new MapFn<Pair<Integer, T>, T>() {
@Override
@@ -112,12 +116,12 @@ public T map(Pair<Integer, T> p) {
private static class SampleFn<T, N extends Number>
extends DoFn<Pair<Integer, Pair<T, N>>, Pair<Integer, Pair<Double, T>>> {
- private int[] sampleSizes;
+ private final int[] sampleSizes;
private transient List<SortedMap<Double, T>> archives;
private transient List<SortedMap<Double, T>> current;
private Random random;
- public SampleFn(int[] sampleSizes, Random random) {
+ private SampleFn(int[] sampleSizes, Random random) {
this.sampleSizes = sampleSizes;
this.random = random;
}
@@ -141,7 +145,7 @@ public void initialize() {
private List<SortedMap<Double, T>> createReservoirs() {
List<SortedMap<Double, T>> ret = Lists.newArrayList();
- for (int i = 0; i < sampleSizes.length; i++) {
+ for (int sampleSize : sampleSizes) {
ret.add(Maps.<Double, T>newTreeMap());
}
return ret;
@@ -185,17 +189,17 @@ public void cleanup(Emitter<Pair<Integer, Pair<Double, T>>> emitter) {
private static class WRSCombineFn<T> extends CombineFn<Integer, Pair<Double, T>> {
- private int[] sampleSizes;
+ private final int[] sampleSizes;
private List<SortedMap<Double, T>> reservoirs;
- public WRSCombineFn(int[] sampleSizes) {
+ private WRSCombineFn(int[] sampleSizes) {
this.sampleSizes = sampleSizes;
}
@Override
public void initialize() {
this.reservoirs = Lists.newArrayList();
- for (int i = 0; i < sampleSizes.length; i++) {
+ for (int dummy : sampleSizes) {
reservoirs.add(Maps.<Double, T>newTreeMap());
}
}
View
13 parallel/src/main/java/com/cloudera/science/ml/parallel/serialize/Serializables.java
@@ -30,16 +30,19 @@
/**
* Utilities for working with {@code Serializable} object types in Crunch.
*/
-public class Serializables {
-
- public static final <T extends Serializable> PType<T> ptype(Class<T> clazz, PTypeFamily ptf) {
+public final class Serializables {
+
+ private Serializables() {
+ }
+
+ public static <T extends Serializable> PType<T> ptype(Class<T> clazz, PTypeFamily ptf) {
return ptf.derived(clazz, new InFn<T>(clazz), new OutFn<T>(), ptf.bytes());
}
private static class InFn<T> extends MapFn<ByteBuffer, T> {
private final Class<T> clazz;
- public InFn(Class<T> clazz) {
+ InFn(Class<T> clazz) {
this.clazz = clazz;
}
@@ -57,7 +60,7 @@ public T map(ByteBuffer input) {
}
}
- private static class OutFn<T> extends MapFn<T, ByteBuffer> {
+ private static class OutFn<T extends Serializable> extends MapFn<T, ByteBuffer> {
@Override
public ByteBuffer map(T input) {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
View
12 parallel/src/main/java/com/cloudera/science/ml/parallel/summary/Entry.java
@@ -20,9 +20,9 @@
public int id;
public long count;
- public Entry() { }
+ Entry() { }
- public Entry(int id) {
+ Entry(int id) {
this.id = id;
this.count = 0;
}
@@ -38,6 +38,12 @@ public Entry inc(long count) {
@Override
public int compareTo(Entry other) {
- return id - other.id;
+ if (id < other.id) {
+ return -1;
+ }
+ if (id > other.id) {
+ return 1;
+ }
+ return 0;
}
}
View
3  parallel/src/main/java/com/cloudera/science/ml/parallel/summary/InternalNumeric.java
@@ -20,9 +20,6 @@
private double sum;
private double sumSq;
private long missing;
-
- public InternalNumeric() {
- }
public Numeric toNumeric(long recordCount) {
if (missing == recordCount) {
View
6 parallel/src/main/java/com/cloudera/science/ml/parallel/summary/InternalStats.java
@@ -24,9 +24,6 @@
import com.google.common.collect.Maps;
import com.google.common.collect.Sets;
-/**
- *
- */
class InternalStats {
public static final Aggregator<InternalStats> AGGREGATOR = new SimpleAggregator<InternalStats>() {
@@ -50,9 +47,6 @@ public void update(InternalStats other) {
private InternalNumeric internalNumeric;
private Map<String, Entry> histogram;
- public InternalStats() {
- }
-
public SummaryStats toSummaryStats(String name, long recordCount) {
if (internalNumeric == null) {
if (histogram == null) {
View
16 parallel/src/main/java/com/cloudera/science/ml/parallel/summary/Summarizer.java
@@ -27,24 +27,18 @@
import org.apache.crunch.materialize.pobject.PObjectImpl;
import org.apache.crunch.types.avro.Avros;
-import com.cloudera.science.ml.core.records.FieldSpec;
import com.cloudera.science.ml.core.records.Record;
import com.cloudera.science.ml.core.records.Spec;
import com.google.common.collect.Maps;
import com.google.common.collect.Sets;
-/**
- *
- */
public class Summarizer {
- private Set<Integer> ignoredColumns = Sets.newHashSet();
+ private final Set<Integer> ignoredColumns = Sets.newHashSet();
private boolean defaultToSymbolic = false;
- private Set<Integer> exceptionColumns = Sets.newHashSet();
+ private final Set<Integer> exceptionColumns = Sets.newHashSet();
private Spec spec = null;
- public Summarizer() { }
-
public Summarizer spec(Spec spec) {
this.spec = spec;
return this;
@@ -88,7 +82,7 @@ public Summarizer exceptionColumns(Iterable<Integer> columns) {
private static class SummaryPObject extends PObjectImpl<Pair<Integer, Pair<Long, InternalStats>>, Summary> {
private final Spec spec;
- public SummaryPObject(Spec spec, PCollection<Pair<Integer, Pair<Long, InternalStats>>> pc) {
+ private SummaryPObject(Spec spec, PCollection<Pair<Integer, Pair<Long, InternalStats>>> pc) {
super(pc);
this.spec = spec;
}
@@ -130,8 +124,8 @@ protected Summary process(Iterable<Pair<Integer, Pair<Long, InternalStats>>> ite
private final Map<Integer, InternalStats> stats;
private long count;
- public SummarizeFn(Set<Integer> ignoreColumns,
- boolean defaultToSymbolic, Set<Integer> exceptionColumns) {
+ private SummarizeFn(Set<Integer> ignoreColumns,
+ boolean defaultToSymbolic, Set<Integer> exceptionColumns) {
this.ignoredColumns = ignoreColumns;
this.defaultToSymbolic = defaultToSymbolic;
this.exceptionColumns = exceptionColumns;
View
5 parallel/src/main/java/com/cloudera/science/ml/parallel/summary/Summary.java