From 92d05bd573407595541f07177c9a826420aaa5d1 Mon Sep 17 00:00:00 2001 From: Theodore Vasiloudis Date: Fri, 5 Jun 2015 11:09:11 +0200 Subject: [PATCH 1/8] Initial version of quickstart guide --- docs/libs/ml/contribution_guide.md | 10 +- docs/libs/ml/quickstart.md | 191 ++++++++++++++++++++++++++++- 2 files changed, 195 insertions(+), 6 deletions(-) diff --git a/docs/libs/ml/contribution_guide.md b/docs/libs/ml/contribution_guide.md index 89f05c0251e29..f0754cbd91b88 100644 --- a/docs/libs/ml/contribution_guide.md +++ b/docs/libs/ml/contribution_guide.md @@ -36,7 +36,7 @@ Everything from this guide also applies to FlinkML. ## Pick a Topic -If you are looking for some new ideas, then you should check out the list of [unresolved issues on JIRA](https://issues.apache.org/jira/issues/?jql=component%20%3D%20%22Machine%20Learning%20Library%22%20AND%20project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC). +If you are looking for some new ideas you should first look into our [roadmap](vision_roadmap.html#Roadmap), then you should check out the list of [unresolved issues on JIRA](https://issues.apache.org/jira/issues/?jql=component%20%3D%20%22Machine%20Learning%20Library%22%20AND%20project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC). Once you decide to contribute to one of these issues, you should take ownership of it and track your progress with this issue. That way, the other contributors know the state of the different issues and redundant work is avoided. @@ -61,7 +61,7 @@ Thus, an integration test could look the following: {% highlight scala %} class ExampleITSuite extends FlatSpec with FlinkTestBase { behavior of "An example algorithm" - + it should "do something" in { ... } @@ -81,12 +81,12 @@ Every new algorithm is described by a single markdown file. This file should contain at least the following points: 1. What does the algorithm do -2. How does the algorithm work (or reference to description) +2. How does the algorithm work (or reference to description) 3. Parameter description with default values 4. Code snippet showing how the algorithm is used In order to use latex syntax in the markdown file, you have to include `mathjax: include` in the YAML front matter. - + {% highlight java %} --- mathjax: include @@ -103,4 +103,4 @@ See `docs/_include/latex_commands.html` for the complete list of predefined late ## Contributing Once you have implemented the algorithm with adequate test coverage and added documentation, you are ready to open a pull request. -Details of how to open a pull request can be found [here](http://flink.apache.org/how-to-contribute.html#contributing-code--documentation). +Details of how to open a pull request can be found [here](http://flink.apache.org/how-to-contribute.html#contributing-code--documentation). diff --git a/docs/libs/ml/quickstart.md b/docs/libs/ml/quickstart.md index b8501f8c47bec..e8c2b2ee6ddb6 100644 --- a/docs/libs/ml/quickstart.md +++ b/docs/libs/ml/quickstart.md @@ -24,4 +24,193 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can open up a Flink shell and load the data as a `DataSet[String]` at first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data") + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains("?")) + .map(_.foreach())//??? + +{% endhighlight %} + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can simply import the dataset then using: + +{% highlight scala %} + +val adultTrain = MLUtils.readLibSVM("path/to/a8a") +val adultTest = MLUtils.readLibSVM("path/to/a8a.t") + +{% endhighlight %} + +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier. + +Due to an error in the test dataset we have to adjust the test data using the following code, to +ensure that the dimensionality of all test examples is 123, as with the training set: + +{% highlight scala %} + +val adjustedTest = adultTest.map{lv => + val vec = lv.vector.asBreeze + val padded = vec.padTo(123, 0.0).toDenseVector + val fvec = padded.fromBreeze + LabeledVector(lv.label, fvec) + } + +{% endhighlight %} + +## Classification + +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier. + +{% highlight scala %} + +val svm = SVM() + .setBlocks(env.getParallelism) + .setIterations(100) + .setRegularization(0.001) + .setStepsize(0.1) + .setSeed(42) + +svm.fit(adultTrain) + +{% endhighlight %} + +Let's now make predictions on the test set and see how well we do in terms of absolute error +We will also create a function that thresholds the predictions to the {-1, 1} scale that the +dataset uses. + +{% highlight scala %} + +def thresholdPredictions(predictions: DataSet[(Double, Double)]) +: DataSet[(Double, Double)] = { + predictions.map { + truthPrediction => + val truth = truthPrediction._1 + val prediction = truthPrediction._2 + val thresholdedPrediction = if (prediction > 0.0) 1.0 else -1.0 + (truth, thresholdedPrediction) + } +} + +val predictionPairs = thresholdPredictions(svm.predict(adjustedTest)) + +val absoluteErrorSum = predictionPairs.collect().map{ + case (truth, prediction) => Math.abs(truth - prediction)}.sum + +println(s"Absolute error: $absoluteErrorSum") + +{% endhighlight %} + +Next we will see if we can improve the performance by pre-processing our data. + +## Data pre-processing and pipelines + +A pre-processing step that is often encouraged when using SVM classification is scaling +the input features to the [0, 1] range, in order to avoid features with extreme values dominating the rest. +FlinkML has a number of `Transformers` such as `StandardScaler` that are used to pre-process data, and a key feature is the ability to +chain `Transformers` and `Predictors` together. This allows us to run the same pipeline of transformations and make predictions +on the train and test data in a straight-forward and type-safe manner. You can read more on the pipeline system of FlinkML, +[here](pipelines.html). + +Let first create a scaling transformer for the features in our dataset, and chain it to a new SVM classifier. + +{% highlight scala %} + +import org.apache.flink.ml.preprocessing.StandardScaler + +val scaler = StandardScaler() +scaler.fit(adultTrain) + +val scaledSVM = scaler.chainPredictor(svm) + +{% endhighlight %} + +We can now use our newly created pipeline to make predictions on the test set. +First we call fit again, to train the scaler and the SVM classifier. +The data of the test set will then be automatically scaled before being passed on to the SVM to +make predictions. + +{% highlight scala %} + +scaledSVM.fit(adultTrain) + +val predictionPairsScaled= thresholdPredictions(scaledSVM.predict(predictionsScaled)) + +val absoluteErrorSumScaled = predictionPairs.collect().map{ + case (truth, prediction) => Math.abs(truth - prediction)}.sum + +println(s"Absolute error with scaled features: $absoluteErrorSumScaled") + +{% endhighlight %} + +The effect that the transformation has on the rror for this dataset is a bit unpredictable. +In reality the scaling transformation does +not fit the dataset we are using, since the features are translated categorical features and as +such, operations like normalization and standard scaling do not make much sense. + +## Where to go from here + +This quickstart guide can act as an introduction to the basic concepts of FlinkML, but there's a lot more you can do. +We recommend going through the [FlinkML documentation](index.html), and trying out the different algorithms. +A very good way to get started is to play around with interesting datasets from the UCI ML repository and the LibSVM datasets. +Tackling an interesting problem from a website like [Kaggle](https://www.kaggle.com) or [DrivenData](http://www.drivendata.org/) +is also a great way to learn by competing with other data scientists. +If you would like to contribute some new algorithms take a look at our [contribution guide](contribution_guide.html). From 52009fe8b8f9f09dac185bef5b311ed326376a27 Mon Sep 17 00:00:00 2001 From: Theodore Vasiloudis Date: Fri, 5 Jun 2015 11:37:44 +0200 Subject: [PATCH 2/8] Small fix for first example --- docs/libs/ml/quickstart.md | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/docs/libs/ml/quickstart.md b/docs/libs/ml/quickstart.md index e8c2b2ee6ddb6..932613f8c35ee 100644 --- a/docs/libs/ml/quickstart.md +++ b/docs/libs/ml/quickstart.md @@ -62,7 +62,7 @@ variable for a regression problem. As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can [download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). -We can open up a Flink shell and load the data as a `DataSet[String]` at first: +We can load the data as a `DataSet[String]` first: {% highlight scala %} @@ -79,10 +79,15 @@ dataset with the FlinkML classification algorithms. val cancerLV = cancer .map(_.productIterator.toList) .filter(!_.contains("?")) - .map(_.foreach())//??? + .map{list => + val numList = list.map(_.asInstanceOf[String].toDouble) + LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) + } {% endhighlight %} +We can then use this data to train a learner. + A common format for ML datasets is the LibSVM format and a number of datasets using that format can be found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. From 055362fc568063414772bd665096c4f09f83519b Mon Sep 17 00:00:00 2001 From: Theodore Vasiloudis Date: Mon, 8 Jun 2015 16:19:23 +0200 Subject: [PATCH 3/8] Changed datasets, reworked examples --- docs/libs/ml/quickstart.md | 153 ++++++++++++++++--------------------- 1 file changed, 67 insertions(+), 86 deletions(-) diff --git a/docs/libs/ml/quickstart.md b/docs/libs/ml/quickstart.md index 932613f8c35ee..5965623e54ae3 100644 --- a/docs/libs/ml/quickstart.md +++ b/docs/libs/ml/quickstart.md @@ -1,4 +1,5 @@ --- +mathjax: include htmlTitle: FlinkML - Quickstart Guide title: FlinkML - Quickstart Guide --- @@ -30,22 +31,22 @@ FlinkML is designed to make learning from your data a straight-forward process, the complexities that usually come with having to deal with big data learning tasks. In this quick-start guide we will show just how easy it is to solve a simple supervised learning problem using FlinkML. But first some basics, feel free to skip the next few lines if you're already -familiar with Machine Learning (ML) +familiar with Machine Learning (ML). -As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +As defined by Murphy [1] ML deals with detecting patterns in data, and using those learned patterns to make predictions about the future. We can categorize most ML algorithms into two major categories: Supervised and Unsupervised Learning. * Supervised Learning deals with learning a function (mapping) from a set of inputs -(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +(features) to a set of outputs. The learning is done using a __training set__ of (input, output) pairs that we use to approximate the mapping function. Supervised learning problems are further divided into classification and regression problems. In classification problems we try to predict the __class__ that an example belongs to, for example whether a user is going to click on -an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent -variable, for example what the temperature will be tomorrow. +an ad or not. Regression problems one the other hand, are about predicting (real) numerical +values, often called the dependent variable, for example what the temperature will be tomorrow. * Unsupervised learning deals with discovering patterns and regularities in the data. An example -of this would be __clustering__, where we try to discover groupings of the data from the +of this would be *clustering*, where we try to discover groupings of the data from the descriptive features. Unsupervised learning can also be used for feature selection, for example through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). @@ -58,72 +59,65 @@ object will have a FlinkML `Vector` member representing the features of the exam member which represents the label, which could be the class in a classification problem, or the dependent variable for a regression problem. -# TODO: Get dataset that has separate train and test sets -As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can -[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). +As an example, we can use Haberman's Survival Data Set , which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data. We can load the data as a `DataSet[String]` first: {% highlight scala %} -val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data") +val survival = env.readCsvFile[(String, String, String, String)]("/path/to/haberman.data") {% endhighlight %} -The dataset has some missing values indicated by `?`. We can filter those rows out and -then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the -dataset with the FlinkML classification algorithms. +We can now transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. We know that the 4th element of the dataset +is the class label, and the rest are features, wo we can build `LabeledVector` elements like this: {% highlight scala %} -val cancerLV = cancer - .map(_.productIterator.toList) - .filter(!_.contains("?")) - .map{list => +val survivalLV = survival + .map{tuple => + val list = tuple.productIterator.toList val numList = list.map(_.asInstanceOf[String].toDouble) - LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) - } + LabeledVector(numList(3), DenseVector(numList.take(3).toArray)) + } {% endhighlight %} -We can then use this data to train a learner. +We can then use this data to train a learner. We will use another dataset to exemplify building a +learner, that will allow us to show how we can import other dataset formats. + +**LibSVM files** A common format for ML datasets is the LibSVM format and a number of datasets using that format can be found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. You can also save datasets in the LibSVM format using the `writeLibSVM` function. -Let's import the Adult (a9a) dataset. You can download the -[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) -and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). +Let's import the svmguide1 dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1.t). We can simply import the dataset then using: {% highlight scala %} -val adultTrain = MLUtils.readLibSVM("path/to/a8a") -val adultTest = MLUtils.readLibSVM("path/to/a8a.t") +val astroTrain = MLUtils.readLibSVM("path/to/svmguide1") +val astroTest = MLUtils.readLibSVM("path/to/svmguide1.t") {% endhighlight %} -This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier. - -Due to an error in the test dataset we have to adjust the test data using the following code, to -ensure that the dimensionality of all test examples is 123, as with the training set: - -{% highlight scala %} - -val adjustedTest = adultTest.map{lv => - val vec = lv.vector.asBreeze - val padded = vec.padTo(123, 0.0).toDenseVector - val fvec = padded.fromBreeze - LabeledVector(lv.label, fvec) - } - -{% endhighlight %} +This gives us two `DataSet[LabeledVector]` that we will use in the following section to create a +classifier. ## Classification Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier. +We can set a number of parameters for the classifier. Here we set the `Blocks` parameter, +which is used to split the input by the underlying CoCoA algorithm [2] uses. The regularization +parameter determines the amount of $\lambda_2$ regularization applied, which is used +to avoid overfitting. The step size determines the contribution of the weight vector updates to +the next weight vector value. The parameter sets the initial step size. {% highlight scala %} @@ -134,33 +128,15 @@ val svm = SVM() .setStepsize(0.1) .setSeed(42) -svm.fit(adultTrain) +svm.fit(astroTrain) {% endhighlight %} -Let's now make predictions on the test set and see how well we do in terms of absolute error -We will also create a function that thresholds the predictions to the {-1, 1} scale that the -dataset uses. +We can now make predictions on the test set. {% highlight scala %} -def thresholdPredictions(predictions: DataSet[(Double, Double)]) -: DataSet[(Double, Double)] = { - predictions.map { - truthPrediction => - val truth = truthPrediction._1 - val prediction = truthPrediction._2 - val thresholdedPrediction = if (prediction > 0.0) 1.0 else -1.0 - (truth, thresholdedPrediction) - } -} - -val predictionPairs = thresholdPredictions(svm.predict(adjustedTest)) - -val absoluteErrorSum = predictionPairs.collect().map{ - case (truth, prediction) => Math.abs(truth - prediction)}.sum - -println(s"Absolute error: $absoluteErrorSum") +val predictionPairs = svm.predict(astroTest) {% endhighlight %} @@ -170,19 +146,20 @@ Next we will see if we can improve the performance by pre-processing our data. A pre-processing step that is often encouraged when using SVM classification is scaling the input features to the [0, 1] range, in order to avoid features with extreme values dominating the rest. -FlinkML has a number of `Transformers` such as `StandardScaler` that are used to pre-process data, and a key feature is the ability to -chain `Transformers` and `Predictors` together. This allows us to run the same pipeline of transformations and make predictions -on the train and test data in a straight-forward and type-safe manner. You can read more on the pipeline system of FlinkML, -[here](pipelines.html). +FlinkML has a number of `Transformers` such as `MinMaxScaler` that are used to pre-process data, +and a key feature is the ability to chain `Transformers` and `Predictors` together. This allows +us to run the same pipeline of transformations and make predictions on the train and test data in +a straight-forward and type-safe manner. You can read more on the pipeline system of FlinkML +[in the pipelines documentation](pipelines.html). -Let first create a scaling transformer for the features in our dataset, and chain it to a new SVM classifier. +Let us first create a normalizing transformer for the features in our dataset, and chain it to a +new SVM classifier. {% highlight scala %} -import org.apache.flink.ml.preprocessing.StandardScaler +import org.apache.flink.ml.preprocessing.MinMaxScaler -val scaler = StandardScaler() -scaler.fit(adultTrain) +val scaler = MinMaxScaler() val scaledSVM = scaler.chainPredictor(svm) @@ -195,27 +172,31 @@ make predictions. {% highlight scala %} -scaledSVM.fit(adultTrain) - -val predictionPairsScaled= thresholdPredictions(scaledSVM.predict(predictionsScaled)) +scaledSVM.fit(astroTrain) -val absoluteErrorSumScaled = predictionPairs.collect().map{ - case (truth, prediction) => Math.abs(truth - prediction)}.sum - -println(s"Absolute error with scaled features: $absoluteErrorSumScaled") +val predictionPairsScaled= scaledSVM.predict(predictionsScaled) {% endhighlight %} -The effect that the transformation has on the rror for this dataset is a bit unpredictable. -In reality the scaling transformation does -not fit the dataset we are using, since the features are translated categorical features and as -such, operations like normalization and standard scaling do not make much sense. +The scaled inputs should give us better prediction performance. ## Where to go from here -This quickstart guide can act as an introduction to the basic concepts of FlinkML, but there's a lot more you can do. -We recommend going through the [FlinkML documentation](index.html), and trying out the different algorithms. -A very good way to get started is to play around with interesting datasets from the UCI ML repository and the LibSVM datasets. -Tackling an interesting problem from a website like [Kaggle](https://www.kaggle.com) or [DrivenData](http://www.drivendata.org/) -is also a great way to learn by competing with other data scientists. -If you would like to contribute some new algorithms take a look at our [contribution guide](contribution_guide.html). +This quickstart guide can act as an introduction to the basic concepts of FlinkML, but there's a lot +more you can do. +We recommend going through the [FlinkML documentation](index.html), and trying out the different +algorithms. +A very good way to get started is to play around with interesting datasets from the UCI ML +repository and the LibSVM datasets. +Tackling an interesting problem from a website like [Kaggle](https://www.kaggle.com) or +[DrivenData](http://www.drivendata.org/) is also a great way to learn by competing with other +data scientists. +If you would like to contribute some new algorithms take a look at our +[contribution guide](contribution_guide.html). + +**References** + +[1] Murphy, Kevin P. *Machine learning: a probabilistic perspective.* MIT press, 2012. + +[2] Jaggi, Martin, et al. *Communication-efficient distributed dual coordinate ascent.* +Advances in Neural Information Processing Systems. 2014. From 5237252de00fede7fa71f83768f203e572047505 Mon Sep 17 00:00:00 2001 From: Theodore Vasiloudis Date: Mon, 8 Jun 2015 16:22:32 +0200 Subject: [PATCH 4/8] Added missing import statements --- docs/libs/ml/quickstart.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/docs/libs/ml/quickstart.md b/docs/libs/ml/quickstart.md index 5965623e54ae3..d5322804ab641 100644 --- a/docs/libs/ml/quickstart.md +++ b/docs/libs/ml/quickstart.md @@ -66,6 +66,10 @@ We can load the data as a `DataSet[String]` first: {% highlight scala %} +import org.apache.flink.api.scala.ExecutionEnvironment + +val env = ExecutionEnvironment.createLocalEnvironment(2) + val survival = env.readCsvFile[(String, String, String, String)]("/path/to/haberman.data") {% endhighlight %} @@ -76,6 +80,9 @@ is the class label, and the rest are features, wo we can build `LabeledVector` e {% highlight scala %} +import org.apache.flink.ml.common.LabeledVector +import org.apache.flink.ml.math.DenseVector + val survivalLV = survival .map{tuple => val list = tuple.productIterator.toList @@ -102,6 +109,8 @@ We can simply import the dataset then using: {% highlight scala %} +import org.apache.flink.ml.MLUtils + val astroTrain = MLUtils.readLibSVM("path/to/svmguide1") val astroTest = MLUtils.readLibSVM("path/to/svmguide1.t") @@ -121,6 +130,8 @@ the next weight vector value. The parameter sets the initial step size. {% highlight scala %} +import org.apache.flink.ml.classification.SVM + val svm = SVM() .setBlocks(env.getParallelism) .setIterations(100) From 547133a0658ab26a1e6f91a18bd135951390fcab Mon Sep 17 00:00:00 2001 From: Theodore Vasiloudis Date: Mon, 8 Jun 2015 16:56:28 +0200 Subject: [PATCH 5/8] Fixed small error in the code examples --- docs/libs/ml/quickstart.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libs/ml/quickstart.md b/docs/libs/ml/quickstart.md index d5322804ab641..09e4cd22a9728 100644 --- a/docs/libs/ml/quickstart.md +++ b/docs/libs/ml/quickstart.md @@ -185,7 +185,7 @@ make predictions. scaledSVM.fit(astroTrain) -val predictionPairsScaled= scaledSVM.predict(predictionsScaled) +val predictionPairsScaled= scaledSVM.predict(astroTest) {% endhighlight %} From 906e524cb189fd846eabfcfe3e5a0cdd3fb31318 Mon Sep 17 00:00:00 2001 From: Theodore Vasiloudis Date: Tue, 9 Jun 2015 14:43:59 +0200 Subject: [PATCH 6/8] Small spelling error --- docs/libs/ml/quickstart.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/libs/ml/quickstart.md b/docs/libs/ml/quickstart.md index 09e4cd22a9728..bee69cb78b9d9 100644 --- a/docs/libs/ml/quickstart.md +++ b/docs/libs/ml/quickstart.md @@ -76,7 +76,7 @@ val survival = env.readCsvFile[(String, String, String, String)]("/path/to/haber We can now transform the data into a `DataSet[LabeledVector]`. This will allow us to use the dataset with the FlinkML classification algorithms. We know that the 4th element of the dataset -is the class label, and the rest are features, wo we can build `LabeledVector` elements like this: +is the class label, and the rest are features, so we can build `LabeledVector` elements like this: {% highlight scala %} From 1393466525a5d5b174acf8b55f8d4ad426b6c781 Mon Sep 17 00:00:00 2001 From: Theodore Vasiloudis Date: Wed, 10 Jun 2015 10:45:36 +0200 Subject: [PATCH 7/8] More info on datasets, addressed PR comments --- docs/libs/ml/index.md | 27 ++++++------ docs/libs/ml/quickstart.md | 85 ++++++++++++++++++++++++-------------- 2 files changed, 70 insertions(+), 42 deletions(-) diff --git a/docs/libs/ml/index.md b/docs/libs/ml/index.md index de9137d581532..9ff7a4b3f9bed 100644 --- a/docs/libs/ml/index.md +++ b/docs/libs/ml/index.md @@ -21,9 +21,9 @@ under the License. --> FlinkML is the Machine Learning (ML) library for Flink. It is a new effort in the Flink community, -with a growing list of algorithms and contributors. With FlinkML we aim to provide -scalable ML algorithms, an intuitive API, and tools that help minimize glue code in end-to-end ML -systems. You can see more details about our goals and where the library is headed in our [vision +with a growing list of algorithms and contributors. With FlinkML we aim to provide +scalable ML algorithms, an intuitive API, and tools that help minimize glue code in end-to-end ML +systems. You can see more details about our goals and where the library is headed in our [vision and roadmap here](vision_roadmap.html). * This will be replaced by the TOC @@ -55,10 +55,13 @@ FlinkML currently supports the following algorithms: ## Getting Started -First, you have to [set up a Flink program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink). -Next, you have to add the FlinkML dependency to the `pom.xml` of your project. +You can check out our [quickstart guide](quickstart.html) for a comprehensive getting started +example. -{% highlight bash %} +If you want to jump right in, you have to [set up a Flink program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink). +Next, you have to add the FlinkML dependency to the `pom.xml` of your project. + +{% highlight xml %} org.apache.flink flink-ml @@ -85,12 +88,11 @@ mlr.fit(trainingData, parameters) val predictions: DataSet[LabeledVector] = mlr.predict(testingData) {% endhighlight %} -For a more comprehensive guide, please check out our [quickstart guide](quickstart.html) - ## Pipelines A key concept of FlinkML is its [scikit-learn](http://scikit-learn.org) inspired pipelining mechanism. It allows you to quickly build complex data analysis pipelines how they appear in every data scientist's daily work. +An in-depth description of FlinkML's pipelines and their internal workings can be found [here](pipelines.html). The following example code shows how easy it is to set up an analysis pipeline with FlinkML. @@ -110,13 +112,14 @@ pipeline.fit(trainingData) // Calculate predictions val predictions: DataSet[LabeledVector] = pipeline.predict(testingData) -{% endhighlight %} +{% endhighlight %} One can chain a `Transformer` to another `Transformer` or a set of chained `Transformers` by calling the method `chainTransformer`. -If one wants to chain a `Predictor` to a `Transformer` or a set of chained `Transformers`, one has to call the method `chainPredictor`. -An in-depth description of FlinkML's pipelines and their internal workings can be found [here](pipelines.html). +If one wants to chain a `Predictor` to a `Transformer` or a set of chained `Transformers`, one has to call the method `chainPredictor`. + ## How to contribute The Flink community welcomes all contributors who want to get involved in the development of Flink and its libraries. -In order to get quickly started with contributing to FlinkML, please read first the official [contribution guide]({{site.baseurl}}/libs/ml/contribution_guide.html). \ No newline at end of file +In order to get quickly started with contributing to FlinkML, please read our official +[contribution guide]({{site.baseurl}}/libs/ml/contribution_guide.html). diff --git a/docs/libs/ml/quickstart.md b/docs/libs/ml/quickstart.md index bee69cb78b9d9..651d73e4a1552 100644 --- a/docs/libs/ml/quickstart.md +++ b/docs/libs/ml/quickstart.md @@ -37,22 +37,36 @@ As defined by Murphy [1] ML deals with detecting patterns in data, and using tho learned patterns to make predictions about the future. We can categorize most ML algorithms into two major categories: Supervised and Unsupervised Learning. -* Supervised Learning deals with learning a function (mapping) from a set of inputs -(features) to a set of outputs. The learning is done using a __training set__ of (input, +* **Supervised Learning** deals with learning a function (mapping) from a set of inputs +(features) to a set of outputs. The learning is done using a *training set* of (input, output) pairs that we use to approximate the mapping function. Supervised learning problems are further divided into classification and regression problems. In classification problems we try to -predict the __class__ that an example belongs to, for example whether a user is going to click on -an ad or not. Regression problems one the other hand, are about predicting (real) numerical +predict the *class* that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems one the other hand, are about predicting (real) numerical values, often called the dependent variable, for example what the temperature will be tomorrow. -* Unsupervised learning deals with discovering patterns and regularities in the data. An example +* **Unsupervised Learning** deals with discovering patterns and regularities in the data. An example of this would be *clustering*, where we try to discover groupings of the data from the descriptive features. Unsupervised learning can also be used for feature selection, for example through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). +## Linking with FlinkML + +In order to use FlinkML in you project, first you have to +[set up a Flink program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink). +Next, you have to add the FlinkML dependency to the `pom.xml` of your project: + +{% highlight xml %} + + org.apache.flink + flink-ml + {{site.version }} + +{% endhighlight %} + ## Loading data -For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +To load data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized functions for formatted data, such as the LibSVM format. For supervised learning problems it is common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` object will have a FlinkML `Vector` member representing the features of the example and a `Double` @@ -61,6 +75,11 @@ variable for a regression problem. As an example, we can use Haberman's Survival Data Set , which you can [download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data. +This dataset *"contains cases from study conducted on the survival of patients who had undergone +surgery for breast cancer"*. The data comes in a comma-separated file, where the first 3 columns +are the features and last column is the class, and the 4th column indicates whether the patient +survived 5 years or longer (label 1), or died within 5 years (label 2). You can check the [UCI +page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for more information on the data. We can load the data as a `DataSet[String]` first: @@ -92,8 +111,8 @@ val survivalLV = survival {% endhighlight %} -We can then use this data to train a learner. We will use another dataset to exemplify building a -learner, that will allow us to show how we can import other dataset formats. +We can then use this data to train a learner. We will however use another dataset to exemplify +building a learner; that will allow us to show how we can import other dataset formats. **LibSVM files** @@ -101,9 +120,11 @@ A common format for ML datasets is the LibSVM format and a number of datasets us found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. You can also save datasets in the LibSVM format using the `writeLibSVM` function. -Let's import the svmguide1 dataset. You can download the +Let's import the svmguide1 dataset. You can download the [training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1) and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1.t). +This is an astroparticle binary classification dataset, used by Hsu et al. [3] in their practical +Support Vector Machine (SVM) guide. It contains 4 numerical features, and the class label. We can simply import the dataset then using: @@ -111,22 +132,22 @@ We can simply import the dataset then using: import org.apache.flink.ml.MLUtils -val astroTrain = MLUtils.readLibSVM("path/to/svmguide1") -val astroTest = MLUtils.readLibSVM("path/to/svmguide1.t") +val astroTrain = MLUtils.readLibSVM("/path/to/svmguide1") +val astroTest = MLUtils.readLibSVM("/path/to/svmguide1.t") {% endhighlight %} -This gives us two `DataSet[LabeledVector]` that we will use in the following section to create a -classifier. +This gives us two `DataSet[LabeledVector]` objects that we will use in the following section to +create a classifier. ## Classification Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier. We can set a number of parameters for the classifier. Here we set the `Blocks` parameter, which is used to split the input by the underlying CoCoA algorithm [2] uses. The regularization -parameter determines the amount of $\lambda_2$ regularization applied, which is used -to avoid overfitting. The step size determines the contribution of the weight vector updates to -the next weight vector value. The parameter sets the initial step size. +parameter determines the amount of $l_2$ regularization applied, which is used +to avoid overfitting. The step size determines the contribution of the weight vector updates to +the next weight vector value. This parameter sets the initial step size. {% highlight scala %} @@ -151,19 +172,20 @@ val predictionPairs = svm.predict(astroTest) {% endhighlight %} -Next we will see if we can improve the performance by pre-processing our data. +Next we will see how we can pre-process our data, and use the ML pipelines capabilities of FlinkML. ## Data pre-processing and pipelines -A pre-processing step that is often encouraged when using SVM classification is scaling -the input features to the [0, 1] range, in order to avoid features with extreme values dominating the rest. -FlinkML has a number of `Transformers` such as `MinMaxScaler` that are used to pre-process data, -and a key feature is the ability to chain `Transformers` and `Predictors` together. This allows +A pre-processing step that is often encouraged [3] when using SVM classification is scaling +the input features to the [0, 1] range, in order to avoid features with extreme values +dominating the rest. +FlinkML has a number of `Transformers` such as `MinMaxScaler` that are used to pre-process data, +and a key feature is the ability to chain `Transformers` and `Predictors` together. This allows us to run the same pipeline of transformations and make predictions on the train and test data in a straight-forward and type-safe manner. You can read more on the pipeline system of FlinkML [in the pipelines documentation](pipelines.html). -Let us first create a normalizing transformer for the features in our dataset, and chain it to a +Let us first create a normalizing transformer for the features in our dataset, and chain it to a new SVM classifier. {% highlight scala %} @@ -176,16 +198,16 @@ val scaledSVM = scaler.chainPredictor(svm) {% endhighlight %} -We can now use our newly created pipeline to make predictions on the test set. +We can now use our newly created pipeline to make predictions on the test set. First we call fit again, to train the scaler and the SVM classifier. -The data of the test set will then be automatically scaled before being passed on to the SVM to +The data of the test set will then be automatically scaled before being passed on to the SVM to make predictions. {% highlight scala %} scaledSVM.fit(astroTrain) -val predictionPairsScaled= scaledSVM.predict(astroTest) +val predictionPairsScaled = scaledSVM.predict(astroTest) {% endhighlight %} @@ -197,17 +219,20 @@ This quickstart guide can act as an introduction to the basic concepts of FlinkM more you can do. We recommend going through the [FlinkML documentation](index.html), and trying out the different algorithms. -A very good way to get started is to play around with interesting datasets from the UCI ML +A very good way to get started is to play around with interesting datasets from the UCI ML repository and the LibSVM datasets. -Tackling an interesting problem from a website like [Kaggle](https://www.kaggle.com) or -[DrivenData](http://www.drivendata.org/) is also a great way to learn by competing with other +Tackling an interesting problem from a website like [Kaggle](https://www.kaggle.com) or +[DrivenData](http://www.drivendata.org/) is also a great way to learn by competing with other data scientists. -If you would like to contribute some new algorithms take a look at our +If you would like to contribute some new algorithms take a look at our [contribution guide](contribution_guide.html). **References** [1] Murphy, Kevin P. *Machine learning: a probabilistic perspective.* MIT press, 2012. -[2] Jaggi, Martin, et al. *Communication-efficient distributed dual coordinate ascent.* +[2] Jaggi, Martin, et al. *Communication-efficient distributed dual coordinate ascent.* Advances in Neural Information Processing Systems. 2014. + +[3] Hsu, Chih-Wei, Chih-Chung Chang, and Chih-Jen Lin. + *A practical guide to support vector classification.* (2003). From 9ad958fc67824b237058249ac4f01264f4f636cb Mon Sep 17 00:00:00 2001 From: Theodore Vasiloudis Date: Thu, 11 Jun 2015 10:25:25 +0200 Subject: [PATCH 8/8] Anchor links for references and other fixes --- docs/libs/ml/quickstart.md | 38 ++++++++++++++++++++------------------ 1 file changed, 20 insertions(+), 18 deletions(-) diff --git a/docs/libs/ml/quickstart.md b/docs/libs/ml/quickstart.md index 651d73e4a1552..9a7b1fb8d091c 100644 --- a/docs/libs/ml/quickstart.md +++ b/docs/libs/ml/quickstart.md @@ -28,12 +28,12 @@ under the License. ## Introduction FlinkML is designed to make learning from your data a straight-forward process, abstracting away -the complexities that usually come with having to deal with big data learning tasks. In this +the complexities that usually come with big data learning tasks. In this quick-start guide we will show just how easy it is to solve a simple supervised learning problem using FlinkML. But first some basics, feel free to skip the next few lines if you're already familiar with Machine Learning (ML). -As defined by Murphy [1] ML deals with detecting patterns in data, and using those +As defined by Murphy [[1]](#murphy) ML deals with detecting patterns in data, and using those learned patterns to make predictions about the future. We can categorize most ML algorithms into two major categories: Supervised and Unsupervised Learning. @@ -52,7 +52,7 @@ through [principal components analysis](https://en.wikipedia.org/wiki/Principal_ ## Linking with FlinkML -In order to use FlinkML in you project, first you have to +In order to use FlinkML in your project, first you have to [set up a Flink program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink). Next, you have to add the FlinkML dependency to the `pom.xml` of your project: @@ -68,14 +68,14 @@ Next, you have to add the FlinkML dependency to the `pom.xml` of your project: To load data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized functions for formatted data, such as the LibSVM format. For supervised learning problems it is -common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +common to use the `LabeledVector` class to represent the `(label, features)` examples. A `LabeledVector` object will have a FlinkML `Vector` member representing the features of the example and a `Double` member which represents the label, which could be the class in a classification problem, or the dependent variable for a regression problem. As an example, we can use Haberman's Survival Data Set , which you can -[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data. -This dataset *"contains cases from study conducted on the survival of patients who had undergone +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data). +This dataset *"contains cases from a study conducted on the survival of patients who had undergone surgery for breast cancer"*. The data comes in a comma-separated file, where the first 3 columns are the features and last column is the class, and the 4th column indicates whether the patient survived 5 years or longer (label 1), or died within 5 years (label 2). You can check the [UCI @@ -87,7 +87,7 @@ We can load the data as a `DataSet[String]` first: import org.apache.flink.api.scala.ExecutionEnvironment -val env = ExecutionEnvironment.createLocalEnvironment(2) +val env = ExecutionEnvironment.getExecutionEnvironment val survival = env.readCsvFile[(String, String, String, String)]("/path/to/haberman.data") @@ -118,13 +118,14 @@ building a learner; that will allow us to show how we can import other dataset f A common format for ML datasets is the LibSVM format and a number of datasets using that format can be found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading -datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +datasets using the LibSVM format through the `readLibSVM` function available through the `MLUtils` +object. You can also save datasets in the LibSVM format using the `writeLibSVM` function. Let's import the svmguide1 dataset. You can download the [training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1) and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1.t). -This is an astroparticle binary classification dataset, used by Hsu et al. [3] in their practical -Support Vector Machine (SVM) guide. It contains 4 numerical features, and the class label. +This is an astroparticle binary classification dataset, used by Hsu et al. [[3]](#hsu) in their +practical Support Vector Machine (SVM) guide. It contains 4 numerical features, and the class label. We can simply import the dataset then using: @@ -144,8 +145,8 @@ create a classifier. Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier. We can set a number of parameters for the classifier. Here we set the `Blocks` parameter, -which is used to split the input by the underlying CoCoA algorithm [2] uses. The regularization -parameter determines the amount of $l_2$ regularization applied, which is used +which is used to split the input by the underlying CoCoA algorithm [[2]](#jaggi) uses. The +regularization parameter determines the amount of $l_2$ regularization applied, which is used to avoid overfitting. The step size determines the contribution of the weight vector updates to the next weight vector value. This parameter sets the initial step size. @@ -176,7 +177,7 @@ Next we will see how we can pre-process our data, and use the ML pipelines capab ## Data pre-processing and pipelines -A pre-processing step that is often encouraged [3] when using SVM classification is scaling +A pre-processing step that is often encouraged [[3]](#hsu) when using SVM classification is scaling the input features to the [0, 1] range, in order to avoid features with extreme values dominating the rest. FlinkML has a number of `Transformers` such as `MinMaxScaler` that are used to pre-process data, @@ -229,10 +230,11 @@ If you would like to contribute some new algorithms take a look at our **References** -[1] Murphy, Kevin P. *Machine learning: a probabilistic perspective.* MIT press, 2012. +[1] Murphy, Kevin P. *Machine learning: a probabilistic perspective.* MIT +press, 2012. -[2] Jaggi, Martin, et al. *Communication-efficient distributed dual coordinate ascent.* -Advances in Neural Information Processing Systems. 2014. +[2] Jaggi, Martin, et al. *Communication-efficient distributed dual +coordinate ascent.* Advances in Neural Information Processing Systems. 2014. -[3] Hsu, Chih-Wei, Chih-Chung Chang, and Chih-Jen Lin. - *A practical guide to support vector classification.* (2003). +[3] Hsu, Chih-Wei, Chih-Chung Chang, and Chih-Jen Lin. + *A practical guide to support vector classification.* 2003.