Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25584][ML][DOC] datasource for libsvm user guide #25286

Closed
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/_data/menu-ml.yaml
@@ -1,7 +1,7 @@
- text: Basic statistics
url: ml-statistics.html
- text: Data sources
url: ml-datasource
url: ml-datasource.html
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The orginal url: ml-datasource is incorrect that the generated url lose suffix .html
This do not matter in the offical web, since it seems that model web browser will automatically add the suffix.
However, in the locally built docs, we can not open the ml-datasource link file:///Users/xxx/Dev/OpenSource/spark/docs/_site/ml-datasource from page file:///Users/xxx/Dev/OpenSource/spark/docs/_site/ml-guide.html.

- text: Pipelines
url: ml-pipeline.html
- text: Extracting, transforming and selecting features
Expand Down
114 changes: 114 additions & 0 deletions docs/ml-datasource.md
Expand Up @@ -120,4 +120,118 @@ In SparkR we provide Spark SQL data source API for loading image data as DataFra
</div>


</div>


## LIBSVM data source

This `LIBSVM` data source is used to load 'libsvm' type files from a directory.
The loaded DataFrame has two columns: label containing labels stored as doubles and features containing feature vectors stored as Vectors.
The schemas of the columns are:
- label: `DoubleType` (represents the instance label)
- features: `VectorUDT` (represents the feature vector)

<div class="codetabs">
<div data-lang="scala" markdown="1">
[`LibSVMDataSource`](api/scala/index.html#org.apache.spark.ml.source.libsvm.LibSVMDataSource)
implements a Spark SQL data source API for loading `LIBSVM` data as a DataFrame.

{% highlight scala %}
scala> val df = spark.read.format("libsvm").option("numFeatures", "780").load("data/mllib/sample_libsvm_data.txt")
df: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> df.show(10)
+-----+--------------------+
|label| features|
+-----+--------------------+
| 0.0|(780,[127,128,129...|
| 1.0|(780,[158,159,160...|
| 1.0|(780,[124,125,126...|
| 1.0|(780,[152,153,154...|
| 1.0|(780,[151,152,153...|
| 0.0|(780,[129,130,131...|
| 1.0|(780,[158,159,160...|
| 1.0|(780,[99,100,101,...|
| 0.0|(780,[154,155,156...|
| 0.0|(780,[127,128,129...|
+-----+--------------------+
only showing top 10 rows
{% endhighlight %}
</div>

<div data-lang="java" markdown="1">
[`LibSVMDataSource`](api/java/org/apache/spark/ml/source/libsvm/LibSVMDataSource.html)
implements Spark SQL data source API for loading `LIBSVM` data as a DataFrame.

{% highlight java %}
Dataset<Row> df = spark.read.format("libsvm").option("numFeatures", "780").load("data/mllib/sample_libsvm_data.txt");
df.show(10);
/*
Will output:
+-----+--------------------+
|label| features|
+-----+--------------------+
| 0.0|(780,[127,128,129...|
| 1.0|(780,[158,159,160...|
| 1.0|(780,[124,125,126...|
| 1.0|(780,[152,153,154...|
| 1.0|(780,[151,152,153...|
| 0.0|(780,[129,130,131...|
| 1.0|(780,[158,159,160...|
| 1.0|(780,[99,100,101,...|
| 0.0|(780,[154,155,156...|
| 0.0|(780,[127,128,129...|
+-----+--------------------+
only showing top 10 rows
*/
{% endhighlight %}
</div>

<div data-lang="python" markdown="1">
In PySpark we provide Spark SQL data source API for loading `LIBSVM` data as DataFrame.

{% highlight python %}
>>> df = spark.read.format("libsvm").option("numFeatures", "780").load("data/mllib/sample_libsvm_data.txt")
>>> df.show(10)
+-----+--------------------+
|label| features|
+-----+--------------------+
| 0.0|(780,[127,128,129...|
| 1.0|(780,[158,159,160...|
| 1.0|(780,[124,125,126...|
| 1.0|(780,[152,153,154...|
| 1.0|(780,[151,152,153...|
| 0.0|(780,[129,130,131...|
| 1.0|(780,[158,159,160...|
| 1.0|(780,[99,100,101,...|
| 0.0|(780,[154,155,156...|
| 0.0|(780,[127,128,129...|
+-----+--------------------+
only showing top 10 rows
{% endhighlight %}
</div>

<div data-lang="r" markdown="1">
In SparkR we provide Spark SQL data source API for loading `LIBSVM` data as DataFrame.

{% highlight r %}
> df = read.df("data/mllib/sample_libsvm_data.txt", "libsvm")
> head(select(df, df$label, df$features), 10)

label features
1 0 <environment: 0x7fe6d35366e8>
2 1 <environment: 0x7fe6d353bf78>
3 1 <environment: 0x7fe6d3541840>
4 1 <environment: 0x7fe6d3545108>
5 1 <environment: 0x7fe6d354c8e0>
6 0 <environment: 0x7fe6d35501a8>
7 1 <environment: 0x7fe6d3555a70>
8 1 <environment: 0x7fe6d3559338>
9 0 <environment: 0x7fe6d355cc00>
10 0 <environment: 0x7fe6d35643d8>

{% endhighlight %}
</div>


</div>