Skip to content

Commit 16a2855

Browse files
m2key1JoachimDunkel
authored andcommitted
[SYSTEMDS-439] Builtins for set operations on vectors
Illustration of the implementation: X = matrix("1 2 3.1 4", rows=2, cols=2) Y = matrix("3.1 4 5 6", rows=2, cols=2) union(X, Y): Union of the sets X and Y 1 2 3.1 4 5 6 setdiff(X, Y): Set difference between X, and Y, with elements in X but not in Y. 1 2 symmetricDifference(X,Y): Set difference between X, and Y, with elements in X and Y but not in both. 1 2 5 6 unique(X): Unique elements of the set X 1 2 3.1 4 Future work: also to support string elements. These operations are helpful for bridging the gap between the relational and linear algebra. Resolves SYSTEMDS-440, SYSTEMDS-441, SYSTEMDS-442, SYSTEMDS-3183. Closes apache#1479. Co-authored-by: David Fleischhacker <david.fleischhacker@student.tugraz.at> Co-authored-by: Joachim Dunkel <dunkel@student.tugraz.at>
1 parent b065677 commit 16a2855

24 files changed

+1011
-111
lines changed

docs/site/builtins-reference.md

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,15 +71,19 @@ limitations under the License.
7171
* [`outlier`-Function](#outlier-function)
7272
* [`pnmf`-Function](#pnmf-function)
7373
* [`scale`-Function](#scale-function)
74+
* [`setdiff`-Function](#setdiff-function)
7475
* [`sherlock`-Function](#sherlock-function)
7576
* [`sherlockPredict`-Function](#sherlockPredict-function)
7677
* [`sigmoid`-Function](#sigmoid-function)
7778
* [`slicefinder`-Function](#slicefinder-function)
7879
* [`smote`-Function](#smote-function)
7980
* [`steplm`-Function](#steplm-function)
81+
* [`symmetricDifference`-Function](#symmetricdifference-function)
8082
* [`tomekLink`-Function](#tomekLink-function)
8183
* [`toOneHot`-Function](#toOneHOt-function)
8284
* [`tSNE`-Function](#tSNE-function)
85+
* [`union`-Function](#union-function)
86+
* [`unique`-Function](#unique-function)
8387
* [`winsorize`-Function](#winsorize-function)
8488
* [`xgboost`-Function](#xgboost-function)
8589

@@ -1823,6 +1827,36 @@ scale=TRUE;
18231827
Y= scale(X,center,scale)
18241828
```
18251829

1830+
## `setdiff`-Function
1831+
1832+
The `setdiff`-function returns the values of X that are not in Y.
1833+
1834+
### Usage
1835+
1836+
```r
1837+
setdiff(X, Y)
1838+
```
1839+
1840+
### Arguments
1841+
1842+
| Name | Type | Default | Description |
1843+
| :--- | :----- | -------- | :---------- |
1844+
| X | Matrix[Double] | required | input vector|
1845+
| Y | Matrix[Double] | required | input vector|
1846+
1847+
### Returns
1848+
1849+
| Type | Description |
1850+
| :----- | :---------- |
1851+
| Matrix[Double] | values of X that are not in Y.|
1852+
1853+
### Example
1854+
1855+
```r
1856+
X = matrix("1 2 3 4", rows = 4, cols = 1)
1857+
Y = matrix("2 3", rows = 2, cols = 1)
1858+
R = setdiff(X = X, Y = Y)
1859+
```
18261860

18271861
## `sherlock`-Function
18281862

@@ -2107,6 +2141,37 @@ y = X %*% rand(rows = ncol(X), cols = 1)
21072141
[C, S] = steplm(X = X, y = y, icpt = 1);
21082142
```
21092143

2144+
## `symmetricDifference`-Function
2145+
2146+
The `symmetricDifference`-function returns the symmetric difference of the two input vectors.
2147+
This is done by calculating the `setdiff` (nonsymmetric) between `union` and `intersect` of the two input vectors.
2148+
2149+
### Usage
2150+
2151+
```r
2152+
symmetricDifference(X, Y)
2153+
```
2154+
2155+
### Arguments
2156+
2157+
| Name | Type | Default | Description |
2158+
| :--- | :----- | -------- | :---------- |
2159+
| X | Matrix[Double] | required | input vector|
2160+
| Y | Matrix[Double] | required | input vector|
2161+
2162+
### Returns
2163+
2164+
| Type | Description |
2165+
| :----- | :---------- |
2166+
| Matrix[Double] | symmetric difference of the input vectors |
2167+
2168+
### Example
2169+
2170+
```r
2171+
X = matrix("1 2 3.1", rows = 3, cols = 1)
2172+
Y = matrix("3.1 4", rows = 2, cols = 1)
2173+
R = symmetricDifference(X = X, Y = Y)
2174+
```
21102175

21112176
## `tomekLink`-Function
21122177

@@ -2212,6 +2277,66 @@ X = rand(rows = 100, cols = 10, min = -10, max = 10))
22122277
Y = tSNE(X)
22132278
```
22142279

2280+
## `union`-Function
2281+
2282+
The `union`-function combines all rows from both input vectors and removes all duplicate rows by calling `unique` on the resulting vector.
2283+
2284+
### Usage
2285+
2286+
```r
2287+
union(X, Y)
2288+
```
2289+
2290+
### Arguments
2291+
2292+
| Name | Type | Default | Description |
2293+
| :--- | :----- | -------- | :---------- |
2294+
| X | Matrix[Double] | required | input vector|
2295+
| Y | Matrix[Double] | required | input vector|
2296+
2297+
### Returns
2298+
2299+
| Type | Description |
2300+
| :----- | :---------- |
2301+
| Matrix[Double] | the union of both input vectors.|
2302+
2303+
### Example
2304+
2305+
```r
2306+
X = matrix("1 2 3 4", rows = 4, cols = 1)
2307+
Y = matrix("3 4 5 6", rows = 4, cols = 1)
2308+
R = union(X = X, Y = Y)
2309+
```
2310+
2311+
## `unique`-Function
2312+
2313+
The `unique`-function returns a set of unique rows from a given input vector.
2314+
2315+
### Usage
2316+
2317+
```r
2318+
unique(X)
2319+
```
2320+
2321+
### Arguments
2322+
2323+
| Name | Type | Default | Description |
2324+
| :--- | :----- | -------- | :---------- |
2325+
| X | Matrix[Double] | required | input vector|
2326+
2327+
### Returns
2328+
2329+
| Type | Description |
2330+
| :----- | :---------- |
2331+
| Matrix[Double] | a set of unique values from the input vector |
2332+
2333+
### Example
2334+
2335+
```r
2336+
X = matrix("1 3.4 7 3.4 -0.9 8 1", rows = 7, cols = 1)
2337+
R = unique(X = X)
2338+
```
2339+
22152340
## `winsorize`-Function
22162341

22172342
The `winsorize`-function removes outliers from the data. It does so by computing upper and lower quartile range

scripts/builtin/intersect.dml

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -39,12 +39,14 @@
3939
m_intersect = function(Matrix[Double] X, Matrix[Double] Y)
4040
return(Matrix[Double] R)
4141
{
42-
# compute indicator vector of intersection output
43-
X = (table(X, 1) != 0)
44-
Y = (table(Y, 1) != 0)
45-
n = min(nrow(X), nrow(Y))
46-
I = X[1:n,] * Y[1:n,]
42+
X = unique(X);
43+
Y = unique(Y);
4744

48-
# reconstruct integer values and create output
49-
R = removeEmpty(target=seq(1,n), margin="rows", select=I)
45+
combined = rbind(X, Y);
46+
47+
combined = order(target=combined, by=1, decreasing=FALSE, index.return=FALSE);
48+
temp = combined[1:nrow(combined)-1,] != combined[2:nrow(combined),];
49+
mask = rbind(matrix(1, rows = 1, cols = 1), rowSums(temp));
50+
51+
R = removeEmpty(target = combined, margin = "rows", select = !mask);
5052
}

scripts/builtin/setdiff.dml

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
#-------------------------------------------------------------
2+
#
3+
# Licensed to the Apache Software Foundation (ASF) under one
4+
# or more contributor license agreements. See the NOTICE file
5+
# distributed with this work for additional information
6+
# regarding copyright ownership. The ASF licenses this file
7+
# to you under the Apache License, Version 2.0 (the
8+
# "License"); you may not use this file except in compliance
9+
# with the License. You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12+
#
13+
# Unless required by applicable law or agreed to in writing,
14+
# software distributed under the License is distributed on an
15+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
16+
# KIND, either express or implied. See the License for the
17+
# specific language governing permissions and limitations
18+
# under the License.
19+
#
20+
#-------------------------------------------------------------
21+
22+
# Builtin function that implements difference operation on vectors
23+
24+
# INPUT PARAMETERS:
25+
# ---------------------------------------------------------------------------------------------
26+
# NAME TYPE DEFAULT MEANING
27+
# ---------------------------------------------------------------------------------------------
28+
# X Matrix --- input vector
29+
# ---------------------------------------------------------------------------------------------
30+
# Y Matrix --- input vector
31+
# ---------------------------------------------------------------------------------------------
32+
33+
# Output(s)
34+
# ---------------------------------------------------------------------------------------------
35+
# NAME TYPE DEFAULT MEANING
36+
# ---------------------------------------------------------------------------------------------
37+
# R Matrix --- vector with all elements that are present in X but not in Y
38+
39+
40+
setdiff = function(Matrix[double] X, Matrix[double] Y)
41+
return (matrix[double] R)
42+
{
43+
common = intersect(X, Y);
44+
X = unique(X);
45+
combined = rbind(X, common);
46+
combined = order(target=combined, by=1, decreasing=FALSE, index.return=FALSE);
47+
temp = combined[1:nrow(combined)-1,] != combined[2:nrow(combined),];
48+
mask1 = rbind(rowSums(temp), matrix(1, rows=1, cols=1));
49+
mask2 = rbind(matrix(1, rows = 1, cols = 1), rowSums(temp));
50+
51+
mask = mask1 & mask2;
52+
R = removeEmpty(target = combined, margin = "rows", select = mask);
53+
}
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
#-------------------------------------------------------------
2+
#
3+
# Licensed to the Apache Software Foundation (ASF) under one
4+
# or more contributor license agreements. See the NOTICE file
5+
# distributed with this work for additional information
6+
# regarding copyright ownership. The ASF licenses this file
7+
# to you under the Apache License, Version 2.0 (the
8+
# "License"); you may not use this file except in compliance
9+
# with the License. You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12+
#
13+
# Unless required by applicable law or agreed to in writing,
14+
# software distributed under the License is distributed on an
15+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
16+
# KIND, either express or implied. See the License for the
17+
# specific language governing permissions and limitations
18+
# under the License.
19+
#
20+
#-------------------------------------------------------------
21+
22+
# Builtin function that implements symmetric difference set-operation on vectors
23+
24+
# INPUT PARAMETERS:
25+
# ---------------------------------------------------------------------------------------------
26+
# NAME TYPE DEFAULT MEANING
27+
# ---------------------------------------------------------------------------------------------
28+
# X Matrix --- input vector
29+
# ---------------------------------------------------------------------------------------------
30+
# Y Matrix --- input vector
31+
# ---------------------------------------------------------------------------------------------
32+
33+
# Output(s)
34+
# ---------------------------------------------------------------------------------------------
35+
# NAME TYPE DEFAULT MEANING
36+
# ---------------------------------------------------------------------------------------------
37+
# R Matrix --- vector with all elements in X and Y but not in both
38+
39+
40+
symmetricDifference = function(Matrix[Double] X, Matrix[Double] Y)
41+
return (matrix[double] R)
42+
{
43+
R = setdiff(union(X,Y), intersect(X,Y))
44+
}

scripts/builtin/union.dml

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
#-------------------------------------------------------------
2+
#
3+
# Licensed to the Apache Software Foundation (ASF) under one
4+
# or more contributor license agreements. See the NOTICE file
5+
# distributed with this work for additional information
6+
# regarding copyright ownership. The ASF licenses this file
7+
# to you under the Apache License, Version 2.0 (the
8+
# "License"); you may not use this file except in compliance
9+
# with the License. You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12+
#
13+
# Unless required by applicable law or agreed to in writing,
14+
# software distributed under the License is distributed on an
15+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
16+
# KIND, either express or implied. See the License for the
17+
# specific language governing permissions and limitations
18+
# under the License.
19+
#
20+
#-------------------------------------------------------------
21+
22+
# Builtin function that implements union operation on vectors
23+
24+
# INPUT PARAMETERS:
25+
# ---------------------------------------------------------------------------------------------
26+
# NAME TYPE DEFAULT MEANING
27+
# ---------------------------------------------------------------------------------------------
28+
# X Matrix --- input vector
29+
# ---------------------------------------------------------------------------------------------
30+
# Y Matrix --- input vector
31+
# ---------------------------------------------------------------------------------------------
32+
33+
# Output(s)
34+
# ---------------------------------------------------------------------------------------------
35+
# NAME TYPE DEFAULT MEANING
36+
# ---------------------------------------------------------------------------------------------
37+
# R Matrix --- matrix with all unique rows existing in X and Y
38+
39+
40+
union = function(Matrix[Double] X, Matrix[Double] Y)
41+
return (matrix[double] R)
42+
{
43+
combined = rbind(X,Y);
44+
R = unique(combined);
45+
}

scripts/builtin/unique.dml

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
#-------------------------------------------------------------
2+
#
3+
# Licensed to the Apache Software Foundation (ASF) under one
4+
# or more contributor license agreements. See the NOTICE file
5+
# distributed with this work for additional information
6+
# regarding copyright ownership. The ASF licenses this file
7+
# to you under the Apache License, Version 2.0 (the
8+
# "License"); you may not use this file except in compliance
9+
# with the License. You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12+
#
13+
# Unless required by applicable law or agreed to in writing,
14+
# software distributed under the License is distributed on an
15+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
16+
# KIND, either express or implied. See the License for the
17+
# specific language governing permissions and limitations
18+
# under the License.
19+
#
20+
#-------------------------------------------------------------
21+
22+
# Builtin function that implements unique operation on vectors
23+
24+
# INPUT PARAMETERS:
25+
# ---------------------------------------------------------------------------------------------
26+
# NAME TYPE DEFAULT MEANING
27+
# ---------------------------------------------------------------------------------------------
28+
# X Matrix --- input vector
29+
# ---------------------------------------------------------------------------------------------
30+
31+
# Output(s)
32+
# ---------------------------------------------------------------------------------------------
33+
# NAME TYPE DEFAULT MEANING
34+
# ---------------------------------------------------------------------------------------------
35+
# R Matrix --- matrix with only unique rows
36+
37+
unique = function(matrix[double] X)
38+
return (matrix[double] R) {
39+
if(nrow(X) > 1) {
40+
X_sorted = order(target=X, by=1, decreasing=FALSE, index.return=FALSE);
41+
temp = X_sorted[1:nrow(X_sorted)-1,] != X_sorted[2:nrow(X_sorted),];
42+
mask = rbind(matrix(1, rows = 1, cols = 1), rowSums(temp));
43+
R = removeEmpty(target = X_sorted, margin = "rows", select = mask);
44+
}
45+
else {
46+
R = X
47+
}
48+
}

0 commit comments

Comments
 (0)