[DIA] Sherlock by tuchris · Pull Request #1183 · apache/systemds

tuchris · 2021-02-15T19:09:25Z

Hi,
this is our initial version of the implementation of the Sherlock project.
http://sherlock.media.mit.edu/assets/2019-Sherlock-KDD.pdf
The neural network part is implemented like described in the paper. Is there a possibility/need to write test cases for a neuronal network?
The feature extraction is currently missing, but a workflow on how we are planing to extract them is set up in the UtilFunctions.class. Could you please confirm if we are on the right trace?
Thanks for the review and feedback. :)

* no return value * input file not changeable (must be a static string)

got compiliation error otherwise

values from paper look good, just use huge amount of samples

no test functions should be implemented within this file

removed input file dependencies

tuchris · 2021-02-25T20:24:12Z

Thank you for the feedback. The hardcoded dependencies will be removed.
Regarding the testing, we have to test 3 parts. The 1.) preprocessing UtilFunction, 2.) training and 3.) testing of the network.
ad 1) is deterministic and testable
but ad 2&3, need large amount of input data to produce reasonable/stable results on training and validation. How shall we proceed here without commit too large training/validation files into the repository?
We would be very happy about some tips regarding this issue.. Maybe @mboehm7 or @Shafaq-Siddiqi could you provide us with some tips?

Many thanks and BR,
Christian, Jasmin, Mauro

input files not valid

# Conflicts: # scripts/builtin/sherlock.dml # scripts/builtin/sherlockNet.dml

corepointer · 2021-03-02T23:56:17Z

Testing training and validation would be out of scope of a unit test imho. And you are right, we don't want a lot of extra data files in the repository for testing. I'd merge the PR to staging with minor formatting changes. A few questions came to mind while looking over it:

What is sherlockPreprocessing.dml for and where is it used? That one also contains a path reference to a non existent directory.
How did you test the implementation? And where did you get the test data from? What about a simple shell script that downloads the needed data (if that exists somewhere online) and fires off the algorithm.

Regards,
Mark

tuchris · 2021-03-03T11:05:02Z

Thank you for the information.

I will remove the preprocessing script, as it is going to be contributed via a separate PR.
You are right. A script/docu with the used files for training/testing will be added.

clarified UtilFunctions

tuchris · 2021-03-10T09:35:27Z

I added the download script right in the scripts/builtin/ folder - please move it to the correct location on merging.

Thanks for the good review!
BR Christian

Baunsgaard · 2021-03-10T09:38:14Z

I added the download script right in the scripts/builtin/ folder - please move it to the correct location on merging.

Thanks for the good review!
BR Christian

if it is a dataset usefull in general it would be nice to integrate in our python tutorials.
see: src/main/python/systemds/examples/tutorials

it is now implemented as python class, to make it easier to use.

tuchris · 2021-03-16T11:11:40Z

It is a dataset intended to do semantic data type detection. Yes, it could be useful to other projects too.
I moved (+updated) it to the suggested location.

corepointer · 2021-03-16T12:40:21Z

Thank you for following up on this! We appreciate the extra effort you take 👍
I started merging this morning but was luckily interrupted by a meeting ;-) I'll test it and merge it in if I don't run into any issues.

corepointer · 2021-03-17T12:29:52Z

Thanks again for the PR. I merged it in now (not in staging - that wouldn't work for a builtin function). I only made formatting modifications and did a test run of the JUnit test you provided. I hope I gave proper github credit to the two coauthors I found on the history of this branch.

tuchris and others added 30 commits December 1, 2020 21:12

initial setup of the builtin sherlock function

15abf9f

added method to load a predifined file

ede501b

* no return value * input file not changeable (must be a static string)

added load_raw_train_values() which returns the csv-data as frame

95e244a

loaded existing processed x and y data

65817ab

added functions for all 6 input files

de0f7be

loaded all files to do "demo training" of fm-regession

85847a3

added saving of training results

a27212b

train network

5bda1b7

improve training of sherlock

bf7d45d

predict and eval

5e8c8bf

moved sherlock network functions to seperate file

7e39ebf

got compiliation error otherwise

added validation phase in training

bdf7351

small updates in sherlock NN

cf57642

updated calculation of lr and played with values

2c01c73

updated weights and number of neurons

2c8fc70

values from paper look good, just use huge amount of samples

added own file to do preprocessing stuff

d48ae0d

implemented structure to preprocess input strings

d504a27

reformatted code and updated debug messages

eb7dc0f

reformatted code

c76adff

removed unnecessary methods in UtilFunctions.java

028466d

removed personal test-script

b7a8f68

Delete hello.dml

e6d0113

Merge branch 'master' into fb_sherlock

365e86e

addLicense

ced56d9

splitted neural network of sherlock like in the paper

572c862

added separate train and predict functions for the whole sherlock model

9ea2e74

bugfix: predict + train: wrong calculation of batches

5b38117

cleaned sherlock.dml

3a55f4a

no test functions should be implemented within this file

removed dropout layer in prediction and added f1 score

84d1116

removed static/dependent components

7dcf23b

removed input file dependencies

dirjavi17 and others added 16 commits February 26, 2021 10:27

map function test preprocess sherlock

c5d33e2

added global statistics in training

db7c895

WIP - sherlock train testing

30fd395

input files not valid

fixed sherlock test

88c9621

separated sherlockPredict and added test

eb14ddc

doku

c61f264

doku network sherlock

57ad7f2

document sherlock predict

e50d916

document sherlock in builtins reference file

e1b4a64

renamed *r*est values to *s*tatistic values

5c2cf77

Merge branch 'testing' into fb_sherlock

409d41f

# Conflicts: # scripts/builtin/sherlock.dml # scripts/builtin/sherlockNet.dml

added and merged global statistics into sherlock

ce9942b

reset log4j.properties

7edf1a7

added file extension to test-input weights

82d95dc

updated builtins-reference.md for global statistics

9a20142

trigger failed pipeline

299cc16

tuchris changed the title ~~[DIA-WIP] Sherlock~~ [DIA] Sherlock Feb 28, 2021

tuchris added 2 commits March 9, 2021 20:03

removed sherlockPreprocessing.dml

1e6e978

clarified UtilFunctions

added python script to download&extract data for sherlock

5d3fe38

moved and updated sherlockData.py

ab0d22d

it is now implemented as python class, to make it easier to use.

corepointer closed this in b081a25 Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DIA] Sherlock#1183

[DIA] Sherlock#1183
tuchris wants to merge 49 commits intoapache:masterfrom
tuchris:fb_sherlock

tuchris commented Feb 15, 2021

Uh oh!

tuchris commented Feb 25, 2021

Uh oh!

corepointer commented Mar 2, 2021

Uh oh!

tuchris commented Mar 3, 2021

Uh oh!

tuchris commented Mar 10, 2021

Uh oh!

Baunsgaard commented Mar 10, 2021

Uh oh!

tuchris commented Mar 16, 2021

Uh oh!

corepointer commented Mar 16, 2021

Uh oh!

corepointer commented Mar 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tuchris commented Feb 15, 2021

Uh oh!

tuchris commented Feb 25, 2021

Uh oh!

corepointer commented Mar 2, 2021

Uh oh!

tuchris commented Mar 3, 2021

Uh oh!

tuchris commented Mar 10, 2021

Uh oh!

Baunsgaard commented Mar 10, 2021

Uh oh!

tuchris commented Mar 16, 2021

Uh oh!

corepointer commented Mar 16, 2021

Uh oh!

corepointer commented Mar 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants