##### Copyright &copy; 2020 The Apache Software Foundation.

In [1]:
# @title Apache Version 2.0 (The "License");
#-------------------------------------------------------------
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
#
#-------------------------------------------------------------

### Developer notebook for Apache SystemDS

<div class=""><table class="" align="left">
<td><a target="_blank" href="https://colab.research.google.com/github/apache/systemds/blob/master/notebooks/systemds_dev.ipynb">
<img src="https://colab.research.google.com/img/colab_favicon_256px.png" width= "32px">Run in Google Colab</a></td>
<td><a target="_blank" href="https://github.com/apache/systemds/blob/master/notebooks/systemds_dev.ipynb">
<img width=32px src="https://github.githubassets.com/images/modules/open_graph/github-mark.png">View source on GitHub</a></td>
</table></div>




This Jupyter/Colab-based tutorial will interactively walk through development setup and running SystemDS in both the

A. standalone mode \
B. with Apache Spark.

Flow of the notebook:
1. Download and Install the dependencies
2. Go to section **A** or **B**

#### Download and Install the dependencies

1. **Runtime:** Java (OpenJDK 8 is preferred)
2. **Build:** Apache Maven
3. **Backend:** Apache Spark (optional)

##### Setup

A custom function to run OS commands.

In [2]:
# Run and print a shell command.
def run(command):
  print('>> {}'.format(command))
  !{command}
  print('')

##### Install Java
Let us install OpenJDK 8. More about [OpenJDK ↗](https://openjdk.java.net/install/).

In [3]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# run the below command to replace the existing installation
!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

!java -version

update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java to provide /usr/bin/java (java) in manual mode
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)


##### Install Apache Maven

SystemDS uses Apache Maven to build and manage the project. More about [Apache Maven ↗](http://maven.apache.org/).

Maven builds SystemDS using its project object model (POM) and a set of plugins. One would find `pom.xml` find the codebase!

In [4]:
# Download the maven source.
maven_version = 'apache-maven-3.6.3'
maven_path = f"/opt/{maven_version}"

if not os.path.exists(maven_path):
  run(f"wget -q -nc -O apache-maven.zip https://downloads.apache.org/maven/maven-3/3.6.3/binaries/{maven_version}-bin.zip")
  run('unzip -q -d /opt apache-maven.zip')
  run('rm -f apache-maven.zip')

# Let's choose the absolute path instead of $PATH environment variable.
def maven(args):
  run(f"{maven_path}/bin/mvn {args}")

maven('-v')

>> wget -q -nc -O apache-maven.zip https://downloads.apache.org/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.zip

>> unzip -q -d /opt apache-maven.zip

>> rm -f apache-maven.zip

>> /opt/apache-maven-3.6.3/bin/mvn -v
[1mApache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)[m
Maven home: /opt/apache-maven-3.6.3
Java version: 1.8.0_252, vendor: Private Build, runtime: /usr/lib/jvm/java-8-openjdk-amd64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.19.104+", arch: "amd64", family: "unix"



##### Install Apache Spark (Optional, if you want to work with spark backend)


NOTE: If spark is not downloaded. Let us make sure the version we are trying to download is officially supported at
https://spark.apache.org/downloads.html

In [5]:
# Spark and Hadoop version
spark_version = 'spark-2.4.6'
hadoop_version = 'hadoop2.7'
spark_path = f"/opt/{spark_version}-bin-{hadoop_version}"
if not os.path.exists(spark_path):
  run(f"wget -q -nc -O apache-spark.tgz https://downloads.apache.org/spark/{spark_version}/{spark_version}-bin-{hadoop_version}.tgz")
  run('tar zxf apache-spark.tgz -C /opt')
  run('rm -f apache-spark.tgz')

os.environ["SPARK_HOME"] = spark_path
os.environ["PATH"] += ":$SPARK_HOME/bin"


>> wget -q -nc -O apache-spark.tgz https://downloads.apache.org/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz

>> tar zxf apache-spark.tgz -C /opt

>> rm -f apache-spark.tgz



#### Get Apache SystemDS

Apache SystemDS development happens on GitHub at [apache/systemds ↗](https://github.com/apache/systemds)

In [6]:
!git clone https://github.com/apache/systemds systemds --depth=1
%cd systemds

Cloning into 'systemds'...
remote: Enumerating objects: 7557, done.[K
remote: Counting objects: 100% (7557/7557), done.[K
remote: Compressing objects: 100% (4456/4456), done.[K
remote: Total 7557 (delta 5559), reused 3710 (delta 3016), pack-reused 0[K
Receiving objects: 100% (7557/7557), 14.73 MiB | 9.80 MiB/s, done.
Resolving deltas: 100% (5559/5559), done.
/content/systemds


##### Build the project

In [7]:
# Logging flags: -q only for ERROR; -X for DEBUG; -e for ERROR
# Option 1: Build only the java codebase
maven('clean package -q')

# Option 2: For building along with python distribution
# maven('clean package -P distribution')

>> /opt/apache-maven-3.6.3/bin/mvn clean package -q



### A. Working with SystemDS in **standalone** mode

NOTE: Pay attention to *directories* and *relative paths*. :)



##### 1. Set SystemDS environment variables

These are useful for the `./bin/systemds` script.

In [8]:
!export SYSTEMDS_ROOT=$(pwd)
!export PATH=$SYSTEMDS_ROOT/bin:$PATH

In [9]:
!echo 'export SYSTEMDS_ROOT='$(pwd) >> ~/.bashrc
!echo 'export PATH=$SYSTEMDS_ROOT/bin:$PATH' >> ~/.bashrc

##### 2. Download Haberman data

Data source: https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival

About: The survival of patients who had undergone surgery for breast cancer.

Data Attributes:
1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute)
    - 1 = the patient survived 5 years or longer
    - 2 = the patient died within 5 year

In [10]:
!mkdir ../data

In [11]:
!wget -P ../data/ http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data

--2020-07-24 18:44:02--  http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3103 (3.0K) [application/x-httpd-php]
Saving to: ‘../data/haberman.data’


2020-07-24 18:44:02 (348 MB/s) - ‘../data/haberman.data’ saved [3103/3103]



In [33]:
# Display first 10 lines of the dataset
# Notice that the test is plain csv with no headers!
!sed -n 1,10p ../data/haberman.data

30,64,1,1
30,62,3,1
30,65,0,1
31,59,2,1
31,65,4,1
33,58,10,1
33,60,0,1
34,59,0,2
34,66,9,2
34,58,30,1


##### 2.1 Set `metadata` for the data

The data does not have any info on the value types. So, `metadata` for the data
helps know the size and format for the matrix data as `.mtd` file with the same
name and location as `.data` file.

In [12]:
# generate metadata file for the dataset
!echo '{"rows": 306, "cols": 4, "format": "csv"}' > ../data/haberman.data.mtd

# generate type description for the data
!echo '1,1,1,2' > ../data/types.csv
!echo '{"rows": 1, "cols": 4, "format": "csv"}' > ../data/types.csv.mtd

##### 3. Find the algorithm to run with `systemds`

In [13]:
# Inspect the directory structure of systemds code base
!ls

bin   CONTRIBUTING.md  docker  LICENSE	pom.xml    scripts  target
conf  dev	       docs    NOTICE	README.md  src


In [14]:
# List all the scripts (also called top level algorithms!)
!ls scripts/algorithms

ALS-CG.dml		   GLM.dml	       naive-bayes.dml
ALS-DS.dml		   GLM-predict.dml     naive-bayes-predict.dml
ALS_predict.dml		   KM.dml	       obsolete
ALS_topk_predict.dml	   Kmeans.dml	       PCA.dml
apply-transform.dml	   Kmeans-predict.dml  random-forest.dml
bivar-stats.dml		   l2-svm.dml	       random-forest-predict.dml
Cox.dml			   l2-svm-predict.dml  StepGLM.dml
Cox-predict.dml		   LinearRegCG.dml     StepLinearRegDS.dml
CsplineCG.dml		   LinearRegDS.dml     stratstats.dml
CsplineDS.dml		   m-svm.dml	       transform.dml
decision-tree.dml	   m-svm-predict.dml   Univar-Stats.dml
decision-tree-predict.dml  MultiLogReg.dml


In [34]:
# Output the algorithm documentation
# start from line no. 22 onwards. Till 35th line the command looks like
!sed -n 22,35p ./scripts/algorithms/Univar-Stats.dml

#
# DML Script to compute univariate statistics for all attributes in a given data set
#
# INPUT PARAMETERS:
# -------------------------------------------------------------------------------------------------
# NAME           TYPE     DEFAULT  MEANING
# -------------------------------------------------------------------------------------------------
# X              String   ---      Location of INPUT data matrix
# TYPES          String   ---      Location of INPUT matrix that lists the types of the features:
#                                     1 for scale, 2 for nominal, 3 for ordinal
# CONSOLE_OUTPUT Boolean  FALSE    If TRUE, print summary statistics to console
# STATS          String   ---      Location of OUTPUT matrix with summary statistics computed for
#                                  all features (17 statistics - 14 scale, 3 categorical)
# -------------------------------------------------------------------------------------------------


In [30]:
!./bin/systemds ./scripts/algorithms/Univar-Stats.dml -nvargs X=../data/haberman.data TYPES=../data/types.csv STATS=../data/univarOut.mtx CONSOLE_OUTPUT=TRUE

###############################################################################
#  SYSTEMDS_ROOT= .
#  SYSTEMDS_JAR_FILE= target/SystemDS.jar
#  CONFIG_FILE= 
#  LOG4JPROP= -Dlog4j.configuration=file:conf/log4j-silent.properties
#  CLASSPATH= target/SystemDS.jar:./lib/*:./target/lib/*
#  HADOOP_HOME= /content/systemds/target/hadoop-test/org/apache/hadoop
#
#  Running script ./scripts/algorithms/Univar-Stats.dml locally with opts: -nvargs X=../data/haberman.data TYPES=../data/types.csv STATS=../data/univarOut.mtx CONSOLE_OUTPUT=TRUE
###############################################################################
Executing command:     java       -Xmx4g      -Xms4g      -Xmn400m   -cp target/SystemDS.jar:./lib/*:./target/lib/*   -Dlog4j.configuration=file:conf/log4j-silent.properties   org.apache.sysds.api.DMLScript   -f ./scripts/algorithms/Univar-Stats.dml   -exec singlenode      -nvargs X=../data/haberman.data TYPES=../data/types.csv STATS=../data/univarOut.mtx CONSOLE_OUTPUT=TRUE

20/

##### 3.1 Let us inspect the output data

In [31]:
# output first 10 lines only.
!sed -n 1,10p ../data/univarOut.mtx

1 1 30.0
1 2 58.0
2 1 83.0
2 2 69.0
2 3 52.0
3 1 53.0
3 2 11.0
3 3 52.0
4 1 52.45751633986928
4 2 62.85294117647059


#### B. Run SystemDS with Apache Spark

#### Playground for DML scripts

DML - A custom language designed for SystemDS with R-like syntax.

##### A test `dml` script to prototype algorithms

Modify the code in the below cell and run to work develop data science tasks
in a high level language.

In [17]:
%%writefile ../test.dml

# This code code acts as a playground for dml code
X = rand (rows = 20, cols = 10)
y = X %*% rand(rows = ncol(X), cols = 1)
lm(X = X, y = y)

Writing ../test.dml


Submit the `dml` script to Spark with `spark-submit`.
More about [Spark Submit ↗](https://spark.apache.org/docs/latest/submitting-applications.html)

In [18]:
!$SPARK_HOME/bin/spark-submit \
    ./target/SystemDS.jar -f ../test.dml

20/07/24 18:44:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.sysds.api.DMLScript).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
ANTLR Tool version 4.5.3 used for code generation does not match the current runtime version 4.7ANTLR Runtime version 4.5.3 used for parser compilation does not match the current runtime version 4.7ANTLR Tool version 4.5.3 used for code generation does not match the current runtime version 4.7ANTLR Runtime version 4.5.3 used for parser compilation does not match the current runtime version 4.7Calling the Direct Solver...
Computing the statistics...
AVG_TOT_Y, 2.477375013209132
STDEV_TOT_Y, 0.46788728972902527
AVG_RES_Y, 4.8647128436662966E-9
STDEV_RES_Y, 4.656120422710408E-8
DISPERSION, 1.998482027272949E-15
R2, 0.99999999999

##### Run a binary classification example with sample data

One would notice that no other script than simple dml is used in this example completely.

In [19]:
# Example binary classification task with sample data.
# !$SPARK_HOME/bin/spark-submit ./target/SystemDS.jar -f ./scripts/nn/examples/fm-binclass-dummy-data.dml