## Start Master and Slave

Please run the following commands in terminal to start master and slave on every machine using user hdfs. You can look up the spark home env in the env.py.

```bash
su hdfs
export MASTER=spark://master0.datascience.com:7077
export CORES_PER_WORKER=1
% on master machine, run:
${SPARK_HOME}/sbin/start-master.sh;
% on slave machine, run:
${SPARK_HOME}/sbin/start-slave.sh -c ${CORES_PER_WORKER} -m 3G ${MASTER}
```

## Set up the Environment

The codes in env.py do not have to be changed for a different training process.

In [1]:
%run env.py
# %env # print env

## Import Libraries

Please import and initialize findspark first before import other pyspark libraries.

In [2]:
import findspark
findspark.init()
from pyspark.context import SparkContext
from pyspark.conf import SparkConf

from argparse import Namespace

from mnist_write import writeMNIST, readMNIST

## Set up Spark Config

Our machine is running on spark 2.3.
Spark env properties can be found [here](https://spark.apache.org/docs/2.3.0/configuration.html).
Yarn env properties can be found [here](https://spark.apache.org/docs/2.3.0/running-on-yarn.html).

Some notes:

**spark.master**: the value needs to be set to a port on master.

In [3]:
conf = SparkConf()

conf.setAll([("spark.app.name", "mnist-standalone-write"), # your app name
             ("spark.master", "spark://master0.datascience.com:7077")]) # cluster mode, please leave this unchanged

sc = SparkContext(conf=conf)
sc.addPyFile("mnist_write.py")

## Set up Arguments

In [4]:
args = Namespace(
  format="csv", # output format: ["csv", "csv2", "pickle", "tf", "tfr"]
  num_partitions=10, # number of output partitions
  output="hdfs://master0.datascience.com:8020/data/mnist/csv", # HDFS directory to save examples in parallelized format
  read=False, # read previously saved examples
  verify=True # verify saved examples after writing
)

## Main

In [5]:
if not args.read:
  # Note: these files are inside the mnist.zip file
  writeMNIST(sc, "data/train-images-idx3-ubyte.gz", "data/train-labels-idx1-ubyte.gz", args.output + "/train", args.format, args.num_partitions)
  writeMNIST(sc, "data/t10k-images-idx3-ubyte.gz", "data/t10k-labels-idx1-ubyte.gz", args.output + "/test", args.format, args.num_partitions)

if args.read or args.verify:
  readMNIST(sc, args.output + "/train", args.format)

sc.stop()

Instructions for updating:
Please use tf.data to implement this functionality.
Extracting data/train-images-idx3-ubyte.gz
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting data/train-labels-idx1-ubyte.gz
Instructions for updating:
Please use tf.one_hot on tensors.
images.shape: (60000, 28, 28, 1)
labels.shape: (60000, 10)


Py4JJavaError: An error occurred while calling o61.saveAsTextFile.
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://master0.datascience.com:8020/data/mnist/csv/train/images already exists
	at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)
	at org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.assertConf(SparkHadoopWriter.scala:287)
	at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:71)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)
	at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1499)
	at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)
	at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1478)
	at org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala:550)
	at org.apache.spark.api.java.AbstractJavaRDDLike.saveAsTextFile(JavaRDDLike.scala:45)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:745)
