Regardless of attributes used to modeling, the overall flow should be the same. Overall process should be: read in data, process into modeling format, split into training and testing, convert to LabeledPoint object, pass to SVM.
- Apache Spark
- NumPy(for Spark's SVM to run)
- YARN(optional)
- Make sure all the CSV files are in the same directory on HDFS
- Changes the code that reads from CSV files to point to the correct directory on HDFS
- Every code chunk has a different level of data granularity. Edit the code depending on what you feel is the correct granularity/set