Here we have the codes for Spark's Scala API classifications methods based on the MLLib Library (Dataframe) API.
This repository was built to respond the requierement of predicting the fraud class probability regarding a group of significant variables for a particular set of customers. The result, confussion matrix of each method, can be written into HDFS by implementing a few lines of extra code. The data loaded into the Spark session arrives from HDFS and its design came from a Cloudera Impala ETL. The Impala SQL is not currently available in this repository. The user can define the elements inside this repository as a functional API that allows the estimation of 5 different types of statistical methods: decision trees; adaptative boosting based on trees; random forest; logistic regression and naive-Bayes for binary classification (it's possible to consider the multi-level classification case). The performance of calculations depends on the Spark's pipeline stages model.
The correct order for understanding this "development" goes as the following:
Import the Scala classes needed to transform data and estimate the different methods. The latter is defined through the Cross-validation model using the AUPR metric (area under precision-recall curve). The metric can be switched.
Import the data set, stored in HDFS, into Spark session. The data structure is org.apache.spark.sql.DataFrame. After user calling the the data set is parallelized using cache() instruction. This one is extremely important since calculations are running in parallel with Bancolombia's Hadoop cluster resources.
The final implementation where the API's kernel dwells. The user can call all the 5 methods in the usual object-oriented form previous Modules and Data set call. Inside every method there's a routine performed by the pipeline model; the stages are: vector assembler, minmax scaler (it's possible to invoke others), the machine learning method and the pipeline itself.
The last file Execution.scala contains just the code lines to execute on the Spark shell.
Finally I must say Spark's version used was the 2.2.0. ✅ ✅ ✅