Skip to content

Scala routines to estimate classifications methods based on the Dataframe API machine learning classes.

Notifications You must be signed in to change notification settings

daasampa/Classification_models_Scala_Routine

Repository files navigation

Binary classification models.

Customers fraud predictions: Machine Learning Classes.

Here we have the codes for Spark's Scala API classifications methods based on the MLLib Library (Dataframe) API.

This repository was built to respond the requierement of predicting the fraud class probability regarding a group of significant variables for a particular set of customers. The result, confussion matrix of each method, can be written into HDFS by implementing a few lines of extra code. The data loaded into the Spark session arrives from HDFS and its design came from a Cloudera Impala ETL. The Impala SQL is not currently available in this repository. The user can define the elements inside this repository as a functional API that allows the estimation of 5 different types of statistical methods: decision trees; adaptative boosting based on trees; random forest; logistic regression and naive-Bayes for binary classification (it's possible to consider the multi-level classification case). The performance of calculations depends on the Spark's pipeline stages model.

The correct order for understanding this "development" goes as the following:

1. Modules (modules_fraud_model.scala): 📚

Import the Scala classes needed to transform data and estimate the different methods. The latter is defined through the Cross-validation model using the AUPR metric (area under precision-recall curve). The metric can be switched.

2. Data set (data_set_fraud_model.scala): 💾

Import the data set, stored in HDFS, into Spark session. The data structure is org.apache.spark.sql.DataFrame. After user calling the the data set is parallelized using cache() instruction. This one is extremely important since calculations are running in parallel with Bancolombia's Hadoop cluster resources.

3. Fraud model (fraud_model.scala): 👾

The final implementation where the API's kernel dwells. The user can call all the 5 methods in the usual object-oriented form previous Modules and Data set call. Inside every method there's a routine performed by the pipeline model; the stages are: vector assembler, minmax scaler (it's possible to invoke others), the machine learning method and the pipeline itself.

The last file Execution.scala contains just the code lines to execute on the Spark shell.

Finally I must say Spark's version used was the 2.2.0. ✅ ✅ ✅

Considerations:
1. Without an acceptable theoretical background of all the methods this API will suffer a substantial loss. The web is plenty of references I encourage you to seek those that better fits your way to understand these topics.
2. The current way to access Spark is through SFTP connection. MobaXterm is an alternative to doing so. However, it has no support, indeed, it has IT restrictions.
3. Source codes in this repository can not be executed inside the GitHub platform.
4. Updates published here are for the good of the version control. The new versions themselves don't migrate directly to the Landing Zone (LZ). The user has to copy these new versions into the node using, e.g., WinSPC or FileZilla.

Releases

No releases published

Packages

No packages published

Languages