Skip to content

Scala API to estimate probabilistic clustering methods based on the Dataframe API machine learning classes.

Notifications You must be signed in to change notification settings

daasampa/Clustering_models_Scala_Routine

Repository files navigation

Clustering models.

Customers fraud predictions: Machine Learning Classes.

Here we have the codes for Spark's Scala API probabilistic clustering methods based on the MLLib Library (Dataframe) API.

On the supervised learning branch for statistical modeling there's always a label feature (response variable) which split the observations into two different groups; this fact provides the building block to the whole methodologies for classification problems. Our main objective in this repository is propose an additional way to response the need of prediction for the customer's fraud class probability, but in this time considering unsupervised methods, probabilistic clustering methods particurally. The key concepts to grab the essence of this API are:

  • Calculate the "optimal" number of groups to distribute the customers who have been previously affected by a fraud circunstance. Those groups are interpreted as the fraud clusters.
    Regarding the nature of the features involved, the question about how many group should I use? would have a straightforward solution. The less variability (more zeros or null values) the features have the harder the method fits well.
    Opposite to the supervised learning methods, the procedure here works in the way that reported fraud cases are the only data to construct the cluster structure. There's no need to include non-fraud cases.

  • After a conscientious analysis for the number of groups to the estimation procedure there's an object, called model, whose roll is to transform the data set containing the features of non-affected customers. The summary for this transformation is the addition of new fields, probability and prediction. This information is interpreted as the cluster label and probability each customer belongs to it. The calculation of the mentioned probability is based on the Gaussian mixture model. Understanding the mathematical background of this methods is mandatory to judge how well this approach could be a solution to the fraud prediction problem.

Be sure you get these two points before moving on. I encourage you to make an effort to completely understand how this approach differs to the supervised learning way.
The following preserves the way Classification_models_Scala_API was presented.

1. Modules (modules_fraud_model.scala): 📚

Import the Scala classes needed to transform data and estimate probabilistic clustering method. Here comes the Gaussian mixture model class. The estimation procedure uses the time-honored Expectation-Maximization algorithm.

2. Data set (data_set_fraud_model.scala): 💾

Contains the sql sentence to call data allocated in HDFS. This data contains the features for the two types of customers tailored as an ETL written in Cloudera Impala.

3. Fraud model (fraud_model.scala): 👾

The final implementation where the API's kernel dwells. The user can call the Gaussian mixture method in the usual object-oriented form previous Modules and Data set call. Inside the method there's a routine performed by the pipeline model; the stages are: vector assembler, minmax scaler (it's possible to invoke others), the machine learning method and the pipeline itself. Moreover, there's a writting of the results in form of a Hive table.

The last file Execution.scala contains just the code lines to execute on the Spark shell.

Considerations:
1. The current way to access Spark is through SFTP connection. MobaXterm is an alternative to doing so. However, it has no support, indeed, it has IT restrictions.
2. Source codes in this repository can not be executed inside the GitHub platform.
3. Updates published here are for the good of the version control. The new versions themselves don't migrate directly to the Landing Zone. The user has to copy these new versions into the node using, e.g., WinSPC or FileZilla.

About

Scala API to estimate probabilistic clustering methods based on the Dataframe API machine learning classes.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages