This repository was published to share a simple method to verify, based on some measurements, how a group of binary categorical features are related each other. Particulary, this routine allows to answer the question: is there an association between a set of features and a main feature (response variable)?.
The use case of this implementation is the capability to create fraud rules, for mitigation or prevention, that eventually migrate to the SPSS Modeler routines how control the customers use of digital channels.
The procedure's foundations are easy to understand, details can be found here.
This time the way to consume the Spark resources is through the Python API: PySpark. The reason why is the ease of the Python language to write the program, as a class, that perform calculation over data represented on pandas dataframes using lambda expresions. The calculations inside the class involves the combinatorial operator which is called from a Python module (itertools).
The measurements, early mentioned, are: support and confidence. Their interpretation is related to the concepts of probability and the conditional probability.
The routine is divided in 4 parts:
Import the PySpark classes needed to handle the data via data frame structure: pandas; sql functions to explore results; tab completion; combinatory function and set off the loglevel while executions of the routine.
A Python class who calculates the support and confidence measures that allows to analyze the association between the set of fraud business features and the fraud class variable. By reading all the code structure it's quite possible to get how calculations are performed. This class is the building block to the next ones.
The parameters function calls the data, stored in HDFS thanks to a Impala ETL, its return statament is build on the ARL class self instance.
Transforming the pandas data frame into pyspark_sql_dataframe using StructField and StructType methods, this function writes the results in HDFS. Those can be check by Impala or Hive.
A Python class how performe, once all the external data is loaded into HDFS, the ETL needed to calculate the design matrix (as an Impala table) which is the input of the whole methodology.