Association rule learning: a way to detect association between features.

This repository was published to share a simple method to verify, based on some measurements, how a group of binary categorical features are related each other. Particulary, this routine allows to answer the question: is there an association between a set of features and a main feature (response variable)?.
The use case of this implementation is the capability to create fraud rules, for mitigation or prevention, that eventually migrate to the SPSS Modeler routines how control the customers use of digital channels.
The procedure's foundations are easy to understand, details can be found here.

This time the way to consume the Spark resources is through the Python API: PySpark. The reason why is the ease of the Python language to write the program, as a class, that perform calculation over data represented on pandas dataframes using lambda expresions. The calculations inside the class involves the combinatorial operator which is called from a Python module (itertools).

The measurements, early mentioned, are: support and confidence. Their interpretation is related to the concepts of probability and the conditional probability.

The routine is divided in 4 parts:

1. Modules (Modules.py): 📚

Import the PySpark classes needed to handle the data via data frame structure: pandas; sql functions to explore results; tab completion; combinatory function and set off the loglevel while executions of the routine.

2. ARL (ARL.py): 🌎

A Python class who calculates the support and confidence measures that allows to analyze the association between the set of fraud business features and the fraud class variable. By reading all the code structure it's quite possible to get how calculations are performed. This class is the building block to the next ones.

3. Parameters (Parameters.py): 💾

The parameters function calls the data, stored in HDFS thanks to a Impala ETL, its return statament is build on the ARL class self instance.

4. Write results (Write_results.py): 📝

Transforming the pandas data frame into pyspark_sql_dataframe using StructField and StructType methods, this function writes the results in HDFS. Those can be check by Impala or Hive.

5. ARL_ETL (ARL_ETL.py): 🚜

A Python class how performe, once all the external data is loaded into HDFS, the ETL needed to calculate the design matrix (as an Impala table) which is the input of the whole methodology.

Considerations:

1. The current way to access Spark is through SFTP connection. MobaXterm is an alternative to doing so. However, it has no support, indeed, it has IT restrictions.

2. Source codes in this repository can not be executed inside the GitHub platform.

3. Updates published here are for the good of the version control. The new versions themselves don't migrate directly to the Landing Zone. The user has to copy these new versions into the node using, e.g., WinSPC or FileZilla.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Association rule learning: a way to detect association between features.

1. Modules (Modules.py): 📚

2. ARL (ARL.py): 🌎

3. Parameters (Parameters.py): 💾

4. Write results (Write_results.py): 📝

5. ARL_ETL (ARL_ETL.py): 🚜

Considerations:

1. The current way to access Spark is through SFTP connection. MobaXterm is an alternative to doing so. However, it has no support, indeed, it has IT restrictions.

2. Source codes in this repository can not be executed inside the GitHub platform.

3. Updates published here are for the good of the version control. The new versions themselves don't migrate directly to the Landing Zone. The user has to copy these new versions into the node using, e.g., WinSPC or FileZilla.

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
ARL.py		ARL.py
ARL_ETL.py		ARL_ETL.py
Modules.py		Modules.py
Parameters.py		Parameters.py
README.md		README.md
Write_results.py		Write_results.py

daasampa/Association_rule_learning_Python_Routine

Folders and files

Latest commit

History

Repository files navigation

Association rule learning: a way to detect association between features.

1. Modules (Modules.py): 📚

2. ARL (ARL.py): 🌎

3. Parameters (Parameters.py): 💾

4. Write results (Write_results.py): 📝

5. ARL_ETL (ARL_ETL.py): 🚜

Considerations:

1. The current way to access Spark is through SFTP connection. MobaXterm is an alternative to doing so. However, it has no support, indeed, it has IT restrictions.

2. Source codes in this repository can not be executed inside the GitHub platform.

3. Updates published here are for the good of the version control. The new versions themselves don't migrate directly to the Landing Zone. The user has to copy these new versions into the node using, e.g., WinSPC or FileZilla.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages