FraudHacker is an anomaly detection system for Medicare insurance claims data. I built FraudHacker using Python3 along with various scientific computing and machine learning packages (numpy, scikit-learn, and many others). For more background about why I built FraudHacker, please see my blog post on the subject. I will focus on the technical details here.
data/: Contains a CSV file displaying the outlier count data generated by the anomaly labeling engine.
notebooks/: Jupyter notebooks demonstrating various aspects of FraudHacker's workflow, including the outlier detection, physician ranking, and hyperparameter sweeping.
src/: The actual source code for FraudHacker and the Flask app that displays its results to users.
Each directory has its own README file with more information.
A reader class, PandasDBReader (implemented in database_tools.py), reads the data from the PostgreSQL database (whose info is specified in an external YAML file) and loads it into a Pandas DataFrame. Then, this dataframe is ingested by an AnomalyDetector sub-class (depending on the desired algorithm; these are implemented in anomaly_tools.py). The AnomalyDetector performs the actual clustering and outlier labeling, produces an outlier score for each record. A threshold on the outlier scores is used to formally label certain records as outliers. The AnomalyDetector class also adds up the outlier counts for each physician.
The next step is currently done semi-manually (this could be improved in the future). I export the outlier counts for each physician to a CSV file (an example of what this data looks like can be found in the data folder). The outlier count data is in turn imported to another PostgreSQL database, which is ultimately directly read by the Flask app. This accomplished via another class, the OutlierCountDBReader (also implemented into database_tools.py). The OutlierCountDBReader produces the values that are ultimately displayed in the Flask app.