- Install requirements:
pip install -r requirements.txt
- Download datasets into respective folders in
/data/
- Run preprocessing script:
python preprocess.py
for downloaded datasets - Run main program:
python main.py
- Clean the datasets
- Save clean versions of datasets
- Look into Compute Canada
- UCI data downloading
- Scale features if too big
- Create configs for datasets (start with Huseyin's 18 datasets)
- Run two level RF analysis on the datasets
- Record entropy for major poisoning levels
- Record performance (AUC, Bias, LogLoss) of the first level RF trained on test data
- Use the fewest neurons and the fewest neural network layers to reach RF performance (or use the same number of neurons and layers for all datasets and compare performance results?)
- Based on RF breaking point and NN simplicity, explain data in global terms (global explanations) or in terms of salient data points (local explanations). Both are open research problems.
- Global explanations can be managed by using functional data depth on entropy lines? Reporting breaking points in performance wrt. the poisoning rate?
- Local explanations (which data points' removal cause the biggest drop in datasets)
Dataset | Reason |
---|---|
haberman | too few features |
temp of major cities | not classification |
wisconsin cancer | duplicate |