## **Motivation**

Digital transformation is driven primarily by data, which is why companies are investing in *maximizing value from data*. Machine learning algorithms are trained using statistical methods to imitate the way humans learn and make decisions.

Current applications rely on human ML experts to perform many related tasks.
*   Preprocessing
*   Selecting and constructing appropriate features for model training
*   Selecting optimal model
*   Optimizing model hyperparameters
*   Checking design of neural network topology (deep learning)
*   Postprocessing of model
*   Evaluating metrics

Since the complexity of tasks is often beyond the scope of non-ML experts, the rapid growth of ML use-cases has created a demand for **off-the-shelf ML methods** that *can be used easily and without expert knowledge*. **Think "plug and play."**


## **What can it do?**

AutoML has many applications:

*   *Time-series forecasting*: normally complicated due to very large datasets
*   *Classification problems*: predicting a label, including handwriting interpretation and fraud detection
*   *Feature selection*: selectively choose strongest predictors
*   *Algorithm selection*: automatically iterates through different models
*   *Model Evaluation*: determining whether the model underfitting or overfitting

AutoML is especially powerful for **Deep Learning, Neural Networks, and Meta Learning**, which all have high costs in both time and resources.

Popular examples of AutoML tools are:

*   PyTorch
*   Auto-sklearn
*   MLBox
*   Amazon AutoGluon and Amazon Sagemaker
*   TPOT
*   AutoKeras, developed by DATA lab at Texas A&M!
*   H2O AutoML
*   Google Cloud AutoML






## **Benefits**

In general, **AutoML makes it easier to incorporate machine learning in diverse contexts without the need for an experienced engineer**.

1.   Saves time by automating modelling process with capabilities of Neural Architectural Search (NAS)
2.   Fills the skill gap by providing a way for all types of enterprises to apply ML, mitigating the impact of shortages of ML professionals
3.   Increases productivity by removing complexity of testing, developing, tuning, and deploying ML frameworks
4.   Enhanced scalability by preventing human-prone errors



## **Limitations**

The blackboxing of AutoML is powerful in business contexts, but this **abstraction is a double-edged sword that poses problems for deeper data problems**.

1.   Difficult to reconcile interactions between data, models, and humans
2.   Intepreting unstructured and semi-structured data can be ambiguous
3.   Optimization goals are changing constantly (tug of war in many directions)
4.   Informed decisions cannot be made until final results are disclosed
5.   AutoML apps can only run on ML model programs, like PyTorch, which means they are not fully accessible yet
6.   Black box means the model is not explainable

Even though AutoML is good, there still needs to be an engineer to supervise and intervene when necessary. **After our data science and ML workshops, that can be you!**

## **Code Walkthrough: H2O AutoML**

There are many AutoML frameworks in the industry. We will go over **H2O**, which has many strengths:


*   Data preprocessing, including categorical encoding
*   Data cleaning and data imputation
*   Handles model selection and hyperparameter tuning
*   Provides nice leaderboard with metrics for various models
*   Supports GPUs for XGBoost (ensemble learning model)
*   Best thing: generates production-ready deployable that can be plugged into any application!  


First, install the necessary components of H20. **It's much easier to use Colab here!**

In [None]:
!apt-get install default-jre
!java -version

In [None]:
!pip install h2o

Let's instantiate the H2O cluster to begin our AutoML experience. You can see various model specification metrics below.

In [30]:
import h2o
from h2o.automl import H2OAutoML

In [7]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.20.1" 2023-08-24; OpenJDK Runtime Environment (build 11.0.20.1+1-post-Ubuntu-0ubuntu122.04); OpenJDK 64-Bit Server VM (build 11.0.20.1+1-post-Ubuntu-0ubuntu122.04, mixed mode, sharing)
  Starting server from /usr/local/lib/python3.10/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmph2bw1m8p
  JVM stdout: /tmp/tmph2bw1m8p/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmph2bw1m8p/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,03 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.42.0.3
H2O_cluster_version_age:,"28 days, 4 hours and 21 minutes"
H2O_cluster_name:,H2O_from_python_unknownUser_19qwf2
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.170 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


This dataset is comprised of churned Telecom customer data. We are using this data because it contains categorical variables and missing values, which normally require feature engineering to make usable for ML. **However, H20 will handle this step for us!**

*Note: Feature engineering should not be completely left to the model in the real-world since an engineer will always have more domain knowledge and business context than a model.*

*Note: Notice that we are not using the Pandas library in this notebook. H2O has its own data manipulation framework.*

In [8]:
churn_df = h2o.import_file('https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/WA_Fn-UseC_-Telco-Customer-Churn.csv')

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [26]:
type(churn_df)

h2o.frame.H2OFrame

In [23]:
churn_df.head(5)

customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [46]:
churn_df.isna().sum()   # number of missing values

11.0

As shown below, most attributes are categorical enumerations, which would usually require a decent amount of encoding by an engineer.

In [9]:
churn_df.types

{'customerID': 'string',
 'gender': 'enum',
 'SeniorCitizen': 'int',
 'Partner': 'enum',
 'Dependents': 'enum',
 'tenure': 'int',
 'PhoneService': 'enum',
 'MultipleLines': 'enum',
 'InternetService': 'enum',
 'OnlineSecurity': 'enum',
 'OnlineBackup': 'enum',
 'DeviceProtection': 'enum',
 'TechSupport': 'enum',
 'StreamingTV': 'enum',
 'StreamingMovies': 'enum',
 'Contract': 'enum',
 'PaperlessBilling': 'enum',
 'PaymentMethod': 'enum',
 'MonthlyCharges': 'real',
 'TotalCharges': 'real',
 'Churn': 'enum'}

The dataset contains information about customers, their products, and whether or not they "**churned**," or ended their subscription.

**Objective: predict whether customer will churn in the future based on this dataset**

In [21]:
churn_df.describe()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
type,string,enum,int,enum,enum,int,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,real,real,enum
mins,,,0.0,,,0.0,,,,,,,,,,,,,18.25,18.8,
mean,,,0.1621468124378816,,,32.37114865824223,,,,,,,,,,,,,64.76169246059916,2283.300440841865,
maxs,,,1.0,,,72.0,,,,,,,,,,,,,118.75,8684.8,
sigma,,,0.3686116056100131,,,24.559481023094456,,,,,,,,,,,,,30.090047097678482,2266.771361883145,
zeros,0,,5901,,,11,,,,,,,,,,,,,0,0,
missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,11,0
0,7590-VHVEG,Female,0.0,Yes,No,1.0,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0.0,No,No,34.0,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0.0,No,No,2.0,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes


To make the data usable for the model, **we need to split the dataset into training and testing datasets**. This is critical for *tuning the model to the specific patterns*, but also *preventing the model from overfitting*. We will discuss this further in a future Regression ML workshop.

Here, we are setting aside 70% of all rows for training, 15% for testing, and 15% for validation.

In [24]:
churn_train, churn_test, churn_valid = churn_df.split_frame(ratios=[.7, .15])

In [28]:
churn_train.shape, churn_test.shape, churn_valid.shape

((4913, 21), (1062, 21), (1068, 21))

Now, we need to specify the **predictors and response variables**. The former refers to the independent variables that we believe impact the latter, which is the dependent variable.

In this situation, we are interested in predicting whether the customer will churn, so the response is "Churn" and the predictors are all other columns. Notice that we remove the customer ID, which should not have any impact on response.

In [33]:
y = "Churn"
x = churn_df.columns
x.remove(y)
x.remove("customerID")

Finally, we will fire up the AutoML pipeline!

*   `max_models`: Try a maximum of 10 different models, where more models means longer runtime (can also timebox individual model training).
*   `seed`: Enables reproducibility.
*   `exclude_algos`: Do not want to include StackedEnsemble and DeepLearning, which are accurate but very complex. In the real world, we want to tend towards simplicity if possible (Occam's razor).
*   `verbosity`: Provides debugging info.
*   `nfolds`: Parameter for n-fold cross validation.

In [34]:
aml = H2OAutoML(max_models = 10, seed = 10, exclude_algos = ["StackedEnsemble", "DeepLearning"], verbosity="info", nfolds=0)

With the model parameters set, we can now train it on the training dataset. This command will go through `max_models` number of models, which range from XGBoost to Linear Regression. Each model performss the following operations:

*   Train the model
*   Evaluate based on the validation set
*   Determine whether this model is the new leader

The results for each model are logged, and the default evaluation metric is *AUC*, or area under ROC (receiver operating characteristic) curve.


In [35]:
aml.train(x = x, y = y, training_frame = churn_train, validation_frame = churn_valid)

AutoML progress: |
22:00:33.221: Project: AutoML_1_20230919_220033
22:00:33.223: Cross-validation disabled by user: no fold column nor nfolds > 1.
22:00:33.223: Setting stopping tolerance adaptively based on the training frame: 0.01426680147272547
22:00:33.223: Build control seed: 10
22:00:33.224: training frame: Frame key: AutoML_1_20230919_220033_training_py_5_sid_9a17    cols: 21    rows: 4913  chunks: 8    size: 265730  checksum: 5476589401582456489
22:00:33.224: validation frame: Frame key: py_7_sid_9a17    cols: 21    rows: 1068  chunks: 8    size: 139323  checksum: 6496272224660117118
22:00:33.225: leaderboard frame: Frame key: py_7_sid_9a17    cols: 21    rows: 1068  chunks: 8    size: 139323  checksum: 6496272224660117118
22:00:33.225: blending frame: NULL
22:00:33.225: response column: Churn
22:00:33.225: fold column: null
22:00:33.226: weights column: null
22:00:33.260: Loading execution steps: [{XGBoost : [def_2 (1g, 10w), def_1 (2g, 10w), def_3 (3g, 10w), grid_1 (4g, 90w),

Unnamed: 0,family,link,regularization,lambda_search,number_of_predictors_total,number_of_active_predictors,number_of_iterations,training_frame
,binomial,logit,Ridge ( lambda = 6.568E-5 ),"nlambda = 30, lambda.max = 15.731, lambda.min = 6.568E-5, lambda.1se = -1.0",45,45,48,AutoML_1_20230919_220033_training_py_5_sid_9a17

Unnamed: 0,No,Yes,Error,Rate
No,2755.0,860.0,0.2379,(860.0/3615.0)
Yes,275.0,1023.0,0.2119,(275.0/1298.0)
Total,3030.0,1883.0,0.231,(1135.0/4913.0)

metric,threshold,value,idx
max f1,0.2952096,0.643194,228.0
max f2,0.1530539,0.7529129,297.0
max f0point5,0.5344039,0.6534866,120.0
max accuracy,0.5283926,0.813963,123.0
max precision,0.8451457,1.0,0.0
max recall,0.0029151,1.0,397.0
max specificity,0.8451457,1.0,0.0
max absolute_mcc,0.2952096,0.4989943,228.0
max min_per_class_accuracy,0.310676,0.770416,221.0
max mean_per_class_accuracy,0.2829433,0.7755818,234.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0101771,0.7858155,3.4065485,3.4065485,0.9,0.8081687,0.9,0.8081687,0.0346687,0.0346687,240.6548536,240.6548536,0.0332856
2,0.0201506,0.7626658,3.3215779,3.3644924,0.877551,0.7736957,0.8888889,0.7911063,0.0331279,0.0677966,232.1577938,236.4492381,0.0647537
3,0.0301242,0.7468189,3.0898399,3.2735602,0.8163265,0.7543409,0.8648649,0.778934,0.0308166,0.0986133,208.9839942,227.3560155,0.0930807
4,0.0400977,0.7368349,2.9353479,3.1894363,0.7755102,0.7421574,0.8426396,0.7697865,0.0292758,0.1278891,193.5347945,218.9436306,0.1193137
5,0.0500712,0.7205741,3.0125939,3.1542116,0.7959184,0.7302955,0.8333333,0.7619204,0.0300462,0.1579353,201.2593944,215.4211608,0.1465937
6,0.1001425,0.6525048,2.6618469,2.9080292,0.703252,0.6886749,0.7682927,0.7252977,0.133282,0.2912173,166.1846869,190.8029238,0.259682
7,0.1500102,0.5959164,2.3482784,2.7219519,0.6204082,0.6217221,0.7191316,0.6908661,0.1171032,0.4083205,134.8278356,172.1951944,0.3510591
8,0.2000814,0.528274,2.3233461,2.6221991,0.6138211,0.56339,0.6927772,0.6589647,0.1163328,0.5246533,132.3346111,162.2199111,0.4411125
9,0.3000204,0.4011847,1.5494824,2.2648694,0.4093686,0.4640956,0.5983718,0.5940524,0.1548536,0.6795069,54.9482362,126.4869447,0.5157448
10,0.3999593,0.2773202,1.2411277,2.0090642,0.3279022,0.3341441,0.5307888,0.5291084,0.124037,0.8035439,24.1127663,100.9064248,0.5484955

Unnamed: 0,No,Yes,Error,Rate
No,586.0,211.0,0.2647,(211.0/797.0)
Yes,57.0,214.0,0.2103,(57.0/271.0)
Total,643.0,425.0,0.2509,(268.0/1068.0)

metric,threshold,value,idx
max f1,0.2528302,0.6149425,232.0
max f2,0.1655695,0.7457213,280.0
max f0point5,0.5709037,0.6202394,91.0
max accuracy,0.5709037,0.8080524,91.0
max precision,0.8171393,1.0,0.0
max recall,0.0076914,1.0,390.0
max specificity,0.8171393,1.0,0.0
max absolute_mcc,0.3800232,0.4683188,169.0
max min_per_class_accuracy,0.2764334,0.7515684,220.0
max mean_per_class_accuracy,0.2528302,0.7624626,232.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0102996,0.7763348,3.2244213,3.2244213,0.8181818,0.7927305,0.8181818,0.7927305,0.0332103,0.0332103,222.4421335,222.4421335,0.0307009
2,0.0205993,0.7509662,3.5826904,3.4035559,0.9090909,0.7653583,0.8636364,0.7790444,0.0369004,0.0701107,258.2690372,240.3555854,0.0663466
3,0.0308989,0.7328657,2.8661523,3.2244213,0.7272727,0.740314,0.8181818,0.7661342,0.0295203,0.099631,186.6152298,222.4421335,0.0921028
4,0.0402622,0.7121351,3.5468635,3.2994079,0.9,0.720238,0.8372093,0.7554607,0.0332103,0.1328413,254.6863469,229.9407878,0.1240584
5,0.0505618,0.7053958,2.8661523,3.2111521,0.7272727,0.7083157,0.8148148,0.7458571,0.0295203,0.1623616,186.6152298,221.1152112,0.1498146
6,0.1001873,0.6477842,2.6768781,2.9465117,0.6792453,0.6763794,0.7476636,0.7114429,0.1328413,0.295203,167.687809,194.6511708,0.2613259
7,0.1507491,0.5710161,2.4083641,2.766015,0.6111111,0.6075417,0.7018634,0.6765941,0.1217712,0.4169742,140.8364084,176.6014989,0.3567483
8,0.2003745,0.5010437,1.5615122,2.4677036,0.3962264,0.5402991,0.6261682,0.6428388,0.0774908,0.4944649,56.1512219,146.7703556,0.3940885
9,0.3005618,0.3601798,1.58375,2.1730524,0.4018692,0.4256904,0.5514019,0.570456,0.1586716,0.6531365,58.3750043,117.3052385,0.472459
10,0.3998127,0.2494519,1.3756179,1.9750944,0.3490566,0.3048175,0.501171,0.5045129,0.1365314,0.7896679,37.5617907,97.5094411,0.5224157

Unnamed: 0,timestamp,duration,iteration,lambda,predictors,deviance_train,deviance_test,alpha,iterations,training_rmse,training_logloss,training_r2,training_auc,training_pr_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_r2,validation_auc,validation_pr_auc,validation_lift,validation_classification_error
,2023-09-19 22:00:37,0.000 sec,2,.16E2,46,1.1412092,1.1207416,0.0,,,,,,,,,,,,,,,
,2023-09-19 22:00:37,0.075 sec,4,.98E1,46,1.1334264,1.1134991,0.0,,,,,,,,,,,,,,,
,2023-09-19 22:00:37,2.242 sec,5,,,,,,5,0.3639717,0.4098980,0.3185317,0.8514692,0.6687510,3.4065485,0.2310197,0.3682910,0.4131056,0.2836966,0.8403260,0.6373805,3.2244213,0.2509363
,2023-09-19 22:00:37,0.141 sec,6,.61E1,46,1.1216392,1.1025302,0.0,,,,,,,,,,,,,,,
,2023-09-19 22:00:37,0.214 sec,8,.38E1,46,1.1043773,1.0864680,0.0,,,,,,,,,,,,,,,
,2023-09-19 22:00:37,0.276 sec,10,.23E1,46,1.0803371,1.0641021,0.0,,,,,,,,,,,,,,,
,2023-09-19 22:00:37,0.322 sec,12,.15E1,46,1.0491155,1.0350644,0.0,,,,,,,,,,,,,,,
,2023-09-19 22:00:37,0.352 sec,14,.9E0,46,1.0120553,1.0006225,0.0,,,,,,,,,,,,,,,
,2023-09-19 22:00:37,0.380 sec,16,.56E0,46,0.9724702,0.9639033,0.0,,,,,,,,,,,,,,,
,2023-09-19 22:00:37,0.411 sec,18,.35E0,46,0.9346106,0.9289116,0.0,,,,,,,,,,,,,,,

variable,relative_importance,scaled_importance,percentage
tenure,1.3130877,1.0,0.1442158
Contract.Two year,0.8388457,0.6388345,0.0921301
Contract.Month-to-month,0.7258201,0.5527583,0.0797165
InternetService.Fiber optic,0.5549385,0.4226210,0.0609487
TotalCharges,0.5296302,0.4033472,0.0581691
InternetService.DSL,0.4474827,0.3407866,0.0491468
MonthlyCharges,0.2959678,0.2253984,0.0325060
PaymentMethod.Electronic check,0.2750798,0.2094908,0.0302119
OnlineSecurity.No,0.2392349,0.1821926,0.0262751
PaperlessBilling.No,0.2226953,0.1695966,0.0244585


After training all the models, we can analyze the leaderboard to see specific evaluation metrics for each attempted model.

In [36]:
lb = aml.leaderboard

In [37]:
lb.head()

model_id,auc,logloss,aucpr,mean_per_class_error,rmse,mse
GLM_1_AutoML_1_20230919_220033,0.840326,0.413106,0.637381,0.237537,0.368291,0.135638
GBM_1_AutoML_1_20230919_220033,0.837266,0.416453,0.634865,0.253622,0.368089,0.135489
GBM_2_AutoML_1_20230919_220033,0.832636,0.423216,0.633965,0.260708,0.371299,0.137863
XGBoost_3_AutoML_1_20230919_220033,0.832624,0.427888,0.61429,0.242813,0.37154,0.138042
GBM_3_AutoML_1_20230919_220033,0.831226,0.429212,0.61776,0.266278,0.372738,0.138934
XRT_1_AutoML_1_20230919_220033,0.826679,0.429051,0.610404,0.248865,0.373397,0.139425
XGBoost_2_AutoML_1_20230919_220033,0.82592,0.441222,0.611954,0.258643,0.37663,0.14185
GBM_4_AutoML_1_20230919_220033,0.822494,0.44198,0.605512,0.249235,0.379049,0.143678
XGBoost_1_AutoML_1_20230919_220033,0.821941,0.443606,0.591332,0.254106,0.381415,0.145478
DRF_1_AutoML_1_20230919_220033,0.818031,0.467772,0.600443,0.258016,0.378643,0.14337


The leader (or any other) model can be used to predict the response ("Churn") for all rows in the test dataset.

In [38]:
churn_pred = aml.leader.predict(churn_test)

glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%


In [39]:
churn_pred.head()

predict,No,Yes
No,0.960996,0.0390044
Yes,0.499263,0.500737
Yes,0.576849,0.423151
No,0.840301,0.159699
No,0.865159,0.134841
Yes,0.290884,0.709116
No,0.883515,0.116485
No,0.990311,0.00968938
No,0.98825,0.0117501
No,0.994832,0.00516828


For a given model, we can also calculate an extended collection of evaluation statistics to demonstrate the **model performance.**

In [40]:
aml.leader.model_performance(churn_test)

Unnamed: 0,No,Yes,Error,Rate
No,592.0,170.0,0.2231,(170.0/762.0)
Yes,78.0,222.0,0.26,(78.0/300.0)
Total,670.0,392.0,0.2335,(248.0/1062.0)

metric,threshold,value,idx
max f1,0.300488,0.6416185,210.0
max f2,0.1543833,0.7731092,286.0
max f0point5,0.4929525,0.6472492,127.0
max accuracy,0.4929525,0.7984934,127.0
max precision,0.7803883,0.8666667,12.0
max recall,0.0027197,1.0,397.0
max specificity,0.8369947,0.9986877,0.0
max absolute_mcc,0.4153422,0.4889764,162.0
max min_per_class_accuracy,0.2721037,0.7506562,223.0
max mean_per_class_accuracy,0.15968,0.7591601,283.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0103578,0.7889168,2.8963636,2.8963636,0.8181818,0.8074953,0.8181818,0.8074953,0.03,0.03,189.6363636,189.6363636,0.0273753
2,0.0207156,0.7684419,2.5745455,2.7354545,0.7272727,0.7773066,0.7727273,0.7924009,0.0266667,0.0566667,157.4545455,173.5454545,0.050105
3,0.0301318,0.7493038,3.54,2.986875,1.0,0.7597058,0.84375,0.7821837,0.0333333,0.09,254.0,198.6875,0.0834383
4,0.0404896,0.7394642,2.8963636,2.9637209,0.8181818,0.7448138,0.8372093,0.7726239,0.03,0.12,189.6363636,196.372093,0.1108136
5,0.0508475,0.7253453,2.8963636,2.95,0.8181818,0.7333833,0.8333333,0.7646305,0.03,0.15,189.6363636,195.0,0.138189
6,0.1007533,0.6513398,2.4045283,2.6798131,0.6792453,0.6869155,0.7570093,0.7261361,0.12,0.27,140.4528302,167.9813084,0.2358793
7,0.1506591,0.5821984,2.1373585,2.500125,0.6037736,0.6172595,0.70625,0.6900707,0.1066667,0.3766667,113.7358491,150.0125,0.3149869
8,0.200565,0.5223165,1.8701887,2.3433803,0.5283019,0.5491126,0.6619718,0.6549966,0.0933333,0.47,87.0188679,134.3380282,0.3755118
9,0.3003766,0.3985868,1.8367925,2.175047,0.5188679,0.45115,0.6144201,0.5872608,0.1833333,0.6533333,83.6792453,117.5047022,0.491916
10,0.4001883,0.2620371,1.1354717,1.9157647,0.3207547,0.3234992,0.5411765,0.5214755,0.1133333,0.7666667,13.5471698,91.5764706,0.5107612
