![](https://www.brainome.ai/wp-content/uploads/2020/08/brainome_logo.png)
# 201 Capacity Progression (CP)


1. What is Capacity Progression
2. Measuring CP
3. Measuring while adjusting data splits
4. Finding your data's content sweet spot 


## Prerequisites
This notebook assumes brainome is installed as per notebook [brainome_101_Quick_Start](brainome_101_Quick_Start.ipynb)

The data sets are:
* [data/titanic_train.csv](data/titanic_train.csv) for training data
* [data/titanic_validate.csv](data/titanic_validate.csv) for validation
* [data/titanic_predict.csv](data/titanic_predict.csv) for predictions

## 1. What is Capacity Progression
*Capacity progression measures the learnability of a dataset, by plotting the number of decisions needed to memorize the function presented by the training data relative to the number of instances presented to the predictor  (for an ideal model).*

From the [Brainome Glossary](https://www.brainome.ai/documentation/glossary/#Capacity%20Progression)

## 2. Measuring Capacity Progression
Brainome outputs the CP measurements of an ideal machine learner.

In [1]:
!brainome data/titanic_train.csv -y -measureonly -json report_201_measureonly.json | grep -A 1 Capacity -

Capacity Progression:             at [ 5%, 10%, 20%, 40%, 80%, 100% ]
    Ideal Machine Learner:              6,   7,   8,   8,   9,   9


## 3.  Measuring while adjusting data splits
The **-split** parameter instructs brainome that percent of the data for training, and the rest for validation

By measuring the predictor at various split points, you can see how much data is really necessary to train your model.

In [2]:
splits = range(10, 100, 10)
for s in splits:
    !brainome data/titanic_train.csv -y -f DT -split {s} -json report_201_split_{s}.json -modelonly -q












## Processing the measurements
Each of the previous runs created a json report from which we extract the MEC, Training Accuracy, and Validation Accuracy.

In [3]:
from glob import glob
from pathlib import Path
import json
# extract CP from measureonly run
with open('report_201_measureonly.json', 'r') as measures:
    measures = json.load(measures)
    capacity = measures['session']['datameter']['capacity_progression']['value']

# extract MEC and accuracies from split runs
samples = []
for split_file in [Path(p) for p in glob('report_201_split_*.json')]:
    # split pct is in filename
    split_pct = int(split_file.stem[17:])
    with open(split_file, 'r') as r_file:
        # load split run results
        split_report = json.load(r_file)
        # extract specific measurements from results
        samples.append((split_pct, 
                        split_report['session']['system_meter'].get('model_capacity'),
                        split_report['session']['system_meter'].get('train_accuracy'),
                        split_report['session']['system_meter'].get('validation_accuracy')
                       ))
# sort samples by integer split percentage
samples.sort(key = lambda x: x[0])


## 4. Finding your training data's content sweet spot 
TODO

In [4]:
print("Capacity Progression:")
print("[ 5%, 10%, 20%, 40%, 80%, 100% ]")
print(capacity)
print('\n')
print("Split %\t\tMEC\t\tTrain Acc\tValidate Acc")
print("-------\t\t---\t\t---------\t------------")
for sample in samples:
    print(f"{sample[0]}\t\t{sample[1]}\t\t{sample[2]}\t\t{sample[3]}")

Capacity Progression:
[ 5%, 10%, 20%, 40%, 80%, 100% ]
[6, 7, 8, 8, 9, 9]


Split %		MEC		Train Acc	Validate Acc
-------		---		---------	------------
10		36		100.0		54.5
20		78		100.0		55.07
30		114		100.0		54.9
40		160		100.0		54.88
50		198		100.0		54.25
60		236		100.0		54.82
70		266		100.0		53.52
80		304		100.0		54.65
90		342		100.0		56.79


## Next Steps
- Check out [brainome_202_MEC](./brainome_202_MEC.ipynb)