Skip to content

Commit

Permalink
revise fold documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
layne-sadler committed Aug 27, 2022
1 parent e10c6dc commit 2489c41
Show file tree
Hide file tree
Showing 13 changed files with 42 additions and 41 deletions.
4 changes: 2 additions & 2 deletions aiqc/orm.py
Original file line number Diff line number Diff line change
Expand Up @@ -3493,7 +3493,7 @@ def run_jobs(id:int):
try:
for rj in tqdm(
repeated_jobs
, desc = "🔮 Training Models 🔮"
, desc = f"🔮 Queue #{id} 🔮"
, ncols = 85
):
# See if this job has already completed. Keeps the tqdm intact.
Expand Down Expand Up @@ -3521,7 +3521,7 @@ def run_jobs(id:int):
try:
for rj in tqdm(
repeated_jobs
, desc = f"🔮 Training Models - Fold #{idx+1} 🔮"
, desc = f"🔮 Queue {id} // Fold #{idx+1} 🔮"
, ncols = 85
):
# See if this job has already completed. Keeps the tqdm intact.
Expand Down
Binary file modified docs/_build/doctrees/environment.pickle
Binary file not shown.
13 changes: 6 additions & 7 deletions docs/_build/doctrees/nbsphinx/notebooks/api_low_level.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2333,7 +2333,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"![cross fold objects](../_static/images/api/cross_fold_objects.png)"
"![CrossFoldBins](../_static/images/api/cross_fold_bins.png)"
]
},
{
Expand All @@ -2350,28 +2350,27 @@
"metadata": {},
"source": [
"Let's say we defined `fold_count=5`. What are the implications?\n",
"\n",
"- Creates 5 `Folds` related to a `Splitset`.\n",
"- 5x more preprocessing and caching; each `fold_validation` is excluded from the `fit` on `folds_train_combinared`. Fits are saved to the `orm.Fold` object as opposed to the `orm.Feature/Label` objects.\n",
"- 5x more models will be trained for each experiment.\n",
"- 5x more preprocessing and caching; the backend must preprocess each Fold separately to prevent data leakage by excluding `fold_validation` from the `fit`. Fits are saved to the Fold object as opposed to the Feature/Label objects.\n",
"- 5x more evaluation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Disclaimer*"
"*Disclaimer about inherent limitations & challenges*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> DO NOT use cross-validation unless your *(total sample count / fold_count)* still gives you an accurate representation of your entire sample population. If you are ignoring that advice and stretching to perform cross-validation, then at least ensure that *(total sample count / fold_count)* is evenly divisible. Folds naturally have fewer samples, so a handful of incorrect predictions have the potential to offset your aggregate metrics. Both of these tips help avoid poorly stratified/ undersized folds that seem to perform either too well (only most common label class present) or poorly (handful of samples and a few inaccurate prediction on an otherwise good model).\n",
"> \n",
"> Candidly, if you've ever performed cross-validation manually, let alone systematically, you'll know that, barring stratification of continuous labels, it's easy enough to construct the folds, but then it's a pain to generate performance metrics (e.g. `zero_division`, absent OHE classes) due to the absence of outlying classes and bins. Time has been invested to handle these scenarios elegantly so that folds can be treated as first-class-citizens alongside splits. That being said, if you try to do something undersized like \"150 samples in their dataset and a `fold_count` > 3 with `unique_classes` > 4,\" then you may run into edge cases.\n",
"> Do not use cross-validation unless the distribution of each resulting fold (*total sample count divided by fold_count*) is representatitve of your broader sample population. If you are ignoring that advice and stretching to perform cross-validation, then at least ensure that the *total sample count is evenly divisble by fold_count*. Both of these tips help avoid poorly stratified/ undersized folds that seem to perform either unjustifiably well (100% accuracy when only the most common label class is present) or poorly (1 incorrect prediction in a small fold negatively skews an otherwise good model).\n",
">\n",
"> Cross validation is only included in AIQC to allow best practices to be upheld and to show off the power of systematic preprocessing."
"> If you've ever performed cross-validation manually with too few samples, then you'll know that it's easy enough to construct the folds, but then it's a pain to calculate performance metrics (e.g. `zero_division`, absent OHE classes) due to the absence of outlying classes and bins. Time has been invested to handle these scenarios elegantly so that folds can be treated as first-class-citizens alongside splits. That being said, if you try to do something undersized like multi-label classification using 150 samples then you may run into errors during evaluation."
]
},
{
Expand Down
Binary file modified docs/_build/doctrees/notebooks/api_low_level.doctree
Binary file not shown.
Binary file added docs/_build/html/_images/cross_fold_bins.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
13 changes: 6 additions & 7 deletions docs/_build/html/_sources/notebooks/api_low_level.ipynb.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2333,7 +2333,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"![cross fold objects](../_static/images/api/cross_fold_objects.png)"
"![CrossFoldBins](../_static/images/api/cross_fold_bins.png)"
]
},
{
Expand All @@ -2350,28 +2350,27 @@
"metadata": {},
"source": [
"Let's say we defined `fold_count=5`. What are the implications?\n",
"\n",
"- Creates 5 `Folds` related to a `Splitset`.\n",
"- 5x more preprocessing and caching; each `fold_validation` is excluded from the `fit` on `folds_train_combinared`. Fits are saved to the `orm.Fold` object as opposed to the `orm.Feature/Label` objects.\n",
"- 5x more models will be trained for each experiment.\n",
"- 5x more preprocessing and caching; the backend must preprocess each Fold separately to prevent data leakage by excluding `fold_validation` from the `fit`. Fits are saved to the Fold object as opposed to the Feature/Label objects.\n",
"- 5x more evaluation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Disclaimer*"
"*Disclaimer about inherent limitations & challenges*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> DO NOT use cross-validation unless your *(total sample count / fold_count)* still gives you an accurate representation of your entire sample population. If you are ignoring that advice and stretching to perform cross-validation, then at least ensure that *(total sample count / fold_count)* is evenly divisible. Folds naturally have fewer samples, so a handful of incorrect predictions have the potential to offset your aggregate metrics. Both of these tips help avoid poorly stratified/ undersized folds that seem to perform either too well (only most common label class present) or poorly (handful of samples and a few inaccurate prediction on an otherwise good model).\n",
"> \n",
"> Candidly, if you've ever performed cross-validation manually, let alone systematically, you'll know that, barring stratification of continuous labels, it's easy enough to construct the folds, but then it's a pain to generate performance metrics (e.g. `zero_division`, absent OHE classes) due to the absence of outlying classes and bins. Time has been invested to handle these scenarios elegantly so that folds can be treated as first-class-citizens alongside splits. That being said, if you try to do something undersized like \"150 samples in their dataset and a `fold_count` > 3 with `unique_classes` > 4,\" then you may run into edge cases.\n",
"> Do not use cross-validation unless the distribution of each resulting fold (*total sample count divided by fold_count*) is representatitve of your broader sample population. If you are ignoring that advice and stretching to perform cross-validation, then at least ensure that the *total sample count is evenly divisble by fold_count*. Both of these tips help avoid poorly stratified/ undersized folds that seem to perform either unjustifiably well (100% accuracy when only the most common label class is present) or poorly (1 incorrect prediction in a small fold negatively skews an otherwise good model).\n",
">\n",
"> Cross validation is only included in AIQC to allow best practices to be upheld and to show off the power of systematic preprocessing."
"> If you've ever performed cross-validation manually with too few samples, then you'll know that it's easy enough to construct the folds, but then it's a pain to calculate performance metrics (e.g. `zero_division`, absent OHE classes) due to the absence of outlying classes and bins. Time has been invested to handle these scenarios elegantly so that folds can be treated as first-class-citizens alongside splits. That being said, if you try to do something undersized like multi-label classification using 150 samples then you may run into errors during evaluation."
]
},
{
Expand Down
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
25 changes: 15 additions & 10 deletions docs/_build/html/notebooks/api_low_level.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
<meta property="og:type" content="website" />
<meta property="og:url" content="https://docs.aiqc.io/notebooks/api_low_level.html" />
<meta property="og:site_name" content="AIQC" />
<meta property="og:description" content="0f38e6c6393944f9b2e3bb52deddf178 Object-Relational Model: The Low-Level API is an object-relational model for machine learning. Each class in the ORM maps to a table in a SQLite database that serve..." />
<meta property="og:description" content="6e67347e4bea469eb6168df222e4165c Object-Relational Model: The Low-Level API is an object-relational model for machine learning. Each class in the ORM maps to a table in a SQLite database that serve..." />
<meta property="og:image" content="https://raw.githubusercontent.com/aiqc/aiqc/main/docs/_static/images/web/meta_image_tall_rect.png" />
<meta property="og:image:alt" content="Artificial Intelligence Quality Control" />
<meta property="twitter:image" content="https://raw.githubusercontent.com/aiqc/aiqc/main/docs/_static/images/web/meta_image_tall_rect.png" />
Expand Down Expand Up @@ -330,7 +330,7 @@
</style>
<section id="ORM">
<h1>ORM<a class="headerlink" href="#ORM" title="Permalink to this headline"></a></h1>
<p><img alt="0f38e6c6393944f9b2e3bb52deddf178" class="banner-photo" src="../_images/sql.png" /></p>
<p><img alt="6e67347e4bea469eb6168df222e4165c" class="banner-photo" src="../_images/sql.png" /></p>
<section id="Object-Relational-Model">
<h2>Object-Relational Model<a class="headerlink" href="#Object-Relational-Model" title="Permalink to this headline"></a></h2>
<p>The Low-Level API is an <em>object-relational model</em> for machine learning. Each class in the <a class="reference external" href="http://docs.peewee-orm.com/en/latest/peewee/models.html">ORM</a> maps to a table in a SQLite database that serves as a machine learning <em>metastore</em>.</p>
Expand Down Expand Up @@ -2235,17 +2235,22 @@ <h2>8. Splitset<a class="headerlink" href="#8.-Splitset" title="Permalink to thi
<hr class="docutils" />
<p><strong>Cross-Validation</strong></p>
<p>Cross-validation is triggered by <code class="docutils literal notranslate"><span class="pre">fold_count:int</span></code> during Splitset creation. Reference the <a class="reference external" href="https://scikit-learn.org/stable/modules/cross_validation.html">scikit-learn documentation</a> to learn more about cross-validation.</p>
<img alt="cross fold objects" src="../_images/cross_fold_objects.png" />
<img alt="CrossFoldBins" src="../_images/cross_fold_bins.png" />
<p>Each row in the diagram above is a <code class="docutils literal notranslate"><span class="pre">Fold</span></code> object.</p>
<p>Each green/blue box represents a bin of stratified samples. During preprocessing and training, we rotate which blue bin serves as the validation samples (<code class="docutils literal notranslate"><span class="pre">fold_validation</span></code>). The remaining green bins in the row serve as the training samples (<code class="docutils literal notranslate"><span class="pre">folds_train_combined</span></code>).</p>
<p>Let’s say we defined <code class="docutils literal notranslate"><span class="pre">fold_count=5</span></code>. What are the implications? - Creates 5 <code class="docutils literal notranslate"><span class="pre">Folds</span></code> related to a <code class="docutils literal notranslate"><span class="pre">Splitset</span></code>. - 5x more models will be trained for each experiment. - 5x more preprocessing and caching; the backend must preprocess each Fold separately to prevent data leakage by excluding <code class="docutils literal notranslate"><span class="pre">fold_validation</span></code> from the <code class="docutils literal notranslate"><span class="pre">fit</span></code>. Fits are saved to the Fold object as opposed to the Feature/Label objects. - 5x more evaluation.</p>
<p><em>Disclaimer</em></p>
<p>Let’s say we defined <code class="docutils literal notranslate"><span class="pre">fold_count=5</span></code>. What are the implications?</p>
<ul class="simple">
<li><p>Creates 5 <code class="docutils literal notranslate"><span class="pre">Folds</span></code> related to a <code class="docutils literal notranslate"><span class="pre">Splitset</span></code>.</p></li>
<li><p>5x more preprocessing and caching; each <code class="docutils literal notranslate"><span class="pre">fold_validation</span></code> is excluded from the <code class="docutils literal notranslate"><span class="pre">fit</span></code> on <code class="docutils literal notranslate"><span class="pre">folds_train_combinared</span></code>. Fits are saved to the <code class="docutils literal notranslate"><span class="pre">orm.Fold</span></code> object as opposed to the <code class="docutils literal notranslate"><span class="pre">orm.Feature/Label</span></code> objects.</p></li>
<li><p>5x more models will be trained for each experiment.</p></li>
<li><p>5x more evaluation.</p></li>
</ul>
<p><em>Disclaimer about inherent limitations &amp; challenges</em></p>
<blockquote>
<div><p>DO NOT use cross-validation unless your <em>(total sample count / fold_count)</em> still gives you an accurate representation of your entire sample population. If you are ignoring that advice and stretching to perform cross-validation, then at least ensure that <em>(total sample count / fold_count)</em> is evenly divisible. Folds naturally have fewer samples, so a handful of incorrect predictions have the potential to offset your aggregate metrics. Both of these tips help avoid poorly stratified/
undersized folds that seem to perform either too well (only most common label class present) or poorly (handful of samples and a few inaccurate prediction on an otherwise good model).</p>
<p>Candidly, if you’ve ever performed cross-validation manually, let alone systematically, you’ll know that, barring stratification of continuous labels, it’s easy enough to construct the folds, but then it’s a pain to generate performance metrics (e.g. <code class="docutils literal notranslate"><span class="pre">zero_division</span></code>, absent OHE classes) due to the absence of outlying classes and bins. Time has been invested to handle these scenarios elegantly so that folds can be treated as first-class-citizens alongside splits. That being said, if you try
to do something undersized like “150 samples in their dataset and a <code class="docutils literal notranslate"><span class="pre">fold_count</span></code> &gt; 3 with <code class="docutils literal notranslate"><span class="pre">unique_classes</span></code> &gt; 4,” then you may run into edge cases.</p>
<p>Cross validation is only included in AIQC to allow best practices to be upheld and to show off the power of systematic preprocessing.</p>
<div><p>Do not use cross-validation unless the distribution of each resulting fold (<em>total sample count divided by fold_count</em>) is representatitve of your broader sample population. If you are ignoring that advice and stretching to perform cross-validation, then at least ensure that the <em>total sample count is evenly divisble by fold_count</em>. Both of these tips help avoid poorly stratified/ undersized folds that seem to perform either unjustifiably well (100% accuracy when only the most common label
class is present) or poorly (1 incorrect prediction in a small fold negatively skews an otherwise good model).</p>
<p>If you’ve ever performed cross-validation manually with too few samples, then you’ll know that it’s easy enough to construct the folds, but then it’s a pain to calculate performance metrics (e.g. <code class="docutils literal notranslate"><span class="pre">zero_division</span></code>, absent OHE classes) due to the absence of outlying classes and bins. Time has been invested to handle these scenarios elegantly so that folds can be treated as first-class-citizens alongside splits. That being said, if you try to do something undersized like multi-label
classification using 150 samples then you may run into errors during evaluation.</p>
</div></blockquote>
<hr class="docutils" />
<p><strong>Samples Cache</strong></p>
Expand Down

0 comments on commit 2489c41

Please sign in to comment.