# Introduction 

## Objective

In [1]:
# @todo: introduce "head counts" and "sweeps" for narrative purposes

- Answer the question:
  * What is the relationship between observations recorded as "head counts" and those recorded as "sweeps"?
- Document the process of finding the answer so that the exact steps are replicable and clearly demonstrated here.

You can follow along in your own Jupyter notebook, or open [this notebook on mybinder.org] and replay cells while tweaking the code if you want to experiment. You'll find [this notebook project on GitHub].

[this notebook on mybinder.org]: https://mybinder.org/v2/gh/devvyn/aafc-field-data/master?filepath=notebook%2Fprojects%2F2016-sweep-vs-tiller%2F2016-sweep-vs-tiller.ipynb
[this notebook project on GitHub]: https://github.com/devvyn/aafc-field-data/tree/master/notebook/projects/2016-sweep-vs-tiller

## About the Data

!["Perfectly straight line in graph comparing supposedly independent variables in Head Count and Sweeps workbook, in Excel Online"](head-count-sweeps-graph-excel-online.png)

I'm troubleshooting a spreadsheet document that was prepared by someone else. Whoever prepared it probably handed it off to someone else before I came into contact. The information in the document looks like it's been copied from other documents which I'm not certain I have access to.

In this document, there's a graph that demonstrates the goal of the document: to compare the relationship between two observation methods. Unfortunately, the graph shows a suspicious degree of idealness: a perfect one-to-one ratio across the entire domain.

My assignment is to trace the error and correct the graph so that it displays the precise ratios calculated from appropriate samples.

### Explore Worksheets

Before getting hands on with _pandas_, I'll open the document in Excel and record what my senses tell me.

#### Unbelievable Graph

- There's a graph in a worksheet called "head counts vs sweeps graphs" which demonstrates the analytical problem encountered/developed by someone else.
- The data supposedly being compared in the graph is cannot be the data that was intended for comparison because the ratio depicted is perfectly linear even though it's comparing real world samples.

#### Work Not Shown

It's clear that the Excel workbook has the results of many calculations, yet there are no spreadsheet formulas in any cells. Spreadsheet formulas would have greatly expedited the verification of the calculations; without any reference to how the calculations were performed and on what values, I'll need to replicate them from scratch.

#### Metadata and Document History

- Data may have been copied from multiple, unidentified sources.
- It's not clear which data is "original" and which is duplicated, amongst the worksheets in the workbook.
- Some data appears to have been summed and then mixed back in with the rest of the data.
- There are no notes from editors of the workbook, and the editors are unidentifiable.

### Summary of Issues

- No metadata
- No convention for categorical label values
- Mixed aggregation levels (some sums mixed with non-summed data)
- Graph plot is based on incorrect values

## Analytical Goals

### Tidy & Plot Ratios

- Data is clean and indexed well enough to align quantitative samples from the two field collection methods.
- Upon plotting the ratios of dependent variables corresponding with each collection method, the graph should not depict an absolutely perfect fit to a line. That is to say, there should be some variation.

### Retain for Reference

- For posterity, the data should be easy to align with any similar data. This makes it possible to identify whether the same data exists in any other file.

## Document Conventions

I'll use certain formatting conventions for written terms throughout this document.

### Technical Terms

In cases where a technical term also has a common, non-technical meaning, the text will be shown in *italics*.

>*pandas* is a data processing framework for Python which uses readable yet expressive syntax to perform calculations upon many values or many sets of values, such as combining spreadsheets or even plotting visualizations.

### Worksheet and DataFrame

This document deals with data sets from Excel spreadsheet workbooks, each of which are handled as a *pandas* `DataFrame`. These will be shown in **bold**, including Python variables referring to a `DataFrame`.

>I've imported **Sheet2** into *pandas* and temporarily assigned it to the variable **s2** for convenience.

### Column and Series

Each column from the workbooks becomes a `Series` in *pandas*. These will be shown in ***bold italics***, including Python variables referring to a `Series`.

>Whenever ***number of samples*** is not a number, ***distance(m)*** is always equal to "`Combined`".

### Code Text and Verbatim Expression

Whenever a word is meant to refer to an expression used by *pandas* or Python, it will be shown in `monospace`. The given expression must always be typed exactly as shown when using it in code execution. This applies equally whether the text is used as an executable expression (Python command) or just a quoted snippet of text used as a non-executable value (eg index label).

>The `concat` command in *pandas* combines two or more `DataFrame` objects.

# Procedure

## Setup Computing Environment

### Interactive Document

Document handling and analysis will be conducted in [Jupyter] Notebook/Lab, using [Python] 3. To setup a live workspace, consult the [README] for the parent project comprising this document and its associated files.

[Jupyter]: https://jupyter.org/
[Python]: https://www.python.org/about/
[README]: https://github.com/devvyn/aafc-field-data/blob/master/README.md

### Source Data

To follow along, you'll require direct access to [`2016-sweep-vs-tiller.xlsx`][file] within your workspace. Download the [file] from GitHub if you're not cloning the entire source [repository].

If you're interacting on [mybinder.org], the file should already be present.

[repository]: https://github.com/devvyn/aafc-field-data/
[file]: https://github.com/devvyn/aafc-field-data/blob/master/notebook/projects/2016-sweep-vs-tiller/2016-sweep-vs-tiller.xlsx
[mybinder.org]: https://mybinder.org/v2/gh/devvyn/aafc-field-data/master?filepath=notebook%2Fprojects%2F2016-sweep-vs-tiller%2F2016-sweep-vs-tiller.ipynb
[this notebook project on GitHub]: https://github.com/devvyn/aafc-field-data/tree/master/notebook/projects/2016-sweep-vs-tiller

### Install Third Party Packages

**NOTE FOR CLOUD USERS**: Keep in mind, some notebooks in the cloud run in temporary "containers" or otherwise ethereal environments where installed packages may vanish after a certain amount of time, so you may have to install again in subsequent sessions.


#### _pandas_

* If your live workspace doesn't include it, [install _pandas_] before continuing.
* If you're using [Anaconda], you probably already have _pandas_.

If you're not sure if it's installed, you may run this command in a code cell within your notebook, prefixed by an exclamation mark as shown. The Python package installer will not install anything you already have.

```
!pip install pandas
```

[Anaconda]: http://docs.continuum.io/anaconda/
[install _pandas_]: https://pandas.pydata.org/pandas-docs/stable/install.html

#### Markdown (Optional) 

If you're following along, I recommend you install the `markdown` package for Python. It makes formatted output from code much simpler to express. You can omit this as long as you don't use the `markdown` function in your code. I'll be using it, however.

```
!pip install markdown
```

### Import Python Packages

From the standard Python library:

* `re` helps with pattern matching in text.

Third party libraries: 

* *pandas* does the heavy lifting of working with spreadsheet-like data.
* For outputting formatted text from the code and displaying in the notebook: `markdown.markdown` and `IPython.display.HTML`.

In [2]:
import re

import pandas
from markdown import markdown
from IPython.display import HTML

## Load Worksheets into _pandas_

I'll make a dictionary of names and data frames, so I can examine the file overall.

In [3]:
data_file = pandas.ExcelFile('2016-sweep-vs-tiller.xlsx')
sheets = {
    sheet_name: data_file.parse(sheet_name)
    for sheet_name in data_file.sheet_names
}
sheet_names = sheets.keys()
sheet_frames = sheets.values()

### Compare Columns

To begin to discern which data is original, I'll list the first eight columns of each worksheet, which will offer some descriptive terms for the data in each:

In [4]:
pandas.DataFrame(
    data=[frame.columns for frame in sheet_frames],
    index=pandas.Index(data=sheet_names),
).loc[:, :8]

Unnamed: 0,0,1,2,3,4,5,6,7,8
Head Counts,Site,Crop,Date,Field,Zadoks_stage,Tiller,EGA_head,EGA_leaf,BCO_head
Sheet2,ID,Province,Collection_Date,Sample_by_week,Date_by_week,Date,Julian_date,Site,Field_name
Sweep Samples Cereals,ID,Province,Collection_Date,Sample_by_week,Date_by_week,Date,Julian_date,Site,Field_name
Head Counts Edited,Date,Site,Crop,Field,Sample Type,Unnamed: 5,Zadoks_stage,Tiller,EGA_alate
Sweep Samples Cereals Edited,Date,Site,Crop,Field_name,Sample Type,Total Sweeps,Unnamed: 6,Unnamed: 7,EGA_alate
Data Sheets Combined,ID,Date,Site,Crop,Field_name,Sample Type,Total Sweeps,Zadoks_stage,Tiller
Pivot chart LH phen,,,,,,,,,
leafhoppers 2016 cereal sweeps,Collection_Date,Sample_by_week,Date_by_week,Date,Julian_date,Site,Field_name,Crop,Distance(m)
Sheet3,,,,,,,,,
aphid sweep vs head count,ID,Date,Site,Crop,Field_name,Sample Type,Total Sweeps,Zadoks_stage,Tiller


### Classify Worksheets

Because I can't trust the accuracy of the data used by the plot in Excel, I need to look at all the sheets and determine the most complete and unadulterated data sets. I'll determine which data belongs to each category, and compare the sets.

Based on sheet names, column names, and similarities between columns sets, I can probably group the sheets like so:

- Head counts:
  - **Head Counts**
  - **Head Counts Edited**
- Sweep:
  - **Sheet2**
  - **Sweep Samples Cereals**
  - **Sweep Samples Cereals Edited**
  - **leafhoppers 2016 cereal sweeps**
- United:
  - **Data Sheets Combined**
- Analytical experiments:
  - **Pivot chart LH phen**
  - **Sheet3**
  - **aphid sweep vs head count**
  - **head counts vs sweeps graphs**

## Tidy Up

### Normalize Column Name Letter Case

To avoid confusion due to variations in capitalization, I'll convert all column names to lower case:

In [5]:
for sheet in sheets.values():
    if sheet.columns.size > 0:
        sheet.columns = sheet.columns.str.lower()

### Convert Date & Time Format

Before I can easily examine dates from the worksheets, I must convert them to proper date values for Python and *pandas*. I'll check the date formats:

In [6]:
(
    pandas.concat(
        {
            name: sheet.loc[:, sheet.columns.str.contains('date')].reset_index(drop=True)
            for (name, sheet) in sheets.items() if sheet.columns.size > 0
        },
        axis='columns',
    )
    .loc[0]
    .unstack()
    .fillna('')
)

Unnamed: 0,collection_date,date,date_by_week,julian_date
Data Sheets Combined,,2016-08-12 00:00:00,,
Head Counts,,04/08/2016,,
Head Counts Edited,,2016-08-04 00:00:00,,
Sheet2,12_08_2016,Aug_12,0.0,0.0
Sweep Samples Cereals,12_08_2016,Aug_12,0.0,0.0
Sweep Samples Cereals Edited,,2016-08-12 00:00:00,,
aphid sweep vs head count,,2016-08-04 00:00:00,,
leafhoppers 2016 cereal sweeps,12_08_2016,Aug_12,0.0,0.0


According to the documentation for [`pandas.to_datetime`], I can convert a string into a true datetime value. The function uses [Python datetime format strings]. Here are the date columns, the format strings, and the worksheet names I see:

***collection_date*** (`%m_%d_%Y`):

- **leafhoppers 2016 cereal sweeps**
- **Sheet2**
- **Sweep Samples Cereals**

***date*** (`%m/%d/%Y`):

- **Head Counts**

***date*** (no conversion):

- **aphid sweep vs head count**
- **Sweep Samples Cereals Edited**
- **Head Counts Edited**
- **Data Sheets Combined**

In some cases, the ***date*** column is present even though a better column (***collection_date***) exists for the date. I'll drop the improper date column and rename the superior column so as to replace the inferior.

After conversion, I'll output a sample date from the column to prove that the column ***date*** is in fact a datetime.

[`pandas.to_datetime`]: http://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.to_datetime.html
[Python datetime format strings]: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

In [7]:
for name, sheet in ((name, sheets[name]) for name in (
    'Head Counts',
    'Sheet2',
    'Sweep Samples Cereals',
    'leafhoppers 2016 cereal sweeps',
)):
    display(HTML(markdown(f'#### {name}')))
    if name == 'Head Counts':
        sheet['date'] = pandas.to_datetime(
            sheet['date'],
            format='%d/%m/%Y',
        )
    else:
        sheet.drop(columns='date', inplace=True)
        sheet.rename(
            columns={'collection_date': 'date'},
            inplace=True,
        )
        sheet['date'] = pandas.to_datetime(
            sheet['date'],
            format='%d_%m_%Y',
        )
    display(sheet.date.head(1))

0   2016-08-04
Name: date, dtype: datetime64[ns]

0   2016-08-12
Name: date, dtype: datetime64[ns]

0   2016-08-12
Name: date, dtype: datetime64[ns]

0   2016-08-12
Name: date, dtype: datetime64[ns]

## Select Primary Data Set: Sweep Samples

### Candidates

- Sheet2
- Sweep Samples Cereals
- Sweep Samples Cereals Edited
- leafhoppers 2016 cereal sweeps

From my hands-on exploration in Excel, I got the feeling that **Sheet2** is the most complete. I'll compare it to the other candidates, one at a time.

### Sweep Samples Cereals

I'll refer to **Sweep Sample Cereals** as **ssc** and I'll continue to refer to **Sheet2** as **s2**. I'll put them both in **compare_sheets** for quick reference.

In [8]:
sheet_names = [
    'Sweep Samples Cereals',
    'Sheet2',
]
compare_sheets = ssc, s2 = [
    sheets[sheet_name]
    for sheet_name in sheet_names
]

I'd like to confirm that all data in **Sweep Samples Cereals** is included in the vaguely named **Sheet2**.

#### Columns

In [9]:
pandas.DataFrame(
    data=[sheet.columns for sheet in compare_sheets],
    index=sheet_names
).T

Unnamed: 0,Sweep Samples Cereals,Sheet2
0,id,id
1,province,province
2,date,date
3,sample_by_week,sample_by_week
4,date_by_week,date_by_week
5,julian_date,julian_date
6,site,site
7,field_name,field_name
8,crop,crop
9,distance(m),distance(m)


That seems to confirm that **Sheet2** has the same columns as **Sweep Samples Cereals** except for four columns that were removed.

I'll compare lists of column names, looking for asymmetry:

In [10]:
ssc.columns.symmetric_difference(s2.columns).tolist()

['ega/20 sweeps',
 'ega/sweep',
 'total ega',
 'total sweeps',
 'unnamed: 129',
 'unnamed: 133']

I believe the unnamed columns are empty. The other columns have calculated values that I don't trust. **Sweep Samples Cereals** is probably useless for my purposes.

#### Rows

In [11]:
len(ssc), len(s2)

(92, 668)

Shown here is the number of rows of each frame (worksheet). It's pretty clear there is a lot more data in **Sheet2**. I suspect that **Sheet2** has additional data added to it. I'll have to take a closer look at the values, especially dates.

##### Analyze Dates

I need to check if *all* the dates in **Sweep Samples Cereals** are in **Sheet2**.

In [12]:
ssc.date.isin(s2.date).all()

True

I'm curious if the converse is also true:

In [13]:
s2.date.isin(ssc.date).all()

False

Since **s2** has some dates which **ssc** does not, I'd like to know how big the difference is. I'll view all date values in **s2** which fail to match with those in **ssc**, ignoring duplicates.

In [14]:
(
    s2.date.loc[~s2.date.isin(ssc.date)]
    .drop_duplicates()
)

486   2017-06-17
568   2016-06-06
Name: date, dtype: datetime64[ns]

So, this isn't a lot, but it's worth noting.

#### Aggregation

I see a ***total sweeps*** column only in **Sweep Samples Cereals**, which leads me to suspect that **Sweep Samples Cereals** comprises aggregates (sums), possibly based on values present in **Sheet2**. 

However, **Sheet2** has two additional dates, which means that it has data that the other worksheet doesn't. Now, it's a matter of determining whether **Sheet2** is comprehensive enough to preclude the use of **Sweep Samples Cereals**.

#### Value Counts on ***distance(m)***

I suspect that the actual, physical sweeps were at different distances from some physical origin, and later reduced to sums. If I peek at the ***distance(m)*** column, I should see a clear difference between these varied distances and the "distance" value used by aggregates. Ideally, there wouldn't be a distance value for aggregates at all, but this touches on the problem of mixing raw data with partially analysed data.

In [15]:
display(
    pandas.concat(
        dict(zip(sheet_names, compare_sheets)),
        axis='columns',
        sort=True,
    )
    .loc[:, pandas.IndexSlice[:, ['distance(m)']]]
    .apply(pandas.value_counts)
    .stack()
    .reorder_levels((1, 0))
    .T
    .fillna('')
)

Unnamed: 0_level_0,distance(m),distance(m),distance(m),distance(m),distance(m),distance(m),distance(m)
Unnamed: 0_level_1,5,10,25,50,100,Combined,0
Sheet2,105.0,107.0,106.0,109.0,93.0,92.0,56.0
Sweep Samples Cereals,,,,,,92.0,


Indeed, **Sheet2** has observations at various "distances", while **Sweep Samples Cereals** has only the label, ***Combined***.

**Sheet2** has more rows because it isn't totalling up the sweeps from various distances. That makes **Sheet2** less reduced, and more "raw".
Since I don't want reduced (aggregated) data, I don't want **Sweep Samples Cereals**.

#### Conclusion

Optimal candidate:

- **Sheet2**

### Sweep Samples Cereals Edited vs Sheet2

In [20]:
ssce = sheets['Sweep Samples Cereals Edited']
compare_sheets = ssce, s2

#### Columns

In [21]:
pandas.options.display.max_rows = 140
pandas.DataFrame(
    data=[sheet.columns for sheet in compare_sheets],
    index=['Sweep Samples Cereals Edited', 'Sheet2']
).T

Unnamed: 0,Sweep Samples Cereals Edited,Sheet2
0,date,id
1,site,province
2,crop,date
3,field_name,sample_by_week
4,sample type,date_by_week
5,total sweeps,julian_date
6,unnamed: 6,site
7,unnamed: 7,field_name
8,ega_alate,crop
9,ega_apt,distance(m)


**Sweep Samples Cereals Edited** seems to have left out some columns that would be expected to carry finely categorized subjects, such as various instars of aphids and leafhoppers. This makes it less likely to have information I'll need, because I'll be comparing aphid numbers. Furthermore, it appears that **Sweep Samples Cereals Edited** has a column, ***Total Sweeps***, that's probably an artefact from pre-existing aggregation. It's also missing the ***Distance(m)*** column; another sign of aggregation, and therefore loss of some information.

Unless further analysis reveals that **Sweep Samples Cereals Edited** has dates that are missing from **Sheet2**, I'll assume it's not worth closer examination.

#### Rows

In [22]:
f'{ssce.index.size / s2.index.size:.0%}'

'14%'

**Sweep Samples Cereals Edited** has 14% the number of rows as **Sheet2**, so it's not likely to be useful, unless the dates don't fully overlap.

In [23]:
def len_unique(pandas_object):
    return len(pandas_object.unique())


descriptors = [pandas.Series.max, pandas.Series.min, len_unique]
for frame in compare_sheets:
    frame.date.apply(descriptors)

It appears that **Sweep Samples Cereals Edited**, like **Sweep Samples Cereals**, has a much shorter date range, so we won't be missing anything if we ignore it. To be sure, I need to check if all the dates in **Sweep Samples Cereals Edited** are in **Sheet2**.

In [24]:
ssce.date.isin(s2.date).all()

True

Excellent. I see no reason to pay attention to **Sweep Samples Cereals Edited** or **Sweep Samples Cereals** anymore. 

If I have time after fixing the graph, I may trace the cause of the error, which may lead me back to one of those worksheets.

#### Conclusion

Optimal candidate:

- **Sheet2** (again)

### leafhoppers 2016 cereal sweeps vs Sheet2

In [25]:
sheet_names = ['leafhoppers 2016 cereal sweeps', 'Sheet2']
compare_sheets = lh2016, s2 = [
    sheets[sheet_name]
    for sheet_name in sheet_names
]

#### Columns

In [26]:
columns = [frame.columns for frame in compare_sheets]
pandas.DataFrame(
    data=columns,
    index=sheet_names
).T.head(lh2016.columns.size)

Unnamed: 0,leafhoppers 2016 cereal sweeps,Sheet2
0,date,id
1,sample_by_week,province
2,date_by_week,date
3,julian_date,sample_by_week
4,site,date_by_week
5,field_name,julian_date
6,crop,site
7,distance(m),field_name
8,number of samples,crop
9,total sweeps,distance(m)


Clearly, **leafhoppers 2016 cereal sweeps** is focused on leafhoppers. Because the object of our analysis is to compare aphid numbers, I don't see the relevance of this leafhopper data.

In [27]:
lh2016['total sweeps'].unique()

array([120,  60,  80,  20, 100,  40])

In [28]:
lh2016['distance(m)'].unique()

array(['Combined'], dtype=object)

In [29]:
lh2016['number of samples'].unique()

array([6, 3, 4, 1, 5, 2])

#### Rows

In [30]:
pandas.DataFrame(data={sheet_name: sheet.index.size for sheet_name, sheet in zip(sheet_names, compare_sheets)}, index=['number of rows']).rename_axis(['worksheet'], axis='columns').T.sort_values(by='number of rows')

Unnamed: 0_level_0,number of rows
worksheet,Unnamed: 1_level_1
leafhoppers 2016 cereal sweeps,92
Sheet2,668


Sheet2 has the most rows. But do the times align?

First, what date format is used in lh2016?

In [31]:
lh2016.date.head()

0   2016-08-12
1   2016-08-05
2   2017-07-08
3   2016-07-15
4   2016-07-15
Name: date, dtype: datetime64[ns]

Now, to compare dimensions:

In [32]:
pandas.DataFrame(
    data=[
        len_unique(frame.date.index)
        for frame in compare_sheets
    ],
    index=sheet_names,
    columns=[
        'unique datetimes',
    ],
)

Unnamed: 0,unique datetimes
leafhoppers 2016 cereal sweeps,92
Sheet2,668


Seeing that `leafhoppers 2016 cereal sweeps` is smaller, check that its index is a subset of `Sheet2`:

In [33]:
lh2016.date.index.isin(s2.index).all()

True

All of the dates in the leafhopper counts are present in `Sheet2`.

#### Conclusion

Given that the leafhopper data isn't pertinent, `Sheet2` remains the best candidate for the primary source of data points about aphids collected and counted according to the "sweep" method.

If it's ultimately determined that the data *is* relevant, it may be useful in that case because the datetime index overlaps with that of `Sheet2`.

### Conclusion

After all analysis, `Sheet2` appears to be the purest, most relevant base for comparison of "sweep" sample data to that of "tiller" count data.

## Select Primary Data Set: Tiller Head

- Head Counts
- Head Counts Edited

### Head Counts vs Head Counts Edited

I presume the relationship between these two worksheets is the same as that between the equivalent "sweep" worksheets. Therefore, I expect the "edited" version to be less useful.

In [34]:
sheet_names = [
    'Head Counts',
    'Head Counts Edited'
]
compare_sheets = hc, hce = [sheets[sheet_name] for sheet_name in sheet_names]

#### Columns

In [35]:
pandas.DataFrame(
    index=sheet_names,
    data=[sheet.columns for sheet in compare_sheets],
).T

Unnamed: 0,Head Counts,Head Counts Edited
0,site,date
1,crop,site
2,date,crop
3,field,field
4,zadoks_stage,sample type
5,tiller,unnamed: 5
6,ega_head,zadoks_stage
7,ega_leaf,tiller
8,bco_head,ega_alate
9,bco_leaf,ega_apt


The main difference here seems to be the aggregation in the "edited" sheet, indicated by the columns named "EGA/head", "EGA_total", etc. I expect the "unedited" data to be more complete and reliable.

#### Rows

Now, compare for completeness:

In [36]:
[len_unique(column) for column in (
    hc.date,
    hce.date
)]

[20, 20]

In [37]:
hc.date.index.isin(hce.date.index).all()

True

Identical date & time for the index of each, so no basis for choosing one over the other.

### Conclusion

Based on the columns, the best candidate for pure, reliable data is:

- `Head Counts`

## Align Primary Data Sets

The names and corresponding data frames, from the worksheets in the source workbook document (Excel):

In [38]:
sheet_names, compare_sheets = zip(
    ('Head Counts', hc),
    ('Sheet2', s2),
)

#### Visualize Whitespace

For the sake of visualization, I'll write a function that wraps any value in square brackets. This makes trailing or leading whitespace obvious.

In [39]:
def wrap_brackets(x):
    return x if pandas.isna(x) else f'[{str(x)}]'

In [40]:
# Example:
pandas.Series([' a ', 'b ', 'c', '', ' ', pandas.np.NaN]).apply(wrap_brackets)

0    [ a ]
1     [b ]
2      [c]
3       []
4      [ ]
5      NaN
dtype: object

### Compare Columns

I've already noticed at least one leading space on a column name ("spiders"), and variations in capitalization. To compensate for this, I'll strip all leading and trailing whitespace from the column names.

In [41]:
for sheet in compare_sheets:
    sheet.columns = sheet.columns.str.strip()
    sheet.sort_index(axis='columns', inplace=True)

Comparing alphabetically sorted column names between our two primary data sets:

In [42]:
pandas.DataFrame(
    index=sheet_names,
    data=[sheet.columns for sheet in compare_sheets]
).fillna('').T

Unnamed: 0,Head Counts,Sheet2
0,aphid_mummies_blk,( damsel bug)nabis_americoferus_adult
1,aphid_mummies_brown,1st_instar_ega
2,aphids_total,1st_instar_macrosteles
3,bco_alate,2nd_instar_ega
4,bco_apt,2nd_instar_macrosteles
5,bco_head,3rd_instar_ega
6,bco_leaf,3rd_instar_ega_pre-alate
7,bco_total,3rd_instar_macrosteles
8,comments,4th_instar_macrosteles
9,crop,4th_instar_pre-alate


Wow, that's quite a large difference in columns for these sets. Since the object of the comparison is aphids only, we can ignore most of these columns from `Sheet2`. For better comparison, let's filter for columns referring to aphids.

#### Aphid Columns

Aphid related terms:

* aphid
* ega
* bco
* greenbug

In [43]:
aphid_terms = (
    r'aphids?',
    r'ega',
    r'bco',
    r'greenbug',
)
aphid_term_pattern = '|'.join(aphid_terms)

hc_aphid_columns, s2_aphid_columns = [
    sheet.columns[
        sheet.columns.str.contains(aphid_term_pattern)
    ] for sheet in compare_sheets
]

sorted(hc_aphid_columns.tolist() + s2_aphid_columns.tolist())

['1st_instar_ega',
 '2nd_instar_ega',
 '3rd_instar_ega',
 '3rd_instar_ega_pre-alate',
 'aphid_mummies',
 'aphid_mummies_aphelinus_black',
 'aphid_mummies_aphidius_brown',
 'aphid_mummies_blk',
 'aphid_mummies_brown',
 'aphidencyrtus_sp',
 'aphidiius_sp.',
 'aphids_total',
 'bco_alate',
 'bco_apt',
 'bco_head',
 'bco_leaf',
 'bco_total',
 'bird_cherry_oat_aphid',
 'ega alate',
 'ega_alate',
 'ega_apt',
 'ega_grn',
 'ega_head',
 'ega_leaf',
 'ega_red',
 'ega_total',
 'greenbug_alate',
 'greenbug_aphid',
 'greenbug_apt',
 'pea aphids',
 'sitobion_avenae_ega_green (wingless)',
 'sitobion_avenae_ega_red',
 'total_alate_aphids',
 'total_apterous_aphids']

I see some problems with this list.

Not aphid related:

- aphidencyrtus_sp
- aphidiius_sp
- aphid_mummies
- aphid_mummies_aphelinus_black
- aphid_mummies_aphidius_brown
- aphid_mummies_blk
- aphid_mummies_brown

Not primary data:

- aphids_total
- bco_total
- ega_total
- total_alate_aphids
- total_apterous_aphids

I can prevent the matching of words containing "aphid" by adding a word boundary definition:

In [44]:
boundary = r'(?:_|^|$|\b)'
aphid_term_pattern = ''.join((
    boundary,
    r'(?:', '|'.join(aphid_terms), r')',
    boundary,
))

hc_aphid_columns, s2_aphid_columns = [
    sheet.columns[
        sheet.columns.str.contains(aphid_term_pattern)
    ] for sheet in compare_sheets
]

sorted(hc_aphid_columns.tolist() + s2_aphid_columns.tolist())

['1st_instar_ega',
 '2nd_instar_ega',
 '3rd_instar_ega',
 '3rd_instar_ega_pre-alate',
 'aphid_mummies',
 'aphid_mummies_aphelinus_black',
 'aphid_mummies_aphidius_brown',
 'aphid_mummies_blk',
 'aphid_mummies_brown',
 'aphids_total',
 'bco_alate',
 'bco_apt',
 'bco_head',
 'bco_leaf',
 'bco_total',
 'bird_cherry_oat_aphid',
 'ega alate',
 'ega_alate',
 'ega_apt',
 'ega_grn',
 'ega_head',
 'ega_leaf',
 'ega_red',
 'ega_total',
 'greenbug_alate',
 'greenbug_aphid',
 'greenbug_apt',
 'pea aphids',
 'sitobion_avenae_ega_green (wingless)',
 'sitobion_avenae_ega_red',
 'total_alate_aphids',
 'total_apterous_aphids']

Better. Still need to exclude "total" and "mummies".

In [45]:
excluded_terms = (
    r'mumm(?:y|ies)',
    r'total',
)
aphid_term_pattern, excluded_term_pattern = (
    r''.join((
        boundary,
        r'(?:', '|'.join(pattern), r')',
        boundary,
    )) for pattern in (aphid_terms, excluded_terms)
)

hc_aphid_columns, s2_aphid_columns = [
    sheet.columns[
        sheet.columns.str.contains(aphid_term_pattern) & ~ sheet.columns.str.contains(excluded_term_pattern)
    ] for sheet in compare_sheets
]

sorted(hc_aphid_columns.tolist() + s2_aphid_columns.tolist())

['1st_instar_ega',
 '2nd_instar_ega',
 '3rd_instar_ega',
 '3rd_instar_ega_pre-alate',
 'bco_alate',
 'bco_apt',
 'bco_head',
 'bco_leaf',
 'bird_cherry_oat_aphid',
 'ega alate',
 'ega_alate',
 'ega_apt',
 'ega_grn',
 'ega_head',
 'ega_leaf',
 'ega_red',
 'greenbug_alate',
 'greenbug_aphid',
 'greenbug_apt',
 'pea aphids',
 'sitobion_avenae_ega_green (wingless)',
 'sitobion_avenae_ega_red']

Great! Aphid columns identified.

Before combining the data sets for mathematical processing, I'll need to normalize those column names.

##### @todo: normalize aphid names

#### Non-aphid Related Columns

What non-aphid columns remain?

In [46]:
hc_remainder, s2_remainder = (
    frame.columns[~frame.columns.isin(frame_aphid)]
    for frame, frame_aphid in zip((hc, s2), (hc_aphid_columns, s2_aphid_columns))
)

sorted(hc_remainder.tolist() + s2_remainder.tolist())

['( damsel bug)nabis_americoferus_adult',
 '1st_instar_macrosteles',
 '2nd_instar_macrosteles',
 '3rd_instar_macrosteles',
 '4th_instar_macrosteles',
 '4th_instar_pre-alate',
 'ambush_bugs',
 'anthocoridae',
 'anthomyiidae-delia',
 'any parasitoid_adults',
 'aphelinus_albipodus',
 'aphelinus_asychis',
 'aphelinus_varipes',
 'aphid_mummies',
 'aphid_mummies_aphelinus_black',
 'aphid_mummies_aphidius_brown',
 'aphid_mummies_blk',
 'aphid_mummies_brown',
 'aphidencyrtus_sp',
 'aphidiius_sp.',
 'aphids_total',
 'asaphes_suspensus',
 'assassin_bug (reduviid bugs)',
 'athysanus_argentarius',
 'bco_total',
 'bees',
 'beetles',
 'bertha_armyworms',
 'braconid_wasps',
 'cabbage_butterfly',
 'capsus_simulans',
 'caterpillar',
 'chalcid_wasps',
 'chinch_bug',
 'chrysopa_oculata_adult',
 'chrysopa_oculata_larvae',
 'chrysoperla_carnea_adult',
 'chrysoperla_carnea_larva',
 'chrysopidae_adults',
 'cicindela',
 'coccinella_septempunctata_c7',
 'comments',
 'crop',
 'crop',
 'date',
 'date',
 'date_by

#### Lookup Columns (Record Index)

Reading through the list of remaining, non-aphid related columns, I see some that don't mention any organism by name. These columns may be useful for indexing, which is crucial to aligning the two data sources.

In [47]:
non_organism_column_names = pandas.Series(data=(
    'collection_date',
    'comments',
    'crop',
    'date',
    'date_by_week',
    'distance(m)',
    'field',
    'field_name',
    'id',
    'julian_date',
    'number of samples',
    'province',
    'sample_by_week',
    'site',
    'zadoks_stage',
))

Here are the names of the matching columns from each frame:

In [48]:
non_organism_columns_common = pandas.DataFrame(
    {
        name: dict(zip(non_organism_column_names,
                       non_organism_column_names.isin(frame.columns)))
        for name, frame
        in zip(sheet_names, compare_sheets)
    }
).replace(to_replace={False: '', True: '✅'})
non_organism_columns_common

Unnamed: 0,Head Counts,Sheet2
collection_date,,
comments,✅,
crop,✅,✅
date,✅,✅
date_by_week,,✅
distance(m),,✅
field,✅,
field_name,,✅
id,,✅
julian_date,,✅


### Align Columns

Our next goal is to determine which columns are in common and ensure they're of the same data type, so we can concatenate the frames.

In [49]:
non_organism_columns_common.index[
    non_organism_columns_common.all(axis='columns')
].tolist()

['crop', 'date', 'site']

Those columns alone are probably enough to align the data.

I'll need to clean up `crop` and `site`, but `date` and `collection_date` are already well formed.

These columns warrant examination as well:

- collection_date
- date
- distance(m)
- field
- field_name
- number of samples

#### Normalize Index Columns

##### Date

The columns containing date information were normalized in an [earlier stage] of analysis. Those columns are ready to align in combination with other unique index labels.

[earlier stage]: #Convert-Date-&amp;amp;-Time-Format

##### Crop

Here are the `crop` field values for both data frames. (Since I've noticed some sneaky whitespace, I'll wrap the `crop` values in brackets for visualization.)

In [50]:
crops = pandas.concat(
    (
        frame.crop.apply(str)
        for frame in compare_sheets
    ),
    keys=sheet_names,
    sort=True,
).drop_duplicates().reset_index(drop=True).sort_values()
crops.apply(wrap_brackets)

3                 [0]
2            [Barley]
9           [Barley ]
16              [Oat]
15             [Oat ]
6              [Oats]
1             [Tanzy]
0             [Wheat]
8            [Wheat ]
10     [Winter Wheat]
14    [Winter Wheat ]
13      [WinterWheat]
12     [Winter_wheat]
11           [barley]
5               [nan]
7          [unlisted]
4             [wheat]
Name: crop, dtype: object

Clearly, there are some variations that should be corrected, in both data frames.

- whitespace
- letter case
- word separation

I'll write a function that transforms any given "crop" value into a uniform representation of the crop it's intended to represent.

In [51]:
def normalize_str(value, separator=' '):
    if value is pandas.np.nan:
        return value

    str_value = str(value)

    # Add separators between words, title case
    is_mixed_case = str_value.upper() != str_value.lower() and not (str_value.islower() or str_value.isupper())
    if is_mixed_case:
        word_index = [
            index for index, char in enumerate(str_value) 
            if char.isupper()
        ] + [None]
        if word_index:
            words = [
                str_value[word_index[i]:word_index[i + 1]].strip()
                for i in range(len(word_index) - 1)
            ]
            str_value = separator.join(words)
    
    transformed = re.compile(r'[^a-zA-Z0-9]').sub(separator, str(str_value).title())
    
    # De-pluralize
    if transformed.endswith('Oats'):
        transformed = transformed[:-1]
    
    return transformed

Previewing the results, with the square brackets added again:

In [52]:
pandas.concat(
    (
        crops,
        crops.apply(normalize_str),
    ),
    keys=("Before", "After"),
    axis='columns',
).sort_values('After').applymap(wrap_brackets)

Unnamed: 0,Before,After
3,[0],[0]
2,[Barley],[Barley]
9,[Barley ],[Barley]
11,[barley],[Barley]
5,[nan],[Nan]
16,[Oat],[Oat]
15,[Oat ],[Oat]
6,[Oats],[Oat]
1,[Tanzy],[Tanzy]
7,[unlisted],[Unlisted]


There are some concerning names, such as "nan", "unlisted", and "0", but the renaming result looks good. I'll apply the changes:

In [53]:
for frame in compare_sheets:
    frame.crop = frame.crop.apply(normalize_str)

##### Site

In [54]:
site_values = (
    pandas.concat(
        (frame[['site']] for frame in compare_sheets),
        keys=sheet_names,
        names=['Sheet Name', 'index',],
    )
    .drop_duplicates()
    .sort_values('site')
)
site_values.site.apply(wrap_brackets)

Sheet Name   index
Head Counts  8             [Alberta]
             142            [Alvena]
Sheet2       174            [Clavet]
Head Counts  141       [Indian Head]
             128            [Kernan]
Sheet2       193           [Kernan ]
Head Counts  109         [Llewellyn]
             53           [Manitoba]
             103      [Meadow Lake ]
             92        [Meadow_Lake]
             96            [Melfort]
Sheet2       363          [Melfort ]
Head Counts  54            [Outlook]
             75           [Outlook ]
             10                [SEF]
             27               [SEF ]
             4               [Wakaw]
             0        [Yellow Creek]
Sheet2       626      [Yellow creek]
Head Counts  2        [Yellow_Creek]
Sheet2       632       [Yellowcreek]
Name: site, dtype: object

This is going to need some normalization. I'll write a string reducer function to transform any given value into a uniformly reducible representation by stripping insignificant characters and letter case. From that, I can build a hash table that maps to preferred representations, and apply that mapping to the values in a renaming operation.

###### Index by Reduced Label Value

In order to match badly formed labels with their normal representation, I need a string function similar to a hash, so the values in the data frame can be used to look up the preferred, normal form.

- convert numeric values to text
- strip non alphanumeric characters
- convert alphabetic characters to lower case

In [55]:
def hash_like(value):
    return re.compile(r"[^a-z0-9]").sub('', str(value).lower())

Applying this to all frames:

In [56]:
for frame in compare_sheets + (site_values, ):
    frame['site_index'] = frame.site.apply(hash_like)
    frame.set_index('site_index', append=True, inplace=True)

###### Choosing Normal Form

If I had a larger data set, I could automatically find the most frequently used representation for any given reduced value ("hash"); I would use the `mode` function. Unfortunately, the groups are far too small and the values too varied:

In [57]:
(
    site_values.site
    .apply(wrap_brackets)
    .unstack()
    .fillna('')
    .reset_index('index', drop=True)
)

site_index,alberta,alvena,clavet,indianhead,kernan,llewellyn,manitoba,meadowlake,melfort,outlook,sef,wakaw,yellowcreek
Sheet Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Head Counts,,,,,,,,,,,,,[Yellow Creek]
Head Counts,,,,,,,,,,,,,[Yellow_Creek]
Head Counts,,,,,,,,,,,,[Wakaw],
Head Counts,[Alberta],,,,,,,,,,,,
Head Counts,,,,,,,,,,,[SEF],,
Head Counts,,,,,,,,,,,[SEF ],,
Head Counts,,,,,,,[Manitoba],,,,,,
Head Counts,,,,,,,,,,[Outlook],,,
Head Counts,,,,,,,,,,[Outlook ],,,
Head Counts,,,,,,,,[Meadow_Lake],,,,,


Even if I strip the leading and trailing whitespace, I would still end up with ambiguous candidate selections. The work required to make a function that would know to how and when to convert labels like `Yellowcreek` to `Yellow Creek` would be unreasonable given the scope of this project, so automation might not be the best choice for choosing normal forms.

I'll make the preferred identifier list manually, then apply the changes automatically. Here it is as a data frame with the appropriately reduced version of each label as the index:

In [58]:
preferred_site_id = pandas.Series(
    name='site',
    data={hash_like(item): item
          for item in [
              'Alvena',
              'Clavet',
              'Indian Head',
              'Kernan',
              'Llewellyn',
              'Meadow Lake',
              'Melfort',
              'Outlook',
              'SEF',
              'Wakaw',
              'Yellow Creek',
          ]},
)
preferred_site_id.index.set_names(['site_index'], inplace=True)
preferred_site_id.to_frame().T

site_index,alvena,clavet,indianhead,kernan,llewellyn,meadowlake,melfort,outlook,sef,wakaw,yellowcreek
site,Alvena,Clavet,Indian Head,Kernan,Llewellyn,Meadow Lake,Melfort,Outlook,SEF,Wakaw,Yellow Creek


With this list, I can compare the reduced ("hashed") values to the ones in the actual data and apply the preferred name where it matches.

Everything that isn't in the preferred site name list:

In [59]:
site_values[
    ~ site_values.index.get_level_values('site_index')
    .isin(preferred_site_id.index)
]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,site
Sheet Name,index,site_index,Unnamed: 3_level_1
Head Counts,8,alberta,Alberta
Head Counts,53,manitoba,Manitoba


These outliers are from the unmatchable records I mentioned upon my first look at the [unique site] labels. I can move on without doing anything further on these.

[unique site]: #Outliers

Report on which records would be changed if I relabel `site`:

In [60]:
pandas.concat(
    (
        site_values,
        preferred_site_id.to_frame().combine_first(site_values),
    ),
    keys=['Before', 'After'],
    axis='columns',
).reset_index('site_index', drop=True).applymap(wrap_brackets)

Unnamed: 0_level_0,Unnamed: 1_level_0,Before,After
Unnamed: 0_level_1,Unnamed: 1_level_1,site,site
Sheet Name,index,Unnamed: 2_level_2,Unnamed: 3_level_2
Head Counts,8,[Alberta],[Alberta]
Head Counts,142,[Alvena],[Alvena]
Sheet2,174,[Clavet],[Clavet]
Head Counts,141,[Indian Head],[Indian Head]
Head Counts,128,[Kernan],[Kernan]
Sheet2,193,[Kernan ],[Kernan]
Head Counts,109,[Llewellyn],[Llewellyn]
Head Counts,53,[Manitoba],[Manitoba]
Head Counts,103,[Meadow Lake ],[Meadow Lake]
Head Counts,92,[Meadow_Lake],[Meadow Lake]


This looks good to me. I'll apply the names:

In [61]:
for frame in compare_sheets:
    frame.loc[:, 'site'] = (
        preferred_site_id.to_frame()
        .combine_first(frame)
        .loc[:, 'site']
    )
    frame.reset_index(
        level='site_index',
        drop=True,
        inplace=True,
    )

I'll review the label values for `site` in both data frames, to see the result of those changes:

In [62]:
(
    pandas.concat(
        (frame[['site']] for frame in compare_sheets),
    )
    .loc[:, 'site']
    .reset_index(drop=True)
    .sort_values()
    .drop_duplicates()
    .apply(wrap_brackets)
)

82          [Alberta]
228          [Alvena]
392          [Clavet]
141     [Indian Head]
474          [Kernan]
561       [Llewellyn]
84         [Manitoba]
105     [Meadow Lake]
627         [Melfort]
59          [Outlook]
791             [SEF]
824           [Wakaw]
863    [Yellow Creek]
Name: site, dtype: object

##### Field

The columns **hc**.***field*** and **s2**.***field_name*** seem to relate to ***site*** and ***crop***. Before I address the values of the columns, I'll rename **s2**.***field_name*** for consistency.

In [63]:
for frame in compare_sheets:
    frame.rename(
        columns={'field_name': 'field'},
        inplace=True,
    )

Now, regarding the values, I suspect redundancy. Here's why I feel that way; look at some of the values from both data sets:

In [64]:
index_column_names = ['site', 'crop', 'field']
pandas.concat(
    (
        hc[index_column_names],
        s2[index_column_names],
    ),
    keys=sheet_names,
#     names=['worksheet', 'index',],
).applymap(wrap_brackets).head(15)

Unnamed: 0,Unnamed: 1,site,crop,field
Head Counts,0,[Yellow Creek],[Wheat],[Yellow_Creek_Wheat]
Head Counts,1,[Yellow Creek],[Wheat],[Yellow_Creek_Wheat]
Head Counts,2,[Yellow Creek],[Wheat],[Yellow_Creek_Wheat]
Head Counts,3,[Yellow Creek],[Tanzy],[Yellow_Creek_Tanzy]
Head Counts,4,[Wakaw],[Barley],[Wakaw_Barley]
Head Counts,5,[Wakaw],[Barley],[Wakaw_Barley]
Head Counts,6,[Wakaw],[Barley],[Wakaw_Barley]
Head Counts,7,[Wakaw],[Barley],[Wakaw_Barley]
Head Counts,8,[Alberta],[0],[SW25-11-11-W4 CM11687]
Head Counts,9,[Alberta],[Wheat],[SW25-11-11-W4]


It's clear to my eyes that in most cases, `site` and `crop` have been concatenated together to produce the value in `field` or `field_name`.

- site + crop = field(_name)

Primary data:

- site
- crop

Aggregated data:

- field
- field_name

Therefore, in those cases, I can safely disregard those derived columns and rely on the the more normalized forms for indexing. That is to say I think it's most beneficial to use `crop` and `site` whenever possible.

In [65]:
(
    pandas.merge(
        s2.reset_index(), hc.reset_index(),
        suffixes=('_s2', '_hc'),
        how='inner',
        on=['date', 'crop', 'site', ],
    )
    .set_index(['date', 'crop', 'site',])
    .sort_index(level=['date', 'crop', 'site',])
    .loc[:, ['field_s2', 'field_hc']]
    .stack()
    .reset_index(level=-1, drop=True)
    .reset_index()
    .drop_duplicates()
    .rename(columns={0: 'field'})
#     .set_index(['date', 'crop', 'site',])
)

Unnamed: 0,date,crop,site,field
0,2016-07-11,Barley,SEF,SEFBarley
1,2016-07-11,Barley,SEF,SEF Barley 1
2,2016-07-14,Barley,Kernan,KernanBarley
3,2016-07-14,Barley,Kernan,Kernan Barley 1
16,2016-07-14,Wheat,Kernan,KernanWheat
17,2016-07-14,Wheat,Kernan,Kernan wheat 1
30,2016-07-14,Wheat,Outlook,OutlookWheat1
31,2016-07-14,Wheat,Outlook,Outlook wheat1
32,2016-07-14,Wheat,Outlook,OutlookWheat
38,2016-07-14,Wheat,Outlook,OutlookWheat2


#### Number of Samples & Distance

The ***number of samples*** field is only present in **Sheet2**. I want to know a little bit about the values:

In [67]:
s2['number of samples'].value_counts(dropna=False)

NaN     576
 5.0     36
 6.0     31
 4.0      9
 3.0      8
 1.0      5
 2.0      3
Name: number of samples, dtype: int64

There are many missing values in this column. I presume the number of samples relates to the "combined" values, but I need to confirm:

In [68]:
s2.loc[
    (
        s2['number of samples'].isna()
    ) | (
        ~ s2['distance(m)'].apply(
            isinstance,
            args=((int, float,),)
        )
    ),
    ['distance(m)', 'number of samples']
].drop_duplicates()

Unnamed: 0,distance(m),number of samples
0,0,
1,5,
2,10,
3,25,
4,50,
5,100,
6,Combined,6.0
22,Combined,3.0
43,Combined,4.0
65,Combined,1.0


This confirms that where the ***number of samples*** is not indicated, the corresponding ***distance(m)*** value is numeric. The converse is also true: non-numeric ***distance(m)*** corresponds with positive values in ***number of samples***. That's consistent with my expectations for the appearance of labels on previously aggregated data.

### Observation Data

Now that the data frame indices I built from the `date`, `site`, `field` and `crop` columns align, I need to determine which observations were recorded in both data frames.

##### @todo: move code block to appropriate place, annotate

In [69]:
for frame in compare_sheets:
    frame.set_index(index_column_names, inplace=True)

##### @todo: select aphid columns, check if others exist in common

### Compare Group Sums to Pre-existing "Combined"

I didn't expect that the data set from `Sheet2` would have sums (presumably from groupings of `distance(m)` values), yet also some non-aggregated values. Normally, these wouldn't be mixed, because they represent different dimensional orders. Because this data set's dimensional order is inconsistent, there is probably redundancy in the total information available, and possibly contradictions.

Any redundancy, whether contradictory or not, would affect calculation for the intended [objective], unless there's exactly one record for each space and time combination — that is to say, if there are discrete values as well as previously "combined" values for the same point along the index, I'll have to avoid including the pre-calculated sum when aggregating my own sums, otherwise the resulting totals will be doubled.

[objective]: #Objective

I'll separate discrete and aggregated sets, then compare my sums of discrete sample values to pre-existing aggregations.

Before I can compare