# Introduction

## Objective

* Answer the question:
  * For any given day and place is the relationship between observations
  recorded as "head counts" and those recorded as "sweeps"?
* Document the process of finding the answer so that the exact steps are
replicable and clearly demonstrated here.

## About the Data

!["Perfectly straight line in graph comparing supposedly independent variables
in Head Count and Sweeps workbook, in Excel Online"
](head-count-sweeps-graph-excel-online.png)

I'm troubleshooting a spreadsheet document that was prepared by someone else.
Whoever prepared it probably handed it off to someone else before I came into
contact. The information in the document looks like it's been copied from other
documents which I'm not certain I have access to.

In this document, there's a graph that demonstrates the goal of the document:
to compare the relationship between two observation methods. Unfortunately, the
graph shows a suspicious degree of idealness: a perfect one-to-one ratio across
the entire domain.

My assignment is to trace the error and correct the graph so that it displays
the precise ratios calculated from appropriate samples. The samples available
for this purpose are counts of aphids collected by the two methods described in
the [Objective].

[Objective]: #Objective

### Explore Worksheets

Before getting hands on with _pandas_, I'll open the document in Excel and
record what my senses tell me.

#### Unbelievable Graph

* There's a graph in a worksheet called "head counts vs sweeps graphs" which
demonstrates the analytical problem encountered/developed by someone else.
* The data supposedly being compared in the graph is cannot be the data that
was intended for comparison because the ratio depicted is perfectly linear even
though it's comparing real world samples.

#### Work Not Shown

It's clear that the Excel workbook has the results of many calculations, yet
there are no spreadsheet formulas in any cells. Spreadsheet formulas would have
greatly expedited the verification of the calculations; without any reference
to how the calculations were performed and on what values, I'll need to
replicate them from scratch.

#### Metadata and Document History

* Data may have been copied from multiple, unidentified sources.
* It's not clear which data is "original" and which is duplicated, amongst the
worksheets in the workbook.
* Some data appears to have been summed and then mixed back in with the rest of
the data.
* There are no notes from editors of the workbook, and the editors are
unidentifiable.

### Summary of Issues

* No metadata
* No convention for categorical label values
* Mixed aggregation levels (some sums mixed with non-summed data)
* Graph plot is based on incorrect values

## Analytical Goals

### Tidy & Plot Ratios

* Data is clean and indexed well enough to align quantitative samples from the
two field collection methods.
* Upon plotting the ratios of dependent variables corresponding with each
collection method, the graph should not depict an absolutely perfect fit to a
line. That is to say, there should be some variation.

### Retain for Reference

* For posterity, the data should be easy to align with any similar data. This
makes it possible to identify whether the same data exists in any other file.

# Pre-analysis

## Setup Computing Environment

### Import Python Packages

In [2]:
import re
from os import getcwd

import pandas
from markdown import markdown
from IPython.display import HTML, display

## Load Worksheets into _pandas_

I'll make a dictionary of names and data frames, so I can examine the file
overall.

In [3]:
data_file = pandas.ExcelFile(
    '../../data/2016-sweep-vs-tiller/2016 combination.xlsx')
sheets = {
    sheet_name: data_file.parse(sheet_name)
    for sheet_name in data_file.sheet_names
}

For convenience, I'll put sheet names and sheet cells in shorthand variables:

In [4]:
sheet_names = sheets.keys()
sheet_frames = sheets.values()

### Classify Worksheets

Because I can't trust the accuracy of the data used by the plot in Excel, I
need to look at all the sheets and determine the most complete and
unadulterated data sets. I'll determine which data belongs to each category,
and compare the sets.

#### Compare Columns

To begin to discern which data is original, I'll preview some column names of
each worksheet. Knowing which sheets have dates and places will provide me a
with a sense for the differences in data structures.

In [5]:
pandas.DataFrame(
    data=[frame.columns for frame in sheet_frames],
    index=pandas.Index(data=sheet_names),
).loc[:, :8]

Based on sheet names, column names, and similarities between columns sets, I
can probably group the sheets like so:
#### Tiller Head Count
* **Head Counts**
* **Head Counts Edited**
#### Net Sweep
* **Sheet2**
* **Sweep Samples Cereals**
* **Sweep Samples Cereals Edited**
* **leafhoppers 2016 cereal sweeps**
#### Extraneous
* Analytical experiments:
  * **Data Sheets Combined**
  * **Pivot chart LH phen**
  * **Sheet3**
  * **aphid sweep vs head count**
  * **head counts vs sweeps graphs**

I don't want any of the sheets grouped under "extraneous". The **Data Sheets
Combined** sheet is uninteresting to me because I want to combine the data in
the sheets in my own way, rather than trust analytical artifacts.

## Tidy Up

### Normalize Column Names

I've already noticed at least one leading space on a column name ("spiders"),
and variations in capitalization. To compensate for this, I'll strip all
leading and trailing whitespace from the column names.

To avoid confusion due to variations in capitalization, I'll convert all column
names to lower case.

In [6]:
for sheet in sheets.values():
    if sheet.columns.size > 0:
        sheet.columns = sheet.columns.str.strip().str.lower()

### Convert Date & Time Format

Before I can easily examine dates from the worksheets, I must convert them to
proper date values for Python and *pandas*. I'll check the date formats from
the first row of all sheets:

In [7]:
(
    pandas.concat(
        {
            name: sheet.loc[
                :,
                sheet.columns.str.contains('date')
            ].reset_index(drop=True)
            for (name, sheet) in sheets.items()
                if sheet.columns.size > 0
        },
        axis='columns',
    )
    .loc[0]
    .unstack()
    .fillna('')
)

According to the documentation for [`pandas.to_datetime`], I can convert a
string to a true datetime value. The function uses [Python datetime format
strings]. Here are the date columns, the format strings, and the worksheet
names I see:

***collection_date*** (`%m_%d_%Y`):

* **leafhoppers 2016 cereal sweeps**
* **Sheet2**
* **Sweep Samples Cereals**

date (`%m/%d/%Y`):

* **Head Counts**

***date*** (no conversion):

* **aphid sweep vs head count**
* **Sweep Samples Cereals Edited**
* **Head Counts Edited**
* **Data Sheets Combined**

In some worksheets, the ***date*** column is present even though there's a
better column for the date (***collection_date***). I'll drop the ***date***
column and rename the ***collection_date*** column to fill its place.

After conversion, I'll output a sample date from the column to prove that the
column ***date*** is in fact a datetime.

[`pandas.to_datetime`]: http://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.to_datetime.html
[Python datetime format strings]: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

In [8]:
for name, sheet in ((name, sheets[name]) for name in (
    'Head Counts',
    'Sheet2',
    'Sweep Samples Cereals',
    'leafhoppers 2016 cereal sweeps',
)):
    display(HTML(markdown(f'#### {name}')))
    if name == 'Head Counts':
        sheet['date'] = pandas.to_datetime(
            sheet['date'],
            format='%d/%m/%Y',
        )
    else:
        sheet.drop(columns='date', inplace=True)
        sheet.rename(
            columns={'collection_date': 'date'},
            inplace=True,
        )
        sheet['date'] = pandas.to_datetime(
            sheet['date'],
            format='%d_%m_%Y',
        )
    display(sheet.date.head(1))

# Select Primary Data Sets

## Sweep Samples

### Candidates

* **Sheet2**
* **Sweep Samples Cereals**
* **Sweep Samples Cereals Edited**
* **leafhoppers 2016 cereal sweeps**

From my hands-on exploration in Excel, I got the feeling that **Sheet2** is the
most complete. It generally had more rows and looked tidiest. I'll compare it
to the other candidates, one at a time.

### **Sheet2** vs **Sweep Samples Cereals**

I'd like to confirm my hope that all data in **Sweep Samples Cereals** is
included in the vaguely named **Sheet2**. I'll put them both in a dictionary
called **compare_sheets** for quick reference.

In [9]:
sheet_names = (
    'Sweep Samples Cereals',
    'Sheet2',
)
sheets_to_compare = {
    name: sheets[name]
    for name in sheet_names
}

I'll refer to **Sweep Sample Cereals** as `ssc` and I'll continue to refer to
**Sheet2** as `s2`. 

In [10]:
ssc, s2 = sheets_to_compare.values()

#### Columns

I'll convert each column list to a `Series` and display them side-by-side in a
`DataFrame`.

In [11]:
from typing import Dict


def emojify_boolean(frame: pandas.DataFrame) -> pandas.DataFrame:
    return (
        frame.replace({
            True: '✅',
            False: '✖️',
        })
    )


def compare_columns(frames: Dict[str, pandas.DataFrame]) -> pandas.DataFrame:
    return emojify_boolean(
        pandas.DataFrame(
            data={
                name: sheet.columns.to_series(name=name)
                for name, sheet in frames.items()
            }
        ).notna()
    )

In [12]:
compare_columns(sheets_to_compare)

I see only a few differences. I'll determine the exact asymmetry using
*pandas*'s `Index.symmetric_difference`:

In [13]:
ssc.columns.symmetric_difference(s2.columns).tolist()

I believe the unnamed columns are empty. The other columns have calculated
values that I don't trust. **Sweep Samples Cereals** is probably useless for my
purposes.

#### Date Range

In [14]:
def len_unique(pandas_object):
    return len(pandas_object.unique())


# noinspection PyTypeChecker
pandas.DataFrame(
    {
        name: frame.date
        for name, frame in sheets_to_compare.items()
    }
).apply(
    [
        pandas.Series.max,
        pandas.Series.min,
        len_unique,
    ]
)

I need to check if *all* the dates in **Sweep Samples Cereals** are in
**Sheet2**.

In [15]:
ssc.date.isin(s2.date).all()

I'm curious if the converse is also true:

In [16]:
s2.date.isin(ssc.date).all()

Since **s2** has some dates which **ssc** does not, I'd like to know how big
the difference is. I'll view all date values in **s2** which fail to match with
those in **ssc**, ignoring duplicates.

In [17]:
(
    s2.date.loc[~s2.date.isin(ssc.date)]
    .drop_duplicates()
)

So, this isn't a lot, but it's worth noting. These dates are only in **s2**.

#### Avoiding reduced data (totals)

I see a "**_total sweeps_**" column only in **Sweep Samples Cereals**, which
leads me to suspect that **Sweep Samples Cereals** comprises aggregates (sums),
possibly based on values present in **Sheet2**. However, **Sheet2** has two
additional dates, which means that it has data that the other worksheet doesn't.
Now, it's a matter of determining whether **Sheet2** is comprehensive enough to
preclude the use of **Sweep Samples Cereals**.

##### Value Counts on ***distance(m)***

It's my impression that the actual, physical sweeps were at different distances
from some physical origin, and later reduced to sums. If I peek at the
***distance(m)*** column, I should see a clear difference between these varied
distances and the "distance" value used by records which have multiple sweeps
combined. (Ideally, there wouldn't be a distance value for aggregates at all,
but this illustrates the problem of mixing raw data with partially analysed
data.)

In [18]:
display(
    pandas.concat(
        {
            name: frame['distance(m)']
            for name, frame in sheets_to_compare.items()
        },
        axis='columns',
    )
    .apply(pandas.value_counts)
    .fillna('-')
)

Indeed, **Sheet2** has data from observations at various "distances", while
**Sweep Samples Cereals** has only the label, ***Combined***.

**Sheet2** has more rows because it isn't totalling up the sweeps from various
distances. That makes **Sheet2** less processed, and therefore more "raw" (less
likely to be corrupted by human error during prior analysis). Since I don't want
reduced (summed) data, I don't want **Sweep Samples Cereals**.

#### Conclusion

Most optimal of these two:

* **Sheet2**

### **Sheet2** vs **Sweep Samples Cereals Edited**

In [19]:
sheets_to_compare = {
    name: sheets[name]
    for name in (
        'Sweep Samples Cereals Edited',
        'Sheet2',
    )
}
ssce, s2 = sheets_to_compare.values()
sheet_names = sheets_to_compare.keys()

#### Columns

In [20]:
pandas.options.display.max_rows = 140

compare_columns(sheets_to_compare)

**Sweep Samples Cereals Edited** seems to have left out some columns that would
be expected to carry finely categorized subjects, such as various instars of
aphids. Furthermore, it appears that **Sweep Samples Cereals Edited** has a
column, ***Total Sweeps***, which is probably an artefact from previous analysis
efforts. It's also missing the ***Distance(m)*** column; another sign of
analysis, and therefore loss of some information.

Unless analysis of date labels reveals that **Sweep Samples Cereals Edited** has
dates that are missing from **Sheet2**, I'll assume it's not worth closer
examination.

#### Date Range

In [21]:
def len_unique(pandas_object):
    return len(pandas_object.unique())


# noinspection PyTypeChecker
pandas.DataFrame(
    {
        name: frame.date
        for name, frame in sheets_to_compare.items()
    }
).apply([
    pandas.Series.max,
    pandas.Series.min,
    len_unique,
    len,
])

It appears that **Sweep Samples Cereals Edited**, like **Sweep Samples
Cereals**, has slightly fewer unique dates than **Sheet2** but the same range,
so I probably won't miss anything if I ignore it. To be sure, I need to check if
all those specific dates in **Sweep Samples Cereals Edited** are actually in
**Sheet2**, and not just the range.

In [22]:
ssce.date.isin(s2.date).all()

Excellent. I see no reason to pay attention to **Sweep Samples Cereals Edited**
or **Sweep Samples Cereals** anymore, except for the possibility that the actual
sample data differs. For the time being, I'll ignore that possibility and move
ahead. 

#### Conclusion

Optimal candidate regarding these two sheets:

* **Sheet2**

### **Sheet2** vs **leafhoppers 2016 cereal sweeps**

In [23]:
sheets_to_compare = {
    name: sheets[name]
    for name in (
        'leafhoppers 2016 cereal sweeps',
        'Sheet2',
    )
}
lh2016, s2 = sheets_to_compare.values()
sheet_names = sheets_to_compare.keys()

#### Columns

In [24]:
compare_columns(sheets_to_compare)

Clearly, **leafhoppers 2016 cereal sweeps** is focused on leafhoppers. Because
the objective goal of this analysis is to compare aphid numbers, I don't see the
relevance of this leafhopper data.

#### Conclusion

Given that the leafhopper data isn't pertinent, **Sheet2** remains the best
candidate.

### Sweeps: Overall Best Data Source

After all analysis, **Sheet2** appears to be the purest, most relevant base for
comparison of "sweep" sample data to that of "tiller" count data.

## Tiller Head Counts

### Candidates

Only two worksheets in the source Excel workbook seem relevant to my need for
"tiller head" samples:

* **Head Counts**
* **Head Counts Edited**

### **Head Counts** vs **Head Counts Edited**

In [25]:
sheets_to_compare = {
    name: sheets[name]
    for name in (
        'Head Counts',
        'Head Counts Edited',
    )
}
hc, hce = sheets_to_compare.values()
sheet_names = sheets_to_compare.keys()

#### Columns

In [26]:
compare_columns(sheets_to_compare)

The main difference here seems to be the aggregation in the "edited" sheet,
indicated by the columns named ***EGA/head***, ***EGA_total***, etc. For the
purpose of achieving the project objective, the columns representing sums of
values from other columns would likely be redundant because I intend to use the
original data only.

#### Date Range

Now, compare for completeness:

In [27]:
def len_unique(pandas_object):
    return len(pandas_object.unique())

In [28]:
# noinspection PyTypeChecker
pandas.DataFrame(
    {
        name: frame.date
        for name, frame in sheets_to_compare.items()
    }
).apply([
    pandas.Series.max,
    pandas.Series.min,
    len_unique,
    len,
])

These may be using the exact same set of dates. I'll confirm:

In [29]:
hc.date.index.isin(hce.date.index).all()

Identical date and time for the index of each, so the ***date*** column shows no
basis for choosing one over the other.

### Head Counts: Overall Best Data Source

Since the only difference between these two frames is the unwanted extra
columns, the simplest sheet wins this.

* **Head Counts**

# Align Selected Data Sets

The names and corresponding data frames, from the worksheets in the source
workbook document (Excel):

In [30]:
sheets_to_compare = {
    'Head Counts': hc,
    'Sheet2': s2,
}
sheet_names = sheets_to_compare.keys()

## Compare Columns

Comparing alphabetically sorted column names between our two primary data sets:

In [31]:
pandas.concat(
    (
        pandas.DataFrame(
            data={
                'Sheet2': s2.columns.difference(hc.columns),
            }
        ),
        pandas.DataFrame(
            data={
                'Head Counts': hc.columns.difference(s2.columns),
            }
        ),
        pandas.DataFrame(
            data={
                'common': s2.columns.intersection(hc.columns),
            }
        ),
    ),
    sort=False,
    axis='columns',
).fillna('--')

That's quite a large difference in columns for these sets. Because my target
sample set is only aphids, I can first drop the unimportant columns, then
normalize all remaining labels.

Before dropping any columns, I need to identify any remaining columns with
correlative values, such as time and place labels which can serve as indices for
alignment.

### Lookup Columns (Record Indices)

Reading through the list of remaining, non-aphid related columns, I see some
that don't mention any organism by name. These columns may be useful for
indexing, which is crucial to aligning the two data sources.

In [32]:
non_organism_column_names = pandas.Series(data=(
    'crop',               # location
    'date',               # time
    'date_by_week',       # redundant date coding
    'distance(m)',        # location
    'field',              # location
    'field_name',         # location
    'id',                 # not relevant for this analysis
    'julian_date',        # redundant date coding
    'number of samples',  # used for sums, which are redundant
    'province',           # location
    'site',               # location
    'zadoks_stage',       # not relevant for this analysis
))

Here are the names of the matching columns from each frame:

In [33]:
emojify_boolean(pandas.DataFrame(
    {
        name: dict(zip(non_organism_column_names,
                       non_organism_column_names.isin(frame.columns)))
        for name, frame
        in zip(sheet_names, (hc, s2))
    }
))

## Align Columns

My next goal is to determine ensure the record label columns are in a common
format so that I can align the frames. I'll need to clean up ***crop***, 
***field*** and ***site***, but ***date*** is already well formed.

### ***field*** vs ***field_name***

The columns **hc**.***field*** and **s2**.***field_name*** seem to relate to
***site*** and ***crop***. Before I address the values of the columns, I'll
rename **s2**.***field_name*** for consistency.

In [34]:
for frame in (hc, s2):
    frame.rename(
        columns={'field_name': 'field'},
        inplace=True,
    )

### Time (Date)

The columns containing time information were normalized in an [earlier stage] of
my analysis. Those columns are ready to align in combination with other unique
index labels. Here's how the rest of the index labels shape up when the frames
are aligned by date:

[earlier stage]: #Convert-Date-&-Time-Format

In [35]:
hhc = hc.set_index('date')
ss2 = s2.set_index('date')
pandas.concat(
    (
        pandas.concat(  # convenient way to add sheet name to column hierarchy
            {name: sheet},
            axis='columns',
        ).reorder_levels([1, 0], axis='columns')
        # Sheet names paired with date-indexed frames.
        for name, sheet in zip(sheet_names, (hhc, ss2))
    ),
    axis='rows',
).loc[
    # Overlapping dates in date-index frames:
    hhc.index.intersection(ss2.index),
    # Non-date indices:
    (
        [
            'site',
            'field',
            'field_name',
            'crop',
        ],
    )
]

Not bad at all. I just need to normalize the string values so I can code them as
labels and align the data sets optimally.

### Visualize Leading & Trailing Whitespace

I've noticed irregular use of space characters in some of the non-numeric values
within the `DataFrame` objects. I could just strip all leading and trailing
spaces, but it's not a good habit to attempt to clean text too early in the
course of analysis because I might overlook something significant and lose
important metadata before I notice the problem.

To visualize these invisible characters, I'll compose a function that replaces
the space character with something more visible (an emojii). Displaying values
this way makes trailing or leading whitespace obvious, without overwriting any
data.

In [36]:
def show_spaces(x):
    return x.str.replace(
        ' ', '⬜️'
    )

Here's an example:

In [37]:
show_spaces(
    pandas.Series(
        [
            ' hello, friend ',
            'I love    space    !',
        ]
    )
)

### Crop

Here are the ***crop*** values for both data frames:

In [38]:
crops = pandas.concat(
    (
        frame.crop.apply(str)
        for frame in sheets_to_compare.values()
    ),
    keys=sheet_names,
    sort=True,
).drop_duplicates().reset_index(drop=True).sort_values()

show_spaces(crops)

Clearly, there are some variations that should be corrected, in both data
frames.

* whitespace
* letter case
* word separation

I'll write a function that transforms any given ***crop*** value into a uniform
representation of the crop it's intended to represent.

In [39]:
def normalize_str(value:(str, float), separator:str=' ')->(str, float):
    if value == pandas.np.nan:
        return value
    str_value = str(value)

    # Convert camel case strings. E.g.: "WinterWheat" -> "winter wheat"
    is_mixed_case = (
        not (str_value.islower() or str_value.isupper())
        and (str_value.upper() != str_value.lower())
    )
    if is_mixed_case:
        word_index = [
            # Remember starting positions of all uppercase characters.
            index for index, char in enumerate(str_value)
            if char.isupper()
        ] + [None]  # Final null for string slicing with […:word_index[i + 1]].
        if word_index:
            words = [
                # Strip leading and trailing whitespace from each word.
                str_value[word_index[i]:word_index[i + 1]].strip()
                for i in range(len(word_index) - 1)
            ]
            str_value = separator.join(words)  # Rejoin words.

    # Replace non-alphanumeric characters with separator (default: space).
    transformed = re.compile(r'[^a-zA-Z0-9]').sub(separator, str_value.lower())

    # De-pluralize special case.
    if transformed.endswith('oats'):
        transformed = transformed[:-1]

    return transformed

Previewing the results, with the square brackets added again:

In [40]:
pandas.concat(
    (
        crops,
        crops.apply(normalize_str),
    ),
    keys=("Before", "After"),
    axis='columns',
).sort_values('After').apply(show_spaces)

There are some concerning names, such as "nan", "unlisted", and "0", but the
renaming result looks good. I'll apply the changes:

In [41]:
for frame in (hc, s2):
    frame.crop = frame.crop.apply(normalize_str)

### Site

In [42]:
site_values = (
    pandas.concat(
        (frame[['site']] for frame in (hc, s2)),
        keys=sheet_names,
        names=['Sheet Name', 'index',],
    )
    .drop_duplicates()
    .sort_values('site')
)
show_spaces(site_values.site)

This is also going to need some normalization. I'll write a string reducer
function to transform any given value into a uniform representation by stripping
insignificant characters and letter case. From that, I can map the values to
preferred representations.

#### Reduce Label Value

In order to match badly formed labels with their normal representation, I need a
string transformation function so that the values in the data frame can be used
to look up the preferred, normal form.

* convert numeric values to text
* strip non alphanumeric characters
* convert alphabetic characters to lower case

In [43]:
def alphanumeric_lower(value):
    return re.compile(r"[^a-z0-9]").sub('', str(value).lower())

#### Nominal Values

If I had a larger data set, I would use the `mode` function to automatically
find the most frequently used representation for any given reduced value.
Unfortunately, the groups are far too small and the values too varied:

In [44]:
(
    site_values
    .reset_index(level='index', drop=True)
    .loc[:, ['site']]
    .apply(show_spaces)
)

Even if I strip the leading and trailing whitespace, I would still end up with
ambiguous candidate selections. The work required to make a function that would
know to how and when to convert labels like "`Yellowcreek`" to "`Yellow Creek`"
would be unreasonable given the scope of this project, so automation might not
be the best choice for choosing normal forms.

I'll make the preferred identifier list manually, then apply the changes
automatically. Here is that list along with the equivalent normal value for
indexing:

In [45]:
preferred_site_id = pandas.Series(
    name='site',
    data={
        alphanumeric_lower(item): item
        for item in [
          'Alvena',
          'Clavet',
          'Indian Head',
          'Kernan',
          'Llewellyn',
          'Meadow Lake',
          'Melfort',
          'Outlook',
          'SEF',
          'Wakaw',
          'Yellow Creek',
        ]
    },
)
preferred_site_id.index.set_names(['site_index'], inplace=True)
preferred_site_id.to_frame()

With this list, I can compare the normalized values to the ones in the actual
data and apply the preferred name where it matches. I'll add this index to both
data frames:

In [46]:
for frame in (hc, s2) + (site_values, ):
    frame['site_index'] = frame.site.apply(alphanumeric_lower)
    frame.set_index('site_index', append=True, inplace=True)

Everything that isn't in the preferred site name list:

In [47]:
site_values[
    ~ (  # ~ is the 'not' operator
        site_values.index.get_level_values('site_index')
        .isin(preferred_site_id.index)
    )
].reset_index('index', drop=True)

These outliers are only in one of the sheets/frames, so I can move on without
doing anything further on these.

[unique site]: #Site

Which records would be changed if I relabel ***site***?

In [48]:
(
    pandas.concat(
        (
            site_values,
            preferred_site_id.to_frame().combine_first(site_values),
        ),
        keys=['Before', 'After'],
        axis='columns',
    )
    .reorder_levels([1, 0], axis='columns')
    .reset_index(['site_index', 'index'], drop=True)
    .apply(show_spaces)
)

This looks good to me. I'll apply the names:

In [49]:
for frame in (hc, s2):
    frame.loc[:, 'site'] = (
        preferred_site_id.to_frame()
        .combine_first(frame)
        .loc[:, 'site']
    )
    frame.reset_index(
        level='site_index',
        drop=True,
        inplace=True,
    )

I'll review the label values for ***site*** in both data frames, to see the
result of those changes:

In [50]:
show_spaces(
    pandas.concat(
        (frame[['site']] for frame in (hc, s2)),
    )
    .loc[:, 'site']
    .reset_index(drop=True)
    .sort_values()
    .drop_duplicates()
)

### Field

Now, regarding the values, I suspect redundancy in ***field***. Here's why I
feel that way—look at some of the values from both data sets:

In [51]:
pandas.options.display.max_rows = 20

index_column_names = ['crop', 'site', 'date', 'field',]
(
    pandas.concat(
        (
            hc[index_column_names],
            s2[index_column_names],
        ),
        keys=sheet_names,
        names=['worksheet', 'index',],
    )
    .sort_values(
        by=index_column_names,
    )
    .set_index(
        keys=['crop', 'site'],
    )
    .drop_duplicates()
    .apply(
        {
            'field': show_spaces,
            'date': lambda x: x,
        }
    )
)

It's clear to my eyes that in some cases, ***site*** and ***crop*** have been
concatenated together to produce the value in ***field***.

* ***site*** + ***crop*** = ***field***

Therefore, I can think of the columns as falling into two categories:

**Primary data:**

* ***site***
* ***crop***

**Mixed data:**

* ***field***

I think it's most beneficial to use ***crop*** and ***site*** rather than
***field*** whenever possible. Therefore, I can safely disregard that derived
column and rely on the the more normalized ***crop*** and ***site*** for
indexing.

The only complication is the numeric suffix present in some values of
***field***, which delineates areas of the site at which the samples were
observed. The numeric values are unique per date-crop-place combination but not
unique per site-crop combination, so I'll have to treat it as a separate column
that represents an additional dimension to our independent variables.

Extracting the unique pairs of dates and the corresponding numerals from the
processed ***field*** value for that date:

In [52]:
pandas.options.display.max_rows = 40

(
    _  # Previous cell output.
    .reset_index()
    .apply(
        dict(
            tuple(
                dict.fromkeys(
                    (
                        'date',
                        'site',
                        'crop',
                    ),
                    lambda x: x,
                ).items() 
            ) + tuple(
                {
                    'field': lambda x: x.str.extract(
                        pat=r'(?P<text>\D*)(?P<number>\d*)',
                    ),
                }.items(),
            ),
        )
    )
    .drop_duplicates(
        subset=[
            ('date', 'date'),
            ('field', 'number'),
        ],
    )
    .dropna(
        subset=[
            ('date', 'date'),
            ('field', 'number'),  # @todo: needed?
        ],
    )
)


Based on this look at the information, I'd like to throw away the remaining text
portion after extractinng the numbers when I apply this to my data set.

Just a reminder: the samples corresponding to ***site*** values `Alberta` and
`Manitoba` aren't in the other `DataFrame`, so the fact that the ***field***
values for these records have a different pattern and are therefore mangled is
completely fine for this project—they'll be entirely dropped when the frames are
aligned.

I'll replace the messy values of the ***field*** column with just the number
portion:

In [53]:
for sheet in (hc, s2):
    sheet.field = (
        sheet.field
        .str.extract(pat=r'(?P<text>\D*)(?P<number>\d*)')
        .loc[:, 'number']
        .apply(pandas.to_numeric, downcast='integer')
    )

### Number of Samples & Distance

The ***number of samples*** field is only present in **Sheet2**. I want to know
a little bit about the values:

In [54]:
s2['number of samples'].value_counts(dropna=False)

There are many missing values in this column. I presume the number of samples
only relates to the "combined" values, but I need to confirm. Here's **Sheet2**
with only the two relevant columns, summarized:

In [55]:
s2[
    ['distance(m)', 'number of samples']
].dropna(how='all').drop_duplicates()

This confirms that where the ***number of samples*** is not indicated, the
corresponding ***distance(m)*** value is numeric. The converse is also true:
non-numeric ***distance(m)*** corresponds with positive values in ***number of
samples***. That's consistent with my expectations for the appearance of labels
on previously aggregated data. I'll be throwing those "Combined" records away.

Drop records with **_distance(m)_** equal to "`Combined`":

In [56]:
s2 = s2.loc[s2['distance(m)'] != 'Combined']

### Identify Columns With Aphid Sample Data

The first step in normalizing column names for aphid samples is identifying the
set of columns that mention aphids by name. For better comparison, I'll filter
for columns referring to aphids.

Here are some terms I know to be indicative of a relationship to aphid samples:

* aphid
* ega
* bco
* greenbug

Not aphid related:

* aphidencyrtus
* aphidiius

Categorized as natural enemy:

* aphid_mummies

Analytical artifacts:

* total

Following these guidelines, the relevant columns are:

**English grain aphid**:

* ***1st_instar_ega***
* ***2nd_instar_ega***
* ***3rd_instar_ega***
* ***3rd_instar_ega_pre-alate***
* ***ega alate***
* ***ega_alate***
* ***ega_apt***
* ***ega_grn***
* ***ega_head***
* ***ega_leaf***
* ***ega_red***
* ***sitobion_avenae_ega_green (wingless)***
* ***sitobion_avenae_ega_red***

This one is ambiguous, but it's probably referring to EGA:

* ***4th_instar_pre-alate***

**Bird cherry oat aphid**:

* ***bco_alate***
* ***bco_apt***
* ***bco_head***
* ***bco_leaf***
* ***bird_cherry_oat_aphid***

**Greenbug aphid**:

* ***greenbug_alate***
* ***greenbug_apt***
* ***greenbug_aphid***

**Pea aphid**:

* ***pea aphids***

The pea aphid data is only in **s2**, so it can be disregarded. The remaining
column names can go in a list for quick reference:

Some tags for the columns:

In [57]:
aphid_column_names = {
    'ega': (
        '1st_instar_ega',
        '2nd_instar_ega',
        '3rd_instar_ega',
        '3rd_instar_ega_pre-alate',
        'ega alate',
        'ega_alate',
        'ega_apt',
        'ega_grn',
        'ega_head',
        'ega_leaf',
        'ega_red',
        'sitobion_avenae_ega_green (wingless)',
        'sitobion_avenae_ega_red',
        '4th_instar_pre-alate',
    ),
    'bco': (
        'bco_alate',
        'bco_apt',
        'bco_head',
        'bco_leaf',
        'bird_cherry_oat_aphid',
    ),
    'greenbug': (
        'greenbug_alate',
        'greenbug_apt',
        'greenbug_aphid',
    ),
    'pea': (
        'pea aphids',
    ),
}
aphid_column_names_level_names = ['aphid_type', 'column_name']

In [58]:
emojify_boolean(
    pandas.DataFrame(
        data={
            name: sheet.columns[
                sheet.columns.isin(
                    sum(
                        aphid_column_names.values(),
                        ()
                    )
                )
            ].to_series(name=name)
            for name, sheet in sheets_to_compare.items()
        },
    ).notna()
)

In [59]:
index_columns = ['date', 'site', 'crop', 'field']
pandas.concat(
    (frame.set_index(index_columns).loc[hhc.index.intersection(ss2.index)]
     for frame in sheets_to_compare.values()),
    sort=False,
).reindex(
    columns=pandas.MultiIndex.from_tuples(
        tuple(((aphid_type, column_name)
               for aphid_type, column_names in aphid_column_names.items()
               for column_name in column_names)),
        names=aphid_column_names_level_names,
    ),
    level=1,
)

In [60]:
aphid_column_names = {
    'ega': {
        'alate': (
            'ega alate',
            'ega_alate',
        ),
        'apterous': (
            'ega_apt',
            'sitobion_avenae_ega_green (wingless)',
        ),
        'pre-alate': (
            '3rd_instar_ega_pre-alate',
            '4th_instar_pre-alate',
        ),
        'red': (
            'ega_red',
            'sitobion_avenae_ega_red',
        ),
        'green': (
            'ega_grn',
            'sitobion_avenae_ega_green (wingless)',
        ),
        'head': (
            'ega_head',
        ),
        'leaf': (
            'ega_leaf',
        ),
        'uncategorized': (
            '1st_instar_ega',
            '2nd_instar_ega',
            '3rd_instar_ega',
        )
    },
    'bco': {
        'alate': (
            'bco_alate',
        ),
        'apterous': (
            'bco_apt',
        ),
        'head': (
            'bco_head',
        ),
        'leaf': (
            'bco_leaf',
        ),
        'uncategorized': (
            'bird_cherry_oat_aphid',
        ),
    },
    'greenbug': {
        'alate': (
            'greenbug_alate',
        ),
        'apterous': (
            'greenbug_apt',
        ),
        'uncategorized': (
            'greenbug_aphid',
        ),
    },
    'pea': {
        'uncategorized': (
            'pea aphids',
        ),
    },
}
aphid_column_names_level_names = ['aphid_type', 'category', 'column_name']

### Observation Data

Now that the data frame are properly indexed by **_date_**, **_site_**,
**_field_** and **_crop_**, here's a preview of the frames aligned by place and
time:

In [61]:
pandas.options.display.max_columns = 100

indices = [
    'date',
    'site',
    'crop',
    'field',
]
hhc, ss2 = (frame.set_index(indices) for frame in (hc, s2))
ss2 = ss2.set_index(['distance(m)', 'id'], append=True).sum(level=[0, 1, 2, 3,])
ss2

In [62]:
hc.columns.to_list()

In [63]:
s2_labels = {
    'id': (
        'index', 'unique',),
    'province': (
        'index', 'place',),
    'date': (
        'index', 'time',),
    'sample_by_week': (
        'not applicable',),
    'date_by_week': (
        'not applicable',),
    'julian_date': (
        'not applicable',),
    'site': (
        'index', 'place',),
    'field': (
        'index', 'place',),
    'crop': (
        'observation', 'crop', 'type',),
    'distance(m)': ('index', 'place',),
    'number of samples': (
        'not applicable',),
    'sitobion_avenae_ega_green (wingless)': (
    'observation', 'aphid', 'count', 'ega', 'apterous',),
    'sitobion_avenae_ega_red': (
        'observation', 'aphid', 'count', 'ega', 'uncategorized',),
    'ega alate': (
        'observation', 'aphid', 'count', 'ega', 'alate',),
    'bird_cherry_oat_aphid': (
        'observation', 'aphid', 'count', 'bco', 'uncategorized',),
    'greenbug_aphid': (
        'observation', 'aphid', 'count', 'greenbug', 'uncategorized',),
    'pea aphids': (
        'observation', 'aphid', 'count', 'pea', 'uncategorized',),
    'total_apterous_aphids': (
        'observation', 'aphid', 'count', 'uncategorized', 'apterous',),
    'total_alate_aphids': (
        'observation', 'aphid', 'count', 'uncategorized', 'apterous',),
    '4th_instar_pre-alate': (
        'observation', 'aphid', 'count', 'uncategorized', 'pre-alate',),
    '3rd_instar_ega': (
        'observation', 'aphid', 'count', 'ega', 'uncategorized',),
    '3rd_instar_ega_pre-alate': (
        'observation', 'aphid', 'count', 'ega', 'pre-alate',),
    '2nd_instar_ega': (
        'observation', 'aphid', 'count', 'ega', 'uncategorized',),
    '1st_instar_ega': (
        'observation', 'aphid', 'count', 'ega', 'uncategorized',),
    'aphid_mummies_aphelinus_black': (
        'not applicable',),
    'aphid_mummies_aphidius_brown': (
        'not applicable',),
    'aphid_mummies': (
        'not applicable',),
    'female_macrosteles_quadrilineatus': (
        'not applicable',),
    'male_macrosteles_quadrilineatus': (
        'not applicable',),
    'macrosteles_quadrilineatus nymphs': (
        'not applicable',),
    '1st_instar_macrosteles': (
        'not applicable',),
    '2nd_instar_macrosteles': (
        'not applicable',),
    '3rd_instar_macrosteles': (
        'not applicable',),
    '4th_instar_macrosteles': (
        'not applicable',),
    'athysanus_argentarius': (
        'not applicable',),
    'doratura_sp.': (
        'not applicable',),
    'errastunus_ocellaris_lh': (
        'not applicable',),
    'other_leafhoppers': (
        'not applicable',),
    'other coccinellid_adults': (
        'not applicable',),
    'hippodamia_tredecimpunctata_c13': (
        'not applicable',),
    'coccinella_septempunctata_c7': (
        'not applicable',),
    'ladybugs- larvae': (
        'not applicable',),
    'chrysopidae_adults': (
        'not applicable',),
    'chrysoperla_carnea_adult': (
        'not applicable',),
    'chrysopa_oculata_adult': (
        'not applicable',),
    'chrysoperla_carnea_larva': (
        'not applicable',),
    'chrysopa_oculata_larvae': (
        'not applicable',),
    'g_lacewing_larvae': (
        'not applicable',),
    'orius_tristicolor': (
        'not applicable',),
    'anthocoridae': (
        'not applicable',),
    '(damsel bug)nabis_americoferus_adult': ('not applicable',),
    'nabis_americoferus_nymph': (
        'not applicable',),
    'nabicula': (
        'not applicable',),
    'nabis_alternatus': (
        'not applicable',),
    'chalcid_wasps': (
        'not applicable',),
    'aphelinus_varipes': (
        'not applicable',),
    'aphelinus_asychis': (
        'not applicable',),
    'aphelinus_albipodus': (
        'not applicable',),
    'braconid_wasps': (
        'not applicable',),
    'aphidiius_sp.': (
        'not applicable',),
    'any parasitoid_adults': (
        'not applicable',),
    'hyperparasitoids ???': (
        'not applicable',),
    'aphidencyrtus_sp': (
        'not applicable',),
    'asaphes_suspensus': (
        'not applicable',),
    'flies': (
        'not applicable',),
    'lauxaniidae': (
        'not applicable',),
    'dolichopodidae': (
        'not applicable',),
    'syrphid_flies': (
        'not applicable',),
    'hoverflies': (
        'not applicable',),
    'female_delia_sp_1': (
        'not applicable',),
    'male_delia_sp_1': (
        'not applicable',),
    'female_delia_sp_2': (
        'not applicable',),
    'male_delia_sp_2': (
        'not applicable',),
    'anthomyiidae-delia': (
        'not applicable',),
    'midge': (
        'not applicable',),
    'lygus_punctatus': (
        'not applicable',),
    'lygus_elisus': (
        'not applicable',),
    'miridae_lygus lineolaris': (
        'not applicable',),
    'lygus_nymph': (
        'not applicable',),
    'green_grass_bugs_trigonotylus_coelestialium miridae': (
        'not applicable',),
    'green_grass nymphs': (
        'not applicable',),
    'capsus_simulans': (
        'not applicable',),
    'katydids': (
        'not applicable',),
    'thrips': (
        'not applicable',),
    'grasshoppers': (
        'not applicable',),
    'spiders': (
        'not applicable',),
    'spider_tetragnathidae': (
        'not applicable',),
    'mosquitoes': (
        'not applicable',),
    'dragonflies+damsel fly': (
        'not applicable',),
    'flea_beetles hop': (
        'not applicable',),
    'flea_beetles striped': (
        'not applicable',),
    'flea_beetles crucifer': (
        'not applicable',),
    'cicindela': (
        'not applicable',),
    'tychius_picirostris (weevil)': ('not applicable',),
    'bertha_armyworms': (
        'not applicable',),
    'shield_bugs': (
        'not applicable',),
    'worms': (
        'not applicable',),
    'beetles': (
        'not applicable',),
    'maggots': (
        'not applicable',),
    'stink_bugs (adult and nymph)': ('not applicable',),
    'red_mite': (
        'not applicable',),
    'moths': (
        'not applicable',),
    'plant_bugs': (
        'not applicable',),
    'pirate_bugs': (
        'not applicable',),
    'assassin_bug (reduviid bugs)': ('not applicable',),
    'bees': (
        'not applicable',),
    'harvestman': (
        'not applicable',),
    'treehoppers': (
        'not applicable',),
    'cabbage_butterfly': (
        'not applicable',),
    'caterpillar': (
        'not applicable',),
    'legume_bug': (
        'not applicable',),
    'chinch_bug': (
        'not applicable',),
    'ambush_bugs': (
        'not applicable',),
    'ichneumonidae': (
        'not applicable',),
    'pumace_flies (drosophilidae)': ('not applicable',),
    'scorpion_flies': (
        'not applicable',),
    'seed bugs (lygaeidea)': ('not applicable',),
    'seed_corn_beetles': (
        'not applicable',),
    'ufi_bugs': (
        'not applicable',),
    'wasps_other': (
        'not applicable',),
    'eulophid_wasp': (
        'not applicable',),
    'oribatid': (
        'not applicable',),
    'spider_mites': (
        'not applicable',),
    'springtails': (
        'not applicable',),
    'mollusks': (
        'not applicable',),
    'formicidae': (
        'not applicable',),
    'weevil': (
        'not applicable',),
    'lepidopteran_pupa': (
        'not applicable',),
    'unnamed: 129': (
        'not applicable',),
    'hymenoptera_proctotrupidae': (
        'not applicable',),
    'hymenoptera_pteromalidae': (
        'not applicable',),
    'hymenoptera_apidae': (
        'not applicable',),
    'hymenoptera_diplazontinae': (
        'not applicable',),
    'hymenoptera_figitidae': (
        'not applicable',),
    'hymenoptera_aphelinidae': (
        'not applicable',),
    'hymenoptera_perilampidae': (
        'not applicable',),
    'hymenoptera_chalcidoidea': (
        'not applicable',),
    'hymenoptera_ichneumondoidea': (
        'not applicable',),
    'hymenoptera_proctotrupoidea': (
        'not applicable',),
}

In [64]:
[(key, *values) for key, values in s2_labels.items()]

In [65]:
pandas.MultiIndex.from_tuples(
    (key, *values) for key, values in s2_labels.items()).to_frame().T

In [66]:
sheets_to_compare = dict(zip(
    sheet_names, (hhc, ss2)))  # Sheet name paired with date-indexed frame.

pandas.concat(
    (
        pandas.concat(  # Convenient way to add sheet name to column hierarchy.
            {
                name: sheet.loc[
                    hhc.index.intersection(ss2.index),  # only dates in common
                ].reindex(
                    columns=pandas.MultiIndex.from_tuples(
                        (
                            (
                                aphid_type,
                                aphid_categorization,
                                column_name,
                            )
                            for aphid_type, inner_hierarchy in aphid_column_names.items()
                            for aphid_categorization, column_names in inner_hierarchy.items()
                            for column_name in column_names
                        ),
                        names=aphid_column_names_level_names
                    ),
                    level=2,
                )
            },
            axis='columns',
        ).reorder_levels(
            [1, 2, 3, 0,],
            axis='columns',
        )  # So sheet names don't ruin alignment.
        for name, sheet in sheets_to_compare.items()
    ),
    axis='rows',
)

Some column names are similar to ones in the opposite `DataFrame`; if I
normalize the names, I can expect to align the corresponding sample values.

#### Correlate Aphid Columns by Aphid Type

I'll be looking for data about aphid types that were measured in both sources,
and I'll ultimately want only one column from each `DataFrame` for each type of
aphid that is present in both sources, because I'll be computing the ratios of
corresponding counts across frames.

**Bird cherry oat:**

* s2:
  * bird_cherry_oat_aphid
* hc:
  * bco_alate
  * bco_apt
  * bco_head
  * bco_leaf

**EGA:**

* alate:
  * hc:
    - ega_alate
  * s2:
    - ega alate

**Greenbug:**

* hc:
  * sum of greenbug_.+
* s2:
  * greenbug_aphid

**All:**

* s2:
  * sum all
* hc:
  * sum of any of [apt, alate], [grn, red], [head, leaf]

The column labels for aphids indicate multiple variations and probably multiple
representations of the same aphid species count in the same `DataFrame`. For
example, English grain aphids from **Head Counts** are counted in opposing
pairs: "winged" versus "wingless", and "red" versus "green"—but there are also
"head" and "leaf" counts. Compare that with the data labels in **Sheet2**, where
there are various instars (life stages), pre-alate versus alate, red versus
green, wingless (green) versus wingless (no colour specified)—but no
correspondirg **_ega_leaf_** or **_ega_head_**.

I'll have to figure out which columns to combine and which to ignore. I'll begin
with the presumption that ***ega_head*** plus ***ega_leaf*** should produce the
same total as the pairings of ***ega_apt*** plus ***ega_alate*** or
***ega_red*** plus ***ega_grn***. If I can make sense of those columns in **Head
yCounts**, I'll proceed to examine the value totals in **Sheet2** with the same
strategy.

Greenbugs will be much simpler, given that I'll be cross referencing one simple
column, **_greenbug_aphid_**, against only two mutually-exclusive columns from
the other worksheet, **_greenbug_alate_** and **_greenbug_apt_**. The sum of
counts for the two greenbug phenotypes of **Sheet2** should correspond fully
with the solitary counterpart in **Head Counts**.

### @todo

* [ ] **match (between sheets) columns like 'alate' rather than overall sums!**

##### English Grain Aphid from **Head Counts**

I'm considering which columns to use, and I'm thinking of using the sum of
***ega_head*** and ***ega_leaf***. Given all complementary pairings of EGA
columns in **Head Counts**, I expect the sums to be identical amongst all three
pairs for each of the sampling dates.

In [67]:
hc_ega_sums = pandas.concat(
    {
        'head + leaf': (hc['ega_head'] + hc['ega_leaf']),
        'apt + alate': (hc['ega_apt']  + hc['ega_alate']),
        'reg + grn':   (hc['ega_red']  + hc['ega_grn']),
    },
    axis='columns',
)

If I add a column showing the standard deviation for each row, I should easily
see where there's significant inconsistency. Showing the 25 worst records:

In [68]:
pandas.options.display.max_rows = 25
pandas.concat(
    (
        hc_ega_sums,
        pandas.Series(
            data=hc_ega_sums.std(axis='columns'),
            name='std',
        ),
    ),
    axis='columns',
).sort_values(
    by='std',
    ascending=False
).head(25)

Looks like there are many invalid counts, but **_34_**, **_80_**, **_33_** and
**_81_** show the biggest clue about which column is the most realistic:

In [69]:
# figure
_.loc[[34, 80, 33, 81]]

It's very unusual to find no aphids at all, and if the same plant was supposedly
bearing 160 aphids, but none of them were red nor green, I'd have to make
various assumptions in order to validate that portion of the data. However,
since nothing can both have and not have wings, I don't need to make *as many*
assumptions about this portion of data. With those two points in mind, the fact
that zeros appear coincidental to each other only in the **_apt + alate_**
column and the **_red + green_** columns suggests these zeros represent missing
observation data rather than an observed count of 0 aphids of those types. With
high confidence, I could disregard the supposed counts of 0 in those instances.

A separate issue is what's happening in records numbered **_14_**, **_75_** and
**_155_**, where all three columns disagree slightly:

In [70]:
# figure
__.loc[[14, 75, 155]]

##### @todo:

* [ ] apply normalized aphid names

### Remove Unnecessary Columns

Aligning the records will be simplest if there are no extraneous columns.
Ideally, I should be able to align the rows and start using the joined data set,
without selecting columns. What non-aphid, non-lookup columns remain?

##### @todo:

* [ ] identify remaining columns
* [ ] drop other columns

### Compare Group Sums to Pre-existing "Combined"

I didn't expect that the data set from **Sheet2** would have sums (presumably
from groupings of ***distance(m)*** values), yet also some non-aggregated
values. Normally, these wouldn't be mixed, because they represent different
dimensional orders. Because this data set's dimensional order is inconsistent,
there is probably redundancy in the total information available, and possibly
contradictions.

Any redundancy, whether contradictory or not, would affect calculation for the
intended [objective], unless there's exactly one record for each space and time
combination—that is to say, if there are discrete values as well as previously
"combined" values for the same point along the index, I'll have to avoid
including the pre-calculated sum when aggregating my own sums, otherwise the
resulting totals will be doubled.

[objective]: #Objective

I'll separate discrete and aggregated sets, then compare my sums of discrete
sample values to pre-existing aggregations.

Before I can compare

## Align Rows

##### @todo:

* [ ] mark bad dates, drop rows when aligning rows