# Objective

- Answer the question:
  * What is the relationship between observations recorded as "head counts" and those recorded as "sweeps"?
- Document the process of finding the answer so that the exact steps are replicable and clearly demonstrated here.

## Background

!["Perfectly straight line in graph comparing supposedly independent variables in Head Count and Sweeps workbook, in Excel Online"](head-count-sweeps-graph-excel-online.png)

I'm troubleshooting a spreadsheet document that was prepared by someone else. Whoever prepared it probably handed it off to someone else before I came into contact. The information in the document looks like it's been copied from other documents which I'm not certain I have access to.

In this document, there's a graph that demonstrates the goal of the document: to compare the relationship between two observation methods. Unfortunately, the graph shows a suspicious degree of idealness: a perfect one-to-one ratio across the entire domain.

My assignment is to trace the error and correct the graph so that it displays the precise ratios calculated from appropriate samples.

Other issues:

- No metadata
- No naming convention for categorical labels
- Mixed aggregation levels

## Expected Results

- Upon plotting the ratio of numbers of specimens by each of two collection methods, the graph should suggest a trend that is not perfectly linear.
- Category labels have no redundancy.
- Sample numeric values, when summed by groups based on common indices, suffer no effects of redundancy.
- Uniform level of aggregation across all sample numeric values.
- Data is clean and indexed well enough to align with similar data in other documents (allowing detection of redundancy amongst many sample sets).

# Procedure

## Setup

* Document handling and analysis will be conducted in [Jupyter] Notebook/Lab, using [Python] 3.
* Microsoft [Office] Excel Online will be used to create and embed a graph. The offline version would also suffice.
* To setup a live workspace, consult the [README] for the home project comprising this document and its associated files.
* If your live workspace doesn't include it, [install _pandas_] before continuing. The commonly recommended way to do this is with `pip install pandas`. (If you're using [Anaconda], you probably already have _pandas_.)

[Anaconda]: http://docs.continuum.io/anaconda/
[Jupyter]: https://jupyter.org/
[Office]: https://www.office.com/
[Python]: https://www.python.org/about/
[README]: https://github.com/devvyn/aafc-field-data/blob/master/README.md
[install _pandas_]: https://pandas.pydata.org/pandas-docs/stable/install.html

### Install Required Python Packages

Within Jupyter notebook:

```
!pip install pandas markdown
```

From a terminal to the notebook host:

```
pip install pandas markdown
```

### Import Python's Regular Expression Library

From time to time during this project, I'll be matching patterns of text in order to correct for variations in identifiers.

In [2]:
import re

### Import the `pandas` Library

In [3]:
import pandas

### Import markdown and HTML

In [4]:
from markdown import markdown
from IPython.display import HTML

## Input Data

Direct access to [`2016-sweep-vs-tiller.xlsx`][file] is required in the live workspace. [Download][file] from GitHub if you're following along without cloning the source [repository].

[repository]: https://github.com/devvyn/aafc-field-data/
[file]: https://github.com/devvyn/aafc-field-data/blob/master/notebook/projects/2016-sweep-vs-tiller/2016-sweep-vs-tiller.xlsx

I'll make a dictionary of names and data frames, so I can examine the file overall.

In [5]:
data_file = pandas.ExcelFile('2016-sweep-vs-tiller.xlsx')
sheets = {
    sheet_name: data_file.parse(sheet_name)
    for sheet_name in data_file.sheet_names
}

### Explore Worksheets

#### Two Sources of Observational Data

Data sets to compare:

* "cereal sweeps" or just "sweeps"
* "head counts" or "tillers"

#### Unbelievable Graph

- There's a graph in a worksheet called "head counts vs sweeps graphs" which demonstrates the analytical problem encountered/developed by someone else.
- The data supposedly being compared in the graph is cannot be the data that was intended for comparison because the ratio depicted is perfectly linear even though it's comparing real world samples.

#### Lack of Spreadsheet Formulas

It's clear that the Excel workbook has the results of many calculations, yet there are no formula cells. In order to check the accuracy of the calculations, I need to replicate them from scratch.

#### Ambiguously Duplicated Data

When I opened the workbook in Excel Online, what I saw gave me these impressions:

- Data may have been copied from multiple, unidentified sources.
- It's not clear which data is "original" and which is duplicated, amongst the worksheets in the workbook.
- Multiple editors have made changes or additions to the workbook, and nobody left notes.

#### Compare Columns

To discern which data is original, I'll begin by listing the columns of all sheets, which will offer some descriptive terms for the data in each:

In [6]:
pandas.DataFrame(
    data=[frame.columns for frame in sheets.values()],
    index=pandas.Index(data=sheets.keys(), name='Worksheet Name',),
)

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,134,135,136,137,138,139,140,141,142,143
Worksheet Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Head Counts,Site,Crop,Date,Field,Zadoks_stage,Tiller,EGA_head,EGA_leaf,BCO_head,BCO_leaf,...,,,,,,,,,,
Sheet2,ID,Province,Collection_Date,Sample_by_week,Date_by_week,Date,Julian_date,Site,Field_name,Crop,...,Hymenoptera_Figitidae,Hymenoptera_Aphelinidae,Hymenoptera_Perilampidae,Hymenoptera_Chalcidoidea,Hymenoptera_Ichneumondoidea,Hymenoptera_Proctotrupoidea,,,,
Sweep Samples Cereals,ID,Province,Collection_Date,Sample_by_week,Date_by_week,Date,Julian_date,Site,Field_name,Crop,...,Hymenoptera_Proctotrupidae,Hymenoptera_Pteromalidae,Hymenoptera_Apidae,Hymenoptera_Diplazontinae,Hymenoptera_Figitidae,Hymenoptera_Aphelinidae,Hymenoptera_Perilampidae,Hymenoptera_Chalcidoidea,Hymenoptera_Ichneumondoidea,Hymenoptera_Proctotrupoidea
Head Counts Edited,Date,Site,Crop,Field,Sample Type,Unnamed: 5,Zadoks_stage,Tiller,EGA_alate,EGA_apt,...,,,,,,,,,,
Sweep Samples Cereals Edited,Date,Site,Crop,Field_name,Sample Type,Total Sweeps,Unnamed: 6,Unnamed: 7,EGA_alate,EGA_apt,...,,,,,,,,,,
Data Sheets Combined,ID,Date,Site,Crop,Field_name,Sample Type,Total Sweeps,Zadoks_stage,Tiller,EGA_alate,...,,,,,,,,,,
Pivot chart LH phen,,,,,,,,,,,...,,,,,,,,,,
leafhoppers 2016 cereal sweeps,Collection_Date,Sample_by_week,Date_by_week,Date,Julian_date,Site,Field_name,Crop,Distance(m),Number of Samples,...,,,,,,,,,,
Sheet3,,,,,,,,,,,...,,,,,,,,,,
aphid sweep vs head count,ID,Date,Site,Crop,Field_name,Sample Type,Total Sweeps,Zadoks_stage,Tiller,EGA_alate,...,,,,,,,,,,


### Initial Grouping

Based on sheet names, column names, and similarities between columns sets, I can probably group the sheets like so:

- Head counts:
  - Head Counts
  - Head Counts Edited
- Sweep:
  - Sheet2
  - Sweep Samples Cereals
  - Sweep Samples Cereals Edited
  - leafhoppers 2016 cereal sweeps
- United:
  - Data Sheets Combined
- Analytical experiments:
  - Pivot chart LH phen
  - Sheet3
  - aphid sweep vs head count
  - head counts vs sweeps graphs

Because I can't trust the accuracy of the data used in the graph, I need to look at all the sheets and determine the most complete and unadulterated data sets. I'll determine which data belongs to each category, and compare the sets.

## Select Primary Data Set: Sweep

- Sheet2
- Sweep Samples Cereals
- Sweep Samples Cereals Edited
- leafhoppers 2016 cereal sweeps

### Convert Date & Time Format

Before I can easily examine dates from the worksheets, I must convert them to proper date values for Python & *pandas*:

In [22]:
for name, sheet in sheets.items():
    output_lines = (HTML(markdown(f'#### {name}')),)
    if sheet.columns.size != 0:
        indexer = (
            sheet.columns
            .map(str)
            .str.lower()
            .str.contains('date')
        )
        output_lines += (sheet.columns[indexer],)
#         sheet.rename(
#             columns={'Collection_Date': 'date'},
#             inplace=True,
#         )
#         sheet['date'] = pandas.to_datetime(
#             sheet['date'],
#             format='%d_%m_%Y',
#         )
    output_lines += (sheet.loc[0, sheet.columns[indexer]],)
    for line_out in output_lines:
        display(line_out)

Index(['Date'], dtype='object')

Date    04/08/2016
Name: 0, dtype: object

Index(['Collection_Date', 'Date_by_week', 'Date', 'Julian_date'], dtype='object')

Collection_Date    12_08_2016
Date_by_week                0
Date                   Aug_12
Julian_date                 0
Name: 0, dtype: object

Index(['Collection_Date', 'Date_by_week', 'Date', 'Julian_date'], dtype='object')

Collection_Date    12_08_2016
Date_by_week                0
Date                   Aug_12
Julian_date                 0
Name: 0, dtype: object

Index(['Date'], dtype='object')

Date    2016-08-04 00:00:00
Name: 0, dtype: object

Index(['Date'], dtype='object')

Date    2016-08-12 00:00:00
Name: 0, dtype: object

Index(['Date'], dtype='object')

Date    2016-08-12 00:00:00
Name: 0, dtype: object

IndexError: boolean index did not match indexed array along dimension 0; dimension is 0 but corresponding boolean dimension is 39

### Sheet2

From my hands-on exploration in Excel, I got the feeling that `Sheet2` is the most complete. I'll compare it to the others.

### Sweep Samples Cereals

In [None]:
sheet_names = [
    'Sweep Samples Cereals',
    'Sheet2',
]
compare_sheets = ssc, s2 = [
    sheets[sheet_name]
    for sheet_name in sheet_names
]

I'd like to confirm that the vaguely named **Sheet2** is what it seems to be: a combination of worksheets that includes **Sweep Samples Cereals**.

In [None]:
display(HTML(markdown('#### Data Frame Shape')))
display(
    pandas.DataFrame(
        data=[
            sheets[sheet_name].shape
            for sheet_name in sheet_names
        ],
        index=sheet_names,
        columns=('rows', 'columns'),
    )
)

The overall shape doesn't preclude that possibility.

I'll take a closer look.

#### Columns

In [None]:
pandas.DataFrame(
    data=[sheet.columns for sheet in compare_sheets],
    index=sheet_names
).T

That seems to confirm that **Sheet2** has the same columns as **Sweep Samples Cereals** except for four columns that were removed.

Compare lists of column names:

In [None]:
set.symmetric_difference(*(
    set(sheet.columns.tolist())
    for sheet in compare_sheets
))

The unnamed columns are inconsequential to our analysis. In fact, I believe they're empty. The rest have aggregate values that I don't trust. **Sweep Samples Cereals** is probably useless for my purposes.

#### Rows

It's pretty clear there is a lot more data in **Sheet2**. I suspect that **Sheet2** has additional data added to it. I'll have to take a closer look at the values, especially dates.

##### Analyze Dates

How closely do the dates overlap?

In [None]:
pandas.concat(
    dict(zip(sheet_names, compare_sheets)),
    sort=True,
    axis='columns'
).loc[:, pandas.IndexSlice[:, 'date']].describe()

The range of dates is identical, but the distribution is not. I need to check if *all* the dates in **Sweep Samples Cereals** are in **Sheet2**.

In [None]:
ssc.date.isin(s2.date).all()

I presume the converse is untrue. I'll verify.

In [None]:
s2.date.isin(ssc.date).all()

I presume **Sweep Samples Cereals** is most likely either a subset of **Sheet2**, or a reduced version of the same source data. I'll keep my eye out for signs of reduction if I decide to use the smaller data set — but I doubt I will.

#### Aggregation

**Sheet2** has two additional dates. I'm especially suspicious that one is aggregated from the other because I see a ***total sweeps*** column in **Sweep Samples Cereals**.

In [None]:
ssc['Total Sweeps'].head()

I suspect that the sweeps were at different distances, and later reduced to sums. If I peek at the `Distance(m)` column, I should see a clear difference.

In [None]:
display(HTML(markdown('#### Value Counts')))
display(
    pandas.concat(
        dict(zip(sheet_names, compare_sheets)),
        axis='columns',
        sort=True,
    )
    .loc[:, pandas.IndexSlice[:, ['Distance(m)']]]
    .apply(pandas.value_counts)
    .stack()
    .reorder_levels((1, 0))
    .T
    .fillna('')
)

In [None]:
s2['Distance(m)'].unique()

Indeed, **Sheet2** has observations at various "distances", while **Sweep Samples Cereals** has only the label, ***Combined***.

**Sheet2** has more rows because it isn't totalling up the sweeps from various distances. That makes **Sheet2** less reduced, and more "raw".
Since I don't want reduced (aggregated) data, I don't want **Sweep Samples Cereals**.

#### Conclusion

Optimal candidate:

- **Sheet2**

### Sweep Samples Cereals Edited vs Sheet2

In [None]:
ssce = sheets['Sweep Samples Cereals Edited']
compare_sheets = ssce, s2

#### Columns

In [None]:
pandas.options.display.max_rows = 140
pandas.DataFrame(
    data=[sheet.columns for sheet in compare_sheets],
    index=['Sweep Samples Cereals Edited', 'Sheet2']
).T

**Sweep Samples Cereals Edited** seems to have left out some columns that would be expected to carry finely categorized subjects, such as various instars of aphids and leafhoppers. This makes it less likely to have information I'll need, because I'll be comparing aphid numbers. Furthermore, it appears that **Sweep Samples Cereals Edited** has a column, ***Total Sweeps***, that's probably an artefact from pre-existing aggregation. It's also missing the ***Distance(m)*** column; another sign of aggregation, and therefore loss of some information.

Unless further analysis reveals that **Sweep Samples Cereals Edited** has dates that are missing from **Sheet2**, I'll assume it's not worth closer examination.

#### Rows

In [None]:
f'{ssce.index.size / s2.index.size:.0%}'

**Sweep Samples Cereals Edited** has 14% the number of rows as **Sheet2**, so it's not likely to be useful, unless the dates don't fully overlap.

In [None]:
def len_unique(pandas_object):
    return len(pandas_object.unique())


descriptors = [pandas.Series.max, pandas.Series.min, len_unique]
for frame in compare_sheets:
    frame.date.apply(descriptors)

It appears that **Sweep Samples Cereals Edited**, like **Sweep Samples Cereals**, has a much shorter date range, so we won't be missing anything if we ignore it. To be sure, I need to check if all the dates in **Sweep Samples Cereals Edited** are in **Sheet2**.

In [None]:
ssce.Date.isin(s2.Collection_Date).all()

Excellent. I see no reason to pay attention to **Sweep Samples Cereals Edited** or **Sweep Samples Cereals** anymore. 

If I have time after fixing the graph, I may trace the cause of the error, which may lead me back to one of those worksheets.

#### Conclusion

Optimal candidate:

- **Sheet2** (again)

### leafhoppers 2016 cereal sweeps vs Sheet2

In [None]:
sheet_names = ['leafhoppers 2016 cereal sweeps', 'Sheet2']
compare_sheets = lh2016, s2 = [
    sheets[sheet_name]
    for sheet_name in sheet_names
]

#### Columns

In [None]:
columns = [frame.columns for frame in compare_sheets]
pandas.DataFrame(
    data=columns,
    index=sheet_names
).T.head(lh2016.columns.size)

Clearly, **leafhoppers 2016 cereal sweeps** is focused on leafhoppers. Because the object of our analysis is to compare aphid numbers, I don't see the relevance of this leafhopper data.

In [None]:
lh2016['Total Sweeps'].unique()

In [None]:
lh2016['Distance(m)'].unique()

In [None]:
lh2016['Number of Samples'].unique()

#### Rows

In [None]:
pandas.DataFrame(data={sheet_name: sheet.index.size for sheet_name, sheet in zip(sheet_names, compare_sheets)}, index=['number of rows']).rename_axis(['worksheet'], axis='columns').T.sort_values(by='number of rows')

Sheet2 has the most rows. But do the times align?

First, what date format is used in lh2016?

In [None]:
lh2016.Collection_Date.head()

Seems like a text expression. Day, month, year; separated by underscores. The function `pandas.to_datetime` solves this.

In [None]:
lh2016.Collection_Date = pandas.to_datetime(lh2016.Collection_Date, format='%d_%m_%Y')
lh2016.Collection_Date.head()

Now, to compare dimensions:

In [None]:
pandas.DataFrame(
    data=[
        len_unique(frame.Collection_Date.index)
        for frame in compare_sheets
    ],
    index=sheet_names,
    columns=[
        'unique datetimes',
    ],
)

Seeing that `leafhoppers 2016 cereal sweeps` is smaller, check that its index is a subset of `Sheet2`:

In [None]:
lh2016.Collection_Date.index.isin(s2.index).all()

All of the dates in the leafhopper counts are present in `Sheet2`.

#### Conclusion

Given that the leafhopper data isn't pertinent, `Sheet2` remains the best candidate for the primary source of data points about aphids collected and counted according to the "sweep" method.

If it's ultimately determined that the data *is* relevant, it may be useful in that case because the datetime index overlaps with that of `Sheet2`.

### Conclusion

After all analysis, `Sheet2` appears to be the purest, most relevant base for comparison of "sweep" sample data to that of "tiller" count data.

## Select Primary Data Set: Tiller Head

- Head Counts
- Head Counts Edited

### Head Counts vs Head Counts Edited

I presume the relationship between these two worksheets is the same as that between the equivalent "sweep" worksheets. Therefore, I expect the "edited" version to be less useful.

In [None]:
sheet_names = [
    'Head Counts',
    'Head Counts Edited'
]
compare_sheets = hc, hce = [sheets[sheet_name] for sheet_name in sheet_names]

#### Columns

In [None]:
pandas.DataFrame(
    index=sheet_names,
    data=[sheet.columns for sheet in compare_sheets],
).T

The main difference here seems to be the aggregation in the "edited" sheet, indicated by the columns named "EGA/head", "EGA_total", etc. I expect the "unedited" data to be more complete and reliable.

#### Rows

In [None]:
hce.Date.head()

In [None]:
hc.Date.head()

I'll fix the date format with `pandas.to_datetime` again.

In [None]:
hc.Date = pandas.to_datetime(hc.Date,
                             format='%d/%m/%Y')
hc.Date.head()

Now, compare for completeness:

In [None]:
[len_unique(column) for column in (
    hc.Date,
    hce.Date
)]

In [None]:
hc.Date.index.isin(hce.Date.index).all()

Identical date & time for the index of each, so no basis for choosing one over the other.

### Conclusion

Based on the columns, the best candidate for pure, reliable data is:

- `Head Counts`

## Align Primary Data Sets

The names and corresponding data frames, from the worksheets in the source workbook document (Excel):

In [None]:
sheet_names, compare_sheets = zip(
    ('Head Counts', hc),
    ('Sheet2', s2),
)

In [None]:
hc is compare_sheets[0]

In [None]:
s2 is compare_sheets[1]

#### Visualize Whitespace

For the sake of visualization, I'll write a function that wraps any value in square brackets. This makes trailing or leading whitespace obvious.

In [None]:
def wrap_brackets(x):
    return x if pandas.isna(x) else f'[{str(x)}]'

In [None]:
# Example:
pandas.Series([' a ', 'b ', 'c', '', ' ', pandas.np.NaN]).apply(wrap_brackets)

### Compare Columns

I've already noticed at least one leading space on a column name ("spiders"), and variations in capitalization. To compensate for this, I'll strip all leading and trailing whitespace from the lower case column names.

Comparing alphabetically sorted column names between our two primary data sets:

In [None]:
for sheet in compare_sheets:
    sheet.columns = sheet.columns.str.strip().str.lower()
    sheet.sort_index(axis='columns', inplace=True)

In [None]:
pandas.DataFrame(
    index=sheet_names,
    data=[sheet.columns for sheet in compare_sheets]
).fillna('').T

Wow, that's quite a large difference in columns for these sets. Since the object of the comparison is aphids only, we can ignore most of these columns from `Sheet2`. For better comparison, let's filter for columns referring to aphids.

#### Aphid Columns

Aphid related terms:

* aphid
* ega
* bco
* greenbug

In [None]:
aphid_terms = (
    r'aphids?',
    r'ega',
    r'bco',
    r'greenbug',
)
aphid_term_pattern = '|'.join(aphid_terms)

hc_aphid_columns, s2_aphid_columns = [
    sheet.columns[
        sheet.columns.str.contains(aphid_term_pattern)
    ] for sheet in compare_sheets
]

sorted(hc_aphid_columns.tolist() + s2_aphid_columns.tolist())

I see some problems with this list.

Not aphid related:

- aphidencyrtus_sp
- aphidiius_sp
- aphid_mummies
- aphid_mummies_aphelinus_black
- aphid_mummies_aphidius_brown
- aphid_mummies_blk
- aphid_mummies_brown

Not primary data:

- aphids_total
- bco_total
- ega_total
- total_alate_aphids
- total_apterous_aphids

I can prevent the matching of words containing "aphid" by adding a word boundary definition:

In [None]:
boundary = r'(?:_|^|$|\b)'
aphid_term_pattern = ''.join((
    boundary,
    r'(?:', '|'.join(aphid_terms), r')',
    boundary,
))

hc_aphid_columns, s2_aphid_columns = [
    sheet.columns[
        sheet.columns.str.contains(aphid_term_pattern)
    ] for sheet in compare_sheets
]

sorted(hc_aphid_columns.tolist() + s2_aphid_columns.tolist())

Better. Still need to exclude "total" and "mummies".

In [None]:
excluded_terms = (
    r'mumm(?:y|ies)',
    r'total',
)
aphid_term_pattern, excluded_term_pattern = (
    r''.join((
        boundary,
        r'(?:', '|'.join(pattern), r')',
        boundary,
    )) for pattern in (aphid_terms, excluded_terms)
)

hc_aphid_columns, s2_aphid_columns = [
    sheet.columns[
        sheet.columns.str.contains(aphid_term_pattern) & ~ sheet.columns.str.contains(excluded_term_pattern)
    ] for sheet in compare_sheets
]

sorted(hc_aphid_columns.tolist() + s2_aphid_columns.tolist())

Great! Aphid columns identified.

Before combining the data sets for mathematical processing, I'll need to normalize those column names.

##### @todo: normalize aphid names

#### Non-aphid Related Columns

What non-aphid columns remain?

In [None]:
hc_remainder, s2_remainder = (
    frame.columns[~frame.columns.isin(frame_aphid)]
    for frame, frame_aphid in zip((hc, s2), (hc_aphid_columns, s2_aphid_columns))
)

sorted(hc_remainder.tolist() + s2_remainder.tolist())

#### Lookup Columns (Record Index)

Reading through the list of remaining, non-aphid related columns, I see some that don't mention any organism by name. These columns may be useful for indexing, which is crucial to aligning the two data sources.

In [None]:
non_organism_column_names = pandas.Series(data=(
    'collection_date',
    'comments',
    'crop',
    'date',
    'date_by_week',
    'distance(m)',
    'field',
    'field_name',
    'id',
    'julian_date',
    'number of samples',
    'province',
    'sample_by_week',
    'site',
    'zadoks_stage',
))

Here are the names of the matching columns from each frame:

In [None]:
non_organism_columns_common = pandas.DataFrame(
    {
        name: dict(zip(non_organism_column_names,
                       non_organism_column_names.isin(frame.columns)))
        for name, frame
        in zip(sheet_names, compare_sheets)
    }
).replace(to_replace={False: '', True: '✅'})
non_organism_columns_common

### Align Columns

Our next goal is to determine which columns are in common and ensure they're of the same data type, so we can concatenate the frames.

In [None]:
non_organism_columns_common.index[
    non_organism_columns_common.all(axis='columns')
].tolist()

Those columns alone are probably enough to align the data.

I'll need to clean up `crop` and `site`, but `date` and `collection_date` are already well formed.

These columns warrant examination as well:

- collection_date
- date
- distance(m)
- field
- field_name
- number of samples

#### Date

The columns containing date information were normalized in an [earlier stage] of analysis. Those columns are ready to align in combination with other unique index labels.

[earlier stage]: #Convert-Date-&amp;amp;-Time-Format

#### Index Columns: Site, Field, Crop

The columns `hc.field` and `s2.field_name` seem to relate to `site` and `crop`. Before I address the values of the columns, I'll rename `field_name` for consistency.

In [None]:
for frame in compare_sheets:
    frame.rename(
        columns={'field_name': 'field'},
        inplace=True,
    )

Now, regarding the values, I suspect redundancy. Here's why I feel that way; look at some of the values from both data sets:

In [None]:
index_location_column_names = ['site', 'crop', 'field',]
site_and_field_name = pandas.concat(
    (
        hc[index_column_names],
        s2[index_column_names],
    ),
    keys=sheet_names,
    sort=True,
)
site_and_field_name.applymap(wrap_brackets).head(15)

It's clear to my eyes that in most cases, `site` and `crop` have been concatenated together to produce the value in `field` or `field_name`.

- site + crop = field(_name)

Primary data:

- site
- crop

Aggregated data:

- field
- field_name

Therefore, in those cases, I can safely disregard those derived columns and rely on the the more normalized forms for indexing. That is to say I think it's most beneficial to use `crop` and `site` whenever possible.

##### Outliers

There are outliers that deviate from the pattern. So, the next question is, do they have plausible matches in the complementary data set? If the outliers are only in one frame, they'd won't be ususable for calculating ratios of counts.

In [None]:
site_and_field_name.loc[site_and_field_name.field.str.contains('-')]

All are from the `Head Counts` frame. I can safely ignore these records because they have no counterpart in `Sheet2`. When the time comes to do the comparison of numbers, they'll be discarded automatically. I can move on to another task.

#### Normalize Site

In [None]:
site_values = (
    pandas.concat(
        (frame[['site']] for frame in compare_sheets),
        keys=sheet_names,
        names=['Sheet Name', 'index',],
    )
    .drop_duplicates()
    .sort_values('site')
)
site_values.site.apply(wrap_brackets)

This is going to need some normalization. I'll write a string reducer function to transform any given value into a uniformly reducible representation by stripping insignificant characters and letter case. From that, I can build a hash table that maps to preferred representations, and apply that mapping to the values in a renaming operation.

##### Index by Reduced Label Value

In order to match badly formed labels with their normal representation, I need a string function similar to a hash, so the values in the data frame can be used to look up the preferred, normal form.

- convert numeric values to text
- strip non alphanumeric characters
- convert alphabetic characters to lower case

In [None]:
def hash_like(value):
    return re.compile(r"[^a-z0-9]").sub('', str(value).lower())

Applying this to all frames:

In [None]:
for frame in compare_sheets + (site_values, ):
    frame['site_index'] = frame.site.apply(hash_like)
    frame.set_index('site_index', append=True, inplace=True)

##### Choosing Normal Form

If I had a larger data set, I could automatically find the most frequently used representation for any given reduced value ("hash"); I would use the `mode` function. Unfortunately, the groups are far too small and the values too varied:

In [None]:
(
    site_values.site
    .apply(wrap_brackets)
    .unstack()
    .fillna('')
)

Even if I strip the leading and trailing whitespace, I would still end up with ambiguous candidate selections (eg: "yellowcreek" values). The work required to make a function that would know to convert "Yellowcreek" to "Yellow Creek" would be unreasonable, so automation might not be the best choice for choosing normal forms.

I'll make the preferred identifier list manually. Here it is as a data frame with the appropriately reduced version of each label as the index:

In [None]:
preferred_site_id = pandas.Series(
    name='site',
    data={hash_like(item): item
          for item in [
              'Alvena',
              'Clavet',
              'Indian Head',
              'Kernan',
              'Llewellyn',
              'Meadow Lake',
              'Melfort',
              'Outlook',
              'SEF',
              'Wakaw',
              'Yellow Creek',
          ]},
)
preferred_site_id.index.set_names(['site_index'], inplace=True)
preferred_site_id.to_frame().T

With this list, I can compare the reduced ("hashed") values to the ones in the actual data and apply the preferred name where it matches.

##### Confirm Expected Outliers

Everything that isn't in the preferred site name list:

In [None]:
site_values[
    ~ site_values.index.get_level_values('site_index')
    .isin(preferred_site_id.index)
]

These outliers are from the unmatchable records I mentioned upon my first look at the [unique site] labels. I can move on without doing anything further on these.

[unique site]: #Outliers

Report on which records would be changed if I relabel `site`:

In [None]:
pandas.concat(
    (
        site_values,
        preferred_site_id.to_frame().combine_first(site_values),
    ),
    keys=['Before', 'After'],
    axis='columns',
).reset_index('site_index', drop=True).applymap(wrap_brackets)

This looks good to me. I'll apply the names:

In [None]:
for frame in compare_sheets:
    frame.site = preferred_site_id.to_frame().combine_first(frame).site

I'll review the label values for `site` in both data frames, to see the result of those changes:

In [None]:
(
    pandas.concat(
        (frame[['site']] for frame in compare_sheets),
        keys=sheet_names,
        names=['Sheet Name', 'index', 'site_index'],
    )
    .site
    .drop_duplicates()
    .reset_index('site_index', drop=True)
    .sort_values()
    .apply(wrap_brackets)
)

#### Normalize Crop

Here are the `crop` field values for both data frames. (Since I've noticed some sneaky whitespace, I'll wrap the `crop` values in brackets for visualization.)

In [None]:
crops = pandas.concat(
    (
        frame.crop.apply(str)
        for frame in compare_sheets
    ),
    keys=sheet_names,
    sort=True,
).drop_duplicates().reset_index(drop=True).sort_values()
crops.apply(wrap_brackets)

Clearly, there are some variations that should be corrected, in both data frames.

- whitespace
- letter case
- word separation

I'll write a function that transforms any given "crop" value into a uniform representation of the crop it's intended to represent.

In [None]:
def normalize_str(value, separator=' '):
    if value is pandas.np.nan:
        return value

    str_value = str(value)

    # Add separators between words, title case
    is_mixed_case = str_value.upper() != str_value.lower() and not (str_value.islower() or str_value.isupper())
    if is_mixed_case:
        word_index = [
            index for index, char in enumerate(str_value) 
            if char.isupper()
        ] + [None]
        if word_index:
            words = [
                str_value[word_index[i]:word_index[i + 1]].strip()
                for i in range(len(word_index) - 1)
            ]
            str_value = separator.join(words)
    
    transformed = re.compile(r'[^a-zA-Z0-9]').sub(separator, str(str_value).title())
    
    # De-pluralize
    if transformed.endswith('Oats'):
        transformed = transformed[:-1]
    
    return transformed

Previewing the results, with the square brackets added again:

In [None]:
pandas.concat(
    (
        crops,
        crops.apply(normalize_str),
    ),
    keys=("Before", "After"),
    axis='columns',
).sort_values('After').applymap(wrap_brackets)

There are some concerning names, such as "nan", "unlisted", and "0", but the renaming result looks good. I'll apply the changes:

In [None]:
for frame in compare_sheets:
    frame.crop = frame.crop.apply(normalize_str)

#### Index

In [None]:
index_column_names = index_location_column_names + ['date',]
for frame in compare_sheets:
    frame.set_index(index_column_names, inplace=True)

In [None]:
s2

#### Number of Samples & Distance

In order to get basic index information when viewing single columns during analysis, I'll set index columns:

In [None]:
# @todo: rename collection_date
s2_indexed = s2.set_index(['site', 'crop', 'collection_date'])

##### Number of Samples

This field is only present in `Sheet2`. I want to know a little bit about the values:

In [None]:
s2_indexed['number of samples'].value_counts(dropna=False)

There are many missing values in this column, which (like the "Combined" value in the `distance(m)` column) suggests a mixture of data from multiple sources — some aggregated and some not. I presume the number of samples relates to the "combined" values, but I need to confirm:

In [None]:
s2_indexed.loc[
    s2_indexed['number of samples'].isna() | (s2_indexed['distance(m)'] == 'Combined'),
    ['distance(m)', 'number of samples']
].drop_duplicates()

This confirms that where the number of samples is not indicated, the corresponding distance value is numeric. That's consistent with aggregation of the data, so I feel I understand what I'll be dealing with in preparation for the value comparison phase of my analysis.

##### Distance

In [None]:
s2_indexed['distance(m)'].drop_duplicates()

I've been concerned about redundancy and potential loss of precision due to the presence of a non-numerical value for `distance(m)`. Some records have a distance value of "Combined". I must decide whether to include both types of records.

With a values at the same physical location, date, and of the same crop type:
- Can I be assured that the presence of "Combined" in the distance field always indicates the sum of all corresponding values?
- Conversely, if the non-combined (discrete) values do not account for the "Combined" value, can I be assured that there is no relationship — and therefore I should use both as separate samples?

To answer these questions, I'll separate discrete and aggregated sets, then examine:

- Overlap of independent variables (site, crop, date).
- Sum of discrete sample values, as compared with values in the "Combined" set.

In [None]:
s2_indexed_special_distance, s2_special_distance = (
    series[series['distance(m)'] != 'Combined'].sort_index()
    for series in (s2_indexed, s2))
s2_indexed_special_distance.index.size / s2_special_distance.index.size

In [None]:
s2_distance_combined = s2_indexed[s2_indexed['distance(m)'] == 'Combined'].sort_index()
s2_distance_discrete = s2_indexed[s2_indexed['distance(m)'] != 'Combined'].sort_index()

Quick sanity check by comparing index size:

In [None]:
s2_distance_discrete.index.drop_duplicates().size / s2_distance_combined.index.drop_duplicates().size

That's to be expected, since many values tend to aggregate down to fewer values.

The actual unique difference, per set theory:

In [None]:
(s2_distance_combined.index ^ s2_distance_discrete.index).size

Oh? Nine unexpected index entries? Are they in "discrete"? Maybe some rows were overlooked during aggregation, or maybe multiple sources were merged into one worksheet—some being aggregated and some not.

In [None]:
(s2_distance_combined.index ^ s2_distance_discrete.index).isin(s2_distance_discrete.index).size

All nine are accounted for in "discrete". 

##### Compare Group Sums to Pre-existing "Combined"

I didn't expect that the data set from `Sheet2` would have sums (presumably from groupings of `distance(m)` values), yet also some non-aggregated values. Normally, these wouldn't be mixed, because they represent different dimensional orders. Because this data set's dimensional order is inconsistent, there could be some redundancy in the total information available, or contradictions.

Any redundancy, whether contradictory or not, would affect the calculation of sums for the for the intended [objective], unless there's exactly one record for each space and time combination — that is to say, if there are discrete values as well as previously "combined" values for the same point along the index, I'll have to avoid including the pre-calculated sum when aggregating my own sums, otherwise the resulting totals will be doubled.

[objective]: #Objective

In [None]:
s2_grouped = s2_indexed.groupby(s2_indexed.index.names + [s2_indexed['distance(m)'] == 'Combined'])

In [None]:
s2_grouped.size().describe()

When grouped by all the indices, I hoped to see a `max = 1` for group size. Since the size some groups is greater than one, there are multiple values in some places. Therefore, it's worth checking for the consistency of values between discrete and aggregated for each combination of place, time, and specimen. If there are discrepencies, they'll need to be reconciled.

My sums and the source file's sums, side-by-side:

In [None]:
s2_distance_check = (
    s2_grouped
    .sum()
    .sum(axis='columns')
    .rename(index={False: 'Mine', True: 'Theirs'}, level='distance(m)')
    .unstack()
)
s2_distance_check#.head()

In [None]:
s2_distance_combined.eq(s2_distance_discrete).head()

#### Observation Data

Now that the data frame indices I built from the `date`, `site`, `field` and `crop` columns align, I need to determine which phenotypes observation were recorded in both data frames.

##### @todo: select aphid columns, check if others exist in common

# WIP 

In [None]:
pandas.concat(
    dict(zip(sheet_names, compare_sheets)),
    sort=True,
)