## Purpose

Cleaning data, comparing numbers in two time series.

Main problem to solve:

- What is the relationship between observations recorded as "head counts" versus "sweeps"?

## Import Data

In [1]:
import pandas

data_file = pandas.ExcelFile('2016 sweep vs tiller.xlsx')
sheets = dict(((sheet_name, data_file.parse(sheet_name))
          for sheet_name in data_file.sheet_names))

## Explore Worksheets

!["Perfectly straight line in graph comparing supposedly independent variables in Head Count and Sweeps workbook, in Excel Online"](head-count-sweeps-graph-excel-online.png)

When I opened the workbook in Excel Online, I saw many sheets with rather unhelpful names, and what looked like a lot of data that had been copied from other worksheets.

- There are two sources of actual observational data:
  * "cereal sweeps" or just "sweeps"
  * "head counts" or "tillers"
- There's a graph in a worksheet called "head counts vs sweeps graphs" which demonstrates the analytical problem found by someone else.
- Multiple editors have made changes or additions to the workbook, and nobody left notes.
- It's not clear which data is "original" and which is duplicated.
- The data supposedly being compared in the graph is cannot be the data that was intended for comparison.

Because I can't trust that the data used in the graph was prepared accurately, I need to look at all the sheets and determine the most complete and unadulterated data sets. I'll determine which data belongs to each category, and compare the sets.

### Compare Columns

I'll begin by listing the columns of all sheets, which will offer some descriptive terms for the data in each:

In [2]:
pandas.DataFrame(index=sheets.keys(), data=[frame.columns for frame in sheets.values()])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,134,135,136,137,138,139,140,141,142,143
Head Counts,Site,Crop,Date,Field,Zadoks_stage,Tiller,EGA_head,EGA_leaf,BCO_head,BCO_leaf,...,,,,,,,,,,
Sheet2,ID,Province,Collection_Date,Sample_by_week,Date_by_week,Date,Julian_date,Site,Field_name,Crop,...,Hymenoptera_Figitidae,Hymenoptera_Aphelinidae,Hymenoptera_Perilampidae,Hymenoptera_Chalcidoidea,Hymenoptera_Ichneumondoidea,Hymenoptera_Proctotrupoidea,,,,
Sweep Samples Cereals,ID,Province,Collection_Date,Sample_by_week,Date_by_week,Date,Julian_date,Site,Field_name,Crop,...,Hymenoptera_Proctotrupidae,Hymenoptera_Pteromalidae,Hymenoptera_Apidae,Hymenoptera_Diplazontinae,Hymenoptera_Figitidae,Hymenoptera_Aphelinidae,Hymenoptera_Perilampidae,Hymenoptera_Chalcidoidea,Hymenoptera_Ichneumondoidea,Hymenoptera_Proctotrupoidea
Head Counts Edited,Date,Site,Crop,Field,Sample Type,Unnamed: 5,Zadoks_stage,Tiller,EGA_alate,EGA_apt,...,,,,,,,,,,
Sweep Samples Cereals Edited,Date,Site,Crop,Field_name,Sample Type,Total Sweeps,Unnamed: 6,Unnamed: 7,EGA_alate,EGA_apt,...,,,,,,,,,,
Data Sheets Combined,ID,Date,Site,Crop,Field_name,Sample Type,Total Sweeps,Zadoks_stage,Tiller,EGA_alate,...,,,,,,,,,,
Pivot chart LH phen,,,,,,,,,,,...,,,,,,,,,,
leafhoppers 2016 cereal sweeps,Collection_Date,Sample_by_week,Date_by_week,Date,Julian_date,Site,Field_name,Crop,Distance(m),Number of Samples,...,,,,,,,,,,
Sheet3,,,,,,,,,,,...,,,,,,,,,,
aphid sweep vs head count,ID,Date,Site,Crop,Field_name,Sample Type,Total Sweeps,Zadoks_stage,Tiller,EGA_alate,...,,,,,,,,,,


### Initial Grouping

Based on sheet names and similarities between columns sets, I can probably group the sheets like so:

- Head counts:
  - Head Counts
  - Head Counts Edited
- Sweep:
  - Sheet2
  - Sweep Samples Cereals
  - Sweep Samples Cereals Edited
  - leafhoppers 2016 cereal sweeps
- United:
  - Data Sheets Combined
- Analytical experiments:
  - Pivot chart LH phen
  - Sheet3
  - aphid sweep vs head count
  - head counts vs sweeps graphs

## Sweep Samples Cereals vs Sheet2

I'd like to confirm that the vaguely named **Sheet2** is what it seems to be: a slightly edited copy of **Sweep Samples Cereals**.

### Columns

In [3]:
sheet_names = ['Sweep Samples Cereals', 'Sheet2']
compare_sheets = ssc, s2 = [sheets[sheet_name] for sheet_name in sheet_names]
pandas.DataFrame(data=[sheet.columns for sheet in compare_sheets], index=sheet_names).T

Unnamed: 0,Sweep Samples Cereals,Sheet2
0,ID,ID
1,Province,Province
2,Collection_Date,Collection_Date
3,Sample_by_week,Sample_by_week
4,Date_by_week,Date_by_week
5,Date,Date
6,Julian_date,Julian_date
7,Site,Site
8,Field_name,Field_name
9,Crop,Crop


That seems to confirm that **Sheet2** has the same columns as **Sweep Samples Cereals** except for four columns that were removed.

In [4]:
set.symmetric_difference(*(set(sheet.columns.tolist()) for sheet in compare_sheets))

{'EGA/20 Sweeps',
 'EGA/Sweep',
 'Total EGA',
 'Total Sweeps',
 'Unnamed: 129',
 'Unnamed: 133'}

The unnamed columns are inconsequential to our analysis. In fact, I believe they're empty. The rest have aggregate values that I don't trust. **Sweep Samples Cereals** is probably useless for my purposes.

### Regarding Spreadsheet Formulas

Side note about saving aggregated data in an Excel workbook: use spreadsheet formulas, and leave the formulas in place—replacing easily evaluable symbolic expressions with calculated values is often unnecessary, not to mention counterproductive if anyone else might need to check your work.

### Rows

What about the rows? I'll check sizes:

In [5]:
pandas.DataFrame(data=[sheets[sheet_name].index.size for sheet_name in sheet_names], index=sheet_names)

Unnamed: 0,0
Sweep Samples Cereals,92
Sheet2,668


It's pretty clear there is a lot more data in **Sheet2**. I suspect that **Sheet2** has additional data added to it. I'll have to take a closer look at the values, especially dates.

### Dates

Before I can easily examine dates in the worksheets, I should convert them to proper datetime values:

In [6]:
for sheet in compare_sheets:
    sheet['Collection_Date'] = pandas.to_datetime(sheet['Collection_Date'], format='%d_%m_%Y')

Which sheet covers the earliest dates?

In [7]:
[sheet['Collection_Date'].min() for sheet in compare_sheets]

[Timestamp('2016-06-05 00:00:00'), Timestamp('2016-06-05 00:00:00')]

Neither.

Which sheet covers the latest dates?

In [8]:
[sheet['Collection_Date'].max() for sheet in compare_sheets]

[Timestamp('2017-07-08 00:00:00'), Timestamp('2017-07-08 00:00:00')]

Neither, again.

Are all the dates used in both?

In [9]:
[len(sheet['Collection_Date'].unique()) for sheet in compare_sheets]

[32, 34]

No, not exactly.

I need to check if all the dates in **Sweep Samples Cereals** are in **Sheet2**.

In [10]:
ssc.Collection_Date.isin(s2.Collection_Date).all()

True

**Sweep Samples Cereals** is most likely either a subset of **Sheet2**, or a reduced version of the same source data. **Sheet2** actually has more unique dates. Therefore, some dates in **Sheet2** must be absent from **Sweep Samples Cereals**.

In [11]:
s2.Collection_Date.isin(ssc.Collection_Date).all()

False

### Dimensions

So, both sheets cover the same date range. **Sheet2** has two additional dates. I'm curious if one is aggregated from the other, especially since I see a "total sweeps" column in **Sweep Samples Cereals**.

In [12]:
ssc['Total Sweeps'].head()

0    120
1    120
2     60
3    120
4     80
Name: Total Sweeps, dtype: int64

I bet the sweeps were at different distances, and later reduced to sums. If I peek at the `Distance(m)` column, I should see a clear difference.

In [13]:
ssc['Distance(m)'].unique()

array(['Combined'], dtype=object)

In [14]:
s2['Distance(m)'].unique()

array([0, 5, 10, 25, 50, 100, 'Combined'], dtype=object)

Indeed, **Sheet2** has observations at various "distances", while **Sweep Samples Cereals** has only the invariate "Combined". Since I don't want reduced (aggregated) data, I don't want **Sweep Samples Cereals**.

**Sheet2** has more rows because it isn't totalling up the sweeps from various distances. That makes **Sheet2** less reduced, and more "raw".

### Conclusion – Optimal Candidate

Winner of the prize for completeness in data about sweep samples:

- **Sheet2**

I'll consider this my best source of truth for "sweep" data.

## Sweep Samples Cereals Edited vs Sheet2

In [15]:
ssce = sheets['Sweep Samples Cereals Edited']
compare_sheets = ssce, s2

### Columns

In [16]:
pandas.options.display.max_rows = 140
pandas.DataFrame(data=[sheet.columns for sheet in compare_sheets],
                 index=['Sweep Samples Cereals Edited', 'Sheet2']).T

Unnamed: 0,Sweep Samples Cereals Edited,Sheet2
0,Date,ID
1,Site,Province
2,Crop,Collection_Date
3,Field_name,Sample_by_week
4,Sample Type,Date_by_week
5,Total Sweeps,Date
6,Unnamed: 6,Julian_date
7,Unnamed: 7,Site
8,EGA_alate,Field_name
9,EGA_apt,Crop


**Sweep Samples Cereals Edited** seems to have left out some columns that would be expected to carry finely categorized subjects, such as various instars of aphids and leafhoppers. This makes it less likely to have information I'll need. 

Furthermore, it appears that **Sweep Samples Cereals Edited** has a column that is probably an artefact from aggregation: `Total Sweeps`. It's also missing the `Distance(m)` column; another sign of aggregation, and therefore loss of some information.

I hope I can ignore the whole worksheet, just as I'll be doing with **Sweep Samples Cereals** ("unedited"). Unless further analysis reveals that **Sweep Samples Cereals Edited** has dates that are missing from **Sheet2**, I'll assume it's not worth closer examination.

### Rows

In [17]:
print(f'{ssce.index.size / s2.index.size:.0%}')

14%


**Sweep Samples Cereals Edited** has 14% the number of rows as **Sheet2**, so it's not likely to be useful, unless the dates don't fully overlap.

### Dates

In [18]:
def index_size(pandas_object):
    return len(pandas_object.unique())


descriptors = [pandas.Series.max, pandas.Series.min, index_size]
ssce.Date.apply(descriptors)

max           2016-08-15 00:00:00
min           2016-06-05 00:00:00
index_size                     31
Name: Date, dtype: object

In [19]:
s2.Collection_Date.apply(descriptors)

max           2017-07-08 00:00:00
min           2016-06-05 00:00:00
index_size                     34
Name: Collection_Date, dtype: object

It appears that **Sweep Samples Cereals Edited**, like **Sweep Samples Cereals**, has a much shorter date range, so we won't be missing anything if we ignore it. To be sure, I need to check if all the dates in **Sweep Samples Cereals Edited** are in **Sheet2**.

In [20]:
ssce.Date.isin(s2.Collection_Date).all()

True

Excellent. I see no reason to pay attention to **Sweep Samples Cereals Edited** or **Sweep Samples Cereals** anymore. Perhaps if I have time after fixing the graph, I will trace the cause of the error, and it may lead me back to one of those worksheets.

### Conclusion – Optimal Candidate

Winner of the prize for completeness in data about sweep samples:

- **Sheet2** (again)

I'll consider this my best source of truth for "sweep" data.

## leafhoppers 2016 cereal sweeps vs Sheet2

In [21]:
sheet_names = ['leafhoppers 2016 cereal sweeps', 'Sheet2']
lh2016, s2 = compare_sheets = [sheets[sheet_name] for sheet_name in sheet_names]

### Columns

In [22]:
columns = [frame.columns for frame in compare_sheets]
pandas.DataFrame(data=columns, index=sheet_names).T.head(lh2016.columns.size)

Unnamed: 0,leafhoppers 2016 cereal sweeps,Sheet2
0,Collection_Date,ID
1,Sample_by_week,Province
2,Date_by_week,Collection_Date
3,Date,Sample_by_week
4,Julian_date,Date_by_week
5,Site,Date
6,Field_name,Julian_date
7,Crop,Site
8,Distance(m),Field_name
9,Number of Samples,Crop


Clearly, **leafhoppers 2016 cereal sweeps** is focused on leafhoppers.

In [23]:
lh2016['Total Sweeps'].unique()

array([120,  60,  80,  20, 100,  40])

In [24]:
lh2016['Distance(m)'].unique()

array(['Combined'], dtype=object)

In [25]:
lh2016['Number of Samples'].unique()

array([6, 3, 4, 1, 5, 2])

#### Rows

In [26]:
pandas.DataFrame(data={sheet_name: sheet.index.size for sheet_name, sheet in zip(sheet_names, compare_sheets)}, index=['number of rows']).rename_axis(['worksheet'], axis='columns').T.sort_values(by='number of rows')

Unnamed: 0_level_0,number of rows
worksheet,Unnamed: 1_level_1
leafhoppers 2016 cereal sweeps,92
Sheet2,668
