## Purpose

Cleaning data, comparing numbers in two time series.

Main problem to solve:

- What is the relationship between observations recorded as "head counts" versus "sweeps"?

## Import Data

In [1]:
import pandas

data_file = pandas.ExcelFile('2016 combination.xlsx')
sheets = dict(((sheet_name, data_file.parse(sheet_name))
          for sheet_name in data_file.sheet_names))

## Explore Worksheets

!["Perfectly straight line in graph comparing supposedly independent variables in Head Count and Sweeps workbook, in Excel Online"](head-count-sweeps-graph-excel-online.png)

When I opened the workbook in Excel Online, I saw many sheets with rather unhelpful names, and what looked like a lot of data that had been copied from other worksheets.

- There are two sources of actual observational data:
  * "cereal sweeps" or just "sweeps"
  * "head counts" or "tillers"
- There's a graph in a worksheet called "head counts vs sweeps graphs" which demonstrates the analytical problem found by someone else.
- Multiple editors have made changes or additions to the workbook, and nobody left notes.
- It's not clear which data is "original" and which is duplicated.
- The data supposedly being compared in the graph is cannot be the data that was intended for comparison.

Because I can't trust that the data used in the graph was prepared accurately, I need to look at all the sheets and determine the most complete and unadulterated data sets. I'll determine which data belongs to each category, and compare the sets.

### Compare Columns

I'll begin by listing the columns of all sheets, which will offer some descriptive terms for the data in each:

In [2]:
pandas.DataFrame(index=sheets.keys(), data=[frame.columns for frame in sheets.values()])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,134,135,136,137,138,139,140,141,142,143
Head Counts,Site,Crop,Date,Field,Zadoks_stage,Tiller,EGA_head,EGA_leaf,BCO_head,BCO_leaf,...,,,,,,,,,,
Sheet2,ID,Province,Collection_Date,Sample_by_week,Date_by_week,Date,Julian_date,Site,Field_name,Crop,...,Hymenoptera_Figitidae,Hymenoptera_Aphelinidae,Hymenoptera_Perilampidae,Hymenoptera_Chalcidoidea,Hymenoptera_Ichneumondoidea,Hymenoptera_Proctotrupoidea,,,,
Sweep Samples Cereals,ID,Province,Collection_Date,Sample_by_week,Date_by_week,Date,Julian_date,Site,Field_name,Crop,...,Hymenoptera_Proctotrupidae,Hymenoptera_Pteromalidae,Hymenoptera_Apidae,Hymenoptera_Diplazontinae,Hymenoptera_Figitidae,Hymenoptera_Aphelinidae,Hymenoptera_Perilampidae,Hymenoptera_Chalcidoidea,Hymenoptera_Ichneumondoidea,Hymenoptera_Proctotrupoidea
Head Counts Edited,Date,Site,Crop,Field,Sample Type,Unnamed: 5,Zadoks_stage,Tiller,EGA_alate,EGA_apt,...,,,,,,,,,,
Sweep Samples Cereals Edited,Date,Site,Crop,Field_name,Sample Type,Total Sweeps,Unnamed: 6,Unnamed: 7,EGA_alate,EGA_apt,...,,,,,,,,,,
Data Sheets Combined,ID,Date,Site,Crop,Field_name,Sample Type,Total Sweeps,Zadoks_stage,Tiller,EGA_alate,...,,,,,,,,,,
Pivot chart LH phen,,,,,,,,,,,...,,,,,,,,,,
leafhoppers 2016 cereal sweeps,Collection_Date,Sample_by_week,Date_by_week,Date,Julian_date,Site,Field_name,Crop,Distance(m),Number of Samples,...,,,,,,,,,,
Sheet3,,,,,,,,,,,...,,,,,,,,,,
aphid sweep vs head count,ID,Date,Site,Crop,Field_name,Sample Type,Total Sweeps,Zadoks_stage,Tiller,EGA_alate,...,,,,,,,,,,


Based on sheet names and similarities between columns sets, I can probably group the sheets like so:

- Head counts:
  - Head Counts
  - Head Counts Edited
- Sweep:
  - Sheet2
  - Sweep Samples Cereals
  - Sweep Samples Cereals Edited
  - leafhoppers 2016 cereal sweeps
- United:
  - Data Sheets Combined
- Analytical experiments:
  - Pivot chart LH phen
  - Sheet3
  - aphid sweep vs head count
  - head counts vs sweeps graphs

#### Sheet2 vs Sweep Samples

I'd like to confirm that the vaguely named **Sheet2** is what it seems to be: a slightly edited copy of **Sweep Samples Cereals**.

In [3]:
sheet_names = ['Sweep Samples Cereals', 'Sheet2']
pandas.DataFrame(data=[sheets[sheet_name].columns for sheet_name in sheet_names], index=sheet_names).T

Unnamed: 0,Sweep Samples Cereals,Sheet2
0,ID,ID
1,Province,Province
2,Collection_Date,Collection_Date
3,Sample_by_week,Sample_by_week
4,Date_by_week,Date_by_week
5,Date,Date
6,Julian_date,Julian_date
7,Site,Site
8,Field_name,Field_name
9,Crop,Crop


That seems to confirm that **Sheet2** has the same columns as **Sweep Samples Cereals** except for four columns that were removed.

What about the rows? I'll check sizes:

In [4]:
sheet_names = ['Sweep Samples Cereals', 'Sheet2']
pandas.DataFrame(data=[sheets[sheet_name].index.size for sheet_name in sheet_names], index=sheet_names)

Unnamed: 0,0
Sweep Samples Cereals,92
Sheet2,668


It's pretty clear there is a lot more data in **Sheet2**. I suspect that **Sheet2** has additional data added to it. Exactly what is hard to say. We'll have to take a closer look at the column values, especially ones which can serve as indices for the series of observations.

Before I can easily examine dates in the worksheets, I should convert them to proper datetime values:

In [5]:
compare_sheets = ssc, s2 = sheets['Sweep Samples Cereals'], sheets['Sheet2']

In [6]:
for sheet in compare_sheets:
    sheet['Collection_Date'] = pandas.to_datetime(sheet['Collection_Date'], format='%d_%m_%Y')

Which sheet covers the earliest dates?

In [7]:
[sheet['Collection_Date'].min() for sheet in compare_sheets]

[Timestamp('2016-06-05 00:00:00'), Timestamp('2016-06-05 00:00:00')]

Neither.

Which sheet covers the latest dates?

In [8]:
[sheet['Collection_Date'].max() for sheet in compare_sheets]

[Timestamp('2017-07-08 00:00:00'), Timestamp('2017-07-08 00:00:00')]

Neither, again.

Are all the dates used in both?

In [9]:
[len(sheet['Collection_Date'].unique()) for sheet in compare_sheets]

[32, 34]

No, not exactly.

So, both sheets cover the same date range. **Sheet2** has two additional dates. I'm curious if one is aggregated from the other, especially since I see a "total sweeps" column in **Sweep Samples Cereals**.

In [10]:
ssc['Total Sweeps'].head()

0    120
1    120
2     60
3    120
4     80
Name: Total Sweeps, dtype: int64

I bet the sweeps were at different distances, and . If I peek at the _Distance(m)_ column, I should see a clear difference.

In [11]:
ssc['Distance(m)'].unique()[:10]

array(['Combined'], dtype=object)

In [12]:
s2['Distance(m)'].unique()[:10]

array([0, 5, 10, 25, 50, 100, 'Combined'], dtype=object)

Winner of the prize for completeness:

- **Sheet2**

**Sheet2** has more rows because it isn't totalling up the sweeps from various distances. That makes **Sheet2** less derivative, and more "original". I'll consider this my best source of truth for "sweep" data.

### Sweep Samples Cereals Edited vs Sheet2

In [13]:
ssce = sheets['Sweep Samples Cereals Edited']
compare_sheets = ssce, s2

In [15]:
pandas.DataFrame(data=[sheet.columns for sheet in compare_sheets],
                 index=['Sweep Samples Cereals Edited', 'Sheet2']).T

Unnamed: 0,Sweep Samples Cereals Edited,Sheet2
0,Date,ID
1,Site,Province
2,Crop,Collection_Date
3,Field_name,Sample_by_week
4,Sample Type,Date_by_week
5,Total Sweeps,Date
6,Unnamed: 6,Julian_date
7,Unnamed: 7,Site
8,EGA_alate,Field_name
9,EGA_apt,Crop
