# Purpose

With Python and _pandas_, read two Excel worksheets, merge the data, and make the data available for analysis.

- Final data should be exported to Excel format, and should be easy to download.
- Results must be reproducible.

## Setup

To prepare, we need to be able to open the Excel files, display results of intermediate processing in the notebook, and avoid repeating large blocks of code.

Some functions can be imported, as they're already available publicly. Others will be made here, in this notebook.

In [53]:
# The display function isn't always imported by default in some Jupyter implementations.
# We'll probably use it a lot.
from IPython.core.display import display, HTML

Jupyter can output HTML if we want. Here are simple helpers to make headings
and other embellishments easy:

In [54]:
def html_wrap(content, element="span", attributes=None):
    """ Convenience method for wrapping any string in an HTML tag. """
    element_with_attributes = ' '.join((item for item in (element, attributes) if not item is None))
    tag_open = f"<{element_with_attributes}>"
    tag_close = f"</{element}>"
    return f"{tag_open}{content}{tag_close}"


def make_heading(content, level=5):
    """ Convenience method for wrapping any string in an HTML header tag. """
    element = f"h{level}"
    return html_wrap(content, element)


def heading(*args, **kwargs):
    """ Call IPython.core.display.HTML on the results of the make_heading function,
    so users don't have to repeatedly do `HTML(heading(…))` """
    return HTML(make_heading(*args, **kwargs))

[_pandas_] is really good with columnar data, like Excel files:
[_pandas_]: https://pandas.pydata.org

In [55]:
import pandas

# The Excel workbook: `2017 CAM data from iPads.xlxs`

In [56]:
# The file I'm interested in parsing for cleanup:
file_path = "./src/real data/2017 CAM data from iPads/2017 CAM data from iPads.xlsx"

## Worksheets in the file

In [57]:
data_file = pandas.ExcelFile(file_path)
sorted(data_file.sheet_names)

['2017 CAM data Erl',
 '2017 CAM iPad data Tyler',
 'Combined iPad 2017 CAM data',
 'schema (WIP reverse engineer)']

I'm only interested in first two, for now:

In [58]:
cam_sheet_names = _[:2]

Let's make a dictionary of dataframes from all the sheets, using the last word as the name. In Python, an index of -1 means the last item, as in, one less than the largest index number.

In [59]:
def last_word(string, word_separator=' '):
    return string.split(' ')[-1]


sheets = {last_word(sheet_name): data_file.parse(sheet_name)
          for sheet_name in cam_sheet_names}

In [60]:
# The keys are sheet names. Let's see what we've got:
sheets.keys()

dict_keys(['Erl', 'Tyler'])

Now that we have a convenient list of sheets that are loaded as _pandas_ DataFrames, we can work toward merging them into one. Once they're merged, we can process the data more easily from a single DataFrame.

### Python dictionaries

Dictionaries in Python are just a collection of named things. The things can be another dictionary, a string, a number, or whatver. Even the names of the things don't necessarily have to be words—they can be numbers, for example.

In [61]:
# define a dictionary:
my_dictionary = {
    'one': 1,
    2: 'two',
    'green': 'I like colour green.',
    'another dictionary': {'more stuff': 1024,
                           'even more stuff': 2048},
    'a list': [1, 2, 3, 4]
}

In [62]:
# recall something from that dictionary:
my_dictionary["one"]

1

In [63]:
my_dictionary[2]

'two'

In [64]:
my_dictionary["another dictionary"]["even more stuff"]

2048

Note that the attempt to use a non-existant key results in a Python exception called a `KeyError`. This is helpful if you accidentally lose track of which keys are in the dictionary. 

In [65]:
my_dictionary["some key that doesn't exist"]

KeyError: "some key that doesn't exist"

If you don't want an this to happen, there are ways to get around it. Later on, we'll use something called a [`defaultdict`], which just makes up the new key on the spot, instead of stopping dead.

[`defaultdict`]: https://docs.python.org/3/library/collections.html?highlight=defaultdict#defaultdict-objects

### Using `for` with dictionaries in Python

Python's `for` statements are handy for doing something to each item in a collection. Dictionaries are a type of collection, but since they have keys and values, you need to specify which you want. To address this, there are three handy methods that all dictionaries have:

- `keys()` - Only the key names
- `values()` - Only the values
- `items()` - Pairs of keys and values

In [None]:
for key in my_dictionary.keys():
    display(key)

In [None]:
for value in my_dictionary.values():
    display(value)

In [None]:
for item in my_dictionary.items():
    display(item)

### Learning more about lists, dictionaries, and other data structures in Python

If you want to learn more about using dictionaries, see the official [Python tutorial on dictionaries], which is part of the page demonstrating other Python collections, such as [lists], which we'll be using extensively.

[Python tutorial on dictionaries]: https://docs.python.org/3/tutorial/datastructures.html#dictionaries
[lists]: https://docs.python.org/3/tutorial/datastructures.html#more-on-lists

## Uniting the DataFrames (worksheets)

### _pandas_ functions for merging DataFrames

_pandas_ has multiple methods for combining datasets:

- [`concat`]: The concatenate method has an option to ignore row numbers, which has the effect of gluing each dataframe to the bottom of the previous one.
- [`append`]: Append would be almost equivalent, but always updates the first DataFrame or Series it's given.
- [`merge`]: Merge is more for database-style relational merges.

[`concat`]: https://pandas.pydata.org/pandas-docs/stable/merging.html
[`append`]: https://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-using-append
[`merge`]: https://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging

From reading up on all three methods, it looks to me like `concat` will suffice, here, as long as the column names are identical. I like to have the option of creating a new DataFrame rather than overwriting anything, because the results are easier to repeat.

Let's see how close that concatenation method gets us:

In [66]:
pandas.concat(sheets, ignore_index=True).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4690 entries, 0 to 4689
Data columns (total 56 columns):
clients__company                                                    1 non-null object
clients__displayText                                                1 non-null object
clients__fname                                                      1 non-null object
clients__lname                                                      1 non-null object
clients__name                                                       1 non-null object
fields__client__company                                             21 non-null object
fields__client__displayText                                         21 non-null object
fields__client__fname                                               21 non-null object
fields__client__lname                                               21 non-null object
fields__client__name                                                21 non-null object
fields__crop                

We can see the RangeIndex size in the info: 

>4690 entries, 0 to 4689

Which means 4690 "rows", if this were a spreadsheet.

Compared to the total row count for all sheets:

In [67]:
sum(len(sheet) for sheet in sheets.values())

4690

Pretty good, so far, except for some variations in the column names—I see `growthStage` and `growthStage Zadoks`, as well as some aphid names amongst the `a1__number`, `a2__number`, `a3__number` group:

```
fields__oSets__growthStage                                  45 non-null float64
fields__oSets__growthStage Zadoks                           9 non-null float64
```

```
fields__oSets__oPoints__observations__a1__number            324 non-null float64
fields__oSets__oPoints__observations__a1__number EGA        124 non-null float64
fields__oSets__oPoints__observations__a2__number            112 non-null float64
fields__oSets__oPoints__observations__a2__number BCO        30 non-null float64
fields__oSets__oPoints__observations__a3__number            25 non-null float64
fields__oSets__oPoints__observations__a3__number Greenbug   0 non-null float64
```

The total number of rows is correct, and the other columns line up when the names match.

Column names (which become Series names in _pandas_) are the only problem standing in the way of a successful unification of DataFrames.

## Preparing to merge: column names

It's already pretty obvious what the naming problem is. But just to demonstrate how to use _pandas_ to calculate differences, let's quickly take a look again at our DataFrames:

In [68]:
sheets.keys()

dict_keys(['Erl', 'Tyler'])

The difference in column names, using 'Erl' as the base:

In [69]:
display(set.difference(*[set(sheet.columns) for sheet in sheets.values()]))

{'fields__oSets__growthStage Zadoks',
 'fields__oSets__oPoints__observations__a1__number EGA',
 'fields__oSets__oPoints__observations__a2__number BCO',
 'fields__oSets__oPoints__observations__a3__number Greenbug',
 'fields__oSets__oPoints__observations__anum TotalAPhids',
 'fields__oSets__oPoints__observations__eVnum Natural enemy totals'}

Or, if we want to list the names of columns from *all* sheets that don't have a match (symmetrical difference).

In [70]:
column_list = sorted(set.symmetric_difference(*[set(sheet.columns) for sheet in sheets.values()]))

Before we display the symmetric difference, let's pop it into a two column layout for _pandas_ to display:

In [98]:
# Display entire column, even if cell data is long
pandas.set_option('display.max_colwidth', 0)  # Zero means no limit

# Show the differing column names side-by-side
display(pandas.DataFrame([*zip(*[iter(column_list)] * 2)]))

# Revert the setting that we changed
pandas.reset_option('display.max_colwidth')


Unnamed: 0,0,1
0,fields__oSets__growthStage,fields__oSets__growthStage Zadoks
1,fields__oSets__oPoints__observations__a1__number,fields__oSets__oPoints__observations__a1__number EGA
2,fields__oSets__oPoints__observations__a2__number,fields__oSets__oPoints__observations__a2__number BCO
3,fields__oSets__oPoints__observations__a3__number,fields__oSets__oPoints__observations__a3__number Greenbug
4,fields__oSets__oPoints__observations__anum,fields__oSets__oPoints__observations__anum TotalAPhids
5,fields__oSets__oPoints__observations__eVnum,fields__oSets__oPoints__observations__eVnum Natural enemy totals


When we concatenate, we'll use the names from the first sheet, since it's more regular. Which sheet is the one to fix? Which sheet has `fields__oSets__growthStage Zadoks` instead of `fields__oSets__growthStage`? I believe "Erl" was our base for comparison initially, but let's make sure of what we're about to do:

In [72]:
# make a filtered dictionary, for later reference
sheets_with_bad_column_names = {sheet_name: sheet for sheet_name, sheet in sheets.items() if 'fields__oSets__growthStage Zadoks' in sheet.columns}
display(heading('Bad:'),
        set(sheets_with_bad_column_names.keys()))

{'Erl'}

In [73]:
display(heading('Good:'),
        set(sheets.keys()) - set(sheets_with_bad_column_names.keys()))

{'Tyler'}

Okay, "Tyler" has the column names we prefer, "Erl" does not. Duly noted.

There's now a `sheets_with_bad_column_names` variable we can use when we fix that. (It's only got one DataFrame, but it's still good practice to be ready for batch processing in case this code is reused in the future.)

### How shall we solve the name mismatch?

Our options:

- rename the columns
- try to merge the sheets in a way that ignores the column names of the bad sheet

If we try to do a special merge, ignoring the column names, we have to worry about the _precise order_ of those columns instead of relying on the names. Since we'd have to spend time checking—and possibly rearranging—the column sequence, we might as well spend that time creating a _reusable_, documented solution for quick and painless name fixing.

### How (where) should we rename the columns/indices?

Possibilities:

- fix the spreadsheet document in Excel format, and move ahead as if this had never happened
- keep the names in the file, but correct the names in memory once they become indices

Since this notebook you're reading has already mentioned the problem, let's go ahead and solve it here. That way our notebook will detail the fix and how to use it in the future, if there are more Excel spreadsheets with similarly altered names.

### Renaming indices (columns or rows) with _pandas_ functions

_pandas_ has a [`rename`] method which lets us apply a transform function to the indices named after our worksheet columns (or explicitly map each column through a dictionary).

[`rename`]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html

Since the bad names merely have extra words tacked onto the end, let's just split the name and use the first "word". In Python, we get the first item of a sequence by using index zero:

In [74]:
def first_word(string, word_separator=' '):
    """ Split string into words (by space character), return first word. """
    return string.split(word_separator)[0]

We just need to apply this function to every Series in the DataFrame, through the _pandas_ rename function. For the sake of visualizing the changes, let's also report on changes in a dry run before using the _pandas_ rename function.

In [75]:
report = []  # For the lines of the report which we'll display later.

for sheet_name, sheet in sheets_with_bad_column_names.items():
    report.append(make_heading(sheet_name, level=3))  # heading
    report_lines = []  # the body of the report for each DataFrame
    for column_name in sheet.columns:
        after = first_word(column_name)
        if after == column_name:
            content = column_name
            attributes = None
        else:
            content = f"{column_name} &rarr; {after}"  # old --> new
            attributes = "style='font-weight: bold'"
        report_lines.append(html_wrap(content, 'li', attributes))  # add an HTML list item
    report.append(html_wrap(''.join(report_lines), 'ol')) # add an ordered list of items to the report
    
# join all the strings together and display as HTML
display(HTML(''.join(report)))

Actually change the names:

In [76]:
for sheet in sheets_with_bad_column_names.values():
    sheet.rename(mapper=first_word, axis='columns', inplace=True)

As mentioned before, we know there's only one sheet, so the `for` loop isn't absolutely necessary, but it's still a good habit when dealing with reusable scripts that can work on batches of datasets.

## Making sense of really long, repetitive column names

_pandas_ has a concept of [indexing hierarchically], which may help group columns together. However, I'm new to _pandas_ and I'm not new to Python. Furthermore, from what I've read, if we use a multi-level labelling system for grouping, we have to pay attention to which level we're operating on as we work with the spreadsheet as DataFrame. 

To avoid complications due to my own ignorance, let's use a nested dictionary in Python to accomplish the same thing without _pandas_. We can probably use the dictionary to help us add more indices to the DataFrame later, if we need.

[indexing hierarchically]: https://pandas.pydata.org/pandas-docs/stable/advanced.html

### Nested dictionaries, for a hierarchy

Instead of a regular dictionary in Python, let's use something called a [`defaultdict`], which will simplify the creation of our hierarchy. [`defaultdict`] is like a regular dictionary, except it doesn't complain if you try to access a key that doesn't exist yet—it just adds it.

[`defaultdict`]: https://docs.python.org/3/library/collections.html#collections.defaultdict

The following method looks long but it's mostly comments for your benefit. 

In [77]:
from collections import defaultdict

def split_column_to_dict(sheet, column_name, column_dictionary=None, separator='__'):
    """ Split the column names like "fields__oSets__oPoints__observations" into groupings of keys
    so that related keys are easy to find, ie columns['fields']['oSets']['oPoints']['observations'].
    This produces a tree of column name segments, with references to actual data at the ends."""

    # If a dictionary is not provided, make an empty one.
    if column_dictionary is None:
        def nested_dict():
            """ This function will be called by defaultdict
            whenever a non-existent key is used. """
            return defaultdict(nested_dict)
        
        column_dictionary = nested_dict()

    # Set a pointer to the root of the tree, for starters.
    pointer = column_dictionary
    
    # Now, walk through the segments in order from left to right,
    # touching the tree for each one.
    for segment in str(column_name).split(separator):
        # Just update the pointer to the deeper location in the tree.
        pointer = pointer[segment]
    # Now that the loop is done, the pointer is pointing at the deepest
    # level of the branch, which either already existed or it was created.
    
    # At the end of the branch, put the data.
    # NOTE: To avoid naming conflicts with pandas magic attributes
    # (such as "number"), the reference to data is in a node with a name that
    # can't possibly be part of naming levels: the separator ('__') 
    pointer[separator] = sheet[column_name]  # the actual pandas "column"

    # Since `pointer` was actually just pointing to parts of the column dictionary,
    # the column dictionary has been filled out with nodes because of how defaultdict
    # was setup with our `nested_dict` constructor.
    return column_dictionary


In [78]:
# Now, make a dictionary of column trees, grouped by sheet name.
column_dictionary = {}
separator = '__'
for sheet_name, sheet in sheets.items():
    # Build a new nested column dictionary for this sheet
    new_dict = None
    for column in sheet.columns:
        new_dict = split_column_to_dict(sheet, column, new_dict, separator)
    # Turn off the defaultdict behaviour of creating a key instead of throwinng an exception
    new_dict.default_factory = None
    column_dictionary[sheet_name] = new_dict


If this worked, there should now be a list of sheets at the top:

In [79]:
column_dictionary.keys()

dict_keys(['Erl', 'Tyler'])

Then there should be a list of the first segments of all the column names in that sheet:

In [80]:
column_dictionary['Erl'].keys()

dict_keys(['fields', 'clients', 'observers'])

Continuing deeper, more segments that share a common prefix:

In [81]:
column_dictionary['Erl']['fields'].keys()

dict_keys(['client', 'name', 'crop', 'desc', 'image', 'date', 'oSets'])

In [82]:
column_dictionary['Erl']['fields']['oSets'].keys()

dict_keys(['date', 'dateCompare', 'growthStage', 'desc', 'obsName', 'totalSets', 'completeSets', 'results', 'oPoints', 'totalA1', 'totalA2', 'totalA3', 'totalA4'])

At the end of each branch in the tree should be a '\_\_' key for the actual data. For example, the `date` of the set:

In [83]:
column_dictionary['Erl']['fields']['oSets']['date'].keys()

dict_keys(['__'])

Furthermore, if we want to skip over blanks as we scan down a column, _pandas_ has a [`dropna()`] method for that:

[`dropna()`]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html

In [84]:
column_dictionary['Erl']['fields']['oSets']['date'][separator].dropna()

0      2017-08-02T13:12:09.542
70     2017-08-09T09:25:11.710
140    2017-08-09T10:06:25.480
210    2017-08-09T11:21:01.555
350    2017-08-09T11:37:20.862
490    2017-08-22T15:42:05.751
560    2017-08-17T11:12:02.820
700    2017-08-17T13:06:30.183
840    2017-08-22T16:02:50.682
Name: fields__oSets__date, dtype: object

We now have enough information to find the beginning of chunks in the sheet by scanning for non-blank cells.

As a side note, the number of non-null (not empty) records in any given column was displayed when we called [`info()`] on the DataFrame. We can also get information like that from the column (which is a Series in _pandas_) by using [`describe()`] or [`count()`]:

[`info()`]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html
[`describe()`]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.describe.html
[`count()`]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.count.html

In [85]:
column_dictionary['Erl']['fields']['oSets']['date'][separator].describe()

count                           9
unique                          9
top       2017-08-09T09:25:11.710
freq                            1
Name: fields__oSets__date, dtype: object

### Helper functions, for browsing the data

Once we start actually reading groups of columns based on which type of chunk we're reading from, we'll have to start typing lists of column names for each section.

It's a bit of extra work to pay attention to whether a given node in the hierarchy is holding a reference to a data Series (column) or whether it's only an intermediate step on the way to the end of the branch.

To save time and mental energy, we can filter columns by whether they have data or not, if we make some simple filter functions. For detecting data that we've placed in the hierarchy of columns, we can look for the special key that we chose earlier (the separator, which is two underscore characters `__`).

In [86]:
def has_data(node):
    """ Filter children that have data. We know a child item has data
    if it has a key that's the just the separator string. """
    return {parent_key: child for parent_key, child in node.items()
            if separator in child.keys()}


def has_children(node):
    """ Filter children that have children. We know an item has children
    if it has at least one key that isn't just the separator string, the
    special key for data references. """
    return {parent_key: child for parent_key, child in node.items()
            if len([key for key in child.keys() if key != separator]) >= 1}

In [87]:
example_node = column_dictionary[sheet_name]['fields']['oSets']['oPoints']['observations']
display(heading('children with children:'),
        has_children(example_node).keys())

dict_keys(['a1', 'a2', 'a3', '|'])

In [88]:
display(heading('children with data:'),
        has_data(example_node).keys())

dict_keys(['id', 'name', 'enum', 'eVnum', 'anum', 'disabled', 'complete', '|'])

In [89]:
# Set of keys for nodes that that have children but also data:
display(heading('children with children & data:'),
        set(has_children(example_node).keys()) & set(has_data(example_node).keys()))

{'|'}

#### Personal thought:

I have to wonder why the developers of the app that output this data chose to use an unpronounceable column name, and put something important there.

Regardless, we can effortlessly handle it now.

## Repeated elements: fields, sets, points, observations

Let's take a look at the columns containing data about observation sets:

In [90]:
for sheet_name, column_tree in column_dictionary.items():
    node = column_tree['fields']['oSets']
    date_column = node['date'][separator]
    columns = [child[separator].name for parent_key, child in has_data(node).items()]
    display(heading(f"{sheet_name}:"),
            sheets[sheet_name].loc[date_column.isna() != True, columns].head(3))

Unnamed: 0,fields__oSets__date,fields__oSets__dateCompare,fields__oSets__growthStage,fields__oSets__desc,fields__oSets__obsName,fields__oSets__totalSets,fields__oSets__completeSets,fields__oSets__results,fields__oSets__totalA1,fields__oSets__totalA2,fields__oSets__totalA3,fields__oSets__totalA4
0,2017-08-02T13:12:09.542,2017-08-02,7.0,,Tyler,1.0,0.0,,,,,
70,2017-08-09T09:25:11.710,2017-08-09,8.0,,Tyler,1.0,1.0,RESULTS.5,164.0,0.0,0.0,0.0
140,2017-08-09T10:06:25.480,2017-08-09,7.0,,Tyler,1.0,1.0,RESULTS.5,66.0,0.0,0.0,0.0


Unnamed: 0,fields__oSets__date,fields__oSets__dateCompare,fields__oSets__growthStage,fields__oSets__desc,fields__oSets__obsName,fields__oSets__totalSets,fields__oSets__completeSets,fields__oSets__results,fields__oSets__totalA1,fields__oSets__totalA2,fields__oSets__totalA3,fields__oSets__totalA4
0,2017-07-14T12:31:24.194,2017-07-14,6.0,,Tyler,1.0,0.0,,,,,
70,2017-07-18T10:31:22.263,2017-07-18,6.0,,Tyler,1.0,1.0,RESULTS.5,8.0,0.0,0.0,0.0
140,2017-07-28T13:05:44.673,2017-07-28,8.0,,Mikki,1.0,1.0,RESULTS.5,37.0,0.0,0.0,0.0


We can ask _pandas_ to drop all the blanks and see which rows remain for the column here (`fields__oSets__date`):

In [91]:
column_dictionary['Erl']['fields']['oSets']['date'][separator].dropna().index

Int64Index([0, 70, 140, 210, 350, 490, 560, 700, 840], dtype='int64')

How many 'oSets' sections of the file are there?

In [92]:
len(_)  # Based on the length of that list we just output

9

Good to know. What about the other file?

In [93]:
column_dictionary['Tyler']['fields']['oSets']['date'][separator].count()

45

In [94]:
_ + __  # add the last two numbers

54

So, 54 sets of observations we'll be processing.

How many observation _points_ will we be processing?

In [95]:
pointer = column_dictionary['Tyler']['fields']['oSets']
has_children(pointer)

{'oPoints': defaultdict(<function __main__.split_column_to_dict.<locals>.nested_dict>,
             {'id': defaultdict(<function __main__.split_column_to_dict.<locals>.nested_dict>,
                          {'__': 0       0.0
                           1       NaN
                           2       NaN
                           3       NaN
                           4       NaN
                           5       NaN
                           6       NaN
                           7       NaN
                           8       NaN
                           9       NaN
                           10      NaN
                           11      NaN
                           12      NaN
                           13      NaN
                           14      1.0
                           15      NaN
                           16      NaN
                           17      NaN
                           18      NaN
                           19      NaN
                           20   