# Purpose

With Python and _pandas_, read two Excel worksheets, merge the data, and make the data available for analysis.

- Final data should be exported to Excel format, and should be easy to download.
- Results must be reproducible.

## Setup

To prepare, we need to be able to open the Excel files, display results of intermediate processing in the notebook, and avoid repeating large blocks of code.

Some functions can be imported, as they're already available publicly. Others will be made here, in this notebook.

In [1]:
# The display function isn't always imported by default in some Jupyter implementations.
# We'll probably use it a lot.
from IPython.core.display import display, HTML

Jupyter can output HTML if we want. Here are simple helpers to make headings
and other embellishments easy:

In [2]:
def html_wrap(content, element="span", attributes=None):
    """ Convenience method for wrapping any string in an HTML tag. """
    element_with_attributes = ' '.join((item for item in (element, attributes) if not item is None))
    tag_open = f"<{element_with_attributes}>"
    tag_close = f"</{element}>"
    return f"{tag_open}{content}{tag_close}"


def make_heading(content, level=5):
    """ Convenience method for wrapping any string in an HTML header tag. """
    element = f"h{level}"
    return html_wrap(content, element)


def heading(*args, **kwargs):
    """ Call IPython.core.display.HTML on the results of the make_heading function,
    so users don't have to repeatedly do `HTML(heading(…))` """
    return HTML(make_heading(*args, **kwargs))

[_pandas_] is really good with columnar data, like Excel files:

[_pandas_]: https://pandas.pydata.org

In [3]:
import pandas

# The Excel workbook: `2017 CAM data from iPads.xlxs`

In [4]:
# The file I'm interested in parsing for cleanup:
file_path = "./src/real data/2017 CAM data from iPads/2017 CAM data from iPads.xlsx"

## Worksheets in the file

In [5]:
data_file = pandas.ExcelFile(file_path)
sorted(data_file.sheet_names)

['2017 CAM data Erl',
 '2017 CAM iPad data Tyler',
 'Combined iPad 2017 CAM data',
 'schema (WIP reverse engineer)']

I'm only interested in first two, for now:

In [6]:
cam_sheet_names = _[:2]

Let's make a dictionary of dataframes from all the sheets, using the last word as the name. In Python, an index of -1 means the last item, as in, one less than the largest index number.

In [7]:
def last_word(string, word_separator=' '):
    return string.split(' ')[-1]


sheets = {last_word(sheet_name): data_file.parse(sheet_name)
          for sheet_name in cam_sheet_names}

In [8]:
# The keys are sheet names. Let's see what we've got:
sheets.keys()

dict_keys(['Erl', 'Tyler'])

Now that we have a convenient list of sheets that are loaded as _pandas_ DataFrames, we can work toward merging them into one. Once they're merged, we can process the data more easily from a single DataFrame.

### Python dictionaries

Dictionaries in Python are just a collection of named things. The things can be another dictionary, a string, a number, or whatver. Even the names of the things don't necessarily have to be words—they can be numbers, for example.

In [9]:
# define a dictionary:
my_dictionary = {
    'one': 1,
    2: 'two',
    'green': 'I like colour green.',
    'another dictionary': {'more stuff': 1024,
                           'even more stuff': 2048},
    'a list': [1, 2, 3, 4]
}

In [10]:
# recall something from that dictionary:
my_dictionary["one"]

1

In [11]:
my_dictionary[2]

'two'

In [12]:
my_dictionary["another dictionary"]["even more stuff"]

2048

Note that the attempt to use a non-existant key results in a Python exception called a `KeyError`. This is helpful if you accidentally lose track of which keys are in the dictionary. 

In [13]:
my_dictionary["some key that doesn't exist"]

KeyError: "some key that doesn't exist"

If you don't want an this to happen, there are ways to get around it. Later on, we'll use something called a [`defaultdict`], which just makes up the new key on the spot, instead of stopping dead.

[`defaultdict`]: https://docs.python.org/3/library/collections.html?highlight=defaultdict#defaultdict-objects

### Using `for` with dictionaries in Python

Python's `for` statements are handy for doing something to each item in a collection. Dictionaries are a type of collection, but since they have keys and values, you need to specify which you want. To address this, there are three handy methods that all dictionaries have:

- `keys()` - Only the key names
- `values()` - Only the values
- `items()` - Pairs of keys and values

In [14]:
for key in my_dictionary.keys():
    display(key)

'one'

2

'green'

'another dictionary'

'a list'

In [15]:
for value in my_dictionary.values():
    display(value)

1

'two'

'I like colour green.'

{'even more stuff': 2048, 'more stuff': 1024}

[1, 2, 3, 4]

In [16]:
for item in my_dictionary.items():
    display(item)

('one', 1)

(2, 'two')

('green', 'I like colour green.')

('another dictionary', {'even more stuff': 2048, 'more stuff': 1024})

('a list', [1, 2, 3, 4])

### More about lists, dictionaries, and other data structures in Python

If you want to learn more about using dictionaries, see the official [Python tutorial on dictionaries], which is part of the page demonstrating other Python collections, such as [lists], which we'll be using extensively.

[Python tutorial on dictionaries]: https://docs.python.org/3/tutorial/datastructures.html#dictionaries
[lists]: https://docs.python.org/3/tutorial/datastructures.html#more-on-lists

## Unifying the DataFrames (worksheets)

### _pandas_ functions for merging DataFrames

_pandas_ has multiple methods for combining datasets:

- [`concat`]: The concatenate method has an option to ignore row numbers, which has the effect of gluing each dataframe to the bottom of the previous one.
- [`append`]: Append would be almost equivalent, but always updates the first DataFrame or Series it's given.
- [`merge`]: Merge is more for database-style relational merges.

[`concat`]: https://pandas.pydata.org/pandas-docs/stable/merging.html
[`append`]: https://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-using-append
[`merge`]: https://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging

From reading up on all three methods, it looks to me like `concat` will suffice, here, as long as the column names are identical. I like to have the option of creating a new DataFrame rather than overwriting anything, because the results are easier to repeat.

Let's see how close that concatenation method gets us:

In [17]:
pandas.concat(sheets).info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4690 entries, (Erl, 0) to (Tyler, 3779)
Data columns (total 56 columns):
clients__company                                                    1 non-null object
clients__displayText                                                1 non-null object
clients__fname                                                      1 non-null object
clients__lname                                                      1 non-null object
clients__name                                                       1 non-null object
fields__client__company                                             21 non-null object
fields__client__displayText                                         21 non-null object
fields__client__fname                                               21 non-null object
fields__client__lname                                               21 non-null object
fields__client__name                                                21 non-null object
fields__crop

We can see the RangeIndex size in the info: 

>MultiIndex: 4690 entries, (Erl, 0) to (Tyler, 3779)

Which means 4690 "rows", if this were a spreadsheet. Compare to the total row count for all sheets, by using Python's function `len` (an abbreviation for "length"), which counts the number of items (in this case, rows):

In [18]:
sum(len(sheet) for sheet in sheets.values())

4690

Also in the `info()` readout:

>Data columns (total 56 columns):

Compare to the column count for the source worksheets:

In [19]:
[len(sheet.columns) for sheet in sheets.values()]

[50, 50]

So, 50 columns in each source sheet.

Pretty good, so far, except for some extra columns due to variations in the column names—I see `growthStage` and `growthStage Zadoks`, as well as some aphid names amongst the `a1__number`, `a2__number`, `a3__number` group:

```
fields__oSets__growthStage                                  45 non-null float64
fields__oSets__growthStage Zadoks                           9 non-null float64
```

```
fields__oSets__oPoints__observations__a1__number            324 non-null float64
fields__oSets__oPoints__observations__a1__number EGA        124 non-null float64
fields__oSets__oPoints__observations__a2__number            112 non-null float64
fields__oSets__oPoints__observations__a2__number BCO        30 non-null float64
fields__oSets__oPoints__observations__a3__number            25 non-null float64
fields__oSets__oPoints__observations__a3__number Greenbug   0 non-null float64
```

The total number of rows is correct, and the other columns line up when the names match.

Column names (which become Series names in _pandas_) are the only problem standing in the way of a successful unification of DataFrames.

## Column names differ

It's already pretty obvious what the naming problem is. But just to demonstrate how to use _pandas_ to calculate differences, let's look again at our DataFrames column names.

For our frames:

In [20]:
sheets.keys()

dict_keys(['Erl', 'Tyler'])

Calculate the difference in column names, using "Erl" as the base, and Python's set theory models:

In [21]:
display(set.difference(*[set(sheet.columns) for sheet in sheets.values()]))

{'fields__oSets__growthStage Zadoks',
 'fields__oSets__oPoints__observations__a1__number EGA',
 'fields__oSets__oPoints__observations__a2__number BCO',
 'fields__oSets__oPoints__observations__a3__number Greenbug',
 'fields__oSets__oPoints__observations__anum TotalAPhids',
 'fields__oSets__oPoints__observations__eVnum Natural enemy totals'}

Or, if we want to list the names of columns from *all* sheets that don't have a match (symmetric difference):

In [22]:
column_list = sorted(set.symmetric_difference(*[set(sheet.columns) for sheet in sheets.values()]))
display(column_list)

['fields__oSets__growthStage',
 'fields__oSets__growthStage Zadoks',
 'fields__oSets__oPoints__observations__a1__number',
 'fields__oSets__oPoints__observations__a1__number EGA',
 'fields__oSets__oPoints__observations__a2__number',
 'fields__oSets__oPoints__observations__a2__number BCO',
 'fields__oSets__oPoints__observations__a3__number',
 'fields__oSets__oPoints__observations__a3__number Greenbug',
 'fields__oSets__oPoints__observations__anum',
 'fields__oSets__oPoints__observations__anum TotalAPhids',
 'fields__oSets__oPoints__observations__eVnum',
 'fields__oSets__oPoints__observations__eVnum Natural enemy totals']

Let's pop it into a two column layout for _pandas_ to display. I found a handy list-to-grid [recipe] on the official documentation for Python's `itertools` library. I based my function on their `grouper` example.

[recipe]: https://docs.python.org/3/library/itertools.html#itertools-recipes

In [23]:
def grid(items, width=2):
    """ Layout items in a grid, from left to right, top to bottom. """
    return [*zip(*[iter(items)] * width)]

In [24]:
column_grid = pandas.DataFrame(grid(column_list))

In [25]:
# Display entire column, even if cell data is long
pandas.set_option('display.max_colwidth', 0)  # Zero means no limit

# Show the differing column names side-by-side
display(column_grid)

Unnamed: 0,0,1
0,fields__oSets__growthStage,fields__oSets__growthStage Zadoks
1,fields__oSets__oPoints__observations__a1__number,fields__oSets__oPoints__observations__a1__number EGA
2,fields__oSets__oPoints__observations__a2__number,fields__oSets__oPoints__observations__a2__number BCO
3,fields__oSets__oPoints__observations__a3__number,fields__oSets__oPoints__observations__a3__number Greenbug
4,fields__oSets__oPoints__observations__anum,fields__oSets__oPoints__observations__anum TotalAPhids
5,fields__oSets__oPoints__observations__eVnum,fields__oSets__oPoints__observations__eVnum Natural enemy totals


Transpose the frame by viewing the `T` attribute of our DataFrame, so differences are vertically adjacent:

In [26]:
display(column_grid.T)

Unnamed: 0,0,1,2,3,4,5
0,fields__oSets__growthStage,fields__oSets__oPoints__observations__a1__number,fields__oSets__oPoints__observations__a2__number,fields__oSets__oPoints__observations__a3__number,fields__oSets__oPoints__observations__anum,fields__oSets__oPoints__observations__eVnum
1,fields__oSets__growthStage Zadoks,fields__oSets__oPoints__observations__a1__number EGA,fields__oSets__oPoints__observations__a2__number BCO,fields__oSets__oPoints__observations__a3__number Greenbug,fields__oSets__oPoints__observations__anum TotalAPhids,fields__oSets__oPoints__observations__eVnum Natural enemy totals


In [27]:
# Reset the column width, since we're done looking at the names this way
pandas.reset_option('display.max_colwidth')

Before we concatenate, we'll standardize on the names from the first sheet (0), since it's more regular. But which sheet is the one to fix? Which sheet has `fields__oSets__growthStage Zadoks` instead of `fields__oSets__growthStage`? I believe "Erl" was our base for comparison initially, but let's make sure of what we're about to do:

In [28]:
# make a filtered dictionary, for later reference
sheets_with_bad_column_names = {sheet_name: sheet for sheet_name, sheet in sheets.items() 
                                if 'fields__oSets__growthStage Zadoks' in sheet.columns}
display(heading('Bad:'),
        set(sheets_with_bad_column_names.keys()))

{'Erl'}

In [29]:
display(heading('Good:'),
        set(sheets.keys()) - set(sheets_with_bad_column_names.keys()))

{'Tyler'}

Okay, "Tyler" has the column names we prefer, "Erl" does not. Duly noted.

There's now a `sheets_with_bad_column_names` variable we can use when we fix that. (It's only got one DataFrame, but it's still good practice to be ready for batch processing in case this code is reused in the future.)

### How shall we solve the name mismatch?

Our options:

- rename the columns
- try to merge the sheets in a way that ignores the column names of the bad sheet

If we try to do a special merge, ignoring the column names, we have to worry about the _precise order_ of those columns instead of relying on the names. Since we'd have to spend time checking—and possibly rearranging—the column sequence, we might as well spend that time creating a _reusable_, documented solution for quick and painless name fixing.

### How (where) should we rename the columns/indices?

Possibilities:

- fix the spreadsheet document in Excel format, and move ahead as if this had never happened
- keep the names in the file, but correct the names in memory once they become indices

Since this notebook you're reading has already mentioned the problem, let's go ahead and solve it here. That way our notebook will detail the fix and how to use it in the future, if there are more Excel spreadsheets with similarly altered names.

### Renaming axis labels (columns or rows) with _pandas_

_pandas_ has a [`rename`] method which lets us apply a transform function to the indices named after our worksheet columns (or explicitly map each column through a dictionary).

[`rename`]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html

Since the bad names merely have extra words tacked onto the end, let's just split the name and use the first "word". In Python, we get the first item of a sequence by using index zero:

In [30]:
def first_word(string, word_separator=' '):
    """ Split string into words (by space character), return first word. """
    return string.split(word_separator)[0]

We just need to apply this function to every Series in the DataFrame, through the _pandas_ rename function. For the sake of visualizing the changes, let's also report on changes in a dry run before using the _pandas_ rename function.

In [31]:
report = []  # For the lines of the report which we'll display later.

for sheet_name, sheet in sheets_with_bad_column_names.items():
    report.append(make_heading(sheet_name, level=3))  # heading
    report_lines = []  # the body of the report for each DataFrame
    for column_name in sheet.columns:
        after = first_word(column_name)
        if after == column_name:
            content = column_name
            attributes = None
        else:
            content = f"{column_name} &rarr; {after}"  # old --> new
            attributes = "style='font-weight: bold'"
        report_lines.append(html_wrap(content, 'li', attributes))  # add an HTML list item
    report.append(html_wrap(''.join(report_lines), 'ol')) # add an ordered list of items to the report
    
# join all the strings together and display as HTML
display(HTML(''.join(report)))

Actually change the names:

In [32]:
for sheet in sheets_with_bad_column_names.values():
    sheet.rename(mapper=first_word, axis='columns', inplace=True)

As mentioned before, we know there's only one sheet, so the `for` loop isn't absolutely necessary, but it's still a good habit when dealing with reusable scripts that can work on batches of datasets. (Sometimes other people find your work useful, and sometimes you'll revisit your work to copy something that succeeded in the past.)

## Finally uniting our dataset

In [33]:
data = pandas.concat(sheets)

In [34]:
data.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4690 entries, (Erl, 0) to (Tyler, 3779)
Columns: 50 entries, clients__company to observers
dtypes: datetime64[ns](1), float64(27), object(22)
memory usage: 1.8+ MB


Splendid! That's what we expected. 😁

## Making sense of really long, repetitive column names

_pandas_ has a concept of [indexing hierarchically], which may help group columns together. However, I'm new to _pandas_ and I'm not new to Python. Furthermore, from what I've read, if we use a multi-level labelling system for grouping, we have to pay attention to which level we're operating on as we work with the spreadsheet as DataFrame. 

To avoid complications due to my own ignorance, let's use a nested dictionary in Python to accomplish the same thing without _pandas_. We can probably use the dictionary to help us add more indices to the DataFrame later, if we need.

[indexing hierarchically]: https://pandas.pydata.org/pandas-docs/stable/advanced.html

### Nested dictionaries, for a hierarchy

Instead of a regular dictionary in Python, let's use something called a [`defaultdict`], which will simplify the creation of our hierarchy. [`defaultdict`] is like a regular dictionary, except it doesn't complain if you try to access a key that doesn't exist yet—it just adds it.

[`defaultdict`]: https://docs.python.org/3/library/collections.html#collections.defaultdict

Because there's always the chance that a node might have data as well as more nodes under it, we'll store the reference to data in a key that can't possibly be a segment in the name: the separator (which in this case is '__') 

The following method looks long but it's mostly comments for your benefit. 

In [35]:
from collections import defaultdict

def split_column_to_dict(column, column_name=None, column_dictionary=None, separator='__'):
    """ Split the column names like "fields__oSets__oPoints__observations" into groupings of keys
    so that related keys are easy to find, ie columns['fields']['oSets']['oPoints']['observations'].
    This produces a tree of column name segments, with references to actual data at the ends.
    
    If `column_name` is provided, it's used instead of the actual column name.
    
    If `column_dicitonary` is provided, attempt to add to it as if it were already initialized as
    a nested defaultdict, from an earlier call to this function. """

    # If a dictionary is not provided, make an empty one.
    if column_dictionary is None:
        def nested_dict():
            """ This function will be called by defaultdict
            whenever a non-existent key is used. """
            return defaultdict(nested_dict)
        
        column_dictionary = nested_dict()

    # Set a pointer to the root of the tree, for starters.
    pointer = column_dictionary
    
    # Now, walk through the segments in order from left to right,
    # touching the tree for each one.
    for segment in str(column_name or column.name).split(separator):
        pointer = pointer[segment]
        
    # Now that the loop is done, the pointer is pointing at the deepest
    # level of the branch, which either already existed or else it was created.
    
    # At the end of the branch, put the data under a special key.
    pointer[separator] = column

    # Since `pointer` was actually just pointing to parts of `column_dictionary`,
    # the column dictionary has been filled out with nodes because of how defaultdict
    # was setup with our `nested_dict` constructor.
    return column_dictionary


Just to demonstrate that the functional code is actually quite slim and simple, here it is without any comments:

```python
def split_column_to_dict(dataframe, column_name, column_dictionary=None, separator='__'):
    if column_dictionary is None:
        def nested_dict():
            return defaultdict(nested_dict)
        column_dictionary = nested_dict()
    pointer = column_dictionary
    for segment in str(column_name or column.name).split(separator):
        pointer = pointer[segment]
    pointer[separator] = dataframe[column_name]
    return column_dictionary
```

Now, make a dictionary of column trees:

In [36]:
column_dictionary = None  # initialize the variable, otherwise we can't reference it
separator = '__'
for name, column in data.items():
    column_dictionary = split_column_to_dict(column, name, column_dictionary, separator)
# Turn off the defaultdict behaviour, so errors are easier to detect later on
column_dictionary.default_factory = None

If that worked as planned, there should be a list of the first segments of all the column names in that sheet:

In [37]:
column_dictionary.keys()

dict_keys(['clients', 'fields', 'observers'])

Continuing deeper, more segments that share a common prefix:

In [38]:
fields_node = column_dictionary['fields']
fields_node.keys()

dict_keys(['client', 'crop', 'date', 'desc', 'image', 'name', 'oSets'])

In [39]:
sets_node = fields_node['oSets']
sets_node.keys()

dict_keys(['completeSets', 'date', 'dateCompare', 'desc', 'growthStage', 'oPoints', 'obsName', 'results', 'totalA1', 'totalA2', 'totalA3', 'totalA4', 'totalSets'])

At the end of each branch in the tree should be a '\_\_' key for the actual data. For example, the `date` of the set:

In [40]:
sets_node['date'].keys()

dict_keys(['__'])

We can look at that data right now!

In [41]:
sets_node['date'][separator].head(20)

Erl  0     2017-08-02T13:12:09.542
     1                         NaN
     2                         NaN
     3                         NaN
     4                         NaN
     5                         NaN
     6                         NaN
     7                         NaN
     8                         NaN
     9                         NaN
     10                        NaN
     11                        NaN
     12                        NaN
     13                        NaN
     14                        NaN
     15                        NaN
     16                        NaN
     17                        NaN
     18                        NaN
     19                        NaN
Name: fields__oSets__date, dtype: object

So, this is the problem with which we launched this whole endeavour. Many, many empty cells. If we want to skip over blanks as we scan down a column, _pandas_ has a [`dropna()`] method for that. Let's use `head` to peek at the first bit of that:

[`dropna()`]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html

In [42]:
sets_node['date'][separator].dropna().head(10)

Erl    0      2017-08-02T13:12:09.542
       70     2017-08-09T09:25:11.710
       140    2017-08-09T10:06:25.480
       210    2017-08-09T11:21:01.555
       350    2017-08-09T11:37:20.862
       490    2017-08-22T15:42:05.751
       560    2017-08-17T11:12:02.820
       700    2017-08-17T13:06:30.183
       840    2017-08-22T16:02:50.682
Tyler  0      2017-07-14T12:31:24.194
Name: fields__oSets__date, dtype: object

So much better!

We now have enough information to find the beginning of chunks in the sheet by scanning for non-blank cells: we have those row numbers for each of the non-null records of that column, where related columns will also reveal data we need for that section.

As a side note, the number of non-null (not empty) records in any given column was displayed when we called [`info()`] on the DataFrame. We can also get information like that from the column (which is a Series in _pandas_) by using [`describe()`] or [`count()`]:

[`info()`]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html
[`describe()`]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.describe.html
[`count()`]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.count.html

In [43]:
sets_node['date'][separator].describe()

count                          54
unique                         54
top       2017-08-02T13:12:09.542
freq                            1
Name: fields__oSets__date, dtype: object

### Helper functions, for browsing the data

Once we start actually reading groups of columns based on which type of chunk we're reading from, we'll have to start typing lists of column names for each section.

It's a bit of extra work to pay attention to whether a given node in the hierarchy is holding a reference to a data Series (column) or whether it's only an intermediate step on the way to the end of the branch.

To save time and mental energy, we can filter columns by whether they have data or not, if we make some simple filter functions. For detecting data that we've placed in the hierarchy of columns, we can look for the special key that we chose earlier: the separator, which is two underscore characters (`__`).

In [44]:
def has_data(node):
    """ Filter children that have data. We know a child item has data if
    it has a key that's the just the separator string. """
    return {parent_key: child for parent_key, child in node.items()
            if separator in child.keys()}


def has_children(node):
    """ Filter children that have children. We know an item has children if
    it has at least one key that isn't just the separator string, which is the
    special key for data references. """
    return {parent_key: child for parent_key, child in node.items()
            if len([key for key in child.keys() if key != separator]) >= 1}

In [45]:
points_node = sets_node['oPoints']
observations_node = points_node['observations']
display(heading('children with children:'),
        has_children(observations_node).keys())

dict_keys(['a1', 'a2', 'a3', '|'])

In [46]:
display(heading('children with data:'),
        has_data(observations_node).keys())

dict_keys(['anum', 'complete', 'disabled', 'eVnum', 'enum', 'id', 'name', '|'])

In [47]:
# Set of keys for nodes that that have children but also data:
display(heading('children with children & data:'),
        set(has_children(observations_node).keys()) & set(has_data(observations_node).keys()))

{'|'}

#### Personal thought:

I have to wonder why the developers of the app that output this data chose to use an unpronounceable column name, and put something important there.

Regardless, we can effortlessly handle it now.

## Putting it all together: accurately reading sections at will

Let's take a look at the columns containing data about observation sets.

We'll use `has_data` to filter our `sets_node` so we can express a list of names of data-containing columns about `fields__oSets`.

In [49]:
sets_columns_names = [column[separator].name for column in has_data(sets_node).values()]
display(sets_columns_names)

['fields__oSets__completeSets',
 'fields__oSets__date',
 'fields__oSets__dateCompare',
 'fields__oSets__desc',
 'fields__oSets__growthStage',
 'fields__oSets__obsName',
 'fields__oSets__results',
 'fields__oSets__totalA1',
 'fields__oSets__totalA2',
 'fields__oSets__totalA3',
 'fields__oSets__totalA4',
 'fields__oSets__totalSets']

Passing that list to _pandas_, we should get exactly which columns of data we need:

In [50]:
display(data[sets_columns_names].head())

Unnamed: 0,Unnamed: 1,fields__oSets__completeSets,fields__oSets__date,fields__oSets__dateCompare,fields__oSets__desc,fields__oSets__growthStage,fields__oSets__obsName,fields__oSets__results,fields__oSets__totalA1,fields__oSets__totalA2,fields__oSets__totalA3,fields__oSets__totalA4,fields__oSets__totalSets
Erl,0,0.0,2017-08-02T13:12:09.542,2017-08-02,,7.0,Tyler,,,,,,1.0
Erl,1,,,NaT,,,,,,,,,
Erl,2,,,NaT,,,,,,,,,
Erl,3,,,NaT,,,,,,,,,
Erl,4,,,NaT,,,,,,,,,


Once again, the problem we faced at the outset. How about skipping irrelevant rows? We fixed this earlier with `dropna()`, but this time we're operating on some columns that _may_ be null, plus certain ones that _must not_ be null.

Let's use the date column as the crucial record upon which we'll predicate our filter, because it should never be null. 

_pandas_ has a way to filter on conditions, which is sometimes called [boolean indexing] because the condition is either `True` or `False`. In this case, _pandas_ expects the actual column data object itself (Series), rather than just the name of the column to check. We already have that in the hierarchy from earlier:

[boolean indexing]: https://pandas.pydata.org/pandas-docs/stable/10min.html#boolean-indexing

Let's use the date column as the crucial record upon which we'll predicate our filter. In this case, _pandas_ expects the actual column data object itself (Series), rather than just the name of the column to check. We already have that in the hierarchy from earlier:

In [51]:
date_column = sets_node['date'][separator]
type(date_column)

pandas.core.series.Series

Passing a check for whether the value is null to _pandas_, as well as specifying the column names we want to get:

In [52]:
sets = data[date_column.isna() == False][sets_columns_names]
display(sets.head(15))

Unnamed: 0,Unnamed: 1,fields__oSets__completeSets,fields__oSets__date,fields__oSets__dateCompare,fields__oSets__desc,fields__oSets__growthStage,fields__oSets__obsName,fields__oSets__results,fields__oSets__totalA1,fields__oSets__totalA2,fields__oSets__totalA3,fields__oSets__totalA4,fields__oSets__totalSets
Erl,0,0.0,2017-08-02T13:12:09.542,2017-08-02,,7.0,Tyler,,,,,,1.0
Erl,70,1.0,2017-08-09T09:25:11.710,2017-08-09,,8.0,Tyler,RESULTS.5,164.0,0.0,0.0,0.0,1.0
Erl,140,1.0,2017-08-09T10:06:25.480,2017-08-09,,7.0,Tyler,RESULTS.5,66.0,0.0,0.0,0.0,1.0
Erl,210,2.0,2017-08-09T11:21:01.555,2017-08-09,,9.0,Stean,RESULTS.1,0.0,0.0,0.0,0.0,2.0
Erl,350,2.0,2017-08-09T11:37:20.862,2017-08-09,,8.0,Stean,RESULTS.1,5.0,5.0,0.0,0.0,2.0
Erl,490,1.0,2017-08-22T15:42:05.751,2017-08-22,,8.0,Mikki,RESULTS.5,1.0,0.0,0.0,0.0,1.0
Erl,560,2.0,2017-08-17T11:12:02.820,2017-08-17,,8.0,Gabrielle,RESULTS.1,169.0,96.0,0.0,0.0,2.0
Erl,700,2.0,2017-08-17T13:06:30.183,2017-08-17,,9.0,Stean,RESULTS.1,78.0,102.0,0.0,0.0,2.0
Erl,840,1.0,2017-08-22T16:02:50.682,2017-08-22,,8.0,Mikki,RESULTS.5,187.0,0.0,0.0,0.0,1.0
Tyler,0,0.0,2017-07-14T12:31:24.194,2017-07-14,,6.0,Tyler,,,,,,1.0


Now, we can repeat this technique to get all the other section data!