# Validating changes introduced in branch `10-density`

    
The version of `outputvalidation.py` at the time of writing returned `NARP` based on changes introduced in branch `10-density`. This notebook is a deep dive into why that happened. 

## Overview

Branch `10-density` made the following changes: 
1. introduced new density variables to the report output
    * This is the main change introduced in `10-density`
2. solidified user's option to drop `_raw_*` features from output
    * This secondary change impacts our work in this notebook


## A preliminary look at the data
We observe that there are 368 columns in `old` dataset that *do not* appear in `new`, and three columns in `new` that do not appear in `old`. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
old = pd.read_csv("../OriginalTestCase.csv") # data before the change
new = pd.read_csv("../debugtestcase.csv")    # data after the change

In [3]:
np.isin(old.columns.values, new.columns.values, invert=True).sum()

368

In [4]:
np.isin(new.columns.values, old.columns.values, invert=True).sum()

3

## Columns missing from `new`

Here we can take a closer look at the columns that appeared in `old` but not `new`.

We expected these to be the `_raw_*` columns, because of change 2. discussed in the **Overview**.
However, observe below that some `_feat_*` columns appear in this set, I'm not sure why. 

In [5]:
dropped_cols = list(set(old.columns.values) - set(new.columns.values))

In [6]:
dropped_feats = [col for col in dropped_cols if col[:4] != '_raw']

In [7]:
dropped_feats

['_feat_sol_ph12',
 '_feat_sol_ph0',
 '_feat_sol_ph3',
 '_feat_sol_ph11',
 '_feat_sol_ph6',
 '_feat_sol_ph9',
 '_feat_sol_ph5',
 '_feat_bpKa1',
 '_feat_sol_ph1',
 '_feat_sol_ph7',
 '_feat_sol_ph13',
 '_feat_sol_ph14',
 '_feat_sol_ph2',
 '_feat_sol_ph10',
 '_feat_fsp3',
 '_feat_sol_ph4',
 '_feat_sol_ph8']

In [8]:
len(dropped_cols)

368

In [9]:
len(dropped_feats)

17

## Columns added to `new`
These are the new columns added to the report output by branch `10-density`

In [10]:
new_cols = list(set(new.columns.values) - set(old.columns.values))

In [11]:
new_cols

['_rxn_v1-M_organic', '_rxn_v1-M_acid', '_rxn_v1-M_inorganic']

## Dropping Mismatched Columns

Once we drop the columns such that the two dfs have the set intersection of their columns, we expect the dataframes to be equal. However, thats not *quite* what happens

In [12]:
old.drop(dropped_cols, axis=1, inplace=True)

In [13]:
new.drop(list(new_cols), axis=1, inplace=True)

In [14]:
old.equals(new)

False

In [15]:
# A different way of comparing all values of the dataframe
(old == new).all().all()

False

## Dealing with `NaN`s
If we dig a little deeper into the differences between the two dataframes, it appears that `NaNs` are the culprit. 

If we just drop (few) rows that contain `NaNs`, everything looks good.

In [16]:
# gives us the indices of the rows and columns where mismatches are present 

mismatch_rows, mismatch_columns = list(map(lambda x: list(np.unique(x)), np.where(old != new)))
mismatch_rows, mismatch_columns

([432, 433, 434, 435, 436, 437, 438, 439, 440, 441], [1, 10])

In [17]:
old.iloc[mismatch_rows, mismatch_columns]

Unnamed: 0,_out_crystalscore,_rxn_temperatureC_actual_bulk
432,,
433,,
434,,
435,,
436,,
437,,
438,,
439,,
440,,
441,,


In [18]:
new.iloc[mismatch_rows, mismatch_columns]

Unnamed: 0,_out_crystalscore,_rxn_temperatureC_actual_bulk
432,,
433,,
434,,
435,,
436,,
437,,
438,,
439,,
440,,
441,,


In [19]:
old_dropnans = old.drop(mismatch_rows)
new_dropnans = new.drop(mismatch_rows)

Note that [pd.df.equals](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.equals.html) still returns `False`, but checking equality 'by hand' returns `True`. I am inclined to trust the 'by hand' check, since pandas can be quite finnickey

In [20]:
old_dropnans.equals(new_dropnans)

False

In [21]:
(old_dropnans == new_dropnans).all().all()

True