# Validating changes introduced in branch `10-density`

    
The version of `outputvalidation.py` at the time of writing returned `NARP` based on changes introduced in branch `10-density`. This notebook is a deep dive into why that happened. 

## Overview

Branch `10-density` made the following changes: 
1. introduced new density variables to the report output
    * This is the main change introduced in `10-density`
2. solidified user's option to drop `_raw_*` features from output
    * This secondary change impacts our work in this notebook


## A preliminary look at the data
We observe that there are 368 columns in `old` dataset that *do not* appear in `new`, and three columns in `new` that do not appear in `old`. 

In [None]:
import pandas as pd
import numpy as np

In [None]:
#old = pd.read_csv("../oldcomparetest.csv") # data before the change
#new = pd.read_csv("../debugtest.csv")    # data after the change
new.head()

In [None]:
#  calculates the number of columns in the old dataset that do not appear in the new dataset
np.isin(old.columns.values, new.columns.values, invert=True).sum()

In [None]:
#  calculates the number of columns in the new dataset that do not appear in the old datset
np.isin(new.columns.values, old.columns.values, invert=True).sum()

## Columns missing from `new`

Here we can take a closer look at the columns that appeared in `old` but not `new`.

(Mike): We expected these to be the `_raw_*` columns, because of change 2. discussed in the **Overview**.
However, observe below that some `_feat_*` columns appear in this set, I'm not sure why. 

(Ian): The `_feat_*` columns have been removed from the perovskite features due to being terrible fits for the current dataset.  The target dataset might need to be updated to reflect the new feature set at the conclusion of validating and pushing to master.  I will make this change and the `_feat_*` columns should no longer appear (unless of course there are expected changes to the reporting of features in the updates being tested in a new branch)

In [None]:
dropped_cols = list(set(old.columns.values) - set(new.columns.values))
dropped_cols

In [None]:
dropped_feats = [col for col in dropped_cols if col[:4] != '_raw']

In [None]:
dropped_feats

In [None]:
len(dropped_cols)

In [None]:
len(dropped_feats)


In [None]:
list(set(old.columns.values) - set(new.columns.values))

## Columns added to `new`
These are the new columns added to the report output by branch `10-density`

In [None]:
new_cols = list(set(new.columns.values) - set(old.columns.values))

In [None]:
new_cols

## Dropping Mismatched Columns

Once we drop the columns such that the two dfs have the set intersection of their columns, we expect the dataframes to be equal. However, thats not *quite* what happens

In [None]:
old.drop(dropped_cols, axis=1, inplace=True)
old

In [None]:
new.drop(list(new_cols), axis=1, inplace=True)
new

In [None]:
old.equals(new)

In [None]:
# A different way of comparing all values of the dataframe
(old == new).all().all()

## Dealing with `NaN`s
If we dig a little deeper into the differences between the two dataframes, it appears that `NaNs` are the culprit. 

If we just drop (few) rows that contain `NaNs`, everything looks good.

In [None]:
# gives us the indices of the rows and columns where mismatches are present 

mismatch_rows, mismatch_columns = list(map(lambda x: list(np.unique(x)), np.where(old != new)))
mismatch_columns

In [None]:
old.iloc[mismatch_rows, mismatch_columns]

In [None]:
new.iloc[mismatch_rows, mismatch_columns]

In [None]:
old_dropnans = old.drop(mismatch_rows)
new_dropnans = new.drop(mismatch_rows)

Note that [pd.df.equals](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.equals.html) still returns `False`, but checking equality 'by hand' returns `True`. I am inclined to trust the 'by hand' check, since pandas can be quite finnickey

In [None]:
old_dropnans.equals(new_dropnans)

In [None]:
(old_dropnans == new_dropnans).all().all()