# Example: Comparison

There is technically only one dependecy needed: `tensorflow-data-validation`.

In [None]:
%pip install -U -r requirements.txt

In [None]:
import pandas as pd
import tensorflow_data_validation as dv

In [None]:
pd.set_option('display.max_columns', 10)

## Data

First, let us load some data. In this example, we use data extracted from the
[Global Health Data Exchange](source).

[source]: http://ghdx.healthdata.org/gbd-results-tool?params=gbd-api-2017-permalink/361bd5bf7237ace4d279fcc9e1daf5bd

In [None]:
data_narrow = pd.read_csv('data/data.csv')
data_narrow.head()

Second, we explore briefly how the data look like using the same tool that will
be used for comparison purpose shortly:

In [None]:
statistics = dv.generate_statistics_from_dataframe(data_narrow, n_jobs=-1)
dv.visualize_statistics(statistics)

It can be seen that `age`, `measure`, and `sex` can be dropped, as they are
constant, and that there are 195 locations and 21 causes of death, which are
quantified using 3 metrics. The data are given in a narrow format. To make the
comparison more interesting, let us convert the data into a wide format:

In [None]:
data_wide = pd.pivot_table(data_narrow,
                           index=['location', 'year'],
                           columns=['cause', 'metric'],
                           values='val')
data_wide.columns = data_wide.columns.to_flat_index()
data_wide.columns = map(': '.join, data_wide.columns.values)
data_wide.reset_index(inplace=True)
data_wide.head()

Lastly, let us split the data into two so there is something to compare:

In [None]:
data_1 = data_wide.query('year >= 2000 and year < 2010')
data_2 = data_wide.query('year >= 2010 and year < 2018')

It can be seen that we are to compare data for 2000–2009 with data for
2010–2017.

## Comparison

Once the data are ready, the comparison itself takes three lines:

In [None]:
statistics_1 = dv.generate_statistics_from_dataframe(data_1, n_jobs=-1)
statistics_2 = dv.generate_statistics_from_dataframe(data_2, n_jobs=-1)

In [None]:
dv.visualize_statistics(lhs_statistics=statistics_1,
                        rhs_statistics=statistics_2)