In [1]:
from ipprl_tools import metrics
from ipprl_tools.utils.data import get_data
import pandas as pd

The `get_data` function will download a demo dataset CSV, which can be used to demostrate the metrics calculation.

In [2]:
demo_data_path = get_data()

**Note:** In order for metrics calculation to work correctly, it is important that we call the read_csv function with the following arguments:

- `dtype=str` - Ensures that all columns are read in as strings (as opposed to numeric/date/etc). This is important as the metrics calculations assume they are working with strings.
- `keep_default_na=False` - Ensures that missing values are coded as `''` instead of `np.nan` or some other value. The metric calculations will also assume that empty string `''` is the missing value placeholder.

In [3]:
demo_data = pd.read_csv(demo_data_path, dtype=str, keep_default_na=False)

We can now calculate metrics on the demo data, using the `run_metrics` function.

In [4]:
result_metrics = metrics.run_metrics(demo_data)

We can view the metrics within this notebook.

In [5]:
result_metrics

Unnamed: 0,Mean Group Size,Median Group Size,Stdev Group Size,Min Group Size,Max Group Size,Missing Data Ratio,Distinct Values Ratio,Shannon Entropy,Theoretical Max Entropy,% Theoretical Max Entropy,Row Count
row_id,1.0,1.0,0.0,1,1,0.0,1.0,16.94327,16.94327,100.0,126018
address,1.934928,1.0,1.282508,1,13,0.0,0.516815,15.727277,15.99099,98.350864,126018
cell_phone_number,1.063475,1.0,0.304271,1,6,0.000603,0.939755,16.809274,16.853626,99.73684,126018
city,8.341696,5.0,15.852526,1,666,0.0,0.11988,12.979011,13.88293,93.488989,126018
county,68.862295,29.0,128.212054,1,2424,0.0,0.014522,9.674918,10.837628,89.271546,126018
dob,4.109908,4.0,2.835599,1,32,0.0,0.243314,14.611513,14.904164,98.036445,126018
father_first_name,63.709808,10.0,200.663789,1,3050,0.0,0.015696,8.681256,10.949827,79.282129,126018
father_last_name,4.304776,1.0,19.389642,1,1208,0.0,0.2323,12.915957,14.837332,87.050398,126018
first_name,34.077339,6.0,97.923743,1,1813,0.0,0.029345,9.848584,11.85253,83.092674,126018
home_phone_number,1.902618,1.0,1.246427,1,13,0.0,0.525592,15.757274,16.015284,98.388974,126018


Finally, we write the computed metrics out to a file.

In [6]:
result_metrics.to_csv("demo_metrics_out.csv", index_label="data_column")