## ipprl_tools Tutorial Notebook
This notebook is a walk-through of the following topics:
    1. Reading data using Pandas.
    2. Using Synthetic Data Generation Methods.
    3. Calculating Linkability Metrics on Generated Data.
    4. Writing data and metric information to file.

In [None]:
import pandas as pd
import numpy as np
from ipprl_tools import synthetic,metrics
from ipprl_tools.utils import tutorial,data

## 1. Reading Data Using Pandas

The module comes with a link to some pre-made synthetic data to demonstrate the corruption methods. To download it, we can use the `get_data()`  method from the `utils` package. 

In [None]:
path = tutorial.get_data()

This gets us the path to the data that has been pre-downloaded.
To read in the data, we use the `read_pickle()` method from `pandas`. We use this method because it can handle reading compressed ZIP files. If your data is in CSV format, you can also use `pandas.read_csv()` to read your data in.

In either case, the `data` variable will contain a Pandas DataFrame object after calling.

**Important Note:** In order for the corruption methods to work correctly, the DataFrame you use must be entirely of type `np.str`. The corruption methods expect to operate on strings, and many will break on non-string data. One easy way to make sure your DataFrame is of type `np.str` is to call the function `.astype(np.str)` when reading your data. This will cast all columns of the DataFrame to be of the correct type.

We can also print out a sample of the data using `<DataFrame>.head(<num_rows>)`

In [None]:
raw_data = pd.read_pickle(path).astype(np.str)
# Drop some of the unecessary columns in our dataset.
raw_data = raw_data.drop(["first_name","first_name2","last_name","last_name2","email","address","city2","zip2","state2"],axis=1)
# Rename the columns of our dataset.
raw_data.columns = ["first_name","last_name","email","address","ssn","sex","city","zip","state","dob","phone","phone2","phone3","race","pcp_npi","suffix","title"]
# Split the data into a dataset, and a swap set. We do this so that we can utilize the swap set in section 4.1.1
dataset = raw_data.iloc[:400000]
swap_set = raw_data.iloc[400000:]

dataset.head(5)

## 2. Using Synthetic Data Generation Methods

Once the data is read in, we want to apply some corruption methods on it.

In [None]:
# We make a copy of the first few rows of data here so that we can compare it to the non-corrupted version.
data_to_corrupt = dataset.iloc[:5].copy()
# The indicators dictionary will hold some information about the corruptions as they are performed.
indicators = {}

In this example, we call the `drop_per_column()` method on our small amount of sample data. We pass the function:
1. `data` - The DataFrame holding our data.
2. `indicators` - A dictionary to hold some metadata about the corruptions.
3. `columns` - We pass `columns = None` to signify that we want this operation to run on *all* columns in the DataFrame.
4. `drop_pct` - This parameter tells the function what percentage of the rows should be dropped. In our case, we want to drop 50%.

In [None]:
synthetic.drop_per_column(data=data_to_corrupt,indicators=indicators,columns=None,drop_pct=0.5)

If we compare the original results to our corrupted version, we can see the the function has randomly deleted some elements of each row (The function rounded down from 50% to 2 rows).

In [None]:
comparison = data_to_corrupt.join(dataset.iloc[:5],lsuffix="_corrupt")
comparison[["first_name","first_name_corrupt","last_name","last_name_corrupt","address","address_corrupt"]]

The indicators dictionary also contains information about which elements specifically were removed.

In [None]:
def get_metrics_row(metadata, row,num_columns):
    return [None if metadata.get((i,row)) is None else metadata.get((i,row)).keys() for i in range(num_columns)]

def make_df_from_metadata(metadata,data):
    num_columns = len(data.columns)
    
    metrics_df = pd.DataFrame.from_dict({idx : get_metrics_row(metadata,idx,num_columns) for idx in range(len(data))},orient="index",columns=data.columns)
    metrics_df["type"] = "metadata"
    
    tmp_data = data.copy()
    tmp_data["type"] = "data"
    
    
    visual_df = pd.concat([tmp_data,metrics_df]).set_index("type",append=True).sort_index()
    return visual_df

If we use the above helper functions above, we can view the corrupted data and the indicator metadata side-by-side. The indicator metadata records the corruptions, and in the case of more complex corruption methods, information about the corruption that was performed on each element of the synthetic dataset.

In [None]:
meta_df = make_df_from_metadata(indicators,data_to_corrupt)
meta_df

## 2.1 Chaining Synthetic Methods

To generate a synthetic dataset suitable for linkage, we can call multiple synthetic data methods, one after another, on the same data. The end result of this chain is a dataset where multiple corruptions have been performed.

In [None]:
data_to_corrupt_large = dataset.iloc[:50].copy()
indicators_large = {}

In the below code, we chain together multiples calls to synthetic methods, passing the same data and indicator variables to each method. After calling the methods, we can print out the metadata DataFrame to see which corruptions were performed for each variable value.

In [None]:
insrt_columns = ["first_name","last_name","email"]
insrt_freqs = [0.2,0.2,0.5]
insrt_nums = [2,2,4]
synthetic.string_insert_alpha(data=data_to_corrupt_large,
                              indicators=indicators_large,
                              insrt_num=insrt_nums,
                              insrt_freq=insrt_freqs,
                              columns=insrt_columns)

n_insrt_columns = ["phone","ssn"]
n_insrt_freqs = [0.1,0.2]
n_insrt_nums = [2,2]
synthetic.string_insert_numeric(data=data_to_corrupt_large,
                                indicators=indicators_large,
                                insrt_num=n_insrt_nums,
                                insrt_freq=n_insrt_freqs,
                                columns=n_insrt_columns)

drop_cols = ["first_name","last_name","email","phone","ssn"]
drop_freqs = [0.2,0.1,0.5,0.4,0.1]
synthetic.drop_per_column(data=data_to_corrupt_large,
                          indicators=indicators_large,
                          columns=drop_cols,
                          drop_pct=drop_freqs)

In [None]:
large_meta_df = make_df_from_metadata(indicators_large,data_to_corrupt_large)
large_meta_df.head(10)

We can save this information by writing it to an Excel file using the following command.

In [None]:
large_meta_df.to_excel("test_excel.xlsx")

## 3.1 Calculating Linkability Metrics

Once we have a dataset that has been sufficiently corrupted, we may want to calculate linkability measures on the data, to determine which columns we should use for linkage.

We can calculate metrics on the data using the `metrics` submodule.

In [None]:
metrics.run_metrics(data_to_corrupt_large)

Each row in the above DataFrame represents a column from the original dataset. The columns in the DataFrame are various Linkability Measures, which are calculated directly from the data. For more information about what these linkability measures mean, visit [this page.](https://github.com/cu-recordlinkage/iPPRL/blob/master/linkability/Metrics_Table.md)

## 4.1 Preparing Files for Linkage

To generate a dataset that we can use for linkage testing, we can use another function from the `utils.data` submodule.

In this example, we are now operating on `data`, which is the complete tutorial dataset we read in at the start of the notebook.

In [None]:
left_ds,right_ds,gt_labels = data.split_dataset(dataset,overlap_pct=0.2)

In the above line of code, we used the `split_dataset` function from `ipprl_tools.utils.data` to split the dataset for us. This function accepts a set of data and splits it into two datasets, each of which has some unique rows, and some rows that overlap with the other dataset. The exact amount of overlap is configurable with the `overlap_pct` parameter.

In this case, we chose to have 20% of the rows from `dataset` appear in both `left_ds` and `right_ds`. 

In addition to returning the two dataset variables, the function also returns a set of ground truth labels, `gt_labels`, which provide the `ID`s of the overlapping rows in `left_ds` and `right_ds`. If desired, you can evalaute the performance of your linkage using these known ground-truth labels.

### 4.1.1 Applying Corruption Methods
Like in Section 3, we will now apply corruption methods to the synthetic data. 

This time, we must operate on two datasets, `left_ds` and `right_ds`.

**Note:** These methods might take a long time to run, because they are operating on very large data. If you'd like them to finish quicker, you can pass a subset of the data (using the `.iloc` function of DataFrame) to the `split_dataset()` function above to make these operations complete quicker.

In [None]:
left_meta = {}
synthetic.string_transpose(left_ds,left_meta,4,0.05)
print("Transpose Complete.")
synthetic.string_delete(left_ds,left_meta,3,0.05)
print("Delete Complete.")
synthetic.string_insert_alpha(left_ds,left_meta,3,0.05,columns=["first_name","last_name","email","address","city","title"])
print("Insert Alpha Complete.")
synthetic.string_insert_numeric(left_ds,left_meta,3,0.05,columns=["phone","phone2","phone3","zip"])
print("Insert Numeric Complete.")
synthetic.edit_values(left_ds,swap_set,left_meta,0.1)
print("Edit Values Complete.")

columns = ["first_name",
            "last_name",
            "email",
            "address",
            "ssn",
            "sex",
            "city",
            "zip",
            "state",
            "dob",
            "phone",
            "phone2",
            "phone3",
            "race",
            "pcp_npi",
            "suffix",
            "title"]

drop_pcts = [0.03,
             0.03,
             0.75,
             0.06,
             0.25,
             0.07,
             0.07,
             0.07,
             0.02,
             0.02,
             0.85,
             0.85,
             0.2,
             0.2,
             0.99,
             0.2]

synthetic.drop_per_column(left_ds,left_meta,drop_pct=drop_pcts,columns=columns)
print("Per-Column Drop Complete.")

In [None]:
right_meta = {}
synthetic.string_transpose(right_ds,right_meta,4,0.05)
print("Transpose Complete.")
synthetic.string_delete(right_ds,right_meta,3,0.05)
print("Delete Complete.")
synthetic.string_insert_alpha(right_ds,right_meta,3,0.05,columns=["first_name","last_name","email","address","city","title"])
print("Insert Alpha Complete.")
synthetic.string_insert_numeric(right_ds,right_meta,3,0.05,columns=["phone","phone2","phone3","zip"])
print("Insert Numeric Complete.")
synthetic.edit_values(right_ds,swap_set,right_meta,0.1)
print("Edit Values Complete.")

columns = ["first_name",
            "last_name",
            "email",
            "address",
            "ssn",
            "sex",
            "city",
            "zip",
            "state",
            "dob",
            "phone",
            "phone2",
            "phone3",
            "race",
            "pcp_npi",
            "suffix",
            "title"]

r_drop_pcts = [0.05,
             0.03,
             0.75,
             0.06,
             0.25,
             0.07,
             0.07,
             0.07,
             0.02,
             0.02,
             0.80,
             0.80,
             0.2,
             0.2,
             0.99,
             0.2]

synthetic.drop_per_column(right_ds,right_meta,drop_pct=r_drop_pcts,columns=columns)
print("Per-Column Drop Complete.")

To verify that the corruption ran on both datasets, we can run the linkability metrics on both.

In [None]:
metrics.run_metrics(left_ds)

In [None]:
metrics.run_metrics(right_ds)

We can also look at the first few rows of the data.

In [None]:
left_ds.head()

In [None]:
right_ds.head()

We can now combine these two datasets into a single dataset in order to use it as input for linkage.

In [None]:
full_ds = pd.concat([left_ds,right_ds])

In [None]:
full_ds

The `concat()` function will concatenate the two DataFrames into a single DataFrame along the axis. In our case, the `split_data()` utility function arranged it so that the indices of our index column `id`, are unique. If you did not use `split_data()` you'll want to make sure that you have references to the original IDs of your data so that you can evaluate the performance later.

`full_ds` is now a DataFrame which contains `left_ds` and `right_ds` stacked on top of each other (concatenated along the row dimension)

In [None]:
full_ds

We can verify that the ground truth IDs from `split_data()` are still valid.

In [None]:
pair_num = 1
full_ds.loc[[gt_labels[pair_num][0],gt_labels[pair_num][1]]]

Now we simply call `.to_csv()` to save our new dataset.

In [None]:
full_ds.to_csv("test_dataset.csv")

In order to evaluate performance later, it is also a good idea to save the individual meta objects as well as the ground-truth labels.

In [None]:
import pickle
# We can save the metadata files as .pkl files, which are a common binary format for Python.
pickle.dump(left_meta,open("left_meta.pkl","wb"))
pickle.dump(right_meta,open("right_meta.pkl","wb"))
# We'll save the ground truth labels into a pikcle as well.
pickle.dump(gt_labels,open("gt_labels.pkl","wb"))

To open these files again later, we can use the `pickle.load()` function in the same way we just used `pickle.dump()`

In [None]:
test_read = pickle.load(open("gt_labels.pkl","rb"))