## ipprl_tools Tutorial Notebook
This notebook is a walk-through of the following topics:
    1. Reading data using Pandas.
    2. Using Synthetic Data Generation Methods.
    3. Calculating Linkability Metrics on Generated Data.
    4. Writing data and metric information to file.

In [1]:
import pandas as pd
import numpy as np
from ipprl_tools import synthetic,metrics,utils

## 1. Reading Data Using Pandas

The module comes with a link to some pre-made synthetic data to demonstrate the corruption methods. To download it, we can use the `get_data()`  method from the `utils` package. 

In [4]:
path = utils.get_data()

This gets us the path to the data that has been pre-downloaded.
To read in the data, we use the `read_pickle()` method from `pandas`. We use this method because it can handle reading compressed ZIP files. If your data is in CSV format, you can also use `pandas.read_csv()` to read your data in.

In either case, the `data` variable will contain a Pandas DataFrame object after calling.

We can also print out a sample of the data using `<DataFrame>.head(<num_rows>)`

In [9]:
data = pd.read_pickle(path)
data.head(5)

Unnamed: 0,first_name,last_name,first_name2,last_name2,first_name3,last_name3,email,email2,address,address2,...,state,state2,dob,phone,phone2,phone3,race,pcp_npi,suffix,title
0,Tara,Mechan,Collete,Charle,Isabelita,Dommersen,idommersen0@google.it,idommersen0@webs.com,30438 Sutteridge Park,48 Grover Way,...,Texas,Minnesota,2017/10/24,713-816-8206,651-608-1749,561-717-5270,Sri Lankan,76-5006664,Jr,Honorable
1,Witty,Doick,Jordan,Moyers,Byrom,Le Moucheux,blemoucheux1@fda.gov,blemoucheux1@cornell.edu,894 Coolidge Drive,158 Marquette Hill,...,Georgia,Kentucky,2017/09/10,404-582-9658,502-478-1240,540-141-9416,Colville,49-7957492,Sr,Honorable
2,Duffy,Kinastan,Araldo,Slott,Garwin,Ismirnioglou,gismirnioglou2@lulu.com,gismirnioglou2@army.mil,197 Barby Hill,9538 Lighthouse Bay Circle,...,Indiana,California,2017/07/22,574-885-2620,626-605-9078,406-221-1811,Asian Indian,68-4856593,Jr,Honorable
3,Winfred,Holbarrow,Jedediah,Jewkes,Ewan,Paquet,epaquet3@unc.edu,epaquet3@baidu.com,03 Park Meadow Junction,0123 Dawn Park,...,Georgia,New York,2017/05/26,706-761-4259,212-881-3527,502-205-2203,Honduran,78-9072361,Jr,Mr
4,Faydra,Quinet,Arlyn,Battershall,Kamila,Tailour,ktailour4@seesaa.net,ktailour4@rediff.com,9 Evergreen Junction,8 Linden Terrace,...,Florida,South Dakota,2017/09/22,850-315-6220,605-784-3270,704-410-3803,Eskimo,95-6884148,II,Mr


## 2. Using Synthetic Data Generation Methods

Once the data is read in, we want to apply some corruption methods on it.

In [13]:
# We make a copy of the first few rows of data here so that we can compare it to the non-corrupted version.
data_to_corrupt = data.iloc[:5].copy()
# The indicators dictionary will hold some information about the corruptions as they are performed.
indicators = {}

In this example, we call the `drop_per_column()` method on our small amount of sample data. We pass the function:
1. `data` - The DataFrame holding our data.
2. `indicators` - A dictionary to hold some metadata about the corruptions.
3. `columns` - We pass `columns = None` to signify that we want this operation to run on *all* columns in the DataFrame.
4. `drop_pct` - This parameter tells the function what percentage of the rows should be dropped. In our case, we want to drop 50%.

In [14]:
synthetic.drop_per_column(data=data_to_corrupt,indicators=indicators,columns=None,drop_pct=0.5)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


If we compare the original results to our corrupted version, we can see the the function has randomly deleted some elements of each row (The function rounded down from 50% to 2 rows).

In [22]:
comparison = data_to_corrupt.join(data.iloc[:5],lsuffix="_corrupt")
comparison[["first_name","first_name_corrupt","last_name","last_name_corrupt","address","address_corrupt"]]

Unnamed: 0,first_name,first_name_corrupt,last_name,last_name_corrupt,address,address_corrupt
0,Tara,Tara,Mechan,Mechan,30438 Sutteridge Park,30438 Sutteridge Park
1,Witty,Witty,Doick,Doick,894 Coolidge Drive,
2,Duffy,Duffy,Kinastan,Kinastan,197 Barby Hill,197 Barby Hill
3,Winfred,,Holbarrow,,03 Park Meadow Junction,03 Park Meadow Junction
4,Faydra,,Quinet,,9 Evergreen Junction,


The indicators dictionary also contains information about which elements specifically were removed.

In [44]:
def get_metrics_row(metadata, row,num_columns):
    return [None if metadata.get((i,row)) is None else metadata.get((i,row)).keys() for i in range(num_columns)]

def make_df_from_metadata(metadata,data):
    num_columns = len(data.columns)
    
    metrics_df = pd.DataFrame.from_dict({idx : get_metrics_row(metadata,idx,num_columns) for idx in range(len(data))},orient="index",columns=data.columns)
    metrics_df["type"] = "metadata"
    
    tmp_data = data.copy()
    tmp_data["type"] = "data"
    
    
    visual_df = pd.concat([tmp_data,metrics_df]).set_index("type",append=True).sort_index()
    return visual_df

If we use the above helper functions above, we can view the corrupted data and the indicator metadata side-by-side. The indicator metadata records the corruptions, and in the case of more complex corruption methods, information about the corruption that was performed on each element of the synthetic dataset.

In [47]:
meta_df = make_df_from_metadata(indicators,data_to_corrupt)
meta_df

Unnamed: 0_level_0,Unnamed: 1_level_0,first_name,last_name,first_name2,last_name2,first_name3,last_name3,email,email2,address,address2,...,state,state2,dob,phone,phone2,phone3,race,pcp_npi,suffix,title
Unnamed: 0_level_1,type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0,data,Tara,Mechan,,Charle,Isabelita,Dommersen,idommersen0@google.it,idommersen0@webs.com,30438 Sutteridge Park,,...,Texas,,,713-816-8206,651-608-1749,,Sri Lankan,,Jr,Honorable
0,metadata,,,(drop_per_column),,,,,,,(drop_per_column),...,,(drop_per_column),(drop_per_column),,,(drop_per_column),,(drop_per_column),,
1,data,Witty,Doick,Jordan,Moyers,Byrom,Le Moucheux,blemoucheux1@fda.gov,,,158 Marquette Hill,...,Georgia,Kentucky,2017/09/10,404-582-9658,502-478-1240,540-141-9416,,49-7957492,Sr,Honorable
1,metadata,,,,,,,,(drop_per_column),(drop_per_column),,...,,,,,,,(drop_per_column),,,
2,data,Duffy,Kinastan,Araldo,,,,gismirnioglou2@lulu.com,gismirnioglou2@army.mil,197 Barby Hill,9538 Lighthouse Bay Circle,...,,California,2017/07/22,,626-605-9078,406-221-1811,Asian Indian,,Jr,
2,metadata,,,,(drop_per_column),(drop_per_column),(drop_per_column),,,,,...,(drop_per_column),,,(drop_per_column),,,,(drop_per_column),,(drop_per_column)
3,data,,,,Jewkes,Ewan,,,,03 Park Meadow Junction,,...,,,2017/05/26,,,,Honduran,78-9072361,,
3,metadata,(drop_per_column),(drop_per_column),(drop_per_column),,,(drop_per_column),(drop_per_column),(drop_per_column),,(drop_per_column),...,(drop_per_column),(drop_per_column),,(drop_per_column),(drop_per_column),(drop_per_column),,,(drop_per_column),(drop_per_column)
4,data,,,Arlyn,,,Tailour,,ktailour4@rediff.com,,8 Linden Terrace,...,Florida,South Dakota,,850-315-6220,,704-410-3803,,95-6884148,,Mr
4,metadata,(drop_per_column),(drop_per_column),,(drop_per_column),(drop_per_column),,(drop_per_column),,(drop_per_column),,...,,,(drop_per_column),,(drop_per_column),,(drop_per_column),,(drop_per_column),
