# Data Profiler

## Table of Contents:
* [Importing libraries,loading & transforming data](#first-bullet)
* [Using the Existing Labeler](#second-bullet)
* [Training Labeler](#third-bullet)
* [Creating Function to Train Labeler](#fourth-bullet)

### Importing libraries, loading & transforming data <a class="anchor" id="first-bullet"></a>



In [1]:
!pip install dataprofiler

Collecting dataprofiler
  Downloading DataProfiler-0.7.7-py3-none-any.whl (7.5 MB)
[K     |████████████████████████████████| 7.5 MB 5.1 MB/s 
Collecting requests==2.27.1
  Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 2.3 MB/s 
Collecting future>=0.18.2
  Downloading future-0.18.2.tar.gz (829 kB)
[K     |████████████████████████████████| 829 kB 69.7 MB/s 
Collecting python-snappy>=0.5.4
  Downloading python_snappy-0.6.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (56 kB)
[K     |████████████████████████████████| 56 kB 3.1 MB/s 
[?25hCollecting fastavro>=1.0.0.post1
  Downloading fastavro-1.4.12-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 23.8 MB/s 
Building wheels for collected packages: future
  Building wheel for future (setup.py) ... [?25l[?25hdone
  Created wheel for future: filename=future-0.18.2-py3-none-any.whl size=491070 sha25

In [2]:
import os
import sys
import json
import pandas as pd

try:
    sys.path.insert(0, '..')
    import dataprofiler as dp
except ImportError:
    import dataprofiler as dp

# remove extra tf loggin
import tensorflow as tf
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

We will now read in the record linkage dataset

In [3]:
data = dp.Data('/content/drive/MyDrive/Capstone/Client Work/Data/recordlinkage1.csv')
df_data = data.data
df_data.head()

Unnamed: 0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id
0,michaela,neumann,8,stanley street,miami,winston hills,4223,nsw,19151111,5304218
1,courtney,painter,12,pinkerton circuit,bega flats,richlands,4560,vic,19161214,4066625
2,charles,green,38,salkauskas crescent,kela,dapto,4566,nsw,19480930,4365168
3,vanessa,parr,905,macquoid place,broadbridge manor,south grafton,2135,sa,19951119,9239102
4,mikayla,malloney,37,randwick road,avalind,hoppers crossing,4552,vic,19860208,7207688


### Using the Existing Labeler <a class="anchor" id="second-bullet"></a>


#### We will run the pre-exisitng labeler for structured data for our dataset

In [4]:
labeler = dp.DataLabeler(labeler_type='structured')

# print out the labels and label mapping
print("Labels: {}".format(labeler.labels)) 
print("\n")
print("Label Mapping: {}".format(labeler.label_mapping))
print("\n")

# make predictions and get labels for each cell going row by row
# predict options are model dependent and the default model can show prediction confidences
predictions = labeler.predict(data, predict_options={"show_confidences": True})

# display prediction results
print("Predictions: {}".format(predictions['pred']))
print("\n")

# display confidence results
print("Confidences: {}".format(predictions['conf']))

Labels: ['PAD', 'UNKNOWN', 'ADDRESS', 'BAN', 'CREDIT_CARD', 'DATE', 'TIME', 'DATETIME', 'DRIVERS_LICENSE', 'EMAIL_ADDRESS', 'UUID', 'HASH_OR_KEY', 'IPV4', 'IPV6', 'MAC_ADDRESS', 'PERSON', 'PHONE_NUMBER', 'SSN', 'URL', 'US_STATE', 'INTEGER', 'FLOAT', 'QUANTITY', 'ORDINAL']


Label Mapping: {'PAD': 0, 'UNKNOWN': 1, 'ADDRESS': 2, 'BAN': 3, 'CREDIT_CARD': 4, 'DATE': 5, 'TIME': 6, 'DATETIME': 7, 'DRIVERS_LICENSE': 8, 'EMAIL_ADDRESS': 9, 'UUID': 10, 'HASH_OR_KEY': 11, 'IPV4': 12, 'IPV6': 13, 'MAC_ADDRESS': 14, 'PERSON': 15, 'PHONE_NUMBER': 16, 'SSN': 17, 'URL': 18, 'US_STATE': 19, 'INTEGER': 20, 'FLOAT': 21, 'QUANTITY': 22, 'ORDINAL': 23}


Predictions: ['UNKNOWN' 'UNKNOWN' 'INTEGER' ... 'UNKNOWN' 'DATE' 'DRIVERS_LICENSE']


Confidences: [[0. 1. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [5]:
# helper functions for printing results

def get_structured_results(results):
    """Helper function to get data labels for each column."""
    columns = []
    predictions = []
    samples = []
    for col in results['data_stats']:
        columns.append(col['column_name'])
        predictions.append(col['data_label'])
        samples.append(col['samples'])

    df_results = pd.DataFrame({'Column': columns, 'Prediction': predictions, 'Sample': samples})
    return df_results

def get_unstructured_results(data, results):
    """Helper function to get data labels for each labeled piece of text."""
    labeled_data = []
    for pred in results['pred'][0]:
        labeled_data.append([data[0][pred[0]:pred[1]], pred[2]])
    label_df = pd.DataFrame(labeled_data, columns=['Text', 'Labels'])
    return label_df
    

pd.set_option('display.width', 100)

In [6]:
# set options to only run the labeler
profile_options = dp.ProfilerOptions()
profile_options.set({"structured_options.text.is_enabled": False, 
                     "int.is_enabled": False, 
                     "float.is_enabled": False, 
                     "order.is_enabled": False, 
                     "category.is_enabled": False, 
                     "chi2_homogeneity.is_enabled": False,
                     "datetime.is_enabled": False,})

profile = dp.Profiler(data, options=profile_options)

results = profile.report()    
print(get_structured_results(results))

INFO:DataProfiler.profilers.profile_builder: Finding the Null values in the columns... 


100%|██████████| 10/10 [00:00<00:00, 163.28it/s]


INFO:DataProfiler.profilers.profile_builder: Calculating the statistics...  (with 4 processes)


100%|██████████| 10/10 [00:00<00:00, 11.55it/s]


          Column       Prediction                                             Sample
0     given_name          UNKNOWN          [pakita, kayla, lachlan, indiana, connor]
1        surname          UNKNOWN        [wasley, boileau, roff, campbell, littlely]
2  street_number          INTEGER                               [67, 303, 9, 14, 16]
3      address_1          UNKNOWN  [hone place, shirlow place, geelong street, ga...
4      address_2          UNKNOWN  [uarah, the lakes retirement village, northwoo...
5         suburb          UNKNOWN  [ryde, balaklava, burwood east, belmont, norwood]
6       postcode          INTEGER                     [5063, 0820, 3943, 3155, 5086]
7          state          UNKNOWN                           [vic, sa, act, qld, vic]
8  date_of_birth             DATE  [19960111, 19240510, 19140203, 19021008, 19280...
9     soc_sec_id  DRIVERS_LICENSE      [3138813, 5659846, 5875656, 4738674, 4578360]


In this example, the results show that the Data Profiler is able to detect integers, URLs, address, and floats appropriately. Unknown is typically strings of text, which is appropriate for those columns.

### Training Labeler <a class="anchor" id="third-bullet"></a>

In [7]:
data.columns

Index(['given_name', 'surname', 'street_number', 'address_1', 'address_2', 'suburb', 'postcode',
       'state', 'date_of_birth', 'soc_sec_id'],
      dtype='object')

#### Here we train the labeler on the record linkage dataset

In [8]:
data = dp.Data('/content/drive/MyDrive/Capstone/Client Work/Data/recordlinkage1.csv')
df = data.data[['given_name', 'surname', 'street_number', 'address_1', 'address_2', 'suburb', 'postcode',
                'state', 'date_of_birth', 'soc_sec_id']]
df.head()

# split data to training and test set
split_ratio = 0.2
df = df.sample(frac=1).reset_index(drop=True)
data_train = df[:int((1 - split_ratio) * len(df))]
data_test = df[int((1 - split_ratio) * len(df)):]

In [10]:
# train a new labeler with column names as labels
if not os.path.exists('data_labeler_saved'):
    os.makedirs('data_labeler_saved')

labeler = dp.train_structured_labeler(
    data=data_train,
    save_dirpath="data_labeler_saved",
    epochs=50,
    default_label="given_name"
)

EPOCH 0, batch_id 1: loss: 3.810373 - acc: 0.272151 - f1_score 0.272151

  m.reset_state()


EPOCH 0 (3s), loss: 2.989107 - acc: 0.387145 - f1_score 0.387145 -- val_f1: 0.065428 - val_precision: 0.079006 - val_recall 0.136769


  m.reset_state()


EPOCH 1 (0s), loss: 1.741082 - acc: 0.520450 - f1_score 0.520450 -- val_f1: 0.088839 - val_precision: 0.285110 - val_recall 0.202605


  m.reset_state()


EPOCH 2 (0s), loss: 1.284778 - acc: 0.598235 - f1_score 0.598235 -- val_f1: 0.135467 - val_precision: 0.478243 - val_recall 0.222202


  m.reset_state()


EPOCH 3 (0s), loss: 1.083605 - acc: 0.655121 - f1_score 0.655121 -- val_f1: 0.209598 - val_precision: 0.536079 - val_recall 0.265324


  m.reset_state()


EPOCH 4 (0s), loss: 0.903651 - acc: 0.692076 - f1_score 0.692076 -- val_f1: 0.343902 - val_precision: 0.618413 - val_recall 0.354812


  m.reset_state()


EPOCH 5 (0s), loss: 0.786716 - acc: 0.719619 - f1_score 0.719619 -- val_f1: 0.455977 - val_precision: 0.658502 - val_recall 0.447480


  m.reset_state()


EPOCH 6 (0s), loss: 0.695297 - acc: 0.740484 - f1_score 0.740484 -- val_f1: 0.520843 - val_precision: 0.680633 - val_recall 0.507155


  m.reset_state()


EPOCH 7 (0s), loss: 0.642216 - acc: 0.761090 - f1_score 0.761090 -- val_f1: 0.571156 - val_precision: 0.693207 - val_recall 0.560766


  m.reset_state()


EPOCH 8 (0s), loss: 0.601715 - acc: 0.772249 - f1_score 0.772249 -- val_f1: 0.614555 - val_precision: 0.703835 - val_recall 0.610406


  m.reset_state()


EPOCH 9 (0s), loss: 0.551461 - acc: 0.794654 - f1_score 0.794654 -- val_f1: 0.647506 - val_precision: 0.717917 - val_recall 0.649116


  m.reset_state()


EPOCH 10 (0s), loss: 0.514511 - acc: 0.811176 - f1_score 0.811176 -- val_f1: 0.675951 - val_precision: 0.731459 - val_recall 0.682203


  m.reset_state()


EPOCH 11 (0s), loss: 0.482730 - acc: 0.821920 - f1_score 0.821920 -- val_f1: 0.700628 - val_precision: 0.744549 - val_recall 0.710898


  m.reset_state()


EPOCH 12 (0s), loss: 0.452670 - acc: 0.831851 - f1_score 0.831851 -- val_f1: 0.723001 - val_precision: 0.760000 - val_recall 0.736813


  m.reset_state()


EPOCH 13 (0s), loss: 0.424706 - acc: 0.841384 - f1_score 0.841384 -- val_f1: 0.743737 - val_precision: 0.775202 - val_recall 0.759222


  m.reset_state()


EPOCH 14 (0s), loss: 0.397602 - acc: 0.846557 - f1_score 0.846557 -- val_f1: 0.764038 - val_precision: 0.796239 - val_recall 0.778514


  m.reset_state()


EPOCH 15 (0s), loss: 0.389249 - acc: 0.850087 - f1_score 0.850087 -- val_f1: 0.784069 - val_precision: 0.817148 - val_recall 0.797311


  m.reset_state()


EPOCH 16 (0s), loss: 0.374471 - acc: 0.855121 - f1_score 0.855121 -- val_f1: 0.802297 - val_precision: 0.835862 - val_recall 0.813306


  m.reset_state()


EPOCH 17 (0s), loss: 0.365146 - acc: 0.857768 - f1_score 0.857768 -- val_f1: 0.815031 - val_precision: 0.851951 - val_recall 0.825311


  m.reset_state()


EPOCH 18 (0s), loss: 0.359486 - acc: 0.856159 - f1_score 0.856159 -- val_f1: 0.826515 - val_precision: 0.863206 - val_recall 0.835378


  m.reset_state()


EPOCH 19 (0s), loss: 0.345259 - acc: 0.863633 - f1_score 0.863633 -- val_f1: 0.836103 - val_precision: 0.869822 - val_recall 0.843434


  m.reset_state()


EPOCH 20 (0s), loss: 0.333361 - acc: 0.868754 - f1_score 0.868754 -- val_f1: 0.844984 - val_precision: 0.875995 - val_recall 0.851279


  m.reset_state()


EPOCH 21 (0s), loss: 0.319339 - acc: 0.873772 - f1_score 0.873772 -- val_f1: 0.851318 - val_precision: 0.881418 - val_recall 0.857208


  m.reset_state()


EPOCH 22 (0s), loss: 0.324854 - acc: 0.869221 - f1_score 0.869221 -- val_f1: 0.856648 - val_precision: 0.885956 - val_recall 0.862494


  m.reset_state()


EPOCH 23 (0s), loss: 0.294841 - acc: 0.882889 - f1_score 0.882889 -- val_f1: 0.860818 - val_precision: 0.888432 - val_recall 0.866201


  m.reset_state()


EPOCH 24 (0s), loss: 0.306606 - acc: 0.878408 - f1_score 0.878408 -- val_f1: 0.865552 - val_precision: 0.890850 - val_recall 0.870960


  m.reset_state()


EPOCH 25 (0s), loss: 0.292208 - acc: 0.882855 - f1_score 0.882855 -- val_f1: 0.869042 - val_precision: 0.893921 - val_recall 0.874151


  m.reset_state()


EPOCH 26 (0s), loss: 0.287362 - acc: 0.885830 - f1_score 0.885830 -- val_f1: 0.872975 - val_precision: 0.896575 - val_recall 0.877479


  m.reset_state()


EPOCH 27 (0s), loss: 0.285377 - acc: 0.885848 - f1_score 0.885848 -- val_f1: 0.875586 - val_precision: 0.897918 - val_recall 0.879595


  m.reset_state()


EPOCH 28 (0s), loss: 0.278654 - acc: 0.888443 - f1_score 0.888443 -- val_f1: 0.877834 - val_precision: 0.899852 - val_recall 0.881564


  m.reset_state()


EPOCH 29 (0s), loss: 0.269336 - acc: 0.891592 - f1_score 0.891592 -- val_f1: 0.879696 - val_precision: 0.900837 - val_recall 0.883649


  m.reset_state()


EPOCH 30 (0s), loss: 0.270798 - acc: 0.890104 - f1_score 0.890104 -- val_f1: 0.883108 - val_precision: 0.903071 - val_recall 0.886271


  m.reset_state()


EPOCH 31 (0s), loss: 0.262606 - acc: 0.894204 - f1_score 0.894204 -- val_f1: 0.886958 - val_precision: 0.904297 - val_recall 0.889483


  m.reset_state()


EPOCH 32 (0s), loss: 0.260707 - acc: 0.897163 - f1_score 0.897163 -- val_f1: 0.888476 - val_precision: 0.905886 - val_recall 0.890873


  m.reset_state()


EPOCH 33 (0s), loss: 0.262811 - acc: 0.895986 - f1_score 0.895986 -- val_f1: 0.889066 - val_precision: 0.906388 - val_recall 0.891810


  m.reset_state()


EPOCH 34 (0s), loss: 0.240287 - acc: 0.904706 - f1_score 0.904706 -- val_f1: 0.890435 - val_precision: 0.906521 - val_recall 0.893464


  m.reset_state()


EPOCH 35 (0s), loss: 0.254986 - acc: 0.897716 - f1_score 0.897716 -- val_f1: 0.893965 - val_precision: 0.908406 - val_recall 0.896486


  m.reset_state()


EPOCH 36 (0s), loss: 0.244407 - acc: 0.904343 - f1_score 0.904343 -- val_f1: 0.896083 - val_precision: 0.910617 - val_recall 0.897718


  m.reset_state()


EPOCH 37 (0s), loss: 0.245936 - acc: 0.902266 - f1_score 0.902266 -- val_f1: 0.898538 - val_precision: 0.912109 - val_recall 0.899824


  m.reset_state()


EPOCH 38 (0s), loss: 0.234315 - acc: 0.906799 - f1_score 0.906799 -- val_f1: 0.901218 - val_precision: 0.912545 - val_recall 0.902594


  m.reset_state()


EPOCH 39 (0s), loss: 0.250538 - acc: 0.900606 - f1_score 0.900606 -- val_f1: 0.904693 - val_precision: 0.914032 - val_recall 0.905637


  m.reset_state()


EPOCH 40 (1s), loss: 0.227407 - acc: 0.910519 - f1_score 0.910519 -- val_f1: 0.907824 - val_precision: 0.916262 - val_recall 0.907964


  m.reset_state()


EPOCH 41 (0s), loss: 0.244500 - acc: 0.903875 - f1_score 0.903875 -- val_f1: 0.908707 - val_precision: 0.916686 - val_recall 0.909291


  m.reset_state()


EPOCH 42 (0s), loss: 0.228190 - acc: 0.910779 - f1_score 0.910779 -- val_f1: 0.909345 - val_precision: 0.917380 - val_recall 0.910260


  m.reset_state()


EPOCH 43 (0s), loss: 0.239474 - acc: 0.903581 - f1_score 0.903581 -- val_f1: 0.911706 - val_precision: 0.918747 - val_recall 0.912218


  m.reset_state()


EPOCH 44 (0s), loss: 0.220831 - acc: 0.914792 - f1_score 0.914792 -- val_f1: 0.912923 - val_precision: 0.919396 - val_recall 0.913545


  m.reset_state()


EPOCH 45 (1s), loss: 0.222838 - acc: 0.913616 - f1_score 0.913616 -- val_f1: 0.913712 - val_precision: 0.919679 - val_recall 0.914609


  m.reset_state()


EPOCH 46 (1s), loss: 0.222635 - acc: 0.914481 - f1_score 0.914481 -- val_f1: 0.915359 - val_precision: 0.920675 - val_recall 0.915925


  m.reset_state()


EPOCH 47 (0s), loss: 0.211540 - acc: 0.915519 - f1_score 0.915519 -- val_f1: 0.916550 - val_precision: 0.921293 - val_recall 0.917020


  m.reset_state()


EPOCH 48 (0s), loss: 0.214847 - acc: 0.916644 - f1_score 0.916644 -- val_f1: 0.916960 - val_precision: 0.920340 - val_recall 0.917757


  m.reset_state()


EPOCH 49 (0s), loss: 0.206754 - acc: 0.918166 - f1_score 0.918166 -- val_f1: 0.917930 - val_precision: 0.920696 - val_recall 0.918495


In [11]:
faker_data= dp.Data('/content/drive/MyDrive/Capstone/Client Work/Data/Fake_data.csv')

In [12]:
faker_data.columns

Index(['given_name', 'surname', 'street_number', 'address_1', 'address_2', 'suburb', 'postcode',
       'state', 'date_of_birth', 'soc_sec_id'],
      dtype='object')

#### Here we train the labeler on the dataset that we have created using faker

In [13]:
df1 = faker_data.data[['given_name', 'surname', 'street_number', 'address_1', 'address_2', 'suburb', 'postcode',
                'state', 'date_of_birth', 'soc_sec_id']]
df1.head()

# split data to training and test set
split_ratio = 0.2
df1 = df1.sample(frac=1).reset_index(drop=True)
data_train1 = df1[:int((1 - split_ratio) * len(df1))]
data_test1 = df1[int((1 - split_ratio) * len(df1)):]

In [14]:
# train a new labeler with column names as labels
if not os.path.exists('data_labeler_saved'):
    os.makedirs('data_labeler_saved')

labeler_faker = dp.train_structured_labeler(
    data=data_train1,
    save_dirpath="data_labeler_saved",
    epochs=20,
    default_label="given_name"
)

EPOCH 0, batch_id 0: loss: 4.052550 - acc: 0.145919 - f1_score 0.145919

  m.reset_state()


EPOCH 0 (5s), loss: 1.514228 - acc: 0.616490 - f1_score 0.616490 -- val_f1: 0.362958 - val_precision: 0.428006 - val_recall 0.468975


  m.reset_state()


EPOCH 1 (1s), loss: 0.879483 - acc: 0.717529 - f1_score 0.717529 -- val_f1: 0.427533 - val_precision: 0.473789 - val_recall 0.517597


  m.reset_state()


EPOCH 2 (1s), loss: 0.592684 - acc: 0.792392 - f1_score 0.792392 -- val_f1: 0.520870 - val_precision: 0.606370 - val_recall 0.592769


  m.reset_state()


EPOCH 3 (1s), loss: 0.445528 - acc: 0.828333 - f1_score 0.828333 -- val_f1: 0.586639 - val_precision: 0.619746 - val_recall 0.659520


  m.reset_state()


EPOCH 4 (1s), loss: 0.366071 - acc: 0.858961 - f1_score 0.858961 -- val_f1: 0.602959 - val_precision: 0.737730 - val_recall 0.677324


  m.reset_state()


EPOCH 5 (1s), loss: 0.314709 - acc: 0.879588 - f1_score 0.879588 -- val_f1: 0.608164 - val_precision: 0.724328 - val_recall 0.677280


  m.reset_state()


EPOCH 6 (1s), loss: 0.290893 - acc: 0.885412 - f1_score 0.885412 -- val_f1: 0.646213 - val_precision: 0.739396 - val_recall 0.705024


  m.reset_state()


EPOCH 7 (2s), loss: 0.246954 - acc: 0.902569 - f1_score 0.902569 -- val_f1: 0.714147 - val_precision: 0.774276 - val_recall 0.754499


  m.reset_state()


EPOCH 8 (1s), loss: 0.228402 - acc: 0.908216 - f1_score 0.908216 -- val_f1: 0.783614 - val_precision: 0.815197 - val_recall 0.806396


  m.reset_state()


EPOCH 9 (1s), loss: 0.225753 - acc: 0.911569 - f1_score 0.911569 -- val_f1: 0.855133 - val_precision: 0.868996 - val_recall 0.862467


  m.reset_state()


EPOCH 10 (1s), loss: 0.203902 - acc: 0.921569 - f1_score 0.921569 -- val_f1: 0.894714 - val_precision: 0.901509 - val_recall 0.895954


  m.reset_state()


EPOCH 11 (1s), loss: 0.186327 - acc: 0.926333 - f1_score 0.926333 -- val_f1: 0.920950 - val_precision: 0.926487 - val_recall 0.919441


  m.reset_state()


EPOCH 12 (1s), loss: 0.185913 - acc: 0.927529 - f1_score 0.927529 -- val_f1: 0.931880 - val_precision: 0.939042 - val_recall 0.929411


  m.reset_state()


EPOCH 13 (1s), loss: 0.169857 - acc: 0.936333 - f1_score 0.936333 -- val_f1: 0.937111 - val_precision: 0.944281 - val_recall 0.935020


  m.reset_state()


EPOCH 14 (1s), loss: 0.161954 - acc: 0.937863 - f1_score 0.937863 -- val_f1: 0.941098 - val_precision: 0.949220 - val_recall 0.939021


  m.reset_state()


EPOCH 15 (1s), loss: 0.164025 - acc: 0.937549 - f1_score 0.937549 -- val_f1: 0.944055 - val_precision: 0.952441 - val_recall 0.942040


  m.reset_state()


EPOCH 16 (1s), loss: 0.150754 - acc: 0.944314 - f1_score 0.944314 -- val_f1: 0.946752 - val_precision: 0.954442 - val_recall 0.944896


  m.reset_state()


EPOCH 17 (1s), loss: 0.147885 - acc: 0.944804 - f1_score 0.944804 -- val_f1: 0.949420 - val_precision: 0.955999 - val_recall 0.948315


  m.reset_state()


EPOCH 18 (2s), loss: 0.185889 - acc: 0.928039 - f1_score 0.928039 -- val_f1: 0.955145 - val_precision: 0.958481 - val_recall 0.954595


  m.reset_state()


EPOCH 19 (2s), loss: 0.144778 - acc: 0.944255 - f1_score 0.944255 -- val_f1: 0.955584 - val_precision: 0.957835 - val_recall 0.955872


#### Now we will read in the second record linkage dataset. 

#### We will rename all of the columns as well

In [15]:
data1=dp.Data('/content/drive/MyDrive/Capstone/Client Work/Data/recordlinkage2.csv')
# data1= dp.Data('/content/fake.csv')
data1.rename(columns={'given_name': 'col_10', 'surname': 'col_9', 'street_number': 'col_8', 'address_1': 'col_7', 
                      'address_2': 'col_6','suburb': 'col_5', 'postcode': 'col_4', 'state': 'col_3', 'date_of_birth': 
                      'col_2', 'soc_sec_id': 'col_1'}, inplace=True)

In [16]:
data1.head()

Unnamed: 0,col_10,col_9,col_8,col_7,col_6,col_5,col_4,col_3,col_2,col_1
0,elton,,3.0,light setreet,pinehill,windermere,3212,vic,19651013,1551941
1,mitchell,maxon,47.0,edkins street,lochaoair,north ryde,3355,nsw,19390212,8859999
2,,white,72.0,lambrigg street,kelgoola,broadbeach waters,3159,vic,19620216,9731855
3,elk i,menzies,1.0,lyster place,,northwood,2585,vic,19980624,4970481
4,,garanggar,,may maxwell crescent,springettst arcade,forest hill,2342,vic,19921016,1366884


#### Using the labeler trained on the record linkage dataset to get predictions for the other record linkage dataset

In [17]:
# predict with the labeler object
profile_options.set({'structured_options.data_labeler.data_labeler_object': labeler})
profile = dp.Profiler(data1, options=profile_options)

# get the prediction from the data profiler
results = profile.report()
print(get_structured_results(results))

INFO:DataProfiler.profilers.profile_builder: Finding the Null values in the columns... 


100%|██████████| 10/10 [00:00<00:00, 139.06it/s]


INFO:DataProfiler.profilers.profile_builder: Calculating the statistics...  (with 4 processes)


100%|██████████| 10/10 [00:00<00:00, 17.95it/s]


   Column     Prediction                                             Sample
0  col_10     given_name          [elijah, noah, cain, samantha, zachariah]
1   col_9        surname          [beatton, ryah, goode, tiahnee, rawlings]
2   col_8  street_number                              [16, 246, 17, 27, 15]
3   col_7      address_1  [gurney kplace, throsselel srreet, marsden str...
4   col_6      address_2  [summer hill, coolsdie, withern house, sundown...
5   col_5         suburb  [macquarie park, glenorf, cantexbury, arana ih...
6   col_4       postcode                     [6149, 3910, 2902, 2090, 2213]
7   col_3          state                          [vic, nsw, nsw, nsw, qld]
8   col_2  date_of_birth  [19391117, 19330313, 19760516, 19890119, 19531...
9   col_1     soc_sec_id      [7533159, 3981694, 3359595, 3938828, 9865461]


#### Using the labeler trained on the dataset generated via faker to get predictions for the other record linkage dataset

In [18]:
# predict with the labeler object
profile_options.set({'structured_options.data_labeler.data_labeler_object': labeler_faker})
profile = dp.Profiler(data1, options=profile_options)

# get the prediction from the data profiler
results = profile.report()
print(get_structured_results(results))

INFO:DataProfiler.profilers.profile_builder: Finding the Null values in the columns... 


100%|██████████| 10/10 [00:00<00:00, 128.55it/s]


INFO:DataProfiler.profilers.profile_builder: Calculating the statistics...  (with 4 processes)


100%|██████████| 10/10 [00:00<00:00, 15.23it/s]


   Column                  Prediction                                             Sample
0  col_10        address_1|given_name                [jesscia, elle, evan, holl, callum]
1   col_9                   address_1         [theodore, walch, ryan, lackher, desantis]
2   col_8               street_number                                [46, 8, 16, 11, 31]
3   col_7                   address_2  [ordell street, holmes c rescent, solandef pla...
4   col_6  suburb|address_2|address_1  [mylandra, kassingbrook, rye park, kiaora, cli...
5   col_5            suburb|address_1  [bittern, magnetic island, thornbury, st peter...
6   col_4               street_number                     [4506, 4350, 2068, 3844, 4306]
7   col_3                   address_1                           [sa, tas, nsw, nsw, vic]
8   col_2               date_of_birth  [19891212, 19280129, 19790719, 19330911, 19941...
9   col_1                  soc_sec_id      [1783041, 9472794, 8822107, 5824638, 7757358]


#### Using the labeler trained on the record linkage dataset to get predictions on the dataset generated via faker

In [19]:
# predict with the labeler object
profile_options.set({'structured_options.data_labeler.data_labeler_object': labeler})
profile = dp.Profiler(faker_data, options=profile_options)

# get the prediction from the data profiler
results = profile.report()
print(get_structured_results(results))

INFO:DataProfiler.profilers.profile_builder: Finding the Null values in the columns... 


  "not the whole dataset.".format(sample_size))
100%|██████████| 10/10 [00:00<00:00, 110.45it/s]


INFO:DataProfiler.profilers.profile_builder: Calculating the statistics...  (with 4 processes)


100%|██████████| 10/10 [00:00<00:00, 16.67it/s]


          Column                Prediction                                             Sample
0     given_name                given_name        [Mark, Melanie, Virginia, Samantha, Nicole]
1        surname                given_name         [Cobb, Calderon, Graham, Contreras, Young]
2  street_number                  postcode                      [240, 29041, 691, 67707, 287]
3      address_1        surname|given_name               [Islands, Oval, Course, Hill, Grove]
4      address_2                 address_2  [Ellis Skyway, Darren Station, Tammy Gardens, ...
5         suburb                 address_2  [Hillview, Brownbury, New Beth, Melissaburgh, ...
6       postcode                  postcode                [77029, 15743, 30853, 60817, 70607]
7          state                 address_2  [New Mexico, New York, New Mexico, Oklahoma, A...
8  date_of_birth             date_of_birth  [19880210, 20141223, 20061028, 19490327, 19440...
9     soc_sec_id  date_of_birth|soc_sec_id  [696239898, 7462

### Creating Function to Train Labeler <a class="anchor" id="fourth-bullet"></a>

In [22]:
def create_labeler(dataframe, n):
  df=dataframe.data[list(dataframe.columns)]
  df.head()
  labeler= dp.train_structured_labeler(
    data=dataframe,
    save_dirpath="data_labeler_saved",
    epochs=n,
    default_label=list(dataframe.columns)[0]
    )
  return labeler

In [23]:
def check_data (dataframe, labeler):
  profile_options.set({'structured_options.data_labeler.data_labeler_object': labeler})
  profile = dp.Profiler(dataframe, options=profile_options)
  # get the prediction from the data profiler
  results = profile.report()
  return(get_structured_results(results))



In [24]:
record_linkage_labeler=create_labeler(data, 50)

EPOCH 0, batch_id 1: loss: 2.962923 - acc: 0.303244 - f1_score 0.303244

  m.reset_state()


EPOCH 0 (3s), loss: 2.212177 - acc: 0.466086 - f1_score 0.466086 -- val_f1: 0.351539 - val_precision: 0.328745 - val_recall 0.504940


  m.reset_state()


EPOCH 1 (0s), loss: 1.464921 - acc: 0.597353 - f1_score 0.597353 -- val_f1: 0.334715 - val_precision: 0.308416 - val_recall 0.485869


  m.reset_state()


EPOCH 2 (0s), loss: 1.175571 - acc: 0.643122 - f1_score 0.643122 -- val_f1: 0.363083 - val_precision: 0.484402 - val_recall 0.508816


  m.reset_state()


EPOCH 3 (0s), loss: 0.967250 - acc: 0.688145 - f1_score 0.688145 -- val_f1: 0.391882 - val_precision: 0.482738 - val_recall 0.522690


  m.reset_state()


EPOCH 4 (0s), loss: 0.829537 - acc: 0.716629 - f1_score 0.716629 -- val_f1: 0.464572 - val_precision: 0.558491 - val_recall 0.570779


  m.reset_state()


EPOCH 5 (1s), loss: 0.757691 - acc: 0.726267 - f1_score 0.726267 -- val_f1: 0.520749 - val_precision: 0.610842 - val_recall 0.618085


  m.reset_state()


EPOCH 6 (1s), loss: 0.647964 - acc: 0.754344 - f1_score 0.754344 -- val_f1: 0.553994 - val_precision: 0.632546 - val_recall 0.637912


  m.reset_state()


EPOCH 7 (1s), loss: 0.580274 - acc: 0.770204 - f1_score 0.770204 -- val_f1: 0.588653 - val_precision: 0.656476 - val_recall 0.659724


  m.reset_state()


EPOCH 8 (1s), loss: 0.540134 - acc: 0.782851 - f1_score 0.782851 -- val_f1: 0.616498 - val_precision: 0.674366 - val_recall 0.678290


  m.reset_state()


EPOCH 9 (1s), loss: 0.478276 - acc: 0.808462 - f1_score 0.808462 -- val_f1: 0.643970 - val_precision: 0.691218 - val_recall 0.696679


  m.reset_state()


EPOCH 10 (1s), loss: 0.450705 - acc: 0.824910 - f1_score 0.824910 -- val_f1: 0.675739 - val_precision: 0.709963 - val_recall 0.718735


  m.reset_state()


EPOCH 11 (1s), loss: 0.403029 - acc: 0.847466 - f1_score 0.847466 -- val_f1: 0.703207 - val_precision: 0.731126 - val_recall 0.738520


  m.reset_state()


EPOCH 12 (0s), loss: 0.382682 - acc: 0.857308 - f1_score 0.857308 -- val_f1: 0.730594 - val_precision: 0.752559 - val_recall 0.759407


  m.reset_state()


EPOCH 13 (0s), loss: 0.345738 - acc: 0.872557 - f1_score 0.872557 -- val_f1: 0.755880 - val_precision: 0.777328 - val_recall 0.778183


  m.reset_state()


EPOCH 14 (0s), loss: 0.347157 - acc: 0.871222 - f1_score 0.871222 -- val_f1: 0.789251 - val_precision: 0.803208 - val_recall 0.803981


  m.reset_state()


EPOCH 15 (1s), loss: 0.322319 - acc: 0.879457 - f1_score 0.879457 -- val_f1: 0.813836 - val_precision: 0.821888 - val_recall 0.824489


  m.reset_state()


EPOCH 16 (0s), loss: 0.309425 - acc: 0.883077 - f1_score 0.883077 -- val_f1: 0.832078 - val_precision: 0.836835 - val_recall 0.840339


  m.reset_state()


EPOCH 17 (0s), loss: 0.302805 - acc: 0.885249 - f1_score 0.885249 -- val_f1: 0.851950 - val_precision: 0.853994 - val_recall 0.858258


  m.reset_state()


EPOCH 18 (0s), loss: 0.292590 - acc: 0.891109 - f1_score 0.891109 -- val_f1: 0.863564 - val_precision: 0.865024 - val_recall 0.868928


  m.reset_state()


EPOCH 19 (1s), loss: 0.283239 - acc: 0.891561 - f1_score 0.891561 -- val_f1: 0.874508 - val_precision: 0.877152 - val_recall 0.878598


  m.reset_state()


EPOCH 20 (1s), loss: 0.273744 - acc: 0.893597 - f1_score 0.893597 -- val_f1: 0.880612 - val_precision: 0.883937 - val_recall 0.883828


  m.reset_state()


EPOCH 21 (0s), loss: 0.270863 - acc: 0.896199 - f1_score 0.896199 -- val_f1: 0.886946 - val_precision: 0.890696 - val_recall 0.888881


  m.reset_state()


EPOCH 22 (0s), loss: 0.251031 - acc: 0.903665 - f1_score 0.903665 -- val_f1: 0.892481 - val_precision: 0.897182 - val_recall 0.893826


  m.reset_state()


EPOCH 23 (0s), loss: 0.255062 - acc: 0.902579 - f1_score 0.902579 -- val_f1: 0.895171 - val_precision: 0.900549 - val_recall 0.896987


  m.reset_state()


EPOCH 24 (1s), loss: 0.262083 - acc: 0.896991 - f1_score 0.896991 -- val_f1: 0.898579 - val_precision: 0.904910 - val_recall 0.899863


  m.reset_state()


EPOCH 25 (0s), loss: 0.243183 - acc: 0.907489 - f1_score 0.907489 -- val_f1: 0.902181 - val_precision: 0.908914 - val_recall 0.903050


  m.reset_state()


EPOCH 26 (0s), loss: 0.241802 - acc: 0.907036 - f1_score 0.907036 -- val_f1: 0.905185 - val_precision: 0.911873 - val_recall 0.905799


  m.reset_state()


EPOCH 27 (0s), loss: 0.218844 - acc: 0.917217 - f1_score 0.917217 -- val_f1: 0.907880 - val_precision: 0.914691 - val_recall 0.908280


  m.reset_state()


EPOCH 28 (0s), loss: 0.224175 - acc: 0.917081 - f1_score 0.917081 -- val_f1: 0.910142 - val_precision: 0.917708 - val_recall 0.910315


  m.reset_state()


EPOCH 29 (0s), loss: 0.220028 - acc: 0.914231 - f1_score 0.914231 -- val_f1: 0.912235 - val_precision: 0.919519 - val_recall 0.912257


  m.reset_state()


EPOCH 30 (0s), loss: 0.226930 - acc: 0.915136 - f1_score 0.915136 -- val_f1: 0.914701 - val_precision: 0.921910 - val_recall 0.914401


  m.reset_state()


EPOCH 31 (0s), loss: 0.209653 - acc: 0.920656 - f1_score 0.920656 -- val_f1: 0.915159 - val_precision: 0.923709 - val_recall 0.915309


  m.reset_state()


EPOCH 32 (0s), loss: 0.210397 - acc: 0.919118 - f1_score 0.919118 -- val_f1: 0.917483 - val_precision: 0.923815 - val_recall 0.917227


  m.reset_state()


EPOCH 33 (0s), loss: 0.205177 - acc: 0.921357 - f1_score 0.921357 -- val_f1: 0.917964 - val_precision: 0.926394 - val_recall 0.917151


  m.reset_state()


EPOCH 34 (0s), loss: 0.206895 - acc: 0.919321 - f1_score 0.919321 -- val_f1: 0.918726 - val_precision: 0.926792 - val_recall 0.918656


  m.reset_state()


EPOCH 35 (0s), loss: 0.205701 - acc: 0.922240 - f1_score 0.922240 -- val_f1: 0.920973 - val_precision: 0.928016 - val_recall 0.920582


  m.reset_state()


EPOCH 36 (0s), loss: 0.188764 - acc: 0.929367 - f1_score 0.929367 -- val_f1: 0.922548 - val_precision: 0.930298 - val_recall 0.922263


  m.reset_state()


EPOCH 37 (0s), loss: 0.192261 - acc: 0.929072 - f1_score 0.929072 -- val_f1: 0.923440 - val_precision: 0.930546 - val_recall 0.923297


  m.reset_state()


EPOCH 38 (0s), loss: 0.196773 - acc: 0.925181 - f1_score 0.925181 -- val_f1: 0.924057 - val_precision: 0.931623 - val_recall 0.923735


  m.reset_state()


EPOCH 39 (0s), loss: 0.200796 - acc: 0.925588 - f1_score 0.925588 -- val_f1: 0.925206 - val_precision: 0.933090 - val_recall 0.925088


  m.reset_state()


EPOCH 40 (0s), loss: 0.180192 - acc: 0.931810 - f1_score 0.931810 -- val_f1: 0.927145 - val_precision: 0.933807 - val_recall 0.927241


  m.reset_state()


EPOCH 41 (0s), loss: 0.185161 - acc: 0.929457 - f1_score 0.929457 -- val_f1: 0.928041 - val_precision: 0.934743 - val_recall 0.927611


  m.reset_state()


EPOCH 42 (0s), loss: 0.174982 - acc: 0.936154 - f1_score 0.936154 -- val_f1: 0.927751 - val_precision: 0.936151 - val_recall 0.927762


  m.reset_state()


EPOCH 43 (0s), loss: 0.170481 - acc: 0.936063 - f1_score 0.936063 -- val_f1: 0.929682 - val_precision: 0.936850 - val_recall 0.929402


  m.reset_state()


EPOCH 44 (0s), loss: 0.166645 - acc: 0.936742 - f1_score 0.936742 -- val_f1: 0.929949 - val_precision: 0.937787 - val_recall 0.929822


  m.reset_state()


EPOCH 45 (0s), loss: 0.164478 - acc: 0.939683 - f1_score 0.939683 -- val_f1: 0.930572 - val_precision: 0.938553 - val_recall 0.930520


  m.reset_state()


EPOCH 46 (0s), loss: 0.167944 - acc: 0.936561 - f1_score 0.936561 -- val_f1: 0.932050 - val_precision: 0.938393 - val_recall 0.931815


  m.reset_state()


EPOCH 47 (0s), loss: 0.149212 - acc: 0.945045 - f1_score 0.945045 -- val_f1: 0.931765 - val_precision: 0.938969 - val_recall 0.931815


  m.reset_state()


EPOCH 48 (0s), loss: 0.159448 - acc: 0.942149 - f1_score 0.942149 -- val_f1: 0.931009 - val_precision: 0.940320 - val_recall 0.930857


  m.reset_state()


EPOCH 49 (0s), loss: 0.161740 - acc: 0.939118 - f1_score 0.939118 -- val_f1: 0.931166 - val_precision: 0.939821 - val_recall 0.931328


In [25]:
check_data(data1, record_linkage_labeler)

INFO:DataProfiler.profilers.profile_builder: Finding the Null values in the columns... 


100%|██████████| 10/10 [00:00<00:00, 157.71it/s]


INFO:DataProfiler.profilers.profile_builder: Calculating the statistics...  (with 4 processes)


100%|██████████| 10/10 [00:00<00:00, 17.99it/s]


Unnamed: 0,Column,Prediction,Sample
0,col_10,given_name,"[renee, gabrielle, julinaa, jakc, jayed]"
1,col_9,surname,"[dunnicjliff, petersen, lowe, lamprey, reinhard]"
2,col_8,street_number,"[42, 108, 23, 8, 136]"
3,col_7,address_1,"[tanj il loop, britten-jones drive, elkedru cl..."
4,col_6,address_2,"[strasus, rockview, top end, tullatoola, roseh..."
5,col_5,suburb,"[aspley, east mitland, airds, upwey, taylors b..."
6,col_4,postcode,"[7009, 4621, 2047, 3058, 6172]"
7,col_3,state,"[nsw, nsw, nsw, nsw, nsw]"
8,col_2,date_of_birth,"[19820129, 19449015, 19151217, 19210202, 19880..."
9,col_1,soc_sec_id,"[7886527, 1148897, 5595371, 1847058, 1378454]"


In [26]:
check_data(faker_data, record_linkage_labeler)

INFO:DataProfiler.profilers.profile_builder: Finding the Null values in the columns... 


  "not the whole dataset.".format(sample_size))
100%|██████████| 10/10 [00:00<00:00, 140.88it/s]


INFO:DataProfiler.profilers.profile_builder: Calculating the statistics...  (with 4 processes)


100%|██████████| 10/10 [00:00<00:00, 18.41it/s]


Unnamed: 0,Column,Prediction,Sample
0,given_name,given_name,"[Christopher, Rebecca, Erica, Perry, Kimberly]"
1,surname,surname|given_name,"[Wiley, Chen, Wilson, Stewart, Bright]"
2,street_number,postcode,"[92200, 59, 777, 422, 514]"
3,address_1,surname,"[Trafficway, Point, Tunnel, Squares, Circles]"
4,address_2,suburb,"[Hughes Shores, James Mills, Franklin Brook, A..."
5,suburb,suburb,"[Port Erinfurt, East Matthewhaven, New Jamesto..."
6,postcode,postcode,"[25227, 46709, 96057, 98179, 85409]"
7,state,could not determine,"[Mississippi, Alabama, South Dakota, Nevada, O..."
8,date_of_birth,date_of_birth,"[19880213, 19260318, 19690206, 19440905, 19990..."
9,soc_sec_id,soc_sec_id|date_of_birth,"[877701303, 9260063, 65021838, 899010129, 8394..."
