# Match Zeta Identities
The goal of this notebook is to attempt to create links between different datasets to see what the current situation is regarding identity resolution. Thoughts: 20%

## Import and Initialize

In [39]:
import networkx as nx
import numpy as np
import pandas as pd

## Load Data
Load the datasets with 10,000 rows each.

In [14]:
dsp = pd.read_csv("data/dsp_cookies_export_20210625_10k.csv")
print(f"DSP data shape: {dsp.shape}")

sizmek = pd.read_csv("data/sizmek_bidstream_raw_20210625_10k.csv")
print(f"Sizmek data shape: {sizmek.shape}")

# Fix column headers
dsp.columns = [i.split(".")[1] for i in dsp.columns]
sizmek.columns = [i.split(".")[1] for i in sizmek.columns]

DSP data shape: (10000, 3)
Sizmek data shape: (10000, 16)


We can use any of the three columns below to create graph connections with.

In [16]:
print("Common columns:", set(dsp.columns).intersection(set(sizmek.columns)))

Common columns: {'zeta_user_id', 'user_id', 'dt'}


## Match Data
We first see that there are no direct matches of the `zeta_user_id` between datasets.

In [21]:
dsp[["zeta_user_id"]].merge(sizmek[["zeta_user_id"]], on="zeta_user_id")

Unnamed: 0,zeta_user_id


This is because the DSP dataset has them formatted as a list for each row.

In [23]:
dsp["zeta_user_id"].head(3)

0    ["4bbbcada-3976-44c8-8962-e5a99d2eda97:1591197...
1    ["9ad67b74-2e30-423d-8532-9e6a56ff6176:1610871...
2    ["b8f83981-fb82-429a-bf53-74c17001d341:1599091...
Name: zeta_user_id, dtype: object

While Sizmek has single ids with some amount of NaNs.

In [31]:
sizmek["zeta_user_id"].sample(3)

6724    fa3e335a-6b70-45b0-a9fb-ac71d098a924:161377748...
5743                                                  NaN
7587    fa3e335a-6b70-45b0-a9fb-ac71d098a924:161377748...
Name: zeta_user_id, dtype: object

In [44]:
print(f"Avg length of DSP user ids: {np.mean([len(i) for i in dsp['zeta_user_id']])}")

Avg length of DSP user ids: 54.1386


So now I wonder **why** are there around 54 ids in each row?
First I would like to see the proportion of NaNs in the Sizmek dataset.

In [49]:
print(f"Percent of NaNs: {sizmek['zeta_user_id'].isna().sum() / len(sizmek):.2%}")

Percent of NaNs: 29.68%
