# Introducing DataTracer
In this short tutorial, we will demonstrate some of the core functionality of the **DataTracer** library.

In [1]:
import pandas as pd
from datatracer import DataTracer
from datatracer import get_demo_data, load_dataset

### Dataset
Suppose you have a dataset that you want to understand. For this tutorial, we'll download and load the `posts` dataset.

In [2]:
get_demo_data(force=True)
metadata, tables = load_dataset('datatracer_demo/posts')

Generating a demo folder at `./datatracer_demo`


This dataset consists of two tables, `users` and `posts`, which look something like this:

In [3]:
tables["users"].head()

Unnamed: 0,id,age,birthyear,height,nb_posts
0,0,42,1978,154,8
1,1,90,1930,188,6
2,2,47,1973,190,1
3,3,83,1937,161,3
4,4,88,1932,164,5


In [4]:
tables["posts"].head()

Unnamed: 0,id,uid,text
0,0,0,ca057d76-a744-11ea-a275-149d997bb0cb
1,1,0,ca057f7e-a744-11ea-a275-149d997bb0cb
2,2,0,ca057fba-a744-11ea-a275-149d997bb0cb
3,3,0,ca057fe2-a744-11ea-a275-149d997bb0cb
4,4,0,ca05800a-a744-11ea-a275-149d997bb0cb


Just by looking at the first few rows of these two tables, we can guess at some of the relationships that may be present in the data. For example, one could imagine that there should be a strong relationship between age and birthyear. Furthermore, the posts table is likely related to the users table through a foreign key relationship, likely through the `uid` (i.e. user id) column.

All of these relationships are pure supposition based on our own intuition at this point; however, we can discover and verify them using the tools provided by the **DataTracer** library.

## DataTracer
The **DataTracer** library is powered by the [MLBlocks](https://hdi-project.github.io/MLBlocks/index.html) framework. Various data lineage tracing tools - from primary key discovery to column mapping discovery - are implemented as primitives which can be chained together to form pipelines. In this section, we apply a few of these pre-defined pipelines to the above dataset.

### Primary Key Tracer
Suppose you want to identify the primary key. This can be accomplished using the primary key discovery pipeline.

In [5]:
solver = DataTracer.load('datatracer.primary_key.basic')
solver.solve(tables)

{'users': ['id'], 'posts': ['id', 'uid']}

This indicates that `id` is the primary key column for both tables.

### Foreign Key Tracer
Now, suppose you want to understand the relationship between the two tables. This corresponds to the foreign key discovery pipeline.

In [6]:
solver = DataTracer.load('datatracer.foreign_key.standard')
solver.solve(tables)

[{'table': 'posts', 'field': 'uid', 'ref_table': 'users', 'ref_field': 'id'}]

This returned a single foreign key which indicates that `posts.uid` is related to `users.id`, confirming our earlier hypothesis.

### Column Map Tracer
#### User Age
Next, suppose you want to understand the lineage of the "age" field. Earlier, we hypothesized that the "age" and "birthyear" fields may be related. We can apply the "column mapping" discovery pipeline to trace the lineage of the "age" field and verify this.

In [7]:
solver = DataTracer.load('datatracer.column_map.basic')
solver.solve(tables, target_table="users", target_field="age")

{('users', 'birthyear'): 0.9999985726528566}

This confirms our hypothesis that `user.birthyear` is strongly related to `users.age` since the column mapping discovery primitive assigned a high score to it.

#### Number of Posts
Finally, suppose you want to understand the lineage of the "nb_posts" field - based on the name, we may hypothesize that it indicates the number of posts associated with each user.

In [8]:
solver = DataTracer.load('datatracer.column_map.basic')
solver.solve(tables, target_table="users", target_field="nb_posts")

{('posts', 'uid'): 1.0}

This isn't as clear cut as the previous example but we can see that `posts.uid` is assigned a relatively high score; his suggests that `users.nb_posts` is related to the `uid` field in the `posts` table. Logically, this makes sense since the `uid` field in the `posts` table is how we figure out which post belongs to which user.