# Legend on Delta Lake

Make sure to have the jar file of org.finos.legend-community:legend-delta:X.Y.Z and all its dependencies available in your spark classpath and a legend data model (version controlled on gitlab) previously compiled to disk or packaged as a jar file and available in your classpath. For python support, please add the corresponding library from pypi repo. See example of a configured spark cluster on datbricks environment (although the same can be achieved on native spark / delta)

[![FINOS - Incubating](https://cdn.jsdelivr.net/gh/finos/contrib-toolbox@master/images/badge-incubating.svg)](https://finosfoundation.atlassian.net/wiki/display/FINOS/Incubating)
[![Build CI](https://github.com/finos/legend-engine/workflows/Build%20CI/badge.svg)]()
[![Maven Central](https://img.shields.io/maven-central/v/org.finos.legend-community/legend-delta.svg)](http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22legend-delta)

<img src='https://raw.githubusercontent.com/finos/legend-community-delta/main/images/legend-cluster.png'>

In [0]:
%sql
DROP DATABASE IF EXISTS legend CASCADE;
CREATE DATABASE legend;

## Legend model
Legend project can be loaded from classpath or directory as follows

In [0]:
from legend.delta import LegendClasspathLoader
legend = LegendClasspathLoader().loadResources()

## Legend schema
We can create the spark schema for any Legend entity of type `Class`. 
This process will recursively loop through each of its underlying fields, enums and possibly nested properties and supertypes.

In [0]:
schema = legend.get_schema("databricks::entity::employee")

In [0]:
import pandas as pd
display(pd.DataFrame(
  [[f.name, str(f.dataType), f.nullable, f.metadata['comment']] for f in schema.fields], 
  columns=['field', 'type', 'optional', 'description']
))

field,type,optional,description
firstName,StringType,False,Person first name
lastName,StringType,False,Person last name
birthDate,DateType,False,Person birth date
gender,StringType,True,Person gender
id,IntegerType,False,Unique ID of a databricks employee
sme,StringType,True,Programming skill that person truly masters
joinedDate,DateType,False,When did that person join Databricks
highFives,IntegerType,True,How many high fives did that person get


## Legend transformations
We can transform raw entities into their desired target tables. Note that relational transformations only support direct mapping and therefore easily enforced through `.withColumnRenamed` syntax.

In [0]:
transformations = legend.get_transformations("databricks::mapping::employee_delta")

In [0]:
import pandas as pd
display(pd.DataFrame(
  [[e, transformations[e]] for e in transformations], 
  columns=['from_column', 'to_column']
))

from_column,to_column
highFives,high_fives
joinedDate,joined_date
lastName,last_name
firstName,first_name
birthDate,birth_date
id,id
sme,sme
gender,gender


## Legend expectations
Given the `multiplicity` properties, we can 
detect if a field is optional or not or list has the right number of elements. Given an `enumeration`, 
we check for value consistency. These will be considered **technical expectations** and converted into SQL constraints. In addition to the rules derived from the schema itself, we also support the conversion of **business expectations**
from the PURE language to SQL expressions. We generate a legend
execution plan against a Databricks runtime, hence operating against relational legend `mapping` rather
than pure entities of type `class`.

In [0]:
expectations = legend.get_expectations("databricks::mapping::employee_delta")

In [0]:
import pandas as pd
display(pd.DataFrame(
  [[e, expectations[e]] for e in expectations], 
  columns=['expectation', 'constraint']
))

expectation,constraint
[birthDate] is mandatory,birth_date IS NOT NULL
[sme] not allowed value,"(sme IS NULL OR sme IN ('Scala', 'Python', 'Java', 'R', 'SQL'))"
[id] is mandatory,id IS NOT NULL
[joinedDate] is mandatory,joined_date IS NOT NULL
[firstName] is mandatory,first_name IS NOT NULL
[high five] should be positive,(high_fives IS NOT NULL AND high_fives > 0)
[lastName] is mandatory,last_name IS NOT NULL
[hiringAge] should be > 18,year(joined_date) - year(birth_date) > 18


## Legend derivations
We can convert Legend derived properties as SQL expressions. In the example model, the field `age` is not physically stored but can be computed at runtime.

In [0]:
derivations = legend.get_derivations("databricks::mapping::employee_delta")

In [0]:
import pandas as pd
display(pd.DataFrame(
  [[e, derivations[e]] for e in derivations], 
  columns=['column', 'expression']
))

column,expression
hiringAge,year(joined_date) - year(birth_date) AS `hiringAge`
age,year(current_date) - year(birth_date) AS `age`
initials,"concat(substring(first_name, 0, 1), substring(last_name, 0, 1)) AS `initials`"


## Legend tables
In order to query our validated entity from legend interface, we can easily create the target state table. This table contains a placeholder for our invalidated constraints (below field `legend`).

In [0]:
table_name = legend.create_table("databricks::mapping::employee_delta")

In [0]:
display(sql("DESCRIBE EXTENDED {}".format(table_name)))

col_name,data_type,comment
first_name,string,Person first name
last_name,string,Person last name
birth_date,date,Person birth date
gender,string,Person gender
id,int,Unique ID of a databricks employee
sme,string,Programming skill that person truly masters
joined_date,date,When did that person join Databricks
high_fives,int,How many high fives did that person get
,,
# Partitioning,,


# Example - write
In this scenario, we read raw JSON files that we schematize, transform and persist to our target state delta table.

In [0]:
%sh
head /dbfs/FileStore/antoine.amend@databricks.com/legend/employee.json

In [0]:
schema = legend.get_schema("databricks::entity::employee")
schema_df = spark.read.format("json").schema(schema).load("/FileStore/antoine.amend@databricks.com/legend")
display(schema_df.limit(10))

firstName,lastName,birthDate,gender,id,sme,joinedDate,highFives
Levey,Storck,1989-02-19,M,,C,2015-12-05,282
Maria,O'Gorman,1987-08-14,M,2.0,Python,2017-03-03,299
Evvy,Lepoidevin,1970-10-04,M,3.0,C,2020-11-02,182
Georges,Jotcham,1973-11-26,F,4.0,Scala,2020-09-14,229
Doroteya,Wadhams,1987-03-11,N,5.0,Scala,2019-02-11,78
Mia,Millgate,1988-08-01,F,6.0,Python,2017-04-13,146
Celene,Calverley,1979-07-15,N,7.0,Python,2021-06-03,69
Richie,Di Matteo,1980-05-18,F,8.0,Python,2014-08-23,167
Ignaz,Kurth,1987-01-10,F,,Python,2014-02-01,199
Anthia,Duck,1998-02-08,F,10.0,Python,2015-01-14,277


In [0]:
transformations = legend.get_transformations("databricks::mapping::employee_delta")
for from_column in transformations.keys():
  schema_df = schema_df.withColumnRenamed(from_column, transformations[from_column])

display(schema_df.limit(10))

first_name,last_name,birth_date,gender,id,sme,joined_date,high_fives
Levey,Storck,1989-02-19,M,,C,2015-12-05,282
Maria,O'Gorman,1987-08-14,M,2.0,Python,2017-03-03,299
Evvy,Lepoidevin,1970-10-04,M,3.0,C,2020-11-02,182
Georges,Jotcham,1973-11-26,F,4.0,Scala,2020-09-14,229
Doroteya,Wadhams,1987-03-11,N,5.0,Scala,2019-02-11,78
Mia,Millgate,1988-08-01,F,6.0,Python,2017-04-13,146
Celene,Calverley,1979-07-15,N,7.0,Python,2021-06-03,69
Richie,Di Matteo,1980-05-18,F,8.0,Python,2014-08-23,167
Ignaz,Kurth,1987-01-10,F,,Python,2014-02-01,199
Anthia,Duck,1998-02-08,F,10.0,Python,2015-01-14,277


In [0]:
table_name = legend.get_table("databricks::mapping::employee_delta")
schema_df.write.format("delta").mode("append").saveAsTable(table_name)

# Example - read
From delta, we read objects that we transform back as a pure entity with derived properties and violated constraints. New derivations could be added from legend studio and seamlessly computed here without the need for engineering team to code. The generated dataframe would comply with business expectations and data quality, as defined from the legend studio.

In [0]:
df = legend.query('databricks::mapping::employee_delta')
display(df.limit(10))

highFives,joinedDate,lastName,firstName,birthDate,id,sme,gender,hiringAge,age,initials
282,2015-12-05,Storck,Levey,1989-02-19,,C,M,26,33,LS
299,2017-03-03,O'Gorman,Maria,1987-08-14,2.0,Python,M,30,35,MO
182,2020-11-02,Lepoidevin,Evvy,1970-10-04,3.0,C,M,50,52,EL
229,2020-09-14,Jotcham,Georges,1973-11-26,4.0,Scala,F,47,49,GJ
78,2019-02-11,Wadhams,Doroteya,1987-03-11,5.0,Scala,N,32,35,DW
146,2017-04-13,Millgate,Mia,1988-08-01,6.0,Python,F,29,34,MM
69,2021-06-03,Calverley,Celene,1979-07-15,7.0,Python,N,42,43,CC
167,2014-08-23,Di Matteo,Richie,1980-05-18,8.0,Python,F,34,42,RD
199,2014-02-01,Kurth,Ignaz,1987-01-10,,Python,F,27,35,IK
277,2015-01-14,Duck,Anthia,1998-02-08,10.0,Python,F,17,24,AD


Given the following service defined on legend studio, we generate the corresponding spark execution plan and return a dataframe with all requested attributes and calculations

```
|databricks::entity::employee.all()->filter(
  x|$x.firstName->startsWith('G')
)->project(
  [
    x|$x.firstName,
    x|$x.lastName,
    x|$x.highFives,
    x|$x.age,
    x|$x.sme,
    x|$x.initials
  ],
  [
    'FirstName',
    'LastName',
    'HighFives',
    'Age',
    'Sme',
    'Initials'
  ]
)->sort(
  [
    desc('HighFives')
  ]
)->take(10)
```

In [0]:
df = legend.query('databricks::service::employee')
display(df.limit(10))

FirstName,LastName,HighFives,Age,Hiring Age,Sme,Initials
Giustina,Pullen,300,45,37,Python,GP
Garth,Pucker,294,33,28,Python,GP
Garv,Rulf,287,43,37,C,GR
Gonzales,Mewton,284,47,41,Python,GM
Gib,Thorius,282,51,49,SAS,GT
Gregg,Dunstall,278,25,21,Python,GD
Gerianne,Pitkin,277,29,28,Python,GP
Griffy,O'Regan,276,32,31,Python,GO
Gardner,Vlasenko,275,38,30,R,GV
Gerianne,Chessun,273,34,30,SAS,GC


The same works against aggregated functions like `groupBy`

```
|databricks::entity::employee.all()->filter(
  x|!($x.gender->isEmpty())
)->groupBy(
  [
    x|$x.gender
  ],
  [
    agg(
      x|$x.highFives,
      x|$x->average()
    ),
    agg(
      x|$x.id,
      x|$x->count()
    )
  ],
  [
    'Gender',
    'HighFives',
    'Employees'
  ]
)->sort(
  [
    desc('HighFives')
  ]
)->take(10)
```

In [0]:
df = legend.query('databricks::service::skills')
display(df.limit(10))

Gender,HighFives,Employees
N,167.09091,44
M,152.0962,394
F,150.23941,542
