## Creating a Lilac dataset


In [2]:
from IPython.display import display

import lilac as ll

### From HuggingFace


In [3]:
source_config = ll.HuggingFaceDataset(dataset_name='glue', config_name='ax')
dataset = ll.create_dataset('local', 'glue', source_config)

Found cached dataset glue (/Users/dsmilkov/.cache/huggingface/datasets/glue/ax/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
100%|██████████| 1/1 [00:00<00:00, 518.65it/s]
Reading from source huggingface...: 100%|██████████| 1104/1104 [00:00<00:00, 37553.62it/s]

Dataset "glue" written to ./data/datasets/local/glue





### From CSV


In [15]:
url = 'https://storage.googleapis.com/lilac-data-us-east1/datasets/csv_datasets/the_movies_dataset/the_movies_dataset.csv'
source_config = ll.CSVDataset(filepaths=[url])
dataset = ll.create_dataset('local', 'the_movies_dataset', source_config)

Downloading from url https://storage.googleapis.com/lilac-data-us-east1/datasets/csv_datasets/the_movies_dataset/the_movies_dataset.csv to /tmp/./data/local_cache/932b6730f8094be7865a04380ae92c4b


Reading from source csv...: 100%|██████████| 45460/45460 [00:01<00:00, 43457.19it/s]

Dataset "the_movies_dataset" written to ./data/datasets/local/the_movies_dataset





### From JSON


In [5]:
source_config = ll.JSONDataset(filepaths=[
  'https://raw.githubusercontent.com/explosion/prodigy-recipes/master/example-datasets/news_headlines.jsonl'
])
dataset = ll.create_dataset('local', 'news_headlines', source_config)

Downloading from url https://raw.githubusercontent.com/explosion/prodigy-recipes/master/example-datasets/news_headlines.jsonl to /tmp/./data/local_cache/452137e2b28c444095efaba00674e4e5


Reading from source json...: 100%|██████████| 200/200 [00:00<00:00, 126907.84it/s]

Dataset "news_headlines" written to ./data/datasets/local/news_headlines





## Visualize the data

Now that we have imported a few datasets, let's visualize them to see what they look like.


In [4]:
ll.start_server()

INFO:     Started server process [94705]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:5432 (Press CTRL+C to quit)


### Stopping the server


In [6]:
await ll.stop_server()

INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.


## Query a dataset


In [4]:
dataset = ll.get_dataset('local', 'the_movies_dataset')
r = dataset.select_rows(['title', 'budget', 'overview'], limit=5)
print('Total number of rows', r.total_num_rows)
display(r.df())

Total number of rows 45460


Unnamed: 0,title,budget,overview,__rowid__
0,Jumanji,65000000,When siblings Judy and Peter discover an encha...,4623e3ce29f34483be6419bc7e03f007
1,Grumpier Old Men,0,A family wedding reignites the ancient feud be...,f602496a210441b2bdff05191379b6cf
2,Waiting to Exhale,16000000,"Cheated on, mistreated and stepped on, the wom...",69e5a63530cd4882a86cffba20ea14a7
3,Heat,60000000,"Obsessive master thief, Neil McCauley leads a ...",e5be08e15a6247fd9759738fdd4f7291
4,Sudden Death,35000000,International action superstar Jean Claude Van...,9b92e098b40e4d689547d3e6d03a4b8a


## Compute embedding

Let's compute the `SBERT` embedding on device for the `overview` field.


In [8]:
dataset.compute_embedding('sbert', 'overview')

Computing sbert: 100%|██████████| 45460/45460 [01:47<00:00, 422.10it/s]


Computing signal "sbert" took 107.781s.
Wrote signal output to ./data/datasets/local/the_movies_dataset/overview/sbert


## Enriching an unstructured field with metadata


In [5]:
dataset.compute_signal(ll.PIISignal(), 'overview')

Computing pii: 100%|██████████| 45460/45460 [00:45<00:00, 995.15it/s] 


Computing signal "pii" took 45.727s.
Wrote signal output to ./data/datasets/local/the_movies_dataset/overview/pii


In [6]:
dataset.compute_signal(ll.LangDetectionSignal(), 'overview')

Computing lang_detection: 100%|██████████| 45460/45460 [01:31<00:00, 494.44it/s]


Computing signal "lang_detection" took 91.983s.
Wrote signal output to ./data/datasets/local/the_movies_dataset/overview/lang_detection


In [7]:
dataset.compute_signal(ll.NearDuplicateSignal(), 'overview')

Fingerprinting...: 44506it [00:06, 7055.80it/s]0:00<?, ?it/s]
Computing hash collisions...: 100%|██████████| 5/5 [00:01<00:00,  4.31it/s]
Clustering...: 100%|██████████| 21/21 [00:00<00:00, 154.89it/s]
Computing near_dup: 100%|██████████| 45460/45460 [00:07<00:00, 5837.09it/s]


Computing signal "near_dup" took 7.824s.
Wrote signal output to ./data/datasets/local/the_movies_dataset/overview/near_dup


## Searching


### By keywords


In [4]:
query = ll.KeywordQuery(search='Aliens')
r = dataset.select_rows(['overview'], searches=[ll.Search(path='overview', query=query)], limit=5)
display(r.df())

Computing signal "substring_search" took 0.000s.


Unnamed: 0,overview,__rowid__,substring_search(query=Aliens)(overview)
0,When Environmental Protection Agency inspector...,153b4f31b4374e1c8cdb165b8591beb5,"[{'__value__': {'start': 422, 'end': 428}}]"
1,"With enormous cone-shaped heads, robotlike wal...",2e1b7bb114fd4406ac2284f19a218ff7,"[{'__value__': {'start': 83, 'end': 89}}]"
2,Aliens who've come to earth to spawn deep bene...,3113fc2b60c44197ae0fc3f0a9ad0b85,"[{'__value__': {'start': 0, 'end': 6}}]"
3,A team from the intergalactic fast food chain ...,99fc5d00ee7b488fb7cd300f57b18eb6,"[{'__value__': {'start': 435, 'end': 441}}, {'..."
4,Marcus is a kid on Manhattan's mean streets. H...,b0e4d97deaa14f459309e42a1dcfde66,"[{'__value__': {'start': 133, 'end': 139}}]"


### Semantic search


In [7]:
query = ll.SemanticQuery(search='Aliens have invaded the earth', embedding='sbert')
r = dataset.select_rows(['overview'], searches=[ll.Search(path='overview', query=query)], limit=5)
display(r.df())

Computing signal "semantic_similarity" took 0.002s.


Unnamed: 0,overview,__rowid__,semantic_similarity(query=Aliens have invaded the earth)(overview.sbert.*.embedding)
0,"The Earth is invaded by alien parasites aka ""s...",2f322658c8b240709ed6350731c977cd,[0.7876087948679924]
1,"Aliens invade, this time delivering a clear ul...",a6bc9ca99dfe47668268695614e74292,[0.7808258235454559]
2,Aliens pretending to be friendly come to Earth...,63b6d34b48194b1eb355cb257859b54e,[0.7718495875597]
3,The nations of the Earth unite in a common cau...,5071f369239848ca9daa9b2c92907782,[0.7678595408797264]
4,"Aliens have landed and are hiding on Earth, bu...",b09339247bd946ea8b0e081c2218dc9e,[0.7546965628862381]


### Conceptual search


In [11]:
query = ll.ConceptQuery(concept_namespace='lilac', concept_name='profanity', embedding='sbert')
r = dataset.select_rows(['overview'], searches=[ll.Search(path='overview', query=query)], limit=5)
display(r.df())

Computing signal "concept_labels" took 0.011s.
Computing signal "concept_score" took 0.022s.


Unnamed: 0,overview,__rowid__,lilac/profanity/labels(overview),lilac/profanity(overview.sbert.*.embedding)
0,A traumatized young man abducts Korean leaders...,c924a9408c6547e9b65706740d3e4925,,"[0.1425706569142622, 0.9767540489817912]"
1,The story centers around a graduating class of...,bbdf5894d8c74e1db9f1759e975358b0,,"[0.0009874361053396775, 0.9702729196295821]"
2,What happens when a generation's ultimate anti...,40d80c411fbb4e959a8be7233eab1300,,"[0.46421371760421426, 0.9675712519471154]"
3,"Welcome to T &amp; A High, where the entire st...",1e2e233f68674e1498576442f64443ed,,[0.9675146942396857]
4,Baby Bink couldn't ask for more; he has adorin...,06680c9be7d74c92b01f0a975ac862b4,,"[0.23914685418353973, 0.9597516982156834]"


## Downloading a dataset


In [None]:
dataset.to_parquet(path=..., fields=...)
dataset.to_csv(path=..., fields=...)
dataset.to_jsonl(path=..., fields=...)

## End to end example


1. I have a csv dataset
2. I want toxicity on field "text"
3. I want to download it
