## Creating a Lilac dataset


In [4]:
import lilac as ll

### From HuggingFace


In [2]:
source_config = ll.HuggingFaceDataset(dataset_name='glue', config_name='ax')
dataset = ll.create_dataset('local', 'glue', source_config)

Found cached dataset glue (/Users/dsmilkov/.cache/huggingface/datasets/glue/ax/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
100%|██████████| 1/1 [00:00<00:00, 452.90it/s]
Reading from source huggingface...: 100%|██████████| 1104/1104 [00:01<00:00, 734.11it/s]


Manifest for dataset "glue" written to ./data/datasets/local/glue


### From CSV


In [3]:
source_config = ll.CSVDataset(filepaths=[
  'https://storage.googleapis.com/lilac-data-us-east1/datasets/csv_datasets/the_movies_dataset/the_movies_dataset.csv'
])
dataset = ll.create_dataset('local', 'the_movies_dataset', source_config)

Downloading from url https://storage.googleapis.com/lilac-data-us-east1/datasets/csv_datasets/the_movies_dataset/the_movies_dataset.csv to /tmp/./data/local_cache/ec84c08e61c5414dbe7fac18fee4313b


Reading from source csv...: 100%|██████████| 45460/45460 [00:00<00:00, 61511.98it/s]


Manifest for dataset "the_movies_dataset" written to ./data/datasets/local/the_movies_dataset


### From JSON


In [2]:
source_config = ll.JSONDataset(filepaths=[
  'https://raw.githubusercontent.com/explosion/prodigy-recipes/master/example-datasets/news_headlines.jsonl'
])
dataset = ll.create_dataset('local', 'news_headlines', source_config)

Downloading from url https://raw.githubusercontent.com/explosion/prodigy-recipes/master/example-datasets/news_headlines.jsonl to /tmp/./data/local_cache/e490387fdc444315b21d2c526b89518b
before write items to parquet


Reading from source json...: 100%|██████████| 200/200 [00:00<00:00, 224714.92it/s]

Manifest for dataset "news_headlines" written to ./data/datasets/local/news_headlines





## Visualize the data

Now that we have imported a few datasets, let's visualize them to see what they look like.


In [4]:
ll.start_server()

INFO:     Started server process [88282]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:5432 (Press CTRL+C to quit)


### Stopping the server


In [None]:
await ll.stop_server()

INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.


## Query a dataset


In [6]:
from IPython.display import display

dataset = ll.get_dataset('local', 'the_movies_dataset')
r = dataset.select_rows(['title', 'budget', 'overview'], limit=5)
print('Total number of rows', r.total_num_rows)
display(r.df())

Total number of rows 45460


Unnamed: 0,title,budget,overview,__rowid__
0,Toy Story,30000000,"Led by Woody, Andy's toys live happily in his ...",f3c7149554f8445281c4025427befbee
1,Jumanji,65000000,When siblings Judy and Peter discover an encha...,5309636e74da4787a94a46aedfb0e83a
2,Grumpier Old Men,0,A family wedding reignites the ancient feud be...,69bce978698148aea11eeb2b7713db35
3,Waiting to Exhale,16000000,"Cheated on, mistreated and stepped on, the wom...",8107aee22c4a4566906da82b33623c59
4,Father of the Bride Part II,0,Just when George Banks has recovered from his ...,500a3f5da6c741be8eb9dc7edb7c5ba3


## Compute embedding

Let's compute the `SBERT` embedding on device for the `overview` field.


In [3]:
dataset.compute_embedding('sbert', 'overview')

Computing sbert...: 100%|██████████| 45460/45460 [01:45<00:00, 432.51it/s]


Computing signal "signal_name='sbert'" took 105.196s.
Wrote signal manifest to ./data/datasets/local/the_movies_dataset/overview/sbert/signal_manifest.json


## Search by keyword and concepts


dataset.select_rows(search=)...


## Computing a signal


In [1]:
from lilac.signals.concept_scorer import ConceptScoreSignal

dataset = get_dataset('local', 'legal-clauses')

dataset.compute_signal(
  ConceptScoreSignal(namespace='lilac', concept_name='legal-termination', embedding='sbert'),
  'clause_text')

Computing signal "signal_name='concept_score' embedding='sbert' namespace='lilac' concept_name='legal-termination' draft='main' num_negative_examples=100" took 0.339s.
Wrote signal manifest to ./data/datasets/local/legal-clauses/clause_text/sbert/embedding/lilac/legal-termination/v33/signal_manifest.json


## Downloading a dataset


In [None]:
dataset.to_parquet(path=..., fields=...)
dataset.to_csv(path=..., fields=...)
dataset.to_jsonl(path=..., fields=...)

## End to end example


1. I have a csv dataset
2. I want toxicity on field "text"
3. I want to download it
