# Use a LlamaIndex loader with Lilac

This notebook will show you how to load any LlamaIndex loader and load the data into Lilac.

LlamaIndex loaders [can be found on LlamaHub](https://llamahub.ai/).

In this example, we'll use the [ArxivReader loader from LlamaHub](https://llamahub.ai/l/papers-arxiv), and load arxiv papers into Lilac.


In [1]:
!pip install pypdf



In [2]:
from llama_index import download_loader

# See: https://llamahub.ai/l/papers-arxiv
ArxivReader = download_loader("ArxivReader")

loader = ArxivReader()
documents = loader.load_data(search_query='au:Karpathy')

In [3]:
import lilac as ll

# Set the project directory for Lilac.
ll.set_project_dir('./data')

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# This assumes you already have a lilac project set up.
# If you don't, use ll.init(project_dir='./data')
ll.create_dataset(
  config=ll.DatasetConfig(
    namespace='local',
    name='arxiv-karpathy',
    source=ll.LlamaIndexDocsSource(
      # documents comes from the loader.load_data call in the previous cell.
      documents=documents,)))

Reading from source llama_index_docs...: 100%|██████████| 107/107 [00:00<00:00, 10133.46it/s]


Executing:
SELECT COUNT() as count FROM t
Query took 0.001s.
Executing:

        SELECT avg(length(val))
        FROM (SELECT "doc_id" AS val FROM t) USING SAMPLE 1000;
      
Executing:

        SELECT avg(length(val))
        FROM (SELECT "text" AS val FROM t) USING SAMPLE 1000;
      
Executing:

        SELECT avg(length(val))
        FROM (SELECT "page_label" AS val FROM t) USING SAMPLE 1000;
      
Executing:

        SELECT avg(length(val))
        FROM (SELECT "file_name" AS val FROM t) USING SAMPLE 1000;
      
Executing:

        SELECT avg(length(val))
        FROM (SELECT "Title of this paper" AS val FROM t) USING SAMPLE 1000;
      
Executing:

        SELECT avg(length(val))
        FROM (SELECT "Authors" AS val FROM t) USING SAMPLE 1000;
      
Query took 0.001s.
Executing:

        SELECT avg(length(val))
        FROM (SELECT "Date published" AS val FROM t) USING SAMPLE 1000;
      
Executing:
SELECT count(val) FROM (SELECT "doc_id" as val FROM t)
Query took 0.001s.
Que

<lilac.data.dataset_duckdb.DatasetDuckDB at 0x2a843ff10>

In [5]:
# Print the first few rows:
dataset = ll.get_dataset('local', 'imdb')
for row in dataset.select_rows(['*'], limit=5):
  print(row)

Executing:
SELECT COUNT() as count FROM t
Query took 0.038s.
{'text': "My interest in Dorothy Stratten caused me to purchase this video. Although it had great actors/actresses, there were just too many subplots going on to retain interest. Plus it just wasn't that interesting. Dialogue was stiff and confusing and the story just flipped around too much to be believable. I was pretty disappointed in what I believe was one of Audrey Hepburn's last movies. I'll always love John Ritter best in slapstick. He was just too pathetic here.", 'label': 'neg', '__hfsplit__': 'train', 'short': {'label': 'true', 'created': datetime.datetime(2023, 9, 19, 16, 5, 17, 185334)}, 'good': None, 'new_label': None, 'new_one': None, 'text.pii': {'emails': [], 'ip_addresses': [], 'secrets': []}, 'text.text_statistics': {'num_characters': 464, 'readability': 6.13909387588501, 'log(type_token_ratio)': 0.9437595009803772, 'frac_non_ascii': 0.0}}
{'text': "Really, I can't believe that I spent $5 on this movie. I am

In [None]:
# You can start a lilac server with:
ll.start_server(project_dir='./data')

INFO:     Started server process [2276]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:5432 (Press CTRL+C to quit)
