# Ingest LAION-2B Parquet file into Deep Lake

1. We will demostrate simple ingestion of a parquet file
2. Adding links to URL images to treat as numpy arrays
3. Running a query using TQL and creating a view
4. Feeding into a pytorch dataloader 

### 1. Download and load a parquet file

In [None]:
!wget https://huggingface.co/datasets/laion/laion2B-en/resolve/main/part-00000-5114fd87-297e-42b0-9d11-50f1df323dfa-c000.snappy.parquet

In [1]:
import pandas as pd

df = pd.read_parquet('part-00000-5114fd87-297e-42b0-9d11-50f1df323dfa-c000.snappy.parquet')

### 2. Ingest into Deep Lake and Linkify images

In [12]:
rows = 100_000
path = './dataset/laion'

In [3]:
import deeplake

ds = deeplake.ingest_dataframe(df[:rows], path, overwrite=True, progressbar=False)
ds.commit()

'firstdbf9474d461a19e9333c2fd19b46115348f'

In [None]:
ds.create_tensor('images', htype='link[image]', sample_compression='jpeg', verify=False, create_shape_tensor=False, create_sample_info_tensor=False, )
ds.images.extend([None for _ in range(rows)])

#### We are using deeplake.compute transforms that can scale to a cluster

In [4]:
@deeplake.compute
def linkify(sample_in, sample_out): 
    sample_out['images'].append(deeplake.link(sample_in['URL'].text()))

linkify().eval(ds, scheduler="processed", num_workers=12, progressbar=True, skip_ok=True)
ds.commit()

Evaluating linkify: 100%|██████████| 100000/100000 [00:08<00:00, 11519.77it/s]


'781624fed6da34dbcfd1d9bccc4a4dd668f43abf'

In [6]:
ds['images'][0].numpy().mean()

204.7248532948533

In [7]:
ds_noNSFW = ds.query("select * where NSFW == 'UNLIKELY'")
ds_noNSFW.save_view(id="noNSFW")

  from .autonotebook import tqdm as notebook_tqdm


In [8]:
len(ds_noNSFW)

93082

### 4. Load from another machine and feed into a datalaoder 

In [14]:
ds = deeplake.load(path, read_only=True)
ds_view = ds.load_view("noNSFW", tensors=["images", "TEXT"], num_workers=8,  scheduler='processed')

./dataset/laion loaded successfully.
This dataset can be visualized in Jupyter Notebook by ds.visualize().


In [17]:
dataloader = ds_view.pytorch(num_workers = 8, 
                          shuffle = False,
                          use_progress_bar=True, 
                          tensors = ['TEXT', 'URL'],
                          batch_size = 32)

for el in dataloader:
  pass

./dataset/laion: 2912it [00:19, 152.26it/s]                          
