# Getting data into LSDB

The most practical way to load data into LSDB is from catalogs in HATS format, hosted locally or on a remote source. We recommend you to visit our own cloud repository, [data.lsdb.io](https://data.lsdb.io), where you are able to find large surveys publicly available to use.
If you're looking for how to get external data into LSDB, see the topic [Import Catalogs](import_catalogs.html) instead.


In [None]:
import lsdb

### Example: Loading Gaia DR3

Let's get Gaia DR3 into our workflow, as an example. It is as simple as invoking `read_hats` with the respective catalog URL, which you can copy directly from our website.

In [None]:
gaia_dr3 = lsdb.read_hats("https://data.lsdb.io/hats/gaia_dr3/gaia/")
gaia_dr3

The Gaia catalog is very wide so you would be requesting its whole set of >150 columns.

In [None]:
gaia_dr3.columns

Note that it's important (and highly recommended) to:

- **Pre-select a small subset of columns** that satisfies your scientific needs. Loading an unnecessarily large amount of data leads to computationally expensive and inefficient workflows. To see which columns are available before even having to invoke `read_hats`, please refer to the column descriptions in each catalog's section on [data.lsdb.io](https://data.lsdb.io).

- **Load catalogs with their respective margin caches**, when available. These margins are necessary to obtain accurate results in several operations such as joining and crossmatching. For more information about margins please visit our [Margins](margins.ipynb) topic notebook.

Let's define the set of columns we need and add the margin catalog's path to our `read_hats` call.

In [None]:
gaia_dr3 = lsdb.read_hats(
    "https://data.lsdb.io/hats/gaia_dr3/gaia/",
    margin_cache="https://data.lsdb.io/hats/gaia_dr3/gaia_10arcs/",
    columns=[
        "source_id",
        "ra",
        "dec",
        "phot_g_mean_mag",
        "phot_proc_mode",
        "azero_gspphot",
        "classprob_dsc_combmod_star",
    ],
)
gaia_dr3

### Data loading is lazy

When invoking `read_hats`, only metadata information about that catalog (e.g. sky coverage, number of total rows, and column schema) is loaded into memory! Notice that the ellipses in the previous catalog representation are just placeholders.

You will find that most use cases start with **LAZY** loading and planning operations, followed by more expensive **COMPUTE** operations. The data is only loaded into memory when we trigger the workflow computations, usually with a `compute` call.

![Lazy workflow diagram](../_static/lazy_diagram.svg)

### Visualizing catalog metadata

Even without loading any data, you can still get a glimpse of our catalog's structure.

#### HEALPix map

You can use `plot_pixels` to observe the catalog's sky coverage map and obtain information about its HEALPix distribution. Areas of higher density of points are represented by higher order pixels.

In [None]:
gaia_dr3.plot_pixels(plot_title="Gaia DR3 Pixel Map")

#### Column schema

It is also straightforward to have a look at column names and their respective types.

In [None]:
gaia_dr3.dtypes