<img src="../images/intake-logo.svg" align="right" width="30%">

# Introduction

## THE 80/20 Data Science Dilemna 

**Do data scientists spend 80% of their time gathering & cleaning data?**

- The 80/20 split varies depending on who you ask
- For some use cases, the data gathering and cleaning steps are on-going integral parts of the workflows


### Typical Workflow: Without a Data Catalog

<img src="https://raw.githubusercontent.com/andersy005/talks/gh-pages/images/intake-esm/no-data-catalog.drawio.svg">

- **Scientists spend a considerable portion of their research time ⏰** writing in-house, customized code solutions for finding and understanding data


- Some challenges that make this kind of workflow harder to work with:

    - Data are big
    - Data are heterogeneous
    - Data are stored in remote storage systems (cloud bucket objects, HTTP servers, FTP servers)
    - Data analysis is getting more complex
    
    
- **Can we spend more time analyzing data and less time writing (and re-writing) code for retrieving, cleaning, disseminating data?** 
    - Answer: A resounding yes... and data catalogs are among the solutions ...


### Refined Workfow: With a Data Catalog 


<img src="https://raw.githubusercontent.com/andersy005/talks/gh-pages/images/intake-esm/with-data-catalog.drawio.svg">



## Intake

- intake is a lightweight package that streamlines the retrieval, investigation, loading and dissemination of data
- Intake's functionality appeals to a variety of departments and roles with an organization/team (developers, data engineers, data scientists)

### What problems does intake solve for us?

- When we talk about data, we tend to focus on **data scientists**, but actually there is a whole data team
    - Data scientist
    - Software Developer
    - Data Engineer / Curator 
    - IT admins (folks who care about the integrity, security, and deployment of data access systems)
    - Other stakeholders (folks who just want the output of the analysis, but are not interested in how the analysis was done)
    
- Avoid "copy and paste" of blocks of code for accessing data 
- Version control data sources
    - Use conda to package the catalog, when new data are available the user runs `conda update ...`
    
- Update data specifications in-place / in real time
    - Allows us to just update the catalog and to leave the analysis code unchanged
    - Leave data users alone by abstracting the concept of a filesystem path
    - Data can reside in different, remote locations
    
- Clear distinction between data provider / data curator / software developer / data analyst, etc... 


### Main components

- **Catalogs**:
    - **Something that points to specific datasets and tells you how you go about loading these datasets**
    - A collection of entries (assets) which corresponds to a specific dataset
    - A catalog is commonly defined in YAML file, but there are other possibilities such as SQL database, CSV, etc..
    - Catalogs form a hierarchy i.e. any catalog can reference (contain) another

- **Drivers**
    - A Python object (class) responsible for loading the data for a catalog entry into a computation ready container (list-of-dicts, Numpy nd-array, Pandas data-frame, xarray.Dataset, xarray.DataArray, etc...)
    
    
    
    ![](../images/intake-plugins.png)