# `ibis` Single Table Verbs

We will be focusing on how to use the `ibis` package, a successor to the popular `pandas` package, for manipulating tabular data. We begin by importing the `ibis` package.   (We include two additional imports from the package which are commonly referred to using their short names, the table placeholder `_` instead of `ibis._`, and the selectors methods as `s` instead of the verbose `ibis.selectors`.  We will see these in action later). 

### Learning Goals

- establish a connection with `duckdb.connect()`
- use `head()` and `excute()` to preview large data
- use `select()`, `distinct()`, `filter()` and `count()` to explore data.

### Getting started

To use `ibis`, we must also select a backend.  We will always be using the quite new and very powerful `duckdb` backend for all of our tasks.  We select a backend by creating a "connection".  The details here are not important for us, we can treat this first block as "boilerplate" starting code.  

In [4]:
import ibis
from ibis import _
import ibis.selectors as s

con = ibis.duckdb.connect()

We are now ready to read in our data.  We will begin by reading the metrics table from the direct access link, as indicated in the URL below.  `con.read_csv()` is quite similar to the `pandas.read_csv()` we saw in module 1, though the optional arguments get some different names and are not quite as flexibile.  One important option for our purposes will be the how to indicate missing values.  In the past, we've seen negative values like `-99` be used to indicate missing values.  That convention reflects limitations of early software, which had no natural concept of "missing". More modern conventions indicating missing values as "NULL" or "NA".  We indicate the data has chosen the latter:

In [2]:
metrics_url = "https://huggingface.co/datasets/cboettig/ram_fisheries/resolve/main/v4.65/tsmetrics.csv"
tsmetrics = con.read_csv(metrics_url, nullstr="NA")


### Previewing data: `head()` and `execute()`

Let's take a look at our new table:

In [3]:
tsmetrics

This doesn't look like a pretty pandas table! Where are the values?  Actually, as we become more familiar with `ibis` we learn to appreciate the display choice here.  `ibis` is designed for working with very big data. An important part of this is something called _lazy evaluation_. Even downloading a very large fle might take a long time, and trying to load a large dataset into python all at once can exceed available RAM and crash the kernel.  Instead, `ibis` merely "peeks" at the data over the remote connection -- without even downloading it! It tells us the names of each column and the data type (e.g. string, or numeric, etc) that the read_csv method has 'guessed' for the data.  As we will see, this is often the most useful information anway.  

If we we do want to see a few example rows, we can use the method `head()` on the table, `tsmetrics.head()`, to say we want only want to see the top of the data frame.  Optionally we can specify how many rows we want to preview, e.g. `tsmetrics.head(10)` to see 10 (the default is 5). Let's try it:

In [6]:
tsmetrics.head()

That's not the top of 5 rows!  Once again, `ibis` is being lazy.  We see the same definition of the table as before, only this time it has a name `r0`, and we see a "plan of execution", that ibis will return the first 5 rows `Limit[r0, 5]`.  We can force it to execute this plan with `execute()` :

In [7]:
tsmetrics.head().execute()

Unnamed: 0,tscategory,tsshort,tslong,tsunitsshort,tsunitslong,tsunique
0,OTHER TIME SERIES DATA,AQ,Aquaculture,MT,metric tons,AQ-MT
1,OTHER TIME SERIES DATA,ASP,Annual surplus production,MT,Metric tons,ASP-MT
2,TOTAL BIOMASS,BdivBmgtpref,General biomass time series preferentially rel...,dimensionless,dimensionless,BdivBmgtpref-dimensionless
3,TOTAL BIOMASS,BdivBmgttouse,General biomass time series relative to manage...,dimensionless,dimensionless,BdivBmgttouse-dimensionless
4,TOTAL BIOMASS,BdivBmsypref,General biomass time series preferentially rel...,dimensionless,dimensionless,BdivBmsypref-dimensionless


At last, we are starting to see what the data really looks like. Data tables can quickly become much to large to explore by simply trying to eyeball every row.  For instance, we notice the first column, `tscategory`, shows a few different possible categories for the various metrics in the database.  So, how many distinct categories are there?  

### `select()` and `distinct()`

To answer this, we will introduce a few more methods of data table manipulation. `select()` selects one or more _columns_ of a given table, while `distinct()` returns only distinct (unique) rows of the table.  Note that both of these methods share a common pattern -- they both apply to a table (not some piece of a table, like a row or column or cell), and they both return a new table as well that is some subset of the old table.  table in, table out.  This design is very intentional -- by having methods designed specificially to operate on tables and return tables, we can easily stack or chain these together, (also true of `head()` and execute()`.  So let's try and see distinct categories:

In [9]:
(tsmetrics
 .select("tscategory")
 .distinct()
 .head(10)
 .execute()
)
 

Unnamed: 0,tscategory
0,CATCH or LANDINGS
1,FISHING MORTALITY
2,OTHER TIME SERIES DATA
3,SPAWNING STOCK BIOMASS or CPUE
4,PRODUCTION
5,TOTAL BIOMASS
6,RECRUITS (NOTE: RECRUITS ARE OFFSET IN TIME SE...
7,TIME UNITS


Note that we have stacked these methods together with each step on it's own line by wrapping the whole thing inside `()` parentheses.  This can make a long "chain" of commands easier to read.  While we have asked for no more that 10 values, we have gotten back only 8 -- so we now know there are only 8 categories.  What if we had selected two columns instead before using `distinct()`?

In [10]:
(tsmetrics
 .select("tscategory", "tsshort")
 .distinct()
 .head(10)
 .execute()
)

Unnamed: 0,tscategory,tsshort
0,SPAWNING STOCK BIOMASS or CPUE,CPUE-A8
1,SPAWNING STOCK BIOMASS or CPUE,CPUEraw-5
2,SPAWNING STOCK BIOMASS or CPUE,CPUEraw
3,SPAWNING STOCK BIOMASS or CPUE,CPUEstand-4
4,SPAWNING STOCK BIOMASS or CPUE,CPUEstand
5,CATCH or LANDINGS,CUSTC
6,OTHER TIME SERIES DATA,DIS
7,OTHER TIME SERIES DATA,EFFORT
8,OTHER TIME SERIES DATA,EFFORT-A2
9,FISHING MORTALITY,ER-3
