

# `ibis` mutates and aggregates

### Learning Goals

- use `group_by()` and `aggregate()` patterns to summarize data
- use `mutate()` to create an new column that is a function of data from existing columns in a table.
- use `order_by()` to arrange rows by one or more columns.

In [1]:
import ibis
from ibis import _
import ibis.selectors as s

con = ibis.duckdb.connect()

In [2]:
base_url = "https://huggingface.co/datasets/cboettig/ram_fisheries/resolve/main/v4.65/"

stock = con.read_csv(base_url + "stock.csv", nullstr="NA")
timeseries = con.read_csv(base_url + "timeseries.csv", nullstr="NA")
assessment = con.read_csv(base_url + "assessment.csv", nullstr="NA")


### Stock "assessments"

The next thing we need to know is a stock assessment.  

In [3]:
assessment

Are some `stockid`s assessed multiple times? One intuitive idea is to `filter()` for a single stockid, and see if we get back multiple rows (multiple assessments).  Let's take a look at "COD2J3KL":

In [4]:
(assessment
 .filter(_.stockid == "COD2J3KL")
 .execute()
)

Unnamed: 0,assessid,assessorid,stockid,stocklong,recorder,daterecorded,dateloaded,assessyear,assesssource,contacts,notes,pdffile,assess,refpoints,assessmethod,assesscomments,xlsfilename,mostrecent
0,DFO-COD2J3KL-1959-2014-WATSON,DFO,COD2J3KL,Atlantic cod Southern Labrador-Eastern Newfoun...,WATSON,2016-02-25,2016-02-18,1959-2014,http://www.dfo-mpo.gc.ca/csas-sccs/publication...,,,,1,1,SURBA,,WATSON_COD2J3KL_2014-DH-edited.xlsx,0
1,DFO-NFLD-COD2J3KL-1850-2011-CHING,DFO-NFLD,COD2J3KL,Atlantic cod Southern Labrador-Eastern Newfoun...,CHING,2013-10-22,2001-12-31,1850-2011,,,,Stock Assesment of Northern (2J3KL) Cod in 2013,1,1,SURBA,,/home/srdbadmin/srdb/spreadsheets/CHING-COD2J3...,0
2,DFO-NFLD-COD2J3KL-1959-2018-ASHBROOK,DFO-NFLD,COD2J3KL,Atlantic cod Southern Labrador-Eastern Newfoun...,ASHBROOK,2019-06-20,2019-08-22,1959-2018,http://www.dfo-mpo.gc.ca/csas-sccs/Publication...,,,AtlanticCod.2J3KL.pdf,1,1,ICA,,ASHBROOK-COD2J3KL-2018-DH-edited.xlsx,0
3,DFO-NFLD-COD2J3KL-1959-2021-HIVELY,DFO-NFLD,COD2J3KL,Atlantic cod Southern Labrador-Eastern Newfoun...,HIVELY,2023-11-08,2023-11-10,1959-2021,,,Science Advisory Report 2022/041,,1,1,ICA,,CND_vers4.62_updates.xlsx,999


Indeed, it looks like their are four assessments of this stock, each conducted in different years and spanning different periods in time!  `filter()`ing for each possible stockid would be tedious though. These four assessments that correspond to this stockid 

In [5]:
(assessment
 .group_by(_.stockid)
 .agg(n=_.count())
 .execute()
)

Unnamed: 0,stockid,n
0,CSALMAKPSWUD,1
1,CSALMANDREAR,1
2,CSALMCHIGCD,1
3,CSALMFISHC,1
4,CSALMKANEKTOKR,1
...,...,...
1507,NPOUTVIa,2
1508,PLAIC7d,10
1509,PLAICIIIa,1
1510,POLLNS-VI-IIIa,11


### `order_by()`

Which `stockid`s have the most assessments?  We can re-order the rows by different columns using the `order_by()`. (Changing the row order does not alter any individual row itself -- that would mess up the data.  Each row is moved as a unit).  By default, `order` is always _increasing_, smallest to largest, A to Z.  While that might be intuitive for dates or names, if we want to see which stocks have the _most_ assessments, we need `n` to be in descending order.  We indicate this by appending the `.desc()` method to the column:

In [None]:
(assessment
 .group_by(_.stockid)
 .agg(n=_.count())
 .order_by(_.n.desc())
 .execute()
)