# Ibis for dplyr Users

[R](https://www.r-project.org/) users familiar with [dplyr](https://dplyr.tidyverse.org/) are likely to find some parts of Ibis familiar.
In fact, some Ibis verbs have been named to match their corresponding dplyr verbs.

However, due to constraints of the Python programming language and the design and goals of Ibis itself, analysts familiar with dplyr may notice some big between the two right away:

TODO: Fill this list in more.

- No pipe syntax but you do have chaining
- mutate can't do internal references, you have to chain mutates
- ibis has more similar to dplyr+dbplyr (TODO: expand on this)
- `_` as a helper
- NULL ordering. ibis sorts NULLs differently than dplyr/R. TODO:
    - Figure out why this is and explain it well
- There are multiple ways to reference columns (by string or by "reference")
- Wrapping complex expressions in parens to make the evaluate correctly
- (need to check but can group_by be used for other things that aggregates?. For instance:
    ```r
    starwars |>
      filter(!is.na(height)) |>
      group_by(species) |> 
      slice_max(height, n = 3)
   ```

## Comparison



### Loading Ibis

In [2]:
import ibis
import ibis.examples as ex
import ibis.selectors as s
from ibis import _
ibis.options.interactive = True

### Loading example data

In R, datasets are typically lazily loaded with packages. For instance, the `starwars` dataset is packaged with dplyr, but is not loaded in memory before you start using it. Ibis provides many datasets in the `examples` module. So to be able to use the `starwars` dataset, you can use:

In [3]:
starwars = ex.starwars.fetch()

### Inspecting the dataset with `head()`

Just like in R, you can use `head()` to inspect the beginning of a dataset. You can also specify the number of rows you want to get back by using the parameter `n` (default `n = 5`).

In R:

```r
head(starwars) # or starwars |> head()
```

With Ibis:

In [8]:
starwars.head(6)

There is no `tail()` in Ibis because most databases do not support this operation.

Another method you can use to limit the number of rows returned by a query is `limit()` which also takes the `n` parameter.

In [10]:
starwars.limit(3)

### filter()

Ibis, like dplyr, has the `filter` method to select rows based on conditions.

With dply:

```r
starwars |>
  filter(skin_color == "light")
```

In Ibis:

In [11]:
starwars.filter(_.skin_color == "light")

In dplyr, you can specify multiple conditions separated with `,` that are then combined with the `&` operator:

```r
starwars |>
  filter(skin_color == "light", eye_color == "brown")
```

In Ibis, you can do the same by putting multiple conditions in a list:

In [12]:
starwars.filter([_.skin_color == "light", _.eye_color == "brown"])

If you want to combine multiple conditions, in dplyr, you could do:

```r
starwars |>
  filter(
      (skin_color == "light" & eye_color == "brown") |
       species == "Droid"
  )
```

In Ibis:

In [18]:
starwars.filter(
    ((_.skin_color == "light") & (_.eye_color == "brown")) |
    (_.species == "Droid")
)

### arrange()

To sort a column, dplyr has the verb `arrange`. For instance, to sort the column `height` using dplyr:

```r
starwars |>
   arrange(height)
```

In [8]:
starwars.order_by(_.height)

You might notice that while dplyr puts missing values at the end, Ibis places them at the top.

If you want to order using multiple variables, you can pass them as a list:

In [5]:
starwars.order_by([_.height, _.mass])

To order a column in descending order, there are two ways to it. Note that missing values remain at the top.

In [10]:
starwars.order_by(_.height.desc()) # or: starwars.order_by(ibis.desc("height"))

### slice()

dplyr provides several functions in the `slice` family to select some rows from the dataset. They are not directly implemented in Ibis but can be emulated with other functions.

For instance, in dplyr, you can use `slice` to select rows 5 to 10:

```r
starwars |>
   slice(5:10)
```

In Ibis, you can use `limit` and specifying an offset:

In [12]:
starwars.limit(5, offset = 4)

`slice_sample` is not implemented. (add example on how to do it?)

`slice_max` and `slice_min` are not implemented but similar results can be obtained combining `order_by` and `limit`:

In [19]:
(
    starwars
    .filter(_.height.notnull())
    .order_by(_.height.desc())
    .limit(3)
)

In [25]:
starwars.select(["hair_color", "skin_color", "eye_color"])

### select()

In [16]:
starwars.select(s.endswith("color"))

### rename()

In [17]:
starwars.relabel({"homeworld": "home_world"})

### mutate()

In [19]:
starwars.mutate(height_m = _.height / 100).select("height_m", "height", ~s.contains("height"))

In [20]:
(starwars
    .mutate(
        height_m = _.height / 100
    )
    .mutate(        
        BMI = _.mass / (_.height_m**2)
    )
    .select("BMI", ~s.matches("BMI"))
)

### summarize() / summarise()

In [21]:
starwars.aggregate(height = _.height.mean())