# Iceberg and Friends - Modern Big Data in a Box

# The Mentor

![Ibis logo](images/ibis_logo.svg)

Ibis is a library providing backend-agnostic dataframe operations across backends such as RDBMS (Postgres, MS SQL Server, DuckDB) but also polars, pandas etc. 

Ibis lazily evaluates all operations and then passes the query graph to the backend for execution

In [109]:
import ibis
import pathlib
import pandas as pd

Ibis uses DuckDB as the default backend

In [71]:
t = ibis.read_csv('../data/csv/10.csv', table_name='reviews')

Ibis has sniffed the CSV to get the schema

In [44]:
t

In [45]:
t.head()

Remember, everything is lazy - need to execute

In [46]:
t.head().execute()

Unnamed: 0,recommendationid,language,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,...,written_during_early_access,hidden_in_steam_china,steam_china_location,author_steamid,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_playtime_at_review,author_last_played
0,147937429,english,1696875102,1717510986,1,3,0,0.5268,0,1,...,0,1,,76561199550893216,35,23,59161,4738,58753,1717541057
1,166664841,russian,1717510100,1717510100,1,0,0,0.0,0,1,...,0,1,,76561199161536896,24,11,436,71,385,1717512997
2,166664763,russian,1717510009,1717510009,1,0,0,0.0,0,0,...,0,1,,76561198046827632,0,7,23750,7,23743,1717510490
3,166663001,turkish,1717508182,1717508182,0,0,0,0.0,0,0,...,0,1,,76561199374468448,32,4,361,19,356,1717508513
4,166658743,brazilian,1717503385,1717503385,1,1,0,0.52381,0,1,...,0,1,,76561198018922960,9,1,1497,0,1497,1478272196


Default is to return a pandas DataFrame - but that's optional. Let's get a polars dataframe

In [48]:
t.head().to_polars()

recommendationid,language,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,hidden_in_steam_china,steam_china_location,author_steamid,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_playtime_at_review,author_last_played
i64,str,i64,i64,i64,i64,i64,f64,i64,i64,i64,i64,i64,str,i64,i64,i64,i64,i64,i64,i64
147937429,"""english""",1696875102,1717510986,1,3,0,0.5268,0,1,0,0,1,,76561199550893216,35,23,59161,4738,58753,1717541057
166664841,"""russian""",1717510100,1717510100,1,0,0,0.0,0,1,0,0,1,,76561199161536896,24,11,436,71,385,1717512997
166664763,"""russian""",1717510009,1717510009,1,0,0,0.0,0,0,0,0,1,,76561198046827632,0,7,23750,7,23743,1717510490
166663001,"""turkish""",1717508182,1717508182,0,0,0,0.0,0,0,0,0,1,,76561199374468448,32,4,361,19,356,1717508513
166658743,"""brazilian""",1717503385,1717503385,1,1,0,0.52381,0,1,0,0,1,,76561198018922960,9,1,1497,0,1497,1478272196


## Alternate backends

Ibis supports [many backends](https://ibis-project.org/backends/support/matrix) with varying coverage. By default, Ibis will use duckdb, which corresponds to the following:

In [60]:
duckdb_conn = ibis.duckdb.connect(':memory:')
duckdb_t = duckdb_conn.read_csv('../data/csv/10.csv')
duckdb_t.head().to_polars()

recommendationid,language,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,hidden_in_steam_china,steam_china_location,author_steamid,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_playtime_at_review,author_last_played
i64,str,i64,i64,i64,i64,i64,f64,i64,i64,i64,i64,i64,str,i64,i64,i64,i64,i64,i64,i64
147937429,"""english""",1696875102,1717510986,1,3,0,0.5268,0,1,0,0,1,,76561199550893216,35,23,59161,4738,58753,1717541057
166664841,"""russian""",1717510100,1717510100,1,0,0,0.0,0,1,0,0,1,,76561199161536896,24,11,436,71,385,1717512997
166664763,"""russian""",1717510009,1717510009,1,0,0,0.0,0,0,0,0,1,,76561198046827632,0,7,23750,7,23743,1717510490
166663001,"""turkish""",1717508182,1717508182,0,0,0,0.0,0,0,0,0,1,,76561199374468448,32,4,361,19,356,1717508513
166658743,"""brazilian""",1717503385,1717503385,1,1,0,0.52381,0,1,0,0,1,,76561198018922960,9,1,1497,0,1497,1478272196


We can also explicitly use a different backend - let's try Polars

In [None]:
polars_conn = ibis.polars.connect()

polars_t = polars_conn.read_csv('../data/csv/10.csv')
polars_t.head().execute()

For the rustaceans - datafusion...

In [69]:
datafusion_conn = ibis.datafusion.connect()
datafusion_t = datafusion_conn.read_csv('../data/csv/10.csv')
datafusion_t.head().execute()

Unnamed: 0,recommendationid,language,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,...,written_during_early_access,hidden_in_steam_china,steam_china_location,author_steamid,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_playtime_at_review,author_last_played
0,103862802,russian,1637857773,1637857773,1,0,0,0.0,0,1,...,0,0,,76561198210898784,286,4,3264,0,2402,1710552207
1,103862313,russian,1637857686,1637857686,1,0,0,0.0,0,1,...,0,0,,76561199187926192,51,3,1783,0,1152,1704141343
2,103861974,russian,1637857621,1637857621,1,0,0,0.0,0,1,...,0,0,,76561198865460048,10,4,106068,1,19873,1717089070
3,91390399,russian,1620096075,1637857519,1,0,0,0.0,0,1,...,0,0,,76561198876292192,145,102,1447,0,472,1620195994
4,103861319,romanian,1637857508,1637857508,1,0,0,0.0,0,0,...,0,0,,76561197981719888,0,1,520610,8953,395967,1717455442


As mentioned previously, Ibis compiles the query graph to the target backend

In [88]:
ibis.duckdb.compile(t.select(t.recommendationid, t.language, t.timestamp_created).head())

'SELECT "t0"."recommendationid", "t0"."language", "t0"."timestamp_created" FROM "reviews" AS "t0" LIMIT 5'

These are all in-memory, how about Postgres? Luckily, we have one running in Docker

In [None]:
postgres_conn = ibis.postgres.connect(host='localhost', user='postgres', password='postgres')

Postgres doesn't support reading a CSV directly, so we have to convert to an in-memory format first before we can write

In [None]:
postgres_conn.create_table('reviews', t) # Create a table based on the schema of `t`
postgres_conn.insert('reviews', t.to_pyarrow())

Now we're ready to read data from the Postgres backend

In [83]:
postgres_t = postgres_conn.table('reviews') # Grab the table `reviews` registered on the backend
postgres_t.head().execute()

Unnamed: 0,recommendationid,language,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,...,written_during_early_access,hidden_in_steam_china,steam_china_location,author_steamid,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_playtime_at_review,author_last_played
0,147937429,english,1696875102,1717510986,1,3,0,0.5268,0,1,...,0,1,,76561199550893216,35,23,59161,4738,58753,1717541057
1,166664841,russian,1717510100,1717510100,1,0,0,0.0,0,1,...,0,1,,76561199161536896,24,11,436,71,385,1717512997
2,166664763,russian,1717510009,1717510009,1,0,0,0.0,0,0,...,0,1,,76561198046827632,0,7,23750,7,23743,1717510490
3,166663001,turkish,1717508182,1717508182,0,0,0,0.0,0,0,...,0,1,,76561199374468448,32,4,361,19,356,1717508513
4,166658743,brazilian,1717503385,1717503385,1,1,0,0.52381,0,1,...,0,1,,76561198018922960,9,1,1497,0,1497,1478272196


# Exploring the Ibis API

Ibis defines it's own DDL api to mutate data - it's similar to pandas, but draws more inspiration from R's dplyr.

To demonstrate - you may have noticed that the timestamps are in UNIX Epoch format, how would we convert those to dates?
`.mutate` is an inplace way of changing the Ibis table

In [95]:
t = t.mutate(timestamp=t.timestamp_created.to_timestamp())
t.schema()

ibis.Schema {
  recommendationid                int64
  language                        string
  timestamp_created               int64
  timestamp_updated               int64
  voted_up                        int64
  votes_up                        int64
  votes_funny                     int64
  weighted_vote_score             float64
  comment_count                   int64
  steam_purchase                  int64
  received_for_free               int64
  written_during_early_access     int64
  hidden_in_steam_china           int64
  steam_china_location            string
  author_steamid                  int64
  author_num_games_owned          int64
  author_num_reviews              int64
  author_playtime_forever         int64
  author_playtime_last_two_weeks  int64
  author_playtime_at_review       int64
  author_last_played              int64
  timestamp                       timestamp
}

Ibis also has a `.select` method to select `expressions`

In [105]:
t.select(t.timestamp_created.to_timestamp().name('timestamp')).head().execute()

Unnamed: 0,timestamp
0,2023-10-09 18:11:42
1,2024-06-04 14:08:20
2,2024-06-04 14:06:49
3,2024-06-04 13:36:22
4,2024-06-04 12:16:25


and a `.filter` method to filter rows

In [136]:
t.filter(~t.recommendationid.isnull()).head()

Method chaining is a popular technique in pandas, but usually involves writing lambdas to pass the dataframe. 

We want to have a year column based off of timestamp_created - this involves converting to a date and then getting the year from it

In [116]:
df = pd.read_csv('../data/csv/10.csv')

In [113]:
(df
 .assign(timestamp=pd.to_datetime(df.timestamp_created, unit='s'))
 .assign(year_created=lambda x: x.timestamp.dt.year)
 ['year_created']
 .head()
)

0    2023
1    2024
2    2024
3    2024
4    2024
Name: year_created, dtype: int32

Ibis has a better solution, the `_` api. The `_` represents the previous table in the chain and is kept track of by Ibis itself. 

In [119]:
from ibis import _

In [117]:
t = ibis.read_csv('../data/csv/10.csv')

In [122]:
(t.mutate(timestamp=t.timestamp_created.to_timestamp())
 .mutate(year_created=_.timestamp.year())
 .select(_.year_created)
 .head()
 .execute()
)

Unnamed: 0,year_created
0,2023
1,2024
2,2024
3,2024
4,2024


## Bigger than memory

Ibis can also work with folders of files and bigger than memory data, depending on the backend.

In the Steam Reviews dataset, each file is named by the game ID, so game ID 10 would be `10.csv`. To include this information, we use the duckdb-specific `filename=True` argument to include the name of the file as a column

In [None]:
conn = ibis.duckdb.connect('demo.ddb')

In [138]:
t = conn.read_csv("../data/csv/*.csv", table_name='reviews', filename=True)

In [125]:
t.head().execute()

Unnamed: 0,recommendationid,language,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,...,hidden_in_steam_china,steam_china_location,author_steamid,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_playtime_at_review,author_last_played,filename
0,147937429,english,1696875102,1717510986,1,3,0,0.5268,0,1,...,1,,76561199550893216,35,23,59161,4738,58753,1717541057,/Users/anders/projects/tutorials/pydata_iceber...
1,166664841,russian,1717510100,1717510100,1,0,0,0.0,0,1,...,1,,76561199161536896,24,11,436,71,385,1717512997,/Users/anders/projects/tutorials/pydata_iceber...
2,166664763,russian,1717510009,1717510009,1,0,0,0.0,0,0,...,1,,76561198046827632,0,7,23750,7,23743,1717510490,/Users/anders/projects/tutorials/pydata_iceber...
3,166663001,turkish,1717508182,1717508182,0,0,0,0.0,0,0,...,1,,76561199374468448,32,4,361,19,356,1717508513,/Users/anders/projects/tutorials/pydata_iceber...
4,166658743,brazilian,1717503385,1717503385,1,1,0,0.52381,0,1,...,1,,76561198018922960,9,1,1497,0,1497,1478272196,/Users/anders/projects/tutorials/pydata_iceber...


In [139]:
t = (t
     .mutate(timestamp_created=t.timestamp_created.to_timestamp(),
             timestamp_updated=t.timestamp_updated.to_timestamp(),
             author_last_played=t.author_last_played.to_timestamp(),
             steam_purchase=t.steam_purchase.cast(bool),
             voted_up=t.voted_up.cast(bool)
            )
     .mutate(year_created=_.timestamp_created.year(), 
             month_created=_.timestamp_created.month(),
             game_id=_.filename.split("/")[-1].split(".")[0]
           )
         .drop(_.filename)
         .filter(~t.recommendationid.isnull())
)

In [134]:
t.head().execute()

Unnamed: 0,recommendationid,language,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,...,author_steamid,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_playtime_at_review,author_last_played,year_created,month_created,game_id
0,147937429,english,2023-10-09 18:11:42,2024-06-04 14:23:06,True,3,0,0.5268,0,True,...,76561199550893216,35,23,59161,4738,58753,2024-06-04 22:44:17,2023,10,10
1,166664841,russian,2024-06-04 14:08:20,2024-06-04 14:08:20,True,0,0,0.0,0,True,...,76561199161536896,24,11,436,71,385,2024-06-04 14:56:37,2024,6,10
2,166664763,russian,2024-06-04 14:06:49,2024-06-04 14:06:49,True,0,0,0.0,0,False,...,76561198046827632,0,7,23750,7,23743,2024-06-04 14:14:50,2024,6,10
3,166663001,turkish,2024-06-04 13:36:22,2024-06-04 13:36:22,False,0,0,0.0,0,False,...,76561199374468448,32,4,361,19,356,2024-06-04 13:41:53,2024,6,10
4,166658743,brazilian,2024-06-04 12:16:25,2024-06-04 12:16:25,True,1,0,0.52381,0,True,...,76561198018922960,9,1,1497,0,1497,2016-11-04 15:09:56,2024,6,10


In [142]:
!du -hs ../data/csv

 13G	../data/csv
