This repository contains the code for my May 3, 2023 workshop at Greenville's Tech Slam 'N Eggs. If you enjoyed the talk, please consider starring Meerschaum on GitHub!
Disclaimer
This project demonstrates how to build a Meerschaum Compose project in a Docker container, which is up to personal / company preference.
You can follow all of these steps without Docker, just omit
root_dir
frommrsm-compose.yaml
(and addroot/
to your.gitignore
).
For today's project we want to retrieve and aggregate these data sources from FRED:
- Average Price: Eggs, Grade A, Large (Cost per Dozen) in U.S. City Average
- Average Price: Chicken, Fresh, Whole (Cost per Pound/453.6 Grams) in U.S. City Average
Set the secret URI for MRSM_SQL_ETL
in .env
at the root of this project. I'm using the default value for the database that comes with mrsm stack up -d db
:
# .env
### Set to a URI
export MRSM_SQL_ETL='postgresql://mrsm:mrsm@localhost:5432/meerschaum'
### or set to JSON
export MRSM_SQL_ETL='{
"flavor": "timescaledb",
"username": "mrsm",
"password": "mrsm",
"host": "localhost",
"port": 5432,
"database": "meerschaum"
}'
Build and start the development container:
docker compose up -d --build
Once it's up, hop into the container (we've named the service mrsm-compose
and the container techslamneggs
):
docker compose exec mrsm-compose bash
or
docker exec -it techslamneggs bash
NOTE: If you aren't using Docker, remove
root_dir
frommrsm-compose.yaml
and run these commands:pip install meerschaum mrsm install plugin compose mrsm compose show plugins
If all went well, you should be able to now use mrsm
and mrsm compose
. Some quick clarification:
-
mrsm
Uses the default environment, is not bound to any specific project. -
mrsm compose
Assumes the environment defined inmrsm-compose.yaml
to isolate pipes to a single project.
From the name you might see how mrsm compose
is modeled after docker compose
. This philosophy makes collaboration simpler by defining the expected environment and pipes in one single manifest (akin to docker-compose.yaml
).
Under mrsm-compose.yaml
, define a list of pipes under the keys sync:pipes
. Let's take a look at the first one:
sync:
pipes:
- connector: "plugin:fred"
metric: "price"
location: "eggs"
target: "price_eggs"
columns:
datetime: "DATE"
dtypes:
"PRICE": "float64"
parameters:
fred:
series_id: "APU0000708111"
Here's a breakdown of these keys:
-
connector: "plugin:fred"
Fetch new data from thefred
plugin (plugins/fred.py
). -
metric: "price"
What kind of data this pipe holds. -
location: "eggs"
Qualifying tag to the labelprice
. -
instance
(omitted)
We can choose to explicitly state which database on which to host this pipe. The default is this project's default instance (sql:etl
, more info below). -
target: "price_eggs"
The target table's name (default would beplugin_fred_price_eggs
). -
columns
Immutable indices for the table.datetime: "DATE"
Range axis for query bounding. May be omitted but is strongly encouraged.
-
dtypes
Optional, explictly state desired data types. Thedatetime
index is parsed asdatetime64[ns]
unless set asInt64
here. -
parameters
All other pipe parameters may be specified here (e.g.fetch:backtrack_minutes
is often used insql
pipes).fred
Custom parameters for this project! In this case,series_id
is used to tell thefred
plugin which dataset to fetch.
That's a lot of keys! Don't worry if you don't know what to include ― the only required keys are connector
and metric
. Start with those and work your way up (like most things, it's an interative process).
NOTE: The keys you defined above are the keyword arguments for mrsm.Pipe()
. If you want to try something out, just run the command
$ mrsm python
and pass them into a Pipe
object:
>>> import meerschaum as mrsm
>>> pipe = mrsm.Pipe(
... connector='foo',
... metric='bar',
... columns={'datetime': 'date'}
... )
>>>
Now it's time to do the fun part: writing the code to fetch the data. Create plugins/fred.py
and define a function fetch()
with this signature:
from typing import Any
import meerschaum as mrsm
def fetch(
pipe: mrsm.Pipe,
**kwargs: Any
) -> 'pd.DataFrame':
...
Whatever is returned from this function will be passed into Pipe.sync()
, so feel safe to return duplicate data.
NOTE: It's not used in this project, but the best way to improve the performance of your plugins is to implement the --begin
and --end
flags. To do so, your function would look like this:
from typing import Any, Optional
from datetime import datetime
import meerschaum as mrsm
def fetch(
pipe: mrsm.Pipe,
begin: Optional[datetime] = None,
end: Optional[datetime] = None,
**kwargs: Any
) -> 'pd.DataFrame':
...
You may have noticed a function register()
within fred.py
. This too is optional but is a good reference to other developers as to what parameters you expect to be present. This function is called when you run the command register pipe -c plugin:fred -m foo
but is overriden by your compose file.
Now we're ready to begin syncing. To begin, run this command:
mrsm compose run
The run
command updates the pipes' registrations and performs a sync one-by-one. You can manually refresh the registration with up --dry
:
mrsm compose up --dry
Conversely, you can delete the pipes from your project with down -v
:
mrsm compose down -v
A useful command for comparing the state of your compose file against the registered pipes is explain
:
mrsm compose explain
Apart from the compose-specific commands (i.e. run
, up
, down
, explain
, logs
, ps
), all other commands are executed as standard Meerschaum actions but with the flag --tags {project_name}
appended.
You can investigate
**kwargs
with the commandshow arguments
:mrsm compose show arguments --begin 2023-01-01
A command you'll likely be running often is sync pipes
: this will select the pipes from your flags and sync them in parallel. For example, to sync all pipes with the connector plugin:fred
, we would run this command:
mrsm compose sync pipes -c plugin:fred
Consider this command to sync our SQL pipes within a certain date range:
mrsm compose sync pipes -c sql:etl --begin 2010-01-01 --end 2015-01-01
There's a lot more I could have included, but this should be enough to get you started! If you'd like to see what else Meerschaum can do, you can always play around in a Python REPL by running the command mrsm compose python
.
The code in this project is released under the Apache License 2.0. In a nutshell, feel free to use this code how you wish, just give the proper credits and don't claim my branding. Have fun!