In this example, we will use Snakemake to analyze Airbnb data from New York City in 2019. Our pipeline includes preprocessing data, performing basic analysis, and generating visualizations. 

You can find the dataset in the github repository, or you can download it directly from [Kaggle here](https://www.kaggle.com/datasets/ptoscano230382/air-bnb-ny-2019).

We'll start by first creating a conda environment with the dependencies and then activating the environment. 

In [None]:
!conda env create -f environment.yml
!conda activate snakemake-tutorial

Here's the workflow in our Snakefile, it includes the steps:
- `preprocess`: clean the raw data
- `analyze`: generate summary statistics
- `visualzie`: create visualizations from the analysis

You can view the steps (or the rules) contained in the `Snakefile` as follows:

In [3]:
!snakemake --list

[32mall[0m
[32manalyze[0m
[32mpreprocess[0m
[32mvisualize[0m


Now we can either run the whole pipeline from start to finish, or run individual steps. But be careful, our workflow is sequential, for example the `visualize` step needs the `analyze` step to be run beforehand, and the we need to run the `preprocess` step before we `analyze`. Snakemake is clever enough that if you skip such dependency, it will detect that outputs from previous steps are missing and it'll run the required steps before the particular one you wanted.

For example, try running the code below to `analyze` the data before `preprocessing`. Snakemake will take care of it and run both steps for you.

In [6]:
!snakemake --cores 1 analyze

[33mBuilding DAG of jobs...[0m
[33mUsing shell: /usr/bin/bash[0m
[33mProvided cores: 1 (use --cores to define parallelism)[0m
[33mRules claiming more threads will be scaled down.[0m
[33mJob stats:
job           count
----------  -------
analyze           1
preprocess        1
total             2
[0m
[33mSelect jobs to execute...[0m
[32m[0m
[32m[Thu Jan  9 12:30:28 2025][0m
[32mrule preprocess:
    input: data/AB_NYC_2019.csv
    output: output/cleaned_data.csv
    jobid: 1
    reason: Missing output files: output/cleaned_data.csv
    resources: tmpdir=/tmp[0m
[32m[0m
[32m[Thu Jan  9 12:30:29 2025][0m
[32mFinished job 1.[0m
[32m1 of 2 steps (50%) done[0m
[33mSelect jobs to execute...[0m
[32m[0m
[32m[Thu Jan  9 12:30:29 2025][0m
[32mrule analyze:
    input: output/cleaned_data.csv
    output: output/summary.csv
    jobid: 0
    reason: Missing output files: output/summary.csv; Input files updated by another job: output/cleaned_data.csv
    resources: tmpd