In this example, we will use Snakemake to analyze Airbnb data from New York City in 2019. Our pipeline includes preprocessing data, performing basic analysis, and generating visualizations. 

You can find the dataset in the github repository, or you can download it directly from [Kaggle here](https://www.kaggle.com/datasets/ptoscano230382/air-bnb-ny-2019).

The dataset contains Airbnb listings in New York for 2019; it describes each accommodation based on its host name, neighborhood, geographical location, price, reviews and etc.

In [1]:
!head -5 data/AB_NYC_2019.csv

id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194


We'll start by first creating a conda environment with the dependencies and then activating the environment. 

In [None]:
!conda env create -f environment.yml
!conda activate snakemake-tutorial

Here's the workflow in our Snakefile, it includes the steps:
- `preprocess`: clean the raw data
- `analyze`: generate summary statistics
- `visualzie`: create visualizations from the analysis

You can view the steps (or the rules) contained in the `Snakefile` as follows:

In [3]:
!snakemake --list

[32mall[0m
[32manalyze[0m
[32mpreprocess[0m
[32mvisualize[0m


As we discussed in the slides about defining target files and target rules, notice the first rule our `Snakefile` is `rule all` which defines the 3 output files we want to obtain at the end of our analysis:
```python
rule all:
    input:
        "output/visualizations/price_distribution.png",
        "output/visualizations/availability_by_neighborhood.png",
        "output/visualizations/reviews_per_month.png"
```

Let's take a look at the `preprocess` work unit to understand what it does exactly and how we can use Python scripts in a Snakemake workflow. In this step we run the `scripts/preprocess.py` on the input file `data/AB_NYC_2019.csv` and create the `output/cleaned_data.csv` as output. The corresponding rule in our `Snakefile` is:
```python
rule preprocess:
    input:
        "data/AB_NYC_2019.csv"
    output:
        "output/cleaned_data.csv"
    script:
        "scripts/preprocess.py"
```

The input file `data/AB_NYC_2019.csv` should be available in your directory already. The Python script `scripts/preprocess.py` is as follows:
```python
import pandas as pd

# Load raw data
data = pd.read_csv(snakemake.input[0])

# Drop rows with missing values in critical columns
data = data.dropna(subset=["name", "host_name", "neighbourhood_group", "price"])

# Filter out listings with unrealistic prices (e.g., over $1,000)
data = data[data["price"] <= 1000]

# Normalize column names
data.columns = [col.strip().lower().replace(" ", "_") for col in data.columns]

# Save cleaned data
data.to_csv(snakemake.output[0], index=False)
```

Next, the `analyze` rule will take the processed data as input, and calculate some basic summary statistics as we can see below:
```python
import pandas as pd

# Load cleaned data
data = pd.read_csv(snakemake.input[0])

# Group data by neighborhood group
summary = data.groupby("neighbourhood_group").agg(
    avg_price=("price", "mean"),
    avg_availability=("availability_365", "mean"),
    total_reviews=("reviews_per_month", "sum")
).reset_index()

# Save summary statistics
summary.to_csv(snakemake.output[0], index=False)
```

Now we can either run the whole pipeline from start to finish, or run individual steps. But be careful, our workflow is sequential, for example the `visualize` step needs the `analyze` step to be run beforehand, and the we need to run the `preprocess` step before we `analyze`. Snakemake is clever enough that if you skip such dependency, it will detect that outputs from previous steps are missing and it'll run the required steps before the particular one you wanted.

Before we get started, we can do a "dry run" to check the scheduling plan and see if the workflow is defined properly.


In [2]:
!snakemake -n

[33mBuilding DAG of jobs...[0m
[33mJob stats:
job           count
----------  -------
all               1
analyze           1
preprocess        1
visualize         1
total             4
[0m
[32m[0m
[32m[Fri Jan 10 11:20:57 2025][0m
[32mrule preprocess:
    input: data/AB_NYC_2019.csv
    output: output/cleaned_data.csv
    jobid: 3
    reason: Missing output files: output/cleaned_data.csv
    resources: tmpdir=/tmp[0m
[32m[0m
[32m[Fri Jan 10 11:20:57 2025][0m
[32mrule analyze:
    input: output/cleaned_data.csv
    output: output/summary.csv
    jobid: 2
    reason: Missing output files: output/summary.csv; Input files updated by another job: output/cleaned_data.csv
    resources: tmpdir=/tmp[0m
[32m[0m
[32m[Fri Jan 10 11:20:57 2025][0m
[32mrule visualize:
    input: output/summary.csv
    output: output/visualizations/price_distribution.png, output/visualizations/availability_by_neighborhood.png, output/visualizations/reviews_per_month.png
    jobid: 1
    reason:


For example, try running the code below to `analyze` the data before `preprocess`. Snakemake will take care of it and run both steps for you.

In [6]:
!snakemake --cores 1 analyze

[33mBuilding DAG of jobs...[0m
[33mUsing shell: /usr/bin/bash[0m
[33mProvided cores: 1 (use --cores to define parallelism)[0m
[33mRules claiming more threads will be scaled down.[0m
[33mJob stats:
job           count
----------  -------
analyze           1
preprocess        1
total             2
[0m
[33mSelect jobs to execute...[0m
[32m[0m
[32m[Thu Jan  9 12:30:28 2025][0m
[32mrule preprocess:
    input: data/AB_NYC_2019.csv
    output: output/cleaned_data.csv
    jobid: 1
    reason: Missing output files: output/cleaned_data.csv
    resources: tmpdir=/tmp[0m
[32m[0m
[32m[Thu Jan  9 12:30:29 2025][0m
[32mFinished job 1.[0m
[32m1 of 2 steps (50%) done[0m
[33mSelect jobs to execute...[0m
[32m[0m
[32m[Thu Jan  9 12:30:29 2025][0m
[32mrule analyze:
    input: output/cleaned_data.csv
    output: output/summary.csv
    jobid: 0
    reason: Missing output files: output/summary.csv; Input files updated by another job: output/cleaned_data.csv
    resources: tmpd

In [1]:
!snakemake --dag --cores 1 | dot -Tpng > airbnb_dag.png

[33mBuilding DAG of jobs...[0m


![](airbnb_dag.png)

To understand the status of our workflow as seen by Snakemake, we can use the `--summary` option. It tells you the status of each step, and whether it plans to update any files or not.

In [1]:
!snakemake --summary

[33mBuilding DAG of jobs...[0m
output_file	date	rule	version	log-file(s)	status	plan
output/visualizations/price_distribution.png	-	visualize	-	-	missing	update pending
output/visualizations/availability_by_neighborhood.png	-	visualize	-	-	missing	update pending
output/visualizations/reviews_per_month.png	-	visualize	-	-	missing	update pending
output/summary.csv	-	analyze	-	-	missing	update pending
output/cleaned_data.csv	-	preprocess	-	-	missing	update pending


![](dag.svg`)