# Preamble: Slides via RISE

This notebook is a set of slides made using `jupyterlab-rise`, installed via   
 
```bash
$ uv pip install jupyterlab_rise
```

When opened in `jupyterlab`, press`Ctrl+R` (or `Option+R` on Mac) to render

P.S.- Should have just used Google Slides. Spent way too much time mucking about with CSS!

# flocoder: Project Overview / Recruitment Drive 

### Scott H. Hawley 
- Professor of Physics, Belmont University, Nashville TN USA
- Head of Research, Hyperstate Music AI
- Former Technical Fellow, Stability AI (co-author on Stable Audio paper)

ICLR 2025 "Best Blog Post": "**Flow With What You Know**: basic physics provides a 'straight, fast' way to get up to speed on flow-based generative models"

<center>
<a href="https://iclr.cc/media/PosterPDFs/ICLR%202025/31364.png?t=1745186162.1069727"><img src="https://iclr.cc/media/PosterPDFs/ICLR%202025/31364.png?t=1745186162.1069727" width="40%"></a></center>

*Apr 23, 2025, SUTD*


# (My) Prior Work: Experimenting with Controllable MIDI Generation

My experience with LLM-like Transformer-based MIDI gen models was unimpressive:

* compositions didn't "sound great,"
* quickly devolved,
* even with lots of work & tweaking.
* Limited/no options for controlling outputs

# (My) Prior Work: Pictures of MIDI

Idea: GUI for controllable gen: users draw an inpainting mask of *roughly* where they want notes to go. Have the system generate notes that fit. (Not a new idea, just new-to-me)

["Pictures of MIDI"](https://arxiv.org/abs/2407.01499) https://arxiv.org/abs/2407.01499: 
<center>
  <img src="pom_mot_idea.png" width="35%" style="margin: 0; padding: 0;">
</center>

Operate on piano-roll *image representations* of MIDI. Based on [“Polyffusion" (Min et al, 2023)](https://arxiv.org/abs/2307.10304), using code from [HDiT (Crowson et al, 2024)](https://arxiv.org/abs/2401.11605) 

Worked great!<sup>1</sup> *Amazingly easy* to do: melody, accompaniment, extension,... 
* HF Spaces Demo: https://huggingface.co/spaces/drscotthawley/PicturesOfMIDI
* Demo page: https://picturesofmidi.github.io/PicturesOfMIDI/
 


<div class="footnote"><sup>1</sup>...But big and slow</div>  

# Idea: Small (Interpretable?) Latent Flow Model


<center>
<img src="https://raw.githubusercontent.com/drscotthawley/flocoder/refs/heads/main/images/flow_schematic.jpg" width="30%"><br>
</center>

Piano roll (PR) images are mostly empty space. So compress to some small latent space. 
* Pretrained VQGAN/VQVAE<sup>1</sup>  from Stable Diffusion yielded "janky" results. Probably needlessly "general".
* So train a custom VQGAN for MIDI PR's to get good compression.

Wish list: Get VQ latents to compress via repeated musical phrases, use encoder for *Motif Analysis*? (Not tried yet)

<div class="footnote"><sup>1</sup> Terminology: VQGAN is a VQVAE that has self-attention and is trained with adversarial loss.</div>   

# MIDI-VQGAN

Takes 128x128x3 RGB images, compresses them to 16x16x4: 4 codebooks of residual vector quantization (RVQ), 32 vectors per codebook

Would like better compression, this just worked for now. 

98% reconstruction accuracy (F1 score) using POP909 dataset. (Tan line:)

<center>
<img src="https://cdn.discordapp.com/attachments/1336763902175744143/1338943564528357387/472747754_453113941185849_3823178967569815371_n.png?ex=680934de&is=6807e35e&hm=b773b727e8fa17189c34de26d39dc0d1b2cd5992626fad920fe0aaaa73e61d07&" width="40%"></center>


# Flow Model *Seems* Good, Decoding seems *Meh* 

Pretty simple. Works ok, but the problem is that the decoded outputs look a bit "off": horizontal-only green lines get "steps"

<center>
  <div style="width: 70%; overflow: hidden; display: inline-block; position: relative;">
    <div style="width: 300%; margin-left: -100%; position: relative;">
      <img src="https://cdn.discordapp.com/attachments/1336763902175744143/1339058461274800230/image.png?ex=6808f71f&is=6807a59f&hm=1f6911aae81558fa2199e0844d8a92615029cfa180d0b942ca2bb613600b96dd&" 
           style="width: 100%; display: block;">
    </div>
    <div style="position: absolute; top: 10px; left: 0; right: 0; text-align: center; color: white; font-weight: bold; font-size:2em; text-shadow: 1px 1px 3px black;">Original</div>
    <div style="position: absolute; bottom: 10px; left: 0; right: 0; text-align: center; color: white; font-weight: bold; font-size:2em; text-shadow: 1px 1px 3px black;">Generated</div>
  </div>
</center>

# Distribution of Flow Gen vs. Original

Despite slight "mangling", distributions look very close

<center><img src="https://cdn.discordapp.com/attachments/1336763902175744143/1339037434498912286/image.png?ex=6808e38a&is=6807920a&hm=2d6dea3c256ab3ce8c19fb1f9462c24805c5298af9357069725372b354b2a6b7&" width="40%"></center>

# Q: Why *Decode* to Images at All?

* If the goal is to generate MIDI, then why bother decoding to images? 
* Why not just decode direclty to MIDI? 

*A: because I was just porting from a working image-diffusion code to custom vq-flow, and didn't think of it til recently*

Current workflow seems maybe *absurdly* wasteful:

1. Get MIDI
2. Convert to PR Image
3. User draws mmask
4. Encode
5. Flow
6. Quantize
7. Decode to PR Image (slower than I'd like)
8. Convert to MIDI

only did it because the MIDI-image-inpainting method was so powerful and easy!

# Open Q: How to Use Latents for Motifs?

Two interests (possibly opposing): 

## I.  Interpretable Representation?

* MIDI *itself* is the compressed, interpretable representation
* Maybe some kind of CLIP-like contrastive loss / VICReg mapping between latents and (projected) MIDI?

## 2. Compressing via Repeated Motifs? 
* BPE tends to multiply too many tokens
* (factorizable) n-grams are ok
* kernel-grams>



...Any ideas? (Save for Discussion at end)



# Issue: Education vs. Performance

This also started as a *teaching* project, writing all code from scratch so "we" understand it all

But maybe best to repurpose others' codes (e.g. SD, Meta,...) for performance? 

# Thanks, Collab & Discussion
Thanks to you! And to Raymond Fan for helpful discussions. 

**Collaborators?** This *has been* largely a solo project but would *love* to make it a collaborative effort! Glady share credits, can MIT license code,...  Grad students? ;-)

Discussion...