# Text-to-music Generation model
* 출처 : https://huggingface.co/sander-wood/text-to-music


[음악 생성 인공지능 모델 구분]

1. Neural Network-based 방식: LSTM, GRU 등의 RNN 기반의 신경망 모델을 사용하여 음악을 생성하는 방식. 과거 음악 데이터를 분석하여 새로운 음악을 생성
   
   ex) Magenta, MuseGAN, BachBot 등

2. GAN-based 방식: GAN 기반의 생성모델로 음악을 생성하는 방식. Generator 네트워크와 Discriminator 네트워크로 구성되어 있으며, Generator 네트워크는 노래를 생성하고, Discriminator 네트워크는 생성된 노래가 진짜인지 가짜인지 판별

   ex) MidiNet, DeepJ, Amadeus Code 등

3. Transformer-based 방식: Transformer 모델 기반으로 음악을 생성하는 방식. Transformer 모델은 시퀀스 데이터를 처리하는 데 매우 효과적으로, 이를 음악 생성에도 적용가능함.

   ex) MusicTransformer, GrooVAE, Pop Music Transformer 등

[AutoModel for Seq2SeqLM]
- Hugging Face Transformers 라이브러리에서 제공하는 자연어처리(NLP)를 위한 모델 클래스 중 하나로, 이 클래스는 Encoder-Decoder 구조의 Sequence-to-Sequence (Seq2Seq) 모델을 자동으로 생성해줌

- 모델의 아키텍처와 하이퍼파라미터를 자동으로 선택하여 생성하며, 다양한 사전 학습된 모델 (예: BART, T5, Marian, etc.)에 대한 Wrapper 역할을 수행. 이를 통해, Seq2Seq 모델을 생성하는 데 필요한 복잡한 작업을 최소화하고 간단한 코드로 모델 생성 가능

* 해당 Text to music generation 모델은 282,870개의 영어 텍스트-음악 쌍에서 BART 모델 기반을 기반으로 미세 조정을 수행하며, output은 ABC 표기법(음악)으로 출력됨.

In [None]:
!pip install samplings
!pip install transformers

import torch
from samplings import top_p_sampling, temperature_sampling
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained('sander-wood/text-to-music')
model = AutoModelForSeq2SeqLM.from_pretrained('sander-wood/text-to-music')
model = model

max_length = 1024
top_p = 0.9
temperature = 1.0

text = "This is a traditional Irish dance music."
input_ids = tokenizer(text, 
                      return_tensors='pt', 
                      truncation=True, 
                      max_length=max_length)['input_ids']

decoder_start_token_id = model.config.decoder_start_token_id
eos_token_id = model.config.eos_token_id

decoder_input_ids = torch.tensor([[decoder_start_token_id]])

for t_idx in range(max_length):
    outputs = model(input_ids=input_ids, 
    decoder_input_ids=decoder_input_ids)
    probs = outputs.logits[0][-1]
    probs = torch.nn.Softmax(dim=-1)(probs).detach().numpy()
    sampled_id = temperature_sampling(probs=top_p_sampling(probs, 
                                                           top_p=top_p, 
                                                           return_probs=True),
                                      temperature=temperature)
    decoder_input_ids = torch.cat((decoder_input_ids, torch.tensor([[sampled_id]])), 1)
    if sampled_id!=eos_token_id:
        continue
    else:
        tune = "X:1\n"
        tune += tokenizer.decode(decoder_input_ids[0], skip_special_tokens=True)
        print(tune)
        break


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting samplings
  Downloading samplings-0.1.7-py3-none-any.whl (7.3 kB)
Installing collected packages: samplings
Successfully installed samplings-0.1.7
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.3-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m56.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.73k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/558M [00:00<?, ?B/s]

X:1
L:1/8
Q:1/4=180
M:6/8
K:D
 A |:"D" dAF DFA |"G" Bed cBA |"D" dcd"A7" efg |"D" fdf"A7" ecA |"D" dAF DFA |"G" Bed cBA |
"D" dcd"A7" efg |1"D" fd"A7"c"D" d2 A :|2"D" fd"A7"c"D" d2 c |:"G" BcB Bcd |"A" ecA Ace |
"Bm" fgf"E7" fga |"A7" ecA"D" A2 =c |"G" BcB Bcd |"A" ecA Ace |"Bm" fgf"E7" fga |1"A" ec"D"A"D" d2 c :|2
"A" ec"D"A"D" d2 |]



In [None]:
text = "This is a traditional korean music"
input_ids = tokenizer(text, 
                      return_tensors='pt', 
                      truncation=True, 
                      max_length=max_length)['input_ids']

decoder_start_token_id = model.config.decoder_start_token_id
eos_token_id = model.config.eos_token_id

decoder_input_ids = torch.tensor([[decoder_start_token_id]])

for t_idx in range(max_length):
    outputs = model(input_ids=input_ids, 
    decoder_input_ids=decoder_input_ids)
    probs = outputs.logits[0][-1]
    probs = torch.nn.Softmax(dim=-1)(probs).detach().numpy()
    sampled_id = temperature_sampling(probs=top_p_sampling(probs, 
                                                           top_p=top_p, 
                                                           return_probs=True),
                                      temperature=temperature)
    decoder_input_ids = torch.cat((decoder_input_ids, torch.tensor([[sampled_id]])), 1)
    if sampled_id!=eos_token_id:
        continue
    else:
        tune = "X:1\n"
        tune += tokenizer.decode(decoder_input_ids[0], skip_special_tokens=True)
        print(tune)
        break

X:1
L:1/4
M:4/4
K:C
 C/D/ | E G G c | B A G2 | c G/G/ E G | G3 C/D/ | E G G c | B A G c | c G/G/ E D/D/ | C3 C/D/ |
 E G G A | G E D2 | c G/G/ E G | G3 C/D/ | E G G c | B A G c | c G/G/ E D/D/ | C3 z |]



In [None]:
text = "Create fast and rhythmical jazz music"
input_ids = tokenizer(text, 
                      return_tensors='pt', 
                      truncation=True, 
                      max_length=max_length)['input_ids']

decoder_start_token_id = model.config.decoder_start_token_id
eos_token_id = model.config.eos_token_id

decoder_input_ids = torch.tensor([[decoder_start_token_id]])

for t_idx in range(max_length):
    outputs = model(input_ids=input_ids, 
    decoder_input_ids=decoder_input_ids)
    probs = outputs.logits[0][-1]
    probs = torch.nn.Softmax(dim=-1)(probs).detach().numpy()
    sampled_id = temperature_sampling(probs=top_p_sampling(probs, 
                                                           top_p=top_p, 
                                                           return_probs=True),
                                      temperature=temperature)
    decoder_input_ids = torch.cat((decoder_input_ids, torch.tensor([[sampled_id]])), 1)
    if sampled_id!=eos_token_id:
        continue
    else:
        tune = "X:1\n"
        tune += tokenizer.decode(decoder_input_ids[0], skip_special_tokens=True)
        print(tune)
        break

X:1
L:1/8
M:3/4
K:C
 B2 [ce]4 | [ce]2 [Bd]4 | [Gd]2 [Be]4 | [Gd]2 [Fc]2 [GB]2 | [Dc]6 | [DGBdfa]6 | [CEG]2 [CFA]4 |
 [FGB]6 | [FAce]6 | [DEG]6 | [EGc]6 | [Bdf]6 | [EGBdfa]6 | [EGc]6 | [Bdf]6 | [EGBdfa]6 |
 [GABdfa]6 | [EGBdfa]6 | [EGBdfa]6 | [EGce]6 | [B,DFAce]6 | [B,DFAce]6 | [A,CEGBdfa]6 | [F,B,DGBdf]6 |
 [A,CEGBdfa]6 | [B,EGce]6 | [DFAce]6 | [A,CEGBdfa]6 | [B,Bdfa]6 | [EGc]6 | [DFAce]6 |
 [FAce]6 | [B,DFAce]6 | [CEGBdfa]6 | [B,Bdfa]6 | [G,B,DGBdfa]6 | [A,CEGBdfa]6 | [B,DFAce]6 |
 [G,B,DGBdfa]6 | [A,CEGBdfa]6 | [A,CEGBdfa]6 | [A,B,DGBdfa]6 | [A,B,DGBdfa]6 | [B,Bdfa]6 | [A,B,DGBdfa]6 |
 [B,Bdfa]6 | [A,B,DGBdfa]6 | [A,B,DGBdfa]6 | [A,B,DGBdfa]6 | [B,Bdfa]6 | [A,B,DGBdfa]6 | [A,B,DGBdfa]6 |
 [A,B,DGBdfa]6 | [B,Bdfa]6 | [A,B,DGBdfa]6 | [A,B,DGBdfa]6 | [A,B,DGBdfa]6 | [B,Bdfa]6 |
 [B,Bdfa]6 | [A,B,DGBdfa]6 | [A,B,DGBdfa]6 | [A,B,DGBdfa]6 | [A,B,DGBdfa]6 | [A,B,DGBdfa]6 |
 [A,B,DGBdfa]6 | [A,B,DGBdfa]6 | [A,B,DGBdfa]6 | [A,B,DGBdfa]6 | [A,B,DGBdfa]6 | [A,B,DGBdfa]6 |
 [A,B,DGBdfa]6 | [A,B,

In [None]:
!pip install pyautogui

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyautogui
  Downloading PyAutoGUI-0.9.53.tar.gz (59 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/59.0 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 KB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pymsgbox
  Downloading PyMsgBox-1.0.9.tar.gz (18 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting PyTweening>=1.0.1
  Downloading pytweening-1.0.4.tar.gz (14 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyscreeze>=0.1.21
  Downloading PyScreeze-0.1.28.tar.gz (25 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[

In [None]:
text = "this is a Mozart-style classical music that calms and heals"
input_ids = tokenizer(text, 
                      return_tensors='pt', 
                      truncation=True, 
                      max_length=max_length)['input_ids']

decoder_start_token_id = model.config.decoder_start_token_id
eos_token_id = model.config.eos_token_id

decoder_input_ids = torch.tensor([[decoder_start_token_id]])

for t_idx in range(max_length):
    outputs = model(input_ids=input_ids, 
    decoder_input_ids=decoder_input_ids)
    probs = outputs.logits[0][-1]
    probs = torch.nn.Softmax(dim=-1)(probs).detach().numpy()
    sampled_id = temperature_sampling(probs=top_p_sampling(probs, 
                                                           top_p=top_p, 
                                                           return_probs=True),
                                      temperature=temperature)
    decoder_input_ids = torch.cat((decoder_input_ids, torch.tensor([[sampled_id]])), 1)
    if sampled_id!=eos_token_id:
        continue
    else:
        tune = "X:1\n"
        tune += tokenizer.decode(decoder_input_ids[0], skip_special_tokens=True)
        print(tune)
        break

X:1
L:1/4
M:4/4
K:D
 [Fd] [Af] [Fd] [Af] | A e e [Ae] | d a a [Ag] | f e e2 | [Fd] d [Af] d | [Ae] c [Ae] c |
 [Fd] [Af] [Fd] [Af] | [Ae] c [Ae] c | [Fd] d [Fd] d | [Fd] f d f | A e e [Ae] | d a a [Ag] | f e e2 |
 [Fd] d [Af] d | [Ae] c [Ae] c | [Fd] [Af] [Fd] [Af] | [Fd] f d f | A e e [Ae] | d a a [Ag] |
 f e e2 | [Fd] d [Af] d | [Ae] c [Ae] c | [Fd] [Af] [Fd] [Af] | [Ae] c [Ae] c | [Fd] [Af] [Fd] [Af] |
 A e e [Ae] | d a a [Ag] | f e e2 | [Fd] d [Af] d | [Ae] c [Ae] c | [Fd] d [Af] d | [Fd] f d f |
 A e e [Ae] | d a a [Ag] | f e e2 | [Fd] d [Af] d | [Ae] c [Ae] c | [Fd] f d f | [Fd] f d f |
 [Ae] c [Ae] c | [Fd] [Af] [Fd] [Af] | A e e [Ae] | d a a [Ag] | f e e2 | [Fd] d [Af] d |
 [Ae] c [Ae] c | [Fd] d [Af] d | [Ae] c [Ae] c | [Fd] d [Af] d | [Ae] c [Ae] c | [Fd] d [Af] d |
 [Ae] c [Ae] c | [Fd] d [Af] d | [Ae] c [Ae] c | [Fd] d [Af] d | [Ae] c [Ae] c | [Fd]4 |]



In [None]:
text = "This is a main soundtrack of an arcade game."
input_ids = tokenizer(text, 
                      return_tensors='pt', 
                      truncation=True, 
                      max_length=max_length)['input_ids']

decoder_start_token_id = model.config.decoder_start_token_id
eos_token_id = model.config.eos_token_id

decoder_input_ids = torch.tensor([[decoder_start_token_id]])

for t_idx in range(max_length):
    outputs = model(input_ids=input_ids, 
    decoder_input_ids=decoder_input_ids)
    probs = outputs.logits[0][-1]
    probs = torch.nn.Softmax(dim=-1)(probs).detach().numpy()
    sampled_id = temperature_sampling(probs=top_p_sampling(probs, 
                                                           top_p=top_p, 
                                                           return_probs=True),
                                      temperature=temperature)
    decoder_input_ids = torch.cat((decoder_input_ids, torch.tensor([[sampled_id]])), 1)
    if sampled_id!=eos_token_id:
        continue
    else:
        tune = "X:1\n"
        tune += tokenizer.decode(decoder_input_ids[0], skip_special_tokens=True)
        print(tune)
        break

X:1
L:1/16
Q:1/4=70
M:3/4
K:G
 A6 FG A2>F2 | G6 EF G2>E2 | F6 DF A2>F2 | G6 EF G2>E2 | F4- FGA2 F2G2 | E8 D4 | G6 FG A2>F2 |
 G6 EF G2>E2 | F6 DF A2>F2 | G6 EF G2>E2 | F4- FG A4 F2 | G6 EF G2>E2 | F6 DA A2F2 | G6 EF G2>E2 |
 F6 DA A2F2 | G6 EF G2>E2 | F6 DA A2F2 | G8- G2 z2 |]



In [10]:
text = "This is a Chinese folk song from the Jiangnan region. It was created during the Qianlong era (1735-1796) of the Qing dynasty. Over time, many regional variations were created, and the song gained popularity both in China and abroad. One version of the song describes a custom of giving jasmine flowers, popular in the southern Yangtze delta region of China."
input_ids = tokenizer(text, 
                      return_tensors='pt', 
                      truncation=True, 
                      max_length=max_length)['input_ids']

decoder_start_token_id = model.config.decoder_start_token_id
eos_token_id = model.config.eos_token_id

decoder_input_ids = torch.tensor([[decoder_start_token_id]])

for t_idx in range(max_length):
    outputs = model(input_ids=input_ids, 
    decoder_input_ids=decoder_input_ids)
    probs = outputs.logits[0][-1]
    probs = torch.nn.Softmax(dim=-1)(probs).detach().numpy()
    sampled_id = temperature_sampling(probs=top_p_sampling(probs, 
                                                           top_p=top_p, 
                                                           return_probs=True),
                                      temperature=temperature)
    decoder_input_ids = torch.cat((decoder_input_ids, torch.tensor([[sampled_id]])), 1)
    if sampled_id!=eos_token_id:
        continue
    else:
        tune = "X:1\n"
        tune += tokenizer.decode(decoder_input_ids[0], skip_special_tokens=True)
        print(tune)
        break

X:1
L:1/8
Q:1/4=60
M:4/4
K:F
 z4 DE F2 | F2 FD D2 z2 | d2 de f2 df | a2 g2 d4 | D4 DE F2 | F2 FD D2 z2 | d2 de f2 df |
 a2 g2 d4 | D4 DE F2 | F2 FD D2 z2 | c2 cc d2 cA | G2 G2 A2 GE | D2 DC D4- | D4 z4 | D2 DD C2 D2 |
 D2 GA c2 de | f2 dc A2 GA | c2 A2 G4- | G4 G2 A2 | c2 de d2 cd | c2 Ac AG F2- | F4 FG A2 |
 G2 FD F2 GF | D8 |]



In [14]:
!git clone https://github.com/sander-wood/text-to-music.git

Cloning into 'text-to-music'...
remote: Enumerating objects: 48, done.[K
remote: Counting objects: 100% (48/48), done.[K
remote: Compressing objects: 100% (48/48), done.[K
remote: Total 48 (delta 19), reused 1 (delta 0), pack-reused 0[K
Unpacking objects: 100% (48/48), 2.85 MiB | 7.32 MiB/s, done.


In [18]:
!pip install unidecode
!python /content/text-to-music/run_inference.py -num_tunes 3 -max_length 1024 -top_p 0.9 -temperature 1.0 -seed 0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
There are 1 GPU(s) available.
We will use the GPU: NVIDIA A100-SXM4-40GB 

##################### HYPERPARAMETERS ###################### 

num_tunes: 3
max_length: 1024
top_p: 0.9
temperature: 1.0
seed: 0

######################## INPUT TEXT ########################

this is a Mozart-style classical music that calms and heals

####################### OUTPUT TUNES #######################

X:1
2023-03-27 13:55:29.550719: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.9/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-03-27 13:55:29.550812: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvi

* 출력된 output을 아래 웹사이트에 그대로 넣으면, 모델이 생성한 노래를 들을 수 있음...

  https://ldzhangyx.github.io/abc/


* 생성한 모델의 성능 측정 지표는?

* ABC 표기법 외 다른 형태로 음악을 출력해주는 모델은 없나....