In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append("../..")
from spiritlm.model.spiritlm_model import Spiritlm, OutputModality, GenerationInput, ContentType

from transformers import GenerationConfig
import IPython.display as ipd

def display_outputs(outputs):
    for output in outputs:
        if output.content_type == ContentType.TEXT:
            print(output.content)
        else:
            ipd.display(ipd.Audio(output.content, rate=16_000))

  from .autonotebook import tqdm as notebook_tqdm


We support two variants of SPIRIT-LM models, `SPIRIT-LM-BASE` and `SPIRIT-LM-EXPRESSIVE`. Both `SPIRIT-LM-BASE` and `SPIRIT-LM-EXPRESSIVE` are fine-tuned from the 7B Llama 2 model on text-only, speech-only and aligned speech+text datasets.

Compared to `SPIRIT-LM-BASE`, `SPIRIT-LM-EXPRESSIVE` captures not only the semantics but also **expressivity** from the speech.

## `SPIRIT-LM-BASE`

In [3]:
spirit_lm = Spiritlm("spirit-lm-base-7b")

Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.03it/s]


Loaded InferenceHubertModel from '/private/home/ntuanh/Projects/parrot/packages/spiritlm/examples/speech_generation/../../checkpoints/speech_tokenizer/hubert_25hz/mhubert_base_25hz.pt'
Loaded RobustQuantizer from '/private/home/ntuanh/Projects/parrot/packages/spiritlm/examples/speech_generation/../../checkpoints/speech_tokenizer/hubert_25hz/km500_L11_robust.pt'




Removing weight norm...
Loaded CodeHiFiGAN checkpoint from '/private/home/ntuanh/Projects/parrot/packages/spiritlm/examples/speech_generation/../../checkpoints/speech_tokenizer/hifigan_spirit_base/generator.pt'


### Generation

The input `interleaved_inputs` of `generate` function is a list of either
- `GenerationInput` composed of `content_type` and `content`, or
- tuple of ('speech'/'text', `content`)

the inputs are interleaved following the order of the list.

`output_modality` controls the output modality.
- If you want to generate only the text, specify it to `OutputModality.TEXT` or 'text';
- If you want to generate only the speech, specify it to `OutputModality.SPEECH`  or 'speech';
- If you don't have the constraint over the generation's modality, use `OutputModality.ARBITRARY` or 'arbitrary';

The output of generation is also a list (of `GenerationOuput`), when `output_modality` is `OutputModality.TEXT` or `OutputModality.SPEECH`, the list should have only one element.
When `output_modality` is `OutputModality.ARBITRARY`, the list can have multiple elements from different types (`ContentType.TEXT` or `ContentType.SPEECH`).

The generation arguments can either be passed through `generation_config=GenerationConfig(args)` or directly in `generate(args)`.

For a full list of generation arguments, see:
https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig

Note that the following two commands give the same outputs:

In [4]:
spirit_lm.generate(
    interleaved_inputs=[
        GenerationInput(
            content="The largest country in the world is",
            content_type=ContentType.TEXT,
        )
    ],
    output_modality=OutputModality.TEXT,
    generation_config=GenerationConfig(
        max_new_tokens=20,
        do_sample=False,
    ),
)

[GenerationOuput(content='Russia. Russia is a country that is located in the northern part of the Eurasian continent.', content_type=<ContentType.TEXT: 'TEXT'>)]

In [5]:
spirit_lm.generate(
    interleaved_inputs=[('text', "The largest country in the world is")],
    output_modality='text',
    max_new_tokens=20,
    do_sample=False,
)

[GenerationOuput(content='Russia. Russia is a country that is located in the northern part of the Eurasian continent.', content_type=<ContentType.TEXT: 'TEXT'>)]

#### T -> T generation

In [6]:
outputs = spirit_lm.generate(
    interleaved_inputs=[('text', "The largest country in the world is")],
    output_modality='text',
    generation_config=GenerationConfig(
        temperature=0.8,
        top_p=0.95,
        max_new_tokens=30,
        do_sample=True,
    ),
)
display_outputs(outputs)

India and a country with diverse landscapes, cultures and languages. The official language of India is English and the currency is Rupee.


#### T -> S generation

In [7]:
outputs = spirit_lm.generate(
    interleaved_inputs=[('text', "Tell me a story about the model spirit-lm")],
    output_modality='speech',
    generation_config=GenerationConfig(
        temperature=0.8,
        top_p=0.95,
        max_new_tokens=200,
        do_sample=True,
    ),
)
display_outputs(outputs)

#### S -> T generation

When the `contnet` is speech, we accept several types:
1) The audio `Path`: e.g., `"examples/audio/7143-88743-0029.flac"` or `Path("examples/audio/7143-88743-0029.flac")`
2) The audio `bytes`: e.g., `open("examples/audio/7143-88743-0029.flac", "rb").read()`
3) The audio `Tensor`: e.g., `torchaudio.load("examples/audio/7143-88743-0029.flac")[0].squeeze(0)`

In [8]:
ipd.Audio("../audio/7143-88743-0029.flac")

In [9]:
outputs = spirit_lm.generate(
    interleaved_inputs=[('speech', "../audio/7143-88743-0029.flac")],
    output_modality='text',
    generation_config=GenerationConfig(
        temperature=0.8,
        top_p=0.95,
        max_new_tokens=30,
        do_sample=True,
    ),
)
display_outputs(outputs)

if you are not in a hurry for i want to see how it is done and you'd better make it a proper one a proper one


#### S -> S generation

In [10]:
outputs = spirit_lm.generate(
    interleaved_inputs=[('speech', "../audio/7143-88743-0029.flac")],
    output_modality='speech',
    generation_config=GenerationConfig(
        temperature=0.8,
        top_p=0.95,
        max_new_tokens=200,
        do_sample=True,
    ),
)
display_outputs(outputs)

#### Arbitrary generation

In [11]:
interleaved_outputs = spirit_lm.generate(
    interleaved_inputs=[('speech', "../audio/7143-88743-0029.flac")],
    output_modality='arbitrary',
    generation_config=GenerationConfig(
        temperature=0.8,
        top_p=0.95,
        max_new_tokens=200,
        do_sample=True,
    ),
)
display_outputs(interleaved_outputs)

 but as to getting it down you'll have to hold me up the hedge we could do that said stephen we could hold you up on the other side and reach it to you and then you could step down into the road and do it while we


 i'll get you on your feet if you open them before the time


#### Specify the prompt by a string of tokens

This could be useful when you construct the few-shots prompt.

Note that when `prompt` is given, `generation_inputs` is not used.

In [12]:
outputs = spirit_lm.generate(
    prompt="[St71][Pi39][Hu99][Hu49][Pi57][Hu38][Hu149][Pi48][Hu71][Hu423][Hu427][Pi56][Hu492][Hu288][Pi44][Hu315][Hu153][Pi42][Hu389][Pi59][Hu497][Hu412][Pi51][Hu247][Hu354][Pi44][Hu7][Hu96][Pi43][Hu452][Pi0][Hu176][Hu266][Pi54][St71][Hu77][Pi13][Hu248][Hu336][Pi39][Hu211][Pi25][Hu166][Hu65][Pi58][Hu94][Hu224][Pi26][Hu148][Pi44][Hu492][Hu191][Pi26][Hu440][Pi13][Hu41][Pi20][Hu457][Hu79][Pi46][Hu382][Hu451][Pi26][Hu332][Hu216][Hu114][Hu340][St71][Pi40][Hu478][Hu74][Pi26][Hu79][Hu370][Pi56][Hu272][Hu370][Pi51][Hu53][Pi14][Hu477][Hu65][Pi46][Hu171][Hu60][Pi41][Hu258][Hu111][Pi40][Hu338][Hu23][Pi39][Hu338][Hu23][Hu338][St71][Pi57][Hu7][Hu338][Hu149][Pi59][Hu406][Hu7][Hu361][Hu99][Pi20][Hu209][Hu479][Pi35][Hu50][St71][Hu7][Hu149][Pi55][Hu35][Pi13][Hu130][Pi3][Hu169][Pi52][Hu72][Pi9][Hu434][Hu119][Hu272][Hu4][Pi20][Hu249][Hu245][Pi57][Hu433][Pi56][Hu159][Hu294][Hu139][Hu359][Hu343][Hu269][Hu302][St71][Hu226][Pi32][Hu370][Hu216][Pi39][Hu459][Hu424][Pi57][Hu226][Pi46][Hu382][Hu7][Pi27][Hu58][Hu138][Pi20][Hu428][Hu397][Pi44][Hu350][Pi32][Hu306][Pi59][Hu84][Hu11][Hu171][Pi42][Hu60][Pi48][Hu314][Hu227][St71][Hu355][Pi56][Hu9][Hu58][Pi44][Hu138][Hu226][Pi25][Hu370][Hu272][Pi56][Hu382][Hu334][Pi26][Hu330][Hu176][Pi56][Hu307][Pi46][Hu145][Hu248][Pi56][Hu493][Hu64][Pi40][Hu44][Hu388][Pi39][Hu7][Hu111][Pi59][St71][Hu23][Hu481][Pi13][Hu149][Pi15][Hu80][Hu70][Pi47][Hu431][Hu457][Pi13][Hu79][Pi27][Hu249][Pi55][Hu245][Pi54][Hu433][Pi36][Hu316][Pi53][Hu180][Pi3][Hu458][Pi26][Hu86][St71][Pi43][Hu225][Pi49][Hu103][Hu60][Pi3][Hu96][Hu119][Pi39][Hu129][Pi41][Hu356][Hu218][Pi14][Hu4][Hu259][Pi56][Hu392][Pi46][Hu490][Hu75][Pi14][Hu488][Hu166][Pi46][Hu65][Hu171][Pi40][Hu60][Hu7][Hu54][Pi39][Hu85][St83][Pi40][Hu361]",
    output_modality='speech',
    generation_config=GenerationConfig(
        temperature=0.8,
        top_p=0.95,
        max_new_tokens=200,
        do_sample=True,
    ),
)
display_outputs(outputs)

## `SPIRIT-LM-EXPRESSIVE`

In [3]:
spirit_lm = Spiritlm("spirit-lm-expressive-7b")

Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.10it/s]


Loaded InferenceHubertModel from '/private/home/ntuanh/Projects/parrot/packages/spiritlm/examples/speech_generation/../../checkpoints/speech_tokenizer/hubert_25hz/mhubert_base_25hz.pt'
Loaded RobustQuantizer from '/private/home/ntuanh/Projects/parrot/packages/spiritlm/examples/speech_generation/../../checkpoints/speech_tokenizer/hubert_25hz/km500_L11_robust.pt'




Using 'fcpe' f0 extractor method (choices are: ['pyaapt', 'fcpe'])
VQVAE model loaded from '/private/home/ntuanh/Projects/parrot/packages/spiritlm/examples/speech_generation/../../checkpoints/speech_tokenizer/f0/f0interp_outfeats_norm_vuv2_l1loss_64bins_ds16/checkpoint_best.pt'!


Some weights of Wav2Vec2StyleEncoder were not initialized from the model checkpoint at /private/home/ntuanh/Projects/parrot/packages/spiritlm/examples/speech_generation/../../checkpoints/speech_tokenizer/style_encoder and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Removing weight norm...
Loaded CodeHiFiGAN checkpoint from '/private/home/ntuanh/Projects/parrot/packages/spiritlm/examples/speech_generation/../../checkpoints/speech_tokenizer/hifigan_spirit_expressive/generator.pt'


In [6]:
outputs = spirit_lm.generate(
    interleaved_inputs=[('text', "I am so deeply saddened, it feels as if my heart is shattering into a million pieces and I can't hold back the tears that are streaming down my face.")],
    output_modality='speech',
    generation_config=GenerationConfig(
        temperature=0.8,
        top_p=0.95,
        max_new_tokens=200,
        do_sample=True,
    ),
)
display_outputs(outputs)

In [7]:
outputs = spirit_lm.generate(
    interleaved_inputs=[('text', "Wow!!! Congratulations!!! I'm so excited that")],
    output_modality='speech',
    generation_config=GenerationConfig(
        temperature=0.8,
        top_p=0.95,
        max_new_tokens=200,
        do_sample=True,
    ),
)
display_outputs(outputs)