##  Efficiently consuming key points of lengthy documents through concise audio summaries.
Description: Develop a proof-of-concept solution to generate concise audio summaries of given documents. The objective is to help professionals quickly grasp essential details and implications. Break down the task into smaller steps and outline the actions needed to achieve this. 

-	Implement TTS – Text-to-Speech for below summary into using python.
-	Research different TTS - (Text-to-Speech) technologies and implement all your researched technologies for given summary.
-	TTS conversion should be contextual and concise enough to understand easily, instead of just word to word conversion.
-	Present your findings for implemented technologies for evaluation.


# Working on Text To Speech 

#### Experiment Texts 

In [3]:
introduction = '''
Introduction
This text discusses a judgment from the Supreme Court of India regarding a complaint filed under Section 138 of the Negotiable Instruments Act. The case involves a dispute over a cheque issued by the respondent, which was returned due to insufficient funds. The Trial Court initially dismissed the complaint, but the Supreme Court upheld it, finding that the cheque was indeed issued by the respondent.
'''
key_points= '''Key Points: 
1.The complaint was dismissed initially due to contradictions in the evidence regarding the number of apple cartons and the amount owed.
2.The High Court established that a cheque carries a presumption of consideration unless proven otherwise.
3.The burden of proof is on the accused to rebut the presumption of consideration by providing evidence or circumstances to show that no debt existed.
4.The court discusses the presumption of debt or liability under Section 139 of the Act and states that it may fail if the accused raises a probable defense.
5.The court emphasizes that the presumption under Section 139 is a device to prevent undue delay in litigation and that dishonoring a check is largely a civil wrong.
6.The respondent in this case failed to provide any evidence to rebut the presumption of consideration in issuing the cheque.
7.The courts below were criticized for dismissing the complaint based on discrepancies in the determination of the amount due.
8.The respondent is held guilty of dishonoring the cheque and is ordered to pay a fine and costs.
'''
conclution='''
Conclusion:In conclusion, the Supreme Court of India upheld a complaint filed under Section 138 of the Negotiable Instruments Act. The court found that the cheque was issued by the respondent and criticized the lower courts for dismissing the complaint based on discrepancies in the evidence. The court emphasized the presumption of consideration under Section 139 and held the respondent guilty of dishonoring the cheque. The respondent was ordered to pay a fine and costs.
'''

### Coqui TTS

In [2]:
!tts --list_models


 Name format: type/language/dataset/model
 1: tts_models/multilingual/multi-dataset/xtts_v2
 2: tts_models/multilingual/multi-dataset/xtts_v1.1
 3: tts_models/multilingual/multi-dataset/your_tts
 4: tts_models/multilingual/multi-dataset/bark
 5: tts_models/bg/cv/vits
 6: tts_models/cs/cv/vits
 7: tts_models/da/cv/vits
 8: tts_models/et/cv/vits
 9: tts_models/ga/cv/vits
 10: tts_models/en/ek1/tacotron2
 11: tts_models/en/ljspeech/tacotron2-DDC
 12: tts_models/en/ljspeech/tacotron2-DDC_ph
 13: tts_models/en/ljspeech/glow-tts
 14: tts_models/en/ljspeech/speedy-speech
 15: tts_models/en/ljspeech/tacotron2-DCA
 16: tts_models/en/ljspeech/vits
 17: tts_models/en/ljspeech/vits--neon
 18: tts_models/en/ljspeech/fast_pitch
 19: tts_models/en/ljspeech/overflow
 20: tts_models/en/ljspeech/neural_hmm
 21: tts_models/en/vctk/vits
 22: tts_models/en/vctk/fast_pitch
 23: tts_models/en/sam/tacotron-DDC
 24: tts_models/en/blizzard2013/capacitron-t2-c50
 25: tts_models/en/blizzard2013/capacitron-t2-c15

In [3]:
from TTS.utils.downloaders import download_ljspeech
download_ljspeech("../recipes/ljspeech/")


KeyboardInterrupt



In [2]:
import torch
from TTS.api import TTS

# Get device
device = "cuda" if torch.cuda.is_available() else "cpu"

In [3]:

# Init TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
tts.tts_to_file(text=introduction, speaker_wav="My_voice.wav", language="en", file_path="output.wav")

 > tts_models/multilingual/multi-dataset/xtts_v2 is already downloaded.


  from .autonotebook import tqdm as notebook_tqdm


 > Using model: xtts
 > Text splitted to sentences.
['Introduction', 'This text discusses a judgment from the Supreme Court of India regarding a complaint filed under Section 138 of the Negotiable Instruments Act.', 'The case involves a dispute over a cheque issued by the respondent, which was returned due to insufficient funds.', 'The Trial Court initially dismissed the complaint, but the Supreme Court upheld it, finding that the cheque was indeed issued by the respondent.']
 > Processing time: 195.3288049697876
 > Real-time factor: 6.209917628240946


'output.wav'

In [4]:
tts = TTS(model_name="tts_models/de/thorsten/tacotron2-DDC", progress_bar=False).to(device)

# Run TTS
tts.tts_to_file(text="Ich bin eine Testnachricht.", file_path="./AudioFiles/coqui_TTS/tacotron2-DDC_output.wav")

 > tts_models/de/thorsten/tacotron2-DDC is already downloaded.
 > vocoder_models/de/thorsten/hifigan_v1 is already downloaded.
 > Using model: tacotron2
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:None
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024


Exception:  [!] No espeak backend found. Install espeak-ng or espeak to your system.

### TensorFlow TTS

In [None]:
git clone https://github.com/TensorSpeech/TensorFlowTTS.git
cd TensorFlowTTS
pip install .

In [17]:
model_name = 'tensorspeech/tts-tacotron2-ljspeech-en'
# model_name = 'tensorspeech/tts-melgan-ljspeech-en'
# model_name = 'tensorspeech/tts-mb_melgan-ljspeech-en'
# model_name = 'tensorspeech/tts-fastspeech2-ljspeech-en'
# model_name = 'tensorspeech/tts-fastspeech-ljspeech-en'

In [8]:
import numpy as np
import soundfile as sf
import yaml

import tensorflow as tf

from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoProcessor

# initialize fastspeech2 model.
fastspeech2 = TFAutoModel.from_pretrained("tensorspeech/tts-fastspeech2-ljspeech-en")


# initialize mb_melgan model
mb_melgan = TFAutoModel.from_pretrained("tensorspeech/tts-mb_melgan-ljspeech-en")


# inference
processor = AutoProcessor.from_pretrained("tensorspeech/tts-fastspeech2-ljspeech-en")

input_ids = processor.text_to_sequence("Recent research at Harvard has shown meditating for as little as 8 weeks, can actually increase the grey matter in the parts of the brain responsible for emotional regulation, and learning.")
# fastspeech inference

mel_before, mel_after, duration_outputs, _, _ = fastspeech2.inference(
    input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
    speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32),
    speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
    f0_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32),
    energy_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32),
)

# melgan inference
audio_before = mb_melgan.inference(mel_before)[0, :, 0]
audio_after = mb_melgan.inference(mel_after)[0, :, 0]

# save to file
sf.write('./audio_before.wav', audio_before, 22050, "PCM_16")
sf.write('./audio_after.wav', audio_after, 22050, "PCM_16")

2024-02-12 13:17:27.755197: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-12 13:17:27.755273: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-12 13:17:27.863146: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-12 13:17:28.071439: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf

RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf

RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf

SystemError: initialization of _pywrap_checkpoint_reader raised unreported exception

### ESPNet 

In [7]:
import espnet
import espnet2

In [9]:
from espnet2.tts.espnet_model import ESPnetTTSModel

In [None]:
'espnet/kan-bayashi_ljspeech_vits'

In [10]:
import soundfile
from espnet2.bin.tts_inference import Text2Speech

text2speech = Text2Speech.from_pretrained("model_name")
speech = text2speech("foobar")["wav"]
soundfile.write("out.wav", speech.numpy(), text2speech.fs, "PCM_16")

Help on class ESPnetTTSModel in module espnet2.tts.espnet_model:

class ESPnetTTSModel(espnet2.train.abs_espnet_model.AbsESPnetModel)
 |  ESPnetTTSModel(feats_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], pitch_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], energy_extract: Optional[espnet2.tts.feats_extract.abs_feats_extract.AbsFeatsExtract], normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], pitch_normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], energy_normalize: Optional[espnet2.layers.inversible_interface.InversibleInterface], tts: espnet2.tts.abs_tts.AbsTTS)
 |  
 |  ESPnet model for text-to-speech task.
 |  
 |  Method resolution order:
 |      ESPnetTTSModel
 |      espnet2.train.abs_espnet_model.AbsESPnetModel
 |      torch.nn.modules.module.Module
 |      abc.ABC
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, feats_extract: O

### Hugging face Opensource models 

In [19]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

##### Approach 1
**Model** : "microsoft/speecht5_tts"            <br>
**Datasets** : "Matthijs/cmu-arctic-xvectors"   <br>
Resource : Hugging Face 


In [13]:
from transformers import pipeline
from datasets import load_dataset
import soundfile as sf
import torch
synthesiser = pipeline("text-to-speech", "microsoft/speecht5_tts")
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
speech = synthesiser(introduction,forward_params={"speaker_embeddings": speaker_embedding})
sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])



In [4]:
import numpy.random as rnd
random_array = rnd.choice(list(range(2800)),600)
len(random_array)

600

In [24]:
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from datasets import load_dataset, Audio
import torch
import soundfile as sf
from datasets import load_dataset
# Total Sample = 7931 on `Matthijs/cmu-arctic-xvectors` dataset
OF_SET_SAMPLE = 2000

# Initiate Model Text Processor 
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
# Initiate Speech Generator
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
# Initiate Vocoder
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
inputs = processor(text=introduction, return_tensors="pt")
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")

# Creating a Random sample from 7931 dataset of sample size 600
import numpy.random as rnd
random_array = rnd.choice(range(len(embeddings_dataset)-OF_SET_SAMPLE),600)


for i in random_array:  
    try:
        speaker_embeddings = torch.tensor(embeddings_dataset[OF_SET_SAMPLE+int(i)]["xvector"]).unsqueeze(0)
    except Exception as e:
        print(e)
        continue
    speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
    filename=f"AudioFiles/Microsoft_speechT5/cmu-arctic-xvectors_introduction_{OF_SET_SAMPLE+i}.wav"
    sf.write(filename, speech.numpy(), samplerate=16000)




KeyboardInterrupt: 

In [25]:
len(embeddings_dataset)

7931

In [25]:
print(type(processor))
print(type(model))
print(type(vocoder))
print(type(inputs))
print(type(embeddings_dataset))
print(type(speaker_embeddings))
print(type(speech))

<class 'transformers.models.speecht5.processing_speecht5.SpeechT5Processor'>
<class 'transformers.models.speecht5.modeling_speecht5.SpeechT5ForTextToSpeech'>
<class 'transformers.models.speecht5.modeling_speecht5.SpeechT5HifiGan'>
<class 'transformers.tokenization_utils_base.BatchEncoding'>
<class 'datasets.arrow_dataset.Dataset'>
<class 'torch.Tensor'>
<class 'torch.Tensor'>


In [24]:
len(embeddings_dataset)

7931

#####  Approach 2
**Model** : "suno/bark"    <br>
Resource : Hugging Face 

In [26]:
from transformers import pipeline
import scipy

synthesiser = pipeline("text-to-speech", "suno/bark")

speech = synthesiser("Hello, my dog is cooler than you!", forward_params={"do_sample": True})

scipy.io.wavfile.write("bark_out.wav", rate=speech["sampling_rate"], data=speech["audio"])


config.json:   0%|          | 0.00/8.81k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/4.49G [00:00<?, ?B/s]



generation_config.json:   0%|          | 0.00/4.91k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.92M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


error: argument out of range

In [8]:
from scipy import io
from torch import cuda 
from transformers import AutoProcessor, BarkModel    

device = "cuda" if cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained('suno/bark')
model = BarkModel.from_pretrained('suno/bark')
model.to(device)

# voice_preset = "v2/en_speaker_6"
voice_preset = "v2/en_speaker_9"

inputs=processor (introduction, voice_preset=voice_preset)
audio_array = model.generate(**inputs)
audio_array = audio_array.cpu().numpy().squeeze()


sample_rate= model.generation_config.sample_rate
io.wavfile.write("bark_out.wav", rate=sample_rate, data=audio_array)


en_speaker_9_semantic_prompt.npy:   0%|          | 0.00/3.06k [00:00<?, ?B/s]

en_speaker_9_coarse_prompt.npy:   0%|          | 0.00/8.94k [00:00<?, ?B/s]

en_speaker_9_fine_prompt.npy:   0%|          | 0.00/17.8k [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


### Google Text To Speech (gTTS)

In [39]:
from gtts import gTTS 
tts = gTTS(introduction)
tts.save('g_introduction.wav')

In [41]:
from gtts import lang 
print(list(lang.tts_langs())[:15])

from gtts import  accents
print(accents.accents)

from gtts import tokenizer
dir(tokenizer)

from gtts.tokenizer import tokenizer_cases
dir(tokenizer_cases)

# tts = gTTS(introduction, lang='en')
# tts.save('g_introduction.wav')

['af', 'ar', 'bg', 'bn', 'bs', 'ca', 'cs', 'da', 'de', 'el', 'en', 'es', 'et', 'fi', 'fr']
['com', 'ad', 'ae', 'com.af', 'com.ag', 'com.ai', 'com.ar', 'as', 'at', 'com.au', 'az', 'ba', 'com.bd', 'be', 'bf', 'bg', 'bj', 'br', 'bs', 'bt', 'co.bw', 'by', 'com.bz', 'ca', 'cd', 'ch', 'ci', 'co.ck', 'cl', 'cm', 'cn', 'com.co', 'co.cr', 'cv', 'dj', 'dm', 'com.do', 'dz', 'com.ec', 'ee', 'com.eg', 'es', 'et', 'fi', 'com.fj', 'fm', 'fr', 'ga', 'ge', 'gg', 'com.gh', 'com.gi', 'gl', 'gm', 'gr', 'com.gt', 'gy', 'com.hk', 'hn', 'ht', 'hr', 'hu', 'co.id', 'ie', 'co.il', 'im', 'co.in', 'iq', 'is', 'it', 'iw', 'je', 'com.je', 'jo', 'co.jp', 'co.ke', 'com.kh', 'ki', 'kg', 'co.kr', 'com.kw', 'kz', 'la', 'com.lb', 'li', 'lk', 'co.ls', 'lt', 'lu', 'lv', 'com.ly', 'com.ma', 'md', 'me', 'mg', 'mk', 'ml', 'mm', 'mn', 'ms', 'com.mt', 'mu', 'mv', 'mw', 'com.mx', 'com.my', 'co.mz', 'na', 'ng', 'ni', 'ne', 'nl', 'no', 'com.np', 'nr', 'nu', 'co.nz', 'com.om', 'pa', 'pe', 'pg', 'ph', 'pk', 'pl', 'pn', 'com.pr', 'ps

['RegexBuilder',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'colon',
 'legacy_all_punctuation',
 'other_punctuation',
 'period_comma',
 'symbols',
 'tone_marks']

### Amazon Polly Text to Speech

### IBM Whatsonx Text to Speech

# Working With Summarization 

### Hugging Face Opensource models