##  Efficiently consuming key points of lengthy documents through concise audio summaries.
Description: Develop a proof-of-concept solution to generate concise audio summaries of given documents. The objective is to help professionals quickly grasp essential details and implications. Break down the task into smaller steps and outline the actions needed to achieve this. 

-	Implement TTS – Text-to-Speech for below summary into using python.
-	Research different TTS - (Text-to-Speech) technologies and implement all your researched technologies for given summary.
-	TTS conversion should be contextual and concise enough to understand easily, instead of just word to word conversion.
-	Present your findings for implemented technologies for evaluation.


# Working on Text To Speech 

#### Experiment Texts 

In [1]:
introduction = '''
Introduction: This text discusses a judgment from the Supreme Court of India regarding a complaint filed under Section 138 of the Negotiable Instruments Act. The case involves a dispute over a cheque issued by the respondent, which was returned due to insufficient funds. The Trial Court initially dismissed the complaint, but the Supreme Court upheld it, finding that the cheque was indeed issued by the respondent.
'''
key_points= '''Key Points: 
1.The complaint was dismissed initially due to contradictions in the evidence regarding the number of apple cartons and the amount owed.
2.The High Court established that a cheque carries a presumption of consideration unless proven otherwise.
3.The burden of proof is on the accused to rebut the presumption of consideration by providing evidence or circumstances to show that no debt existed.
4.The court discusses the presumption of debt or liability under Section 139 of the Act and states that it may fail if the accused raises a probable defense.
5.The court emphasizes that the presumption under Section 139 is a device to prevent undue delay in litigation and that dishonoring a check is largely a civil wrong.
6.The respondent in this case failed to provide any evidence to rebut the presumption of consideration in issuing the cheque.
7.The courts below were criticized for dismissing the complaint based on discrepancies in the determination of the amount due.
8.The respondent is held guilty of dishonoring the cheque and is ordered to pay a fine and costs.
'''
conclution='''
Conclusion:In conclusion, the Supreme Court of India upheld a complaint filed under Section 138 of the Negotiable Instruments Act. The court found that the cheque was issued by the respondent and criticized the lower courts for dismissing the complaint based on discrepancies in the evidence. The court emphasized the presumption of consideration under Section 139 and held the respondent guilty of dishonoring the cheque. The respondent was ordered to pay a fine and costs.
'''

### Coqui TTS

In [15]:
!tts --list_models


 Name format: type/language/dataset/model
 1: tts_models/multilingual/multi-dataset/xtts_v2 [already downloaded]
 2: tts_models/multilingual/multi-dataset/xtts_v1.1
 3: tts_models/multilingual/multi-dataset/your_tts
 4: tts_models/multilingual/multi-dataset/bark
 5: tts_models/bg/cv/vits
 6: tts_models/cs/cv/vits
 7: tts_models/da/cv/vits
 8: tts_models/et/cv/vits
 9: tts_models/ga/cv/vits
 10: tts_models/en/ek1/tacotron2
 11: tts_models/en/ljspeech/tacotron2-DDC
 12: tts_models/en/ljspeech/tacotron2-DDC_ph
 13: tts_models/en/ljspeech/glow-tts
 14: tts_models/en/ljspeech/speedy-speech
 15: tts_models/en/ljspeech/tacotron2-DCA
 16: tts_models/en/ljspeech/vits
 17: tts_models/en/ljspeech/vits--neon
 18: tts_models/en/ljspeech/fast_pitch
 19: tts_models/en/ljspeech/overflow
 20: tts_models/en/ljspeech/neural_hmm
 21: tts_models/en/vctk/vits
 22: tts_models/en/vctk/fast_pitch
 23: tts_models/en/sam/tacotron-DDC
 24: tts_models/en/blizzard2013/capacitron-t2-c50
 25: tts_models/en/blizzard2

In [1]:
from TTS.utils.downloaders import download_ljspeech
download_ljspeech("../recipes/ljspeech/")

OSError: [Errno 28] No space left on device: '../recipes'

### TensorFlow TTS

In [None]:
git clone https://github.com/TensorSpeech/TensorFlowTTS.git
cd TensorFlowTTS
pip install .

In [17]:
model_name = 'tensorspeech/tts-tacotron2-ljspeech-en'
# model_name = 'tensorspeech/tts-melgan-ljspeech-en'
# model_name = 'tensorspeech/tts-mb_melgan-ljspeech-en'
# model_name = 'tensorspeech/tts-fastspeech2-ljspeech-en'
# model_name = 'tensorspeech/tts-fastspeech-ljspeech-en'

In [3]:
import numpy as np
import soundfile as sf
import yaml

import tensorflow as tf

from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoProcessor

# initialize fastspeech2 model.
fastspeech2 = TFAutoModel.from_pretrained("tensorspeech/tts-fastspeech2-ljspeech-en")


# initialize mb_melgan model
mb_melgan = TFAutoModel.from_pretrained("tensorspeech/tts-mb_melgan-ljspeech-en")


# inference
processor = AutoProcessor.from_pretrained("tensorspeech/tts-fastspeech2-ljspeech-en")

input_ids = processor.text_to_sequence("Recent research at Harvard has shown meditating for as little as 8 weeks, can actually increase the grey matter in the parts of the brain responsible for emotional regulation, and learning.")
# fastspeech inference

mel_before, mel_after, duration_outputs, _, _ = fastspeech2.inference(
    input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
    speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32),
    speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
    f0_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32),
    energy_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32),
)

# melgan inference
audio_before = mb_melgan.inference(mel_before)[0, :, 0]
audio_after = mb_melgan.inference(mel_after)[0, :, 0]

# save to file
sf.write('./audio_before.wav', audio_before, 22050, "PCM_16")
sf.write('./audio_after.wav', audio_after, 22050, "PCM_16")

TypeError: Descriptors cannot be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

### ESPNet 

### Hugging face Opensource models 

In [19]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

##### Approach 1
**Model** : "microsoft/speecht5_tts"            <br>
**Datasets** : "Matthijs/cmu-arctic-xvectors"   <br>
Resource : Hugging Face 


In [2]:
from transformers import pipeline
from datasets import load_dataset
import soundfile as sf
import torch
synthesiser = pipeline("text-to-speech", "microsoft/speecht5_tts")
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
speech = synthesiser(key_points,forward_params={"speaker_embeddings": speaker_embedding})
sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])


In [30]:
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from datasets import load_dataset, Audio
import torch
import soundfile as sf
from datasets import load_dataset

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
inputs = processor(text=introduction, return_tensors="pt")
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")

speaker_embeddings = torch.tensor(embeddings_dataset[7000+6]["xvector"]).unsqueeze(0)

speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
# filename=f"./speechT5_outputs/speech{i}.wav"
# sf.write(filename, speech.numpy(), samplerate=16000)



from IPython.display import Audio

sampling_rate = model.generation_config.sample_rate
Audio(speech.cpu().numpy().squeeze(), rate=sampling_rate)




AttributeError: 'GenerationConfig' object has no attribute 'sample_rate'

In [None]:
model.ge

In [32]:
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from datasets import load_dataset
import torch
import soundfile as sf
from datasets import load_dataset

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
inputs = processor(text=introduction, return_tensors="pt")
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
for i in range(100):
    speaker_embeddings = torch.tensor(embeddings_dataset[647*i]["xvector"]).unsqueeze(0)

    speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
    filename=f"./speechT5_outputs/speech{i}.wav"
    sf.write(filename, speech.numpy(), samplerate=16000)


KeyboardInterrupt: 

In [25]:
print(type(processor))
print(type(model))
print(type(vocoder))
print(type(inputs))
print(type(embeddings_dataset))
print(type(speaker_embeddings))
print(type(speech))

<class 'transformers.models.speecht5.processing_speecht5.SpeechT5Processor'>
<class 'transformers.models.speecht5.modeling_speecht5.SpeechT5ForTextToSpeech'>
<class 'transformers.models.speecht5.modeling_speecht5.SpeechT5HifiGan'>
<class 'transformers.tokenization_utils_base.BatchEncoding'>
<class 'datasets.arrow_dataset.Dataset'>
<class 'torch.Tensor'>
<class 'torch.Tensor'>


In [24]:
len(embeddings_dataset)

7931

#####  Approach 2
**Model** : "suno/bark"    <br>
Resource : Hugging Face 

In [34]:
from transformers import pipeline
import scipy

synthesiser = pipeline("text-to-speech", "suno/bark")

speech = synthesiser("Hello, my dog is cooler than you!", forward_params={"do_sample": True})

scipy.io.wavfile.write("bark_out.wav", rate=speech["sampling_rate"], data=speech["audio"])


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


error: ushort format requires 0 <= number <= 65535

In [40]:
from transformers import AutoProcessor, AutoModel

processor = AutoProcessor.from_pretrained("suno/bark")
model = AutoModel.from_pretrained("suno/bark")

inputs = processor(
    text=["Hello, my name is Suno. And, uh — and I like pizza. [laughs] But I also have other interests such as playing tic tac toe."],
    return_tensors="pt",
)

speech_values = model.generate(**inputs, do_sample=True)
import scipy

sampling_rate = model.config.sample_rate
scipy.io.wavfile.write("bark_out.wav", rate=sampling_rate, data=speech_values.cpu().numpy().squeeze())


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


AttributeError: 'BarkConfig' object has no attribute 'sample_rate'

### Google Text To Speech (gTTS)

In [39]:
from gtts import gTTS 
tts = gTTS(introduction)
tts.save('g_introduction.wav')

### Amazon Polly Text to Speech

### IBM Whatsonx Text to Speech

# Working With Summarization 

### Hugging Face Opensource models