# SpeakSense - Language Detection System (Machine Learning CSCI 6364)

**Abde Manaaf Ghadiali (G29583342), Gehna Ahuja (G00000000), Venkatesh Shanmugam (G00000000)**

The objective of this project is to develop a robust and accurate system capable of detecting the language spoken in audio recordings. By leveraging advanced machine learning algorithms and signal processing techniques, the system aims to accurately identify the language spoken in various audio inputs, spanning diverse accents, dialects, and environmental conditions. This language detection solution seeks to provide practical applications in speech recognition, transcription, translation, and other fields requiring language-specific processing, thereby enhancing accessibility and usability across linguistic boundaries.

This code sets up an environment for working with audio data, particularly focusing on Indian languages. Here's a breakdown of what each part does:

1. **Importing Libraries**: Imports necessary libraries for data manipulation, visualization, machine learning, and audio processing.

2. **Setting Display Options and Suppressing Warnings**: Configures display options for Pandas and suppresses warnings.

3. **Setting Random Seed**: Sets a random seed for reproducibility.

4. **Downloading Datasets**: Checks if the necessary datasets are downloaded, and if not, downloads them from Kaggle using the OpenDatasets library and organizes them into appropriate directories.

5. **Audio Data Processing**: Prepares the audio data for further analysis. This might include feature extraction, preprocessing, and organizing the data for training machine learning models.

6. **Machine Learning**: Utilizes machine learning techniques for tasks such as spoken language identification. This involves splitting the data into training and testing sets, building machine learning models (such as Random Forest or Gradient Boosting), evaluating the models, and generating classification reports and confusion matrices.

7. **Deep Learning**: Utilizes deep learning techniques, specifically convolutional neural networks (CNNs), for tasks such as spoken language identification. This involves building and training deep learning models using the TensorFlow and Keras libraries.

8. **Downloading Datasets from Kaggle**: This part downloads datasets from Kaggle using the OpenDatasets library. The datasets are related to audio data with Indian languages and spoken language identification.



In [2]:
import numpy as np
import pandas as pd
import opendatasets as od
import matplotlib.pyplot as plt
import IPython.display as ipd

import librosa
import pywt
import os
import glob
import warnings

from tqdm import tqdm
from scipy.fft import fft, fftfreq
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

pd.set_option('display.max_columns', None)
warnings.filterwarnings('ignore')
np.random.seed(42)

In [None]:
data_path_dict = {
    'data_path': '../data',
    'eda_results': '../data/eda_results',
    'model_data': '../data/model_data',
    'chunked_audio': '../data/model_data/chunked_audio',
    'data_subset': '../data/model_data/data_subset',
    'mfcc_dataframes': '../data/model_data/mfcc_dataframes',
    'mfcc_mean_dataframes': '../data/model_data/mfcc_mean_dataframes',
    'spectrogram_images': '../data/model_data/spectrogram_images',
    'models': '../data/models'
}

for file_path_key in data_path_dict:
    if not os.path.exists(data_path_dict[file_path_key]):
        print(f'Path does not Exist: {data_path_dict[file_path_key]}')

        os.makedirs(data_path_dict[file_path_key])

In [None]:
if not (os.path.exists('../data/audio-dataset-with-10-indian-languages') or os.path.exists('../data/audio_dataset_indian_languages')):
    od.download(dataset_id_or_url="https://www.kaggle.com/datasets/hbchaitanyabharadwaj/audio-dataset-with-10-indian-languages", data_dir='../data/')
    os.rename('../data/audio-dataset-with-10-indian-languages/', '../data/audio_dataset_indian_languages/')

if not (os.path.exists('../data/spoken-language-identification') or os.path.exists('../data/spoken_language_identification')):
    od.download(dataset_id_or_url="https://www.kaggle.com/datasets/toponowicz/spoken-language-identification", data_dir='../data/')
    os.rename('../data/spoken-language-identification/', '../data/spoken_language_identification/')

## Exploratory Data Analysis (EDA)

The provided code snippet defines variables related to file paths and sample filenames for audio datasets. Here's what each part does:

1. `spoken_languages_train_path_dataset`: This variable holds the path to the training dataset for spoken language identification. It points to the directory where the audio files for training are stored.

2. `indian_languages_train_path_dataset`: This variable holds the path to the training dataset for Indian languages. It points to the directory where the audio files for Indian languages are stored. The `*/*.mp3` part suggests that it's looking for all MP3 files within subdirectories.

3. `filename_de`: This list contains sample filenames for the German language. Each filename seems to represent a fragment of an audio file. These filenames might be used for testing or validation purposes within the dataset.

4. `filename_en`: This list contains sample filenames for the English language.

5. `filename_es`: This list contains sample filenames for the Spanish language.

In [None]:
spoken_languages_train_path_dataset = '../data/spoken_language_identification/train/train/'
indian_languages_train_path_dataset = '../data/audio_dataset_indian_languages/Language Detection Dataset/*/*.mp3'

filename_de = ['de_f_0809fd0642232f8c85b0b3d545dc2b5a.fragment1.flac', 'de_f_5d2e7f30d69f2d1d86fd05f3bbe120c2.fragment1.flac']
filename_en = ['en_f_058b70233667e1b64506dddf9f9d6b46.fragment1.flac', 'en_f_386ee651f6f1539ff5622c55e234e5a4.fragment3.flac']
filename_es = ['es_f_47bd2e6178465cd745c86c9db5ffe447.fragment1.flac', 'es_f_ea5fee5b16a663c988fbddb2137cf573.fragment15.flac']

#### Display Audio Data, Sample Rate and Language of and Audio File and Play Audio

This code snippet performs the following tasks:

1. **Loading Audio Data**:
    ```python
    data_de, sample_rate_de = librosa.load(spoken_languages_train_path_dataset + filename_de[0])
    ```
    It uses the `librosa.load()` function to load the audio file corresponding to the first filename in the `filename_de` list. The `spoken_languages_train_path_dataset` variable provides the directory path, and `filename_de[0]` provides the filename. The function returns two values: `data_de`, which contains the audio waveform data, and `sample_rate_de`, which represents the sampling rate of the audio file.

2. **Displaying Audio Information**:
    ```python
    print(f'Audio for Language: German')
    print(f'Audio Data Sample Rate: {sample_rate_de}')
    print(f'Audio Data: {data_de}')
    ```
    It prints information about the loaded audio file. Specifically, it prints the language of the audio (German), the sample rate of the audio (`sample_rate_de`), and the audio data itself (`data_de`). This provides insight into the content and properties of the audio file.

3. **Playing the Audio**:
    ```python
    ipd.Audio(data=data_de, rate=sample_rate_de)
    ```
    It uses `IPython.display.Audio()` to create an audio widget for playing the audio. It takes `data_de` as the audio data and `sample_rate_de` as the sample rate. When executed in a Jupyter Notebook environment, running this line will display an audio player widget that allows you to listen to the loaded audio file.

Overall, this code snippet loads, displays information about, and plays an audio file for the German language.


In [None]:
data_de, sample_rate_de = librosa.load(spoken_languages_train_path_dataset + filename_de[0])

print(f'Audio for Language: German')
print(f'Audio Data Sample Rate: {sample_rate_de}')
print(f'Audio Data: {data_de}')

ipd.Audio(data=data_de, rate=sample_rate_de)

In [None]:
data_en, sample_rate_en = librosa.load(spoken_languages_train_path_dataset + filename_en[0])

print(f'Audio for Language: English')
print(f'Audio Data Sample Rate: {sample_rate_en}')
print(f'Audio Data: {data_en}')

ipd.Audio(data=data_en, rate=sample_rate_en)

In [None]:
data_es, sample_rate_es = librosa.load(spoken_languages_train_path_dataset + filename_es[0])

print(f'Audio for Language: Spanish')
print(f'Audio Data Sample Rate: {sample_rate_es}')
print(f'Audio Data: {data_es}')

ipd.Audio(data=data_es, rate=sample_rate_es)

In [None]:
data_bengali, sample_rate_bengali = librosa.load('../data/audio_dataset_indian_languages/Language Detection Dataset/Bengali/0.mp3')

print(f'Audio for Language: Bengali')
print(f'Audio Data Sample Rate: {sample_rate_bengali}')
print(f'Audio Data: {data_bengali}')

ipd.Audio(data=data_bengali, rate=sample_rate_bengali)

In [None]:
data_gujarati, sample_rate_gujarati = librosa.load('../data/audio_dataset_indian_languages/Language Detection Dataset/Gujarati/214.mp3')

print(f'Audio for Language: Gujarati')
print(f'Audio Data Sample Rate: {sample_rate_gujarati}')
print(f'Audio Data: {data_gujarati}')

ipd.Audio(data=data_gujarati, rate=sample_rate_gujarati)

In [None]:
data_hindi, sample_rate_hindi = librosa.load('../data/audio_dataset_indian_languages/Language Detection Dataset/Hindi/0.mp3')

print(f'Audio for Language: Hindi')
print(f'Audio Data Sample Rate: {sample_rate_hindi}')
print(f'Audio Data: {data_hindi}')

ipd.Audio(data=data_hindi, rate=sample_rate_hindi)

In [None]:
data_kannada, sample_rate_kannada = librosa.load('../data/audio_dataset_indian_languages/Language Detection Dataset/Kannada/214.mp3')

print(f'Audio for Language: Kannada')
print(f'Audio Data Sample Rate: {sample_rate_kannada}')
print(f'Audio Data: {data_kannada}')

ipd.Audio(data=data_kannada, rate=sample_rate_kannada)

In [None]:
data_malayalam, sample_rate_malayalam = librosa.load('../data/audio_dataset_indian_languages/Language Detection Dataset/Malayalam/214.mp3')

print(f'Audio for Language: Malayalam')
print(f'Audio Data Sample Rate: {sample_rate_malayalam}')
print(f'Audio Data: {data_malayalam}')

ipd.Audio(data=data_malayalam, rate=sample_rate_malayalam)

In [None]:
data_marathi, sample_rate_marathi = librosa.load('../data/audio_dataset_indian_languages/Language Detection Dataset/Marathi/214.mp3')

print(f'Audio for Language: Marathi')
print(f'Audio Data Sample Rate: {sample_rate_marathi}')
print(f'Audio Data: {data_marathi}')

ipd.Audio(data=data_marathi, rate=sample_rate_marathi)

In [None]:
data_punjabi, sample_rate_punjabi = librosa.load('../data/audio_dataset_indian_languages/Language Detection Dataset/Punjabi/214.mp3')

print(f'Audio for Language: Punjabi')
print(f'Audio Data Sample Rate: {sample_rate_punjabi}')
print(f'Audio Data: {data_punjabi}')

ipd.Audio(data=data_punjabi, rate=sample_rate_punjabi)

In [None]:
data_tamil, sample_rate_tamil = librosa.load('../data/audio_dataset_indian_languages/Language Detection Dataset/Tamil/214.mp3')

print(f'Audio for Language: Tamil')
print(f'Audio Data Sample Rate: {sample_rate_tamil}')
print(f'Audio Data: {data_tamil}')

ipd.Audio(data=data_tamil, rate=sample_rate_tamil)

In [None]:
data_telugu, sample_rate_telugu = librosa.load('../data/audio_dataset_indian_languages/Language Detection Dataset/Telugu/214.mp3')

print(f'Audio for Language: Telugu')
print(f'Audio Data Sample Rate: {sample_rate_telugu}')
print(f'Audio Data: {data_telugu}')

ipd.Audio(data=data_telugu, rate=sample_rate_telugu)

In [None]:
data_urdu, sample_rate_urdu = librosa.load('../data/audio_dataset_indian_languages/Language Detection Dataset/Urdu/214.mp3')

print(f'Audio for Language: Urdu')
print(f'Audio Data Sample Rate: {sample_rate_urdu}')
print(f'Audio Data: {data_urdu}')

ipd.Audio(data=data_urdu, rate=sample_rate_urdu)

#### Display Amplitude Plots of Audio Files

This code defines a function `amplitude_plot_audio` for plotting the amplitude of audio data. Here's a breakdown of what it does:

1. **Function Definition**:
    ```python
    def amplitude_plot_audio(data_dict: dict, n_rows: int = 1, n_cols: int = 3, figsize: tuple = (20, 5), file_name: str = 'amplitude_plot'):
    ```
    - The function `amplitude_plot_audio` takes several parameters:
        - `data_dict`: A dictionary containing audio data where the keys are formatted as `<index>_<language>` where `<index>` represents the index of the audio data and `<language>` represents the language label.
        - `n_rows`: Number of rows for subplots (default is 1).
        - `n_cols`: Number of columns for subplots (default is 3).
        - `figsize`: Size of the figure (default is (20, 5)).
        - `file_name`: Name of the file to save the plot (default is 'amplitude_plot').

2. **Plotting Amplitude**:
    ```python
    _, ax = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=figsize)
    ```
    - This line creates subplots based on the specified number of rows (`n_rows`) and columns (`n_cols`). It returns a figure and an array of Axes objects (`ax`).

3. **Iterating Over Data Dictionary**:
    ```python
    for key in data_dict.keys():
        idx, lang = key.split('_')
        idx = int(idx)

        ax[idx].plot(data_dict[key])

        ax[idx].set_ylabel('Amplitude')
        ax[idx].set_xlabel('Time in samples')
        ax[idx].set_title(f'Audio Amplitude vs Time ({lang})')
    ```
    - This loop iterates over each key in the `data_dict`, which represents the index and language label of each audio data.
    - It splits each key to extract the index and language information.
    - It plots the amplitude data (`data_dict[key]`) on the corresponding subplot (`ax[idx]`) and sets labels and titles accordingly.

4. **Saving the Plot**:
    ```python
    plt.savefig(f'../data/eda_results/{file_name}.png')
    ```
    - After plotting all the amplitudes, the plot is saved as a PNG file in the directory `../data/eda_results/` with the specified `file_name`.

Overall, this function generates a subplot for each audio sample, plotting its amplitude over time and labeling it with the corresponding language.


In [None]:
def amplitude_plot_audio(data_dict: dict, n_rows: int = 1, n_cols: int = 3, figsize: tuple = (20, 5), file_name: str = 'amplitude_plot') -> None:
    """Plots amplitude vs. time for audio data.

    Parameters:
        data_dict (dict): Dictionary containing audio data. Keys are in the format 'idx_lang'.
        n_rows (int, optional): Number of rows in the subplot grid. Default is 1.
        n_cols (int, optional): Number of columns in the subplot grid. Default is 3.
        figsize (tuple, optional): Figure size (width, height) in inches. Default is (20, 5).
        file_name (str, optional): Name of the output file to save the plot. Default is 'amplitude_plot'.
    """

    # Create subplots with specified number of rows and columns
    _, ax = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=figsize)

    # Iterate over keys in the data dictionary
    for key in data_dict.keys():

        # Split the key into index and language
        idx, lang = key.split('_')
        idx = int(idx)  # Convert index to integer

        # Plot audio data on corresponding subplot
        ax[idx].plot(data_dict[key])

        # Set labels and title for the subplot
        ax[idx].set_ylabel('Amplitude [dB]')
        ax[idx].set_xlabel('Time in Samples [ms]')
        ax[idx].set_title(f'Audio Amplitude vs Time ({lang})')

    # Save the plot as a PNG file
    plt.savefig(f'../data/eda_results/{file_name}.png')

In [None]:
for i in range(len(filename_de)):
    data_de, _ = librosa.load(spoken_languages_train_path_dataset + filename_de[i])
    data_en, _ = librosa.load(spoken_languages_train_path_dataset + filename_en[i])
    data_es, _ = librosa.load(spoken_languages_train_path_dataset + filename_es[i])

    data_dict = {'0_de': data_de, '1_en': data_en, '2_es': data_es}

    amplitude_plot_audio(data_dict=data_dict, file_name='amplitude_plot_de_en_es')

In [None]:
indian_languages_list = ['Hindi', 'Bengali', 'Gujarati']
data_dict = {}

for idx, lang in enumerate(indian_languages_list):
    data_dict[f'{idx}_{lang.lower()}'] = librosa.load(f'../data/audio_dataset_indian_languages/Language Detection Dataset/{lang}/214.mp3')[0]

amplitude_plot_audio(data_dict=data_dict, file_name='amplitude_plot_hbg')

indian_languages_list = ['Kannada', 'Malayalam', 'Marathi']
data_dict = {}

for idx, lang in enumerate(indian_languages_list):
    data_dict[f'{idx}_{lang.lower()}'] = librosa.load(f'../data/audio_dataset_indian_languages/Language Detection Dataset/{lang}/214.mp3')[0]

amplitude_plot_audio(data_dict=data_dict, file_name='amplitude_plot_kmm')

indian_languages_list = ['Punjabi', 'Tamil', 'Telugu', 'Urdu']
data_dict = {}

for idx, lang in enumerate(indian_languages_list):
    data_dict[f'{idx}_{lang.lower()}'] = librosa.load(f'../data/audio_dataset_indian_languages/Language Detection Dataset/{lang}/214.mp3')[0]

amplitude_plot_audio(data_dict=data_dict, n_rows=1, n_cols=4, figsize=(25, 5), file_name='amplitude_plot_pttu')

#### Display Spectogram Plots of Audio Files

This code defines a function `spectogram_plot_audio` for plotting the spectrogram of audio data. Here's a breakdown of what it does:

1. **Function Definition**:
    ```python
    def spectogram_plot_audio(data_dict: dict, sample_rate_dict: dict, n_rows: int = 1, n_cols: int = 3, figsize: tuple = (20, 5), file_name: str = 'spectogram_plot'):
    ```
    - The function `spectogram_plot_audio` takes several parameters:
        - `data_dict`: A dictionary containing audio data where the keys are formatted as `<index>_<language>` where `<index>` represents the index of the audio data and `<language>` represents the language label.
        - `sample_rate_dict`: A dictionary containing sample rates corresponding to each audio data in `data_dict`.
        - `n_rows`: Number of rows for subplots (default is 1).
        - `n_cols`: Number of columns for subplots (default is 3).
        - `figsize`: Size of the figure (default is (20, 5)).
        - `file_name`: Name of the file to save the plot (default is 'spectogram_plot').

2. **Plotting Spectrogram**:
    ```python
    _, ax = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=figsize)
    ```
    - This line creates subplots based on the specified number of rows (`n_rows`) and columns (`n_cols`). It returns a figure and an array of Axes objects (`ax`).

3. **Iterating Over Data Dictionary**:
    ```python
    for key in data_dict.keys():
        idx, lang = key.split('_')
        idx = int(idx)

        ax[idx].specgram(data_dict[key], Fs=sample_rate_dict[key])

        ax[idx].set_ylabel('Frequency [Hz]')
        ax[idx].set_xlabel('Time [sec]')
        ax[idx].set_title(f'Audio Frequency vs Time ({lang})')
    ```
    - This loop iterates over each key in the `data_dict`, which represents the index and language label of each audio data.
    - It splits each key to extract the index and language information.
    - It plots the spectrogram of the audio data (`data_dict[key]`) on the corresponding subplot (`ax[idx]`) with the corresponding sample rate (`sample_rate_dict[key]`). It uses the `specgram` function from Matplotlib to generate the spectrogram.

4. **Saving the Plot**:
    ```python
    plt.savefig(f'../data/eda_results/{file_name}.png')
    ```
    - After plotting all the spectrograms, the plot is saved as a PNG file in the directory `../data/eda_results/` with the specified `file_name`.

Overall, this function generates a subplot for each audio sample, plotting its spectrogram and labeling it with the corresponding language.


In [None]:
def spectrogram_plot_audio(data_dict: dict, sample_rate_dict: dict, n_rows: int = 1, n_cols: int = 3, figsize: tuple = (20, 5), file_name: str = 'spectrogram_plot') -> None:
    """Plots spectrogram for audio data.

    Parameters:
        data_dict (dict): Dictionary containing audio data. Keys are in the format 'idx_lang'.
        sample_rate_dict (dict): Dictionary containing sample rates corresponding to audio data keys.
        n_rows (int, optional): Number of rows in the subplot grid. Default is 1.
        n_cols (int, optional): Number of columns in the subplot grid. Default is 3.
        figsize (tuple, optional): Figure size (width, height) in inches. Default is (20, 5).
        file_name (str, optional): Name of the output file to save the plot. Default is 'spectrogram_plot'.
    """

    # Create subplots with specified number of rows and columns
    _, ax = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=figsize)

    # Iterate over keys in the data dictionary
    for key in data_dict.keys():

        # Split the key into index and language
        idx, lang = key.split('_')
        idx = int(idx)  # Convert index to integer

        # Plot spectrogram of audio data on corresponding subplot
        ax[idx].specgram(data_dict[key], Fs=sample_rate_dict[key])

        # Set labels and title for the subplot
        ax[idx].set_ylabel('Frequency [Hz]')
        ax[idx].set_xlabel('Time [sec]')
        ax[idx].set_title(f'Audio Frequency vs Time ({lang})')

    # Save the plot as a PNG file
    plt.savefig(f'../data/eda_results/{file_name}.png')


In [None]:
for i in range(len(filename_de)):
    data_de, samplerate_de = librosa.load(spoken_languages_train_path_dataset + filename_de[i])
    data_en, samplerate_en = librosa.load(spoken_languages_train_path_dataset + filename_en[i])
    data_es, samplerate_es = librosa.load(spoken_languages_train_path_dataset + filename_es[i])

    data_dict = {'0_de': data_de, '1_en': data_en, '2_es': data_es}
    sample_rate_dict = {'0_de': samplerate_de, '1_en': samplerate_en, '2_es': samplerate_es}

    spectrogram_plot_audio(data_dict=data_dict, sample_rate_dict=sample_rate_dict, file_name='spectogram_plot_de_en_es')

In [None]:
indian_languages_list = ['Hindi', 'Bengali', 'Gujarati']
data_dict = {}
sample_rate_dict = {}

for idx, lang in enumerate(indian_languages_list):
    data, sample_rate = librosa.load(f'../data/audio_dataset_indian_languages/Language Detection Dataset/{lang}/214.mp3')

    data_dict[f'{idx}_{lang.lower()}'] = data
    sample_rate_dict[f'{idx}_{lang.lower()}'] = sample_rate

spectrogram_plot_audio(data_dict=data_dict, sample_rate_dict=sample_rate_dict, file_name='spectogram_plot_hbg')

indian_languages_list = ['Kannada', 'Malayalam', 'Marathi']
audio_file_name = '214'
data_dict = {}
sample_rate_dict = {}

for idx, lang in enumerate(indian_languages_list):
    data, sample_rate = librosa.load(f'../data/audio_dataset_indian_languages/Language Detection Dataset/{lang}/{audio_file_name}.mp3')

    data_dict[f'{idx}_{lang.lower()}'] = data
    sample_rate_dict[f'{idx}_{lang.lower()}'] = sample_rate

spectrogram_plot_audio(data_dict=data_dict, sample_rate_dict=sample_rate_dict, file_name='spectogram_plot_kmm')

indian_languages_list = ['Punjabi', 'Tamil', 'Telugu', 'Urdu']
audio_file_name = '214'
data_dict = {}
sample_rate_dict = {}

for idx, lang in enumerate(indian_languages_list):
    data, sample_rate = librosa.load(f'../data/audio_dataset_indian_languages/Language Detection Dataset/{lang}/{audio_file_name}.mp3')

    data_dict[f'{idx}_{lang.lower()}'] = data
    sample_rate_dict[f'{idx}_{lang.lower()}'] = sample_rate

spectrogram_plot_audio(data_dict=data_dict, sample_rate_dict=sample_rate_dict, n_rows=1, n_cols=4, figsize=(25, 5), file_name='spectogram_plot_pttu')

#### Loading and Pre-Processing Data

This code defines a function `load_data` to load audio data from a file and extract information about the sample rate and duration of the audio. Here's what it does:

1. **Function Definition**:
    ```python
    def load_data(file_name: str) -> tuple:
    ```
    - The function `load_data` takes a single parameter `file_name`, which represents the path to the audio file to be loaded. It returns a tuple containing information about the sample rate and audio duration.

2. **Loading Audio Data**:
    ```python
    audio_data, sample_rate = librosa.load(file_name, sr=None)
    ```
    - This line uses `librosa.load()` to load the audio file specified by `file_name`. The parameter `sr=None` indicates that the original sample rate of the audio file should be preserved. The function returns two values: `audio_data`, which contains the audio waveform data, and `sample_rate`, which represents the sampling rate of the audio file.

3. **Calculating Audio Duration**:
    ```python
    audio_duration_sec = int(librosa.get_duration(y=audio_data, sr=sample_rate))
    ```
    - This line calculates the duration of the audio in seconds using `librosa.get_duration()`. It takes `audio_data` and `sample_rate` as parameters and returns the duration of the audio in seconds, which is then converted to an integer value.

4. **Returning Results**:
    ```python
    return (sample_rate, audio_duration_sec)
    ```
    - The function returns a tuple containing the sample rate and audio duration.

5. **Error Handling**:
    ```python
    except Exception as e:
        print(f"Error processing {file_name}: {str(e)}")
        return (np.nan, np.nan)
    ```
    - If an exception occurs during the loading or processing of the audio file, it prints an error message indicating the file name and the specific error. It then returns `(np.nan, np.nan)` to indicate that the sample rate and audio duration could not be determined.

Overall, this function loads audio data from a file, calculates its sample rate and duration, and returns this information as a tuple. If an error occurs during the process, it prints an error message and returns NaN values.


In [None]:
def load_data(file_name: str) -> tuple:
    """Loads audio data from a file using librosa.

    Parameters:
        file_name (str): Path to the audio file.

    Returns:
        tuple: A tuple containing the sample rate and duration of the audio in seconds.
               If an error occurs during processing, returns (np.nan, np.nan).
    """

    try:
        # Load audio data and sample rate
        audio_data, sample_rate = librosa.load(file_name, sr=None)

        # Calculate duration of audio in seconds
        audio_duration_sec = int(librosa.get_duration(y=audio_data, sr=sample_rate))

        return (sample_rate, audio_duration_sec)

    except Exception as e:
        # Print error message if an exception occurs during processing
        print(f"Error processing {file_name}: {str(e)}")

        # Return NaN values for sample rate and duration
        return (np.nan, np.nan)

This code snippet performs several operations to create and preprocess a DataFrame containing information about audio files. Here's a breakdown:

1. **Creating DataFrame for Spoken Languages**:
    - It creates a DataFrame `spoken_language_dataframe` to store file names and language labels for audio files in the `spoken_languages_train_path_dataset` directory.
    - It constructs the file paths by concatenating the directory path with each file name obtained using `os.listdir()`.
    - It extracts the language label from the file name by splitting the string and taking the appropriate substring.
    - It replaces language abbreviations ('de' for German, 'en' for English, 'es' for Spanish) with full language names.

2. **Creating DataFrame for Indian Languages**:
    - It creates a DataFrame `indian_language_dataframe` to store file names and language labels for audio files in the `indian_languages_train_path_dataset` directory.
    - It uses `glob.glob()` to get all file names matching the pattern.
    - It extracts the language label from the file name path.
    - It filters out files labeled as 'punjabi'.

3. **Combining DataFrames and Adding Additional Columns**:
    - It concatenates the two DataFrames (`spoken_language_dataframe` and `indian_language_dataframe`) into a single DataFrame `language_dataframe`.
    - It calculates the file size in kilobytes for each file and adds it as a new column.

4. **Saving DataFrames to CSV Files**:
    - It saves the `language_dataframe` to a CSV file named 'language_dataframe_v1.csv' without including the index column.

5. **Processing Audio Data**:
    - It iterates over unique language labels in the DataFrame.
    - For each language, it creates a subset of the DataFrame containing only files for that language.
    - It applies the `load_data` function to each file to extract sample rate and audio duration information and adds these as new columns in the subset DataFrame.
    - It saves each subset DataFrame to a separate CSV file.

6. **Combining Processed DataFrames**:
    - It combines all processed subset DataFrames into a single DataFrame `language_dataframe`.
    - It reads each CSV file for each language subset and concatenates them.
    - It drops any rows with NaN values.

7. **Saving Combined DataFrame to CSV File**:
    - It saves the combined DataFrame to a CSV file named 'language_dataframe_v1.csv' without including the index column.

Overall, this code prepares and preprocesses audio data and metadata, and stores it in CSV format for further analysis or use in machine learning models.


In [None]:
spoken_language_dataframe = pd.DataFrame({'file_name': [spoken_languages_train_path_dataset + file_name for file_name in os.listdir(spoken_languages_train_path_dataset)]})
spoken_language_dataframe['language_label'] = spoken_language_dataframe['file_name'].str.split('/', expand=True).iloc[:, 5].str.split('_', expand=True).iloc[:, 0].replace(to_replace={
    'de': 'german', 'en': 'english', 'es': 'spanish'})

indian_language_dataframe = pd.DataFrame({'file_name': glob.glob(indian_languages_train_path_dataset)})
indian_language_dataframe['language_label'] = indian_language_dataframe['file_name'].str.split('\\', expand=True).iloc[:, 1].str.lower()

indian_language_dataframe = indian_language_dataframe[indian_language_dataframe['language_label'] != 'punjabi']

language_dataframe = pd.concat([spoken_language_dataframe, indian_language_dataframe], ignore_index=True)
language_dataframe['file_size_kb'] = (language_dataframe['file_name'].apply(lambda x: os.path.getsize(x)) / 1024).round(3)

language_dataframe.to_csv('../data/model_data/language_dataframe_v1.csv', index=False)

for lang in tqdm(language_dataframe['language_label'].unique(), desc="Languages"):
    lang_data = language_dataframe[language_dataframe['language_label'] == lang].copy()
    lang_data[['sample_rate', 'audio_duration_sec']] = lang_data['file_name'].apply(lambda file_name: pd.Series(load_data(file_name=file_name)))

    lang_data.to_csv(f'../data/model_data/data_subset/language_dataframe_{lang}_v1.csv', index=False)

language_dataframe = pd.concat([pd.read_csv(f'../data/model_data/data_subset/language_dataframe_{lang}_v1.csv') for lang in language_dataframe['language_label'].unique()], ignore_index=True).dropna()
language_dataframe.to_csv('../data/model_data/language_dataframe_v1.csv', index=False)

language_dataframe

#### Descriptive Statistics of Raw Audio Data

In [None]:
language_dataframe = pd.read_csv('../data/model_data/language_dataframe_v1.csv')
language_dataframe

In [None]:
language_dataframe['sample_rate'].unique()

In [None]:
language_dataframe['file_size_kb'].mean(), language_dataframe['file_size_kb'].median()

In [None]:
language_dataframe['audio_duration_sec'].value_counts()

In [None]:
language_dataframe.describe()

#### Waveform Visualization

Waveform visualization is a technique used to plot the amplitude of an audio signal against time. This type of visualization provides a graphical representation of the audio waveform, allowing us to visualize the variations in amplitude over time. Each point on the waveform represents the magnitude of the audio signal at a specific point in time.

Waveform visualization is useful for gaining insights into the overall structure and dynamics of the audio. By examining the waveform, we can identify patterns, such as changes in amplitude, duration of sounds, and presence of silence or noise. This can help in understanding the rhythm, tempo, and intensity of the audio.

In addition, waveform visualization is commonly used for tasks such as audio editing, quality assessment, and transcription. It provides a simple yet informative representation of the audio signal, making it easier to analyze and interpret the characteristics of the audio.

Overall, waveform visualization is a fundamental tool in audio processing and analysis, offering valuable insights into the temporal dynamics of the audio signal.


In [None]:
file_path = language_dataframe['file_name'].iloc[0]
language = language_dataframe['language_label'].iloc[0]

audio_data, sample_rate = librosa.load(file_path, sr=None)

plt.figure(figsize=(10, 4))
librosa.display.waveshow(y=audio_data, sr=sample_rate)
plt.xlabel('Time [sec]')
plt.ylabel('Amplitude [db]')
plt.title(f'Waveform Plot ({language.capitalize()})')
plt.show()

#### Spectrogram Visualization

A spectrogram is a visual representation of the frequency content of an audio signal over time. It is created by computing the Short-Time Fourier Transform (STFT) of the audio signal, which divides the signal into short overlapping segments and computes the Fourier Transform for each segment. The resulting spectrogram displays the magnitude of the frequency components as a function of time.

Spectrogram visualization is a powerful tool for analyzing audio signals, as it provides insights into various aspects of the signal's characteristics. By examining the spectrogram, one can identify patterns such as frequency components, changes in frequency content over time, and the presence of specific sounds or events.

The spectrogram is typically displayed with time on the x-axis, frequency on the y-axis, and color representing the magnitude of the frequency components. Brighter colors indicate higher magnitudes, while darker colors indicate lower magnitudes.

Spectrogram visualization is widely used in various audio processing tasks, including speech recognition, music analysis, sound classification, and environmental monitoring. It enables researchers and practitioners to explore the frequency dynamics of audio signals and extract meaningful information from them.

Overall, spectrogram visualization is an essential technique for gaining insights into the frequency content and temporal dynamics of audio signals, making it a valuable tool in audio analysis and processing.


In [None]:
D = librosa.stft(audio_data)
S_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)

plt.figure(figsize=(10, 4))
librosa.display.specshow(S_db, sr=sample_rate, x_axis='time', y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title('Spectrogram')
plt.show()

#### Mel-frequency Cepstral Coefficients (MFCCs)

Mel-frequency Cepstral Coefficients (MFCCs) are a set of features widely used in speech and audio processing tasks. They represent the short-term power spectrum of a sound in a compact and perceptually relevant manner. MFCCs are derived from the Mel-frequency cepstrum, which is a representation of the short-term power spectrum of a sound after it has been passed through a series of filters designed to mimic the human auditory system's response.

The process of extracting MFCCs involves several steps:

1. **Frame the Signal**: The audio signal is divided into short overlapping frames, typically ranging from 20 to 40 milliseconds in duration.

2. **Apply a Window Function**: A window function, such as the Hamming window, is applied to each frame to reduce spectral leakage.

3. **Compute the Fourier Transform**: The Fourier Transform is computed for each framed signal to obtain its frequency spectrum.

4. **Apply the Mel Filterbank**: The Mel Filterbank is a series of triangular filters spaced evenly in the Mel-frequency scale, which is a perceptually relevant scale of frequency. The energy in each filter's frequency band is computed.

5. **Take the Logarithm**: The logarithm of the energy in each filterbank is computed to better approximate the human auditory system's response to sound.

6. **Compute the Discrete Cosine Transform (DCT)**: The DCT is applied to the log-filterbank energies to decorrelate the features and obtain the MFCCs.

The resulting MFCCs capture the spectral characteristics of the audio signal in a compact form, making them suitable for tasks such as speech recognition, speaker identification, and audio classification. They are particularly effective in capturing phonetic and speaker-related information from speech signals.

Overall, MFCCs are a powerful feature representation for audio signals, providing a robust and efficient way to characterize the spectral content of sound.


In [None]:
mfccs = librosa.feature.mfcc(y=audio_data, sr=sample_rate, n_mfcc=40)

plt.figure(figsize=(10, 4))
librosa.display.specshow(mfccs, x_axis='time')
plt.colorbar()
plt.title('Mel-frequency Cepstral Coefficients')
plt.ylabel('MFCC Coefficients')
plt.xlabel('Time [sec]')
plt.show()

#### Energy Distribution Analysis

Energy distribution analysis involves examining how the energy of an audio signal is distributed across different frequency bands. This analysis can provide insights into the dominant frequencies or energy patterns present in the audio signal.

The process of energy distribution analysis typically involves the following steps:

1. **Compute the Short-Time Fourier Transform (STFT)**: The audio signal is divided into short overlapping segments, and the Fourier Transform is computed for each segment. This results in a time-frequency representation of the signal, where the magnitude of the frequency components is represented as a function of time.

2. **Partition the Frequency Spectrum**: The frequency spectrum obtained from the STFT is divided into multiple frequency bands. These bands can be equally spaced or based on perceptual scales such as the Mel-frequency scale.

3. **Calculate Energy in Each Band**: The energy of the signal within each frequency band is computed by summing the magnitudes of the frequency components within that band. This provides a measure of the signal's power within each frequency range.

4. **Visualize the Energy Distribution**: The energy distribution across different frequency bands is visualized using a histogram, bar plot, or heat map. This visualization allows for the identification of dominant frequencies or energy patterns present in the audio signal.

Energy distribution analysis is commonly used in various audio processing tasks, including music analysis, speech recognition, and sound classification. It can help in identifying key characteristics of the audio signal and extracting relevant features for further analysis.

Overall, energy distribution analysis provides valuable insights into the frequency content and power distribution of audio signals, facilitating a better understanding of their underlying properties and behaviors.


In [None]:
n_fft = 2048
hop_length = 512

stft = librosa.stft(audio_data, n_fft=n_fft, hop_length=hop_length)
power_spec = np.abs(stft) ** 2

num_bands = 10
freq_bands = np.linspace(0, sample_rate / 2, num_bands + 1).astype(int)
energy_distribution = np.zeros((num_bands, power_spec.shape[1]))

for i in range(num_bands):
    energy_distribution[i, :] = np.sum(power_spec[freq_bands[i]:freq_bands[i+1], :], axis=0)

plt.figure(figsize=(10, 6))
plt.imshow(energy_distribution, aspect='auto', origin='lower', cmap='jet')
plt.colorbar(label='Energy')
plt.xlabel('Time Frame')
plt.ylabel('Frequency Band')
plt.title('Energy Distribution Across Frequency Bands')
plt.show()

#### Pitch Analysis

Pitch analysis involves estimating the fundamental frequency or pitch of an audio signal, which represents the perceived frequency of the sound. Pitch is a fundamental perceptual attribute of sound and is closely related to the frequency of the sound wave.

The process of pitch analysis typically involves the following steps:

1. **Preprocessing**: The audio signal may undergo preprocessing steps such as filtering, windowing, and normalization to improve the accuracy of pitch estimation.

2. **Pitch Estimation**: Various algorithms can be used to estimate the pitch of the audio signal. One common method is the autocorrelation-based technique, which involves computing the autocorrelation function of the signal and identifying the peaks corresponding to the fundamental frequency.

3. **Post-processing**: Post-processing steps may be applied to refine the estimated pitch values and eliminate spurious or erroneous detections.

Pitch analysis is useful in a wide range of applications, including music transcription, speech processing, voice analysis, and emotion recognition. In music transcription, pitch analysis is used to convert audio recordings into symbolic representations of music, such as sheet music or MIDI files. In speech processing, pitch analysis can provide insights into the prosodic features of speech, such as intonation and emphasis. In voice analysis, pitch analysis can help identify characteristics such as gender and age.

Overall, pitch analysis plays a crucial role in understanding and interpreting the frequency characteristics of audio signals, making it a valuable tool in various audio processing tasks.


In [None]:
pitch = librosa.pyin(audio_data, fmin=librosa.note_to_hz('C1'), fmax=librosa.note_to_hz('C8'))

plt.figure(figsize=(10, 4))
plt.plot(librosa.frames_to_time(range(len(pitch)), sr=sample_rate), pitch, color='b')
plt.xlabel('Time (s)')
plt.ylabel('Pitch (Hz)')
plt.title('Pitch Analysis')
plt.show()

#### Temporal Features

Temporal features are statistical measures computed over the temporal domain of an audio signal. These features provide insights into the temporal characteristics of the audio, capturing various aspects of its amplitude and distribution over time.

Some commonly computed temporal features include:

- **Mean**: The mean represents the average amplitude of the audio signal over time. It provides a measure of the central tendency of the signal's amplitude distribution.

- **Variance**: The variance measures the spread or dispersion of the amplitude values around the mean. It indicates the degree of variability or fluctuation in the signal's amplitude over time.

- **Skewness**: Skewness quantifies the asymmetry of the amplitude distribution. A positive skewness value indicates that the tail of the distribution is skewed towards higher values, while a negative skewness value indicates skewness towards lower values.

- **Kurtosis**: Kurtosis measures the "tailedness" of the amplitude distribution. It reflects the degree to which the distribution is peaked or flat compared to a normal distribution. High kurtosis values indicate a sharper peak and heavier tails, while low kurtosis values indicate a flatter distribution.

Computing these temporal features provides valuable insights into the overall shape, variability, and distribution of the audio signal over time. These features are commonly used in various audio processing tasks, including audio classification, event detection, and anomaly detection.

By analyzing temporal features, researchers and practitioners can gain a better understanding of the temporal characteristics of the audio signal and extract meaningful information for further analysis and interpretation.


In [None]:
mean = np.mean(audio_data)
variance = np.var(audio_data)

skewness = np.mean((audio_data - np.mean(audio_data)) ** 3) / np.mean((audio_data - np.mean(audio_data)) ** 2) ** (3/2)
kurtosis = np.mean((audio_data - np.mean(audio_data)) ** 4) / np.mean((audio_data - np.mean(audio_data)) ** 2) ** 2

print("Mean:", mean)
print("Variance:", variance)
print("Skewness:", skewness)
print("Kurtosis:", kurtosis)

#### Frequency Domain Analysis

Frequency domain analysis involves analyzing the frequency spectrum of an audio signal using techniques such as the Fourier Transform or Short-Time Fourier Transform (STFT). This analysis provides insights into the frequency components present in the audio signal and how they vary over time.

The Fourier Transform is a mathematical technique that decomposes a signal into its constituent frequency components. By applying the Fourier Transform to an audio signal, we can obtain a representation of its frequency spectrum, which shows the magnitude and phase of each frequency component.

The Short-Time Fourier Transform (STFT) extends the Fourier Transform to analyze the frequency content of a signal over short time intervals. It divides the signal into overlapping segments, computes the Fourier Transform for each segment, and then combines the results to create a time-frequency representation of the signal. This allows us to observe how the frequency components of the signal change over time.

Frequency domain analysis can reveal important information about the characteristics of the audio signal, such as:

- Dominant frequencies: The main frequency components present in the signal.
- Harmonic structure: The relationship between the fundamental frequency and its harmonics.
- Transient events: Short-duration changes in the frequency content of the signal.
- Spectral patterns: Repeating patterns or motifs in the frequency spectrum.

Frequency domain analysis is widely used in various audio processing tasks, including audio compression, noise reduction, and speech enhancement. It provides a powerful tool for understanding the spectral properties of audio signals and extracting meaningful information for further analysis and processing.

Overall, frequency domain analysis is an essential technique in audio signal processing, enabling researchers and practitioners to explore the frequency characteristics of audio signals and uncover valuable insights about their underlying structure and dynamics.


In [None]:
n_fft = 2048
hop_length = 512
stft = librosa.stft(audio_data, n_fft=n_fft, hop_length=hop_length)

stft_db = librosa.amplitude_to_db(np.abs(stft))
freqs = librosa.fft_frequencies(sr=sample_rate, n_fft=n_fft)

plt.figure(figsize=(10, 4))
librosa.display.specshow(stft_db, sr=sample_rate, hop_length=hop_length, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.title('Frequency Spectrum')
plt.xlabel('Time (s)')
plt.ylabel('Frequency (Hz)')
plt.show()

In [None]:
N = len(audio_data)
xf = fft(audio_data)
xf_freqs = fftfreq(N, (1.0 / sample_rate))[:N//2]

plt.figure(figsize=(10, 4))
plt.plot(xf_freqs, 2.0 / N * np.abs(xf[:N//2]))
plt.title('Frequency Spectrum')
plt.xlabel('Frequency (Hz)')
plt.ylabel('Amplitude')
plt.grid()
plt.show()

#### Audio Segmentation

Audio segmentation is the process of dividing an audio signal into meaningful segments or parts based on various features such as silence detection, changes in energy, or spectral characteristics. This technique helps in identifying different sections or events within the audio, making it easier to analyze and process the signal.

There are several methods for audio segmentation, including:

1. **Silence Detection**: Silence detection involves identifying periods of silence or low energy in the audio signal. This can be achieved by setting a threshold on the signal's energy level and detecting segments with energy below the threshold.

2. **Energy-Based Segmentation**: Energy-based segmentation focuses on detecting changes in the energy level of the audio signal. Segments with significant changes in energy levels may correspond to different events or sections within the audio.

3. **Spectral Characteristics**: Spectral-based segmentation involves analyzing the frequency content of the audio signal to identify segments with distinct spectral characteristics. This can be done using techniques such as clustering or pattern recognition based on features extracted from the frequency domain.

4. **Machine Learning Approaches**: Machine learning techniques can also be used for audio segmentation, where models are trained to classify segments based on various features extracted from the audio signal.

Audio segmentation is a crucial step in many audio processing tasks, including speech recognition, music transcription, and sound event detection. It allows for the identification and isolation of relevant sections within the audio signal, enabling more focused analysis and processing.

By segmenting the audio signal into meaningful parts, researchers and practitioners can better understand its structure and content, leading to more accurate and effective processing outcomes.

Overall, audio segmentation plays a vital role in audio processing applications, facilitating the extraction of valuable information and insights from audio signals.


In [None]:
window_size = 2048
hop_length = 512
energy = np.array([sum(abs(audio_data[i:i+window_size] ** 2)) for i in range(0, len(audio_data), hop_length)])

threshold = np.mean(energy) * 0.1
silence_segments = np.where(energy < threshold)[0]

segments = []
start = 0

for silence in silence_segments:
    segments.append((start, silence * hop_length))
    start = silence * hop_length

for i, (start, end) in enumerate(segments):
    print(f"Segment {i + 1}: Start = {start / sample_rate:.3f} seconds, End = {end / sample_rate:.3f} seconds")

#### Clustering and Dimensionality Reduction

Clustering techniques and dimensionality reduction methods are powerful tools used to explore the structure of audio data and discover similarities or patterns among audio samples. These methods help in organizing large audio datasets, identifying clusters of similar audio samples, and reducing the dimensionality of the data while preserving its essential characteristics.

##### Clustering Techniques

Clustering techniques group similar audio samples together based on their feature representations. Some commonly used clustering algorithms include K-means clustering, hierarchical clustering, and density-based clustering. These algorithms partition the data into clusters such that audio samples within the same cluster are more similar to each other than to those in other clusters.

##### Dimensionality Reduction Methods

Dimensionality reduction methods aim to reduce the number of features in the audio data while preserving as much information as possible. Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) are popular dimensionality reduction techniques used in audio processing. PCA identifies the principal components (linear combinations of the original features) that capture the most variance in the data. t-SNE, on the other hand, is a non-linear technique that projects high-dimensional data into a lower-dimensional space, emphasizing the local structure of the data.

##### Application in Audio Processing

Clustering and dimensionality reduction techniques are widely used in various audio processing tasks, including:

- **Audio Classification**: Clustering can help in grouping audio samples with similar characteristics, making it easier to classify them into different categories or classes.
- **Content-Based Retrieval**: Dimensionality reduction techniques can be used to create compact representations of audio data, enabling efficient retrieval of similar audio samples from large databases.
- **Audio Visualization**: Clustering and dimensionality reduction methods can aid in visualizing high-dimensional audio data in lower-dimensional spaces, allowing for intuitive exploration and interpretation of the data.

By applying clustering and dimensionality reduction techniques to audio data, researchers and practitioners can gain valuable insights into the underlying structure and patterns of the data, facilitating various audio processing tasks and applications.

Overall, clustering and dimensionality reduction are essential tools in audio processing, enabling the exploration and analysis of large audio datasets and the discovery of meaningful patterns and similarities among audio samples.


In [None]:
mfccs = librosa.feature.mfcc(y=audio_data, sr=sample_rate, n_mfcc=40)

pca = PCA(n_components=2)
mfccs_pca = pca.fit_transform(mfccs.T)

kmeans = KMeans(n_clusters=3)
clusters = kmeans.fit_predict(mfccs_pca)

plt.figure(figsize=(10, 6))

for cluster in range(kmeans.n_clusters):
    plt.scatter(mfccs_pca[clusters == cluster, 0], mfccs_pca[clusters == cluster, 1], label=f'Cluster {cluster + 1}')

plt.title('Clustering of Audio Segments (PCA)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()

#### Time-Frequency Decomposition

Time-frequency decomposition is a technique used to decompose an audio signal into representations that simultaneously capture both time and frequency characteristics. This allows for a more detailed analysis of how the frequency content of the signal evolves over time. Two common techniques for time-frequency decomposition are the wavelet transform and the Wigner-Ville distribution.

##### Wavelet Transform

The wavelet transform is a mathematical tool that decomposes a signal into a set of wavelets, which are small wave-like functions that are localized in both time and frequency. By applying the wavelet transform to an audio signal, we can obtain a time-frequency representation that highlights both short-term and long-term frequency components of the signal. This representation is particularly useful for analyzing transient signals or signals with rapidly changing frequency content.

##### Wigner-Ville Distribution

The Wigner-Ville distribution is a time-frequency representation obtained by computing the joint time-frequency distribution of a signal. It provides a high-resolution representation of the signal's time-frequency content, capturing both instantaneous and non-instantaneous frequency components. However, the Wigner-Ville distribution may suffer from cross-term interference, which can introduce artifacts in the representation.

##### Application in Audio Processing

Time-frequency decomposition techniques are widely used in various audio processing tasks, including:

- **Audio Analysis**: Time-frequency representations allow for a detailed analysis of the frequency content of audio signals over time, facilitating tasks such as event detection, pitch tracking, and sound classification.
- **Audio Compression**: Time-frequency representations can be used to identify and remove redundant or irrelevant information from audio signals, leading to more efficient compression algorithms.
- **Audio Enhancement**: Time-frequency decomposition techniques can help in separating and isolating specific frequency components of audio signals, enabling the enhancement of desired features or suppression of unwanted noise.

By decomposing audio signals into time-frequency representations, researchers and practitioners can gain valuable insights into the temporal and spectral characteristics of the signals, facilitating various audio processing tasks and applications.

Overall, time-frequency decomposition is an essential technique in audio processing, enabling a more detailed and comprehensive analysis of audio signals in both the time and frequency domains.


In [None]:
levels = 5
coeffs, freqs = pywt.cwt(audio_data, np.arange(1, levels + 1), 'morl')

plt.figure(figsize=(10, 6))
plt.imshow(np.abs(coeffs), extent=[0, len(audio_data) / sample_rate, freqs[-1], freqs[0]], aspect='auto', cmap='jet')
plt.colorbar(label='Magnitude')
plt.xlabel('Time (s)')
plt.ylabel('Frequency (Hz)')
plt.title('Time-Frequency Decomposition (CWT)')
plt.show()