# SpeakSense - Language Detection System (Machine Learning CSCI 6364)

**Abde Manaaf Ghadiali (G29583342), Gehna Ahuja (G00000000), Venkatesh Shanmugam (G00000000)**

The objective of this project is to develop a robust and accurate system capable of detecting the language spoken in audio recordings. By leveraging advanced machine learning algorithms and signal processing techniques, the system aims to accurately identify the language spoken in various audio inputs, spanning diverse accents, dialects, and environmental conditions. This language detection solution seeks to provide practical applications in speech recognition, transcription, translation, and other fields requiring language-specific processing, thereby enhancing accessibility and usability across linguistic boundaries.

This code sets up an environment for working with audio data, particularly focusing on Indian languages. Here's a breakdown of what each part does:

1. **Importing Libraries**: Imports necessary libraries for data manipulation, visualization, machine learning, and audio processing.

2. **Setting Display Options and Suppressing Warnings**: Configures display options for Pandas and suppresses warnings.

3. **Setting Random Seed**: Sets a random seed for reproducibility.

4. **Downloading Datasets**: Checks if the necessary datasets are downloaded, and if not, downloads them from Kaggle using the OpenDatasets library and organizes them into appropriate directories.

5. **Audio Data Processing**: Prepares the audio data for further analysis. This might include feature extraction, preprocessing, and organizing the data for training machine learning models.

6. **Machine Learning**: Utilizes machine learning techniques for tasks such as spoken language identification. This involves splitting the data into training and testing sets, building machine learning models (such as Random Forest or Gradient Boosting), evaluating the models, and generating classification reports and confusion matrices.

7. **Deep Learning**: Utilizes deep learning techniques, specifically convolutional neural networks (CNNs), for tasks such as spoken language identification. This involves building and training deep learning models using the TensorFlow and Keras libraries.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import IPython.display as ipd
import tensorflow as tf
import soundfile as sf

import librosa
import os
import glob
import warnings

from tqdm import tqdm

pd.set_option('display.max_columns', None)
warnings.filterwarnings('ignore')
np.random.seed(42)

# Feature Extraction

### Audio Chunking

In [2]:
language_dataframe = pd.read_csv('../data/model_data/language_dataframe_v1.csv')

spoken_languages_train_path_dataset = '../data/spoken_language_identification/train/train/'
indian_languages_train_path_dataset = '../data/model_data/chunked_audio/*.flac'

language_dataframe['language_label'].unique()

array(['german', 'english', 'spanish', 'bengali', 'gujarati', 'hindi',
       'kannada', 'malayalam', 'marathi', 'tamil', 'telugu', 'urdu'],
      dtype=object)

This code processes audio data for Indian languages by chunking the audio files into segments of fixed duration and resampling them to a target sample rate. Here's a breakdown:

1. **Defining Parameters**:
    - `indian_languages`: A list containing the names of Indian languages.
    - `target_sample_rate`: The target sample rate to which the audio will be resampled.
    - `audio_duration`: The duration (in seconds) of each audio chunk.

2. **Processing Audio Files**:
    - It iterates over each Indian language in the `indian_languages` list.
    - For each language, it creates a subset DataFrame `lang_dataframe` containing only files for that language.
    - It initializes an empty list `audio_data_list` to store audio data for chunking and a variable `chunk_count` to track the number of chunks processed.

3. **Chunking Audio Files**:
    - It iterates over each file path in `lang_dataframe['file_name']`.
    - It loads each audio file using `librosa.load()` with the specified target sample rate.
    - It appends the loaded audio data to the `audio_data_list`.
    - If the length of `audio_data_list` reaches 1500 (a chosen threshold) or if it's the last file in the subset DataFrame, it proceeds to chunk the audio data.
    - It uses TensorFlow's `tf.signal.frame()` function to segment the concatenated audio data into fixed-duration chunks.
    - It iterates over each chunk and writes it to a FLAC file in a specified directory, with the filename indicating the chunk number, index within the language, and language name.
    - It increments `chunk_count` to track the number of processed chunks and resets `audio_data_list` for the next iteration or language.

Overall, this code efficiently processes audio files for Indian languages, chunks them into fixed-duration segments, and saves them to disk for further analysis or use in machine learning models.


In [3]:
indian_languages = ['bengali', 'gujarati', 'hindi', 'kannada', 'malayalam', 'marathi', 'tamil', 'telugu', 'urdu']
target_sample_rate = 22050
audio_duration = 10

for lang in indian_languages:
    lang_dataframe = language_dataframe[language_dataframe['language_label'] == lang].copy()
    audio_data_list = []
    chunk_count = 0

    for idx, file_path in tqdm(enumerate(lang_dataframe['file_name'].values), total=len(lang_dataframe), desc=f"Processing {lang}"):
        audio_data_list.append(librosa.load(file_path, sr=target_sample_rate)[0])

        if (len(audio_data_list) >= 1500) or (idx == len(lang_dataframe) - 1):
            audio_chunks = tf.signal.frame(np.concatenate(audio_data_list), int(audio_duration * target_sample_rate), int(audio_duration * target_sample_rate), pad_end=True)

            for chunk_idx, audio_chunk in enumerate(audio_chunks):
                sf.write(file=f'../data/model_data/chunked_audio/{chunk_count}_{chunk_idx}_{lang}.flac', data=audio_chunk, samplerate=target_sample_rate, format='flac')

            chunk_count += 1
            audio_data_list = []


Processing bengali: 100%|██████████| 27258/27258 [04:42<00:00, 96.65it/s] 
Processing gujarati: 100%|██████████| 26436/26436 [04:26<00:00, 99.09it/s] 
Processing hindi: 100%|██████████| 25462/25462 [04:19<00:00, 97.98it/s] 
Processing kannada: 100%|██████████| 22208/22208 [03:21<00:00, 110.37it/s]
Processing malayalam: 100%|██████████| 24044/24044 [03:48<00:00, 105.41it/s]
Processing marathi: 100%|██████████| 25378/25378 [04:23<00:00, 96.38it/s] 
Processing tamil: 100%|██████████| 24195/24195 [03:56<00:00, 102.17it/s]
Processing telugu: 100%|██████████| 23655/23655 [04:01<00:00, 98.00it/s] 
Processing urdu: 100%|██████████| 31958/31958 [05:34<00:00, 95.66it/s]  


In [4]:
data_bengali, sample_rate_bengali = librosa.load('../data/model_data/chunked_audio/0_1_bengali.flac')
print(f'Audio Data Sample Rate: {sample_rate_bengali}')
print(f'Audio Data: {data_bengali}')

ipd.Audio(data=data_bengali, rate=sample_rate_bengali)

Audio Data Sample Rate: 22050
Audio Data: [0. 0. 0. ... 0. 0. 0.]


### Re-Calculate Descriptive Statistics

In [5]:
def load_data(file_name: str) -> tuple:
    """Loads audio data from a file using librosa.

    Parameters:
        file_name (str): Path to the audio file.

    Returns:
        tuple: A tuple containing the sample rate and duration of the audio in seconds.
               If an error occurs during processing, returns (np.nan, np.nan).
    """

    try:
        # Load audio data and sample rate
        audio_data, sample_rate = librosa.load(file_name, sr=None)

        # Calculate duration of audio in seconds
        audio_duration_sec = int(librosa.get_duration(y=audio_data, sr=sample_rate))

        return (sample_rate, audio_duration_sec)

    except Exception as e:
        # Print error message if an exception occurs during processing
        print(f"Error processing {file_name}: {str(e)}")

        # Return NaN values for sample rate and duration
        return (np.nan, np.nan)

In [6]:
spoken_language_dataframe = pd.DataFrame({'file_name': [spoken_languages_train_path_dataset + file_name for file_name in os.listdir(spoken_languages_train_path_dataset)]})
spoken_language_dataframe['language_label'] = spoken_language_dataframe['file_name'].str.split('/', expand=True).iloc[:, 5].str.split('_', expand=True).iloc[:, 0].replace(to_replace={
    'de': 'german', 'en': 'english', 'es': 'spanish'})

indian_language_dataframe = pd.DataFrame({'file_name': glob.glob(indian_languages_train_path_dataset)})
indian_language_dataframe['language_label'] = indian_language_dataframe['file_name'].str.split('_', expand=True).iloc[:, 4].str.split('.', expand=True).iloc[:, 0].str.lower()

language_dataframe = pd.concat([spoken_language_dataframe, indian_language_dataframe], ignore_index=True)
language_dataframe['file_size_kb'] = (language_dataframe['file_name'].apply(lambda x: os.path.getsize(x)) / 1024).round(3)

language_dataframe.to_csv('../data/model_data/language_dataframe_v2.csv', index=False)

for lang in tqdm(language_dataframe['language_label'].unique(), desc="Languages"):
    lang_data = language_dataframe[language_dataframe['language_label'] == lang].copy()
    lang_data[['sample_rate', 'audio_duration_sec']] = lang_data['file_name'].apply(lambda file_name: pd.Series(load_data(file_name=file_name)))

    lang_data.to_csv(f'../data/model_data/data_subset/language_dataframe_{lang}_v2.csv', index=False)

language_dataframe = pd.concat([pd.read_csv(f'../data/model_data/data_subset/language_dataframe_{lang}_v2.csv') for lang in language_dataframe['language_label'].unique()], ignore_index=True).dropna()
language_dataframe.to_csv('../data/model_data/language_dataframe_v2.csv', index=False)

language_dataframe

Languages: 100%|██████████| 12/12 [07:25<00:00, 37.12s/it]


Unnamed: 0,file_name,language_label,file_size_kb,sample_rate,audio_duration_sec
0,../data/spoken_language_identification/train/t...,german,241.913,22050,10
1,../data/spoken_language_identification/train/t...,german,247.070,22050,10
2,../data/spoken_language_identification/train/t...,german,235.787,22050,10
3,../data/spoken_language_identification/train/t...,german,236.962,22050,10
4,../data/spoken_language_identification/train/t...,german,239.393,22050,10
...,...,...,...,...,...
188045,../data/model_data/chunked_audio\9_96_urdu.flac,urdu,231.222,22050,10
188046,../data/model_data/chunked_audio\9_97_urdu.flac,urdu,269.815,22050,10
188047,../data/model_data/chunked_audio\9_98_urdu.flac,urdu,215.311,22050,10
188048,../data/model_data/chunked_audio\9_99_urdu.flac,urdu,210.095,22050,10


In [7]:
language_dataframe['language_label'].value_counts()

language_label
german       24360
english      24360
spanish      24360
urdu         15939
bengali      13585
gujarati     13168
hindi        12712
marathi      12665
tamil        12040
malayalam    11989
telugu       11796
kannada      11076
Name: count, dtype: int64

In [8]:
language_dataframe['audio_duration_sec'].unique()

array([10], dtype=int64)

In [9]:
language_dataframe_v1 = pd.read_csv('../data/model_data/language_dataframe_v1.csv')
language_dataframe_v2 = pd.read_csv('../data/model_data/language_dataframe_v2.csv')

In [10]:
language_dataframe_v1.describe().round(2)

Unnamed: 0,file_size_kb,sample_rate,audio_duration_sec
count,303674.0,303674.0,303674.0
mean,108.21,39540.44,5.72
std,60.91,10887.05,2.45
min,0.53,22050.0,0.0
25%,78.42,22050.0,4.0
50%,78.51,44100.0,5.0
75%,78.92,48000.0,5.0
max,857.51,48000.0,10.0


In [11]:
language_dataframe_v2.describe().round(2)

Unnamed: 0,file_size_kb,sample_rate,audio_duration_sec
count,188050.0,188050.0,188050.0
mean,214.5,22050.0,10.0
std,29.01,0.0,0.0
min,0.67,22050.0,10.0
25%,194.68,22050.0,10.0
50%,215.66,22050.0,10.0
75%,234.76,22050.0,10.0
max,323.31,22050.0,10.0


In [12]:
language_dataframe_v1['audio_duration_sec'].value_counts()

audio_duration_sec
4.0     144833
5.0      85293
10.0     73080
0.0        154
1.0        108
2.0        107
3.0         99
Name: count, dtype: int64

In [13]:
language_dataframe_v2['audio_duration_sec'].value_counts()

audio_duration_sec
10    188050
Name: count, dtype: int64

In [14]:
lang_value_counts = pd.merge(
    left=pd.DataFrame(language_dataframe_v1['language_label'].value_counts()).reset_index(),
    right=pd.DataFrame(language_dataframe_v2['language_label'].value_counts()).reset_index(),
    on='language_label').rename(columns={'language_label': 'Language Labels', 'count_x': 'Before Chunking', 'count_y': 'After Chunking'})

lang_value_counts

Unnamed: 0,Language Labels,Before Chunking,After Chunking
0,urdu,31958,15939
1,bengali,27258,13585
2,gujarati,26436,13168
3,hindi,25462,12712
4,marathi,25378,12665
5,german,24360,24360
6,english,24360,24360
7,spanish,24360,24360
8,tamil,24195,12040
9,malayalam,24044,11989


### MFCC Mean Data Creation

In [15]:
language_dataframe = pd.read_csv('../data/model_data/language_dataframe_v2.csv')

This code calculates the mean Mel-frequency cepstral coefficients (MFCC) features for audio files and saves them in CSV format. Here's the breakdown:

1. **Defining Function to Calculate MFCC Features**:
    - The function `get_mfcc_features_mean` takes a file name as input and returns the mean MFCC features.
    - It loads the audio data using `librosa.load()` with a sample rate of 22050 Hz.
    - It calculates MFCC features using `librosa.feature.mfcc()` with parameters specifying 40 coefficients (`n_mfcc=40`).
    - It calculates the mean of MFCC features over time and returns the result.

2. **Processing Audio Files and Calculating MFCC Features**:
    - It iterates over each unique language label in the `language_dataframe`.
    - For each language, it creates a subset DataFrame `lang_data` containing only files for that language.
    - It calculates the MFCC features mean for each audio file in the subset using the `get_mfcc_features_mean` function and applies it to the `'file_name'` column using `apply()`.
    - It saves the resulting DataFrame to a CSV file named `mfcc_data_{lang}_v1.csv` in a specified directory.

3. **Combining Processed DataFrames and Saving to CSV**:
    - It reads each CSV file for each language subset and concatenates them into a single DataFrame `mfcc_feature_mean_dataframe`.
    - It drops any rows with NaN values.
    - It saves the combined DataFrame to a CSV file named `mfcc_feature_mean_dataframe_v1.csv`.


In [16]:
def get_mfcc_features_mean(file_name: str) -> tuple:
    audio_data, sample_rate = librosa.load(file_name, sr=22050)

    return list(np.mean(librosa.feature.mfcc(y=audio_data, sr=sample_rate, n_mfcc=40).T, axis=0))


for lang in language_dataframe['language_label'].unique():
    print(f'Currently Running: {lang}')

    lang_data = language_dataframe[language_dataframe['language_label'] == lang].copy()
    lang_data['mfcc_features_mean'] = lang_data['file_name'].apply(lambda file_name: get_mfcc_features_mean(file_name=file_name))

    lang_data.to_csv(f'../data/model_data/mfcc_mean_dataframes/mfcc_data_{lang}_v1.csv', index=False)

mfcc_feature_mean_dataframe = pd.concat([pd.read_csv(f'../data/model_data/mfcc_mean_dataframes/mfcc_data_{lang}_v1.csv') for lang in language_dataframe['language_label'].unique()], ignore_index=True).dropna()
mfcc_feature_mean_dataframe.to_csv('../data/model_data/mfcc_feature_mean_dataframe_v1.csv', index=False)

Currently Running: german
Currently Running: english
Currently Running: spanish
Currently Running: bengali
Currently Running: gujarati
Currently Running: hindi
Currently Running: kannada
Currently Running: malayalam
Currently Running: marathi
Currently Running: tamil
Currently Running: telugu
Currently Running: urdu


### MFCC Matrix Data Creation

In [17]:
language_dataframe = pd.read_csv('../data/model_data/language_dataframe_v2.csv')

This code calculates the Mel-frequency cepstral coefficients (MFCC) features for audio files and saves them in CSV format. Here's the breakdown:

1. **Defining Function to Calculate MFCC Features**:
    - The function `get_mfcc_features` takes a file name as input and returns the MFCC features as a list.
    - It loads the audio data using `librosa.load()` with a sample rate of 22050 Hz.
    - It calculates MFCC features using `librosa.feature.mfcc()` with parameters specifying 40 coefficients (`n_mfcc=40`).
    - It converts the MFCC features to a list and returns them.

2. **Sampling Data**:
    - It downsamples the `language_dataframe` by randomly selecting 500 samples for each language.
    - This downsampling is done to reduce computational complexity and ensure balanced representation across languages.

3. **Processing Audio Files and Calculating MFCC Features**:
    - It iterates over each unique language label in the downsampled `language_dataframe`.
    - For each language, it creates a subset DataFrame `lang_data` containing a random sample of 500 files for that language.
    - It calculates the MFCC features for each audio file in the subset using the `get_mfcc_features` function and applies it to the `'file_name'` column using `apply()`.
    - It saves the resulting DataFrame to a CSV file named `mfcc_data_{lang}_v1.csv` in a specified directory.

4. **Combining Processed DataFrames and Saving to CSV**:
    - It reads each CSV file for each language subset and concatenates them into a single DataFrame `mfcc_feature_dataframe`.
    - It drops any rows with NaN values.
    - It saves the combined DataFrame to a CSV file named `mfcc_feature_dataframe_v1.csv`.


In [18]:
def get_mfcc_features(file_name: str) -> tuple:
    audio_data, sample_rate = librosa.load(file_name, sr=22050)

    return librosa.feature.mfcc(y=audio_data, sr=sample_rate, n_mfcc=40).tolist()

language_dataframe = language_dataframe.groupby('language_label').apply(lambda row: row.sample(n=500)).reset_index(drop=True)

for lang in language_dataframe['language_label'].unique():
    print(f'Currently Running: {lang}')

    lang_data = language_dataframe[language_dataframe['language_label'] == lang].copy()
    lang_data['mfcc_features'] = lang_data['file_name'].apply(lambda file_name: get_mfcc_features(file_name=file_name))

    lang_data.to_csv(f'../data/model_data/mfcc_dataframes/mfcc_data_{lang}_v1.csv', index=False)

mfcc_feature_dataframe = pd.concat([pd.read_csv(f'../data/model_data/mfcc_dataframes/mfcc_data_{lang}_v1.csv') for lang in language_dataframe['language_label'].unique()], ignore_index=True).dropna()
mfcc_feature_dataframe.to_csv('../data/model_data/mfcc_feature_dataframe_v1.csv', index=False)

Currently Running: bengali
Currently Running: english
Currently Running: german
Currently Running: gujarati
Currently Running: hindi
Currently Running: kannada
Currently Running: malayalam
Currently Running: marathi
Currently Running: spanish
Currently Running: tamil
Currently Running: telugu
Currently Running: urdu


### Spectogram Images

In [19]:
language_dataframe = pd.read_csv('../data/model_data/language_dataframe_v2.csv')

This code snippet defines a function `save_audio_spectrogram_plot` and uses it to generate and save spectrogram images for audio files. Here's a breakdown:

1. **Defining Function to Save Spectrogram**:
    - The function `save_audio_spectrogram_plot` takes four parameters: `file_name` (path to the audio file), `idx` (index of the file), `label` (language label of the file), and `figsize` (optional tuple specifying the size of the figure, default is (5, 5)).
    - It loads the audio data using `librosa.load()` with the original sample rate.
    - It creates a new figure with a specified size and no frame.
    - It generates the spectrogram plot using `plt.specgram()`, which computes and plots the spectrogram of the audio data.
    - It saves the generated spectrogram plot as a PNG image in a directory corresponding to the language label and with a filename based on the index.

2. **Processing Audio Files and Generating Spectrograms**:
    - It first downsamples the `language_dataframe` by randomly selecting 500 samples for each language. This is done to reduce computational load and ensure balanced representation across languages.
    - It iterates over each unique language label in the downsampled `language_dataframe`.
    - For each language, it creates a subset DataFrame `lang_data` containing a random sample of 500 files for that language.
    - It iterates over each file in `lang_data['file_name']`.
    - If the directory corresponding to the language label does not exist, it creates one.
    - It calls the `save_audio_spectrogram_plot` function to generate and save the spectrogram image for each audio file in the language subset.


In [21]:
def save_audio_spectrogram_plot(file_name: str, idx: int, label: str, figsize: tuple = (5, 5)):
    audio_data, sample_rate = librosa.load(file_name, sr=None)

    fig = plt.figure(figsize=figsize, frameon=False)
    ax = plt.Axes(fig, [0., 0., 1., 1.])
    ax.set_axis_off()
    fig.add_axes(ax)

    plt.specgram(audio_data, Fs=sample_rate)

    fig.savefig(f'../data/model_data/spectrogram_images/{label}/{idx}.png', dpi=64)
    plt.close()

language_dataframe = language_dataframe.groupby('language_label').apply(lambda row: row.sample(n=500)).reset_index(drop=True)

for lang in language_dataframe['language_label'].unique():
    lang_data = language_dataframe[language_dataframe['language_label'] == lang].copy()

    for idx, file_path in tqdm(enumerate(lang_data['file_name'].values), total=len(lang_data), desc=f"Processing {lang}"):
        if not os.path.exists(f'../data/model_data/spectrogram_images/{lang}'):
            os.makedirs(f'../data/model_data/spectrogram_images/{lang}')

        save_audio_spectrogram_plot(file_path, idx, lang)

Processing bengali: 100%|██████████| 500/500 [00:31<00:00, 15.93it/s]
Processing english: 100%|██████████| 500/500 [00:30<00:00, 16.32it/s]
Processing german: 100%|██████████| 500/500 [00:30<00:00, 16.30it/s]
Processing gujarati: 100%|██████████| 500/500 [00:31<00:00, 15.77it/s]
Processing hindi: 100%|██████████| 500/500 [00:32<00:00, 15.52it/s]
Processing kannada: 100%|██████████| 500/500 [00:29<00:00, 17.23it/s]
Processing malayalam: 100%|██████████| 500/500 [00:33<00:00, 15.12it/s]
Processing marathi: 100%|██████████| 500/500 [00:33<00:00, 15.04it/s]
Processing spanish: 100%|██████████| 500/500 [00:29<00:00, 17.04it/s]
Processing tamil: 100%|██████████| 500/500 [00:33<00:00, 14.72it/s]
Processing telugu: 100%|██████████| 500/500 [00:30<00:00, 16.35it/s]
Processing urdu: 100%|██████████| 500/500 [00:31<00:00, 15.98it/s]
