# __Kenyan Swahili Speech Emotion Recognition System__ 

### Researchers
1. Ian Korir - Group Leader
2. Hellen Samuel 
3. Gregory Mikuro 
4. Doreen Wanjiru
5. Esther Francis - Scrum Master

### Facilitator
Nikita Njoroge - Technical Mentor

## **1.0 Business Understanding**

### **1.1 Introduction**
In today’s customer service landscape, understanding and responding to customer emotions effectively can be a key differentiator for businesses. By integrating Speech Emotion Recognition (SER) technology into customer service call centers, companies can enhance the quality of interactions, tailoring responses to meet the emotional needs of customers. This project focuses on developing an SER model specifically designed to recognize emotions in Kenyan Swahili speech, thereby addressing a significant gap in existing emotion AI technologies and offering a tailored solution for the Kenyan market.

### **1.2 Background**
Customer service call centers often serve as the first point of contact between a company and its customers. In these interactions, understanding the emotional state of the customer can be crucial for delivering effective and empathetic service. However, existing SER systems are predominantly designed for languages like English and may not perform well with Swahili, the most widely spoken language in Kenya. Given the unique phonetic and prosodic characteristics of Swahili, there is a clear need for a localized SER model. This project aims to address this need by developing an SER model that can accurately classify emotions in Swahili speech, specifically within the context of customer service call centers.

### **1.3 Research Problem**
Customer service call centers in Kenya face challenges in effectively understanding and responding to customer emotions due to the lack of SER systems tailored to Swahili. This gap limits the ability of service agents to provide personalized and emotionally intelligent responses, which can negatively impact customer satisfaction and overall service quality. By focusing on the development of a Swahili-specific SER model, this project seeks to enhance the capability of call centers to manage customer interactions more effectively, ultimately improving the customer experience.

### **1.4 Problem Statement**
The absence of a robust, Swahili-specific Speech Emotion Recognition (SER) model hinders the ability of customer service call centers in Kenya to accurately detect and respond to customer emotions. This limitation can lead to suboptimal customer interactions and reduced satisfaction levels. Therefore, this project aims to develop an emotion recognition model tailored to the linguistic and cultural nuances of Kenyan Swahili, enabling more effective and empathetic customer service interactions in call centers.

### **1.5 Objectives**
1. **Data Collection and Preprocessing**: Collect, preprocess, and annotate a diverse dataset of Kenyan Swahili audio samples for five target emotions: Anger, Happiness, Sadness, Calmness, and Surprise.
2. **Acoustic Feature Analysis**: Extract and analyze acoustic features from the collected Swahili samples, examining relationships between these features and the corresponding emotions.
3. **Feature Selection and Data Augmentation**: Develop a feature selection methodology tailored to Swahili speech patterns and implement data augmentation techniques to enhance the diversity and robustness of the dataset.
4. **Model Development and Deployment**: Create and deploy a deep neural network-based classification model with a target accuracy of at least 80% for emotion recognition in Kenyan Swahili speech.


The key business driver is to enhance customer service quality by enabling real-time emotion detection, which can be used to optimize customer interactions, improve agent performance, and increase overall customer satisfaction.


### **1.6 Stakeholder**

## **2.0 Data Understanding**

### **2.1 Methodology**
The dataset for this project was collected from 240 participants, ensuring a balanced representation of gender (male and female) and the five target emotions (Anger, Happiness, Sadness, Calmness, and Surprise). Participants were volunteers who contributed Swahili speech samples reflecting these emotions in controlled settings to ensure consistency in the data collection process. The methodology also considered variations in speaker accents and dialects to enhance the model’s generalization capability.

### **2.2 Structure**
The dataset was structured to ensure a balanced distribution of speech samples across the five emotions and the two gender categories. This balanced structure is crucial for developing a model that performs well across different demographic groups and emotional states. Each participant provided multiple samples for each emotion, resulting in a comprehensive dataset that captures the nuances of Swahili speech across different emotional contexts. The collected data was systematically organized and labeled, facilitating efficient preprocessing and feature extraction in subsequent phases of the project.

## __3.0 Modeling__ 

### __3.1 Data Preprocessing__

In our approach to processing and managing audio data for emotion recognition, we implemented a multi-step pipeline involving data loading, cleaning, preprocessing, feature extraction, and finally, data saving and splitting. Here’s a breakdown of the strategy and the reasoning behind each step:

#### 3.1.1 **Data Loading**

**Process**: We first load the audio data from a specified directory where the data is organized by emotion labels. For each emotion, we retrieve the audio files and store the audio data in an array.

**Reasoning**: 
- **Organization and Access**: We structured the data based on emotion categories to facilitate easy retrieval and management. This allows us to handle data more systematically and apply operations specific to each emotion.
- **Storage**: We use lists to store the raw audio data and corresponding labels, which makes it straightforward to process and access later.

#### 3.1.2 **Data Cleaning**

**Process**: After loading the audio files, we clean the data by applying noise reduction and trimming silence from the audio clips.

**Reasoning**:
- **Noise Reduction**: By reducing noise, we improve the quality of the audio, which can enhance the accuracy of subsequent feature extraction and emotion classification. This step helps in mitigating any distortions that might interfere with the analysis.
- **Silence Trimming**: Removing silence helps in focusing on the relevant parts of the audio, which can be crucial for training models that rely on the actual content of the audio rather than its length or non-informative parts.

#### 3.1.3 **Audio Preprocessing**

**Process**: We pad or truncate the audio data to a fixed length to ensure consistency across all samples.

**Reasoning**:
- **Consistency**: Uniform length for all audio samples is essential for training machine learning models, as it ensures that the input features are consistent. Padding short samples and truncating long ones standardizes the input size and simplifies further processing.

#### 3.1.4 **Feature Extraction**

**Process**: We extract Mel-Frequency Cepstral Coefficients (MFCCs) from the audio data. MFCCs are used to represent the audio signals in a way that captures the essential characteristics for emotion recognition.

**Reasoning**:
- **Feature Representation**: MFCCs are commonly used features in audio processing because they effectively capture the spectral properties of the audio signal, which are crucial for distinguishing between different emotions.
- **Dimensionality Reduction**: By averaging MFCCs across time, we reduce the dimensionality of the data while preserving the most relevant features, making it easier to train machine learning models.

#### 3.1.5 **Data Saving**

**Process**: We save the processed features and labels into CSV and NumPy files. This step also involves splitting the data into training, validation, and test sets.

**Reasoning**:
- **Persistence**: Saving the processed data ensures that we can reuse it without having to repeat the preprocessing steps. This is time-efficient and allows us to maintain consistency across different runs.
- **Data Splitting**: Splitting the data into training, validation, and test sets is crucial for evaluating the performance of our models. It ensures that the models are trained on one subset, validated on another, and tested on a separate, unseen subset.



In [None]:
import os
import numpy as np
import librosa
import noisereduce as nr
from sklearn.model_selection import train_test_split
import pandas as pd
from tensorflow.keras.layers import Input, Dropout 
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, classification_report, f1_score, precision_score, recall_score, roc_curve, auc
import seaborn as sns
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense, LSTM, BatchNormalization 
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from main import DataLoader, DataCleaner, AudioPreprocessor, FeatureExtractor, EmotionLabeler, DataSaver, Modeling, TrainingWithCallbacks, Evaluation, ModelSaver

In [None]:
data_dir = "data"
emotions = ["sad", "happy", "angry", "calm", "surprised"]
sample_rate = 16000  
audio_preprocessor = AudioPreprocessor(data_dir, emotions, sample_rate, verbose=False)
X_processed, y_processed = audio_preprocessor.get_data()

# Initialize DataSaver class
data_saver = DataSaver(data_dir, emotions, sample_rate, verbose=False)
# Process, save, and split the data
data_saver.save_to_csv()  # Save to CSV
data_saver.save_to_npy()  # Save as .npy files
X_train, X_val, X_test, y_train, y_val, y_test = data_saver.split_data()  # Split the data
