<h1 style="color:orange;font-size:40px;font-weight:bold">Deep learning model</h1>

<p style="color:orange;font-size:14;font-style:italic">And finally, here we are with the creation of our deep learning model.</p>

<p style="color:orange;font-size:20;font-weight:bold">Imports of libraries</p>

In [39]:
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, Activation, BatchNormalization, LSTM, Dropout, Bidirectional
from sklearn.preprocessing import OneHotEncoder
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer

<p style="color:orange;font-size:20;font-weight:bold">Data Loading : Feature and Target definitions</p>

In [2]:
filepath = "data/lyrics-data.csv" # Path to the data in the CSV file
data = preprocessing.load_preprocessed_data(filepath, ["drake", "kanye west", "50 cent", "taylor swift", "celine dion", "rihanna"]) # Load the data with the preprocessing pipeline

X = data["Lyric"] # Feature
y = data["Artist"] # Target
tfidf = TfidfVectorizer() #TF-IDF vectorizer
X_vectorized = tfidf.fit_transform(X) # TF-IDF vectorization to the text data

<p style="color:orange;font-size:20;font-weight:bold">Dimensions Exploration</p>

In [3]:
print(f"Estimated number of unique words (vocabulary size): {len(tfidf.vocabulary_)}")
print(f"Number of documents: {X_vectorized.shape[0]}")
print(f"Dimensionality of TF-IDF vectors: {X_vectorized.shape[1]}")
print(f"Estimated maximum number of characters for a Lyric : { data['Lyric_Length'].max()}")

Estimated number of unique words (vocabulary size): 23306
Number of documents: 2100
Dimensionality of TF-IDF vectors: 23306
Estimated maximum number of characters for a Lyric : 13415


<p style="color:orange;font-size:20;font-weight:bold">Report Function for the CSR Matrix</p>

In [4]:
def report(y_true, y_pred, class_labels):
    """
    Generate and display a classification report along with a confusion matrix.

    Parameters:
        y_true (list): True labels.
        y_pred (list): Predicted labels.
        class_labels (list): List of class labels.

    Returns:
        None
    """
    # Print classification report
    print(classification_report(y_true, y_pred, labels=class_labels, zero_division=0))

    # Configure confusion matrix display options
    confusion_matrix_kwargs = dict(
        text_auto=True,
        title="Confusion Matrix",
        width=1000,
        height=800,
        labels=dict(x="Predicted", y="True Label"),
        x=class_labels,
        y=class_labels,
        color_continuous_scale='Blues'
    )

    # Generate and display confusion matrix using Plotly
    c_m = confusion_matrix(y_true, y_pred, labels=class_labels)
    fig = px.imshow(c_m, **confusion_matrix_kwargs)
    fig.show()

<p style="color:orange;font-size:26px;font-weight:bold">LSTM Layer Implementation - No Other Modification Added</p>

In [15]:
# Tokenize the text data
tokenizer = Tokenizer(num_words=20000)  # Initialize a tokenizer with a specified vocabulary size
tokenizer.fit_on_texts(X)  # Use the preprocessed data to update the tokenizer's word index
X_sequences = tokenizer.texts_to_sequences(X)  # Convert text data to sequences of numeric indices
X_padded = pad_sequences(X_sequences, maxlen=700)  # Pad sequences to ensure uniform length

# Encode target labels numerically
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_padded, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded)

# Build an LSTM model
model = Sequential()
model.add(Embedding(input_dim=20000, output_dim=128, input_length=700))  # Embedding layer for word representation
model.add(LSTM(128))  # Long Short-Term Memory (LSTM) layer for sequence modeling
model.add(Dense(6, activation='softmax'))  # Output layer with softmax activation for multi-class classification

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# One-hot encode the target labels
y_train_onehot = tf.keras.utils.to_categorical(y_train, num_classes=6)  # Convert training labels to one-hot format
y_test_onehot = tf.keras.utils.to_categorical(y_test, num_classes=6)  # Convert testing labels to one-hot format

# Train the model
model.fit(X_train, y_train_onehot, epochs=10, batch_size=32, validation_split=0.15)  # Fit the model to the training data

# Evaluate the model
y_pred_prob = model.predict(X_test)  # Predict probabilities for the test set
y_pred = y_pred_prob.argmax(axis=1)  # Convert probabilities to class labels
class_labels = ['50 cent', 'celine dion', 'drake', 'kanye west', 'rihanna', 'taylor swift']  # Define class labels

# Map numeric labels to actual class names
label_mapping = {0: '50 cent', 1: 'celine dion', 2: 'drake', 3: 'kanye west', 4: 'rihanna', 5: 'taylor swift'}  
y_pred = [label_mapping[label] for label in y_pred]  # Convert numeric labels to actual class names
y_test = [label_mapping[label] for label in y_test]  # Convert numeric labels to actual class names

# Generate a classification report to evaluate model performance
report(y_test, y_pred, class_labels)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
              precision    recall  f1-score   support

     50 cent       0.76      0.76      0.76        93
 celine dion       0.71      0.66      0.68        80
       drake       0.57      0.27      0.37        63
  kanye west       0.39      0.55      0.45        58
     rihanna       0.40      0.33      0.36        49
taylor swift       0.59      0.75      0.66        77

    accuracy                           0.59       420
   macro avg       0.57      0.55      0.55       420
weighted avg       0.60      0.59      0.58       420



<p style="color:orange;font-size:20;font-weight:bold">The performances of the model are lower than the first baseline model. This is due to the fact that it takes a lot of time for the model to train so I couldn't improve and test other parameters</p>

# --------------------------------------------------------------------------------------------
# --------------------------------------------------------------------------------------------

<p style="color:orange;font-size:26px;font-weight:bold">LSTM Layer Implementation - Early Stopping + Oversampling + RNN Layers</p>

In [34]:
# EarlyStopping Callback
early_stopping = EarlyStopping(monitor='val_loss', patience=2, restore_best_weights=True)  # Set up early stopping to prevent overfitting

# Tokenize the text data
tokenizer = Tokenizer(num_words=23306)  
tokenizer.fit_on_texts(X)  # Fit tokenizer on the preprocessed data
X_sequences = tokenizer.texts_to_sequences(X)  # Convert text data to sequences
X_padded = pad_sequences(X_sequences, maxlen=2048)  # Pad sequences to a fixed length

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)  # Encode target labels

# Split the data into training and testing sets before applying SMOTE
X_train, X_test, y_train, y_test = train_test_split(X_padded, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded)

# Apply SMOTE oversampling to the training set
smote = SMOTE(random_state=42)
X_train_over, y_train_over = smote.fit_resample(X_train, y_train)

# Build an LSTM model
model = Sequential()
model.add(Embedding(input_dim=23306, output_dim=64, input_length=2048))  # Embedding layer for word representation
model.add(SimpleRNN(1024, activation='relu', return_sequences=True, name='rnn1'))  # First SimpleRNN layer
model.add(SimpleRNN(512, activation='relu', name='rnn2'))  # Second SimpleRNN layer
model.add(LSTM(64))  # LSTM layer
model.add(Dense(6, activation='softmax'))  # Output layer with softmax activation

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])  # Compile the model with categorical crossentropy loss and Adam optimizer

# One-hot encode the target labels
y_train_onehot = tf.keras.utils.to_categorical(y_train_over, num_classes=6)  # One-hot encode training labels
y_test_onehot = tf.keras.utils.to_categorical(y_test, num_classes=6)  # One-hot encode testing labels

# Train the model
model.fit(X_train_over, y_train_onehot, epochs=10, batch_size=32, validation_split=0.15, callbacks=[early_stopping])  # Train the model with early stopping callback

# Evaluate the model
y_pred_prob = model.predict(X_test)
y_pred = y_pred_prob.argmax(axis=1)  # Convert probabilities to class labels
class_labels = ['50 cent', 'celine dion', 'drake', 'kanye west', 'rihanna', 'taylor swift']  # Define class labels

# Map numeric labels to actual class names
label_mapping = {0: '50 cent', 1: 'celine dion', 2: 'drake', 3: 'kanye west', 4: 'rihanna', 5: 'taylor swift'}  
y_pred = [label_mapping[label] for label in y_pred]
y_test = [label_mapping[label] for label in y_test]

report(y_test, y_pred, class_labels)  # Report evaluation metrics

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
              precision    recall  f1-score   support

     50 cent       0.82      0.82      0.82        93
 celine dion       0.72      0.70      0.71        80
       drake       0.47      0.52      0.50        63
  kanye west       0.49      0.36      0.42        58
     rihanna       0.57      0.51      0.54        49
taylor swift       0.65      0.78      0.71        77

    accuracy                           0.65       420
   macro avg       0.62      0.62      0.61       420
weighted avg       0.64      0.65      0.64       420



<p style="color:orange;font-size:20;font-weight:bold">Even by using oversampling and early stopping, the performances didn't improve too much. I think it's due to the too high conversion of data when passing from a huge size input to a layer of 128. This reduction may be leading to a loss of data.</p>

<p style="color:orange;font-size:20;font-weight:bold">Here I used RNN combined with LSTM because it takes too much time to add other LSTM layers. If you have any feedbacks for me to explain how I could improve this deep learning model I would be grateful !</p>