<h1 style="color:orange;font-size:40px;font-weight:bold">Improve of the baseline model</h1>

<p style="color:orange;font-size:14;font-style:italic">After choosing Gradient Boosting as our baseline model, we can improve it playing with some parameters to try to make it work better.</p>

<p style="color:orange;font-size:20;font-weight:bold">Imports of libraries</p>

In [2]:
import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.ensemble import GradientBoostingClassifier
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

<p style="color:orange;font-size:20;font-weight:bold">Data Loading : Feature and Target definitions</p>

In [3]:
filepath = "data/lyrics-data.csv" # Path to the data in the CSV file
data = preprocessing.load_preprocessed_data(filepath, ["drake", "kanye west", "50 cent", "taylor swift", "celine dion", "rihanna"]) # Load the data with the preprocessing pipeline

In [4]:
X = data["Lyric"] # Feature
y = data["Artist"] # Target
tfidf = TfidfVectorizer() #TF-IDF vectorizer
X_vectorized = tfidf.fit_transform(X) # TF-IDF vectorization to the text data 
class_labels = ['50 cent', 'celine dion', 'drake', 'kanye west', 'rihanna', 'taylor swift'] # Define labels

<p style="color:orange;font-size:28px;font-weight:bold">Oversampling</p>

<p style="color:orange;font-size:20;font-weight:bold">SMOTE is a machine learning technique combatting class imbalance. It creates synthetic samples for the minority class, reducing bias and improving model performance.</p>

In [5]:
smote = SMOTE(random_state=42) # SMOTE instance
X_oversampled, y_oversampled = smote.fit_resample(X_vectorized, y) # Apply SMOTE oversampling to the training set

model_over = preprocessing.Model(X_oversampled, y_oversampled, GradientBoostingClassifier()) # Instantiate the Model class 
model_over.fit() # Fit the model
y_pred = model_over.predict() # Make predictions
model_over.report(model_over.y_test, y_pred, class_labels) # Generate classification report

<p style="color:orange;font-size:20;font-weight:bold">Here the Oversampling was an obvious solution of our problem. Indeed, if we take the example of Rihanna, her discography was almost twice smaller than 50 Cent one. Counterbalancing by oversampling her songs is a good idea to make them equal when facing the model during the training.</p>

<p style="color:orange;font-size:28px;font-weight:bold">Undersampling</p>

<p style="color:orange;font-size:20;font-weight:bold">RandomUnderSampler is a machine learning technique combatting class imbalance. It removes portions of the majority class, reducing bias and improving model performance.</p>

In [4]:
undersampler = RandomUnderSampler(sampling_strategy='auto', random_state=42) # RandomUnderSampler instance
X_undersampled, y_undersampled = undersampler.fit_resample(X_vectorized, y) # Apply random undersampling to the training set

model_under = preprocessing.Model(X_undersampled, y_undersampled, GradientBoostingClassifier()) # Instantiate the Model class 
model_under.fit() # Fit the model
y_pred = model_under.predict() # Make predictions
model_under.report(model_under.y_test, y_pred, class_labels) # Generate classification report

              precision    recall  f1-score   support

     50 cent       1.00      0.86      0.93        36
 celine dion       0.75      0.75      0.75        36
       drake       0.78      0.86      0.82        37
  kanye west       0.93      0.68      0.78        37
     rihanna       0.65      0.76      0.70        37
taylor swift       0.64      0.73      0.68        37

    accuracy                           0.77       220
   macro avg       0.79      0.77      0.78       220
weighted avg       0.79      0.77      0.78       220



<p style="color:orange;font-size:20;font-weight:bold">The score being better with oversampling we are going to keep this technique to balance our classes.</p>

# --------------------------------------------------------------------------------------------
# --------------------------------------------------------------------------------------------

<p style="color:orange;font-size:28px;font-weight:bold">Hyperparameters Tuning</p>

In [6]:
# Define the parameters you want to search over
learning_rates = [0.2, 0.3]
n_estimators_values = [50, 100, 150]

# List to store results
results = []

# Loop over parameter combinations
for learning_rate in learning_rates:
    for n_estimators in n_estimators_values:

        # Instantiate the Model class with different parameters
        model_over = preprocessing.Model(X_oversampled, y_oversampled, GradientBoostingClassifier(learning_rate=learning_rate, n_estimators=n_estimators))

        # Fit the model
        model_over.fit()

        # Make predictions
        y_pred = model_over.predict()

        # Generate classification report
        report = classification_report(model_over.y_test, y_pred, target_names=class_labels)

        # Append results
        results.append({
            'learning_rate': learning_rate,
            'n_estimators': n_estimators,
            'classification_report': report
        })

# Print or analyze the results as needed
for result in results:
    print(f"Learning Rate: {result['learning_rate']}, N Estimators: {result['n_estimators']}")
    print(result['classification_report'])


Learning Rate: 0.2, N Estimators: 50
              precision    recall  f1-score   support

     50 cent       0.94      0.90      0.92        70
 celine dion       0.74      0.79      0.76        70
       drake       0.94      0.90      0.92        70
  kanye west       0.88      0.75      0.81        71
     rihanna       0.74      0.71      0.72        70
taylor swift       0.65      0.79      0.71        70

    accuracy                           0.81       421
   macro avg       0.81      0.81      0.81       421
weighted avg       0.82      0.81      0.81       421

Learning Rate: 0.2, N Estimators: 100
              precision    recall  f1-score   support

     50 cent       0.95      0.90      0.93        70
 celine dion       0.78      0.80      0.79        70
       drake       0.93      0.90      0.91        70
  kanye west       0.95      0.76      0.84        71
     rihanna       0.75      0.77      0.76        70
taylor swift       0.69      0.84      0.76        70

  

<p style="color:orange;font-size:20;font-weight:bold">We will save the model with a leaarning-rate of 0.2 and n_estimators of 100 that has the best f1 score. 83% is such a good accuracy score but I think that we could have get better result by trying other values for the parameters and by tuning other ones. Sadly, it takes a lot of time for the model to build itself so we will stop here !</p>