<h1 style="color:orange;font-size:40px;font-weight:bold">Select a baseline model</h1>

<p style="color:orange;font-size:14;font-style:italic">Now that we have created a script to load our data correctly, we can engineer on some ML Models to choose the one with the best results.</p>

<p style="color:orange;font-size:20;font-weight:bold">Imports of libraries</p>

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix
import plotly.express as px
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier

<p style="color:orange;font-size:20;font-weight:bold">Data Loading</p>

In [17]:
filepath = "data/lyrics-data.csv" # Path to the data in the CSV file
data = preprocessing.load_preprocessed_data(filepath, ["drake", "kanye west", "50 cent", "taylor swift", "celine dion", "rihanna"]) # Load the data with the preprocessing pipeline

In [24]:
data # Display our data

Unnamed: 0,Artist,Title,Lyric,Lyric_Length,Artist_Encoded
0,rihanna,love brain,get like oh want want try buy pretty heart pri...,1597,4
1,rihanna,umbrella feat jay z,jayz aham aham yeah rihanna aham aham good gir...,2565,4
2,rihanna,diamond,shine bright like diamond shine bright like di...,2041,4
3,rihanna,stay feat mikky ekko,along fever cold sweat hotheaded believer thro...,1140,4
4,rihanna,girl world,want love like im hot ride keep thinkin doin l...,1883,4
...,...,...,...,...,...
2095,celine dion,shake night long,fast machine keep motor clean best damn woman ...,1256,1
2096,celine dion,light,hey man come theres secret wan na whisper ear ...,2032,1
2097,celine dion,ziggy english version,oh ziggy call ziggy im hot hes like rest hold ...,995,1
2098,celine dion,ziggy garcon pa comme autres,ziggy sappelle ziggy folle garcon pa comme aut...,1015,1


<p style="color:orange;font-size:20;font-weight:bold">Model Class (added to preprocessing.py to be used for other notebooks)</p>

In [25]:
class Model:
    def __init__(self, X, y, model_architecture, vectorizer, random_seed=42, test_size=0.15) -> None:
        """
        Initialize the Model class.

        Parameters:
            X (list or array-like): Input features.
            y (list or array-like): Target labels.
            model_architecture (sklearn estimator): Classifier model architecture.
            vectorizer (sklearn transformer): Text vectorization method.
            random_seed (int): Random seed for reproducibility.
            test_size (float): Fraction of the data to be used as test set.

        Returns:
            None
        """
        self.X = X
        self.y = y
        self.model_instance = model_architecture
        self.vectorizer = vectorizer
        self.random_seed = random_seed
        self.test_size = test_size  

        # Define the pipeline
        self.pipeline = Pipeline([
            ('vectorizer', vectorizer),
            ('classifier', model_architecture)
        ])

        # Split the data into training and testing sets
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
            X, y, test_size=test_size, random_state=random_seed, stratify=y
        )

    def fit(self):
        """
        Fit the pipeline to the training data.

        Returns:
            None
        """
        self.pipeline.fit(self.X_train, self.y_train)

    def predict(self):
        """
        Predict the target labels on the test data.

        Returns:
            array: Predicted labels.
        """
        return self.pipeline.predict(self.X_test)
    
    def predict_proba(self):
        """
        Predict class probabilities on the test data.

        Returns:
            array: Predicted class probabilities.
        """
        return self.pipeline.predict_proba(self.X_test)

    def report(self, y_true, y_pred, class_labels):
        """
        Display a classification report and confusion matrix.

        Parameters:
            y_true (list or array-like): True labels.
            y_pred (list or array-like): Predicted labels.
            class_labels (list): List of class labels.

        Returns:
            None
        """
        # Print classification report
        print(classification_report(y_true, y_pred, target_names=class_labels))
        
        # Create the confusion matrix
        cm = confusion_matrix(y_true, y_pred)
        
        # Display the confusion matrix using Plotly
        confusion_matrix_kwargs = dict(
            text_auto=True, 
            title="Confusion Matrix", width=1000, height=800,
            labels=dict(x="Predicted", y="True Label"),
            color_continuous_scale='Blues'
        )
        fig = px.imshow(
            cm,
            **confusion_matrix_kwargs,
            x=class_labels,
            y=class_labels
        )
        fig.show()


<p style="color:orange;font-size:20;font-weight:bold">Definition of the target, the feature and encoding</p>

In [22]:
X = data["Lyric"] # Feature
y = data["Artist"] # Target

label_encoder = LabelEncoder() # Instantiate a label encoder # https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
data['Artist_Encoded'] = label_encoder.fit_transform(data['Artist']) # Fit and transform the encoder on labels
y_encoded = data['Artist_Encoded'] # Target Encoded

class_labels = label_encoder.inverse_transform(range(6)) # Get the Labels of each class

<p style="color:orange;font-size:20;font-weight:bold">Predictions made by different models</p>

In [9]:
# Using Multinomial
print("\n---------------------------------------------------------------------------------------\n---------------------------------------------------------------------------------------\n\nMultinomial Naive Bayes\n")
model_mn = Model(X, y_encoded, MultinomialNB(), TfidfVectorizer()) # Create a model
model_mn.fit() # Fit the model
y_pred = model_mn.predict() # Predict results
model_mn.report(model_mn.y_test, y_pred, class_labels) # Generate classification report

# Using Gradient Boosting Classifier
print("\n---------------------------------------------------------------------------------------\n---------------------------------------------------------------------------------------\n\nGradient Boosting Classifier\n")
model_gb = Model(X, y, GradientBoostingClassifier(), TfidfVectorizer()) # Create a model
model_gb.fit() # Fit the model
y_pred_gb = model_gb.predict() # Predict results
model_gb.report(model_gb.y_test, y_pred_gb, class_labels) # Generate classification report

# Using Random Forest Classifier
print("\n---------------------------------------------------------------------------------------\n---------------------------------------------------------------------------------------\n\nRandom Forest Classifier\n")
model_rf = Model(X, y, RandomForestClassifier(), TfidfVectorizer()) # Create a model
model_rf.fit() # Fit the model
y_pred_rf = model_rf.predict() # Predict results
model_rf.report(model_rf.y_test, y_pred_rf, class_labels) # Generate classification report

# Using Logisti cRegression
print("\n---------------------------------------------------------------------------------------\n---------------------------------------------------------------------------------------\n\nLogistic Regression\n")
model_lr = Model(X, y, LogisticRegression(), TfidfVectorizer()) # Create a model
model_lr.fit() # Fit the model
y_pred_lr = model_lr.predict() # Predict results
model_lr.report(model_lr.y_test, y_pred_lr, class_labels) # Generate classification report


---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------

Multinomial Naive Bayes



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

     50 cent       0.39      0.99      0.56        70
 celine dion       1.00      0.50      0.67        60
       drake       1.00      0.11      0.19        47
  kanye west       0.33      0.02      0.04        44
     rihanna       0.00      0.00      0.00        36
taylor swift       0.52      0.90      0.66        58

    accuracy                           0.50       315
   macro avg       0.54      0.42      0.35       315
weighted avg       0.57      0.50      0.41       315




---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------

Gradient Boosting Classifier

              precision    recall  f1-score   support

     50 cent       0.94      0.94      0.94        70
 celine dion       0.77      0.82      0.79        60
       drake       0.93      0.83      0.88        47
  kanye west       0.83      0.68      0.75        44
     rihanna       0.66      0.58      0.62        36
taylor swift       0.63      0.78      0.70        58

    accuracy                           0.79       315
   macro avg       0.79      0.77      0.78       315
weighted avg       0.80      0.79      0.79       315




---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------

Random Forest Classifier

              precision    recall  f1-score   support

     50 cent       0.79      0.97      0.87        70
 celine dion       0.86      0.72      0.78        60
       drake       0.84      0.57      0.68        47
  kanye west       0.77      0.55      0.64        44
     rihanna       0.68      0.58      0.63        36
taylor swift       0.64      0.93      0.76        58

    accuracy                           0.75       315
   macro avg       0.76      0.72      0.73       315
weighted avg       0.77      0.75      0.74       315




---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------

Logistic Regression

              precision    recall  f1-score   support

     50 cent       0.89      0.94      0.92        70
 celine dion       0.77      0.78      0.78        60
       drake       0.79      0.79      0.79        47
  kanye west       0.77      0.52      0.62        44
     rihanna       0.64      0.50      0.56        36
taylor swift       0.67      0.86      0.75        58

    accuracy                           0.77       315
   macro avg       0.75      0.73      0.74       315
weighted avg       0.77      0.77      0.76       315



<p style="color:orange;font-size:20;font-weight:bold">Even if it was time consuming, Gradient Boosting is providing the best results here. We can keep this model to improve it in the next notebook.</p>