# Creating the Model
In this Jupyter Notebook, we will be creating the Naive-Bayes model, training it on stratified data from the processed dataset, and fine tuned it via `GridSearchCV`.

Using the methods in this notebook, we were able to arrive to an accuracy of `Accuracy: 0.750514050719671`.

## Part 1: Preliminaries

In [1]:
# Import necessary libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import nltk as nltk
from nltk.corpus import stopwords
import string

## Part 2: Reading a `processed_lyrics.csv` dataset
In this section, we will be loading `processed_lyrics.csv` and splitting it into the training and test data for the model. 

In [2]:
# Load the CSV file
data = pd.read_csv('processed_lyrics.csv')

In [3]:
# Extract the column names
column_names = data.columns.tolist()

# Print the column names
print("Column Names:", column_names)

Column Names: ['title', 'tag', 'artist', 'year', 'views', 'features', 'lyrics', 'id', 'language_cld3', 'language_ft', 'language', 'lyric_length']


As seen in the value counts for each genre below, song entries with `tag = "pop"` *dominate* the dataset. Hence, to improve the accuracy of the model, we used stratification in splitting the training and testing datasets.

In [4]:
data.tag.value_counts()

tag
pop        3827
rap         884
rock        495
rb          181
misc         52
country      20
Name: count, dtype: int64

We also numerically label each entry using the mapping below for processing later.

In [5]:
data['label_num'] = data.tag.map({'country':0, 'misc':1, 'pop':2, 'rap':3, 'rb':4, 'rock':5})
data.drop('views', axis=1, inplace=True)
data.drop('language_cld3', axis=1, inplace=True)
data.drop('language_ft', axis=1, inplace=True)

In [6]:
data.head(10)

Unnamed: 0,title,tag,artist,year,features,lyrics,id,language,lyric_length,label_num
0,All star,rap,Enchant boyz,2012,{Lil_john},nang unang bibira mabomba bara mag ihaw gamit ...,64812,fil,112,3
1,Balak ni Syke,rap,Gloc-9,2012,{},alak balak lasing kasalukuyan ngunit malaman a...,85710,fil,158,3
2,Apatnapungbara,rap,Gloc-9,2012,"{""Ian Tayao""}",hook ian tayao akoy tutula mahaba nako umupo p...,85711,fil,205,3
3,Silup,rap,Gloc-9,2012,"{""Denise Barcena""}",hook denise mamang pulis pwede ba akong huming...,85713,fil,221,3
4,New Life Song Part Two,rap,R1 one 6 souljhaz,2012,{},choros dios buhay ikay pasasalamtan awitin nil...,102239,fil,40,3
5,New Life Song Part Two,rap,R.1 One Six Souljhaz,2012,{Van.rey},choros dios buhay ikay pasasalamtan awitin nil...,102242,fil,40,3
6,NutriJingle,rap,Kamikzee,2013,{},kinukumpleto mo araw tuwing hinahain mo pagkai...,198397,fil,45,3
7,Yeah,rock,Kamikazee,2013,{},1 kinukumpleto mo araw tuwing hinahain mo pagk...,198400,fil,47,5
8,The Bobo Song,rap,Loonie,2013,{},pinilakang tabing sariling natabunan telenovel...,217852,fil,397,3
9,BLKD vs Spade,rap,BLKD,2013,{},round 1 ginoong gio de leon angas mo wow pagbu...,217966,fil,654,3


We label the dataset at this point into its x-axis `lyrics`, and y-axis `tag`, and split it into a training and testing dataset.


In [7]:
lyrics = data['lyrics']
genre = data['tag']

As stated earlier, we stratify the splitting of the dataset for better performance of the model.

In [8]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(lyrics, genre, test_size=0.2, stratify=genre, random_state=42)

## Part 3: Vectorizing our dataset

In [9]:
# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the training data
vectorizer.fit(X_train)


In [10]:
# Examine the fitted vocabulary
vectorizer.get_feature_names_out()

array(['01', '02', '03', ..., 'zoom', 'zsa', 'zuriel'], dtype=object)

In [11]:
# fit and transform training data into a 'document-term matrix'
X_train_dtm = vectorizer.fit_transform(X_train)

In [12]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(X_train_dtm.toarray(), columns = vectorizer.get_feature_names_out())

Unnamed: 0,01,02,03,04,05,06,07,08,09,0917,...,zimzalabim,zio,zjay,zo,zombie,zone,zoo,zoom,zsa,zuriel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4362,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4363,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4364,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4365,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
# transform testing data into a document-term matrix (using existing vocabulary)
X_test_dtm = vectorizer.transform(X_test)
X_test_dtm.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [14]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(X_test_dtm.toarray(), columns=vectorizer.get_feature_names_out())

Unnamed: 0,01,02,03,04,05,06,07,08,09,0917,...,zimzalabim,zio,zjay,zo,zombie,zone,zoo,zoom,zsa,zuriel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1087,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1088,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1089,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1090,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Part 4: Building and evaluating a model

In [15]:
# Initialize the Multinomial Naive Bayes classifier
nb_classifier = MultinomialNB()

# Train the classifier
nb_classifier.fit(X_train_dtm, y_train)


In [16]:
# Make predictions on the test data
y_pred_class = nb_classifier.predict(X_test_dtm)


In [17]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.7509157509157509

As seen in the confusion matrix below, the model had *confusion* in the third column, or the `misc` tag.

In [18]:
metrics.confusion_matrix(y_test, y_pred_class)

array([[  0,   0,   2,   1,   1,   0],
       [  0,   0,   9,   1,   0,   0],
       [  1,   0, 683,  68,   7,   7],
       [  0,   0,  40, 132,   4,   1],
       [  0,   0,  25,   9,   2,   0],
       [  0,   0,  84,  12,   0,   3]], dtype=int64)

In [19]:
# Print the classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_class))

              precision    recall  f1-score   support

     country       0.00      0.00      0.00         4
        misc       0.00      0.00      0.00        10
         pop       0.81      0.89      0.85       766
         rap       0.59      0.75      0.66       177
          rb       0.14      0.06      0.08        36
        rock       0.27      0.03      0.05        99

    accuracy                           0.75      1092
   macro avg       0.30      0.29      0.27      1092
weighted avg       0.69      0.75      0.71      1092



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Part 5: Examining a model for further insight

In [22]:
# store the vocabulary of X_train
X_train_tokens = vectorizer.get_feature_names_out()
len(X_train_tokens)

36158

In [23]:
# examine the first 50 tokens
print(X_train_tokens[0:50])

['01' '02' '03' '04' '05' '06' '07' '08' '09' '0917' '10' '100' '1008'
 '1096' '11' '110' '12' '12gauge' '13th' '13x' '143' '14344' '16' '16x'
 '17' '1861' '1896' '19' '1997' '1999' '1fritz' '1igalaw' '1loonie'
 '1shortone' '1st' '1strap' '1x' '20' '2000' '2004' '2007' '2013' '2017'
 '2018' '2020' '2021' '2021y' '2022' '22' '2202']


In [24]:
# examine the last 50 tokens
print(X_train_tokens[-50:])

['yumi' 'yumoko' 'yumugyog' 'yumuko' 'yumuyugyog' 'yumuyuko' 'yun' 'yuna'
 'yung' 'yunis' 'yunmasaya' 'yup' 'yupi' 'yupiyupi' 'yuri' 'yuridope'
 'yusuke' 'yuta' 'yutang' 'yuuh' 'yuuki' 'yuyuko' 'yvan' 'yzkk' 'zack'
 'zaint' 'zean' 'zel' 'zelijah' 'zeno' 'zephanie' 'zeppelin' 'zero'
 'zesto' 'zeus' 'zev' 'zhayt' 'zhen' 'zia' 'zild' 'zimzalabim' 'zio'
 'zjay' 'zo' 'zombie' 'zone' 'zoo' 'zoom' 'zsa' 'zuriel']


In [25]:
# Naive Bayes counts the number of times each token appears in each class
nb_classifier.feature_count_

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  8.,  0.],
       [ 1.,  1.,  3., ..., 14.,  0.,  1.],
       [ 0.,  0.,  0., ...,  0.,  0.,  1.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [26]:
# rows represent classes, columns represent tokens
nb_classifier.feature_count_.shape

(6, 36158)

In [27]:
country_token_count = nb_classifier.feature_count_[0, :]
misc_token_count = nb_classifier.feature_count_[1, :]
pop_token_count = nb_classifier.feature_count_[2, :]
rap_token_count = nb_classifier.feature_count_[3, :]
rb_token_count = nb_classifier.feature_count_[4, :]
rock_token_count = nb_classifier.feature_count_[5, :]
# create a DataFrame of tokens with their separate ham and spam counts
tokens = pd.DataFrame({'token':X_train_tokens, 'country':country_token_count, 'misc':misc_token_count, 'pop':pop_token_count, 'rap':rap_token_count, 'rb':rb_token_count, 'rock':rock_token_count}).set_index('token')
tokens.head()

Unnamed: 0_level_0,country,misc,pop,rap,rb,rock
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,3.0,0.0,0.0
4,0.0,0.0,0.0,2.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0


In [28]:
# examine 5 random DataFrame rows
tokens.sample(5, random_state=427)

Unnamed: 0_level_0,country,misc,pop,rap,rb,rock
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
amipingan,0.0,0.0,2.0,0.0,0.0,0.0
magpapatayo,0.0,0.0,0.0,1.0,0.0,0.0
maatem,0.0,0.0,0.0,1.0,0.0,0.0
naingatan,0.0,0.0,0.0,2.0,0.0,0.0
pinanghahawakan,0.0,0.0,7.0,1.0,4.0,0.0


In [29]:
# add 1 to tag counts to avoid dividing by 0 (1 point)
tokens[['country', 'misc', 'pop', 'rap', 'rb', 'rock']] = tokens[['country', 'misc', 'pop', 'rap', 'rb', 'rock']].apply(lambda x: x + 1)

In [30]:
# convert the tag counts into frequencies
tokens['country'] = tokens['country'] / nb_classifier.class_count_[0]
tokens['misc'] = tokens['misc'] / nb_classifier.class_count_[1]
tokens['pop'] = tokens['pop'] / nb_classifier.class_count_[2]
tokens['rap'] = tokens['rap'] / nb_classifier.class_count_[3]
tokens['rb'] = tokens['rb'] / nb_classifier.class_count_[4]
tokens['rock'] = tokens['rock'] / nb_classifier.class_count_[5]

tokens.sample(5, random_state=427)

Unnamed: 0_level_0,country,misc,pop,rap,rb,rock
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
amipingan,0.0625,0.02381,0.00098,0.001414,0.006897,0.002525
magpapatayo,0.0625,0.02381,0.000327,0.002829,0.006897,0.002525
maatem,0.0625,0.02381,0.000327,0.002829,0.006897,0.002525
naingatan,0.0625,0.02381,0.000327,0.004243,0.006897,0.002525
pinanghahawakan,0.0625,0.02381,0.002614,0.002829,0.034483,0.002525


In [31]:
# Assuming you have a DataFrame called tokens with columns: 'token', 'genre1', 'genre2', ..., 'genre6'

# Calculate the ratio of each genre to all other genres for each token
for genre in ['country', 'misc', 'pop', 'rap', 'rb', 'rock']:
    other_genres = [col for col in tokens.columns if col != genre]
    tokens[f'{genre}_ratio'] = tokens[genre] / tokens[other_genres].sum(axis=1)

# Sample 5 rows from the DataFrame
sampled_tokens = tokens.sample(5, random_state=427)

# Print the sampled tokens with the calculated ratios
print(sampled_tokens)


                 country     misc       pop       rap        rb      rock  \
token                                                                       
amipingan         0.0625  0.02381  0.000980  0.001414  0.006897  0.002525   
magpapatayo       0.0625  0.02381  0.000327  0.002829  0.006897  0.002525   
maatem            0.0625  0.02381  0.000327  0.002829  0.006897  0.002525   
naingatan         0.0625  0.02381  0.000327  0.004243  0.006897  0.002525   
pinanghahawakan   0.0625  0.02381  0.002614  0.002829  0.034483  0.002525   

                 country_ratio  misc_ratio  pop_ratio  rap_ratio  rb_ratio  \
token                                                                        
amipingan             1.754345    0.013020   0.000526   0.000759  0.003708   
magpapatayo           1.717652    0.013281   0.000179   0.001548  0.003780   
maatem                1.717652    0.013281   0.000179   0.001548  0.003780   
naingatan             1.653382    0.013764   0.000185   0.002406  0.00

In [32]:
# examine the DataFrame sorted by spam_ratio
# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier
tokens.sort_values('pop_ratio', ascending=False)

Unnamed: 0_level_0,country,misc,pop,rap,rb,rock,country_ratio,misc_ratio,pop_ratio,rap_ratio,rb_ratio,rock_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
yo,0.0625,0.023810,0.341718,0.169731,0.193103,0.255051,0.063554,0.021931,0.432730,0.121724,0.129361,0.163477
mahal,0.3125,0.214286,0.603724,0.451202,0.296552,0.310606,0.166545,0.100081,0.326025,0.193623,0.110712,0.111920
pagibig,0.7500,0.452381,0.832734,0.304102,0.365517,0.406566,0.317622,0.151982,0.303014,0.084949,0.101438,0.110970
pasko,0.0625,0.023810,0.135577,0.029703,0.006897,0.010101,0.303270,0.043444,0.282613,0.034212,0.007454,0.010867
la,0.0625,0.023810,0.426985,0.159830,0.965517,0.265152,0.033944,0.012440,0.280322,0.077188,0.719371,0.096003
...,...,...,...,...,...,...,...,...,...,...,...,...
adedele,0.3750,0.023810,0.000327,0.001414,0.006897,0.002525,10.722727,0.002143,0.000029,0.000127,0.000620,0.000227
adedelehiyo,0.3750,0.023810,0.000327,0.001414,0.006897,0.002525,10.722727,0.002143,0.000029,0.000127,0.000620,0.000227
doledidi,0.3750,0.023810,0.000327,0.001414,0.006897,0.002525,10.722727,0.002143,0.000029,0.000127,0.000620,0.000227
ulipon,0.6250,0.023810,0.000327,0.001414,0.006897,0.002525,17.871212,0.001286,0.000018,0.000076,0.000372,0.000136


## Part 6: Tuning the vectorizer

In [33]:
# show default parameters for CountVectorizer
vectorizer

In [34]:
metrics.accuracy_score(y_test, y_pred_class)

0.7509157509157509

In [35]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Define the pipeline
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', MultinomialNB())
])

# Define the parameters to search
parameters = {
    'vect__ngram_range': [(1, 1), (1, 2)],  # Test unigrams and bigrams
    'vect__max_df': [0.5, 0.75, 1.0],       # Test different maximum document frequencies
    'vect__min_df': [1, 2, 5],               # Test different minimum document frequencies
    'vect__stop_words': [None, 'english'],    # Test with and without stopwords
    'clf__alpha': [0.1, 0.5, 1.0]            # Test different alpha values for Laplace smoothing
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1, verbose=1)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Print the best parameters found
print("Best Parameters:", grid_search.best_params_)

# Print the best accuracy found
print("Best Accuracy:", grid_search.best_score_)

Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best Parameters: {'clf__alpha': 0.5, 'vect__max_df': 0.75, 'vect__min_df': 1, 'vect__ngram_range': (1, 2), 'vect__stop_words': 'english'}
Best Accuracy: 0.7634561901541542


In [36]:
# Extract the best parameters found by grid search
best_params = grid_search.best_params_
best_params

{'clf__alpha': 0.5,
 'vect__max_df': 0.75,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 2),
 'vect__stop_words': 'english'}

In [37]:
# Initialize CountVectorizer with the best parameters
vectorizer = CountVectorizer(ngram_range=best_params['vect__ngram_range'],
                             max_df=best_params['vect__max_df'],
                             min_df=best_params['vect__min_df'],
                             stop_words=best_params['vect__stop_words'])

# Fit the vectorizer to the entire training data
X_train_vectorized = vectorizer.fit_transform(X_train)

# Initialize and train the Multinomial Naive Bayes classifier
clf = MultinomialNB(alpha=best_params["clf__alpha"])
clf.fit(X_train_vectorized, y_train)

# Predict on the test data
X_test_vectorized = vectorizer.transform(X_test)
y_pred_class = clf.predict(X_test_vectorized)

# Calculate accuracy
accuracy = metrics.accuracy_score(y_test, y_pred_class)
print("Accuracy:", accuracy)

Accuracy: 0.7683150183150184
