# Lyrics Feature Extraction using RoBERTa or TF-IDF
Idea: From the frequency table of the lyrics that is preprocessed, we obtain a feature extraction using TF-IDF RoBERTa. 
We get __ number of features as the columns of the dataframe

In [1]:
import sys
import os

# Add the parent directory to sys.path
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

In [2]:
from preprocessing.util.lyrics_processor import LyricsProcessor
from lyrics_provider import LyricsProvider
import os
import numpy as np
import pandas as pd
import sqlite3

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
parent_directory = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
data_dir = os.path.join(parent_directory, 'data')
lyrics_dir = os.path.join(data_dir, 'mxm_dataset.db')

In [4]:
processor = LyricsProcessor(lyrics_dir)
processor.process_all()
lyrics_data = processor.get_lyrics_data()

Database connected.
Columns in 'lyrics' table: ['track_id', 'mxm_tid', 'word', 'count', 'is_test', 'song_id']
'song_id' column already exists.
Lyrics table processed and pivoted with song_id as index.
Database connection closed.


Determine number of features to set the vector in dataframe format using **max_features**

## Using TF-IDF

In [5]:
#getting and displaying the embeddings and feature outputs from TF-IDF",
#processes the lyris \n"
model = LyricsProvider(lyrics_data)
embeddings_df = model.get_tfidf_embeddings(lyrics_data, max_features=20)
#the dataframe will contain however many features as specificed in the max_features. these are the columns below. 
print(embeddings_df.head())

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


TF-IDF embeddings generated.
              song_id  all        am       and       are        do        in  \
0  TRAAAAV128F421A322  0.0  0.300988  0.487591  0.294758  0.000000  0.000000   
1  TRAAABD128F429CF47  0.0  0.000000  0.075210  0.340995  0.273745  0.131871   
2  TRAAAED128E0783FAB  0.0  0.264423  0.471193  0.064738  0.062364  0.000000   
3  TRAAAEF128F4273421  0.0  0.000000  0.000000  0.000000  0.000000  0.000000   
4  TRAAAEW128F42930C0  0.0  0.287620  0.434872  0.000000  0.000000  0.000000   

         is        it        me  ...       not        of        on      that  \
0  0.234611  0.000000  0.000000  ...  0.000000  0.372548  0.000000  0.000000   
1  0.135707  0.132564  0.000000  ...  0.090503  0.000000  0.000000  0.254985   
2  0.077291  0.100668  0.050630  ...  0.051546  0.000000  0.000000  0.203317   
3  0.788932  0.000000  0.000000  ...  0.350760  0.278394  0.000000  0.197648   
4  0.000000  0.000000  0.293716  ...  0.000000  0.000000  0.181298  0.000000   

        t

## Using RoBERTa

In [6]:
model = LyricsProvider(lyrics_data)
embeddings_rb = model.get_roberta_embeddings(lyrics_data)

print(embeddings_rb.head())

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Generating Embeddings:   4%|‚ñç         | 1205/29708 [26:23<10:24:19,  1.31s/it] 


KeyboardInterrupt: 

In [None]:
#kmeans clustering
clustered_df, silhouette, calinski_harabasz = model.kmeans_cluster(embeddings_rb)
print(clustered_df[['song_id', 'cluster']].head())

model.visualize_cluster(clustered_df)

## Saving to .pkl file

For RoBERTa

In [None]:
parent_directory = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
file_path = os.path.join(parent_directory, 'data/embeddings/lyrics_embeddings/lyrics_roberta.pkl')

model.embeddings_to_pkl(file_path, embeddings=embeddings_rb)

For TF-IDF

In [None]:
parent_directory = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
file_path = os.path.join(parent_directory, 'data/embeddings/lyrics_embeddings/lyrics_tfidf.pkl')

model.embeddings_to_pkl(file_path, embeddings=embeddings_df)