# Implementing BERT using Cosine Similarity

BERT (Bidirectional Encoder Representations from Transformers) is a powerful language model developed by researchers at Google.
BERT-base-uncased is one of the pre-trained BERT models, which means it has been pre-trained on a large amount of text data and can be fine-tuned for various natural language processing tasks.

### Advantages:

- **Powerful Pre-training:** BERT-base-uncased has been pre-trained on a large corpus of text data, which allows it to capture a deep understanding of language and context. This pre-training helps the model perform well on a wide range of NLP tasks.<br/>

- **Bidirectionality:** BERT uses a bidirectional training approach, which means it learns to understand text by considering the context from both the left and right sides of a word. This is in contrast to traditional language models, which only consider the left-to-right or right-to-left context.<br/>

- **Flexible Fine-tuning:** The BERT-base-uncased model can be easily fine-tuned for various tasks by adding a task-specific output layer. This makes it a versatile and adaptable tool for different NLP applications.<br/>

- **Computational Efficiency:** The BERT-base-uncased model is relatively smaller and more efficient compared to the larger BERT-large model, making it more practical for deployment in real-world applications.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [10]:
path = 'spotify_final_dataset.csv' # Adding path to the final preprocessed dataset 

In [None]:
df = pd.read_csv(path)

In [4]:
df.shape

(100000, 15)

In [5]:
df.isnull().sum()

track_pos                    0
track_artist_name            0
track_track_name             0
track_duration_ms            0
track_album_name             0
playlist_name                0
playlist_num_artists         0
playlist_num_albums          0
playlist_num_tracks          0
playlist_num_followers       0
playlist_num_edits           0
playlist_duration_ms         0
playlist_collaborative       0
bag_of_words              7757
sentiment_bag_of_words       0
dtype: int64

In [6]:
df.dropna(inplace=True)

In [7]:
df.head()

Unnamed: 0,track_pos,track_artist_name,track_track_name,track_duration_ms,track_album_name,playlist_name,playlist_num_artists,playlist_num_albums,playlist_num_tracks,playlist_num_followers,playlist_num_edits,playlist_duration_ms,playlist_collaborative,bag_of_words,sentiment_bag_of_words
0,0,The Jackson 5,ABC,174866,ABC,party party,116,142,152,1,3,39413578,False,jackson c easy love b baby michael sing come s...,0.7964
1,1,Streetlight Manifesto,Point/Counterpoint,327920,Everything Goes Numb,party party,116,142,152,1,3,39413578,False,know dont never would ill ive like wont cant im,0.1316
2,2,Michael Jackson,Billie Jean,293826,Thriller 25 Super Deluxe Edition,party party,116,142,152,1,3,39413578,False,jean one billie lover uh son baby kid hoo girl,0.5859
3,3,Green Day,Basket Case,181533,Dookie,party party,116,142,152,1,3,39413578,False,sometimes chorus give creeps mind plays tricks...,0.128
4,4,The White Stripes,Seven Nation Army,231800,Elephant,party party,116,142,152,1,3,39413578,False,im na gon back comin prechorus instrumental bl...,0.0


In [8]:
df[df['playlist_name'] == 'party party'].head(5)

Unnamed: 0,track_pos,track_artist_name,track_track_name,track_duration_ms,track_album_name,playlist_name,playlist_num_artists,playlist_num_albums,playlist_num_tracks,playlist_num_followers,playlist_num_edits,playlist_duration_ms,playlist_collaborative,bag_of_words,sentiment_bag_of_words
0,0,The Jackson 5,ABC,174866,ABC,party party,116,142,152,1,3,39413578,False,jackson c easy love b baby michael sing come s...,0.7964
1,1,Streetlight Manifesto,Point/Counterpoint,327920,Everything Goes Numb,party party,116,142,152,1,3,39413578,False,know dont never would ill ive like wont cant im,0.1316
2,2,Michael Jackson,Billie Jean,293826,Thriller 25 Super Deluxe Edition,party party,116,142,152,1,3,39413578,False,jean one billie lover uh son baby kid hoo girl,0.5859
3,3,Green Day,Basket Case,181533,Dookie,party party,116,142,152,1,3,39413578,False,sometimes chorus give creeps mind plays tricks...,0.128
4,4,The White Stripes,Seven Nation Army,231800,Elephant,party party,116,142,152,1,3,39413578,False,im na gon back comin prechorus instrumental bl...,0.0


In [9]:
df.track_track_name.unique()

array(['ABC', 'Point/Counterpoint', 'Billie Jean', ..., 'Ruff To The Max',
       'Movin', 'Devistated - Edit'], dtype=object)

In [None]:
! pip install transformers



#### Model Implementation:

In [11]:
from transformers import BertTokenizer, BertModel
import torch
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

# Function to encode text using BERT in batches
def encode_text_with_bert_in_batches(text_list, batch_size=100):
    embeddings = []
    for i in range(0, len(text_list), batch_size):
        batch = text_list[i:i + batch_size]
        inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt", max_length=128)
        with torch.no_grad():
            outputs = model(**inputs)
        # Use the [CLS] token embedding as the sentence embedding
        batch_embeddings = outputs.last_hidden_state[:, 0, :].numpy()
        embeddings.append(batch_embeddings)
    return np.vstack(embeddings)

# Example: Encode track names and playlist names in batches
batch_size = 100  # Adjust based on your system's memory capacity
df['track_name_embedding'] = list(encode_text_with_bert_in_batches(df['track_track_name'].tolist(), batch_size))
df['playlist_name_embedding'] = list(encode_text_with_bert_in_batches(df['playlist_name'].tolist(), batch_size))

In [14]:
def recommend_tracks_bert(seed_tracks, num_recommendations=5):
    seed_embeddings = encode_text_with_bert_in_batches(seed_tracks)  # Encode seed tracks
    track_embeddings = list(df['track_name_embedding'])  # All track embeddings

    # Compute cosine similarity between seed tracks and all tracks
    similarities = cosine_similarity(seed_embeddings, track_embeddings)
    mean_similarity = similarities.mean(axis=0)  # Average similarity across all seed tracks

    # Rank tracks by similarity and exclude seed tracks
    df['similarity'] = mean_similarity
    recommendations = df[~df['track_track_name'].isin(seed_tracks)] \
        .sort_values(by='similarity', ascending=False)

    # Drop duplicates based on track name and artist combination
    unique_recommendations = recommendations[['track_artist_name', 'track_track_name', 'similarity']] \
        .drop_duplicates(subset=['track_artist_name', 'track_track_name']) \
        .head(num_recommendations)

    return unique_recommendations


$Cosine Similarity:$

$\cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}|| \cdot ||\mathbf{B}||}$

In [None]:
# Define seed_tracks as input to the model
seed_tracks = ["XO TOUR Llif3", "X Men","California Girls"] 

# Get 5 unique recommendations using BERT
bert_recommendations = recommend_tracks_bert(seed_tracks, num_recommendations=5)
print("BERT-Based Recommendations:")
print(bert_recommendations) 


BERT-Based Recommendations:
      track_artist_name track_track_name  similarity
66865               LFO     Summer Girls    0.910419
49761          Ludacris      Party Girls    0.906873
46004         blackbear     Valley Girls    0.903665
67600    Persona La Ave     Special Boys    0.903648
88722         FKA twigs       Video Girl    0.901454


In [None]:
# feeding the previous output as the next input to check the similarity and relavancy of the tracks
seed_tracks = ["XO TOUR Llif3", "X Men","California Girls", "Summer Girls","Party Girls","Valley Girls","Special Boys","Video Girl"]  
# Get 5 unique recommendations using BERT
bert_recommendations = recommend_tracks_bert(seed_tracks, num_recommendations=5)
print("BERT-Based Recommendations:")
print(bert_recommendations) 


BERT-Based Recommendations:
              track_artist_name track_track_name  similarity
49981  Bob Marley & The Wailers             Kaya    0.932976
40341                   Beyoncé     Naughty Girl    0.931241
59256             Niykee Heaton       Dream Team    0.930380
8134        The Clancy Brothers     Mountain Dew    0.929805
55744               MisterWives         Vagabond    0.929183


### Conclusion:

The BERT-based recommendation system demonstrates the potential of language models for personalized music recommendations.\
      By effectively capturing semantic and syntactic information, this approach can provide more relevant and engaging recommendations.