**Note:**The below implementation is taken from the book **`Hands on Large Language Models by Jay Alammar`**

# Explanation

### 🎯 Why It Works (The Core Idea)

**Word2Vec doesn’t need explicit labels — it learns from patterns.**

---

#### 🧠 Just like with words:

You don’t tell the model:

> "king = royalty, male, human"

Instead, you just feed it sentences like:

- "The king ruled the land."
- "The queen ruled the land."

The model notices that **"king"** and **"queen"** appear in similar contexts  
→ so it maps them **close together** in vector space.

---

#### 🎵 Same with songs:

Let’s say you feed the model playlists like:

```python
['sad_song1', 'piano_ballad2', 'acoustic3']
['party_track1', 'club_beat2', 'dance_hit3']
['acoustic3', 'lofi4', 'rainy_day5']
```

Even though the model has no idea what **"genre"** or **"mood"** means, it sees patterns like:

- `'acoustic3'` co-occurs with `'sad_song1'` and `'lofi4'`
- `'party_track1'` co-occurs with `'club_beat2'` and `'dance_hit3'`

So it starts mapping:

- `'acoustic3'`, `'sad_song1'`, `'lofi4'` → close together  
- `'party_track1'`, `'club_beat2'`, `'dance_hit3'` → close together

These clusters reflect **latent properties** like *mood* or *genre* —  
even though you never explicitly provided that information.

---

### 💡 So how does it learn features like “genre” or “mood” without being told?

- It doesn’t know the feature names.
- But it **learns usage patterns** that align with real-world categories.

Just like a child might learn:

> *"Dogs and cats both have fur, walk on 4 legs, and are pets."*  
> — without needing to know the label **“mammal”**

---

### 🎯 The model learns:

> *“These songs often appear together in the same context — so they must be related somehow.”*

That **“somehow”** is what gets captured as **vector similarity**.



In [None]:
# install numpy also with gensim, so the required version is installed
# and does not cause conflict. As the error has already been occured.
!pip install numpy gensim



In [None]:
import pandas as pd
from urllib import request

In [None]:
# Get playlist dataset files
data = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt')

# Parse the playlist dataset file. Skip the first two lines as
# they only contain metadata
lines = data.read().decode("utf-8").split('\n')[2:]

# Remove playlists with only one song
# Each number is an id for a song
playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]

# Load song metadata
songs_file =request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')
songs_file = songs_file.read().decode("utf-8").split('\n')
songs = [s.rstrip().split('\t') for s in songs_file]
songs_df = pd.DataFrame(data=songs, columns = ['id', 'title',
'artist'])
songs_df = songs_df.set_index('id')

In [None]:
playlists_ = [s.rstrip().split() for s in lines if len(s.split()) <= 1]

playlists_[0]

['68565']

In [None]:
songs_df.head()

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Gucci Time (w\/ Swizz Beatz),Gucci Mane
1,Aston Martin Music (w\/ Drake & Chrisette Mich...,Rick Ross
2,Get Back Up (w\/ Chris Brown),T.I.
3,Hot Toddy (w\/ Jay-Z & Ester Dean),Usher
4,Whip My Hair,Willow


In [None]:
print( 'Playlist #1:\n ', playlists[0], '\n')
print( 'Playlist #2:\n ', playlists[1])

Playlist #1:
  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43'] 

Playlist #2:
  ['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62', '114', '115', '84', '116', '117',

# Train the model

The playlists were containing the strings telling the id of a song in songs df like `['1', '3', '56',...]`

Each Playlist is taken as a sentence and each song id as a word.

`playlists = [['1', '3', '56'], ['3', '7', '9', '12'], ...]`

This is analogous to natural language, where:

- Playlist = Sentence

- Song ID = Word

- Word2Vec = Learns which words (songs) appear in similar contexts (playlists)

### How Word2Vec Works Behind the Scenes
There are two types of Word2Vec:

- CBOW (Continuous Bag of Words): Predicts the center word from context

- Skip-gram: Predicts the context from a center word (you’re likely using this implicitly)

### What it learns:

- It assigns each song ID a vector (embedding) in a high-dimensional space (in your case, 32-dimensional).

- Songs that appear in similar playlists (contexts) will have similar embeddings, i.e., closer in vector space.

### Example
Let’s say:

Playlist A: [“123”, “456”, “789”]

Playlist B: [“123”, “888”, “999”]

“123” appears with “456”, “789”, “888”, and “999” — so Word2Vec learns that these are contextually similar.


In [None]:
from gensim.models import Word2Vec
# Train our Word2Vec model
model = Word2Vec(
    playlists,      # ✅ Input: list of lists of tokens (e.g., song IDs, track names, words) — each list is like a "sentence"
    vector_size=32, # ✅ Size of each word (or item) vector; higher = more expressive, but more compute/memory
    window=20,      # ✅ Context window size; how many items before/after the current word to consider as context
    negative=50,    # ✅ Number of "negative samples" used in negative sampling (helps distinguish between similar and dissimilar words)
    min_count=1,    # ✅ Ignores words/items that appear less than this many times; 1 means include *all* items
    workers=4       # ✅ Number of CPU threads to use for training (parallelization for speed)
)

In [None]:
song_id = 2172
# Ask the model for songs similar to song #2172
model.wv.most_similar(positive=str(song_id))

[('6641', 0.9965611100196838),
 ('1922', 0.9958073496818542),
 ('2849', 0.995021641254425),
 ('3119', 0.9949675798416138),
 ('6626', 0.9941149950027466),
 ('1954', 0.9934073686599731),
 ('2976', 0.9932992458343506),
 ('2068', 0.9929718971252441),
 ('1849', 0.9928407669067383),
 ('3116', 0.992684006690979)]

In [None]:
print(songs_df.iloc[2172])

title     Fade To Black
artist        Metallica
Name: 2172 , dtype: object


In [None]:
import numpy as np

def print_recommendations(song_id):
 similar_songs = np.array(model.wv.most_similar(positive=str(song_id),topn=5))[:,0]
 return songs_df.iloc[similar_songs]

# Extract recommendations
print_recommendations(2172)

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
6641,Shout At The Devil,Motley Crue
1922,One,Metallica
2849,Run To The Hills,Iron Maiden
3119,There's Only One Way To Rock,Sammy Hagar
6626,Blackout,Scorpions


### 🧠 What if a New Song Appears?

If a song ID wasn’t seen during training (e.g., `'999'`), Word2Vec won't have an embedding for it:

```python
model.wv['999']  # ❌ KeyError
```

This is called the **cold-start problem** — the model has no idea where to place it in vector space because it never saw it in any playlist.


# Solution


#### **1. Average Context Embedding**
If '999' appears in a new playlist with known songs:

```
['999', 'lofi4', 'sad_song1', 'acoustic3']
```

We can estimate '999'’s embedding by averaging the vectors of its known neighbors:

```
vector_999 = np.mean([
    model.wv['lofi4'],
    model.wv['sad_song1'],
    model.wv['acoustic3']
], axis=0)
```

Now, use vector_999 to find similar songs from the trained vocabulary using cosine similarity:

```
similar_songs = model.wv.similar_by_vector(vector_999, topn=10)
```

In [None]:
model.wv['223'] # returns the context vector for the song.

array([-2.26888359e-01, -1.24643111e+00,  9.82966244e-01,  9.95871902e-01,
        5.42421758e-01, -1.97363183e-01, -2.11377263e+00,  1.05393216e-01,
       -1.10254908e+00,  8.64706874e-01,  3.69725257e-01,  6.58751369e-01,
        5.08227825e-01, -2.18630004e+00,  3.87266427e-01, -1.02805376e+00,
        5.69717526e-01,  1.16657782e+00,  2.22644225e-01,  1.26935709e+00,
        4.71033245e-01,  7.61530221e-01,  3.89725715e-02,  4.88901824e-01,
       -1.03168678e+00,  1.78497255e-01,  9.44776577e-04, -2.85953552e-01,
        8.25823963e-01,  1.16677821e+00,  6.96452916e-01,  3.48710679e-02],
      dtype=float32)

In [None]:
# Say song name is 'lofi4', it is obviously not in our corpus
new_song = 'lofi4'

# say it appears in our new playlist with songs, 223, 121, 324
vector_lofi4 = np.mean([
    model.wv['223'],
    model.wv['121'],
    model.wv['324']
], axis=0)

In [None]:
similar_songs = model.wv.similar_by_vector(vector_lofi4, topn=10)

In [None]:
similar_songs

[('883', 0.9980020523071289),
 ('165', 0.9978117346763611),
 ('35678', 0.9977737069129944),
 ('19279', 0.9975717663764954),
 ('654', 0.9974988102912903),
 ('347', 0.9974021315574646),
 ('40673', 0.9972525835037231),
 ('19417', 0.997150182723999),
 ('1125', 0.997092068195343),
 ('1111', 0.9970541596412659)]

#### **2. Use Metadata - Hybrid System (Word2Vec + Metadata)**
If '999' has metadata like:

```
{ "genre": "lofi", "artist": "A.R. Chill", "tempo": "slow" }
```

We can find similar songs by:

- Matching genre or artist

- Computing cosine similarity between feature vectors (e.g., tempo, energy)

- Using precomputed metadata embeddings

📌 This helps recommend songs when playlist data is missing.

In [None]:
print(playlists[:1])

[['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43']]


Word2Vec is already trained above, so now let's work with the song information and use it to recommend songs.

In [None]:
songs_df.head()

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Gucci Time (w\/ Swizz Beatz),Gucci Mane
1,Aston Martin Music (w\/ Drake & Chrisette Mich...,Rick Ross
2,Get Back Up (w\/ Chris Brown),T.I.
3,Hot Toddy (w\/ Jay-Z & Ester Dean),Usher
4,Whip My Hair,Willow


In [None]:
songs_df.isna().sum()

Unnamed: 0,0
title,1
artist,2


In [None]:
songs_df.dropna(inplace=True)

## Feature Vectorization

**Artist:** categorical variable so best with One-hot/Label Encoding

**Title:**  tf-idf is better as it may have info like, sad, poetry, reverbed, lofi , etc... these words matter. **Sentence Transfformer** is even better as it captures the semantic information of sentence.

In [None]:
import sklearn
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sentence_transformers import SentenceTransformer


In [None]:

# Step 1: Fit encoders
artist_enc = OneHotEncoder(sparse_output=False)
artist_enc.fit(songs_df[['artist']])  # ✅ correct input shape


title_enc = TfidfVectorizer()
title_enc.fit(songs_df['title'])      # ✅ fit on whole column

sentence_model = SentenceTransformer('all-MiniLM-L6-v2')
title_vec = sentence_model.encode([row['title']])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
def encode_song_meta(row, title_encoder, artist_encoder = artist_enc):
    artist_vec = artist_encoder.transform([[row['artist']]])

    # If using SentenceTransformer
    if hasattr(title_encoder, 'encode'):
        title_vec = title_encoder.encode([row['title']])
    else:
        title_vec = title_encoder.transform([row['title']]).toarray()

    return np.hstack([title_vec, artist_vec])

now, I have stacked the vectors but how do I do is for a new song I should know the id the title and the artist to recommend songs, what if the title and artist is not in the training data we use tfidf and one hotencoder on ?.

In [None]:
row = {
    'artist': 'Gucci Mane',
    'title': 'Gucci Time (w\/ Swizz Beatz)'
}

encoded_result = encode_song_meta(row, title_encoder = title_enc)
print(encoded_result.shape)

(1, 40100)




In [None]:
encoded_result

array([[0., 0., 0., ..., 0., 0., 0.]])

TF-IDF is so sparse so use sentence embeddings rather

In [None]:
encoded_result = encode_song_meta(row, title_encoder = sentence_model)
print(encoded_result.shape)

(1, 16360)




In [None]:
print(encoded_result)

[[-0.08929074 -0.03969651 -0.02199353 ...  0.          0.
   0.        ]]


In [None]:
songs_df.tail()

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
75257,Dearest (I'm So Sorry),Picture Me Broken
75258,USA Today,Alan Jackson
75259,Superstar,Raul Malo
75260,Romancin' The Blues,Giacomo Gates
75261,Inner Change,The Jazzmasters


In [None]:
# Encode all songs
encoded_songs = np.array([
    encode_song_meta({'artist': row['artist'], 'title': row['title']}, title_encoder=sentence_model)
    for _, row in songs_df[:1000].iterrows()
])

## Recommend Songs

In [None]:
songs_df[224:230]

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
224,Chutkaipan,Scrunter
225,Say Aah,Trey Songz
226,Woo Hah!! Got You All In Check,Busta Rhymes
227,Shimmy Shimmy Ya,Ol' Dirty Bastard
228,Get At Me Dog (w\/ Sheek Louch),DMX
229,Get Your Money Up (w\/ Keyshia Cole & Trina),Keri Hilson


In [None]:
new_data= {
  "artist": "Scrunter",
  "title": "Chutkaipan"
}

In [None]:
new_data_encoded = encode_song_meta(new_data, title_encoder = sentence_model)

In [None]:
# use cosine similarity

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def calculate_cosine_similarity_sklearn(vec1, vec2):
    """
    Calculates the cosine similarity between two vectors using Scikit-learn.
    Note: Input vectors need to be reshaped to 2D arrays for this function.
    """
    return cosine_similarity(vec1.reshape(1, -1), vec2.reshape(1, -1))[0][0]

In [None]:
count = 0
for i, song_vector in enumerate(encoded_songs):
    similarity = calculate_cosine_similarity_sklearn(new_data_encoded, song_vector)
    if similarity > 0:
        print(f"Similarity with song {i}: {similarity}")
        print(songs_df.iloc[i])

        if count == 10:
          break
        count += 1

Similarity with song 0: 0.1364933641920116
title     Gucci Time (w\/ Swizz Beatz)
artist                      Gucci Mane
Name: 0 , dtype: object
Similarity with song 1: 0.02159306182524347
title     Aston Martin Music (w\/ Drake & Chrisette Mich...
artist                                            Rick Ross
Name: 1 , dtype: object
Similarity with song 2: 0.0655873705920384
title     Get Back Up (w\/ Chris Brown)
artist                             T.I.
Name: 2 , dtype: object
Similarity with song 3: 0.03372617239119289
title     Hot Toddy (w\/ Jay-Z & Ester Dean)
artist                                 Usher
Name: 3 , dtype: object
Similarity with song 4: 0.0777024574050724
title     Whip My Hair
artist          Willow
Name: 4 , dtype: object
Similarity with song 5: 0.10049771929184943
title     Down On Me (w\/ 50 Cent)
artist                     Jeremih
Name: 5 , dtype: object
Similarity with song 6: 0.10954676899360699
title     Black And Yellow
artist         Wiz Khalifa
Name: 6 , dty

# Other Approaches

#### 3. Retrain or Fine-Tune the Model
If new playlists include '999', update the existing Word2Vec model:

```
model.build_vocab(new_playlists, update=True)
model.train(new_playlists, total_examples=len(new_playlists), epochs=1)
```
This gives '999' a real embedding based on its new co-occurrence patterns.

#### 4. Hybrid System (Word2Vec + Metadata)
Large platforms (Spotify, YouTube Music) combine multiple data sources:

- Playlist co-occurrence (Word2Vec-style)

- Song metadata (genre, artist, mood, year)

- Audio features (valence, tempo, MFCCs)

- User interaction data (likes, skips, history)