<h1>Class 2a - Tokens and Embeddings</h1>

### [OPTIONAL] - Installing Packages

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies.

---

üí° **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---

In [1]:
%%capture
!pip install transformers>=4.40.1 accelerate>=0.27.2

# Phi-3

The first step is to load our model onto the GPU for faster inference. Note that we load the model and tokenizer separately (although that isn't always necessary).

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/195 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

# Notice how the text is terminated after 20 tokens

In [3]:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|>"

# Tokenize the input prompt
input_ids = tokenizer(prompt,
                      return_tensors="pt").input_ids.to("cuda")

# Generate the text
generation_output = model.generate(
    input_ids=input_ids,
    max_new_tokens=20
    )

# Print the output
print("\n")
print(tokenizer.decode(generation_output[0]))

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.




Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|> Subject: Heartfelt Apologies for the Gardening Mishap


Dear


In [4]:
print(input_ids)

tensor([[14350,   385,  4876, 27746,  5281,   304, 19235,   363,   278, 25305,
           293, 16423,   292,   286,   728,   481, 29889, 12027,  7420,   920,
           372,  9559, 29889, 32001]], device='cuda:0')


In [None]:
for id in input_ids[0]:
  print(tokenizer.decode(id))

Write
an
email
apolog
izing
to
Sarah
for
the
trag
ic
garden
ing
m
ish
ap
.
Exp
lain
how
it
happened
.
<|assistant|>


In [5]:
print(generation_output)

tensor([[14350,   385,  4876, 27746,  5281,   304, 19235,   363,   278, 25305,
           293, 16423,   292,   286,   728,   481, 29889, 12027,  7420,   920,
           372,  9559, 29889, 32001,  3323,   622, 29901, 17778, 29888,  2152,
          6225, 11763,   363,   278, 19906,   292,   341,   728,   481,    13,
            13,    13, 29928,   799]], device='cuda:0')


In [6]:
for id in generation_output[0]:
  print(tokenizer.decode(id))

Write
an
email
apolog
izing
to
Sarah
for
the
trag
ic
garden
ing
m
ish
ap
.
Exp
lain
how
it
happened
.
<|assistant|>
Sub
ject
:
Heart
f
elt
Ap
ologies
for
the
Garden
ing
M
ish
ap






D
ear


In [7]:
print(tokenizer.decode(3323))
print(tokenizer.decode([3323, 622]))

Sub
Subject


# Comparing Trained Tokenizers

In [8]:
from transformers import AutoModelForCausalLM, AutoTokenizer

colors_list = [
    '102;194;165', '252;141;98', '141;160;203',
    '231;138;195', '166;216;84', '255;217;47'
]

def show_tokens(sentence, tokenizer_name):
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    token_ids = tokenizer(sentence).input_ids
    for idx, t in enumerate(token_ids):
        print(
            f'\x1b[0;30;48;2;{colors_list[idx % len(colors_list)]}m' +
            tokenizer.decode(t) +
            '\x1b[0m',
            end=' '
        )

In [9]:
text = """
English and CAPITALIZATION
üéµ È∏ü
show_tokens False None elif == >= else: two tabs:"    " Three tabs: "       "
12.0*50=600
"""

In [None]:
text = """
this is LLM fundamentals class in santa Clara
"""

In [None]:
show_tokens(text, "bert-base-uncased")

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

[0;30;48;2;102;194;165m[CLS][0m [0;30;48;2;252;141;98mthis[0m [0;30;48;2;141;160;203mis[0m [0;30;48;2;231;138;195mll[0m [0;30;48;2;166;216;84m##m[0m [0;30;48;2;255;217;47mfundamental[0m [0;30;48;2;102;194;165m##s[0m [0;30;48;2;252;141;98mclass[0m [0;30;48;2;141;160;203min[0m [0;30;48;2;231;138;195msanta[0m [0;30;48;2;166;216;84mclara[0m [0;30;48;2;255;217;47m[SEP][0m 

In [None]:
show_tokens(text, "bert-base-cased")

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

[0;30;48;2;102;194;165m[CLS][0m [0;30;48;2;252;141;98mthis[0m [0;30;48;2;141;160;203mis[0m [0;30;48;2;231;138;195mLL[0m [0;30;48;2;166;216;84m##M[0m [0;30;48;2;255;217;47mfundamental[0m [0;30;48;2;102;194;165m##s[0m [0;30;48;2;252;141;98mclass[0m [0;30;48;2;141;160;203min[0m [0;30;48;2;231;138;195msa[0m [0;30;48;2;166;216;84m##nta[0m [0;30;48;2;255;217;47mClara[0m [0;30;48;2;102;194;165m[SEP][0m 

In [None]:
show_tokens(text, "gpt2")

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mthis[0m [0;30;48;2;141;160;203m is[0m [0;30;48;2;231;138;195m LL[0m [0;30;48;2;166;216;84mM[0m [0;30;48;2;255;217;47m fundamentals[0m [0;30;48;2;102;194;165m class[0m [0;30;48;2;252;141;98m in[0m [0;30;48;2;141;160;203m s[0m [0;30;48;2;231;138;195manta[0m [0;30;48;2;166;216;84m Clara[0m [0;30;48;2;255;217;47m
[0m 

In [None]:
show_tokens(text, "google/flan-t5-small")

config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

[0;30;48;2;102;194;165mthis[0m [0;30;48;2;252;141;98mis[0m [0;30;48;2;141;160;203mL[0m [0;30;48;2;231;138;195mLM[0m [0;30;48;2;166;216;84mfundamental[0m [0;30;48;2;255;217;47ms[0m [0;30;48;2;102;194;165mclass[0m [0;30;48;2;252;141;98min[0m [0;30;48;2;141;160;203m[0m [0;30;48;2;231;138;195ms[0m [0;30;48;2;166;216;84mant[0m [0;30;48;2;255;217;47ma[0m [0;30;48;2;102;194;165mClar[0m [0;30;48;2;252;141;98ma[0m [0;30;48;2;141;160;203m</s>[0m 

In [None]:
# The official is `tiktoken` but this the same tokenizer on the HF platform
show_tokens(text, "Xenova/gpt-4")

tokenizer_config.json:   0%|          | 0.00/460 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/98.0 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mthis[0m [0;30;48;2;141;160;203m is[0m [0;30;48;2;231;138;195m L[0m [0;30;48;2;166;216;84mLM[0m [0;30;48;2;255;217;47m fundamentals[0m [0;30;48;2;102;194;165m class[0m [0;30;48;2;252;141;98m in[0m [0;30;48;2;141;160;203m santa[0m [0;30;48;2;231;138;195m Clara[0m [0;30;48;2;166;216;84m
[0m 

In [None]:
show_tokens(text, "bigcode/starcoder2-15b")

config.json:   0%|          | 0.00/803 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/958 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mthis[0m [0;30;48;2;141;160;203m is[0m [0;30;48;2;231;138;195m LL[0m [0;30;48;2;166;216;84mM[0m [0;30;48;2;255;217;47m fund[0m [0;30;48;2;102;194;165mamentals[0m [0;30;48;2;252;141;98m class[0m [0;30;48;2;141;160;203m in[0m [0;30;48;2;231;138;195m s[0m [0;30;48;2;166;216;84manta[0m [0;30;48;2;255;217;47m C[0m [0;30;48;2;102;194;165mlar[0m [0;30;48;2;252;141;98ma[0m [0;30;48;2;141;160;203m
[0m 

In [None]:
show_tokens(text, "facebook/galactica-1.3b")

config.json:   0%|          | 0.00/789 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/166 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.00 [00:00<?, ?B/s]

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mthis[0m [0;30;48;2;141;160;203m is[0m [0;30;48;2;231;138;195m LL[0m [0;30;48;2;166;216;84mM[0m [0;30;48;2;255;217;47m fundamentals[0m [0;30;48;2;102;194;165m class[0m [0;30;48;2;252;141;98m in[0m [0;30;48;2;141;160;203m s[0m [0;30;48;2;231;138;195manta[0m [0;30;48;2;166;216;84m Clara[0m [0;30;48;2;255;217;47m
[0m 

In [None]:
show_tokens(text, "microsoft/Phi-3-mini-4k-instruct")

[0;30;48;2;102;194;165m[0m [0;30;48;2;252;141;98m
[0m [0;30;48;2;141;160;203mthis[0m [0;30;48;2;231;138;195mis[0m [0;30;48;2;166;216;84mL[0m [0;30;48;2;255;217;47mLM[0m [0;30;48;2;102;194;165mfund[0m [0;30;48;2;252;141;98mament[0m [0;30;48;2;141;160;203mals[0m [0;30;48;2;231;138;195mclass[0m [0;30;48;2;166;216;84min[0m [0;30;48;2;255;217;47ms[0m [0;30;48;2;102;194;165manta[0m [0;30;48;2;252;141;98mClara[0m [0;30;48;2;141;160;203m
[0m 

<h1>Contextualized Word Embeddings From a Language Model (Like BERT)</h1>

In [None]:
from transformers import AutoModel, AutoTokenizer

# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base")

# Load a language model
model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall")

# Tokenize the sentence
tokens = tokenizer('I am in Santa Clara', return_tensors='pt')

# Process the tokens
output = model(**tokens)[0]

config.json:   0%|          | 0.00/474 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/241M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/198 [00:00<?, ?it/s]

DebertaV2Model LOAD REPORT from: microsoft/deberta-v3-xsmall
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
deberta.embeddings.word_embeddings._weight | UNEXPECTED |  | 
mask_predictions.classifier.weight         | UNEXPECTED |  | 
mask_predictions.dense.weight              | UNEXPECTED |  | 
lm_predictions.lm_head.LayerNorm.weight    | UNEXPECTED |  | 
lm_predictions.lm_head.dense.bias          | UNEXPECTED |  | 
mask_predictions.LayerNorm.bias            | UNEXPECTED |  | 
lm_predictions.lm_head.bias                | UNEXPECTED |  | 
mask_predictions.classifier.bias           | UNEXPECTED |  | 
mask_predictions.dense.bias                | UNEXPECTED |  | 
lm_predictions.lm_head.LayerNorm.bias      | UNEXPECTED |  | 
lm_predictions.lm_head.dense.weight        | UNEXPECTED |  | 
mask_predictions.LayerNorm.weight          | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/a

In [None]:
output.shape

torch.Size([1, 7, 384])

In [None]:
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))

[CLS]
I
 am
 in
 Santa
 Clara
[SEP]


In [None]:
output

tensor([[[-3.4824,  0.0244, -0.0648,  ..., -0.0912, -0.1865,  0.2374],
         [-0.2571, -0.2795, -0.4868,  ...,  1.0391,  0.1179,  0.3313],
         [ 0.1346,  0.7334, -0.1298,  ..., -0.0797,  0.1033, -1.8350],
         ...,
         [ 0.0126,  0.0643,  0.0061,  ...,  1.9844,  0.0424, -1.9414],
         [-0.4792,  0.2117,  0.6123,  ...,  0.2390, -0.5811,  0.8389],
         [-3.3535, -0.0300, -0.0204,  ..., -0.0440, -0.3899,  0.0219]]],
       dtype=torch.float16, grad_fn=<NativeLayerNormBackward0>)

<h1>Sentence Transformer - Text Embeddings (For Sentences and Whole Documents)</h1>

In [None]:
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to text embeddings
# vector = model.encode("I like Santa Clara!")
vector = model.encode("I am attending LLM Fundamentals class")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

MPNetModel LOAD REPORT from: sentence-transformers/all-mpnet-base-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
vector.shape

(768,)

In [None]:
# the vector.shape response is same for both test cases

In [None]:
print(len(vector))

768


In [None]:
print(vector[:4])

[-0.02966681 -0.13029607 -0.02572516  0.00771284]


In [None]:
print(vector[-4:])

[-0.00091652 -0.01786992  0.01824574 -0.00275091]


<h1>Word Embeddings Beyond LLMs</h1>

# Review other

In [10]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m27.9/27.9 MB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [None]:
import gensim.downloader as api

# Download embeddings (66MB, glove, trained on wikipedia, vector size: 50)
# Other options include "word2vec-google-news-300"
# More options at https://github.com/RaRe-Technologies/gensim-data
model = api.load("glove-wiki-gigaword-50")



# most_similar: Examples

In [None]:
model.most_similar([model['january']], topn=11)

[('january', 1.0),
 ('december', 0.9941421747207642),
 ('february', 0.9940434098243713),
 ('october', 0.9925727844238281),
 ('september', 0.9907277822494507),
 ('november', 0.9904569387435913),
 ('august', 0.9883098006248474),
 ('april', 0.9849991798400879),
 ('june', 0.9810335636138916),
 ('july', 0.9798346161842346),
 ('march', 0.9716477990150452)]

In [None]:
model.most_similar([model['university']], topn=11)

[('university', 1.0),
 ('college', 0.874463677406311),
 ('harvard', 0.8710561394691467),
 ('yale', 0.8566808104515076),
 ('graduate', 0.8552882671356201),
 ('institute', 0.8483645915985107),
 ('professor', 0.8417032361030579),
 ('school', 0.8261534571647644),
 ('faculty', 0.8257830142974854),
 ('graduated', 0.8143773078918457),
 ('academy', 0.8103663921356201)]

# Analogies: man is to woman, as king is to ...

In [None]:
model.most_similar(positive=['', 'chicken'], negative=['cat'], topn=1)

[('meat', 0.8934735655784607)]

# The most_similar function operates by:
Averaging the word vectors of all words in the positive list.
Averaging the word vectors of all words in the negative list.
Subtracting the average negative vector from the average positive vector.
Finding the words in the vocabulary whose vectors are closest to this resulting difference vector.
# Analogy Example: "King - Man + Woman = Queen"

<h1>Recommending songs by embeddings</h1>

# Dataset
# https://www.cs.cornell.edu/~shuochen/lme/data_page.html

In [None]:
import pandas as pd
from urllib import request

# Get the playlist dataset file
data = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt')

# Parse the playlist dataset file. Skip the first two lines as
# they only contain metadata
lines = data.read().decode("utf-8").split('\n')[2:]

# Remove playlists with only one song
playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]

# Load song metadata
songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')
songs_file = songs_file.read().decode("utf-8").split('\n')
songs = [s.rstrip().split('\t') for s in songs_file]
songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])
songs_df = songs_df.set_index('id')

In [None]:
print( 'Playlist #1:\n ', playlists[0], '\n')
print( 'Playlist #2:\n ', playlists[1])

Playlist #1:
  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43'] 

Playlist #2:
  ['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62', '114', '115', '84', '116', '117',

In [None]:
print( 'Playlist #-1 (Last):\n ', playlists[-1])

Playlist #-1 (Last):
  ['21165', '8889', '7254', '27588', '8974', '7402', '6808', '12850']


In [None]:
print( 'Playlist #-2:\n ', playlists[-2])

Playlist #-2:
  ['8665', '25878', '9467', '5', '6', '12085', '8650', '8651', '829', '21146', '8666', '8648', '50', '8649', '5698', '5681', '126', '8652', '8653', '8658', '8657', '8655', '8660', '344', '20052', '50', '5822', '77', '8633', '6816', '8663', '5', '21114', '50', '20052', '20065', '20627', '20058', '20622', '26737', '77', '344', '19296', '20070', '20062', '50', '21114', '20052', '20065', '20053', '5', '19477', '20056', '9678', '21116', '5', '20062', '50', '21114', '20052', '21115', '77', '20046', '9678', '19477', '20056', '5', '21114', '50', '20052', '20065', '20062', '20046', '21116', '20627', '77', '344', '20058', '19296', '5', '50', '21114', '20052', '20621', '20065', '9678', '21116', '20622', '5', '20062', '26737', '50', '21114', '5822', '21115', '77', '20052', '20065', '344', '19296', '20053', '20070', '9678', '21116', '44365', '50', '20062', '5', '21114', '20052', '51489', '8674', '86', '5714', '21112', '8616', '8642', '38', '8645', '5943', '21110']


In [None]:
print(songs_df[:10])

                                                title       artist
id                                                                
0                        Gucci Time (w\/ Swizz Beatz)   Gucci Mane
1   Aston Martin Music (w\/ Drake & Chrisette Mich...    Rick Ross
2                       Get Back Up (w\/ Chris Brown)         T.I.
3                  Hot Toddy (w\/ Jay-Z & Ester Dean)        Usher
4                                        Whip My Hair       Willow
5                            Down On Me (w\/ 50 Cent)      Jeremih
6                                    Black And Yellow  Wiz Khalifa
7                                   Blowing Me Kisses   Soulja Boy
8                                         Lay It Down        Lloyd
9                       Good For My Money (w\/ Lloyd)    Baby Bash


In [None]:
print(songs_df[-10:])

                                    title             artist
id                                                          
75253   In Keeping Secrets Of Silent Eart   Coheed & Cambria
75254                             Charmer      Kings Of Leon
75255              For Your Entertainment       Adam Lambert
75256    Let's Make Love (w\/ Tim Mcgraw)         Faith Hill
75257              Dearest (I'm So Sorry)  Picture Me Broken
75258                           USA Today       Alan Jackson
75259                           Superstar          Raul Malo
75260                 Romancin' The Blues      Giacomo Gates
75261                        Inner Change    The Jazzmasters
                                     None               None


# Link to Word2Vec documentation
https://radimrehurek.com/gensim/models/word2vec.html

# Consider each song to be a token and the playlist is a sentence.
# Now the goal is to find songs which appear together in playlists

In [None]:
from gensim.models import Word2Vec

# Train our Word2Vec model
model = Word2Vec(
    playlists, vector_size=32, window=20, negative=50, min_count=1, workers=4
)

In [None]:
song_id = 3822

# Ask the model for songs similar to song #3822
model.wv.most_similar(positive=str(song_id))

[('15660', 0.9900148510932922),
 ('4157', 0.9895102977752686),
 ('4187', 0.9855514168739319),
 ('4181', 0.9847140908241272),
 ('3358', 0.9833555221557617),
 ('1506', 0.982265055179596),
 ('19162', 0.9811265468597412),
 ('12749', 0.9802616238594055),
 ('8542', 0.9795125722885132),
 ('3384', 0.9788711071014404)]

In [None]:
print(songs_df.iloc[3822])

title         Billie Jean
artist    Michael Jackson
Name: 3822 , dtype: object


In [None]:
import numpy as np

def print_recommendations(song_id):
    similar_songs = np.array(
        model.wv.most_similar(positive=str(song_id),topn=5)
    )[:,0]
    return  songs_df.iloc[similar_songs]

# Extract recommendations
print_recommendations(3822)

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
15660,Let The Music Play,Shannon
4157,P.Y.T. (Pretty Young Thing),Michael Jackson
4187,I Wanna Dance With Somebody (Who Loves Me),Whitney Houston
4181,Kiss,Prince & The Revolution
3358,Maneater,Daryl Hall & John Oates


In [None]:
print(songs_df.iloc[3822])

title         Billie Jean
artist    Michael Jackson
Name: 3822 , dtype: object


In [None]:
print_recommendations(3822)

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
15660,Let The Music Play,Shannon
4157,P.Y.T. (Pretty Young Thing),Michael Jackson
4187,I Wanna Dance With Somebody (Who Loves Me),Whitney Houston
4181,Kiss,Prince & The Revolution
3358,Maneater,Daryl Hall & John Oates


In [None]:
print(songs_df.iloc[3000])

title     Instant Karma! (We All Shine On)
artist                         John Lennon
Name: 3000 , dtype: object


In [None]:
print_recommendations(3000)

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2888,For What It's Worth,Buffalo Springfield
2730,Reeling In The Years,Steely Dan
8557,Imagine,John Lennon
2645,After Midnight,Eric Clapton
16543,Yellow Submarine,The Beatles


References:
https://medium.com/analytics-vidhya/ideas-for-using-word2vec-in-human-learning-tasks-1c5dabbeb72e

# Product Recommendation using Word2Vec:
## https://www.kaggle.com/code/tawfikelmetwally/product-recommendation-system-using-word2vec