# Tokenization and Embeddings

This notebook demonstrates tokenization and embedding techniques for NLP tasks.
- **Tokenization**: The process of breaking text into smaller units (tokens), such as words or subwords.
- **Embeddings**: Transforming tokens into dense numerical vectors that capture semantic meaning.

In [1]:
# Import torch and check gpu
import torch

torch.cuda.is_available() # nvidia-gpu

True

# Downloading and Running LLM

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "microsoft/Phi-3-mini-4k-instruct"

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

In [4]:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant>"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

# Generate the text
generation_output = model.generate(
    input_ids=input_ids,
    max_new_tokens=20 # only generates 20 tokens for output
)

print(tokenizer.decode(generation_output[0]))

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48


Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant>|
Subject: Sincere Apologies for the Gardening Mishap

Dear


In [5]:
print(input_ids)

tensor([[14350,   385,  4876, 27746,  5281,   304, 19235,   363,   278, 25305,
           293, 16423,   292,   286,   728,   481, 29889, 12027,  7420,   920,
           372,  9559, 19423, 29989,   465, 22137, 29958]], device='cuda:0')


These input_ids reference a table inside the tokenizer containing all the tokens it knows (vocabulary)

## Translating IDs to tokens

In [7]:
for id in input_ids[0]:
    print(tokenizer.decode(id, skip_special_tokens=False))

Write
an
email
apolog
izing
to
Sarah
for
the
trag
ic
garden
ing
m
ish
ap
.
Exp
lain
how
it
happened
.<
|
ass
istant
>


How Does the Tokenizer Break Down Text?

Three key factors determine how a tokenizer processes an input prompt.  

First, the model creator selects a tokenization method during the model’s design phase. Common methods include byte pair encoding (BPE), widely used in GPT models, and WordPiece, which powers BERT. While both methods aim to create an efficient token set for representing text, they achieve this through different approaches.  

Second, after selecting a method, various design choices must be made, such as setting the vocabulary size and defining special tokens. These decisions impact how the tokenizer functions.

Third, the tokenizer must be trained on a specific dataset to optimize its vocabulary. Even with the same method and parameters, a tokenizer trained on English text will differ from one trained on code or multilingual datasets.  

Beyond preprocessing input for a language model, tokenizers also handle output processing by converting token IDs back into words or tokens.

Comparing Trained LLM Tokenizers

In [8]:
text = """

English and CAPITALIZATION

🎵鸟
show_tokens False None elif == >= else: two tabs:" " Three tabs: "   "

12.0*50=600

"""

In [9]:
colors_list = [
    '102;194;165', '252;141;98', '141;160;203',
    '231;138;195', '166;216;84', '255;217;47'
]

def show_tokens(sentence, tokenizer_name):
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    token_ids = tokenizer(sentence).input_ids
    for idx, t in enumerate(token_ids):
        print(
            f'\x1b[0;30;48;2;{colors_list[idx % len(colors_list)]}m' +
            tokenizer.decode(t) +
            '\x1b[0m',
            end=' '
        )

### BERT Base Model (uncased) (2018)
- **Tokenization method**: WordPiece
- **uncased** - all letters are converted into lowercase
- **Vocabulary Size**: 30,522

In [10]:
show_tokens(sentence=text, tokenizer_name="bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

[0;30;48;2;102;194;165m[CLS][0m [0;30;48;2;252;141;98menglish[0m [0;30;48;2;141;160;203mand[0m [0;30;48;2;231;138;195mcapital[0m [0;30;48;2;166;216;84m##ization[0m [0;30;48;2;255;217;47m[UNK][0m [0;30;48;2;102;194;165m[UNK][0m [0;30;48;2;252;141;98mshow[0m [0;30;48;2;141;160;203m_[0m [0;30;48;2;231;138;195mtoken[0m [0;30;48;2;166;216;84m##s[0m [0;30;48;2;255;217;47mfalse[0m [0;30;48;2;102;194;165mnone[0m [0;30;48;2;252;141;98meli[0m [0;30;48;2;141;160;203m##f[0m [0;30;48;2;231;138;195m=[0m [0;30;48;2;166;216;84m=[0m [0;30;48;2;255;217;47m>[0m [0;30;48;2;102;194;165m=[0m [0;30;48;2;252;141;98melse[0m [0;30;48;2;141;160;203m:[0m [0;30;48;2;231;138;195mtwo[0m [0;30;48;2;166;216;84mtab[0m [0;30;48;2;255;217;47m##s[0m [0;30;48;2;102;194;165m:[0m [0;30;48;2;252;141;98m"[0m [0;30;48;2;141;160;203m"[0m [0;30;48;2;231;138;195mthree[0m [0;30;48;2;166;216;84mtab[0m [0;30;48;2;255;217;47m##s[0m [0;30;48;2;102;194;165m:[0m [0;30;48;2;25

- The newline breaks are gone, which makes the model blind to information encoded in newlines (e.g., a chat log when each turn is in a new line).

- All the text is in lowercase.

- The word “capitalization” is encoded as two subtokens: capital ##ization. The ## characters are used to indicate this token is a partial token connected to the token that precedes it. This is also a method to indicate where the spaces are, as it is assumed tokens without ## in front have a space before them.

- The emoji and Chinese characters are gone and replaced with the [UNK] special token indicating an “unknown token.”

### BERT Base Model (cased) (2018)
- **Tokenization method**: WordPiece
- **Vocabulary Size**: 28,996

In [11]:
show_tokens(text, "bert-base-cased")

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

[0;30;48;2;102;194;165m[CLS][0m [0;30;48;2;252;141;98mEnglish[0m [0;30;48;2;141;160;203mand[0m [0;30;48;2;231;138;195mCA[0m [0;30;48;2;166;216;84m##PI[0m [0;30;48;2;255;217;47m##TA[0m [0;30;48;2;102;194;165m##L[0m [0;30;48;2;252;141;98m##I[0m [0;30;48;2;141;160;203m##Z[0m [0;30;48;2;231;138;195m##AT[0m [0;30;48;2;166;216;84m##ION[0m [0;30;48;2;255;217;47m[UNK][0m [0;30;48;2;102;194;165m[UNK][0m [0;30;48;2;252;141;98mshow[0m [0;30;48;2;141;160;203m_[0m [0;30;48;2;231;138;195mtoken[0m [0;30;48;2;166;216;84m##s[0m [0;30;48;2;255;217;47mF[0m [0;30;48;2;102;194;165m##als[0m [0;30;48;2;252;141;98m##e[0m [0;30;48;2;141;160;203mNone[0m [0;30;48;2;231;138;195mel[0m [0;30;48;2;166;216;84m##if[0m [0;30;48;2;255;217;47m=[0m [0;30;48;2;102;194;165m=[0m [0;30;48;2;252;141;98m>[0m [0;30;48;2;141;160;203m=[0m [0;30;48;2;231;138;195melse[0m [0;30;48;2;166;216;84m:[0m [0;30;48;2;255;217;47mtwo[0m [0;30;48;2;102;194;165mta[0m [0;30;48;2;252;1

The cased version of the BERT tokenizer differs mainly in including uppercase tokens.

- Notice how “CAPITALIZATION” is now represented as eight tokens: CA ##PI ##TA ##L ##I ##Z ##AT ##ION.

- Both BERT tokenizers wrap the input within a starting [CLS] token and a closing [SEP] token. [CLS] and [SEP] are utility tokens used to wrap the input text and they serve their own purposes. [CLS] stands for classification as it’s a token used at times for sentence classification. [SEP] stands for separator, as it’s used to separate sentences in some applications that require passing two sentences to a model.

### GPT-2 (2019)
- **Tokenization method**: Byte pair encoding (BPE)
- **Vocabulary Size**: 50,257
- **Special tokens**: <|endoftext|>

In [12]:
show_tokens(text, "openai-community/gpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98m
[0m [0;30;48;2;141;160;203mEnglish[0m [0;30;48;2;231;138;195m and[0m [0;30;48;2;166;216;84m CAP[0m [0;30;48;2;255;217;47mITAL[0m [0;30;48;2;102;194;165mIZ[0m [0;30;48;2;252;141;98mATION[0m [0;30;48;2;141;160;203m
[0m [0;30;48;2;231;138;195m
[0m [0;30;48;2;166;216;84m�[0m [0;30;48;2;255;217;47m�[0m [0;30;48;2;102;194;165m�[0m [0;30;48;2;252;141;98m�[0m [0;30;48;2;141;160;203m�[0m [0;30;48;2;231;138;195m�[0m [0;30;48;2;166;216;84m
[0m [0;30;48;2;255;217;47mshow[0m [0;30;48;2;102;194;165m_[0m [0;30;48;2;252;141;98mt[0m [0;30;48;2;141;160;203mok[0m [0;30;48;2;231;138;195mens[0m [0;30;48;2;166;216;84m False[0m [0;30;48;2;255;217;47m None[0m [0;30;48;2;102;194;165m el[0m [0;30;48;2;252;141;98mif[0m [0;30;48;2;141;160;203m ==[0m [0;30;48;2;231;138;195m >=[0m [0;30;48;2;166;216;84m else[0m [0;30;48;2;255;217;47m:[0m [0;30;48;2;102;194;165m two[0m [0;30;48;2;252;141;98m tabs[0m [0

With the GPT-2 tokenizer, we notice the following:

- The newline breaks are represented in the tokenizer.

- Capitalization is preserved, and the word “CAPITALIZATION” is represented in four tokens.

- The 🎵鸟 characters are now represented by multiple tokens each. While we see these tokens printed as the � character, they actually stand for different tokens. For example, the 🎵 emoji is broken down into the tokens with token IDs 8582, 236, and 113. The tokenizer is successful in reconstructing the original character from these tokens. We can see that by printing tokenizer.decode([8582, 236, 113]), which prints out 🎵.

- The two tabs are represented as two tokens (token number 197 in that vocabulary) and the four spaces are represented as three tokens (number 220) with the final space being a part of the token for the closing quote character.


In [13]:
gpt2_tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
gpt2_tokenizer.decode([8582, 236, 113])

'🎵'

### Flan-T5 (2022)
- **Tokenization method**: SentencePiece (which supports BPE and unigram language model)
- **Vocabulary size**: 32,100
- **Special tokens**:
    - unk_token <unk>
    - pad_token <pad>

In [15]:
show_tokens(text, "google/flan-t5-base")

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

[0;30;48;2;102;194;165mEnglish[0m [0;30;48;2;252;141;98mand[0m [0;30;48;2;141;160;203mCA[0m [0;30;48;2;231;138;195mPI[0m [0;30;48;2;166;216;84mTAL[0m [0;30;48;2;255;217;47mIZ[0m [0;30;48;2;102;194;165mATION[0m [0;30;48;2;252;141;98m[0m [0;30;48;2;141;160;203m<unk>[0m [0;30;48;2;231;138;195mshow[0m [0;30;48;2;166;216;84m_[0m [0;30;48;2;255;217;47mto[0m [0;30;48;2;102;194;165mken[0m [0;30;48;2;252;141;98ms[0m [0;30;48;2;141;160;203mFal[0m [0;30;48;2;231;138;195ms[0m [0;30;48;2;166;216;84me[0m [0;30;48;2;255;217;47mNone[0m [0;30;48;2;102;194;165m[0m [0;30;48;2;252;141;98me[0m [0;30;48;2;141;160;203ml[0m [0;30;48;2;231;138;195mif[0m [0;30;48;2;166;216;84m=[0m [0;30;48;2;255;217;47m=[0m [0;30;48;2;102;194;165m>[0m [0;30;48;2;252;141;98m=[0m [0;30;48;2;141;160;203melse[0m [0;30;48;2;231;138;195m:[0m [0;30;48;2;166;216;84mtwo[0m [0;30;48;2;255;217;47mtab[0m [0;30;48;2;102;194;165ms[0m [0;30;48;2;252;141;98m:[0m [0;30;48;2;141;16

- No newline or whitespace tokens; this would make it challenging for the model to work with code.

- The emoji and Chinese characters are both replaced by the <unk> token, making the model completely blind to them.

### GPT-4 (2023)
- Tokenization medhod: BPE
- Vocab size: over 100k
- Special tokens:
    - <|endoftext|>
    - Fill in the middle tokens.
        - <|fim_prefix|>
        - <|fim_middle|>
        - <|fim_suffix|>

In [16]:
show_tokens(text, "Xenova/gpt-4")

tokenizer_config.json:   0%|          | 0.00/460 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.01M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/917k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.23M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/98.0 [00:00<?, ?B/s]

[0;30;48;2;102;194;165m

[0m [0;30;48;2;252;141;98mEnglish[0m [0;30;48;2;141;160;203m and[0m [0;30;48;2;231;138;195m CAPITAL[0m [0;30;48;2;166;216;84mIZATION[0m [0;30;48;2;255;217;47m

[0m [0;30;48;2;102;194;165m�[0m [0;30;48;2;252;141;98m�[0m [0;30;48;2;141;160;203m�[0m [0;30;48;2;231;138;195m�[0m [0;30;48;2;166;216;84m�[0m [0;30;48;2;255;217;47m�[0m [0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mshow[0m [0;30;48;2;141;160;203m_tokens[0m [0;30;48;2;231;138;195m False[0m [0;30;48;2;166;216;84m None[0m [0;30;48;2;255;217;47m elif[0m [0;30;48;2;102;194;165m ==[0m [0;30;48;2;252;141;98m >=[0m [0;30;48;2;141;160;203m else[0m [0;30;48;2;231;138;195m:[0m [0;30;48;2;166;216;84m two[0m [0;30;48;2;255;217;47m tabs[0m [0;30;48;2;102;194;165m:"[0m [0;30;48;2;252;141;98m "[0m [0;30;48;2;141;160;203m Three[0m [0;30;48;2;231;138;195m tabs[0m [0;30;48;2;166;216;84m:[0m [0;30;48;2;255;217;47m "[0m [0;30;48;2;102;194;165m  [0m [0;30;48;2

The GPT-4 tokenizer behaves similarly to its ancestor, the GPT-2 tokenizer. Some differences are:

- The GPT-4 tokenizer represents the four spaces as a single token. In fact, it has a specific token for every sequence of whitespaces up to a list of 83 whitespaces.

- The Python keyword elif has its own token in GPT-4. Both this and the previous point stem from the model’s focus on code in addition to natural language.

- The GPT-4 tokenizer uses fewer tokens to represent most words. Examples here include “CAPITALIZATION” (two tokens versus four) and “tokens” (one token versus three).

### Phi-3 (and LLama 2)

The Phi-3 model here reuses the tokenizer of LLama 2 with some additional number of special tokens.

- Tokenization method: BPE
- Vocabulary size: 32,000
- Special tokens:
    - <|endoftext|>
    - Chat tokens:
        - <|user|>
        - <|assistant|>
        - <|system|>

In [17]:
show_tokens(text, model_name)

[0;30;48;2;102;194;165m[0m [0;30;48;2;252;141;98m
[0m [0;30;48;2;141;160;203m
[0m [0;30;48;2;231;138;195mEnglish[0m [0;30;48;2;166;216;84mand[0m [0;30;48;2;255;217;47mC[0m [0;30;48;2;102;194;165mAP[0m [0;30;48;2;252;141;98mIT[0m [0;30;48;2;141;160;203mAL[0m [0;30;48;2;231;138;195mIZ[0m [0;30;48;2;166;216;84mATION[0m [0;30;48;2;255;217;47m
[0m [0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98m�[0m [0;30;48;2;141;160;203m�[0m [0;30;48;2;231;138;195m�[0m [0;30;48;2;166;216;84m�[0m [0;30;48;2;255;217;47m�[0m [0;30;48;2;102;194;165m�[0m [0;30;48;2;252;141;98m�[0m [0;30;48;2;141;160;203m
[0m [0;30;48;2;231;138;195mshow[0m [0;30;48;2;166;216;84m_[0m [0;30;48;2;255;217;47mto[0m [0;30;48;2;102;194;165mkens[0m [0;30;48;2;252;141;98mFalse[0m [0;30;48;2;141;160;203mNone[0m [0;30;48;2;231;138;195melif[0m [0;30;48;2;166;216;84m==[0m [0;30;48;2;255;217;47m>=[0m [0;30;48;2;102;194;165melse[0m [0;30;48;2;252;141;98m:[0m [0;30;48;2;141;160

## Important Tokenizer Parameters

- Vocabulary size:
How many tokens to keep in the tokenizer’s vocabulary? (30K and 50K are often used as vocabulary size values, but more and more we’re seeing larger sizes like 100K.)
Special tokens
What special tokens do we want the model to keep track of? We can add as many of these as we want, especially if we want to build an LLM for special use cases. Common choices include:

- Beginning of text token
- End of text token
- Padding token
- Unknown token
- CLS token
- Masking token

- Capitalization
In languages such as English, how do we want to deal with capitalization? Should we convert everything to lowercase? (Name capitalization often carries useful information, but do we want to waste token vocabulary space on all-caps versions of words?)

# Embeddings

## Creating Contextualized Word Embeddings with Language Models

In [20]:
from transformers import AutoModel, AutoTokenizer

model_name = "microsoft/deberta-base"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
tokens = tokenizer("Hello world", return_tensors="pt")

# Process the tokens
output = model(**tokens)[0]

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/474 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/559M [00:00<?, ?B/s]

In [21]:
output.shape

torch.Size([1, 4, 768])

In [22]:
for token in tokens["input_ids"][0]:
    print(tokenizer.decode(token))

[CLS]
Hello
 world
[SEP]


Here for each token we have a vector representation of 768 length

In [23]:
output

tensor([[[ 0.0473, -0.0435, -0.0812,  ...,  0.0121,  0.0395, -0.0462],
         [-1.1017, -0.7390, -0.7409,  ..., -0.4467,  0.3183, -0.4456],
         [ 1.0047,  0.6782, -0.4958,  ...,  0.2436, -0.3662,  0.5453],
         [ 0.2161,  0.0714, -0.1412,  ...,  0.0635,  0.1943,  0.0773]]],
       grad_fn=<AddBackward0>)

## Text Embeddings (for Sentences and Whole Documents)

While token embeddings are key to how LLMs operate, a number of LLM applications require operating on entire sentences, paragraphs, or even text documents. This has led to special language models that produce text embeddings—a single vector that represents a piece of text longer than just one token.

In [26]:
from sentence_transformers import SentenceTransformer

# Load model
embed_model = SentenceTransformer("all-mpnet-base-v2")

# Convert the text to text embeddings
vector = embed_model.encode("Best movie ever!")
vector.shape

(768,)

## Using Pretrained Word Embeddings

In [27]:
import gensim.downloader as api

# Download embeddings (glove, trained on wikipedia)
model = api.load("glove-wiki-gigaword-50")




In [29]:
# Getting nearest neighbors of specific word `king`
model.most_similar(model['king'], topn=10)

[('king', 1.0000001192092896),
 ('prince', 0.8236179351806641),
 ('queen', 0.7839043140411377),
 ('ii', 0.7746230363845825),
 ('emperor', 0.7736247777938843),
 ('son', 0.766719400882721),
 ('uncle', 0.7627150416374207),
 ('kingdom', 0.7542161345481873),
 ('throne', 0.7539914846420288),
 ('brother', 0.7492411136627197)]

# Embeddings for Recommendation Systems

## Recommending Songs by Embeddings

In [30]:
import pandas as pd
from urllib import request

# Get the playlist dataset file
data = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt')

# Parse the playlist dataset file. Skip the first two lines as they only contain metadata
lines = data.read().decode("utf-8").split("\n")[2:]

# Remove playlists with only one song
playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]

# Load the song metadata
songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')
songs_file = songs_file.read().decode("utf-8").split('\n')
songs = [s.rstrip().split('\t') for s in songs_file]
songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])
songs_df = songs_df.set_index('id')

In [32]:
songs_df.head()

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Gucci Time (w\/ Swizz Beatz),Gucci Mane
1,Aston Martin Music (w\/ Drake & Chrisette Mich...,Rick Ross
2,Get Back Up (w\/ Chris Brown),T.I.
3,Hot Toddy (w\/ Jay-Z & Ester Dean),Usher
4,Whip My Hair,Willow


In [34]:
print(playlists[0])

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43']


In [36]:
print(playlists[1])

['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62', '114', '115', '84', '116', '117', '118', '119', '120', '121', '122', '123', '50', '70', '71', '124', '17', '85', '14', '82', '48', '125', '47', '46', '72', '53', '25', '73', '4', '126', '59', '74', '20', '43', '127', '128', '129', '13', '82', '48', '130', '131', '132', '133', '134', '135', '136', '137', '59', '46', '138', '43', '20', '139', '140', '73', '57', '70', '141', '3', '1', '74', '142', '143', '144', '145', '48', '13', '25', '146', '50', '147', '126', '59', '20', '148', '149', '150', '151', '152', '56', '153', '154', '155', '156', '157', '158', '159', '160', '161', '162', '163', '164', '165', '166', '167', '168', '169'

In [37]:
# Lets train the model
from gensim.models import Word2Vec

# Train our Word2Vec model
model = Word2Vec(playlists, vector_size=32, window=20, negative=50, min_count=1,
                 workers=4)

In [38]:
song_id = 2172

# Ask model for songs similar to song #2172
model.wv.most_similar(positive=str(song_id))

[('2849', 0.9984061121940613),
 ('11473', 0.9972136616706848),
 ('10105', 0.9960300922393799),
 ('3116', 0.9959848523139954),
 ('5634', 0.99503093957901),
 ('3079', 0.9949156641960144),
 ('3126', 0.9947771430015564),
 ('5586', 0.9946444630622864),
 ('3167', 0.9944220185279846),
 ('6626', 0.9942136406898499)]

In [39]:
print(songs_df.iloc[2172])

title     Fade To Black
artist        Metallica
Name: 2172 , dtype: object


In [40]:
import numpy as np

def print_recommendations(song_id):
    similar_songs = np.array(
        model.wv.most_similar(positive=str(song_id),topn=5)
    )[:,0]
    return  songs_df.iloc[similar_songs]

# Extract recommendations
print_recommendations(2172)

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2849,Run To The Hills,Iron Maiden
11473,Little Guitars,Van Halen
10105,Three Lock Box,Sammy Hagar
3116,Communication Breakdown,Led Zeppelin
5634,Mr. Brownstone,Guns N' Roses
