### Word embedding beyoud LLMs

#### Using pretrained word embedding

In [3]:
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-50")

In [4]:
model.most_similar([model['king']], topn=11)

[('king', 1.0000001192092896),
 ('prince', 0.8236181139945984),
 ('queen', 0.7839042544364929),
 ('ii', 0.7746229767799377),
 ('emperor', 0.7736246585845947),
 ('son', 0.7667195200920105),
 ('uncle', 0.7627151012420654),
 ('kingdom', 0.7542161345481873),
 ('throne', 0.7539914846420288),
 ('brother', 0.7492412328720093),
 ('ruler', 0.7434254288673401)]

### Training the song embedding model

##### Lets assume a song is word embedding and a playlist is a sentence....

In [13]:
import pandas as pd
from urllib import request

### get a playlist dataset file

data =  request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt')

In [18]:
### parse the playlist dataset file . skip the first two 
## lines as they only contain meta data
lines = data.read().decode("utf-8").split("\n")[2:]

In [29]:
## remove play list with less then two songs
playlists = [s.strip().split() for s in lines if len(s.split()) > 1]

data = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt')
 # Parse the playlist dataset file. Skip the first two lines as
 # they only contain metadata
lines = data.read().decode("utf-8").split('\n')[2:]
 # Remove playlists with only one song
playlists = [s.rstrip().split() for s in lines if len(s.split()) >1]

In [22]:
### load song metadata
songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')
songs_file = songs_file.read().decode("utf-8").split("\n")
songs = [s.rstrip().split("\t") for s in songs_file]

songs_df = pd.DataFrame(data=songs, columns=["id", "title" , "artist"])
songs_df = songs_df.set_index("id")

In [30]:
print( 'Playlist #1:\n ', playlists[0], '\n')
print( 'Playlist #2:\n ', playlists[1])

Playlist #1:
  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43'] 

Playlist #2:
  ['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62', '114', '115', '84', '116', '117',

In [31]:
###   lets train model

from gensim.models import Word2Vec

model = Word2Vec(playlists, vector_size=32, window=20, negative=50, 
                 min_count=1, workers=4)

In [32]:
song_id = 2172
model.wv.most_similar(positive=str(song_id))

[('6658', 0.9981025457382202),
 ('11473', 0.9975206851959229),
 ('3126', 0.9971314668655396),
 ('3167', 0.9969710111618042),
 ('5586', 0.9967140555381775),
 ('6624', 0.9963540434837341),
 ('3094', 0.9963253736495972),
 ('10105', 0.9952881932258606),
 ('10084', 0.9951165914535522),
 ('2849', 0.9950700998306274)]

In [33]:
print(songs_df.iloc[2172])

title     Fade To Black
artist        Metallica
Name: 2172 , dtype: object


In [34]:
import numpy as np

def print_recommendations(song_id):
    similar_songs = np.array(model.wv.most_similar(positive=str(song_id),
                                                   topn=5))[:,0]
    return songs_df.iloc[similar_songs]

In [36]:
print_recommendations(2172)

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
6658,(Bang Your Head) Metal Health,Quiet Riot
11473,Little Guitars,Van Halen
3126,Heavy Metal,Sammy Hagar
3167,Unchained,Van Halen
5586,The Last In Line,Dio


### Summary
1. In this chapter, we have covered LLM tokens, tokenizers, and useful approaches to using token embeddings. This prepares us to start looking closer at language models in the next chapter, and also opens the door to learn about how embeddings are used beyond language models.
2. We explored how tokenizers are the first step in processing input to an LLM, transforming raw textual input into token IDs. Common tokenization schemes include breaking text down into words, subword tokens, characters, or bytes, depending on the specific requirements of a given application.
3. A tour of real-world pretrained tokenizers (from BERT to GPT-2, GPT-4, and other models) showed us areas where some tokenizers are better (e.g., preserving information like capitalization, newlines, or tokens in other languages) and other areas where tokenizers are just different from each other (e.g., how they break down certain words).
4. Three of the major tokenizer design decisions are the tokenizer algorithm (e.g., BPE, WordPiece, SentencePiece), tokenization parameters (including vocabulary size, special tokens, capitalization, treatment of capitalization and different languages), and the dataset the tokenizer is trained on.
5. Language models are also creators of high-quality contextualized token embeddings that improve on raw static embeddings. Those contextualized token embeddings are what’s used for tasks including named-entity recognition (NER), extractive text summarization, and text classification. In addition to producing token embeddings, language models can produce text embeddings that cover entire sentences or even documents. This empowers plenty of applications that will be shown in Part II of this book overing language model applications
6. Before LLMs, word embedding methods like word2vec, GloVe, and fastText were popular. In language processing, this has largely been replaced with contextualized word embeddings produced by language models. The word2vec algorithm relies on two main ideas: skip-gram and negative sampling. It also uses contrastive training similar to the type we’ll see in Chapter 10.
7. Embeddings are useful for creating and improving recommender systems as we discussed in the music recommender we built from curated song playlists.

