## Zajęcia 8: Vector representations

Wszystkie zadania ćwiczeniowe należy rozwiązywać w języku Python w kopii Jupyter Notebook'a dla danych zajęć w wyznaczonych miejscach (komórki z komentarzem `# Solution`).

Nie należy usuwać komórek z treścią zadań.

Należy wyświetlać outputy przy pomocy `print`

## Dla chętnych! (może się przydać od ambitniejszych projektów końcowych)

https://github.com/huggingface/smol-course - kurs finetune'owania LLMów do własnych zadań

https://github.com/unslothai/unsloth - biblioteka do efektywnego finetune'owania LLMów (są gotowe notebooki z kodem na platformie Colab)

#### Co to jest wektor?

Wektor - jednowymiarowa macierz

[0, 1, 0, 0, 0] - one hot encoding - tylko wartości 0/1

[0, 2, 0, 5, 1, 100] - frequency encoding - liczby całkowite >= 0

[-1.5, 0.0002, 5000.01] - wektor

### Zadanie 1

Dokonaj preprocessingu tekstów https://git.wmi.amu.edu.pl/ryssta/spam-classification/src/branch/master/train/in.tsv (preprocessing - proces wstępnej "obróbki tekstów - sprowadzenie do małych liter, tokenizacja itd..) i dokonaj:
* a) one hot encodingu
* b) frequency encodingu

dla dwóch przykładowych zdań, które zawierają przynajmniej 2 wystąpienia słowa, które znajduje się w słowniku (czyli występuje w korpusie z pliku in.tsv). Ze względu na dużą liczbę unikalnych słów w korpusie, proszę nie printować całych wektorów, tylko indeksy oraz wartości.

In [None]:
# Solution
import pandas as pd
import re

def read_and_prepare_data(file):
    data=pd.read_csv(file, sep='\t', header=None)
    data.columns=['message']
    # make all messages lowercase and remove punctuation
    data['message']=data['message'].str.lower()
    data['message']=data['message'].apply(lambda x: re.sub(r'[^\w\s]',' ',x))
    return data

def create_vocabulary(data):
    vocabulary=set()
    for message in data['message']:
        words=message.split(" ")
        for word in words:
            vocabulary.add(word)
    vocabulary=list(vocabulary)
    if ('' in vocabulary):
        vocabulary.remove('') # remove empty string if it exists
    vocabulary.sort()
    return vocabulary

def one_hot_encode(sentence,vocabulary):
    sentence=sentence.lower()
    sentence=re.sub(r'[^\w\s]',' ',sentence)
    sentence=sentence.split(" ")
    one_hot_encoded=[]
    for word in vocabulary:
        one_hot_encoded.append(int(word in sentence))
    result=[]
    for i in range(len(one_hot_encoded)):
        if one_hot_encoded[i]==1:
            result.append([i,1])
    return result

def frequency_encode(sentence,vocabulary):
    sentence=sentence.lower()
    sentence=re.sub(r'[^\w\s]',' ',sentence)
    sentence=sentence.split(" ")
    frequency_encoded=[]
    for word in vocabulary:
        frequency_encoded.append(sentence.count(word))
    result=[]
    for i in range(len(frequency_encoded)):
        if frequency_encoded[i]!=0:
            result.append([i,frequency_encoded[i]])
    return result

def print_result(result):
    for record in result:
        print(record)
    print("Rest of the indexes have values 0\n")

data=read_and_prepare_data("in.tsv")
vocabulary=create_vocabulary(data)

print("First sentence\n")
first_sentence="I really appreciate good food. Appreciate mate."

print("One hot encoding: ")
print_result(one_hot_encode(first_sentence, vocabulary))

print("Frequency encoding: ")
print_result(frequency_encode(first_sentence, vocabulary))

print("Second sentence\n")
second_sentence="I'm making a webpage. My webpage is going to be the best webpage."

print("One hot encoding: ")
print_result(one_hot_encode(second_sentence, vocabulary))

print("Frequency encoding: ")
print_result(frequency_encode(second_sentence, vocabulary))



First sentence

One hot encoding: 
[930, 1]
[2692, 1]
[2922, 1]
[3264, 1]
[4018, 1]
[5150, 1]
Rest of the indexes have values 0

Frequency encoding: 
[930, 2]
[2692, 1]
[2922, 1]
[3264, 1]
[4018, 1]
[5150, 1]
Rest of the indexes have values 0

Second sentence

One hot encoding: 
[664, 1]
[1157, 1]
[1205, 1]
[2913, 1]
[3264, 1]
[3432, 1]
[3922, 1]
[3973, 1]
[4277, 1]
[6253, 1]
[6367, 1]
[6816, 1]
Rest of the indexes have values 0

Frequency encoding: 
[664, 1]
[1157, 1]
[1205, 1]
[2913, 1]
[3264, 1]
[3432, 1]
[3922, 1]
[3973, 1]
[4277, 1]
[6253, 1]
[6367, 1]
[6816, 3]
Rest of the indexes have values 0



### Zadanie 2

Na podstawie pliku https://git.wmi.amu.edu.pl/ryssta/spam-classification/src/branch/master/train/in.tsv oraz pliku https://git.wmi.amu.edu.pl/ryssta/spam-classification/src/branch/master/train/expected.tsv podziel teksty względem klasy spam/nie spam. Oblicz wartość IDF osobno dla tekstów klasy spam oraz dla tekstów klasy nie spam, dla słów:
* free
* send
* are
* the

oraz 2-3 własnoręcznie wybranych słów.

In [None]:
# Solution

import pandas as pd
import re
import math


def read_and_prepare_data(messages_file, labels_file):
    messages=pd.read_csv(messages_file, sep='\t', header=None)
    labels=pd.read_csv(labels_file, sep='\t', header=None)
    
    # combine messages and labels into one dataframe
    alldata=pd.concat([messages, labels], axis=1)
    alldata.columns=['message', 'label']
    
    # make all messages lowercase and remove punctuation
    alldata['message']=alldata['message'].str.lower()
    alldata['message']=alldata['message'].apply(lambda x: re.sub(r'[^\w\s]',' ',x))
    
    # split data into classes not_spam and spam and remove the label column 
    not_spam = alldata[alldata['label'] == 0].drop('label', axis=1).reset_index(drop=True)
    spam = alldata[alldata['label'] == 1].drop('label', axis=1).reset_index(drop=True)
    not_spam.to_csv('not_spam.tsv', sep='\t', index=False, header=False)
    spam.to_csv('spam.tsv', sep='\t', index=False, header=False)
    return not_spam, spam

def get_total_document_count(data):
    return len(data)

def get_total_occunrences_of_messages_with_word(word, data):
    count=0
    for message in data:
        words=message.split(" ")
        if word in words:
            count+=1
    return count

def calculate_idf_for_word(word, data, total_document_count):
    total_occurences=get_total_occunrences_of_messages_with_word(word, data)
    return math.log(total_document_count/total_occurences)


(not_spam,spam)=read_and_prepare_data("in.tsv", "expected.tsv")

words_list=["free", "send", "are", "the", "call", "reply", "won"]

idf_not_spam={}
idf_spam={}
total_document_count_not_spam=get_total_document_count(not_spam['message'])
total_document_count_spam=get_total_document_count(spam['message'])
for word in words_list:
    idf_not_spam[word]=calculate_idf_for_word(word, not_spam['message'],total_document_count_not_spam)
    idf_spam[word]=calculate_idf_for_word(word, spam['message'], total_document_count_spam)
    
print("IDF for not_spam:")
print(idf_not_spam)
print("IDF for spam:")
print(idf_spam)
    


IDF for not_spam:
{'free': 4.4622717510112, 'send': 3.5565631284675816, 'are': 2.539176279722058, 'the': 1.7231309628389586, 'call': 3.050001903488049, 'reply': 4.603350349271105, 'won': 5.534908553276049}
IDF for spam:
{'free': 1.4816045409242156, 'send': 2.334716371176839, 'are': 2.237552622723191, 'the': 1.49033822089297, 'call': 0.8746785358113991, 'reply': 2.099402284242374, 'won': 2.2562447557353438}


### Zadanie 3

Na podstawie warstwy embedding modelu gpt2 wypisz 15 najbardziej podobnych (względem miary podobieństwa cosinuowego) tokenów do słów:
* cat
* tree

oraz 2 własnoręcznie wybranych tokenów.

In [4]:
from transformers import GPT2Tokenizer, GPT2Model
import torch


tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
embedding_layer = model.wte
cos_sim = torch.nn.CosineSimilarity()

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
[
    [0.1, 0.2, 0.3], # Ala
    [-0.5, 0.5, 0.9], # ma
    ...
    # 50254
    ...
    [0.1, -0.1, -0.2] # w GPT2 jest 768 wartości w pojedynczym wektorze, a nie 3
]

SyntaxError: invalid syntax. Perhaps you forgot a comma? (674667415.py, line 4)

In [5]:
print("Tekst 'cat' jest konwertowany do tokenu 9246")
print("\nTokenizacja")
print(tokenizer("computer"))
print("\nDetokenizacja")
print(tokenizer.decode([33215]))
print("\nLiczba tokenów w słowniku")
print(len(tokenizer))

Tekst 'cat' jest konwertowany do tokenu 9246

Tokenizacja
{'input_ids': [33215], 'attention_mask': [1]}

Detokenizacja
computer

Liczba tokenów w słowniku
50257


In [6]:
print("Embedding tokenu: 9246")
cat_embedding = embedding_layer(torch.LongTensor([9246]))
print("\nRozmiar embeddingu (wektora)")
print(cat_embedding.shape)
print("\nWartości embeddingu")
print(cat_embedding)

Embedding tokenu: 9246

Rozmiar embeddingu (wektora)
torch.Size([1, 768])

Wartości embeddingu
tensor([[-0.0164, -0.0934,  0.2425,  0.1398,  0.0388, -0.2592, -0.2724, -0.1625,
          0.1683,  0.0829,  0.0136, -0.2788,  0.1493,  0.1408,  0.0557, -0.3691,
          0.2200, -0.0428,  0.2206,  0.0865,  0.1237, -0.1499,  0.1446, -0.1150,
         -0.1425, -0.0715, -0.0526,  0.1550, -0.0678, -0.2059,  0.2065, -0.0297,
          0.0834, -0.0483,  0.1207,  0.1975, -0.3193,  0.0124,  0.1067, -0.0473,
         -0.3037,  0.1139,  0.0949, -0.2175,  0.0796, -0.0941, -0.0394, -0.0704,
          0.2033, -0.1555,  0.2928, -0.0770,  0.0787,  0.1214,  0.1528, -0.1464,
          0.4247,  0.1921, -0.0415, -0.0850, -0.2787,  0.0656, -0.2026,  0.1856,
          0.1353, -0.0820, -0.0639,  0.0701,  0.1680,  0.0597,  0.3265, -0.1100,
          0.1056,  0.1845, -0.1156,  0.0054,  0.0663,  0.1842, -0.1069,  0.0491,
         -0.0853, -0.2519,  0.0031,  0.1805,  0.1505,  0.0442, -0.2427,  0.1104,
          0.09

In [7]:
print("Podobieństwo tego samego embeddingu (powinno wyjść 1)")
print(cos_sim(cat_embedding, cat_embedding))

Podobieństwo tego samego embeddingu (powinno wyjść 1)
tensor([1.0000], grad_fn=<SumBackward1>)


In [None]:
from tabulate import tabulate

def find_similar_words(word):
    word_token=tokenizer(word)
    input_word_embedding = embedding_layer(torch.LongTensor([word_token['input_ids'][0]]))
    top_15_similar=[]
    for i in range(0,tokenizer.vocab_size):
        word_embedding = embedding_layer(torch.LongTensor([i]))
        similarity = cos_sim(input_word_embedding, word_embedding)
        top_15_similar.append((i, similarity.item()))
        # keep 16, because 1st element is the same word as input
        # it is simpler to remove it later
        if(len(top_15_similar)>16):
            top_15_similar.sort(key=lambda x: x[1], reverse=True)
            top_15_similar=top_15_similar[:16] 
    top_15_similar.pop(0) # remove 1st element
    for i in range(len(top_15_similar)):
        top_15_similar[i]=[tokenizer.decode([top_15_similar[i][0]]),top_15_similar[i][1]]
    return top_15_similar

# cat
cat_result=find_similar_words("cat")

# tree
tree_result=find_similar_words("tree")

# green
green_result=find_similar_words("green")

# city
city_result=find_similar_words("city")

data_to_print=[]
for i in range(0,15):
    data_to_print.append([i+1,cat_result[i][0],
                          cat_result[i][1], tree_result[i][0],
                          tree_result[i][1], green_result[i][0],
                          green_result[i][1], city_result[i][0],
                          city_result[i][1]])

column_names=["Słowo", "Podobieństwo"]
print(f"TOP 15 słów podobnych do słowa 'cat', 'tree', 'green' i 'city'")
print(tabulate(data_to_print, column_names*4, tablefmt="grid"))

TOP 15 słów podobnych do słowa 'cat', 'tree', 'green' i 'city'
+----+----------+----------------+---------------+----------------+---------+----------------+-----------+----------------+
|    | Słowo    |   Podobieństwo | Słowo         |   Podobieństwo | Słowo   |   Podobieństwo | Słowo     |   Podobieństwo |
|  1 | cats     |       0.694833 | Tree          |       0.70179  | Green   |       0.713976 | City      |       0.756445 |
+----+----------+----------------+---------------+----------------+---------+----------------+-----------+----------------+
|  2 | Cat      |       0.691784 | Tree          |       0.658886 | green   |       0.701008 | city      |       0.69871  |
+----+----------+----------------+---------------+----------------+---------+----------------+-----------+----------------+
|  3 | cat      |       0.603214 | tree          |       0.651134 | Green   |       0.628727 | City      |       0.640064 |
+----+----------+----------------+---------------+----------------+--