## Uczenie głębokie – przetwarzanie tekstu – laboratoria
# 1. TF–IDF

In [1]:
import numpy as np
import re

## Zbiór dokumentów

In [2]:
documents = ['Ala lubi zwierzęta i ma kota oraz psa!',
             'Ola lubi zwierzęta oraz ma kota a także chomika!',
             'I Jan jeździ na rowerze.',
             '2 wojna światowa była wielkim konfliktem zbrojnym',
             'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.',
            ]

Czego potrzebujemy?

- Chcemy zamienić teksty na zbiór słów.

### ❔ Pytania

- Czy do stokenizowania tekstu możemy użyć `document.split(' ')`?
- Jakie trudności możemy napotkać?

## Preprocessing

In [3]:
def get_str_cleaned(str_dirty):
    punctuation = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
    new_str = str_dirty.lower()
    new_str = re.sub(' +', ' ', new_str)
    for char in punctuation:
        new_str = new_str.replace(char,'')
    return new_str

In [4]:
sample_document = get_str_cleaned(documents[0])

In [5]:
sample_document

'ala lubi zwierzęta i ma kota oraz psa'

## Tokenizacja

In [6]:
def tokenize_str(document):
    return document.split(' ')

In [7]:
tokenize_str(sample_document)

['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa']

In [8]:
documents_cleaned = [get_str_cleaned(document) for document in documents]

In [9]:
documents_cleaned

['ala lubi zwierzęta i ma kota oraz psa',
 'ola lubi zwierzęta oraz ma kota a także chomika',
 'i jan jeździ na rowerze',
 '2 wojna światowa była wielkim konfliktem zbrojnym',
 'tomek lubi psy ma psa i jeździ na motorze i rowerze']

In [10]:
documents_tokenized = [tokenize_str(d) for d in documents_cleaned]

In [11]:
documents_tokenized

[['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa'],
 ['ola', 'lubi', 'zwierzęta', 'oraz', 'ma', 'kota', 'a', 'także', 'chomika'],
 ['i', 'jan', 'jeździ', 'na', 'rowerze'],
 ['2', 'wojna', 'światowa', 'była', 'wielkim', 'konfliktem', 'zbrojnym'],
 ['tomek',
  'lubi',
  'psy',
  'ma',
  'psa',
  'i',
  'jeździ',
  'na',
  'motorze',
  'i',
  'rowerze']]

### ❔ Pytania

- Jaki jest następny krok w celu stworzenia wektórów TF lub TF–IDF?
- Jakie wielkości będzie wektor TF lub TF–IDF?

## Stworzenie słownika

In [12]:
vocabulary = []
for document in documents_tokenized:
    for word in document:
        vocabulary.append(word)
vocabulary = sorted(set(vocabulary))

In [13]:
vocabulary

['2',
 'a',
 'ala',
 'była',
 'chomika',
 'i',
 'jan',
 'jeździ',
 'konfliktem',
 'kota',
 'lubi',
 'ma',
 'motorze',
 'na',
 'ola',
 'oraz',
 'psa',
 'psy',
 'rowerze',
 'także',
 'tomek',
 'wielkim',
 'wojna',
 'zbrojnym',
 'zwierzęta',
 'światowa']

## 📝 Zadanie **1.1** *(1 pkt)*

Napisz funkcję `word_to_index(word: str)`, która dla danego słowa zwraca wektor jednostkowy (*one-hot vector*) w postaci `numpy.array`.

Przyjmij, że słownik dany jest za pomocą zmiennej globalnej `vocabulary`.

In [23]:
def word_to_index(word: str) -> np.array:
    return np.eye(len(vocabulary))[vocabulary.index(word)]

In [24]:
word_to_index('psa')

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 0.])

## 📝 Zadanie **1.2** *(1 pkt)*

Napisz funkcję, która zamienia listę słów na wektor TF. 

In [41]:
def tf(document: list) -> np.array:
    vector = np.zeros(len(vocabulary))
    for word in document:
        if word in vocabulary:
            vector[vocabulary.index(word)] += 1
    return vector


In [42]:
tf(documents_tokenized[0])

array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,
       0., 0., 0., 0., 0., 0., 0., 1., 0.])

In [43]:
documents_vectorized = list()
for document in documents_tokenized:
    document_vector = tf(document)
    documents_vectorized.append(document_vector)

In [44]:
documents_vectorized

[array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,
        0., 0., 0., 0., 0., 0., 0., 1., 0.]),
 array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,
        0., 0., 1., 0., 0., 0., 0., 1., 0.]),
 array([0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 1., 0., 0., 0., 0., 0., 0., 0.]),
 array([1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 1., 1., 1., 0., 1.]),
 array([0., 0., 0., 0., 0., 2., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1.,
        1., 1., 0., 1., 0., 0., 0., 0., 0.])]

## IDF

In [45]:
idf = np.zeros(len(vocabulary))
idf = len(documents_vectorized) / np.sum(np.array(documents_vectorized) != 0,axis=0)
display(idf)

array([5.        , 5.        , 5.        , 5.        , 5.        ,
       1.66666667, 5.        , 2.5       , 5.        , 2.5       ,
       1.66666667, 1.66666667, 5.        , 2.5       , 5.        ,
       2.5       , 2.5       , 5.        , 2.5       , 5.        ,
       5.        , 5.        , 5.        , 5.        , 2.5       ,
       5.        ])

## 📝 Zadanie **1.3** *(1 pkt)*

Napisz funkcję, która zwraca podobieństwo kosinusowe między dwoma dokumentami w postaci zwektoryzowanej.

In [46]:
def similarity(query: np.array, document: np.array) -> float:
    return np.dot(query, document) / (np.linalg.norm(query) * np.linalg.norm(document))

In [47]:
documents[0]

'Ala lubi zwierzęta i ma kota oraz psa!'

In [48]:
documents_vectorized[0]

array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,
       0., 0., 0., 0., 0., 0., 0., 1., 0.])

In [49]:
documents[1]

'Ola lubi zwierzęta oraz ma kota a także chomika!'

In [50]:
documents_vectorized[1]

array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,
       0., 0., 1., 0., 0., 0., 0., 1., 0.])

In [51]:
similarity(documents_vectorized[0], documents_vectorized[1])

0.5892556509887895

## Prosta wyszukiwarka

In [52]:
def transform_query(query):
    """Funkcja, która czyści i tokenizuje zapytanie"""
    query_vector = tf(tokenize_str(get_str_cleaned(query)))
    return query_vector

In [53]:
similarity(transform_query('psa kota'), documents_vectorized[0])

0.4999999999999999

In [54]:
query = 'psa kota'
for i in range(len(documents)):
    display(documents[i])
    display(similarity(transform_query(query), documents_vectorized[i]))

'Ala lubi zwierzęta i ma kota oraz psa!'

0.4999999999999999

'Ola lubi zwierzęta oraz ma kota a także chomika!'

0.2357022603955158

'I Jan jeździ na rowerze.'

0.0

'2 wojna światowa była wielkim konfliktem zbrojnym'

0.0

'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.'

0.19611613513818402

In [55]:
# dlatego potrzebujemy mianownik w cosine similarity
query = 'rowerze'
for i in range(len(documents)):
    display(documents[i])
    display(similarity(transform_query(query), documents_vectorized[i]))

'Ala lubi zwierzęta i ma kota oraz psa!'

0.0

'Ola lubi zwierzęta oraz ma kota a także chomika!'

0.0

'I Jan jeździ na rowerze.'

0.4472135954999579

'2 wojna światowa była wielkim konfliktem zbrojnym'

0.0

'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.'

0.2773500981126146

In [56]:
# dlatego potrzebujemy term frequency → wiecej znaczy bardziej dopasowany dokument
query = 'i'
for i in range(len(documents)):
    display(documents[i])
    display(similarity(transform_query(query), documents_vectorized[i]))

'Ala lubi zwierzęta i ma kota oraz psa!'

0.35355339059327373

'Ola lubi zwierzęta oraz ma kota a także chomika!'

0.0

'I Jan jeździ na rowerze.'

0.4472135954999579

'2 wojna światowa była wielkim konfliktem zbrojnym'

0.0

'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.'

0.5547001962252291

In [57]:
# dlatego IDF - żeby ważniejsze słowa miał większą wagę
query = 'i chomika'
for i in range(len(documents)):
    display(documents[i])
    display(similarity(transform_query(query), documents_vectorized[i]))

'Ala lubi zwierzęta i ma kota oraz psa!'

0.24999999999999994

'Ola lubi zwierzęta oraz ma kota a także chomika!'

0.2357022603955158

'I Jan jeździ na rowerze.'

0.31622776601683794

'2 wojna światowa była wielkim konfliktem zbrojnym'

0.0

'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.'

0.39223227027636803

## Biblioteki

In [58]:
import numpy as np
import sklearn.metrics

from sklearn.datasets import fetch_20newsgroups

from sklearn.feature_extraction.text import TfidfVectorizer

In [59]:
newsgroups = fetch_20newsgroups()['data']

In [60]:
len(newsgroups)

11314

In [61]:
print(newsgroups[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







### Naiwne przeszukiwanie

In [62]:
all_documents = list() 
for document in newsgroups:
    if 'car' in document:
        all_documents.append(document)

In [63]:
print(all_documents[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







In [64]:
print(all_documents[1])

From: guykuo@carson.u.washington.edu (Guy Kuo)
Subject: SI Clock Poll - Final Call
Summary: Final call for SI clock reports
Keywords: SI,acceleration,clock,upgrade
Article-I.D.: shelley.1qvfo9INNc3s
Organization: University of Washington
Lines: 11
NNTP-Posting-Host: carson.u.washington.edu

A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies are especially requested.

I will be summarizing in the next two days, so please add to the network
knowledge base if you have done the clock upgrade and haven't answered this
poll. Thanks.

Guy Kuo <guykuo@u.washington.edu>



#### ❔ Pytanie

Jakie są problemy z takim podejściem?

### TF–IDF i odległość kosinusowa

In [65]:
vectorizer = TfidfVectorizer()

In [66]:
document_vectors = vectorizer.fit_transform(newsgroups)

In [67]:
document_vectors

<11314x130107 sparse matrix of type '<class 'numpy.float64'>'
	with 1787565 stored elements in Compressed Sparse Row format>

In [68]:
document_vectors[0]

<1x130107 sparse matrix of type '<class 'numpy.float64'>'
	with 89 stored elements in Compressed Sparse Row format>

In [69]:
document_vectors[0].todense()

matrix([[0., 0., 0., ..., 0., 0., 0.]])

In [70]:
document_vectors[0:4].todense()

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

In [71]:
query_str = 'speed'
#query_str = 'speed car'
#query_str = 'spider man'

In [72]:
query_vector = vectorizer.transform([query_str])
similarities = sklearn.metrics.pairwise.cosine_similarity(query_vector,document_vectors)
print(np.sort(similarities)[0][-4:])
print(similarities.argsort()[0][-4:])

for i in range (1,5):
    print(newsgroups[similarities.argsort()[0][-i]])
    print(np.sort(similarities)[0,-i])
    print('-'*100)
    print('-'*100)
    print('-'*100)

[0.26949927 0.3491801  0.44292083 0.47784165]
[4517 5509 2116 9921]
From: ray@netcom.com (Ray Fischer)
Subject: Re: x86 ~= 680x0 ??  (How do they compare?)
Organization: Netcom. San Jose, California
Distribution: usa
Lines: 36

dhk@ubbpc.uucp (Dave Kitabjian) writes ...
>I'm sure Intel and Motorola are competing neck-and-neck for 
>crunch-power, but for a given clock speed, how do we rank the
>following (from 1st to 6th):
>  486		68040
>  386		68030
>  286		68020

040 486 030 386 020 286

>While you're at it, where will the following fit into the list:
>  68060
>  Pentium
>  PowerPC

060 fastest, then Pentium, with the first versions of the PowerPC
somewhere in the vicinity.

>And about clock speed:  Does doubling the clock speed double the
>overall processor speed?  And fill in the __'s below:
>  68030 @ __ MHz = 68040 @ __ MHz

No.  Computer speed is only partly dependent of processor/clock speed.
Memory system speed play a large role as does video system speed and
I/O speed.  As pro

## 📝 Zadanie **1.4** *(4 pkt.)*

Wybierz zbiór tekstowy, który ma conajmniej 10000 dokumentów (inny niż w tym przykładzie).
Na jego podstawie stwórz wyszukiwarkę wykorzystującą TF–IDF i podobieństwo kosinusowe do oceny podobieństwa dokumentów. Wyszukiwarka powinna zwracać kilka posortowanych najbardziej pasujących dokumentów razem ze score'ami.

In [121]:
def search(corpus: list, corpus_vectors: np.ndarray, 
           query: str, n: int=4):
    query_vector = vectorizer.transform([query])
    similarities = sklearn.metrics.pairwise.cosine_similarity(query_vector,corpus_vectors)
    
    for i in range (1, n+1):
        print(corpus[similarities.argsort()[0][-i]])
        print(np.sort(similarities)[0,-i])
        print('-'*100)

In [122]:
# Data from https://opus.nlpl.eu/ELRC-3855-SWPS_University_Soci/pl&en/v1/ELRC-3855-SWPS_University_Soci
with open("en.txt", encoding="utf8") as text_file:
    corpus = [line for line in text_file]

vectorizer = TfidfVectorizer()
corpus_vectors = vectorizer.fit_transform(corpus)

search(corpus, corpus_vectors, 'University', 5)

The SWPS University has entered into cooperation with Marshall University (MU), University of Debrecen (DU), California State University Stanislaus (CSUS) and Bangor University (BU).

0.47042171958188805
----------------------------------------------------------------------------------------------------
 University Strategy,

0.41675248793039
----------------------------------------------------------------------------------------------------
 University Statute,

0.39494577452264923
----------------------------------------------------------------------------------------------------
Current practices of the SWPS University The University has established a Scientific Research Office (Biuro ds.

0.3106022718147999
----------------------------------------------------------------------------------------------------
Current practices of the SWPS University

0.2983536024169672
----------------------------------------------------------------------------------------------------
