# Vector space with ML

This lab will be devoted to the use of ML model for the needs of information retrieval and text classification.  

**Searching in the curious facts database**

The facts dataset is given [here](https://raw.githubusercontent.com/IUCVLab/information-retrieval/main/datasets/facts.txt), take a look. We want you to retrieve facts **relevant to the query** (whatever it means), for example, you type "good mood", and get to know that Cherophobia is the fear of fun. For this, the idea is to utilize document vectors. However, instead of forming vectors with tf-idf and reducing dimensions, this time we want to obtain fixed-size vectors for documents using ML model.

## 1. Use neural networks to embed sentences

Make use of any, starting from doc2vec up to Transformers, etc. Provide all code, dependencies, installation requirements.


- [UCE in spacy 2](https://spacy.io/universe/project/spacy-universal-sentence-encoder) (`!pip install spacy-universal-sentence-encoder`)
- [Sentence BERT in spacy 2](https://spacy.io/universe/project/spacy-sentence-bert) (`!pip install spacy-sentence-bert`)
- [Pretrained 🤗 Transformers](https://huggingface.co/transformers/pretrained_models.html)
- [Spacy 3 transformers](https://spacy.io/usage/embeddings-transformers#transformers-installation)
- [doc2vec pretrained](https://github.com/jhlau/doc2vec)
- [Some more sentence transformers](https://www.sbert.net/docs/quickstart.html)
- [Even fasttext can do a sentence embedding](https://fasttext.cc/docs/en/python-module.html#model-object)

Here should be dependency installation, download instructions and so on. With outputs.

In [8]:
# !pip install transformers datasets
# !pip install "tensorflow>=2.0.0"
# !pip install --upgrade tensorflow-hub
# !pip install spacy-universal-sentence-encoder

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0mCollecting spacy-universal-sentence-encoder
  Downloading spacy_universal_sentence_encoder-0.4.5.tar.gz (13 kB)
  Preparing metadata (setup.py) ... [?25ldone


Building wheels for collected packages: spacy-universal-sentence-encoder
  Building wheel for spacy-universal-sentence-encoder (setup.py) ... [?25ldone
[?25h  Created wheel for spacy-universal-sentence-encoder: filename=spacy_universal_sentence_encoder-0.4.5-py3-none-any.whl size=15793 sha256=17b573fa6085a41904ee5c0ff173398fb16af6fc8052573a2fca05a113e5b489
  Stored in directory: /Users/artmurashko/Library/Caches/pip/wheels/78/d2/ff/4c091ddba84486e45a657748dc5596000aebbe4f6ede3aed67
Successfully built spacy-universal-sentence-encoder
Installing collected packages: spacy-universal-sentence-encoder
[33m  DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work

And then use the library to download (and load) the model.

NB: model downloading may take time (depending on the model hosting). If you think it may take a long time, ask your TA for assistance with binaries.

In [9]:
import spacy_universal_sentence_encoder
nlp = spacy_universal_sentence_encoder.load_model('en_use_lg')

Downloaded https://tfhub.dev/google/universal-sentence-encoder-large/5, Total size: 577.10MB



2023-03-13 11:49:00.843587: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [11]:
import tensorflow_hub as hub
embedd = hub.load("https://tfhub.dev/google/universal-sentence-encoder-large/5")

## 2. Write a function that prepares embedding of arbitrary queries

Write a function, which returns a fixed-sized vector of embedding.

In [65]:
def embed(text):
    return embedd([text])[0].numpy()

In [67]:
qwe = embed("Folks, here's a story about Minnie the Moocher. ")
print(qwe.shape)

(512,)


Here we check that embeddings are of the same size and type.

In [68]:
assert embed(
            "Some random text"
        ).shape == \
        embed(
            "Folks, here's a story about Minnie the Moocher. "
            "She was a lowdown hoochie coocher. "
            "She was the roughest, toughest frail, "
            "but Minnie had a heart as big as a whale"
        ).shape, "Shape should match"

NB: here we check DISTANCE, not similarity. This similar texts should produce results close to 0.

In [69]:
from scipy.spatial.distance import cosine

assert abs(cosine(
            embed("some text for testing"), 
            embed("some text for testing")
        )) < 1e-4, "Embedding should match"

assert abs(cosine(
            embed("Cats eat mice."), 
            embed("Terminator is an autonomous cyborg, typically humanoid, originally conceived as a virtually indestructible soldier, infiltrator, and assassin.")
        )) > 0.2, "Embeddings should be far"

## 3. Read the data

Now, let's read the facts dataset. Download it from the abovementioned url and read to the list of sentences.

In [70]:
import requests
url = "https://raw.githubusercontent.com/IUCVLab/information-retrieval/main/datasets/facts.txt"
facts = requests.get(url).text.split('\n')

In [71]:
print(*facts[:5], sep='\n')

assert len(facts) == 159
assert ('our lovely little planet') in facts[0]

1. If you somehow found a way to extract all of the gold from the bubbling core of our lovely little planet, you would be able to cover all of the land in a layer of gold up to your knees.
2. McDonalds calls frequent buyers of their food "heavy users."
3. The average person spends 6 months of their lifetime waiting on a red light to turn green.
4. The largest recorded snowflake was in Keogh, MT during year 1887, and was 15 inches wide.
5. You burn more calories sleeping than you do watching television.


## 4. Transform sentences to vectors

Transform the list of facts to `numpy.array` of vectors corresponding to each document (`sent_vecs`), inferring them from the model we just loaded.

In [72]:
import numpy as np
#TODO infer vectors
sent_vecs = np.array([embed(text) for text in facts])

In [73]:
assert sent_vecs.shape[0] == len(facts)

## 5. Find closest to the query

Now find 5 facts which are the closest to the query using cosine measure.

### 5.1. Closest search

In [74]:
def find_k_closest(query, dataset, k=10):
    return np.argsort(dataset @ query)[-k:]

### 5.1. Use your function

In [75]:
query = "good mood"
query_vec = embed(query)

print("Results for query:", query)
print()
for k in find_k_closest(query_vec, sent_vecs, 5):
    print("\t", facts[k])

Results for query: good mood

	 44. Honey never spoils.
	 45. About half of all Americans are on a diet on any given day.
	 98. Blue-eyed people tend to have the highest tolerance of alcohol.
	 68. Cherophobia is the fear of fun.
	 57. Gorillas burp when they are happy


## 6. Measure DCG@5 for the following query bucket
```
good mood
gorilla
woman
earth
japan
people
math
```

Recommend 5 facts to each of the queries. Write your code below.

In [82]:
bucket = """good mood
gorilla
woman
earth
japan
people
math""".split('\n')

for term in bucket:
    print("\n")
    print(term)
    for k in find_k_closest(embed(term), sent_vecs, k=5)[::-1]:
        print("\t", facts[k])



good mood
	 57. Gorillas burp when they are happy
	 68. Cherophobia is the fear of fun.
	 98. Blue-eyed people tend to have the highest tolerance of alcohol.
	 45. About half of all Americans are on a diet on any given day.
	 44. Honey never spoils.


gorilla
	 55. The word "gorilla" is derived from a Greek word meaning, "A tribe of hairy women."
	 57. Gorillas burp when they are happy
	 137. Human birth control pills work on gorillas.
	 106. The male ostrich can roar just like a lion.
	 85. The elephant is the only mammal that can't jump!


woman
	 151. Women have twice as many pain receptors on their body than men. But a much higher pain tolerance.
	 16. Men are 6 times more likely to be struck by lightning than women.
	 116. Male dogs lift their legs when they are urinating for a reason. They are trying to leave their mark higher so that it gives off the message that they are tall and intimidating.
	 55. The word "gorilla" is derived from a Greek word meaning, "A tribe of hairy wo

## 7. Write your own relevance assessments and compute DCG@5

In [15]:
assessments = [
    [1, 0, 0, 0, 0], # good mood
    [1, 1, 1, 0, 0], # gorilla
    [0, 0, 0, 0, 0], # ...
    ...
]

def dcg(rels):
    from math import log
    s = 0
    for i, rel int enumerate(rels):
        s += rel / log(1 + i + 1, 2)
    return s

print(f"DCG@5 = {dcg5:.4f}")
print(f"IDCG@5 = {idcg5:.4f}")
print(f"nDCG@5 = {dcg5 / idcg5:.4f}")

DCG@5 = 1.5029
IDCG@5 = 2.9485
nDCG@5 = 0.5097
