Notes:
Calculate similarity between
- queries and docs
- docs and docs
- queries and queries
Notes:
- Use case for queries-docs?
- Use case for doc-doc?
- Use case for query-query?
- Queries / docs are vectors
- Term-document Matrix → document vector
Notes:
- How can a document be a vector?
#1 | #2 | #3 | |
---|---|---|---|
Book | 10 | 3 | 1 |
Information | 5 | 2 | 2 |
-
$\vec{V}(\#1) = (10, 5)$ -
$\vec{V}(\#2) = (3, 2)$ -
$\vec{V}(\#3) = (1, 2)$
Notes:
- What are the document vectors for #1, #2, #3?
$\vec{V}(\#1) = (10, 5)$ $\vec{V}(\#2) = (3, 2)$ $\vec{V}(\#2) = (1, 2)$
Notes:
- Are the docs similar?
Notes:
Notes:
Notes:
- Documents look similar but vector distance is pretty big
- Vector distance does not consider document size
- Need a better measure that accounts for longer documents
$\vec{V}(\#1) = (10, 5)$ $\vec{V}(\#2) = (3, 2)$ $\vec{V}(\#2) = (1, 2)$
Notes:
Notes:
Notes:
Notes:
- 1.0 means equality (same vector direction)
- 0.0 means maximum difference (90° between vectors)
Notes:
- Why can't there be more than 90° difference?
$$\begin{aligned} \textrm{sim}(\#1, \#2) & = \frac{ \begin{pmatrix}10 \ 5\end{pmatrix} \cdot \begin{pmatrix}3 \\ 2\end{pmatrix} }{ \left|\begin{pmatrix}10 \ 5\end{pmatrix}\right| \left|\begin{pmatrix}3 \\ 2\end{pmatrix}\right| } \\ & = \frac{30 + 10}{\sqrt{10^2 + 5^2} \sqrt{3^2 + 2^2}} \\ & = \frac{40}{\sqrt{125} \sqrt{13}} = \frac{40}{40.31} \\ & \approx 0.99 \\ \\ \textrm{sim}(\#2, \#3) & \approx 0.875 \end{aligned}$$
- Vocabulary:
[book, information]
- Query:
book
- $\vec{V}(q) = \begin{pmatrix}1 \\ 0\end{pmatrix}$
Notes:
- What does the query vector look like?
Notes:
- Where to draw the query vector?
Notes:
- Which doc is most similar to the query?
Notes:
We should use that instead of keyword search!
Words are represented as one-hot vectors:
- One vector component is 1, all others are 0.
- This takes up a lot of space.
Semantically similar words have completely different vectors.
Slow to compute.
Query vector needs to be compared with every document vector.
Notes:
- Why is it slow to compute?
Compress one-hot vectors to dense vectors with fewer dimensions.
Calculate similar vectors for semantically related words.
Build vector index, compute approximate nearest vector.