# Introduction to Vectors

## What are vectors?

A vector is a physical quantity that has both magnitude and direction. A vector can be represented as a ray with magnitude equal to the length of the ray and direction equal to the angle that the ray makes with the positive x-axis. 

![vector](2d-vector-2.png)

In the image above, vector AB or vector v may be represented as *$\overline{AB}$* = $\vec v$ = 3i + 3j or in **component form** as *$\overline{AB}$* = $\vec v$ = (3,3). In calculus textbooks, the component form of vectors is usually denoted as $\vec v = \langle 3,3 \rangle$

The magnitude of the vector $\vec v$ = (a,b) is $\|\vec v\|$ = $\sqrt {a^2 + b^2}$ and the direction of vector $\vec v$ is $\theta$

So, in n-dimensional space, vector $\vec v$ = ($a_1,a_2,...,a_n$) will have magnitude $\|\vec v\|$ = $\sqrt {a_1^2 + a_2^2 + ... +a_n^2}$

**NOTE:** In machine learning, high dimension vector representations of data are called **embeddings**.

In [None]:
!pip install numpy

In [None]:
# NumPy is Python library for numerical scientific computing
import numpy as np

# 2-dimensional vectors
v_2d_1 = np.array([1,5])
v_2d_2 = np.array([2,4])
v_2d_3 = np.array([3,2])
print(f"Examples of 2-dimensional vectors are {v_2d_1}, {v_2d_2}, {v_2d_3}")

# 3-dimensional vectors
v_3d_1 = np.array([1,5,6])
v_3d_2 = np.array([2,4,8])
v_3d_3 = np.array([3,2,5])
print(f"Examples of 3-dimensional vectors are {v_3d_1}, {v_3d_2}, {v_3d_3}")

## Vector similarity

Vector similarity is a measure of the closeness of vectors. While there are several metrics for vector similarity, given below are some popular metrics.

### Euclidean distance

The Euclidean distance is the shortest distance (straight-line) between 2 points in n-dimensional space.

If **p** and **q** are 2 points in **n**-dimensional space, then the Euclidean distance **d** between them is given by $d(p,q) = \sqrt{\sum_{i=1}^n(q_i - p_i)^2}$

In [None]:
# Calculating Euclidean distance between points
for v in v_2d_2, v_2d_3:
    # Using mathematical formula above
    print(f"Euclidean distance between {v_2d_1} and {v} using the formula is {np.sqrt(np.sum(np.square(v_2d_1 - v)))}")

    # using linalg.norm()
    print(f"Euclidean distance between {v_2d_1} and {v} using numpy.linalg.norm is {np.linalg.norm(v_2d_1 - v)}")

If we plot the cartesian coordinates (1,5), (2,4) and (3,2), we'll see that point (1,5) is closer to point (2,4) than it is to (3,2) and this is demonstrated by Euclidean distance between the points as calculated by the above program.

<div style="font-size: 18px;color:blue">
     The <b>smaller</b> the euclidean distance between vectors, the closer (more similar) the vectors.
</div>

### Cosine similarity

Cosine similarity measures the cosine of the angle between the two vectors. This similarity metric denotes the similarity of vectors based on their direction and typically used when measuring similarity of text.

If **p** and **q** are 2 points in n-dimensional space, then the cosine similarity (cos𝜃) between them is given by $\cos\theta = \frac{\vec p \cdot \vec q}{\|\vec p\||\vec q\|}$

In [None]:
# Calculating cosine similarity between vectors
for v in v_2d_2, v_2d_3:
    # Using mathematical formula above
    print(f"Cosine similarity between {v_2d_1} and {v} is {np.dot(v_2d_1,v)/(np.linalg.norm(v_2d_1)*np.linalg.norm(v))}")

If we plot the cartesian coordinates (1,5), (2,4) and (3,2), we'll see that the angle $\theta$ between the vectors represented by points (1,5) and (2,4) is smaller than that between the vectors represented by points (1,5) and (3,2). The smaller the angle, the larger the cosine of the angle.

<div style="font-size: 18px;color:blue">
The <b>larger</b> the cosine similarity between vectors, the closer (more similar) the vectors.
</div>

### Dot product

The dot product or inner product of two vectors is the sum of the products of the vectors' corresponding components. For 2 vectors $\vec p$ and $\vec q$, the dot product is $\vec p \cdot \vec q$ = $\sum_{i=1}^n p_i q_i $
The dot product of two vectors is also the product of their magnitudes and the cosine of the angle between them and given by $\vec p \cdot \vec q$ = $\|\vec p \| \|\vec q \| cos \theta$


In [None]:
# Calculating dot product of two vectors
for v in v_2d_2, v_2d_3:
    # Using NumnPy library
    print(f"Dot product of {v_2d_1} and {v} using NumPy is {np.dot(v_2d_1,v)}")
    # Using sum formula
    print(f"Dot product of {v_2d_1} and {v} using sum formula is {np.sum(v_2d_1 * v)}")  

<div style="font-size: 18px;color:blue">
The <b>larger</b> the dot product of vectors, the closer (more similar) the vectors.
</div>

## Vector databases

A vector database is a database that stores data as high-dimensional vectors. The vectors are mathematical representations of data and the number of their dimensions depends on the attributes or features in the data (granularity of data). By representing data as high-dimensional vectors, the *meaning* and *context* of data may be captured and data similarity can be determined based on vector similarity (how close the vectors are). 

**Why use vector databases?** The main advantage of vector databases is the fast and accurate similarity search and retrieval of data based on their vector similarity. So, instead of search for data by *exact matches, regular expressions or other criteria*, data may be matched based on *semantics and context*.

**Examples of vector databases:** Pinecone, Chroma, pg_vector (extension for PostgreSQL), sqlite_vss (extension for SQLite), Amazon OpenSearch

## Using vector databases for similarity searches

Here's a simple example to demonstrate a similarity search using a vector database. The steps in this example are as follows:
1.  Transform a few sentences into vectors (embeddings) using the sentence-transformers framework
2.  Create a vector database and insert the transformed data (vectors) into the database
3.  Use a query vector against the vector database to determine sentences that match the query

#### 1. Transforming sentences into vectors

**Install the SentenceTransformers framework**

In [None]:
%pip install -U sentence-transformers # This will takle a while and consume about 6-8 GB storage space.
%pip install ipywidgets

In [None]:
!export TOKENIZERS_PARALLELISM=false # to work around a hugging face warning

**Select model and transform sentences into vectors**

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

# Our sentences we like to encode
sentences = ['London is the capital of the UK. It is a cosmopolitan city with a lot of diversity. The weather is unpredictable and you will not regret carrying a brolly!',
             'Toronto is the capital of Ontario, Canada. It has the famous CN tower. It snows a lot in winter.',
             'The EPL is my favorite football league. International stars play in the EPL.',
             'Go Gunners! Get past Man City next time! Arsenal and Man City are top clubs in the EPL.',
             'Lionel Messi now plays for Inter Miami in the MLS.',
             'The climate is warm in sunny California. Orange county in SoCal has nice beaches.']

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences).tolist()

# Create IDs (natural numbers from 1 onwards) for embeddings
embeddings_ids = []
for i in range(1,7,1):
    embeddings_ids.append(str(i))

# Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", f"{len(embedding)} dimensions")
    print(embedding)

#### 2. Create a vector database and insert vectors

For this example, the free tier of the fully-managed Pinecone vector database is used. Note that the desired vector similarity measure or metric is specified while creating the vector database (or index).

**Install the vector database**

In [None]:
!pip3 install ipython-secrets dbus-python keyrings.alt pinecone-client # pinecone and other utilities

In [None]:
from ipython_secrets import *
PINECONE_API_KEY = get_secret("PINECONE_API_KEY")
PINECONE_ENVIRONMENT = get_secret("PINECONE_ENVIRONMENT")

**Insert vectors into the vector database**

In [None]:
# Initialize Pinecone
import pinecone
pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENVIRONMENT)

# Create an index (vector database)
if pinecone.list_indexes() == []:
    pinecone.create_index("my-test-vdb", dimension=384, metric="cosine")

# Connect to the index
index = pinecone.Index("my-test-vdb")

# Upsert transformed sentences
# Upsert sample data (6 384-dimensional vectors)

embedding_upserts = zip (embeddings_ids, embeddings)
upserts = []
for i,e in embedding_upserts:
    upserts.append((f"{i}",e))
index.upsert(upserts)

#### 3. Use a query vector for similarity search

In [None]:
# Transform a query or statement into a vector (embeddings) using the same transform method as used for the sentences.
query = "I want to visit the England."
query_vector = model.encode(query).tolist()

# Query the index for similarity matches
results = index.query(
  vector=query_vector,
  top_k=3,
  include_values=True
).matches

# Map result vectors to sentences
out = {embeddings_ids[i]: sentences[i] for i in range(len(embeddings_ids))}
for id in results:
    print(out[id.id])

**Cleanup**

In [None]:
pinecone.delete_index("my-test-vdb")