<center>
<table>
  <tr>
    <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/nasa-logo.svg" width="100"/> </td>
     <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/ASTG_logo.png?raw=true" width="80"/> </td>
     <td> <img src="https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png" width="130"/> </td>
    </tr>
</table>
</center>

        
<center>
<h1><font color= "blue" size="+3">ASTG Python Course Series</font></h1>
</center>

---

<center>
    <h1><font color="red">Understand Vector Embeddings</font></h1>
</center>

# <font color="red">Objective</font>

Vector embeddings are numerical representations of data points that express different types of data, including non-mathematical data such as words or images, as an array of numbers that Machine Learning (ML) models can process.

We provide general concepts that help us understand:
- What vector embeddings are.
- Why they are needed.
- How they keep the meaning of data.
- How they are created.
- How they are used.

# <font color="red">Introduction</font>

- ML algorithms, like most software algorithms, need numbers to work with.
   - They process information using mathematical algorithms that require numerical inputs to learn from data, detect patterns, and make predictions.
- We need to convert both raw numerical data (like temperature, pressure, elevation, etc.) and non-numerical data (like text, images, etc.) into numerical forms through processes like feature engineering.
- To identify patterns and relationships in data, ML algorithms process numerical features and data points, often in vector form.
-  __Vector embeddings__ are numerical representations of data (numerical or not) that keeps the meaning of the original data and allow to perform mathematical operations for comparing, similarity,  transforming, etc.
   - They allow us to take virtually any data type and represent it as numerical vectors.
      - A vector is an array of numbers in n-dimensional space, functioning like a mathematical bookkeeping device for data.
   - We can perform tasks on the transformed data without losing the data’s original meaning.
   - Expressing data points as vector embeddings enables the interoperability of different types of data, acting as a _lingua franca_ of sorts between different data formats by representing them in the same embedding space.
   - The more similar two items are, the closer the embeddings for those items are placed in the vector space.
      - Features or qualities shared by two data points should be reflected in both of their vector embeddings.
      - Dissimilar data points should have dissimilar vector embeddings.
- The same embeddings can be repurposed for search, database retrieval, and other features, creating a highly personalized user experience.
- Embeddings can enhance accuracy and generalization by allowing an algorithm to recognize similarities between concepts without being explicitly instructed to do so.

# <font color="red">What does vector embedding look like?</font>

- The length (or dimensionality) of the vector depends on the specific embedding technique used and how we want the data to be represented.
   - Each embedding model has its own dimension length, allowed input types, similarity space, and other characteristics.
- The vector embedding itself is typically represented as a sequence of numbers, such as `[-0.5, -0.9, 0.3, 0.7, ...]`.
   - Each number in the sequence corresponds to a specific feature and contributes to the overall representation of the data point.
   - The values within the vector are not meaningful on their own. It is the relative values and relationships between the numbers that capture the semantic information and allow algorithms to process and analyze the data effectively.

#### Example

![fig_embedding](https://cdn.sanity.io/images/bbnkhnhl/production/9d9a653b2bb115c9ecae49532d8bbcd97e3e45ed-1920x1080.jpg?w=3840&q=75&fit=clip&auto=format)

In the above figure:

- Some words and their embeddings are presented.
- Some possible featues are shown: whether the word represents a living being, a mammal, what gender it as, etc.
- The values in the matrix indicate how much each word has each feature:
   - `woman` and `man` are high on `alive` but are low on `rodent` and `plural`.
   - `hutches` is low on `bipedel` but high on `plural`.
- Each word can be placed on many different dimensions to end up in particular points in the conceptual space.

# <font color="red">How do embeddings work?<font>
- Embeddings are important for ML and data processing applications.
- A vector embedding transforms a data point, such as a word, sentence or image, into an n-dimensional array of numbers representing that data point’s characteristics—its features.
- Embeddings created through neural networks.
   - The embedding model is basically a neural network with the last layer removed. Instead of getting a specific labeled value for an input, we get a vector embedding.
   - It takes raw input data, like images and texts, and represent them as numerical vectors. 
   - The network learns to map high-dimensional data into lower-dimensional spaces while preserving important properties of the data.
   - During the training process, the neural network learns to transform these representations into meaningful embeddings.
      - Through iterative training, the neural network refines its parameters, including the weights in the embedding layer, to better represent the meaning of a particular input and how it relates to another piece of input (like how one word relates to another).
      - The training task is essential in shaping the learned embeddings. Optimizing the network for the task at hand forces it to learn embeddings that capture the underlying semantic relationships within the input data.

![fig_sample](https://qdrant.tech/articles_data/what-are-embeddings/How-Embeddings-Work.jpg)
Image source: Qdrant

## <font color="blue">Comapring vector embeddings</font>

- Once we have embeddings of different inputs __from the same embedding model__, then we can compare the vectors using a distance metric, and determine the relative similarity of inputs. 
- n-dimensional embeddings of similar data points should be grouped closely together in n-dimensional space.
- We need to use an appropriate mathematical operation to best measure the similarity among embeddings.

### Metrics
- We can consider the following metrics:
   - __Euclidian distance__: Measures the average straight-line distance between the corresponding points of different vectors. Because it is sensitive to magnitude, it’s useful for data that reflects things like size or counts.
   - __Cosine similarity__: Is a normalized measure of the cosine of the angle between two vectors.
       - It ranges from -1 to 1, in which 1 represents identical vectors, 0 represents orthogonal (or unrelated) vectors, and -1 represents fully opposite vectors. 
       - It is used widely in NLP tasks because it naturally normalizes vector magnitudes, which makes it less sensitive to the relative frequency of words in training data than Euclidian distance.
   - __Dot product__: Is the sum of the product of the corresponding components of each vector.

![fig_compare](https://miro.medium.com/v2/resize:fit:1400/format:webp/0*glVlVp1Q9-vDmYNn.png)

### Vector search

- We can compute the similarity between an arbitrary input vector and the existing vectors in a database: that is known as __vector search__.
- Vector search maps data (text, images, etc.) into high-dimensional numerical vectors that capture the semantic essence of content in a high-dimensional space where similar items cluster together.
- Where traditional search relies on mentions of keywords, lexical similarity, and the frequency of word occurrences, vector search engines use distances in the embedding space to represent similarity.
   - Vector search enables searching both structured and unstructured data by semantics or meaning, and by values.

![fig_search](https://cdn.educba.com/academy/wp-content/uploads/2024/02/Vector-Search-Benefits.jpg)

# <font color="red">What types of objects can be embedded? </font>

![fig_types](https://partee.io/images/posts/vector-embeddings/embedding-creation.png)
Iamge Source: Sam Partee

Many kinds of data types and objects can be represented as vector embeddings. Some of the common ones include: 

- __Text__: Documents, paragraphs, sentences, and words can be embedded into numerical vectors using techniques like Word2Vec (for word embeddings) and Doc2Vec (for document embeddings. 
- __Images__: Images can be embedded into vectors using methods like CNNs (Convolutional Neural Networks) or pre-trained image embedding models like ResNet and VGG. 
- __Audio__: Audio signals like music or speech can be embedded into numerical representations using techniques like RNNs (Recurrent Neural Networks) or spectrogram embeddings. They capture auditory properties, making it possible for systems to interpret audio more effectively. 
- __Graphs__: Edges and nodes in a graph can be embedded using techniques like graph convolutional networks and node embeddings to capture relational and structural information. Nodes in a graph represent entities like a person, product, or web page, and each edge represents the connection or link between those entities.
- __3D model data__: 3D model embeddings represent different geometric aspects of 3-dimensional objects and are used for tasks like form matching, objection detection, and 3D reconstruction.  
- __Time-series data__: These embeddings capture temporal patterns in sequential data and are used for sensor data, financial data, and IoT applications. Their common use cases include pattern identification, anomaly detection, and time series forecasting. Meanwhile, 
- __Molecules__: Molecule embeddings that represent chemical compounds are used for molecular property prediction, drug discovery and development, and chemical similarity searching. 

# <font color="red">Applications using vector embeddings</font>

Vector embeddings are an incredibly versatile tool and can be applied in many domains.
Here are some NASA related applications that rely on vector embeddings:

- __Cataloging remote sensing data__: By converting raw remote-sensing data into vector embeddings, a catalog can be created that indexes data by spatial, temporal, and feature similarity. This allows users to:
   - Search for complex phenomena, like areas with drought-stressed vegetation or signs of illegal mining, without needing specialized technical expertise.
   - Accelerate the process of finding and using relevant data, assisting scientists and researchers in their work.
- __Information retrieval__: We use word embedding to improve information retrieval from large collections of abstracts and full-text documents.
   - Users can easily identify papers in their topics of interest, retreive relevant NASA documents, etc.
- __Knowledge discovery__: Vector embeddings are to integrate textual metadata with satellite imagery and other visual information and capture the relationships between different data types. This facilitates robust analysis and helps scientists uncover deeper insights.
- __Anomaly detection__: Embeddings are created for anomaly detection using spacecraft telemetry and other complex datasets. By converting high-dimensional time-series data into dense, numerical vectors, embeddings enable advanced machine learning models to identify unusual patterns that would be missed by traditional methods. 


# <font color="red"> References</font>

- [What is vector embedding?](https://www.ibm.com/think/topics/vector-embedding) by Dave Bergmann and Cole Stryker of IBM.
- [A Beginner’s Guide to Vector Embeddings](https://www.tigerdata.com/blog/a-beginners-guide-to-vector-embeddings) from tigerdata.com
- [A visual introduction to vector embeddings](https://techcommunity.microsoft.com/blog/educatordeveloperblog/a-visual-introduction-to-vector-embeddings/4418793) by Pamela Fox of Microsoft.

[![Python + AI: Vector embeddings](http://img.youtube.com/vi/ABLeB7JMWk0/0.jpg)](http://www.youtube.com/watch?v=ABLeB7JMWk0 "Python + AI: Vector embeddings")