# Embeddings and Semantic Search Lab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/S24-CS143AI/blob/main/semantic_search_lab.ipynb)

This lab introduces the `sentence_transformers` library which is built on top of the Hugging Face `transformers` library we worked with last time. You can find more information here: https://www.sbert.net/docs/

First, we need to install some packages.

In [1]:
import sys

!{sys.executable} -m pip install sentence_transformers
!{sys.executable} -m pip install requests

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


## Getting a SentenceTransformer model

We can specify a Hugging Face model just like with `transformers`. You can see the model card for this model here: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

The purpose of this model is to return **sentence embeddings** for applications like **semantic search**.

In [2]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

## Trying it out

Call `model.encode()` to get the embedding for a sentence. Let's print one out to see what they look like. 

In [3]:
ai_definition = "Intelligent agents are computer programs that behave rationally in their environments."
text_embedding = model.encode(ai_definition)
text_embedding

array([ 2.47965213e-02, -2.50644255e-02, -2.64692921e-02, -4.84711584e-03,
        5.03179990e-02, -4.46560085e-02,  2.22164206e-02,  1.25201130e-02,
       -1.55535601e-02,  6.87073320e-02,  1.79628637e-02,  1.24619761e-03,
        6.07962869e-02, -1.31208794e-02,  2.50144061e-02,  5.01403864e-03,
        4.60640527e-02, -8.98199454e-02, -5.09008914e-02, -1.69333052e-02,
       -1.85755901e-02, -1.07291769e-02, -2.93348953e-02, -2.79941689e-02,
       -8.57128873e-02,  7.98192769e-02,  8.01137765e-04, -4.81104665e-02,
        6.52510999e-03, -1.31700477e-02,  5.78082502e-02,  4.39543277e-02,
        6.19660169e-02,  4.03581820e-02, -4.75638779e-03,  6.14248812e-02,
       -2.94525828e-02,  3.51185538e-02,  3.99242565e-02,  1.75489448e-02,
       -7.35502988e-02, -5.25383987e-02, -6.32574409e-02, -4.24986938e-03,
        7.00697489e-03, -2.75384318e-02, -1.10022202e-01, -4.32477370e-02,
       -4.94306125e-02, -4.41644229e-02, -2.16475606e-01, -5.35669085e-03,
        6.48972578e-03,  

### Exercise

Type in some different sentences of different lengths and see what their embeddings are.

## Calculating the similarity between sentences

The `euclidean_distance` function is a fast way to compute the distance-between-points measurement that you learned in high school math. Sometimes other metrics are used like cosine similarity (https://en.wikipedia.org/wiki/Cosine_similarity ).

In [4]:
import numpy as np

def euclidean_distance(vec1, vec2):
    """Calculate the Euclidean distance between two vectors."""
    return np.linalg.norm(vec1 - vec2)

ai_statement = "Intelligent agents are computer programs that behave rationally in their environments."
nn_statement = "Neural networks is an AI technique inspired by brain structures."
ee_statement = "Elementary education programs prepare teachers for challenging classroom environments"

ai_statement_embedding = model.encode(ai_statement)
nn_statement_embedding = model.encode(nn_statement)
ee_statement_embedding = model.encode(ee_statement)

print("Distance between AI and NN statements:",euclidean_distance(ai_statement_embedding,nn_statement_embedding))
print("Distance between AI and EE statements:",euclidean_distance(ai_statement_embedding,ee_statement_embedding))

Distance between AI and NN statements: 1.1352925
Distance between AI and EE statements: 1.2642297


Notice that the AI and NN statements are closer despite not sharing any words in common!

The AI and EE statemetns share words like "programs" and "environments"

### Exercise

See if you can trick the model - can you write three sentences in which unrelated sentences are scored closer the the related ones?

## Getting embeddings for a whole batch of text

If you have a lot of text to get embeddings for - like if you're searching a whole bunch of data, it is faster to encode them as a batch like this:

In [5]:
list_of_sentences = ["Intelligent agents are computer programs that behave rationally in their environments.",
                        "Neural networks is an AI technique inspired by brain structures.",
                        "Elementary education programs prepare teachers for challenging classroom environments"
                    ]

list_of_embeddings = model.encode(list_of_sentences)

print("We have",len(list_of_embeddings),"embeddings")

We have 3 embeddings


## Semantic Search

Here's some data on all of the courses in Drake's course catalog.

In [6]:
import requests
import json

course_descriptions_url = 'https://raw.githubusercontent.com/ericmanley/S24-CS143AI/main/data/course_descriptions.json'
course_descriptions = requests.get(course_descriptions_url).json()
course_descriptions

[{'course_code': 'ACCT 0--',
  'course_name': 'ACCT-LOWER DIVISION',
  'description': 'No course description is available.'},
 {'course_code': 'ACCT 041',
  'course_name': 'INTRODUCTION TO FINANCIAL ACCOUNTING',
  'description': 'The elements of the financial statements, accounting for deferrals, the double-entry accounting system, internal control and cash, receivables and payables, inventory, operational assets, long-term debt, equity transactions, income measurement, and comprehensive treatment of the balance sheet, the income statement and the statement of cash flows.  Financial statement analysis will be integrated throughout the course.  Prereq.:  None.'},
 {'course_code': 'ACCT 042',
  'course_name': 'INTRODUCTION TO MANAGERIAL ACCOUNTING',
  'description': 'Explaining manufacturing and nonmanufacturing costs and how they are reported in the financial statements, computing the cost of providing a service or manufacturing a product, determining cost behavior as activity levels ch

### Exercise

Loop through this data and find the entry that is most similar to this search query:

In [7]:
query = "Which courses cover neural network?"

### Exercise

Revise your code from the previous exercise to return the top 10 matches.