# Course Semantic Search
Experiments with searching over course content using semantic search. The course content comes from Open LiaScript courses. Start simple then move to getting GPT3 to question answer based on search results. 



## Part 1 - Use the gtr-t5-large Transformer for Dense Retrieval to create embeddings and use FAISS for similarity search

Based On https://til.simonwillison.net/python/gtr-t5-large

### Installations


In [23]:

!pip install sentence-transformers


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [24]:
!pip install faiss-gpu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 3021, in _dep_map
    return self.__dep_map
  File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 2815, in __getattr__
    raise AttributeError(attr)
AttributeError: _DistInfoDistribution__dep_map

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/cli/base_command.py", line 167, in exc_logging_wrapper
    status = run_func(*args)
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/cli/req_command.py", line 199, in wrapper
    return func(self, options, args)
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/commands/install.py", line 397, in run
    conflicts = self._determine_conflicts(to_inst

In [25]:
!pip install httpx

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Download a Liascript Course

In [6]:
# Want to use a liascript open book
# Need to download https://raw.githubusercontent.com/LiaBooks/How-To-Code-in-Python-3/main/README.md

import httpx
import markdown
import re

def parse_markdown(markdown_str):
    # Convert markdown to HTML
    html = markdown.markdown(markdown_str)

    heading_indices = [m.start() for m in re.finditer('<h', html)]

    content = []

    for index, item in enumerate(heading_indices):
      start = item
      end = 0
      if index < len(heading_indices)-1:
        end = heading_indices[index+1]-1
      else:
        end = len(html)
      section = html[start:end]
      section = re.sub('<[^<]+?>', '', section) # strip html
      section = section.replace('\n', ' ') # replace new line with a space
      content.append(section)

    return content

url = "https://raw.githubusercontent.com/LiaBooks/How-To-Code-in-Python-3/main/README.md"
    
course_markdown = httpx.get(url, timeout=10)
full_markdown = course_markdown.text

# Remove code blocks
full_markdown = re.sub(r"^```[^\S\r\n]*[a-z]*(?:\n(?!```$).*)*\n```", '', full_markdown, 0, re.MULTILINE)

contents = parse_markdown(full_markdown)

print("Example Section content", contents[2])
print("Number of sections", len(contents))


How To Code in Python 3   ## About DigitalOcean  DigitalOcean is a cloud services platform delivering the simplicity developers love and businesses trust to run production applications at scale. It provides highly available, secure and scalable compute, storage and networking solutions that help developers build great software faster. Founded in 2012 with offices in New York and Cambridge, MA, DigitalOcean offers transparent and affordable pricing, an elegant user interface, and one of the largest libraries of open source resources available. For more information, please visit https://www.digitalocean.com or follow [\@digitalocean](https://twitter.com/digitalocean) on Twitter.  Read this book online and receive server credit via https://do.co/python-book      ## DigitalOcean Community Team  **Director of Community:** Etel Sverdlov  **Technical Writers:** Melissa Anderson, Brian Boucheron, Mark Drake, Justin Ellingwood, Katy Howard, Lisa Tagliaferri  **Technical Editors:** Brian Hogan, 

### Get an Embedding Vector for each course content item

In [7]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/gtr-t5-large")



Downloading (…)071a2/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)/2_Dense/config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

Downloading (…)c5306071a2/README.md:   0%|          | 0.00/1.90k [00:00<?, ?B/s]

Downloading (…)306071a2/config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/670M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading (…)"spiece.model";:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)071a2/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading (…)06071a2/modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

In [8]:
import datetime

print("Start time", datetime.datetime.now().isoformat())
embeddings = model.encode(contents)
print("Finish time", datetime.datetime.now().isoformat())

2023-02-26T00:47:22.443638
2023-02-26T00:47:48.403013


In [10]:
import json

# Save the Embeddings

with open("embeddings.json", "w") as fp:
    json.dump(
        {
            "embeddings": [list(map(float, e)) for e in embeddings]
        },
        fp,
    )

### Use FAISS to match Course Content to a Search Query

In [15]:
import faiss
import json
import numpy as np

# Load the saved embeddings
data = json.load(open("embeddings.json"))

index = faiss.IndexFlatL2(len(data["embeddings"][0]))
index.add(np.array(data["embeddings"]).astype('float32')) #ndarrays must be of numpy.float32, and not float64.

def find_similar(query_embedding, k=10):
    _, I = index.search(np.array([query_embedding]), k)
    return I[0]




### Test a Few Search Queries

In [22]:
#query = "What are the main differences between Python 2 and 3?"

#query = "Give examples of differences in Python 2 and 3 syntax?"

query = "How do I write a comment?"

query_embedding = model.encode(query)

results = find_similar(query_embedding)

for idx in results:
  print(contents[idx])
  print("----")

How To Write Comments Comments are lines that exist in computer programs that are ignored by compilers and interpreters. Including comments in programs makes code more readable for humans as it provides some information or explanation about what each part of a program is doing. Depending on the purpose of your program, comments can serve as notes to yourself or reminders, or they can be written with the intention of other programmers being able to understand what your code is doing. In general, it is a good idea to write comments while you are writing or updating a program as it is easy to forget your thought process later on, and comments written later may be less useful in the long term.
----
For loop that iterates over sharks list and prints each string item for shark in sharks:    print(shark) Comments are made to help programmers, whether it is the original programmer or someone else using or collaborating on the project. If comments cannot be properly maintained and updated along

## Part 2 - Try using Lance for Vector Similarity Search
Lance is a new columnar format with vector search.

https://github.com/eto-ai/lance

https://pypi.org/project/pylance/

## Part 3 - Try using SQLite Vector Search Plugin for Similarity Search

## Part 4 - Question Answering with GPT 3 

