We are implementing a RAG pipeline.

Our naive approach was to try to use semantic search with embeddings to retrieve relevant results

In [1]:
from embed.service_layer.services import embed
from embed import bootstrap as embed_bootstrap

from content_index.service_layer import services as content_index_services
from content_index import bootstrap as content_index_bootstrap

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\bbraden\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\bbraden\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
import chromadb
def chromadb_client_factory() -> chromadb.HttpClient:
    host, port = "http://localhost:6333", 6333
    return chromadb.HttpClient(host, port)

In [3]:
import pymongo

def mongo_db_client_factory() -> pymongo.MongoClient:
    db_uri, port = "localhost", 27017
    return pymongo.MongoClient(db_uri, port)

In [4]:
from content_index.service_layer import unit_of_work

def user_query(user_query: str):

    embed_dependencies = embed_bootstrap.IdentityChunkerHuggingFaceMLE5Embedder()
    embedded_text = embed(chunker=embed_dependencies["chunker"], embedder=embed_dependencies["embedder"], text_data=[user_query])

    index_uow = unit_of_work.ChromaDBUnitOfWork(client_factory=chromadb_client_factory)

    results = content_index_services.get_similar_mgoblog_content(uow=index_uow, embeddings=[x.embedded_text for x in embedded_text], top_n_results=30)[0]
    full_results = []
    for result in results:
        full_results.append([result.url, result.text])
    
    return full_results

In [5]:
results = user_query("What is Makari Paige good at?")

In [6]:
# Testing results using only semantic search... Paige isn't even mentioned in any of the documents retrieved
[x[1] for x in results if 'Paige' in x[1]] 

[]

In [7]:
# Also, when looking at the text returned, it's almost all irrelevant information, new line heavy text, etc.
# Our search strategy needs to change but also our processing steps when uploading to the vector database
results

[['https://www.mgoblog.com/content/mgoradio-109', 'The Sponsors'],
 ['https://www.mgoblog.com/content/punt-counterpunt-northwestern-2024',
  '----------------------------\n\n\n\n\n\n    COUNTERPUNT\n\n\nBy Internet Raj\n@internetraj'],
 ['https://www.mgoblog.com/content/punt-counterpunt-indiana-2024',
  '----------------------------\n\n\n\n\n\n    COUNTERPUNT\n\n\nBy Internet Raj\n@internetraj'],
 ['https://www.mgoblog.com/content/punt-counterpunt-game-2024',
  '----------------------------\n\n\n\n\n\n    COUNTERPUNT\n\n\nBy Internet Raj\n@internetraj'],
 ['https://www.mgoblog.com/content/punt-counterpunt-oregon-2024',
  '----------------------------\n\n\n\n\n\n    COUNTERPUNT\n\n\nBy Internet Raj\n@internetraj'],
 ['https://www.mgoblog.com/content/benny-patterson-has-committed-michigan',
  'Hayes Fawcett (@Hayesfawcett3) November 1, 2024'],
 ['https://www.mgoblog.com/content/wtka-roundtable-11212024-they-don%27t-have-guy',
  'Things Discussed:'],
 ['https://www.mgoblog.com/content/wtk

### Adding a BM25 layer on to retrieval

In [5]:
from rank_bm25 import BM25Okapi

corpus = [
    "Hello there good man!",
    "It is quite windy in London",
    "How is the weather today?"
]

tokenized_corpus = [doc.split(" ") for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

bm25.get_top_n("London".split(" "),corpus, 1)

['It is quite windy in London']

In [19]:
# Trying it with our data
index_uow = unit_of_work.ChromaDBUnitOfWork(client_factory=chromadb_client_factory)
all_data = content_index_services.list_mgoblog_content(index_uow)
corpus = [x.text for x in all_data]

tokenized_corpus = [doc.split(" ") for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)


In [26]:
# All the results include Makari Paige, this is an obvious improvement
# These are all defensive recaps as well which is what we are looking for. 
query_results = bm25.get_top_n("How did Makari Paige play against Michigan State.".split(" "),corpus, 15)
for result in query_results:
    print(f"url:{[x.url for x in all_data if x.text == result][0]}")
    print(f"text:{result}")
    if "Paige" in result:
        print("Paige mentioned by name")


url:https://www.mgoblog.com/content/upon-further-review-2024-defense-vs-northwestern
text:the worse blocking gets to the quarterback first. In fact the comp to The Real Rashan Gary (not the one in your heads) that I made for Moore as a recruit has proven as exacting as it was controversial. How many ~270-pound guys are there playing in space like an outside linebacker?The above was also our What Happened to Makari Paige moment of the game—the only one. The comp here is so real it hurts:Also, I'm breaking up with Vlad Goldin.
Paige mentioned by name
url:https://www.mgoblog.com/content/upon-further-review-2024-defense-vs-michigan-state
text:the play and get Carter down short of the sticks. Contrast this with Berry, who's got the same job against the same play:#10 in the slotBerry isn't even relevant. So yes, you can dog Makari Paige for being a bad tackler. It's quite evident that he is, and at this late stage of his college career it appears that's what he will be. Paige is also the guy

In [27]:
# Let's keep testing 
query_results = bm25.get_top_n("What does Wink Martindale do well?".split(" "),corpus, 15)
for result in query_results:
    print(f"url:{[x.url for x in all_data if x.text == result][0]}")
    print(f"text:{result}")
    if "Wink" in result:
        print("Wink mentioned by name")

url:https://www.mgoblog.com/content/upon-further-review-2024-defense-vs-michigan-state
text:than what happens when he gets there. If you're looking for someone to blame, blame Wink Martindale for putting him in that position.Okay then fire Wink Martindale.I don't think he can be fired. He has a $9M contract and you need that money for players.Sideline Wink Martindale.Yeah, it's time.Dude, for real? I was trying to be talk radio, not calling for a Kerry Coombsening after 17 points.Well I do the charting. When I'm done doing the charting I'm supposed to tell you what it says. What it
Wink mentioned by name
url:https://www.mgoblog.com/content/upon-further-review-2024-defense-vs-michigan-state
text:Grant, Jaishawn Barham, Mason Graham when not used as a DE, Quinten Johnson, Jyaire Hill.Maybe not so heroic?There have been worse DCs but Wink Martindale is the worst play-caller of those I've seen. Aamir Hall got exposed. Makari Paige was in the right spot but couldn't tackle when he got there

In [30]:
# Let's keep testing 
query_results = bm25.get_top_n("What is the scouting report on Bryce Underwood?".split(" "),corpus, 15)
for result in query_results:
    print(f"url:{[x.url for x in all_data if x.text == result][0]}")
    print(f"text:{result}")
    if "Underwood" in result:
        print("Underwood mentioned by name")

url:https://www.mgoblog.com/content/fee-fi-foe-film-oregon-offense-2024
text:this year showcase. Here's a long one on a scramble from his time at Oklahoma against the Huskers: I don't think Gabriel is a top tier rushing QB in college football but he has a good feel for how to use his legs and his overall athletic attributes make him dangerous enough to rate around where Aidan Chiles did last week, a 7 or so. Different sorts of players, Gabriel is thicker and less elusive but the mobility/dual-threat ability is on the scouting report all the same. Dangerman: I was split
url:https://www.mgoblog.com/content/fee-fi-foe-film-ohio-state-offense-2024
text:competing in The Game a year ago. Today we will learn about Ohio State's reshuffled offense, its new QB, OL challenges, shiny new stars at WR/RB, and its new OC:  The Film: Multiple injuries on the OL have changed the scouting report on the OSU offense as the year has gone along, with the most recent injury to Seth McLaughlin striking just a

### Hybrid search looks to be viable path forward, but we still need to address our pre-processing stage. We'll tackle that first before going deeper in search.

In [6]:
# We are grabbing our data right from web pages which causes some wonky formatting and data to come through. Let's take a look at what we have now
import pymongo

def mongo_db_client_factory() -> pymongo.MongoClient:
    db_uri, port = "localhost", 27017
    return pymongo.MongoClient(db_uri, port)

In [7]:
from ingest_mgoblog_data.service_layer import unit_of_work as elt_unit_of_work
from ingest_mgoblog_data.service_layer import services as elt_services
elt_uow = elt_unit_of_work.PymongoUnitOfWork(client_factory=mongo_db_client_factory)

processed_data = elt_services.list_processed_mgoblog_content(uow=elt_uow)

In [47]:
print(f"Percent of documents with new lines: {len([x for x in processed_data if '\n' in x.body])/len(processed_data)}")
print(f"Percent of documents with multiple spacings: {len([x for x in processed_data if '      ' in x.body])/len(processed_data)}")

Percent of documents with new lines: 0.21
Percent of documents with multiple spacings: 0.06


In [57]:
# There's a bunch of table data in articles like this which will be tough to handle no matter what, but it shouldn't look like this.
print([x.body for x in processed_data if x.url == "https://www.mgoblog.com/content/upon-further-review-2024-defense-vs-northwestern"][0])

FORMATION NOTES: UFR Glossary is here. Go Go is the unbalanced formation with all the threats to one side. This variant I called Go Go 2TE.Both teams were pretty vanilla except Michigan liked to walk down to a Cover Zero look sometimes.[After THE JUMP: Forty-five snaps against Northwestern, fwiw.]


LnDnDstOFormDPackFrontHiTypeRushPlayPlayerYdsEPAO251st10Gun Str (Y)4-2-54-3 Split1 offPA4HitchHall50.09Don't know why they covered the TE. Room under Hall so they use it, immediate tackle. Push all around.
O302nd5Offset Str4-2-5Nk Wide2 fldRPOSplit Stretch/SpacingHausmann111.32[H-Fly] The pass option is open too but he's reading a scrape exchange which is the scissors to split zone rock (RPS-1). Hausmann(-1) compounds by getting to the wrong side of the C whom he allowed to work down to him.
O411st10Pistol TTE5-2-45-2 Over2 fldRunF InsertBarham0-1.14[F-In] Scrape exchange (RPS+1) delivers Barham outside. D-Mo(+0.5) popped the RT back while coming in and Graham(+1) put a double into the path

In [11]:
# Let's try using the beautifulsoup get_text method to see if this cleans things up better
from bs4 import BeautifulSoup

new_processed_data = []
for x in processed_data:
    soup = BeautifulSoup(x.raw_html, 'html.parser')

    body_div = soup.find("article").find("div", class_="field--name-body")
    text = body_div.get_text(separator=' ', strip=True)

    new_processed_data.append({"url": x.url, "body": text})

In [64]:
# Boom
print(f"Percent of documents with new lines: {len([x for x in new_processed_data if '\n' in x["body"]])/len(new_processed_data)}")
print(f"Percent of documents with multiple spacings: {len([x for x in new_processed_data if '      ' in x["body"]])/len(new_processed_data)}")

Percent of documents with new lines: 0.01
Percent of documents with multiple spacings: 0.0


In [72]:
# Looks pretty good. Table data is still somewhat problematic but I think we can leave it in for now. 
import textwrap
print(textwrap.fill([x["body"] for x in new_processed_data if x["url"] == "https://www.mgoblog.com/content/upon-further-review-2024-defense-vs-northwestern"][0],width=40))

FORMATION NOTES: UFR Glossary is here .
Go Go is the unbalanced formation with
all the threats to one side. This
variant I called Go Go 2TE. Both teams
were pretty vanilla except Michigan
liked to walk down to a Cover Zero look
sometimes. [After THE JUMP: Forty-five
snaps against Northwestern, fwiw.] Ln Dn
Dst OForm DPack Front Hi Type Rush Play
Player Yds EPA O25 1st 10 Gun Str (Y)
4-2-5 4-3 Split 1 off PA 4 Hitch Hall 5
0.09 Don't know why they covered the TE.
Room under Hall so they use it,
immediate tackle. Push all around. O30
2nd 5 Offset Str 4-2-5 Nk Wide 2 fld RPO
Split Stretch/Spacing Hausmann 11 1.32
[H-Fly] The pass option is open too but
he's reading a scrape exchange which is
the scissors to split zone rock (RPS-1).
Hausmann(-1) compounds by getting to the
wrong side of the C whom he allowed to
work down to him. O41 1st 10 Pistol TTE
5-2-4 5-2 Over 2 fld Run F Insert Barham
0 -1.14 [F-In] Scrape exchange (RPS+1)
delivers Barham outside. D-Mo(+0.5)
popped the RT back while 

In [75]:
for x in processed_data:
    soup = BeautifulSoup(x.raw_html, 'html.parser')

    body_div = soup.find("article").find("div", class_="field--name-body")
    text = body_div.get_text(separator=' ', strip=True)

    x.body = text

## Preprocessing looks better, let's go back to Embedding/Search

In [8]:
from embed.common import chunker, embedder
from embed.service_layer.services import embed
from content_index.service_layer import services as content_index_services
from content_index.service_layer import unit_of_work as content_index_uow

In [36]:
ci_uow = content_index_uow.ChromaDBUnitOfWork(client_factory=chromadb_client_factory, content_collection_name="dev_mgoblog_content_embeddings")
ch = chunker.RecursiveCharacterTextChunker()
em = embedder.HuggingFaceInferenceAPIEmbedder(embedder.HuggingFaceInferenceAPIInputs(access_token="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"),model_endpoint="models/intfloat/multilingual-e5-large-instruct")

In [90]:
# Let's only embed a sample of the processed data for development purposes
embed_results = embed(ch, em, text_data=[x.body for x in processed_data[16:40]])

In [91]:
content_to_index = []
for i in range(0,len(embed_results)):
    item = embed_results[i]
    url = processed_data[item.index].url
    content_to_index.append({"id": f"{i}:{url}","url": url, "embedding": item.embedded_text, "text": item.text})

content_index_services.add_mgoblog_content(uow=ci_uow, data=content_to_index)

In [92]:
def user_query(user_query):
    embedded_text = embed(chunker=ch, embedder=em, text_data=[user_query])

    results = content_index_services.get_similar_mgoblog_content(uow=ci_uow, embeddings=[x.embedded_text for x in embedded_text], top_n_results=30)[0]
    full_results = []
    for result in results:
        full_results.append([result.url, result.text])
    
    return full_results

In [93]:
query_results = user_query("What is Makari Paige good at?")

In [95]:
# Paige is in a lot of these now!
len([x for x in query_results if 'Paige' in x[1]])

12

In [96]:
# Results are clearly so much more relevant than before, these are actual articles about football (for the most part) now
query_results

[['https://www.mgoblog.com/content/michigan-72-tarleton-state-49',
  "have the same dudes around him? The darkest interpretation is Paige is making business decisions with his draftable body versus a season with nothing to play for, except if any NFL GM is watching this they're going to pull his name off the board. Is there an injury? Something mental? The fact that it's showing up every time he goes to make a tackle makes me think injury. He's still reading plays better than anyone else on defense. Hopefully whatever's up with him gets healed/fixed/worked out"],
 ['https://www.mgoblog.com/content/michigan-72-tarleton-state-49',
  "starts that play not lined up because IU is going tempo and Paige's job is getting everybody lined up. He knows where he's supposed to be however and gets there. I mean he is THERE. And then he stops.This was the issue against MSU and though I haven't charted the Oregon game I already pulled a few examples from it while doing my rewatch. It's not just a tren

In [58]:
# Let's see how a hybrid approach would work

def keyword_search(query: str):
    all_data = content_index_services.list_mgoblog_content(ci_uow)
    corpus = [x.text for x in all_data]

    tokenized_corpus = [doc.split(" ") for doc in corpus]

    bm25 = BM25Okapi(tokenized_corpus)

    # inital_results = bm25.get_top_n(query.split(" "), corpus, 20)
    inital_results = bm25.get_scores(query)
    final_results = []
    for i in range(0,len(inital_results)):
        final_results.append({"id": all_data[i].id, "text": all_data[i].text, "keyword_score":inital_results[i]})

    return final_results

In [51]:
def vector_search(query: str):
    embedded_text = embed(chunker=ch, embedder=em, text_data=[query])

    client = chromadb_client_factory()
    repo_collection = client.get_collection("dev_mgoblog_content_embeddings")
    results = repo_collection.query(query_embeddings=[x.embedded_text for x in embedded_text],  n_results=100000)
    
    return results

In [105]:
import pandas as pd

query = "Did Bryce Underwood commit to Michigan?"

keyword_results = keyword_search(query)

vector_similarities = vector_search(query)

vector_results = []
for i in range(0, len(vector_similarities["ids"][0])):
    vector_results.append({"id": vector_similarities['ids'][0][i], "distances": vector_similarities['distances'][0][i]})

for keyword_result in keyword_results:
    keyword_result["semantic_distance"] = [x["distances"] for x in vector_results if x["id"] == keyword_result["id"]][0]

df = pd.DataFrame(keyword_results)

In [106]:
# Pretty good!
for i in range(0,5):
    print(df[df['keyword_score'] < 2].sort_values("semantic_distance").iloc[i].text)

So, uh, this happened: Breaking: Bryce Underwood, ESPN’s No. 1 overall recruit in the Class of 2025, is flipping his commitment to the Michigan Wolverines, he told school officials today.It’s a big day for @umichfootball, HC Sherrone Moore and @champcircleuofm as they land the highest-rated recruit… pic.twitter.com/4afP0z2Tny— Adam Schefter (@AdamSchefter) November 21, 2024After weeks going on months of fervent speculation, 5* QB Bryce Underwood from Belleville HS has flipped his commitment from
class and this next era of college football. With Underwood now on board, I would look at WR Derek Meadows (borderline top 100 recruit committed to LSU) and ATH Bradley Gompers (borderline top 200 recruit committed to Duke) as players Michigan will try to flip in the aftermath of the Underwood commitment. A lot more will be written about Underwood on this site in the coming days. Until then, savor this shocking recruiting coup. There is no content after the jump.
Law Grad, Venue by 4M, Winewood

This obviously isn't a perfect setup right now, but it's something to start with. 

In future iterations, I'll look to use the title, author and date created to further filter the initial results.