# RAG exploration

![goal_post](goal_post.png)

## 🤖 Social assistant, with **off the shelf model**

In [9]:
url='https://microsoftfabric.devpost.com/'

In [10]:
import requests
from bs4 import BeautifulSoup
import json

# Fetch the content from the URL
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Extract the content of the blog post
blog_post_content = soup.get_text()

# Convert the content to JSON
content = json.dumps({"content": blog_post_content})

print(content)

{"content": "\n\n\n\n  \n\n\n\n\n\n\n\n\n\n\nMicrosoft Fabric and AI Learning Hackathon: Building the next wave of innovative AI powered data analytics applications with Microsoft Fabric - Devpost\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n      Log in\n \n\n\n\n        Sign up\n      \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDevpost\n\n\nHackathons\nProjects\nHost a public hackathon\n\n\n\n\n\n\nDevpost for Teams\n\n\nTeams login\nRequest a demo\n\n\n\n\n\n    Hackathons\n\n\n\n\n    Projects\n\n\n\nBlog\n\n\n\n\n    Host a hackathon\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nProduct\n\n\n\n\n\n\n\nDevpost\nGrow your developer ecosystem and promote your platform.\n\n\n\nHackathons\nProjects\nHost a public hackathon\n\n\n\n\n\n\nDevpost for Teams\nDrive innovation, collaboration, and retention within your organization.\n\n\n\nTeams login\nRequest a demo\n\n\n\n\n\n\n\n\n\n    Hackathons\n\n\n\n\n    Projects\n

In [11]:
# Messages to give LLM, to create a short LinkedIn post based on a blog post

system_message = """
You are a social assistant who writes creative content. You will politely decline any other requests from the user not related to creating content. Don't talk about a single VS Code release and don't talk about release dates at all. Instead, only talk about the relevant features. Don't include made up links, but do provide real links to the VS Code release notes for specific features. You format all your responses as Markdown unless otherwise specified. Avoid wrapping your entire response in a markdown code element.
"""
messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": f"Create a very short LinkedIn post using the following: {content}"}
]

In [12]:
import os
from openai import OpenAI

token = os.environ["GITHUB_TOKEN"]
endpoint = "https://models.inference.ai.azure.com"
model_name = "gpt-4o-mini"

client = OpenAI(
    base_url=endpoint,
    api_key=token,
)

response = client.chat.completions.create(
    messages=messages,
    temperature=1.0,
    top_p=1.0,
    max_tokens=1000,
    model=model_name
)

print(response.choices[0].message.content)

🌟 Excited to announce the **Microsoft Fabric and AI Learning Hackathon**! Join us in building the future of AI-powered data analytics applications using Microsoft Fabric. 

With $10,000 in prizes and a chance to showcase your creativity, this is more than just a competition—it's an opportunity to learn and innovate alongside fellow tech enthusiasts. 

🗓️ Deadline: November 12, 2024, at 5 PM PST

Whether you're a database expert or an AI enthusiast, bring your skills to the table and help shape tomorrow's technology! 

👉 For more details, visit [Devpost](https://devpost.com). Let's create something amazing! 🚀 #MicrosoftFabric #AIHackathon #Innovation


## 📚 Text search

#### Extract key topics & features

In [13]:
# Messages to give LLM, to extract key topics & features

topic_system_message = """
You are an expert at conducting entity extraction. Generate top topics and functionality based on provided content. Focus on identifying key concepts, themes, and relevant terms related to specific developer tooling, with a particular emphasis on VS Code features. Make sure entities you extract are directly relevant to the developer environment described. Don't mention specific dates or years. Use advanced search techniques, including Boolean operators and keyword variations, to craft precise, optimized queries that yield the most relevant results. Aim for clarity, relevance, and depth to cover all aspects of the topic efficiently. Simply list the phrases without additional explanation or details. Do not list any bullet points or numbered lists or quotation marks.
"""

topic_user_message="Come up with a list of top 5 developer tooling topics, functionalities, and relevant terms, with a strong focus on VS Code features and integrations based on the following content: "

In [14]:
def extract_key_topics(content, model="gpt-4o-mini"):
    messages = [
        {"role": "system", "content": topic_system_message},
        {"role": "user", "content": topic_user_message+content}
    ]

    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.3,
    )

    key_topics = response.choices[0].message.content.split('\n')
    return key_topics

key_topics = extract_key_topics(content)
print("\n".join([topic + "\n" for topic in key_topics]))

Microsoft Fabric integration  

AI-powered data analytics  

Real-Time Intelligence in Microsoft Fabric  

Azure OpenAI services  

SQL and AI integration in developer tools  



#### Load & filter data

In [15]:
# load split_docs_contents.json as a dataframe
import pandas as pd
df = pd.read_json('split_docs_contents.json')
df.head()

Unnamed: 0,content,url
0,See what is new in the Visual Studio Code July...,https://code.visualstudio.com/updates/July_201...
1,See what is new in the Visual Studio Code July...,https://code.visualstudio.com/updates/July_201...
2,See what is new in the Visual Studio Code July...,https://code.visualstudio.com/updates/July_201...
3,See what is new in the Visual Studio Code July...,https://code.visualstudio.com/updates/July_201...
4,See what is new in the Visual Studio Code July...,https://code.visualstudio.com/updates/July_201...


In [16]:
# Filter rows based on column: 'content'
df_clean = df[(df['content'].str.contains("2023", regex=False, na=False)) | (df['content'].str.contains("2024", regex=False, na=False))]

#### Perform text search

In [17]:
from rank_bm25 import BM25Okapi
import pandas as pd

def bm25_search(df_clean, key_topics, top_n=10):
    # Tokenize the content of each document
    tokenized_corpus = [doc.split(" ") for doc in df_clean['content']]
    
    # Initialize BM25
    bm25 = BM25Okapi(tokenized_corpus)
    
    # Combine key topics into a single query string
    query = " ".join(key_topics)
    tokenized_query = query.split(" ")
    
    # Get BM25 scores for the query
    scores = bm25.get_scores(tokenized_query)
    
    # Get the indices of the top_n documents
    top_n_indices = scores.argsort()[-top_n:][::-1]
    
    # Retrieve the top_n documents
    top_n_docs = df_clean.iloc[top_n_indices]
    
    return top_n_docs

# Perform the search and output the top 10 documents
top_10_docs = bm25_search(df_clean, key_topics, top_n=10)
print(top_10_docs)

                                                content  \
4088  Learn what is new in the Visual Studio Code Au...   
3803  Learn what is new in the Visual Studio Code Fe...   
4153  Learn what is new in the Visual Studio Code Se...   
4181  Learn what is new in the Visual Studio Code Oc...   
3378  Learn what is new in the Visual Studio Code Ap...   
3702  Learn what is new in the Visual Studio Code No...   
3372  Learn what is new in the Visual Studio Code Ap...   
3244  Learn what is new in the Visual Studio Code Ja...   
3607  Learn what is new in the Visual Studio Code Au...   
3787  Learn what is new in the Visual Studio Code Fe...   

                                                    url  
4088  https://code.visualstudio.com/updates/v1_93#re...  
3803  https://code.visualstudio.com/updates/v1_87#fo...  
4153  https://code.visualstudio.com/updates/v1_94#py...  
4181  https://code.visualstudio.com/updates/v1_95#co...  
3378  https://code.visualstudio.com/updates/v1_78#au...  
37

## 🔢 Semantic reranking

In [18]:
# Messages to give LLM, to re-rank the documents based on semantic relevance

rerank_system_message = """
You are tasked with re-ranking a set of documents based on their relevance to given search queries. The documents have already been retrieved based on initial search criteria, but your role is to refine the ranking by considering factors such as semantic similarity to the query, context relevance, and alignment with the user's intent. Focus on documents that provide concise, high-quality information, ensuring that the top-ranked documents answer the query as accurately and completely as possible. If you can't rank them based on semantic relevance, give higher rank to documents with VS Code features that were published most recently. Make sure to return the full name of the feature and URL of each release note document, and format your response as a Markdown list item, with the URL in parentheses. Do not include any additional information or commentary about the documents. List a variety of documents, and give more weight to documents that mention Python and or notebooks features. Only return the top 3 documents and reference them by the feature name, not the release version or date.
"""

rerank_user_message=f"Here are some documents: {top_10_docs.to_json(orient='records')}. Re-rank those documents based on these key VS Code functionalities: {key_topics}. Only return the top 3 documents."

In [19]:
def rerank_documents(model="gpt-4o-mini"):
    # Truncate the user message to fit within the token limit
    max_length = 7500  # Adjust this value as needed to fit within the token limit
    truncated_user_message = rerank_user_message[:max_length]

    messages = [
        {"role": "system", "content": rerank_system_message},
        {"role": "user", "content": truncated_user_message}
    ]

    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.7,
    )

    reranked_documents = response.choices[0].message.content.split('\n')
    return reranked_documents

reranked_documents = rerank_documents()
print("\n".join([doc + "\n" for doc in reranked_documents]))


- Run tests with coverage in Python (https://code.visualstudio.com/updates/v1_94#run-tests-with-coverage)

- Translate your strings using Azure AI Translator (https://code.visualstudio.com/updates/v1_87#for-extension-authors:-preview-of-`@vscode\/l10n-dev`-and-azure-ai-translator)

- Renamed SQL to MS SQL (https://code.visualstudio.com/updates/v1_93#renamed-\"sql\"-to-\"ms-sql\")



## 🧠 Social assistant, with **relevant features**

In [21]:
def generate_llm_answer(content, context, completion_model):
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content":  f"Create a very short LinkedIn post using the following content: {content}. Also, include the following established VS Code features along with their URLs in your response, so folks seeing the post can try them out: {context}."}
    ]

    response = client.chat.completions.create(
        model=completion_model,
        messages=messages,
        temperature=0.3
    )

    answer = response.choices[0].message.content
    return answer

print(generate_llm_answer(content, reranked_documents, completion_model="gpt-4o-mini"))

🚀 Exciting news! Join the **Microsoft Fabric and AI Learning Hackathon** and be part of building the next wave of innovative AI-powered data analytics applications. This is a fantastic opportunity to leverage Microsoft Fabric and Azure OpenAI services. 

🗓️ **Deadline:** Nov 12, 2024  
💰 **Prizes:** $10,000 in total!

Whether you're an AI enthusiast or a cloud computing expert, this hackathon is your platform to showcase your skills and creativity. 

For more details and to join, check out the hackathon on [Devpost](https://devpost.com).

While you're at it, enhance your coding experience with these amazing VS Code features:
- Run tests with coverage in Python: [Learn more](https://code.visualstudio.com/updates/v1_94#run-tests-with-coverage)
- Translate your strings using Azure AI Translator: [Explore here](https://code.visualstudio.com/updates/v1_87#for-extension-authors:-preview-of-`@vscode\\/l10n-dev`-and-azure-ai-translator)
- Renamed SQL to MS SQL: [Find out more](https://code.vis

#### Compare responses between chat models

In [22]:
print(generate_llm_answer(content, reranked_documents, completion_model="Mistral-small"))

📣 **Microsoft Fabric and AI Learning Hackathon** 📣

Join us in building the next wave of innovative AI-powered data analytics applications with Microsoft Fabric! This is your chance to showcase your skills and creativity, whether you're an AI enthusiast, a cloud computing expert, or a database guru.

Microsoft Fabric is an integrated platform that combines data engineering, data warehousing, and data science in a seamless environment. With built-in AI capabilities, it allows users to automate processes, gain deeper insights, and accelerate decision-making.

To get started, we're offering free access to Microsoft Fabric and an Azure OpenAI Proxy service. Plus, we're hosting live informational sessions and workshops to help expand your knowledge.

**Hackathon Categories:**
1. Microsoft Fabric + AI Innovation
2. Real-Time Intelligence in Microsoft Fabric
3. Azure Database for PostgreSQL Integration
4. SQL And AI Integration
5. Azure Cosmos DB + Microsoft Fabric Integration

**Prizes:** $1

In [24]:
print(generate_llm_answer(content, reranked_documents, completion_model="meta-llama-3-8b-instruct"))

Here's a LinkedIn post based on the content you provided:

**Microsoft Fabric and AI Learning Hackathon: Building the next wave of innovative AI powered data analytics applications**

Are you ready to dive deep into the future of AI and cloud innovation? Join the Microsoft Fabric and AI Learning Hackathon, a unique opportunity to showcase your skills and creativity in building innovative AI powered data analytics applications!

**What to Build**

Complete the Microsoft Learn AI Skills Challenge (Microsoft Fabric) and build a new Fabric solution that leverages Azure OpenAI services and falls into one of the following hackathon categories:

* Microsoft Fabric + AI Innovation
* Real-Time Intelligence in Microsoft Fabric
* Azure Database for PostgreSQL Integration
* SQL And AI Integration
* Azure Cosmos DB + Microsoft Fabric Integration

**What to Submit**

* Provide a URL to your code repository for judging and testing
* Include a video (about 3-5 minutes) that demonstrates your submissio

#### Try it out yourself! 😄

VS Code extensions used for this demo:

![extensions](vsc_extensions.png)