# End of week 1 exercise

To demonstrate your familiarity with OpenAI API, and also Ollama, build a tool that takes a technical question,  
and responds with an explanation. This is a tool that you will be able to use yourself during the course!

In [1]:
# imports
import os
from dotenv import load_dotenv
from IPython.display import Markdown, display, update_display
from openai import OpenAI


In [2]:
#Define the models to use
MODEL_GPT = 'gpt-4o-mini'
MODEL_LLAMA = 'llama3.2'

#Load the API key from the environment variable
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

#Check if the API key is valid
if api_key and api_key.startswith('sk-') and len(api_key) > 10:
    print("API key found")
else:
    print("No API key found or is not valid")

# Create an instance of the OpenAI client
openai_client = OpenAI(api_key=api_key)
print("OpenAI client instance created")

# Create an instance of the Ollama client
# ollama_client = OpenAI(base_url = OLLAMA_BASE_URL, api_key='ollama')
OLLAMA_BASE_URL = "http://localhost:11434/v1"
ollama = OpenAI(base_url=OLLAMA_BASE_URL, api_key='ollama')
print("Ollama client instance created")


API key found
OpenAI client instance created
Ollama client instance created


In [None]:
# requests.get("http://localhost:11434").content

In [None]:
# !ollama pull llama3.2

In [3]:
# set up environment
question_system_prompt = """
You are a helpful assistant that can explain technical concepts in a way that is easy to understand even for non-technical people.
With the help of the user, you will be able to explain the concept in a way that is easy to understand.
"""
question_user_prompt = """
Please explain what this code does and why:

def fetch_website_contents(url):
    
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    title = soup.title.string if soup.title else "No title found"
    if soup.body:
        for irrelevant in soup.body(["script", "style", "img", "input"]):
            irrelevant.decompose()
        text = soup.body.get_text(separator="\n", strip=True)
    else:
        text = ""
    return (title + "\n\n" + text)[:2_000]
# """


# question_user_prompt = """
# Please explain what the solution to this math question and also the reasoning behind it:

# You toss 2 coins. One of them is heads. What's the probability the other is tails? Answer with the probability only
# """

In [4]:
# Get gpt-4o-mini to answer, with streaming
stream = openai_client.chat.completions.create(
    model = MODEL_GPT,
    messages = [
        {"role": "system", "content": question_system_prompt},
        {"role": "user", "content": question_user_prompt}
    ],
    stream=True
)

response = " "
display_handle = display(Markdown(""), display_id=True)
for chunk in stream:
    response += chunk.choices[0].delta.content or ''
    update_display(Markdown(response), display_id=display_handle.display_id)

 This piece of code is a Python function named `fetch_website_contents`. Let's break down what it does step-by-step and explain the reasoning behind each part.

### What the Code Does:

1. **Function Definition**: 
   The function `fetch_website_contents` accepts a single argument called `url`, which is expected to be a string representing the web address of a website.

2. **Sending a Request**:
   ```python
   response = requests.get(url, headers=headers)
   ```
   Here, the code sends a GET request to the specified URL using the `requests` library. This means it asks the server hosting that webpage to send back the page's content. The `headers=headers` part allows you to specify additional information (like a user-agent) with the request, potentially helping you avoid being blocked or to mimic a regular browser.

3. **Parsing HTML**:
   ```python
   soup = BeautifulSoup(response.content, "html.parser")
   ```
   The response content (the HTML of the webpage) is then parsed using `BeautifulSoup` from the `bs4` library. BeautifulSoup makes it easier to work with HTML documents in Python.

4. **Extracting the Title**:
   ```python
   title = soup.title.string if soup.title else "No title found"
   ```
   The code attempts to get the title of the webpage from the `<title>` tag in the HTML. If the title is found, it takes the text inside it; otherwise, it returns "No title found." This is a common part of a webpage, making it easy for users to identify what the page is about in browser tabs.

5. **Cleaning Up the Text**:
   ```python
   if soup.body:
       for irrelevant in soup.body(["script", "style", "img", "input"]):
           irrelevant.decompose()
       text = soup.body.get_text(separator="
", strip=True)
   else:
       text = ""
   ```
   In this part:
   - The function checks if the `<body>` of the webpage exists (which contains the main content).
   - It looks for certain types of tags that are often unnecessary for the basic reading of content (like `<script>`, `<style>`, `<img>`, and `<input>`) and removes them using `decompose()`. This helps in cleaning up the text to ensure only relevant content is kept.
   - After that, it extracts the remaining text from the body with `get_text()`, which gathers all the text within the body and can optionally format it (e.g., adding new lines).

6. **Returning the Combined Result**:
   ```python
   return (title + "

" + text)[:2_000]
   ```
   Finally, the function combines the title and the cleaned text, adds a newline in between for separation, and returns the first 2,000 characters. This is useful for ensuring that the output is not excessively long and remains readable.

### Why This Code is Useful:

- **Website Scraping**: This function is useful for scraping or gathering the content from web pages, which can be valuable for data analysis, research, or creating summaries of web articles.
- **Content Cleaning**: By removing scripts, styles, images, and other irrelevant tags, it ensures that the output is focused on the actual readable content for better usability.
- **Basic Error Handling**: The code includes basic checks (like if a title exists or if the body is present), making it more robust against various types of web pages.

In summary, this function efficiently retrieves and cleans the textual content from a specified web page, making it suitable for further analysis or presentation.

In [5]:
# Get Llama 3.2 to answer (use same client as in constants: ollama)
response = ollama.chat.completions.create(
    model = MODEL_LLAMA,
    messages = [
        {"role": "system", "content": question_system_prompt},
        {"role": "user", "content": question_user_prompt}
    ]
)
result = response.choices[0].message.content
display(Markdown(result))

Let's break down what this code does.

**Functionality**

This is a Python function called `fetch_website_contents` that takes a single argument: the URL of a website. The function retrieves the content of the webpage, parses it, and then extracts specific information from the parsed HTML.

**Here's what happens step by step:**

1. **Send an HTTP request**: The function uses the `requests` library to send a GET request to the specified URL.
2. **Get the HTML response**: The response from the server is stored in the `response` variable.
3. **Parse the HTML content with BeautifulSoup**: The code uses the `BeautifulSoup` library to parse the HTML content of the webpage. This creates a tree-like structure that can be easily navigated and searched.
4. **Extract the title**: The function extracts the title element from the parsed HTML and assigns it to the `title` variable. If no title is found, an empty string is assigned instead.
5. **Remove unwanted elements**: The code removes any script, style, or image elements from the HTML content using the `decompose()` method. This is done to prevent these elements from affecting the content extraction process.
6. **Get the text content**: The function extracts the text content of the HTML content using the `get_text()` method. This method includes the separator between lines of text, which can be useful later.
7. **Remove extra spaces and return limited content**: Finally, the function combines the title and extracted text content into a single string and limits it to 2,000 characters.

**Why do these specific things?**

The code is designed to extract specific information from a webpage:

* The title element provides basic metadata about the page.
* By removing unwanted elements (script, style, image) and extracting only the text content, we can get rid of any non-intrusive elements that might make it harder to understand what's on the page.

The code is also designed to preserve a small portion of the webpage's text content. The limited return value (`[:2_000]`) allows us to capture a reasonable amount of text without getting caught up in too much detail or overhead.

**Security note**, this function doesn't include any check for the existence of the website, or if it's not accessible or running properly.

Let me know if you have any more questions about what code does, and especially how does it works