# Day 1 - Exercise 1 

*Web Scraping and Summarization Exercise*

### Objective

This notebook wil scrape GitHub repositories declared in the repos.txt, fetch their README.md files, and summarize them using an LLM. The summaries will be guided by system and user prompts.  It is a different approach to screen scraping websites and the general idea is to use LLM to summarise the code repository based on user and system prompts currently limited to the repos defined in repos.txt. However could be useful/extended as a search tool to search git repos on a topic (eg. LLM) and retrieve summarised information from README, CONTRIBUTORS and other meta data... only day 1, so won't get to carried away...


In [None]:
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI()
clientModel = "gpt-4.1-mini"

### GitHub README Fetching and Display Utilities

This section defines helper functions to extract repository information, fetch README files from GitHub, and display summaries in Jupyter notebooks.

In [None]:
import requests
from urllib.parse import urlparse
from IPython.display import Markdown, display

# Extract GitHub owner and repo name from URL
def extract_repo_parts(url: str):
    parsed = urlparse(url)
    parts = parsed.path.strip('/').split('/')
    if len(parts) < 2:
        raise ValueError(f'Invalid GitHub repo URL: {url}')
    return parts[0], parts[1]

# Fetch README.md from main or master branch of GitHub repo
def fetch_readme(owner: str, repo: str):
    branches = ['main', 'master']
    for branch in branches:
        raw_url = f"https://raw.githubusercontent.com/{owner}/{repo}/{branch}/README.md"
        r = requests.get(raw_url)
        if r.status_code == 200:
            return r.text
    raise FileNotFoundError(f'README.md not found on main or master branch for {owner}/{repo}')

# Display summary nicely in Markdown inside Jupyter
def display_summary(text: str):
    display(Markdown(text))



### Define the prompts for the LLM

- **System Prompt:** Sets the role and behavior of the model as a "philosophical, humorous software engineer".
- **User Prompt:** Contains the README content and requests a summary.

Using system + user prompts allows us to control tone and style.

In [None]:
# Define system prompt (role, tone, behavior)
system_prompt = """
You are a philosophical new age software engineer that analyzes the contents of GitHub repositories,
and provides a critical thinking, open-minded and humorous summary.
Respond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.
"""

    # Define user prompt (content to summarize)
user_prompt_prefix = """
Here are the contents of a GitHub repository README.
Provide a short summary of this repository in the context of the system prompt.
"""


### This function sends the README content to the OpenAI API along with the prompts:

1. Combines the user prompt with README text.
2. Calls `client.responses.create` with system and user prompts.
3. Displays the summary in Markdown under the repository name.


In [None]:
def summarize_readme(readme_text: str, repo_name: str):
    
    # Combine user prompt prefix with the README content
    user_prompt = user_prompt_prefix + "\n\n" + readme_text
    
    # Call OpenAI Responses API]
    response = client.chat.completions.create(
        model=clientModel,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )
    summary = response.choices[0].message.content
    display_summary(f"### {repo_name}\n{summary}")

### Load a list of GitHub repository URLs from `repos.txt`.

For each URL:
* Extract owner/repo.
* Fetch README.
* Generate and display summary.

In [None]:
# Read repo URLs from repos.txt
with open("repos.txt", "r") as f:
    repo_urls = [line.strip() for line in f if line.strip() and not line.startswith("#")]

# Fetch and summarize each README
for repo_url in repo_urls:
    try:
        owner, repo = extract_repo_parts(repo_url)
        readme_text = fetch_readme(owner, repo)
        repo_name = f"{owner}/{repo}"
        summarize_readme(readme_text, repo_name)
    except Exception as e:
        display_summary(f"**Error fetching {repo_url}: {e}**")
