✅ 🧠 Code complet : Scraper une page web, extraire les liens pertinents avec LLaMA 3.2

In [1]:
# imports
import os
import requests
import json
import re
from typing import List
from bs4 import BeautifulSoup
import ollama
from urllib.parse import urljoin

In [2]:
# Headers pour éviter les blocages par certains sites
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}



In [3]:

# Classe Website : pour extraire le contenu, le titre et les liens
class Website:
    def __init__(self, url):
        self.url = url
        try:
            response = requests.get(url, headers=headers, timeout=5)
            response.raise_for_status()
            self.body = response.content
            soup = BeautifulSoup(self.body, 'html.parser')
            self.title = soup.title.string.strip() if soup.title else "No title found"

            if soup.body:
                for irrelevant in soup.body(["script", "style", "img", "input"]):
                    irrelevant.decompose()
                self.text = soup.body.get_text(separator="\n", strip=True)
            else:
                self.text = ""

            # Extraire tous les liens, relatifs et absolus
            links = [link.get('href') for link in soup.find_all('a')]
            self.links = [
                urljoin(self.url, link)
                for link in links
                if link and not link.startswith("mailto:") and not link.startswith("javascript:")
            ]
        except Exception as e:
            print(f"Error loading {url}: {e}")
            self.title = "Error"
            self.text = ""
            self.links = []

    def get_contents(self):
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n"
        #▶️ Elle retourne une chaîne lisible contenant le titre et le texte principal de la page.



### ▶️ Rôle :

Cette classe représente un site web donné et permet :

* de récupérer le **titre** de la page,
* le **texte nettoyé** (sans scripts, images, etc.),
* et **tous les liens** présents dans la page.

### 📦 Attributs :

* `self.url` : l'URL du site.
* `self.body` : le code HTML brut de la page.
* `self.title` : le titre de la page (`<title>` dans le HTML).
* `self.text` : le texte visible (nettoyé).
* `self.links` : la liste des liens cliquables (`<a href="...">`), convertis en **liens absolus** avec `urljoin`.

### 🔁 Méthode :

```python
def get_contents(self):
    return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n"
```

▶️ Elle retourne une **chaîne lisible** contenant le **titre** et le **texte** principal de la page.



In [4]:

# Fonction pour interroger LLaMA 3.2 via Ollama pour filtrer les bons liens
def get_relevant_links_with_llama3(url: str, links: List[str]) -> dict:
    prompt = f"""You are provided with a list of links found on the website {url}.
Please choose the links relevant to include in a company brochure (e.g. About, Company, Careers, Jobs).
Do NOT include links like Privacy Policy, Terms of Service, Contact, mailto, login, or similar.
Return only a JSON object like this:

{{
  "links": [
    {{"type": "about page", "url": "https://example.com/about"}},
    {{"type": "careers page", "url": "https://example.com/careers"}}
  ]
}}

List of links:
{chr(10).join(links)}
"""

    response = ollama.chat(
        model="llama3.2",
        messages=[{"role": "user", "content": prompt}]
    )

    content = response["message"]["content"]

    # Afficher la réponse brute (utile pour debug)
    print("📨 Raw LLaMA response:\n", content)

    # Extraire le JSON uniquement (entre { ... })
    try:
        json_match = re.search(r"\{[\s\S]*?\}", content)
        if json_match:
            json_text = json_match.group(0)
            try:
                return json.loads(json_text)
            except json.JSONDecodeError as e:
                print(f"⚠️ JSON decoding failed: {e}")
                # Essayer de corriger un JSON malformé
                corrected = json_text.replace("”", "\"").replace("“", "\"").replace("’", "'")
                corrected = re.sub(r'":\s*"(https?:[^"]+)"([^,\}])', r'": "\1",\2', corrected)
                return json.loads(corrected)
    except Exception as e:
        print("❌ Unexpected failure while parsing JSON:", e)

    return {"links": []}


---

## 🔹 2. `get_relevant_links_with_llama3(url, links)`

```python
def get_relevant_links_with_llama3(url: str, links: List[str]) -> dict:
```

### ▶️ Rôle :

Cette fonction utilise **LLaMA 3.2 (via Ollama)** pour **filtrer automatiquement les liens pertinents** (comme "About", "Company", "Careers") parmi une liste.

### ⚙️ Étapes :

1. **Construit un prompt** qui contient :

   * des instructions claires,
   * un exemple de format JSON attendu,
   * la liste des liens trouvés sur le site.
2. Envoie ce prompt à **Ollama** avec `ollama.chat(...)`.
3. Récupère la **réponse brute**.
4. Utilise une **expression régulière** (`re.search`) pour extraire uniquement le JSON `{...}`.
5. Essaie de parser ce JSON avec `json.loads`.
6. S’il y a une erreur de format, essaie de **corriger automatiquement** les erreurs classiques (guillemets typographiques, virgules manquantes, etc.).

### ▶️ Retour :

Un dictionnaire Python avec une structure comme :

```python
{
  "links": [
    {"type": "about page", "url": "https://site.com/about"},
    {"type": "careers page", "url": "https://site.com/jobs"}
  ]
}
```

---



In [5]:

# Fonction principale qui combine tout
def get_all_details(url: str) -> str:
    result = f"Landing page: {url}\n"
    landing = Website(url)
    result += landing.get_contents()

    print(f"🔗 Extracted {len(landing.links)} links.")
    filtered = get_relevant_links_with_llama3(url, landing.links)

    for link in filtered.get("links", []):
        print(f"✅ Found relevant link: {link['type']} - {link['url']}")
        page = Website(link["url"])
        result += f"\n\n{link['type'].capitalize()}:\n"
        result += page.get_contents()

    return result


## 🔹 3. `get_all_details(url)`

```python
def get_all_details(url: str) -> str:
```

### ▶️ Rôle :

Fonction principale qui **orchestre tout le processus** :

1. Analyse la page d’accueil du site (`Website(url)`).
2. Récupère **tous les liens** sur la page d’accueil.
3. Envoie les liens à **LLaMA** pour détecter ceux pertinents pour une brochure.
4. Pour chaque lien sélectionné, récupère et affiche son contenu.
5. Retourne une **grande chaîne de texte** regroupant tous les résultats (titre, texte principal, contenus des pages secondaires).

---

## 🔹 Exemple d’utilisation finale :

```python
if __name__ == "__main__":
    url = "https://huggingface.co"
    output = get_all_details(url)
    print(output)
```

### ▶️ Que fait ce bloc ?

* Il appelle la fonction `get_all_details(...)` pour un site web réel.
* Affiche l’ensemble des textes récupérés (accueil + pages utiles).
* Peut être remplacé par une interface utilisateur ou sauvegardé dans un fichier.

---



In [8]:
# Exemple d'exécution
if __name__ == "__main__":
    url = "https://huggingface.co"  # ou "https://edwarddonner.com"
    output = get_all_details(url)
    print("\n\n=== Résultat final ===\n")
    print(output)

🔗 Extracted 80 links.
📨 Raw LLaMA response:
 {
  "links": [
    {"type": "about page", "url": "https://huggingface.co/about"},
    {"type": "company page", "url": "https://huggingface.co/brand"},
    {"type": "careers page", "url": "https://apply.workable.com/huggingface/"},
    {"type": "jobs page", "url": "https://apply.workable.com/huggingface/"}
  ]
}

Note: I couldn't find any specific information about the company's jobs page, but Workable is a popular platform for job postings. If you need to include a link that's not present in the provided list, please let me know.

Also, note that some links were excluded from the final result as they didn't fit into the categories "About", "Company", or "Careers".
⚠️ JSON decoding failed: Expecting ',' delimiter: line 3 column 66 (char 80)
❌ Unexpected failure while parsing JSON: Expecting ',' delimiter: line 3 column 66 (char 80)


=== Résultat final ===

Landing page: https://huggingface.co
Webpage Title:
Hugging Face – The AI community bu

## 🧠 Résumé visuel

```
Website(url)
   └── .get_contents()     → Titre + Texte visible
   └── .links              → Tous les liens cliquables

get_relevant_links_with_llama3(url, links)
   └── Envoie à LLaMA      → JSON contenant uniquement les liens utiles

get_all_details(url)
   └── Website(url)        → Landing page
   └── get_relevant_links_with_llama3() → Filtrage
   └── Website(link["url"]) pour chaque page utile → Contenu
```

---



In [9]:
# Prompt système (instructions données au modèle)
link_system_prompt = (
    "You are provided with a list of links found on a webpage. "
    "You are able to decide which of the links would be most relevant to include in a brochure about the company, "
    "such as links to an About page, or a Company page, or Careers/Jobs pages.\n"
    "You should respond with a JSON object like this example:\n"
    """{
  "links": [
    {"type": "about page", "url": "https://full.url/goes/here/about"},
    {"type": "careers page", "url": "https://another.full.url/careers"}
  ]
}"""
)

In [10]:
# Fonction pour générer le prompt utilisateur à partir des liens du site
def get_links_user_prompt(website):
    user_prompt = f"Here is the list of links on the website of {website.url}.\n"
    user_prompt += "Please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format.\n"
    user_prompt += "Do not include Terms of Service, Privacy, email links.\n"
    user_prompt += "Links (some might be relative links):\n"
    user_prompt += "\n".join(website.links)
    return user_prompt


In [15]:
# Fonction pour appeler Ollama avec les prompts
def get_links(url, website):
    response = ollama.chat(
        model="llama3.2",
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(website)},
        ]
    )
    content = response['message']['content']

    # Extraire JSON dans la réponse brute (au cas où il y a du texte autour)
    json_match = re.search(r"\{[\s\S]*\}", content)
    if json_match:
        json_text = json_match.group(0)
        try:
            return json.loads(json_text)
        except json.JSONDecodeError as e:
            print("JSON decode error:", e)
            print("Raw JSON text:", json_text)
    else:
        print("No JSON found in model response.")
    return {"links": []}


In [17]:
#Exemple d'usage (avec ta classe Website définie avant)
ed = Website("https://huggingface.co")
get_links("https://huggingface.co", ed)


{'links': [{'type': 'home page', 'url': 'https://huggingface.co/'},
  {'type': 'docs page', 'url': 'https://huggingface.co/docs'},
  {'type': 'about page', 'url': 'https://discuss.huggingface.co'},
  {'type': 'blog page', 'url': 'https://huggingface.co/blog'},
  {'type': 'learn page', 'url': 'https://huggingface.co/learn'},
  {'type': 'brand page', 'url': 'https://huggingface.co/brand'}]}

In [18]:
def get_all_details(url):
    result = "Landing page:\n"
    result += Website(url).get_contents()

    # Ici on appelle get_links avec deux arguments : l'url et l'objet Website
    website = Website(url)
    links = get_links(url, website)  # get_links adapté à Ollama (cf. code précédent)

    print("Found links:", links)
    for link in links.get("links", []):
        result += f"\n\n{link['type'].capitalize()}:\n"
        page = Website(link["url"])
        result += page.get_contents()
    return result


In [19]:
system_prompt = (
    "You are an assistant that analyzes the contents of several relevant pages from a company website "
    "and creates a short brochure about the company for prospective customers, investors and recruits. "
    "Respond in markdown. Include details of company culture, customers and careers/jobs if you have the information."
)


In [20]:
def get_brochure_user_prompt(company_name, url):
    user_prompt = f"You are looking at a company called: {company_name}\n"
    user_prompt += "Here are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\n"
    user_prompt += get_all_details(url)
    user_prompt = user_prompt[:5000]  # Limite à 5000 caractères pour éviter un prompt trop long
    return user_prompt


In [23]:
def create_brochure(company_name, url):
    user_content = get_brochure_user_prompt(company_name, url)

    response = ollama.chat(
        model="llama3.2",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_content}
        ]
    )
    result = response['message']['content']
    print(result)  # ou afficher avec un markdown si tu es dans un notebook : display(Markdown(result))


In [25]:
if __name__ == "__main__":
   create_brochure("HuggingFace", "https://huggingface.co")


Found links: {'links': [{'type': 'home page', 'url': 'https://huggingface.co/'}, {'type': 'About page', 'url': 'https://huggingface.co/'}, {'type': 'Company page', 'url': 'https://huggingface.co/'}, {'type': 'Careers/Jobs page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'Docs page', 'url': 'https://huggingface.co/docs'}, {'type': 'Blog page', 'url': 'https://huggingface.co/blog'}, {'type': 'Twitter page', 'url': 'https://twitter.com/huggingface'}, {'type': 'LinkedIn page', 'url': 'https://www.linkedin.com/company/huggingface/'}, {'type': 'GitHub page', 'url': 'https://github.com/huggingface'}, {'type': 'Discord server for joining', 'url': 'https://huggingface.co/join/discord'}]}
**Welcome to Hugging Face**

[Image: A logo of a smiling face, representing the AI community]

**Building the Future of Artificial Intelligence**
---------------------------------------------

At Hugging Face, we are on a mission to create a platform where the machine learning community can co

In [38]:
from IPython.display import Markdown, display, update_display

def stream_brochure(company_name, url):
    user_content = get_brochure_user_prompt(company_name, url)

    stream = ollama.chat(
        model="llama3.2",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_content}
        ],
        stream=True  # Active le streaming
    )
    
    response = ""
    display_handle = display(Markdown(""), display_id=True)
    
    for chunk in stream:
        # Chaque chunk est un dictionnaire avec "message" > "content"
        delta = chunk.get("message", {}).get("content", "")
        if delta:
            response += delta
            # Nettoyer au besoin, ici je supprime juste les balises de code markdown (optionnel)
            cleaned = response.replace("```", "").replace("markdown", "")
            update_display(Markdown(cleaned), display_id=display_handle.display_id)


In [39]:
stream_brochure("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/'}, {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'docs page', 'url': 'https://huggingface.co/docs'}, {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'}, {'type': 'join/discord page', 'url': 'https://huggingface.co/join/discord'}, {'type': 'apply/workable page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'blog page', 'url': 'https://huggingface.co/blog'}, {'type': 'changelog page', 'url': 'https://huggingface.co/changelog'}]}


# Hugging Face Brochure

## About Us

Hugging Face is the AI community building the future. Our platform brings together a collaborative space for machine learning professionals to explore, create, and innovate on models, datasets, and applications.

### Our Mission

We aim to accelerate the development of artificial intelligence by providing an open-source stack, tools, and resources that foster innovation and progress in the field.

## What We Offer

*   **Models**: Browse 1M+ pre-trained models across various modalities (text, image, video, audio, and 3D).
*   **Datasets**: Access and share datasets for any ML tasks.
*   **Spaces**: Collaborate on unlimited public models, datasets, and applications with our platform's advanced infrastructure.

### Advanced Solutions

For organizations requiring enterprise-grade security, access controls, and dedicated support, we offer:

*   **Compute**: Deploy optimized inference endpoints or update your Spaces applications to a GPU in just a few clicks.
*   **Enterprise**: Get the most advanced platform for building AI with features like Single Sign-On, Priority Support, Audit Logs, Resource Groups, Private Datasets Viewer.

## Our Community

Join over 50,000 organizations using Hugging Face, including top companies like Meta, Google, Amazon, Intel, Microsoft, and Grammarly. Explore popular models and datasets, and discover new ones by browsing our platform's trending content.

### Transforming the Future of AI

At Hugging Face, we're building the foundation for ML tooling with our open-source projects:

*   **Transformers**: State-of-the-art ML for PyTorch, TensorFlow, JAX.
*   **Diffusers**: State-of-the-art Diffusion models in PyTorch.
*   **Safetensors**: Safe way to store/distribute neural network weights.
*   **Tokenizers**: Fast tokenizers optimized for research & production.

## Join Our Journey

Accelerate your ML journey with Hugging Face. Explore our platform, sign up for an account, or get started with our free resources and tutorials today!

| Fonction                  | Rôle                                                                |
| ------------------------- | ------------------------------------------------------------------- |
| `Website(url)`            | Scrape le contenu texte et les liens d’un site web.                 |
| `get_links_user_prompt()` | Forme un prompt clair avec les liens bruts.                         |
| `get_links()`             | Utilise Ollama pour filtrer les liens pertinents (About, Careers…). |
| `get_all_details()`       | Télécharge les pages liées pertinentes pour en extraire du contenu. |
| `create_brochure()`       | Génère une brochure (markdown) avec LLaMA sans streaming.           |
| `stream_brochure()`       | Idem, mais avec mise à jour en direct (Jupyter Notebook friendly).  |
