<a href="https://colab.research.google.com/github/carlp1/AMDPiscitelli/blob/main/bozza_tesi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import requests
from bs4 import BeautifulSoup
import re

def scrape_and_preprocess_compendium_chapter(url):
    """
    Given a Vatican Compendium chapter URL, scrape and clean its text content.
    Returns a list of cleaned paragraph texts with metadata.
    """
    headers = {
        "User-Agent": "Mozilla/5.0",
        "Accept-Language": "en-US,en;q=0.9"
    }

    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        raise Exception(f"Failed to fetch content from {url} (status {response.status_code})")

    soup = BeautifulSoup(response.content, 'lxml')

    cleaned_paragraphs = []
    for p in soup.find_all('p'):
        raw_text = p.get_text(separator=' ', strip=True)
        if not raw_text:
            continue

        # Remove leading/trailing numbers, line breaks, excess whitespace
        text = re.sub(r'\r\n', ' ', raw_text)
        text = re.sub(r'\s+', ' ', text).strip()

        # Optional: remove footnote numbers at the end of sentences
        text = re.sub(r'(\.|\?|\!)(\d+)', r'\1', text)

        if len(text) > 20:  # Filter out short/uninformative lines
            cleaned_paragraphs.append({
                'text': text,
                'source_url': url
            })

    return cleaned_paragraphs

# Example usage
chapter_url = "https://www.vatican.va/archive/ENG0015/__P1L.HTM"
chapter_content = scrape_and_preprocess_compendium_chapter(chapter_url)

# Preview result
for para in chapter_content[:3]:
    print(para['text'], "\n")


Paragraph 3. THE MYSTERIES OF CHRIST'S LIFE 

512 Concerning Christ's life the Creed speaks only about the mysteries of the Incarnation (conception and birth) and Paschal mystery (passion, crucifixion, death, burial, descent into hell, resurrection and ascension). It says nothing explicitly about the mysteries of Jesus' hidden or public life, but the articles of faith concerning his Incarnation and Passover do shed light on the whole of his earthly life. "All that Jesus did and taught, from the beginning until the day when he was taken up to heaven", 171 is to be seen in the light of the mysteries of Christmas and Easter. 

513 According to circumstances catechesis will make use of all the richness of the mysteries of Jesus. Here it is enough merely to indicate some elements common to all the mysteries of Christ's life (I), in order then to sketch the principal mysteries of Jesus' hidden (II) and public (III) life. 



In [2]:
import requests
import time

# Define the Gospels and chapter count
gospels = {
    'Matthew': 28,
    'Mark': 16,
    'Luke': 24,
    'John': 21
}

def fetch_gospel_verses_live():
    all_verses = []

    for book, chapters in gospels.items():
        for chapter in range(1, chapters + 1):
            url = f"https://bible-api.com/{book}+{chapter}?translation=web"
            response = requests.get(url)
            if response.status_code == 200:
                data = response.json()
                for verse in data.get("verses", []):
                    all_verses.append({
                        "book": verse["book_name"],
                        "chapter": verse["chapter"],
                        "verse": verse["verse"],
                        "text": verse["text"].strip()
                    })
                print(f"✓ {book} {chapter} done")
            else:
                print(f"✗ Failed: {book} {chapter}")
            time.sleep(1)  # Be respectful to the API

    return all_verses

# Fetch and store in memory
gospel_verses = fetch_gospel_verses_live()


✓ Matthew 1 done
✓ Matthew 2 done
✓ Matthew 3 done
✓ Matthew 4 done
✓ Matthew 5 done
✓ Matthew 6 done
✓ Matthew 7 done
✓ Matthew 8 done
✓ Matthew 9 done
✓ Matthew 10 done
✓ Matthew 11 done
✓ Matthew 12 done
✓ Matthew 13 done
✓ Matthew 14 done
✓ Matthew 15 done
✓ Matthew 16 done
✓ Matthew 17 done
✗ Failed: Matthew 18
✗ Failed: Matthew 19
✗ Failed: Matthew 20
✗ Failed: Matthew 21
✗ Failed: Matthew 22
✗ Failed: Matthew 23
✗ Failed: Matthew 24
✓ Matthew 25 done
✓ Matthew 26 done
✓ Matthew 27 done
✓ Matthew 28 done
✓ Mark 1 done
✓ Mark 2 done
✓ Mark 3 done
✓ Mark 4 done
✓ Mark 5 done
✓ Mark 6 done
✓ Mark 7 done
✓ Mark 8 done
✓ Mark 9 done
✓ Mark 10 done
✓ Mark 11 done
✗ Failed: Mark 12
✗ Failed: Mark 13
✗ Failed: Mark 14
✗ Failed: Mark 15
✗ Failed: Mark 16
✗ Failed: Luke 1
✗ Failed: Luke 2
✗ Failed: Luke 3
✓ Luke 4 done
✓ Luke 5 done
✓ Luke 6 done
✓ Luke 7 done
✓ Luke 8 done
✓ Luke 9 done
✓ Luke 10 done
✓ Luke 11 done
✓ Luke 12 done
✓ Luke 13 done
✓ Luke 14 done
✓ Luke 15 done
✓ Luke 16 don

In [6]:
import re
import numpy as np
from sentence_transformers import SentenceTransformer
import gc

def preprocess_for_embedding(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

def split_into_sentences(text):
    # Split by . : ; with lookbehind to preserve the punctuation
    return [s.strip() for s in re.split(r'(?<=[.;:])\s+', text) if len(s.strip()) > 20]

def match_compendium_to_gospel(compendium_paragraphs, gospel_verses, threshold=0.6):
    model = SentenceTransformer('paraphrase-MiniLM-L6-v2', device='cpu')

    # Expand gospel verses into sub-sentences
    gospel_subsentences = []
    for verse in gospel_verses:
        subsentences = split_into_sentences(verse['text'])
        for subsentence in subsentences:
            gospel_subsentences.append({
                'original_text': subsentence,
                'preprocessed_text': preprocess_for_embedding(subsentence),
                'reference': f"{verse['book']} {verse['chapter']}:{verse['verse']}"
            })

    # Pre-encode all gospel sub-sentences
    gospel_embeddings = model.encode(
        [g['preprocessed_text'] for g in gospel_subsentences],
        convert_to_numpy=True,
        show_progress_bar=True
    )

    matches = []

    for idx, para in enumerate(compendium_paragraphs):
        comp_sentences = split_into_sentences(para['text'])
        for comp_text in comp_sentences:
            comp_clean = preprocess_for_embedding(comp_text)
            comp_embedding = model.encode([comp_clean], convert_to_numpy=True)[0]

            # Compute cosine similarity
            dot = np.dot(gospel_embeddings, comp_embedding)
            norms = np.linalg.norm(gospel_embeddings, axis=1) * np.linalg.norm(comp_embedding)
            similarities = dot / norms

            for i, score in enumerate(similarities):
                if score >= threshold:
                    matches.append({
                        'compendium_text': comp_text,
                        'gospel_text': gospel_subsentences[i]['original_text'],
                        'reference': gospel_subsentences[i]['reference'],
                        'score': float(score)
                    })

        if (idx + 1) % 5 == 0:
            print(f"Processed {idx + 1}/{len(compendium_paragraphs)} compendium blocks...")

        gc.collect()

    # Sort results
    matches.sort(key=lambda x: x['score'], reverse=True)
    return matches


In [7]:
results = match_compendium_to_gospel(chapter_content, gospel_verses, threshold=0.6)

Batches:   0%|          | 0/108 [00:00<?, ?it/s]

Processed 5/91 compendium blocks...
Processed 10/91 compendium blocks...
Processed 15/91 compendium blocks...
Processed 20/91 compendium blocks...
Processed 25/91 compendium blocks...
Processed 30/91 compendium blocks...
Processed 35/91 compendium blocks...
Processed 40/91 compendium blocks...
Processed 45/91 compendium blocks...
Processed 50/91 compendium blocks...
Processed 55/91 compendium blocks...
Processed 60/91 compendium blocks...
Processed 65/91 compendium blocks...
Processed 70/91 compendium blocks...
Processed 75/91 compendium blocks...
Processed 80/91 compendium blocks...
Processed 85/91 compendium blocks...
Processed 90/91 compendium blocks...


In [8]:
for match in results[:50]:
    print(f"\nScore: {match['score']:.4f}")
    print(f"Vatican Text: {match['compendium_text']}")
    print(f"Gospel Match: {match['gospel_text']}")
    print(f"Reference: {match['reference']}")



Score: 0.8656
Vatican Text: 541 "Now after John was arrested, Jesus came into Galilee, preaching the gospel of God, and saying:
Gospel Match: Now after John was taken into custody, Jesus came into Galilee, preaching the Good News of God’s Kingdom,
Reference: Mark 1:14

Score: 0.8550
Vatican Text: "He must increase, but I must decrease." 201
Gospel Match: He must increase, but I must decrease.
Reference: John 3:30

Score: 0.8449
Vatican Text: 228 John preaches "a baptism of repentance for the forgiveness of sins".
Gospel Match: John came baptizing in the wilderness and preaching the baptism of repentance for forgiveness of sins.
Reference: Mark 1:4

Score: 0.8174
Vatican Text: Jesus' ascent to Jerusalem
Gospel Match: When they heard that Jesus was coming to Jerusalem,
Reference: John 12:12

Score: 0.8157
Vatican Text: Jesus' messianic entrance into Jerusalem
Gospel Match: Jesus entered into the temple in Jerusalem.
Reference: Mark 11:11

Score: 0.8064
Vatican Text: "I came not to call 