<a href="https://colab.research.google.com/github/dornercr/math_notebooks/blob/main/adam_prototype.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [18]:
# @title
# ================================================
# üåç ILR Speaking Level + Language ID + Arabic Dialect Detection + Original Text + English Translation + Topic Modeling
# ================================================

import ipywidgets as widgets
from IPython.display import display, HTML, Markdown, clear_output
from google.colab import ai
import json, re, pandas as pd

# -----------------------------------------------
# üîπ Widget Setup
# -----------------------------------------------

dropdown = widgets.Dropdown(
    options=ai.list_models(),
    description='Model:',
    layout={'width': 'auto'}
)

text_input = widgets.Textarea(
    placeholder='Paste or type text (any language, any topic)...',
    layout={'width': 'auto', 'height': '160px'},
)

button_upload = widgets.Button(
    description='üìÅ Upload File',
    button_style='info'
)

button_analyze = widgets.Button(
    description='üîç Analyze Text',
    button_style='primary'
)

output_summary = widgets.Output(layout={'border': '1px solid #ccc', 'padding': '10px'})
output_table = widgets.Output(layout={'border': '1px solid #ccc', 'padding': '10px'})

# -----------------------------------------------
# üîπ Helper Functions
# -----------------------------------------------

def safe_json_parse(raw):
    """Attempts to clean and parse the AI's JSON output safely."""
    raw = re.sub(r'```(json)?', '', raw).strip()
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        match = re.search(r'\{.*\}', raw, re.DOTALL)
        if match:
            try:
                return json.loads(match.group())
            except:
                pass
        raise ValueError("Invalid JSON returned by model")

def analyze_text(text):
    """Ask Gemini for language ID + Arabic dialect + translation + ILR proficiency + topic modeling."""
    prompt = f"""
You are a multilingual AI evaluator and translator.

Your tasks:
1. Identify the **primary language** of the text (ISO 639-1 code + English name).
2. If the language is **Arabic**, determine its **regional variety**:
   - Modern Standard Arabic (MSA)
   - Egyptian Arabic
   - Levantine Arabic
   - Maghrebi Arabic (Moroccan, Algerian, Tunisian)
   - Gulf Arabic (Saudi, Emirati, Kuwaiti, etc.)
   - Sudanese Arabic
   Provide this as `"dialect": "..."` inside the JSON.
3. Return the **original text** exactly as provided.
4. Translate it into **English** accurately and naturally.
5. Determine the **ILR Speaking Level** (0‚Äì5 or + levels).
6. Provide a **reasoning paragraph** justifying the classification.
7. Extract **3‚Äì5 key topics** with keywords and summaries.

Respond ONLY in valid JSON using this structure:
{{
  "language": {{"code": "ar", "name": "Arabic", "dialect": "Egyptian Arabic"}},
  "original_text": "ÿßŸÑŸÜÿµ ÿßŸÑÿ£ÿµŸÑŸä ŸáŸÜÿß...",
  "translation": "English translation here...",
  "level": "ILR-3",
  "reasoning": "Why this level fits according to ILR descriptors.",
  "topics": [
    {{"topic": "Cultural Change", "keywords": ["tradition", "modernity"], "summary": "Examines the social evolution in Egypt."}},
    {{"topic": "Technology and Society", "keywords": ["innovation", "digital life"], "summary": "Describes technology's role in daily living."}}
  ]
}}

Reference ‚Äî ILR Speaking Levels:
(Level 0) No Proficiency ‚Äì No practical speaking ability; only isolated words or rehearsed phrases. Communication is impossible beyond repeating memorized items or responding to greetings. The speaker cannot form original sentences or sustain interaction.

(Level 0+) Memorized Proficiency ‚Äì Can use a few memorized expressions for immediate needs (e.g., ‚Äúhello,‚Äù ‚Äúthank you,‚Äù ‚Äúwater‚Äù). Speech is limited to rehearsed forms with little understanding of structure or grammar. Pronunciation and comprehension are weak, and communication breaks down outside set phrases.

(Level 1) Elementary Proficiency ‚Äì Can handle very simple exchanges related to immediate needs, such as introductions, directions, or ordering food. Speech is slow and heavily patterned, with frequent pauses and major grammatical errors. Comprehension is limited to clear, repeated, and familiar speech.

(Level 1+) Elementary Plus ‚Äì Can manage predictable daily exchanges and give short connected sentences. Understands simple questions on familiar topics and can express basic needs and preferences. Vocabulary remains limited, and errors in grammar and pronunciation are still frequent, but communication is sometimes sustained without major breakdowns.

(Level 2) Limited Working Proficiency ‚Äì Can carry out routine social and work conversations on familiar and concrete topics. Speech is generally understandable, though not smooth. The speaker can give instructions, describe experiences, and handle most survival situations. Errors in grammar and vocabulary remain common, especially in complex sentences.

(Level 2+) Limited Working Plus ‚Äì Speech is smoother, more confident, and better organized. The speaker can discuss familiar subjects at length and handle unexpected turns in conversation. Some ability to describe, compare, and narrate with moderate control of time frames. Pronunciation and grammar are stronger, though occasional hesitation and self-correction remain.

(Level 3) General Professional Proficiency ‚Äì Communicates effectively and accurately on professional, social, and abstract topics. Speech is cohesive, well-organized, and uses a wide range of vocabulary and structures. The speaker can support opinions, hypothesize, and explain complex concepts, maintaining clarity and coherence even in extended discourse.

(Level 3+) General Professional Plus ‚Äì Speaks fluently and naturally in nearly all formal and informal situations. Language is flexible and stylistically appropriate, with good command of idioms, humor, and cultural references. The speaker demonstrates near-native rhythm and spontaneity, though minor gaps may appear in idiomatic precision or nuanced style.

(Level 4) Advanced Professional Proficiency ‚Äì Near-native fluency with precise and sophisticated vocabulary. Can perform complex tasks such as negotiating, persuading, debating, and presenting professionally. Speech demonstrates complete grammatical control, effective register shifting, and full awareness of sociolinguistic norms.

(Level 4+) Advanced Professional Plus ‚Äì Functionally native in nearly all aspects except the most subtle cultural or idiomatic details. The speaker handles emotionally charged, literary, or rhetorical content with ease. Fully comfortable with humor, irony, and regional variation, displaying mastery of style and pragmatics.

(Level 5) Functionally Native Proficiency ‚Äì Equivalent to an educated native speaker. Complete control of grammar, idiom, style, and cultural nuance across all registers and dialects. Speech is effortless, spontaneous, and flexible in every situation‚Äîfrom professional meetings to creative expression. The speaker is indistinguishable from a native in all respects.

Text:
{text}
"""
    raw = ""
    for chunk in ai.generate_text(prompt=prompt, model_name=dropdown.value, stream=False):
        if chunk:
            raw += chunk
    return raw.strip()

def display_results(result_json):
    """Parse JSON safely and display clean outputs."""
    try:
        data = safe_json_parse(result_json)
    except Exception as e:
        with output_summary:
            clear_output()
            display(Markdown(f"‚ùå **Error:** {e}. Model returned invalid JSON. Try again."))
        return

    with output_summary:
        clear_output()
        lang = data.get("language", {})
        dialect_info = ""
        if lang.get("code") == "ar" and "dialect" in lang:
            dialect_info = f" ‚Äî Dialect: **{lang.get('dialect')}**"
        display(Markdown(f"### üåç Detected Language: **{lang.get('name','Unknown')}** ({lang.get('code','?')}){dialect_info}"))

        # Original text display
        if "original_text" in data:
            display(Markdown("### üó£Ô∏è Original Text"))
            display(Markdown(f"> {data.get('original_text','(No original text returned.)')}"))

        # Translation display
        if "translation" in data:
            display(Markdown("### üá¨üáß English Translation"))
            display(Markdown(f"> {data.get('translation','(No translation returned.)')}"))

        # ILR + reasoning
        display(Markdown(f"### üß† ILR Speaking Level: **{data.get('level','Unknown')}**"))
        display(Markdown(f"**Reasoning:** {data.get('reasoning','(No reasoning provided.)')}"))

    # Topics section
    with output_table:
        clear_output()
        topics = data.get('topics', [])
        if topics:
            df = pd.DataFrame(topics)
            display(Markdown("### üìÇ Extracted Topics"))
            display(df)
        else:
            display(Markdown("No topics returned."))

def on_analyze_clicked(b):
    """Triggered when Analyze button is clicked."""
    with output_summary:
        clear_output()
        display(Markdown("‚è≥ Analyzing text... please wait..."))
    with output_table:
        clear_output()

    text = text_input.value.strip()
    if not text:
        with output_summary:
            clear_output()
            display(Markdown("‚ö†Ô∏è Please enter or upload text first."))
        return

    result_json = analyze_text(text)
    display_results(result_json)

def on_upload_clicked(b):
    """Allows uploading .txt files directly."""
    from google.colab import files
    uploaded = files.upload()
    if uploaded:
        file_name = list(uploaded.keys())[0]
        with open(file_name, 'r', encoding='utf-8') as f:
            text_input.value = f.read()

# -----------------------------------------------
# üîπ Bind Actions
# -----------------------------------------------

button_analyze.on_click(on_analyze_clicked)
button_upload.on_click(on_upload_clicked)

# -----------------------------------------------
# üîπ Display UI
# -----------------------------------------------

display(HTML("""
<style>
.widget-dropdown select, .widget-textarea textarea {
  font-size: 16px;
  font-family: "Arial", sans-serif;
}
blockquote {
  border-left: 3px solid #ccc;
  padding-left: 10px;
  color: #333;
}
</style>
"""))

ui = widgets.VBox([
    dropdown,
    text_input,
    widgets.HBox([button_upload, button_analyze]),
    output_summary,
    output_table
])
display(ui)


VBox(children=(Dropdown(description='Model:', layout=Layout(width='auto'), options=('google/gemini-2.0-flash',‚Ä¶

In [9]:
# ‚úÖ Setup cell ‚Äî install all dependencies
#!pip install newspaper3k beautifulsoup4 requests pandas
#!pip install lxml_html_clean

import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

# ============================================================
# üåç Multilingual News Scraper + ILR Analyzer Integration (Colab)
# ============================================================

import requests, concurrent.futures, csv, os, json, re, pandas as pd
from bs4 import BeautifulSoup
from newspaper import Article
from google.colab import ai
from IPython.display import display, Markdown, clear_output

# --------------------------------------------
# üîπ ILR ANALYSIS FUNCTION (from your widget logic)
# --------------------------------------------
def safe_json_parse(raw):
    raw = re.sub(r'```(json)?', '', raw).strip()
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        match = re.search(r'\{.*\}', raw, re.DOTALL)
        if match:
            try:
                return json.loads(match.group())
            except:
                pass
        raise ValueError("Invalid JSON returned by model")

def analyze_text(text, model_name="gpt-5"):
    """Ask model for language ID + Arabic dialect + translation + ILR + topics."""
    prompt = f"""
You are a multilingual AI evaluator and translator. Perform:
1. Language identification (ISO code + English name).
2. If Arabic, detect dialect (MSA, Egyptian, Gulf, Maghrebi, etc.).
3. Translate into English.
4. Rate ILR Speaking Level (0‚Äì5 or +).
5. Explain reasoning.
6. Extract 3‚Äì5 key topics with keywords and summaries.

Respond in JSON with:
{{"language": {{"code":"xx","name":"...","dialect":"..."}},
  "translation":"...",
  "level":"ILR-3",
  "reasoning":"...",
  "topics":[{{"topic":"...","keywords":["..."],"summary":"..."}}]}}
Text:
{text}
"""
    raw = ""
    for chunk in ai.generate_text(prompt=prompt, model_name=model_name, stream=False):
        if chunk:
            raw += chunk
    return safe_json_parse(raw.strip())

# --------------------------------------------
# üîπ ARTICLE SCRAPING UTILITIES
# --------------------------------------------
def get_article(url, language):
    article = Article(url, language=language)
    try:
        article.download()
        article.parse()
        article.nlp()
        if len(article.text) < 300:
            raise ValueError(f"Too short: {url}")
        return {
            'title': article.title,
            'summary': article.summary,
            'text': article.text,
            'language': language,
            'link': url
        }
    except Exception as e:
        print(f"[{language.upper()}] Failed: {e}")
        return None

def scrape_website(url, link_criteria, base_url=None):
    headers = {'User-Agent': 'Mozilla/5.0'}
    try:
        res = requests.get(url, headers=headers, timeout=10)
        res.raise_for_status()
        soup = BeautifulSoup(res.content, 'html.parser')
        urls = []
        for a in soup.find_all('a', href=True):
            link = a['href']
            if link_criteria in link:
                if base_url and link.startswith('/'):
                    link = base_url + link
                urls.append(link)
        return list(set(urls))
    except Exception as e:
        print(f"Failed to retrieve {url}: {e}")
        return []

def save_articles_to_csv(articles, filename):
    os.makedirs(os.path.dirname(filename), exist_ok=True)
    keys = ['title', 'summary', 'text', 'language', 'link',
            'ilr_level', 'reasoning', 'translation', 'topics']
    with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=keys)
        writer.writeheader()
        writer.writerows(articles)
    print(f"‚úÖ Saved {len(articles)} ‚Üí {filename}")

# --------------------------------------------
# üîπ MAIN MULTILINGUAL SCRAPE + ANALYZE
# --------------------------------------------
def fetch_and_analyze(url, lang, model_name):
    base_article = get_article(url, lang)
    if not base_article:
        return None
    try:
        ilr_data = analyze_text(base_article['text'], model_name=model_name)
        base_article['ilr_level'] = ilr_data.get('level', 'Unknown')
        base_article['reasoning'] = ilr_data.get('reasoning', '')
        base_article['translation'] = ilr_data.get('translation', '')
        base_article['topics'] = json.dumps(ilr_data.get('topics', []), ensure_ascii=False)
    except Exception as e:
        print(f"ILR analysis failed for {url}: {e}")
        base_article.update({'ilr_level':'Error','reasoning':'','translation':'','topics':'[]'})
    return base_article

def scrape_multilingual_ai(sources_config, model_name="gpt-5", output_dir="scraped_ilr"):
    all_articles = []
    for conf in sources_config:
        lang = conf['lang']
        sites = conf['sites']
        lang_articles = []

        print(f"\nüåç Scraping + Analyzing {lang.upper()} ({len(sites)} sites)")
        article_urls = []
        for src in sites:
            article_urls.extend(scrape_website(src[0], src[1], src[2]))

        with concurrent.futures.ThreadPoolExecutor(max_workers=6) as ex:
            futures = {ex.submit(fetch_and_analyze, u, lang, model_name): u for u in article_urls}
            for fut in concurrent.futures.as_completed(futures):
                art = fut.result()
                if art:
                    lang_articles.append(art)

        save_articles_to_csv(lang_articles, f"{output_dir}/{lang}_ilr.csv")
        all_articles.extend(lang_articles)

    save_articles_to_csv(all_articles, f"{output_dir}/all_languages_ilr.csv")
    return pd.DataFrame(all_articles)

# --------------------------------------------
# üåê CONFIG EXAMPLE (Spanish + Russian)
# --------------------------------------------
spanish_sources = [
    ('https://elpais.com', '/noticias/', 'https://elpais.com'),
    ('https://www.elmundo.es', '/noticias/', 'https://www.elmundo.es')
]
russian_sources = [
    ('https://ria.ru', '/20', 'https://ria.ru'),
    ('https://www.gazeta.ru', '/news/', 'https://www.gazeta.ru')
]
sources_config = [
    {'lang': 'es', 'sites': spanish_sources},
    {'lang': 'ru', 'sites': russian_sources}
]

# --------------------------------------------
# üöÄ RUN
# --------------------------------------------
if __name__ == "__main__":
    df = scrape_multilingual_ai(sources_config, model_name="gpt-5")
    display(Markdown("### ‚úÖ Analysis Complete"))
    display(df.head())


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.



üåç Scraping + Analyzing ES (2 sites)
ILR analysis failed for https://elpais.com/noticias/latinoamerica/: Error code: 503 - {'message': 'The requested model is currently unavailable.', 'type': 'invalid_request_error'}
‚úÖ Saved 1 ‚Üí scraped_ilr/es_ilr.csv

üåç Scraping + Analyzing RU (2 sites)
ILR analysis failed for https://ria.ru/20251112/vsu-2054591514.html: Error code: 503 - {'message': 'The requested model is currently unavailable.', 'type': 'invalid_request_error'}
ILR analysis failed for https://ria.ru/20251112/moskva-2054513461.html: Error code: 503 - {'message': 'The requested model is currently unavailable.', 'type': 'invalid_request_error'}
ILR analysis failed for https://rsport.ria.ru/20251112/kolesnikov-2054590008.html: Error code: 503 - {'message': 'The requested model is currently unavailable.', 'type': 'invalid_request_error'}
ILR analysis failed for https://ria.ru/20251112/nebenzya-2054386054.html: Error code: 503 - {'message': 'The requested model is currently una

### ‚úÖ Analysis Complete

Unnamed: 0,title,summary,text,language,link,ilr_level,reasoning,translation,topics
0,Latinoam√©rica en EL PA√çS,El veh√≠culo colision√≥ contra una camioneta en ...,El veh√≠culo colision√≥ contra una camioneta en ...,es,https://elpais.com/noticias/latinoamerica/,Error,,,[]
1,–ù–∞ –ó–∞–ø–∞–¥–µ —Ä–∞—Å–∫—Ä—ã–ª–∏ —É–≥—Ä–æ–∑—É –±–µ–∑–æ–ø–∞—Å–Ω–æ—Å—Ç–∏ –ï–° –æ—Ç —Å...,–í—ã–≤–æ–¥—ã –°–∫–∏–Ω–Ω–µ—Ä–∞ –ø–æ–¥–∫—Ä–µ–ø–ª—è—é—Ç—Å—è —Ä–µ–∑—É–ª—å—Ç–∞—Ç–∞–º–∏ –∏—Å—Å...,–í—ã–≤–æ–¥—ã –°–∫–∏–Ω–Ω–µ—Ä–∞ –ø–æ–¥–∫—Ä–µ–ø–ª—è—é—Ç—Å—è —Ä–µ–∑—É–ª—å—Ç–∞—Ç–∞–º–∏ –∏—Å—Å...,ru,https://ria.ru/20251112/vsu-2054591514.html,Error,,,[]
2,"–ú–æ—Å–∫–≤–∞ –≤–∏–¥–∏—Ç, —á—Ç–æ –ï–≤—Ä–æ–ø–∞ –≥–æ—Ç–æ–≤–∏—Ç—Å—è –∫ –≤–æ–π–Ω–µ —Å –†...",–í –ø–æ—Å–ª–µ–¥–Ω–∏–µ –≥–æ–¥—ã –†–æ—Å—Å–∏—è –æ—Ç–º–µ—á–∞–µ—Ç –±–µ—Å–ø—Ä–µ—Ü–µ–¥–µ–Ω—Ç–Ω...,–í –ø–æ—Å–ª–µ–¥–Ω–∏–µ –≥–æ–¥—ã –†–æ—Å—Å–∏—è –æ—Ç–º–µ—á–∞–µ—Ç –±–µ—Å–ø—Ä–µ—Ü–µ–¥–µ–Ω—Ç–Ω...,ru,https://ria.ru/20251112/moskva-2054513461.html,Error,,,[]
3,–ö–æ–ª–µ—Å–Ω–∏–∫–æ–≤ –≤–∑—è–ª –∑–æ–ª–æ—Ç–æ —á–µ–º–ø–∏–æ–Ω–∞—Ç–∞ –†–æ—Å—Å–∏–∏ –Ω–∞ –∫–æ...,"–ï—Å–ª–∏ –≤—ã –Ω–µ —Å–æ–≥–ª–∞—Å–Ω—ã —Å –±–ª–æ–∫–∏—Ä–æ–≤–∫–æ–π, –≤–æ—Å–ø–æ–ª—å–∑—É–π—Ç...","‚àû . –ï—Å–ª–∏ –≤—ã –Ω–µ —Å–æ–≥–ª–∞—Å–Ω—ã —Å –±–ª–æ–∫–∏—Ä–æ–≤–∫–æ–π, –≤–æ—Å–ø–æ–ª—å...",ru,https://rsport.ria.ru/20251112/kolesnikov-2054...,Error,,,[]
4,–í–∞—Å–∏–ª–∏–π –ù–µ–±–µ–Ω–∑—è: –ø—Ä–µ–∫—Ä–∞—â–µ–Ω–∏–µ –æ–≥–Ω—è –Ω–∞ –£–∫—Ä–∞–∏–Ω–µ –Ω...,‚Äî –†–∞—Å—Å–∫–∞–∂–∏—Ç–µ –æ–± –∏—Ç–æ–≥–∞—Ö –ø—Ä–µ–¥—Å–µ–¥–∞—Ç–µ–ª—å—Å—Ç–≤–∞ –†–æ—Å—Å–∏–∏...,–†–æ—Å—Å–∏—è –Ω–µ –≤–∏–¥–∏—Ç –¥–ª—è –û–û–ù —Ä–æ–ª–∏ –≤ —É—Ä–µ–≥—É–ª–∏—Ä–æ–≤–∞–Ω–∏–∏ ...,ru,https://ria.ru/20251112/nebenzya-2054386054.html,Error,,,[]


In [20]:
# ============================================================
# üåç Multilingual Scraper + Auto Gemini ILR Analyzer (Colab)
# ============================================================

!pip install newspaper3k lxml_html_clean beautifulsoup4 requests pandas nltk -q
import nltk
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)

# ============================================================
import requests, csv, os, json, re, time, pandas as pd
from bs4 import BeautifulSoup
from newspaper import Article
from google.colab import ai
from IPython.display import display, Markdown

# --------------------------------------------
# üîπ Auto-select first available Gemini model
# --------------------------------------------
# ============================================================
# üîπ Auto Gemini model fallback
# ============================================================

available_models = ai.list_models()
fallback_default = "gemini-2.5-flash"

def get_working_model():
    """Tries available models in order until one responds successfully."""
    if not available_models:
        print(f"‚ö†Ô∏è No models returned by ai.list_models(). Using default: {fallback_default}")
        return fallback_default

    print(f"üîç Available models: {available_models}")
    for m in available_models:
        print(f"‚è≥ Testing model: {m}")
        try:
            # simple ping test ‚Äî 1 short generation
            prompt = "Respond with the single word: READY"
            raw = ""
            for chunk in ai.generate_text(prompt=prompt, model_name=m, stream=False):
                if chunk:
                    raw += chunk
            if "READY" in raw.upper():
                print(f"‚úÖ Using Gemini model: {m}")
                return m
            else:
                print(f"‚ö†Ô∏è Model {m} responded unexpectedly ‚Äî skipping.")
        except Exception as e:
            print(f"‚ùå Model {m} failed: {e}")
            continue

    print(f"‚ö†Ô∏è All models failed. Falling back to {fallback_default}")
    return fallback_default

model_name = get_working_model()

# --------------------------------------------
# üîπ JSON parsing utility
# --------------------------------------------
def safe_json_parse(raw):
    raw = re.sub(r'```(json)?', '', raw).strip()
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        match = re.search(r'\{.*\}', raw, re.DOTALL)
        if match:
            try:
                return json.loads(match.group())
            except:
                pass
        raise ValueError("Invalid JSON returned by model")

# --------------------------------------------
# üîπ Gemini ILR analysis function (auto model)
# --------------------------------------------
def gemini_analyze_text(text, retries=3, delay=8):
    """Use Gemini for ILR, translation, and topic modeling."""
    prompt = f"""
You are a multilingual AI evaluator and translator.
1. Identify language (ISO code + name)
2. If Arabic, specify dialect (MSA, Egyptian, Levantine, Maghrebi, Gulf, etc.)
3. Provide English translation
4. Assign ILR Speaking Level (0‚Äì5 or +)
5. Explain reasoning
6. Extract 3‚Äì5 key topics with keywords and summaries.
Respond ONLY in JSON:
{{
  "language": {{"code": "xx", "name": "...", "dialect": "..."}},
  "translation": "...",
  "level": "ILR-3",
  "reasoning": "...",
  "topics": [{{"topic": "...", "keywords": ["..."], "summary": "..."}}]
}}
Text:
{text}
"""
    for attempt in range(1, retries + 1):
        try:
            raw = ""
            for chunk in ai.generate_text(prompt=prompt, model_name=model_name, stream=False):
                if chunk:
                    raw += chunk
            return safe_json_parse(raw)
        except Exception as e:
            if "unavailable" in str(e).lower() or "503" in str(e):
                print(f"‚ö†Ô∏è Gemini unavailable. Retrying in {delay}s (attempt {attempt}/{retries})")
                time.sleep(delay)
            else:
                print(f"‚ùå Permanent failure: {e}")
                return {"level": "Error", "reasoning": "", "translation": "", "topics": []}
    print("‚ùå Failed after all retries.")
    return {"level": "Error", "reasoning": "", "translation": "", "topics": []}

# --------------------------------------------
# üîπ Scraping utilities
# --------------------------------------------
def get_article(url, language):
    article = Article(url, language=language)
    try:
        article.download()
        article.parse()
        article.nlp()
        if len(article.text) < 300:
            raise ValueError("Too short.")
        return {'title': article.title, 'summary': article.summary,
                'text': article.text, 'language': language, 'link': url}
    except Exception as e:
        print(f"[{language.upper()}] Failed: {e}")
        return None

def scrape_website(url, link_criteria, base_url=None):
    headers = {'User-Agent': 'Mozilla/5.0'}
    try:
        res = requests.get(url, headers=headers, timeout=10)
        res.raise_for_status()
        soup = BeautifulSoup(res.content, 'html.parser')
        urls = []
        for a in soup.find_all('a', href=True):
            link = a['href']
            if link_criteria in link:
                if base_url and link.startswith('/'):
                    link = base_url + link
                urls.append(link)
        return list(set(urls))
    except Exception as e:
        print(f"Failed {url}: {e}")
        return []

def save_articles_to_csv(articles, filename):
    os.makedirs(os.path.dirname(filename), exist_ok=True)
    keys = ['title','summary','text','language','link','ilr_level','reasoning','translation','topics']
    with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=keys)
        writer.writeheader()
        writer.writerows(articles)
    print(f"‚úÖ Saved {len(articles)} ‚Üí {filename}")

# --------------------------------------------
# üîπ Scrape + ILR analyze
# --------------------------------------------
def fetch_and_analyze(url, lang):
    art = get_article(url, lang)
    if not art:
        return None
    try:
        ilr = gemini_analyze_text(art['text'])
        art['ilr_level'] = ilr.get('level', 'Unknown')
        art['reasoning'] = ilr.get('reasoning', '')
        art['translation'] = ilr.get('translation', '')
        art['topics'] = json.dumps(ilr.get('topics', []), ensure_ascii=False)
    except Exception as e:
        print(f"ILR analysis failed for {url}: {e}")
        art.update({'ilr_level': 'Error', 'reasoning': '', 'translation': '', 'topics': '[]'})
    return art

def scrape_multilingual_gemini(sources_config, output_dir="scraped_ilr"):
    all_articles = []
    for conf in sources_config:
        lang = conf['lang']
        sites = conf['sites']
        lang_articles = []
        print(f"\nüåç Scraping {lang.upper()} ‚Äî {len(sites)} sources")

        urls = []
        for s in sites:
            urls.extend(scrape_website(s[0], s[1], s[2]))

        for u in urls:
            art = fetch_and_analyze(u, lang)
            if art:
                lang_articles.append(art)

        save_articles_to_csv(lang_articles, f"{output_dir}/{lang}_ilr.csv")
        all_articles.extend(lang_articles)

    save_articles_to_csv(all_articles, f"{output_dir}/all_languages_ilr.csv")
    display(Markdown("### ‚úÖ Gemini ILR Analysis Complete"))
    return pd.DataFrame(all_articles)

# --------------------------------------------
# üåê Example: Spanish + Russian
# --------------------------------------------

spanish_news_sources = [
    # Spain
    ('https://elpais.com', '/noticias/', 'https://elpais.com'),
    ('https://www.elmundo.es', '/noticias/', 'https://www.elmundo.es'),
    ('https://www.abc.es', '/noticias/', 'https://www.abc.es'),
    ('https://www.lavanguardia.com', '/noticias/', 'https://www.lavanguardia.com'),
    ('https://www.elconfidencial.com', '/noticias/', 'https://www.elconfidencial.com'),
    ('https://www.eldiario.es', '/noticias/', 'https://www.eldiario.es'),
    ('https://www.publico.es', '/noticias/', 'https://www.publico.es'),
    ('https://www.rtve.es', '/noticias/', 'https://www.rtve.es'),
    ('https://www.elespanol.com', '/noticias/', 'https://www.elespanol.com'),
    ('https://www.larazon.es', '/noticias/', 'https://www.larazon.es'),
    ('https://www.cuartopoder.es', '/noticias/', 'https://www.cuartopoder.es'),
    ('https://www.infolibre.es', '/noticias/', 'https://www.infolibre.es'),
    ('https://www.elnortedecastilla.es', '/noticias/', 'https://www.elnortedecastilla.es'),
    ('https://www.diariovasco.com', '/noticias/', 'https://www.diariovasco.com'),
    ('https://www.elperiodico.com', '/noticias/', 'https://www.elperiodico.com'),

    # Argentina
    ('https://www.clarin.com', '/noticias/', 'https://www.clarin.com'),
    ('https://www.lanacion.com.ar', '/noticias/', 'https://www.lanacion.com.ar'),
    ('https://www.pagina12.com.ar', '/noticias/', 'https://www.pagina12.com.ar'),
    ('https://www.perfil.com', '/noticias/', 'https://www.perfil.com'),

    # Colombia
    ('https://www.eltiempo.com', '/noticias/', 'https://www.eltiempo.com'),
    ('https://www.semana.com', '/noticias/', 'https://www.semana.com'),
    ('https://www.elespectador.com', '/noticias/', 'https://www.elespectador.com'),
    ('https://www.elheraldo.co', '/noticias/', 'https://www.elheraldo.co'),
    ('https://www.larepublica.co', '/noticias/', 'https://www.larepublica.co'),

    # Mexico
    ('https://www.eluniversal.com.mx', '/noticias/', 'https://www.eluniversal.com.mx'),
    ('https://www.excelsior.com.mx', '/noticias/', 'https://www.excelsior.com.mx'),
    ('https://www.milenio.com', '/noticias/', 'https://www.milenio.com'),
    ('https://www.proceso.com.mx', '/noticias/', 'https://www.proceso.com.mx'),
    ('https://www.elsoldemexico.com.mx', '/noticias/', 'https://www.elsoldemexico.com.mx'),
    ('https://www.elfinanciero.com.mx', '/noticias/', 'https://www.elfinanciero.com.mx'),
    ('https://www.animalpolitico.com', '/noticias/', 'https://www.animalpolitico.com'),

    # Peru
    ('https://www.elcomercio.pe', '/noticias/', 'https://www.elcomercio.pe'),
    ('https://www.larepublica.pe', '/noticias/', 'https://www.larepublica.pe'),

    # Chile
    ('https://www.emol.com', '/noticias/', 'https://www.emol.com'),
    ('https://www.latercera.com', '/noticias/', 'https://www.latercera.com'),

    # Venezuela
    ('https://www.el-nacional.com', '/noticias/', 'https://www.el-nacional.com'),
    ('https://www.ultimasnoticias.com.ve', '/noticias/', 'https://www.ultimasnoticias.com.ve'),

    # Paraguay
    ('https://www.ultimahora.com', '/noticias/', 'https://www.ultimahora.com'),
    ('https://www.abc.com.py', '/noticias/', 'https://www.abc.com.py'),

    # Uruguay
    ('https://elpais.com.uy', '/noticias/', 'https://elpais.com.uy'),
    ('https://www.elobservador.com.uy', '/noticias/', 'https://www.elobservador.com.uy'),

    # Costa Rica
    ('https://www.elpais.cr', '/noticias/', 'https://www.elpais.cr'),
    ('https://www.nacion.com', '/noticias/', 'https://www.nacion.com'),

    # Guatemala
    ('https://www.prensalibre.com', '/noticias/', 'https://www.prensalibre.com'),
    ('https://www.soy502.com', '/noticias/', 'https://www.soy502.com'),

    # International Spanish-language
    ('https://cnnespanol.cnn.com', '/noticias/', 'https://cnnespanol.cnn.com'),
    ('https://www.bbc.com/mundo', '/noticias/', 'https://www.bbc.com/mundo'),
    ('https://es.euronews.com', '/noticias/', 'https://es.euronews.com'),
    ('https://www.dw.com/es', '/noticias/', 'https://www.dw.com/es'),
    ('https://www.nytimes.com/es', '/noticias/', 'https://www.nytimes.com/es'),

    # News for Spanish learners
    ('https://www.newsinslowspanish.com/latino', '/', 'https://www.newsinslowspanish.com/latino'),
    ('https://www.veintemundos.com', '/', 'https://www.veintemundos.com'),
    ('https://www.ver-taal.com', '/', 'https://www.ver-taal.com'),

    # Pop culture news
    ('https://www.telemundo.com/noticias', '/', 'https://www.telemundo.com/noticias'),
    ('https://www.univision.com', '/', 'https://www.univision.com'),
    ('https://www.revistacuore.com', '/', 'https://www.revistacuore.com'),
    ('https://www.lecturas.com', '/', 'https://www.lecturas.com'),
    ('https://www.caras.cl', '/', 'https://www.caras.cl'),

    # Magazines
    ('https://www.vogue.es', '/', 'https://www.vogue.es'),
    ('https://www.peopleenespanol.com', '/', 'https://www.peopleenespanol.com'),
    ('https://www.cosmohispano.com', '/', 'https://www.cosmohispano.com'),
    ('https://www.gq.com.mx', '/', 'https://www.gq.com.mx'),

    # Sports
    ('https://espndeportes.espn.com', '/', 'https://espndeportes.espn.com'),
    ('https://www.foxdeportes.com', '/', 'https://www.foxdeportes.com'),
    ('https://www.marca.com', '/noticias/', 'https://www.marca.com'),
    ('https://www.as.com', '/noticias/', 'https://www.as.com'),
    ('https://www.sport.es', '/noticias/', 'https://www.sport.es'),
    ('https://www.mundodeportivo.com', '/noticias/', 'https://www.mundodeportivo.com'),

    # Business & Economy
    ('https://www.expansion.com', '/noticias/', 'https://www.expansion.com'),
    ('https://www.eleconomista.es', '/noticias/', 'https://www.eleconomista.es'),
    ('https://www.cincodias.elpais.com', '/noticias/', 'https://www.cincodias.elpais.com'),
    ('https://www.portafolio.co', '/noticias/', 'https://www.portafolio.co'),
    ('https://www.larepublica.pe', '/noticias/', 'https://www.larepublica.pe'),
    ('https://www.elfinanciero.com.mx', '/noticias/', 'https://www.elfinanciero.com.mx'),
    ('https://www.iprofesional.com', '/noticias/', 'https://www.iprofesional.com'),
    ('https://www.gestion.pe', '/noticias/', 'https://www.gestion.pe'),
    ('https://www.mercado.com.ar', '/noticias/', 'https://www.mercado.com.ar'),
    ('https://www.finanzas.com', '/noticias/', 'https://www.finanzas.com'),
    ('https://www.americaeconomia.com', '/noticias/', 'https://www.americaeconomia.com'),
    ('https://www.dinero.com', '/noticias/', 'https://www.dinero.com'),
    ('https://www.infobae.com/economia', '/noticias/', 'https://www.infobae.com/economia'),

    # Travel & Transportation
    ('https://www.hosteltur.com', '/noticias/', 'https://www.hosteltur.com'),
    ('https://www.preferente.com', '/noticias/', 'https://www.preferente.com'),
    ('https://www.reportur.com', '/noticias/', 'https://www.reportur.com'),
    ('https://www.viajestic.com', '/noticias/', 'https://www.viajestic.com'),
    ('https://www.losviajeros.com', '/noticias/', 'https://www.losviajeros.com'),
    ('https://www.turismodeexperiencias.com', '/noticias/', 'https://www.turismodeexperiencias.com'),
    ('https://www.elviajero.elpais.com', '/noticias/', 'https://www.elviajero.elpais.com'),
    ('https://www.traveler.es', '/noticias/', 'https://www.traveler.es'),
    ('https://www.latitudperfecta.com', '/noticias/', 'https://www.latitudperfecta.com'),
    ('https://www.miviajeporelmundo.com', '/noticias/', 'https://www.miviajeporelmundo.com'),
    ('https://www.revistaviajar.es', '/noticias/', 'https://www.revistaviajar.es'),
    ('https://www.turiscom.org', '/noticias/', 'https://www.turiscom.org'),

    # Social & Diplomatic Affairs
    ('https://www.diplomaticouruguay.com', '/noticias/', 'https://www.diplomaticouruguay.com'),
    ('https://www.elperiodico.com', '/noticias/', 'https://www.elperiodico.com'),  # also in Spain
    ('https://www.elobservador.com.uy', '/noticias/', 'https://www.elobservador.com.uy'),
    ('https://www.revistadeoccidente.com', '/noticias/', 'https://www.revistadeoccidente.com'),
    ('https://www.prensalatina.com.br', '/noticias/', 'https://www.prensalatina.com.br'),
    ('https://www.revistainterforum.com', '/noticias/', 'https://www.revistainterforum.com'),
    ('https://www.diplomaciaenlinea.com', '/noticias/', 'https://www.diplomaciaenlinea.com'),
    ('https://www.eldiariodelaembajada.com', '/noticias/', 'https://www.eldiariodelaembajada.com'),
    ('https://www.embajadasyconsulados.com', '/noticias/', 'https://www.embajadasyconsulados.com'),
    ('https://www.internationaldiplomacy.com', '/noticias/', 'https://www.internationaldiplomacy.com'),

    # Emergency, Legal & Police News
    ('https://www.diariojudicial.com', '/noticias/', 'https://www.diariojudicial.com'),
    ('https://www.lajornadadeoriente.com.mx', '/noticias/', 'https://www.lajornadadeoriente.com.mx'),
    ('https://www.legis.pe', '/noticias/', 'https://www.legis.pe'),
    ('https://www.abogacia.es', '/noticias/', 'https://www.abogacia.es'),
    ('https://www.eljurista.eu', '/noticias/', 'https://www.eljurista.eu'),
    ('https://www.juristaweb.com', '/noticias/', 'https://www.juristaweb.com'),
    ('https://www.lexlatin.com', '/noticias/', 'https://www.lexlatin.com'),
    ('https://www.legaltoday.com', '/noticias/', 'https://www.legaltoday.com'),
    ('https://www.derechoaldia.com.ar', '/noticias/', 'https://www.derechoaldia.com.ar'),
    ('https://www.noticiasjuridicas.com', '/noticias/', 'https://www.noticiasjuridicas.com'),
    ('https://www.laverdadlegal.com', '/noticias/', 'https://www.laverdadlegal.com'),

    # Shopping & Bargaining
    ('https://www.compradiccion.com', '/noticias/', 'https://www.compradiccion.com'),
    ('https://www.soydecompras.com', '/noticias/', 'https://www.soydecompras.com'),
    ('https://www.ahorradoras.com', '/noticias/', 'https://www.ahorradoras.com'),
    ('https://www.ofertaman.com', '/noticias/', 'https://www.ofertaman.com'),
    ('https://www.chollometro.com', '/noticias/', 'https://www.chollometro.com'),
    ('https://www.promocionesdescuentos.com', '/noticias/', 'https://www.promocionesdescuentos.com'),
    ('https://www.gangasparahogar.com', '/noticias/', 'https://www.gangasparahogar.com'),
    ('https://www.descontalia.com', '/noticias/', 'https://www.descontalia.com'),
    ('https://www.ofertitas.es', '/noticias/', 'https://www.ofertitas.es'),
    ('https://www.cuponation.com.mx', '/noticias/', 'https://www.cuponation.com.mx'),
]
russian_sources = [
    ('https://ria.ru', '/20', 'https://ria.ru'),
    ('https://www.gazeta.ru', '/news/', 'https://www.gazeta.ru'),
    ('https://www.fontanka.ru', '/20', 'https://www.fontanka.ru'),
    ('https://iz.ru', '/news/', 'https://iz.ru'),
    ('https://tass.ru', '/proisshestviya/', 'https://tass.ru')
]
sources_config = [
    {'lang': 'es', 'sites': spanish_sources},
    {'lang': 'ru', 'sites': russian_sources}
]
	ILR-3	The text contains a variety of topics with moderate complexity including news reports, opinion pieces, and political analysis. A speaker at ILR level 3 can handle general conversation and factual topics, and while they might need to pause or search for words occasionally, the overall comprehension and production are adequate. The vocabulary and sentence structure are varied and require some nuanced understanding.	Governor Samuel Garc√≠a reports that the State will disburse 500 million pesos to develop an artificial intelligence cluster. A bus accident leaves 37 dead and about twenty injured in Peru Renzo G√≥mez Vega | Lima | The vehicle collided with a van on the Pan-American Highway South. Among the survivors is an eight-month-old child. Morena Revocation: what the opposition doesn't learn Salvador Camarena | Underestimating Morena in their lack of scruples when mixing party and Government in an election with a recall could be costly for the opposition. Armero Wounds that do not heal (II): the Armero tragedy Guillermo P√©rez Fl√≥rez | In Armero everything was a mistake. And, as in the Palace of Justice, the wounds remain open because there has been no truth, justice, or reparation. Elections in Colombia The internal disputes of the Colombian right hinder their path to the presidential elections Juan Miguel Hern√°ndez Bonilla | Bogot√° | The internal battles in the Conservative and Democratic Center parties, the insults between Abelardo de la Espriella and Vicky D√°vila, and the disagreement of the former governors make it difficult to elect a single candidate to face the left. Editorial CELAC-EU Summit, a wasted opportunity El Pa√≠s | Many countries in Europe and Latin America say they are looking for alternatives to Trump's imposing diplomacy, but they tarnish the meeting in Santa Marta. Pinochet Dictatorship "I still cry while sleeping" Elizabeth Subercaseaux | If former soldiers like Miguel Krassnoff are imprisoned, it is not because "they are not politically well-liked," as Johannes Kaiser says, but because they tortured and murdered men and women because they did not agree with their ideas. Seizure of the Palace of Justice The United States and the secret archives of the Palace of Justice E. Andr√©s Celis R | Part of the truth about what happened in Colombia on November 6 and 7, 1985, and which we still do not know, lies in documents that the United States keeps as reserved.	[{"topic": "Artificial Intelligence Development in Nuevo Le√≥n, Mexico", "keywords": ["Samuel Garc√≠a", "Nuevo Le√≥n", "artificial intelligence", "cluster", "investment"], "summary": "Governor Samuel Garc√≠a announces a 500
# --------------------------------------------
# üöÄ Run
# --------------------------------------------
df = scrape_multilingual_gemini(sources_config)
display(df.head())


‚úÖ Using Gemini model: google/gemini-2.0-flash

üåç Scraping ES ‚Äî 5 sources
[ES] Failed: Too short.
[ES] Failed: Too short.
‚úÖ Saved 12 ‚Üí scraped_ilr/es_ilr.csv

üåç Scraping RU ‚Äî 5 sources
[RU] Failed: Too short.
[RU] Failed: Too short.
[RU] Failed: Too short.
[RU] Failed: Too short.
‚ùå Permanent failure: Invalid JSON returned by model
[RU] Failed: Too short.
‚ùå Permanent failure: Invalid JSON returned by model
‚ùå Permanent failure: Invalid JSON returned by model
[RU] Failed: Too short.
‚ùå Permanent failure: Invalid JSON returned by model
‚ùå Permanent failure: Invalid JSON returned by model
‚ùå Permanent failure: Invalid JSON returned by model
[RU] Failed: Too short.
‚ùå Permanent failure: Invalid JSON returned by model
[RU] Failed: Too short.
[RU] Failed: Too short.
[RU] Failed: Too short.
‚ùå Permanent failure: Invalid JSON returned by model
[RU] Failed: Too short.
‚ùå Permanent failure: Invalid JSON returned by model
[RU] Failed: Too short.
[RU] Failed: Too short.
[R

### ‚úÖ Gemini ILR Analysis Complete

Unnamed: 0,title,summary,text,language,link,ilr_level,reasoning,translation,topics
0,Latinoam√©rica en EL PA√çS,El gobernador Samuel Garc√≠a informa que el Est...,El gobernador Samuel Garc√≠a informa que el Est...,es,https://elpais.com/noticias/latinoamerica/,ILR-3,The text contains a variety of topics with mod...,Governor Samuel Garc√≠a reports that the State ...,"[{""topic"": ""Artificial Intelligence Developmen..."
1,"Teresa: ""De peque√±a siempre me hac√≠an comentar...","""Tanto mi hermano como yo nos parecemos much√≠s...",La 'influencer' Teresa Sanz ha compartido una ...,es,https://www.cope.es/actualidad/salud-bienestar...,ILR-3,The text uses relatively complex vocabulary an...,The 'influencer' Teresa Sanz has shared a prof...,"[{""topic"": ""Self-Acceptance and Beauty Standar..."
2,Carles Porta: ¬´Hay muchos par√°sitos en el 'tru...,"El dominio del arte del 'true crime', su insti...",Hay olfato period√≠stico y luego est√° el de Car...,es,https://www.abc.es/play/series/noticias/carles...,ILR-4,"The text uses complex vocabulary, nuanced argu...","There's journalistic instinct, and then there'...","[{""topic"": ""Carles Porta's True Crime Work"", ""..."
3,"Elena L√≥pez, psic√≥loga: ""Si sientes que duerme...",Si sientes que duermes mejor cuando est√°s con ...,Si sientes que duermes mejor cuando est√°s con ...,es,https://www.cope.es/actualidad/salud-bienestar...,ILR-3,The text uses relatively complex sentence stru...,If you feel that you sleep better when you are...,"[{""topic"": ""Sleep and Relationships"", ""keyword..."
4,Arturo P√©rez-Reverte recomienda esta serie de ...,"En esta ocasi√≥n, ha sido Arturo P√©rez-Reverte ...",Netflix se ha consolidado como una de las plat...,es,https://www.abc.es/play/series/noticias/arturo...,ILR-3,The text demonstrates general professional pro...,Netflix has established itself as one of the m...,"[{""topic"": ""Netflix and Streaming"", ""keywords""..."
