# wiki article

> Getting a wiki article and analyze to give a rating based on reader's background.

In [None]:
#|default_exp wiki

In [None]:
#| hide
from nbdev.showdoc import *

In [None]:
#| export
import httpx
from bs4 import BeautifulSoup
import html2text
from IPython.display import Markdown

## WikiArticle

In [None]:
#| export
class WikiArticle:
    "Grab a wikipedia article to analyze."
    def __init__(self, url):
        if not url.startswith('https://en.wikipedia.org/wiki/'): 
            raise ValueError("Must be English Wikipedia URL")
        self.url = url
        self._soup = None
        
    @property
    def soup(self):
        if self._soup is None:
            self._soup = BeautifulSoup(httpx.get(self.url).text, 'lxml')
        return self._soup
    
    @property
    def title(self): return self.soup.find('h1', id='firstHeading').text.strip()
    
    @property
    def introduction(self):
        "Select an introduction from the `WikiArticle`."
        content = self.soup.select_one("#mw-content-text > div.mw-content-ltr.mw-parser-output")
        paragraphs = []
        for p in content.find_all('p'):
            if p.find_previous('div', class_='mw-heading mw-heading2'): break
            if text := p.text.strip(): paragraphs.append(text)
        return '\n\n'.join(paragraphs)

In [None]:
show_doc(WikiArticle)

---

### WikiArticle

>      WikiArticle (url)

*Grab a wikipedia article to analyze.*

In [None]:
show_doc(WikiArticle.introduction)

---

### WikiArticle.introduction

>      WikiArticle.introduction ()

*Select an introduction from the `WikiArticle`.*

In [None]:
article = WikiArticle("https://en.wikipedia.org/wiki/Evolution_of_snake_venom")
print(article.title)
print(article.introduction)

Evolution of snake venom
Venom in snakes and some lizards is a form of saliva that has been modified into venom over its evolutionary history.[1] In snakes, venom has evolved to kill or subdue prey, as well as to perform other diet-related functions.[2] While snakes occasionally use their venom in self defense, this is not believed to have had a strong effect on venom evolution.[3] The evolution of venom is thought to be responsible for the enormous expansion of snakes across the globe.[4][5][6]

The evolutionary history of snake venom is a matter of debate. Historically, snake venom was believed to have evolved once, at the base of the Caenophidia, or derived snakes. Molecular studies published beginning in 2006 suggested that venom originated just once among a putative clade of reptiles, called Toxicofera, approximately 170 million years ago.[7] Under this hypothesis, the original toxicoferan venom was a very simple set of proteins that were assembled in a pair of glands. Subsequentl

In [None]:
intros = article.introduction
intros

'Venom in snakes and some lizards is a form of saliva that has been modified into venom over its evolutionary history.[1] In snakes, venom has evolved to kill or subdue prey, as well as to perform other diet-related functions.[2] While snakes occasionally use their venom in self defense, this is not believed to have had a strong effect on venom evolution.[3] The evolution of venom is thought to be responsible for the enormous expansion of snakes across the globe.[4][5][6]\n\nThe evolutionary history of snake venom is a matter of debate. Historically, snake venom was believed to have evolved once, at the base of the Caenophidia, or derived snakes. Molecular studies published beginning in 2006 suggested that venom originated just once among a putative clade of reptiles, called Toxicofera, approximately 170 million years ago.[7] Under this hypothesis, the original toxicoferan venom was a very simple set of proteins that were assembled in a pair of glands. Subsequently, this set of protein

## Using Claudette

Using `claudette`, we can analyze the article. When analyzing, it decides on:

- interest_rating: Rating 1-10 of how interesting the article is for this reader based on the background
- interest_reason: Markdown explanation for interest rating (max 50 words) for this reader based on the background
- difficulty_rating: Rating 1-10 of how difficult the article is for this reader based on the background
- difficulty_reason: Markdown explanation for difficulty rating (max 50 words) for this reader based on the background
- prerequisites: List of topics reader should know before reading for this reader based on the background
- prereq_reason: Markdown explanation for prerequisites (max 50 words) for this reader based on the background
  

In [None]:
#| export
from claudette import Chat, Client, models
from fastcore.utils import *

In [None]:
models

['claude-3-opus-20240229',
 'claude-3-5-sonnet-20241022',
 'claude-3-haiku-20240307',
 'claude-3-5-haiku-20241022']

Using haiku is not recommended as it is not reliable.

In [None]:
client = Client(models[1])

In [None]:
#| export
class ArticleAnalysis:
    "Analysis of a Wikipedia article for a reader based on the background."
    def __init__(self,
                interest_rating: int,        # Rating 1-10 of how interesting the article is for this reader based on the background
                interest_reason: str,        # Markdown explanation for interest rating (max 50 words) for this reader based on the background
                difficulty_rating: int,      # Rating 1-10 of how difficult the article is for this reader based on the background
                difficulty_reason: str,      # Markdown explanation for difficulty rating (max 50 words) for this reader based on the background
                prerequisites: list[str],    # List of topics reader should know before reading for this reader based on the background
                prereq_reason: str,          # Markdown explanation for prerequisites (max 50 words) for this reader based on the background
    ):
        assert 1 <= interest_rating <= 10, "Interest rating must be between 1 and 10"
        assert 1 <= difficulty_rating <= 10, "Difficulty rating must be between 1 and 10"
        store_attr()
        
    __repr__ = basic_repr('interest_rating, interest_reason, difficulty_rating, difficulty_reason, prerequisites, prereq_reason')

In [None]:
#| export
def analyze_article_for_reader(article_text: str, background: str) -> ArticleAnalysis:
    "Analyze a Wikipedia article for a specific reader background"
    prompt = f"""Here's a Wikipedia article introduction:

<problem>
Analyze this article introduction for the given reader background. Provide:
1. Interest rating (1-10) with brief reason
2. Difficulty rating (1-10) with brief reason
3. Prerequisites needed, with reason why they're important
Keep all explanations under 50 words.
</problem>

<article>
{article_text}
</article>

<reader_background>
{background}
</reader_background>
"""
    
    return client.structured(prompt, ArticleAnalysis)[0]

In [None]:
backgrounds = {
    'high_school': """Background of the reader:
- High school graduate
- Interested in science but no formal training beyond high school
- Enjoys nature documentaries
- Has basic understanding of how evolution works from school and documentaries
""",
    'college_bio': """Background of the reader:
- A college student
- Familiar with evolutionary biology, organic chemistry, statistics, immunology, genetics, molecular genetics, molecular biology, and linear algebra.
- Interested in science related to machine learning, statistics, immunology, organic chemistry, genetics, genomics, and bioinformatics.
""",
    'humanities': """Background of the reader:
- English Literature professor
- Interested in narrative and historical developments
- Reads Scientific American occasionally
- No formal science education beyond high school
- Hates science.
""",
    'tech_professional': """Background of the reader:
- Software engineer with computer science degree
- Familiar with complex systems and algorithms
- Reads tech blogs and popular science articles
- Basic understanding of scientific method
""",
    'medical_practitioner': """Background of the reader:
- Primary care physician
- Strong understanding of human anatomy and physiology
- Familiar with pharmacology and toxicology
- Limited exposure to evolutionary biology
"""
}

In [None]:
for reader_type, background in backgrounds.items():
    analysis = analyze_article_for_reader(intros, background)
    print(f"\nAnalysis for {reader_type}:")
    print(analysis)


Analysis for high_school:
ArticleAnalysis(interest_rating=8, interest_reason='Perfect match for someone who enjoys nature documentaries. The evolutionary arms race and how snakes developed venom connects well with their existing interest in evolution and natural history.', difficulty_rating=7, difficulty_reason='Contains complex scientific terminology (Caenophidia, Toxicofera) and molecular biology concepts. While main ideas are accessible, specific details may be challenging without advanced biology background.', prerequisites=['Basic evolution concepts', 'High school biology', 'Understanding of proteins and genes', 'Basic taxonomy/classification of reptiles'], prereq_reason='Understanding evolution and basic biology is crucial for following the venom development discussion. Knowledge of proteins and taxonomy helps grasp the molecular aspects and species relationships mentioned.')

Analysis for college_bio:
ArticleAnalysis(interest_rating=8, interest_reason='Combines molecular biolog

Interesting to see that everyone likes to read evolution of snake venom.

## Running in parallel

It is quite slow analyzing one by one. It is possible to analyze multiple articles in parallel, but this is prone to `rate_limit_error`.

In [None]:
#| export
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import sys
from typing import List, Dict
import threading

In [None]:
#| export
def is_interactive() -> bool:
    "Check if we're running in an interactive environment (IPython/Jupyter)"
    return hasattr(sys, 'ps1') or bool(sys.flags.interactive) or 'ipykernel' in sys.modules

In [None]:
is_interactive()

True

We use `ThreadPoolExecuter` if we are in interactive mode, but we switch to `ProcessPoolExecutor` when we are running it in script.

In [None]:
#| export
def analyze_multiple_articles(articles: List[str], backgrounds: Dict[str, str], max_workers: int = None) -> Dict[str, Dict[str, ArticleAnalysis]]:
    "Analyze multiple articles for different reader backgrounds in parallel"
    Executor = ThreadPoolExecutor if is_interactive() else ProcessPoolExecutor
    results = {}
    with Executor(max_workers=max_workers) as executor:
        futures = {}
        for article_idx, article_text in enumerate(articles):
            results[article_idx] = {}
            for reader_type, background in backgrounds.items():
                future = executor.submit(analyze_article_for_reader, article_text, background)
                futures[(article_idx, reader_type)] = future
        for (article_idx, reader_type), future in futures.items():
            try:
                results[article_idx][reader_type] = future.result()
            except Exception as e:
                results[article_idx][reader_type] = None
    return results

In [None]:
boring_articles = {
    'bureaucracy': WikiArticle("https://en.wikipedia.org/wiki/ISO_216").introduction,  # Paper size standards
    'statistics': WikiArticle("https://en.wikipedia.org/wiki/Analysis_of_variance").introduction,  # Dense statistical methods
    'obscure': WikiArticle("https://en.wikipedia.org/wiki/List_of_writing_systems").introduction,  # Dry list of writing systems
    'methodology': WikiArticle("https://en.wikipedia.org/wiki/ISO_8601").introduction,  # Date/time formatting standards
}

In [None]:
boring_articles

{'bureaucracy': 'ISO 216 is an international standard for paper sizes, used around the world except in North America and parts of Latin America. The standard defines the "A", "B" and "C" series of paper sizes, which includes the A4, the most commonly available paper size worldwide. Two supplementary standards, ISO 217 and ISO 269, define related paper sizes; the ISO 269 "C" series is commonly listed alongside the A and B sizes.\n\nAll ISO 216, ISO 217 and ISO 269 paper sizes (except some envelopes) have the same aspect ratio, √2:1, within rounding to millimetres. This ratio has the unique property that when cut or folded in half widthways, the halves also have the same aspect ratio. Each ISO paper size is one half of the area of the next larger size in the same series.[1]',
 'statistics': 'Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the "variation" among and between groups) used to analyze the differences betwee

In [None]:
test_results = analyze_multiple_articles(list(boring_articles.values()), backgrounds)
test_results

{0: {'high_school': ArticleAnalysis(interest_rating=6, interest_reason='The mathematical property of paper sizes maintaining their ratio when halved is an intriguing scientific concept that could appeal to someone interested in discovering patterns in everyday objects.', difficulty_rating=4, difficulty_reason='While the concept of aspect ratios and paper sizes is straightforward, the mathematical relationship (√2:1) might be slightly challenging but still accessible with basic high school math.', prerequisites=['Basic geometry knowledge', 'Understanding of ratios and proportions', 'Familiarity with square roots'], prereq_reason='These math concepts are essential to grasp the unique properties of the paper size system and why the √2:1 ratio is special when folding paper.'),
  'college_bio': ArticleAnalysis(interest_rating=4, interest_reason="While the mathematical property of √2:1 ratio may appeal to someone with math background, the topic of paper sizes is quite removed from the reader

In [None]:
test_results

{0: {'high_school': ArticleAnalysis(interest_rating=6, interest_reason='The mathematical property of paper sizes maintaining their ratio when halved is an intriguing scientific concept that could appeal to someone interested in discovering patterns in everyday objects.', difficulty_rating=4, difficulty_reason='While the concept of aspect ratios and paper sizes is straightforward, the mathematical relationship (√2:1) might be slightly challenging but still accessible with basic high school math.', prerequisites=['Basic geometry knowledge', 'Understanding of ratios and proportions', 'Familiarity with square roots'], prereq_reason='These math concepts are essential to grasp the unique properties of the paper size system and why the √2:1 ratio is special when folding paper.'),
  'college_bio': ArticleAnalysis(interest_rating=4, interest_reason="While the mathematical property of √2:1 ratio may appeal to someone with math background, the topic of paper sizes is quite removed from the reader

`None` means we got an error. Most likely from rate limit. 

It's good to see that people have different interest ratings and difficulty ratings based on their background.

In [None]:
#| hide
import nbdev; nbdev.nbdev_export()