#### Webscraping for creating Mindmaps
- Lets see if we can find terms/keywords related to a topic, and possibly find an ordering of how they should be learned
- Use these lists words as RAG for the llm, as the roadmap

#### Ideas
1. Scraping Wikipedia 
    - There is a related API, but it is usually too deep
    - Extract keywords and links from wiki page, 
        - Possibly go a level down, and see how many link back to our current page

In [24]:
import requests
from bs4 import BeautifulSoup
from bs4.element import Comment

import spacy
import spacy_transformers

nlp = spacy.load("en_core_web_trf")

In [3]:
# API docs: https://en.wikipedia.org/api/rest_v1/
BASE_URL = "https://en.wikipedia.org/api/rest_v1"

In [21]:

res = requests.get(BASE_URL + "/page/html/Distributed computing")
html = res.text
print(len(html), html[:300])

293218 <!DOCTYPE html>
<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" about="https://en.wikipedia.org/wiki/Special:Redirect/revision/1249313583"><head prefix="mwr: https://en.wikipedia.org/wiki/Special:Redirect/"><meta property="mw:TimeUuid" content="3f9ef8c0-9f14-11ef-8773-7f72


In [29]:
def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

soup = BeautifulSoup(html, "html.parser")
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)  

text_only = " ".join(t.strip() for t in visible_texts)
print(len(result), result[:300])

41354 System with multiple networked computers  Not to be confused with Decentralized computing .  Distributed computing is a field of computer science that studies distributed systems , defined as computer systems whose inter-communicating components are located on different networked computers . [1] [2]


  texts = soup.findAll(text=True)


In [None]:
with open("test.html", "w") as f:
    f.write(html)

In [30]:
doc = nlp(text_only)

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Three 452 457 CARDINAL
one 664 667 CARDINAL
SOA 761 764 ORG
Marc Brooker 1340 1352 PERSON
one 2090 2093 CARDINAL
only one 3817 3825 CARDINAL
Saga 3938 3942 ORG
one 5585 5588 CARDINAL
the 1960s 6349 6358 DATE
first 6369 6374 ORDINAL
the 1970s 6472 6481 DATE
ARPANET 6489 6496 PRODUCT
one 6499 6502 CARDINAL
the late 1960s 6559 6573 DATE
ARPANET 6579 6586 PRODUCT
the early 1970s 6610 6625 DATE
ARPANET 6788 6795 PRODUCT
the global Internet 6816 6835 PRODUCT
Usenet 6887 6893 PRODUCT
FidoNet 6898 6905 PRODUCT
the 1980s 6911 6920 DATE
the late 1970s 7075 7089 DATE
early 1980s 7094 7105 DATE
first 7111 7116 ORDINAL
Symposium on Principles of Distributed Computing 7142 7190 ORG
PODC 7192 7196 ORG
1982 7213 7217 DATE
International Symposium on Distributed Computing 7239 7287 EVENT
DISC 7289 7293 ORG
Ottawa 7313 7319 GPE
1985 7323 7327 DATE
the International Workshop on Distributed Algorithms on Graphs 7331 7393 EVENT
first 7889 7894 ORDINAL
three 7915 7920 CARDINAL
three 8088 8093 CARDINAL
Three 

In [16]:
soup = BeautifulSoup(html, "html.parser")

In [28]:
for x in soup.find_all('a'):
    href = x['href']
    rel = x.get('rel', [])
    text = x.text
    print(href, text, rel)
    if href.startswith("/wiki/") or 'mw:WikiLink' in rel:
        print(text, href)

./Computer_science_(disambiguation) Computer science (disambiguation) ['mw:WikiLink']
Computer science (disambiguation) ./Computer_science_(disambiguation)
./File:Lambda_calculus-Church_numerals.png  []
./Programming_language_theory Programming language theory ['mw:WikiLink']
Programming language theory ./Programming_language_theory
./File:Sorting_quicksort_anim.gif  []
./Computational_complexity_theory Computational complexity theory ['mw:WikiLink']
Computational complexity theory ./Computational_complexity_theory
./File:Activemarker2.PNG  []
./Artificial_intelligence Artificial intelligence ['mw:WikiLink']
Artificial intelligence ./Artificial_intelligence
./File:Half_Adder.svg  []
./Computer_architecture Computer architecture ['mw:WikiLink']
Computer architecture ./Computer_architecture
./Computer_science Computer science ['mw:WikiLink']
Computer science ./Computer_science
./File:Flowchart_structured_programming.svg  []
./History_of_computer_science History ['mw:WikiLink']
History ./

#### Wiki Related API

In [None]:
RELATED_ENDPOINT = f"{BASE_URL}/page/related"

def get_related(query):
    res = requests.get(f"{RELATED_ENDPOINT}/{query}")
    if res.status_code != 200:
        return []
    parse = res.json()
    return [x['title'] for x in parse['pages']]

# print(get_related("Distributed computing"))

['Load_balancing_(computing)', 'Parallel_computing', 'Outline_of_computer_science', 'Scalability', 'Distributed_memory', 'Theoretical_computer_science', 'Graph_(abstract_data_type)', 'Concurrency_(computer_science)', 'MapReduce', 'Concurrent_computing', 'List_of_computer_science_conferences', 'Leslie_Valiant', 'Distributed_minimum_spanning_tree', 'Bulk_synchronous_parallel', 'Leader_election', 'Computer_cluster', 'Lateral_computing', 'Algorithmic_skeleton', 'Apache_Hama', 'Data-intensive_computing']


In [13]:
def get_topics(query):
    res = set()
    level1 = get_related(query)
    res.update(level1)
    for element in level1:
        level2 = get_related(element)
        res.update(level2)
    return list(res)

d = get_topics("Distributed computing")

In [16]:
# print(d)
# print(len(d))
for x in d:
    print(x)

Slurm_Workload_Manager
Apache_Accumulo
Message_passing_in_computer_clusters
ACM_SIGACT
SPQR_tree
Distributed_hash_table
List_of_computer_scientists
Consensus_(computer_science)
Session_Initiation_Protocol
Yuri_Gurevich
All-to-all_(parallel_pattern)
Multiprocessor_system_architecture
Quantum_computing
Michael_Kearns_(computer_scientist)
Mainframe_computer
Adiabatic_quantum_computation
Knuth_Prize
Kubernetes
Actor_model_later_history
Belief_propagation
Parallel_Virtual_File_System
Hardware_acceleration
Supercomputer_architecture
Supercomputer
Multi-core_processor
Transport_Layer_Security
Distributed_networking
Computational_learning_theory
Reduction_operator
Metaheuristic
D-Wave_Systems
Multithreading_(computer_architecture)
Many-task_computing
Computation_offloading
List_of_programming_languages_by_type
Proxy_server
Explicit_multi-threading
Application_delivery_network
Fast_flux
TUM_School_of_Computation,_Information_and_Technology
Breadth-first_search
Ronald_Fagin
Reachability
List_of_

In [None]:
# rank topics by number of backlinks?
# rank topics by frequency in from other sources

# extract keywords from page itself rather than related API


# Idea
# get related topics for a list of basic system design keywords (10 or so)
# save related topics for each of those topics
# build a graph of related topics

