## Web Scraping Example: Optimized Learning with Knowledge Graphs
- finds and orders prerequisite knowledge for a given source
- Collects linked references from given Wiki page with web scraping


### note to class: this is a personal project I recently started, I thought it might be cool to show everyone some webscraping in context!

making sources of information more accessible by finding and ordering prerequisite knowledge.

Strategy:
0. **Scrape a given page on simple wikipedia for related items, keywords, links to other articles, etc. Make sure that this is depth limited search and that we're looking for prerequisite knowledge, so a page is found to be successful if it references the topic of or a parent topic of the query page.** Here's a rough potential algo for this:
  a. for each linked candidate keyword/topic found, search the page for instances of our term being used (maybe narrow scope to just links/tags for efficiency using bs4, don't just scan plaintext fo an entire webpage). If the candidate has at least once reference to the query term, naively put it in the "worth investigating, make it a node" pile.
1. **clean/format the scraped data into a json file that neo4j or some other graph representation can interpret/use.** make sure that nodes have the correct labels and that they have lists of neighbors that aren't super long, maybe naively prioritize number of instances or cut out terms that don't relate back, I'm not sure.
2. run an algo that checks if the graph is a DAG (Most likely won't be because of things that reference each other). maybe just find cycles and then try to predict which item is upstream/parent to the other by some mechanism. NLP? idk man
3. Once we have a graph with generally clean edges, ideally a DAG, we can run a topological sorting algorithm on it to figure out which knowledge is upstream of our query.
4. Make a cool visualization with seaborn/neo4j/something that allows links being attached to nodes. Provide an ordered list of prereq reading, and a list of potentially interesting follow up reads that were discovered in the process

The above algo is subject to change as this project develops further. Hopefully a decent starting point.

In [1]:
# get libraries for html get request, bs4 for parsing web to json, 
# pandas for df's, and re for regexes (see below for more regex details)
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

In [2]:
# choose any wiki page as the start node: let's use a relevant topic of the sort of thing 
# we're trying to emulate, get data from that URL
gkg_page = requests.get('https://en.wikipedia.org/wiki/Google_Knowledge_Graph')
# create a soup object with the html parser
gkg_soup = BeautifulSoup(gkg_page.content, 'html.parser')

![after inspecting a wiki page, I found that the tag mw-content-text is the main text area w/ links](wiki_main_scrape.png)

In [3]:
# from the soup, get the first div from the main body with id 'bodyContent'
# from bodyContent, get nested div with id 'mw-content-text'
gkg_body = gkg_soup.find('div', {'id': 'bodyContent'}).find('div', {'id': 'mw-content-text'})
# IMPORTANT FOR TASK: find all 'a' tags from this page, these are our links
gkg_fa = gkg_body.find_all('a', href=True)
# use a list comprehension (fancy one-liner loop syntax) to get the link (the 'href') from each a tag
gkg_links = [a['href'] for a in gkg_fa]
gkg_links[:15]

['/wiki/Knowledge_graph',
 '/wiki/File:Google_Knowledge_Panel.png',
 '/wiki/File:Google_Knowledge_Panel.png',
 '/wiki/Thomas_Jefferson',
 '/wiki/Google_Search',
 '/wiki/Knowledge_base',
 '/wiki/Google',
 '/wiki/Google_Search',
 '/wiki/Infobox',
 '/wiki/Search_engine_results_page',
 '#cite_note-:0-1',
 '#cite_note-2',
 '#cite_note-3',
 '#cite_note-4',
 '#cite_note-5']

In [4]:
# note that our links has a bunch of things that we don't want, like external links and others
# filter out the links that don't go to other wiki pages, match regex for wiki endpoint 
gkg_labels_all = list(filter(re.compile('^/wiki/').match, gkg_links))
# exclude files, specials, templates, etc Thing:resource
gkg_labels = list(set(gkg_labels_all) - set(filter(re.compile('.*:.*').match, gkg_labels_all)))
gkg_labels

['/wiki/Computability',
 '/wiki/YouTube_Comedy_Week',
 '/wiki/Postini',
 '/wiki/A_Logic_Named_Joe',
 '/wiki/Protocol_Buffers',
 '/wiki/Jigsaw_(company)',
 '/wiki/Google_Photos',
 '/wiki/Google_PowerMeter',
 '/wiki/The_Washington_Post',
 '/wiki/Google_Maps_Navigation',
 '/wiki/Waze',
 '/wiki/Google_Sheets',
 '/wiki/Flutter_(American_company)',
 '/wiki/Google_Chat',
 '/wiki/Site_reliability_engineering',
 '/wiki/Gmail',
 '/wiki/Google_Programmable_Search_Engine',
 '/wiki/Accelerated_Mobile_Pages',
 '/wiki/ARCore',
 '/wiki/Skia_Graphics_Engine',
 '/wiki/Cortana_(virtual_assistant)',
 '/wiki/Green_Throttle_Games',
 '/wiki/OKR',
 '/wiki/Pimp_My_Search',
 '/wiki/Google_Account',
 '/wiki/Google_Pinyin',
 '/wiki/PageRank',
 '/wiki/Solve_for_X',
 '/wiki/YouTube_Next_Lab_and_Audience_Development_Group',
 '/wiki/Infobox',
 '/wiki/Google_China',
 '/wiki/Read_Along',
 '/wiki/Google_OnHub',
 '/wiki/Material_Design',
 '/wiki/Dunant_(submarine_communications_cable)',
 '/wiki/Google_Play_Music',
 '/wik

In [5]:
def get_wiki_links(partial_url, nodes_neighbors, search_depth):
    """
    Collects links from the given Wikipedia page, stores them as a list of outgoing edges for this page's node.
    TODO Recursively calls itself on each page found, decrementing search depth each round until depth is zero.
    
    Parameters:
    -----------
    partial_url (str) : string of the form /wiki/Name_of_article for the target article
    search_depth (int) : number of search rounds remaining in recursion
    nodes_neighbors (dict) : existing dictionary of {page: [list, of, pages, linked]) to which we add
    
    Returns:
    --------
    nodes_neighbors (dict) : dict including edges for this page, used for directed graph in neo4j
    
    """
    # get data from specified wiki page and use bs4 to parse it
    page = requests.get(f'https://en.wikipedia.org{partial_url}')
    soup = BeautifulSoup(page.content, 'html.parser')
    # get main body text content from nested mw-content-text div
    body = soup.find("div", {"id": "bodyContent"}).find("div", {"id": "mw-content-text"})
    # get all a tags and extract link from each
    fa = body.find_all('a', href=True)
    links = [a['href'] for a in fa]
    # filter out the links that don't go to other wiki pages, match regex for wiki endpoint 
    labels_all = list(filter(re.compile("^/wiki/").match, links))
    # exclude files, specials, templates, etc. subtracts strings of format Type:resource
    labels = list(set(labels_all) - set(filter(re.compile(".*:.*").match, labels_all)))
    # add labels to dict entry for this node
    nodes_neighbors[partial_url] = labels

    return nodes_neighbors
    

In [6]:
gkg_dict = get_wiki_links('/wiki/Agalychnis_callidryas', dict(), 1)
gkg_dict

{'/wiki/Agalychnis_callidryas': ['/wiki/Nocturnality',
  '/wiki/Defensive_adaptation',
  '/wiki/Clutch_(eggs)',
  '/wiki/Poisonous_amphibian',
  '/wiki/Copeia',
  '/wiki/Central_America',
  '/wiki/Animal_Diversity_Web',
  '/wiki/Neotropical',
  '/wiki/Amplexus',
  '/wiki/Phyllomedusinae',
  '/wiki/Epiphyte',
  '/wiki/Synonym_(taxonomy)',
  '/wiki/Colombia',
  '/wiki/AmphibiaWeb',
  '/wiki/Taxonomy_(biology)',
  '/wiki/Edward_Drinker_Cope',
  '/wiki/INaturalist',
  '/wiki/Barcode_of_Life_Data_System',
  '/wiki/CITES',
  '/wiki/Scientific_name',
  '/wiki/Amphibian_Species_of_the_World',
  '/wiki/Chordate',
  '/wiki/Encyclopedia_of_Life',
  '/wiki/Scale_(map)',
  '/wiki/S2CID_(identifier)',
  '/wiki/Agalychnis',
  '/wiki/Amphibian',
  '/wiki/Metamorphosis',
  '/wiki/Bromelia',
  '/wiki/Ranoidea_chloris',
  '/wiki/Mexico',
  '/wiki/National_Center_for_Biotechnology_Information',
  '/wiki/PMID_(identifier)',
  '/wiki/Arboreal',
  '/wiki/Global_Biodiversity_Information_Facility',
  '/wiki/IU

In [7]:
# BeautifulSoup(requests.get(f'https://en.wikipedia.org{gkg_labels[0]}').content, 'html.parser')

In [8]:
# loop over valid wiki page links and call some function on them that collects information IFF they aren't 
# already in the list of added 

# make the above into a function, pass a /wiki/Page_name_from_url and a number of search depth remaining
# that gets decremented for each recursive call

### TODO in this project: shape our data so that we can have a json dict of nodes and edges that is interpretable by neo4j!
- if I finish this, I'll share it with the class at some point :)