# Scraping Wikipedia Pages
### Author: Sam Eure
### Last updated: May 14, 2021
#### [Code in github](https://github.com/euresa/statistics/blob/master/Python%20Projects/Wikipedia_Scraping/wiki_scraper.ipynb)

In this notebook, I created a short Python script for scraping Wikipedia pages. My WikiScraper class focuses on associating natural text paragraphs and hyperlinks to their associated sections/subsections on any given Wikipedia page. In the first part of this script I develop the scraping class for gathering and processing the data. Later on in this notebook I perform some basic NLP on the text that I got online.

## Scraping and Organizing Data

### Classes, Functions, and Imports

In [1]:
import re
import requests

import pandas as pd
from bs4 import BeautifulSoup

class WikiScraper:
    """A class used for scraping Wikipedia page data. Organizes natural language on
    a Wikipedia page and creates a list of the hyperlinks present on the page."""

    def __init__(self, subject: str="", link: str="", printing: bool=False) -> None:
        self.hyperlinks: list = []
        self.para_list: list = []
        self.section_headers: list = []
        if subject != "":
            self.HTML_LINK: str = _make_wiki_link(subject)
            self._get_wiki_data(self.HTML_LINK, printing=printing)
            self._get_wiki_df()
        elif link != "":
            self.HTML_LINK = link
            self._get_wiki_data(self.HTML_LINK, printing=printing)
            self._get_wiki_df()
        else:
            print("Please provide a 'subject' or 'link' argument.")

    def _get_wiki_data(self, HTML_LINK: str, printing: bool=False) -> None:
        """Takes in a Wikipedia link and returns the natural language paragraphs,
        section titles, and hyperlinks found in the document.
        """
        if self.section_headers == []:
            soup: BeautifulSoup = _get_soup_doc(HTML_LINK)
            headers: list = _get_headers(soup)
            header_strings = [str(h) for h in headers]
            h_indices: list = _get_indices(soup, header_strings)

            self.subject: str = _get_title(soup)
            printer_count = 5  # counter in the for loop below
            if printing:
                print("PAGE TITLE:", self.subject)
                printer_count = 0

            for i, head in enumerate(headers):
                self.section_headers.append(head.text)
                paragraphs: list = _get_raw_paragraphs(soup, h_indices, i)
                hrefs = [_get_hyperlink(p) for p in paragraphs]
                paras = [_remove_HTML(p) for p in paragraphs]
                self.hyperlinks.append(combine(hrefs))
                self.para_list.append(paras)
                if printer_count < 3:  # To limit printing
                    printer_count = printer_count + 1
                    print("\n", "#" * 100, "\n\tSECTION:", head.text)
                    for p in paras:
                        print("\n", _show_some_text(p))
                    print("\nLINKS:", _show_some_text(str(combine(hrefs))))
        else:
            print("Wiki data already scraped.")

    def _get_wiki_df(self):
        """Organizes Wikipedia page data into a Pandas dataframe."""
        wiki_df = pd.DataFrame(
            {
                "section": self.section_headers,
                "hyperlinks": self.hyperlinks,
                "paragraphs": self.para_list,
            }
        )
        wiki_df["words"] = wiki_df.apply(
            lambda row: _get_words_from_paragraphs(row.paragraphs), axis=1
        )
        wiki_df = _add_count_feature(wiki_df, "hyperlinks")
        wiki_df = _add_count_feature(wiki_df, "paragraphs")
        wiki_df = _add_count_feature(wiki_df, "words")
        wiki_df = wiki_df.set_index("section")
        self.df = wiki_df[~(wiki_df.paragraphs_count == 0)]


def _make_wiki_link(subject: str) -> str:
    """Takes in a string containing a subject (e.g. Python, Cats, Russia, etc.) and
    returns the related Wikipedia page link to the subject.
    
        example:
        >>> _make_wiki_link('Statistics')
        https://en.wikipedia.org/wiki/Statistics
    """
    link = "https://en.wikipedia.org/wiki/" + subject
    return link


def _get_soup_doc(HTML_LINK: str, parser: str="html.parser") -> BeautifulSoup:
    """Takes in an html link and returns a BeautifulSoup document."""
    response = requests.get(HTML_LINK)
    soup_doc = BeautifulSoup(response.content, "html.parser")
    return soup_doc


def _get_title(soup_doc: BeautifulSoup) -> str:
    """Returns the title of the web page."""
    title = soup_doc.find(id="firstHeading").text
    return title


def _get_headers(soup: BeautifulSoup) -> list:
    """Returns the header of each section in a Wikipedia page."""
    headers = soup.find_all("span", attrs="mw-headline")
    return headers


def _remove_footnotes(text: str) -> str:
    """Drop footnote superscripts in brackets"""
    text = re.sub(r"\[.*?\]+", "", text)
    return text


def _get_indices(soup: BeautifulSoup, string_elements: list) -> list:
    """Returns the a list of the starting index for each element in a list of strings."""
    soup_string = str(soup)
    indices = [soup_string.index(string) for string in string_elements]
    return indices


def _get_index(text: str, element: str) -> int:
    """Attempts to get the index of an element. Returns -1 if not present

        example:
            >>> _get_index("hello world", "w")
            6
            >>> _get_index("hello world", "z")
            -1
    """
    try:
        idx = text.index(element)
        return idx
    except:
        return -1


def combine(lists):
    """Combines a list of lists into one list to return.

        example:
        >>>combine([[1, 2, 3], [4, 5, 6], [7], [8, 9]])
        [1, 2, 3, 4, 5, 6, 7, 8, 9]
    """
    combo = []
    for list_ in lists:
        combo.extend(list_)
    return combo


def _collect_pattern_pairs(text: str, start_char: str, end_char: str) -> list:
    """Returns a list of all text between sets of start_char and end_char characters."""
    collection = []
    while _get_index(text, start_char) != -1:
        start: int = _get_index(text, start_char)
        end: int = _get_index(text[start:], end_char) + start  # make sure end is after start
        collection.append(text[start : end + len(end_char)])
        text = text[end + len(end_char) :]
    return collection


def _get_raw_paragraphs(soup: BeautifulSoup, h_indices: list, i: int) -> list:
    """Returns unprocessed HTML paragraphs."""
    soup_string = str(soup)
    try:
        raw_text_i = soup_string[h_indices[i] : h_indices[i + 1]]
    except:
        raw_text_i = soup_string[h_indices[i] :]
    paragraphs: list = _collect_pattern_pairs(raw_text_i, "<p>", "</p>")
    return paragraphs


def _get_paragraphs(soup: BeautifulSoup, h_indices: list, i: int) -> list:
    """Returns the text paragraphs associated with the section header specified by an index."""
    paragraphs: list = _get_raw_paragraphs(soup, h_indices, i)
    clean_paragraphs = [_remove_HTML(p) for p in paragraphs]
    return clean_paragraphs


def _remove_pattern_pair(text: str, start_char: str, end_char: str) -> str:
    """Removes all text between the start_char and end_char and returns remaining text."""
    while _get_index(text, start_char) != -1:
        start: int = _get_index(text, start_char)
        end: int = _get_index(text[start:], end_char) + start  # make sure end is after start
        text = text[:start] + text[end + len(end_char) :]
    return text


def _remove_substrings(text: str, substring_list: list) -> str:
    """Removes all occurences of all substrings in a list from a text string. Returns new string."""
    for pattern in substring_list:
        text = re.sub(pattern, "", text)
    return text


def _remove_HTML(text: str) -> str:
    """Removes the HTML elements from a substring of an HTML document and returns resulting string."""
    to_remove = ["</a>", "</sup>", "<p>", "</p>"]
    text = _remove_substrings(text, to_remove)
    plain_text: str = _remove_pattern_pair(text, "<", ">")
    plain_text = _remove_footnotes(plain_text)
    return plain_text


def _clean_wiki_links(hlinks: list) -> list:
    """Completes hyperlinks to other Wikipedia pages."""
    for i, link in enumerate(hlinks):
        if "/wiki/" in link:
            end = link.index('"')
            hlinks[i] = re.sub("/wiki/", "https://en.wikipedia.org/wiki/", link[:end])
    return hlinks


def _remove_internal_links(hlinks: list) -> list:
    """Removes links that reference different parts of the Wikipedia page."""
    for link in hlinks:
        if "#" == link[0]:
            hlinks.remove(link)
    return hlinks


def _clean_cite_notes(hlinks: list) -> list:
    """Removes references to links cited at the bottom of the Wikipedia page."""
    cite_notes = []
    for i, link in enumerate(hlinks):
        if "#cite_note-" in link:
            cite_notes.append(hlinks[i])
    for c in cite_notes:
        hlinks.remove(c)
    return hlinks


def _clean_hyperlinks(html_hyperlinks: list) -> list:
    """Removes the HTML markup around hyperlinks and returns a list of hyperlinks"""
    to_remove = ['<a href="', '">']
    hlinks = [_remove_substrings(link, to_remove) for link in html_hyperlinks]
    hlinks = _clean_wiki_links(hlinks)
    hlinks = _clean_cite_notes(hlinks)
    hlinks = _remove_internal_links(hlinks)
    return hlinks


def _get_hyperlink(text: str) -> list:
    """Returns a list of all hyperlinks included in a subsection of an HTML document."""
    html_hyperlinks: list = _collect_pattern_pairs(text, "<a href=", ">")
    hyperlinks: list = _clean_hyperlinks(html_hyperlinks)
    return hyperlinks


def _show_some_text(text: str) -> str:
    """Returns first 100 characters in string."""
    return text[:80] + "..."


def _get_words_from_paragraphs(p_list: list) -> list:
    """Returns a list of words from a list of paragraphs."""
    paragraph: str = " \n ".join(p_list) #paragraphs are separated by ' \n '
    return paragraph.split(" ")


def _add_count_feature(df: pd.DataFrame, feature: str) -> pd.DataFrame:
    """Returns df with new feature that is the length of the list of a different feature."""
    df[feature + "_count"] = df.apply(lambda row: len(row[feature]), axis=1)
    return df

### Scraping and processing the data

In [2]:
#Choose your subject of interest!
SUBJECT = "Statistics"
wiki_obj = WikiScraper(subject=SUBJECT, printing=True)

PAGE TITLE: Statistics

 #################################################################################################### 
	SECTION: Introduction

 Statistics is a mathematical body of science that pertains to the collection, an...

 In applying statistics to a problem, it is common practice to start with a popul...

 When a census is not feasible, a chosen subset of the population called a sample...

LINKS: ['https://en.wikipedia.org/wiki/Data', 'https://en.wikipedia.org/wiki/Mathematic...

 #################################################################################################### 
	SECTION: Mathematical statistics

 Mathematical statistics is the application of mathematics to statistics. Mathema...

LINKS: ['https://en.wikipedia.org/wiki/Mathematics', 'https://en.wikipedia.org/wiki/Mat...

 #################################################################################################### 
	SECTION: History

 The early writings on statistical inference date back to Ara

Some of the sections don't have any paragraphs associated with them. This is because I assigned paragraphs to more specific subsections as opposed to sections as a whole.

## Data is organized as a Pandas dataframe


In [3]:
wiki_obj.df.head()

Unnamed: 0_level_0,hyperlinks,paragraphs,words,hyperlinks_count,paragraphs_count,words_count
section,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Introduction,"[https://en.wikipedia.org/wiki/Data, https://e...",[Statistics is a mathematical body of science ...,"[Statistics, is, a, mathematical, body, of, sc...",18,3,346
Mathematical statistics,"[https://en.wikipedia.org/wiki/Mathematics, ht...",[Mathematical statistics is the application of...,"[Mathematical, statistics, is, the, applicatio...",3,1,27
History,[https://en.wikipedia.org/wiki/Mathematics_in_...,[The early writings on statistical inference d...,"[The, early, writings, on, statistical, infere...",51,8,758
Sampling,[https://en.wikipedia.org/wiki/Design_of_exper...,"[When full census data cannot be collected, st...","[When, full, census, data, cannot, be, collect...",10,3,242
Experimental and observational studies,"[https://en.wikipedia.org/wiki/Causality, http...",[A common goal for a statistical research proj...,"[A, common, goal, for, a, statistical, researc...",10,1,201


Now that I have the plain text from each of the sections, I can do some natural language processing to find the most popular words from each section. 

## NLP

I'll do some basic NLP to find popular words within documents.

### Functions and Imports

In [4]:
"""
I'll use spaCy to help with some of the natural language processing,
such as identifying 'stop words'.
"""
from spacy import load


def _remove_punctuation(words: list) -> list:
    """Removes punctuation and special characters."""
    punc_list = [
        ".",
        ",",
        ")",
        "(",
        "/",
        "]",
        "[",
        "\n",
        " ",
        ";",
        ":",
        '"',
        "'",
        "\n \n ",
        "-",
    ]
    no_punc = [w for w in words if w not in punc_list]
    return no_punc


def _get_top_n_strings(str_list: list, n: int=3) -> list:
    """Finds most popular n strings in a string list. Returns a list of tuples (word, count)."""
    word_df: pd.DataFrame = pd.DataFrame({"words": str_list})
    word_vc = word_df.value_counts()
    top_strings = [(word_vc.index[i][0], word_vc[i]) for i in range(n)]
    return top_strings


def _remove_stop_words(doc, lemmas: bool=False) -> list:
    """Removes "stop words" (common words like "is", "but", "and") from list of words."""
    if lemmas:
        # Lemmas are the base form of a word. Ex: the lemma of swimming is swim.
        interesting_words = [token.lemma_ for token in doc if not token.is_stop]
    else:
        interesting_words = [token.text for token in doc if not token.is_stop]
    return interesting_words


def find_popular_words(nlp, words_list: list, n: int=3, lemmas: bool=False) -> list:
    """Finds the most popular words that aren't stop words in a list of words."""
    text = " ".join(words_list)
    doc = nlp(text.lower())
    nice_words: list = _remove_stop_words(doc, lemmas=lemmas)
    actual_nice_words = _remove_punctuation(nice_words)
    popular_words: list = _get_top_n_strings(actual_nice_words, n=n)
    return popular_words


def find_top_words(nlp, df, n: int=1, lemmas: bool=False) -> pd.Series:
    """Returns a pandas.Series object of tuples comprised of the the top 'n' words 
    from each row of a pandas.DataFrame already containing a 'words' feature.
    """
    top_words: pd.Series = df.apply(
        lambda row: find_popular_words(nlp, row.words, n=n, lemmas=lemmas), axis=1
    )
    return top_words

### Finding most common words in each section

In [5]:
#Load language model
nlp = load('en_core_web_sm')

wiki_obj.df['top words'] = find_top_words(nlp, wiki_obj.df, n=3)
print(wiki_obj.subject)
wiki_obj.df[['words_count','top words']].head()

Statistics


Unnamed: 0_level_0,words_count,top words
section,Unnamed: 1_level_1,Unnamed: 2_level_1
Introduction,346,"[(data, 16), (population, 8), (statistics, 7)]"
Mathematical statistics,27,"[(mathematical, 3), (analysis, 2), (statistics..."
History,758,"[(statistics, 11), (statistical, 9), (fisher, 6)]"
Sampling,242,"[(population, 7), (sample, 6), (theory, 5)]"
Experimental and observational studies,201,"[(studies, 6), (variables, 4), (study, 4)]"


We can also determine the top (most frequent) words for the entire topic.

In [7]:
print("The most popular words for the 'statistics' Wikipedia page are:")
find_popular_words(nlp, combine(list(wiki_obj.df.words.values)), n=5)

The most popular words for the 'statistics' Wikipedia page are:


[('statistics', 69),
 ('data', 57),
 ('statistical', 56),
 ('sample', 32),
 ('probability', 30)]

Unsurprisingly, the most popular words are'statistics' and 'data'!

The dataframe above is from the 'Statistics' page on Wikipedia. Let's check out some of the pages that are referenced on this page using the hyperlinks we found.

In [26]:
import random
allLinks = combine(wiki_obj.hyperlinks)
random.shuffle(allLinks)
allLinks[:5]

['https://en.wikipedia.org/wiki/Standard_deviation',
 'https://en.wikipedia.org/wiki/Longitude',
 'https://en.wikipedia.org/wiki/Fahrenheit',
 'https://en.wikipedia.org/wiki/Boolean_data_type',
 'https://en.wikipedia.org/wiki/Karl_Pearson']

### Random walk on graph

We can easily get a new page of Wikipedia data using these links.

In [27]:
new_page = allLinks[0]
for _ in range(10):
    new_page = WikiScraper(link=allLinks[0])
    edges = combine(new_page.hyperlinks)
    print(new_page.subject)
    random.shuffle(edges)
    allLinks = edges

Standard deviation
Variance
Classical mechanics
Angular momentum
Closed system
Thermodynamic system
Quantum thermodynamics
Third law of thermodynamics
Spin glass
Phase transition
