# Scraping Wikipedia Pages
### Author: Sam Eure
### Data: May 12, 2021

In this notebook, I created a short Python script for scraping Wikipedia pages in order to obtain the hyperlinks listed in each section of the Wikipedia pages. Given a link to a Wikipedia page, this script should do the following:

1) Print the title of the page 

2) Print the header of each section

3) Print the plain text of each section

4) Print a list of the hyperlinks referenced in the paragraphs in each section (if present)

Ultimately, I'd like to develop some text based ML models using Wikipedia page data. I think this script will work well for gathering natural language data on a variety of topics!

In [530]:
import re

try:
    import requests
except:
    !python3 -m pip install requests
    import requests
    
try:
    from bs4 import BeautifulSoup
except:
    !python3 -m pip install beautifulsoup4
    from bs4 import BeautifulSoup

##########################################################
#	    			 USER INPUTS	     				 #
#    Please enter the html link you'd like to explore.   #
##########################################################

HTML_LINK = "https://en.wikipedia.org/wiki/Statistics"

##########################################################
#			       END OF USER INPUTS					 #
##########################################################

def get_soup_doc(html_link, parser = 'html.parser'):
    '''Takes in an html link and returns a BeautifulSoup document.'''
    response = requests.get(HTML_LINK)
    soup_doc = BeautifulSoup(response.content, 'html.parser')
    return(soup_doc)

def get_title(soup_doc):
    '''Returns the title of the web page.'''
    title = soup_doc.find(id='firstHeading').text
    return(title)

def get_headers(soup):
    '''Returns the header of each section in a Wikipedia page.'''
    headers = soup.find_all('span', attrs='mw-headline')
    return(headers)

def remove_footnotes(text):
    '''Drop footnote superscripts in brackets'''
    text = re.sub(r"\[.*?\]+", '', text)
    return(text)

def get_indices(soup, string_elements):
    '''Returns the a list of the starting index for each element in a list of strings.'''
    soup_string = str(soup)
    indices = [soup_string.index(string) for string in header_strings]
    return(indices)

def get_index(text, element):
    '''Attempts to get the index of an element. Returns None if not present'''
    try:
        idx = text.index(element)
        return(idx)
    except:
        return(None)
    
def combine(lists):
    '''Combines a list of lists into one list to return.'''
    combo = []
    for list_ in lists:
        combo.extend(list_)
    return(combo)

def collect_pattern_pairs(text, start_char, end_char):
    '''Returns a list of all text between sets of start_char and end_char characters.'''
    collection = []
    while get_index(text, start_char) is not None:
        start = get_index(text, start_char)
        end = get_index(text[start:], end_char) + start #make sure end is after start
        collection.append(text[start:end+len(end_char)])
        text = text[end+len(end_char):]
    return(collection)

def get_raw_paragraphs(soup, h_indices, i):
    '''Returns unprocessed HTML paragraphs.'''
    soup_string = str(soup)
    try:
        raw_text_i = soup_string[h_indices[i]:h_indices[i+1]]
    except:
        raw_text_i = soup_string[h_indices[i]:]
    paragraphs = collect_pattern_pairs(raw_text_i, "<p>", "</p>")
    return(paragraphs)

def get_paragraphs(soup, h_indices, i):
    '''Returns the text paragraphs associated with the section header specified by an index.'''
    paragraphs = get_raw_paragraphs(soup, h_indices, i)
    clean_paragraphs = [remove_HTML(p) for p in paragraphs]
    return(clean_paragraphs)

def remove_pattern_pair(text, start_char, end_char):
    '''Removes all text between the start_char and end_char and returns remaining text.'''
    while get_index(text, start_char) is not None:
        start = get_index(text, start_char)
        end = get_index(text[start:], end_char) + start #make sure end is after start
        text = text[:start] + text[end+len(end_char):]
    return(text)

def remove_substrings(text, substring_list):
    '''Removes all occurences of all substrings in a list from a text string. Returns new string.'''
    for pattern in substring_list:
        text = re.sub(pattern, "", text)
    return(text)

def remove_HTML(text):
    '''Removes the HTML elements from a substring of an HTML document and returns resulting string.'''
    to_remove = ['</a>', '</sup>', '<p>', '</p>']
    text = remove_substrings(text, to_remove)
    plain_text = remove_pattern_pair(text, '<', ">")
    plain_text = remove_footnotes(plain_text)
    return(plain_text)

def clean_wiki_links(hlinks):
    '''Completes hyperlinks to other Wikipedia pages.'''
    for i, link in enumerate(hlinks):
        if '/wiki/' in link:
            end = link.index('"')
            hlinks[i] = re.sub('/wiki/', 'https://en.wikipedia.org/wiki/', link[:end])
    return(hlinks)

def remove_internal_links(hlinks):
    '''Removes links that reference different parts of the Wikipedia page.'''
    internal_links = []
    for link in hlinks:
        if "#" == link[0]:
            print(link)
            hlinks.remove(link)
    return(hlinks)
    
def clean_cite_notes(hlinks):
    '''Removes references to links cited at the bottom of the Wikipedia page.'''
    cite_notes = []
    for i, link in enumerate(hlinks):
        if '#cite_note-' in link:
            cite_notes.append(hlinks[i])
    for c in cite_notes:
        hlinks.remove(c)
    return(hlinks)

def clean_hyperlinks(html_hyperlinks):
    '''Removes the HTML markup around hyperlinks and returns a list of hyperlinks'''
    to_remove =['<a href="', '">']
    hlinks = [remove_substrings(link, to_remove) for link in html_hyperlinks]
    hlinks = clean_wiki_links(hlinks)
    hlinks = clean_cite_notes(hlinks)
    hlinks = remove_internal_links(hlinks)
    return(hlinks)

def get_hyperlink(text):
    '''Returns a list of all hyperlinks included in a subsection of an HTML document.'''
    html_hyperlinks = collect_pattern_pairs(text, '<a href=', '>')
    hyperlinks = clean_hyperlinks(html_hyperlinks)
    return(hyperlinks)
    
###################
#      MAIN       #
###################
    
soup = get_soup_doc(HTML_LINK)
headers = get_headers(soup)
header_strings = [str(h) for h in headers]
h_indices = get_indices(soup, header_strings)

#Printing
print('PAGE TITLE:',get_title(soup))
for i, head in enumerate(headers):
    print("#"*100, '\n\tSECTION:', head.text)
    paragraphs = get_raw_paragraphs(soup, h_indices, i)
    hrefs = [get_hyperlink(p) for p in paragraphs]
    paras = [remove_HTML(p) for p in paragraphs]
    for p in paras:
        print('\n', p)
    
    print('LINKS ---> ', combine(hrefs))
    
    
###################
#      END        #
###################

PAGE TITLE: Statistics
#################################################################################################### 
	SECTION: Introduction

 Statistics is a mathematical body of science that pertains to the collection, analysis, interpretation or explanation, and presentation of data, or as a branch of mathematics. Some consider statistics to be a distinct mathematical science rather than a branch of mathematics. While many scientific investigations make use of data, statistics is concerned with the use of data in the context of uncertainty and decision making in the face of uncertainty.


 In applying statistics to a problem, it is common practice to start with a population or process to be studied. Populations can be diverse topics such as "all people living in a country" or "every atom composing a crystal". Ideally, statisticians compile data about the entire population (an operation called census). This may be organized by governmental statistical institutes. Descriptive s

LINKS --->  []
#################################################################################################### 
	SECTION: Data collection
LINKS --->  []
#################################################################################################### 
	SECTION: Sampling

 When full census data cannot be collected, statisticians collect sample data by developing specific experiment designs and survey samples. Statistics itself also provides tools for prediction and forecasting through statistical models.


 To use a sample as a guide to an entire population, it is important that it truly represents the overall population. Representative sampling assures that inferences and conclusions can safely extend from the sample to the population as a whole. A major problem lies in determining the extent that the sample chosen is actually representative. Statistics offers methods to estimate and correct for any bias within the sample and data collection procedures. There are also methods o


 Statistical inference is the process of using data analysis to deduce properties of an underlying probability distribution. Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates.  It is assumed that the observed data set is sampled from a larger population. Inferential statistics can be contrasted with descriptive statistics. Descriptive statistics is solely concerned with properties of the observed data, and it does not rest on the assumption that the data come from a larger population.

LINKS --->  ['https://en.wikipedia.org/wiki/Data_analysis', 'https://en.wikipedia.org/wiki/Probability_distribution', 'https://en.wikipedia.org/wiki/Statistical_population', 'https://en.wikipedia.org/wiki/Sampling_(statistics)', 'https://en.wikipedia.org/wiki/Descriptive_statistics']
#################################################################################################### 
	SECTION: Terminology and theory of inferentia


 Most studies only sample part of a population, so results don't fully represent the whole population. Any estimates obtained from the sample only approximate the population value. Confidence intervals allow statisticians to express how closely the sample estimate matches the true value in the whole population. Often they are expressed as 95% confidence intervals. Formally, a 95% confidence interval for a value is a range where, if the sampling and analysis were repeated under the same conditions (yielding a different dataset), the interval would include the true (population) value in 95% of all possible cases. This does not imply that the probability that the true value is in the confidence interval is 95%. From the frequentist perspective, such a claim does not even make sense, as the true value is not a random variable.  Either the true value is or is not within the given interval. However, it is true that, before any data are sampled and given a plan for how to construct the confi


 The concept of correlation is particularly noteworthy for the potential confusion it can cause. Statistical analysis of a data set often reveals that two variables (properties) of the population under consideration tend to vary together, as if they were connected. For example, a study of annual income that also looks at age of death might find that poor people tend to have shorter lives than affluent people. The two variables are said to be correlated; however, they may or may not be the cause of one another. The correlation phenomena could be caused by a third, previously unconsidered phenomenon, called a lurking variable or confounding variable. For this reason, there is no way to immediately infer the existence of a causal relationship between the two variables.

LINKS --->  ['https://en.wikipedia.org/wiki/Data_set']
#################################################################################################### 
	SECTION: Applications
LINKS --->  []
##########################


 Statistical techniques are used in a wide range of types of scientific and social research, including: biostatistics, computational biology, computational sociology, network biology, social science, sociology and social research. Some fields of inquiry use applied statistics so extensively that they have specialized terminology. These disciplines include:


 In addition, there are particular types of statistical analysis that have also developed their own specialised terminology and methodology:


 Statistics form a key basis tool in business and manufacturing as well. It is used to understand measurement systems variability, control processes (as in statistical process control or SPC), for summarizing data, and to make data-driven decisions. In these roles, it is a key tool, and perhaps the only reliable tool.

LINKS --->  ['https://en.wikipedia.org/wiki/Biostatistics', 'https://en.wikipedia.org/wiki/Computational_biology', 'https://en.wikipedia.org/wiki/Computational_sociology', 'h