# Scraping Wikipedia Pages
### Author: Sam Eure
### Data: May 12, 2021

In this notebook, I created a short Python script for scraping Wikipedia pages in order to obtain the hyperlinks listed in each section of the Wikipedia pages. Given a link to a Wikipedia page, this script should do the following:

1) Print the title of the page 

2) Print the header of each section

3) Print the plain text of each section

4) Print a list of the hyperlinks referenced in the paragraphs in each section (if present)

Ultimately, I'd like to develop some text based ML models using Wikipedia page data. I think this script will work well for gathering natural language data on a variety of topics!

### Functions and Imports

In [152]:
import re

import pandas as pd
try:
    import requests
except:
    !python3 -m pip install requests
    import requests
    
try:
    from bs4 import BeautifulSoup
except:
    !python3 -m pip install beautifulsoup4
    from bs4 import BeautifulSoup

################### INPUTS ###########################

HTML_LINK = "https://en.wikipedia.org/wiki/Statistics"

################## END OF INPUTS #####################

#Scraping
def get_soup_doc(html_link, parser = 'html.parser'):
    '''Takes in an html link and returns a BeautifulSoup document.'''
    response = requests.get(HTML_LINK)
    soup_doc = BeautifulSoup(response.content, 'html.parser')
    return(soup_doc)

def get_title(soup_doc):
    '''Returns the title of the web page.'''
    title = soup_doc.find(id='firstHeading').text
    return(title)

def get_headers(soup):
    '''Returns the header of each section in a Wikipedia page.'''
    headers = soup.find_all('span', attrs='mw-headline')
    return(headers)

def remove_footnotes(text):
    '''Drop footnote superscripts in brackets'''
    text = re.sub(r"\[.*?\]+", '', text)
    return(text)

def get_indices(soup, string_elements):
    '''Returns the a list of the starting index for each element in a list of strings.'''
    soup_string = str(soup)
    indices = [soup_string.index(string) for string in header_strings]
    return(indices)

def get_index(text, element):
    '''Attempts to get the index of an element. Returns None if not present'''
    try:
        idx = text.index(element)
        return(idx)
    except:
        return(None)
    
def combine(lists):
    '''Combines a list of lists into one list to return.'''
    combo = []
    for list_ in lists:
        combo.extend(list_)
    return(combo)

def collect_pattern_pairs(text, start_char, end_char):
    '''Returns a list of all text between sets of start_char and end_char characters.'''
    collection = []
    while get_index(text, start_char) is not None:
        start = get_index(text, start_char)
        end = get_index(text[start:], end_char) + start #make sure end is after start
        collection.append(text[start:end+len(end_char)])
        text = text[end+len(end_char):]
    return(collection)

def get_raw_paragraphs(soup, h_indices, i):
    '''Returns unprocessed HTML paragraphs.'''
    soup_string = str(soup)
    try:
        raw_text_i = soup_string[h_indices[i]:h_indices[i+1]]
    except:
        raw_text_i = soup_string[h_indices[i]:]
    paragraphs = collect_pattern_pairs(raw_text_i, "<p>", "</p>")
    return(paragraphs)

def get_paragraphs(soup, h_indices, i):
    '''Returns the text paragraphs associated with the section header specified by an index.'''
    paragraphs = get_raw_paragraphs(soup, h_indices, i)
    clean_paragraphs = [remove_HTML(p) for p in paragraphs]
    return(clean_paragraphs)

def remove_pattern_pair(text, start_char, end_char):
    '''Removes all text between the start_char and end_char and returns remaining text.'''
    while get_index(text, start_char) is not None:
        start = get_index(text, start_char)
        end = get_index(text[start:], end_char) + start #make sure end is after start
        text = text[:start] + text[end+len(end_char):]
    return(text)

def remove_substrings(text, substring_list):
    '''Removes all occurences of all substrings in a list from a text string. Returns new string.'''
    for pattern in substring_list:
        text = re.sub(pattern, "", text)
    return(text)

def remove_HTML(text):
    '''Removes the HTML elements from a substring of an HTML document and returns resulting string.'''
    to_remove = ['</a>', '</sup>', '<p>', '</p>']
    text = remove_substrings(text, to_remove)
    plain_text = remove_pattern_pair(text, '<', ">")
    plain_text = remove_footnotes(plain_text)
    return(plain_text)

def clean_wiki_links(hlinks):
    '''Completes hyperlinks to other Wikipedia pages.'''
    for i, link in enumerate(hlinks):
        if '/wiki/' in link:
            end = link.index('"')
            hlinks[i] = re.sub('/wiki/', 'https://en.wikipedia.org/wiki/', link[:end])
    return(hlinks)

def remove_internal_links(hlinks):
    '''Removes links that reference different parts of the Wikipedia page.'''
    internal_links = []
    for link in hlinks:
        if "#" == link[0]:
            print(link)
            hlinks.remove(link)
    return(hlinks)
    
def clean_cite_notes(hlinks):
    '''Removes references to links cited at the bottom of the Wikipedia page.'''
    cite_notes = []
    for i, link in enumerate(hlinks):
        if '#cite_note-' in link:
            cite_notes.append(hlinks[i])
    for c in cite_notes:
        hlinks.remove(c)
    return(hlinks)

def clean_hyperlinks(html_hyperlinks):
    '''Removes the HTML markup around hyperlinks and returns a list of hyperlinks'''
    to_remove =['<a href="', '">']
    hlinks = [remove_substrings(link, to_remove) for link in html_hyperlinks]
    hlinks = clean_wiki_links(hlinks)
    hlinks = clean_cite_notes(hlinks)
    hlinks = remove_internal_links(hlinks)
    return(hlinks)

def get_hyperlink(text):
    '''Returns a list of all hyperlinks included in a subsection of an HTML document.'''
    html_hyperlinks = collect_pattern_pairs(text, '<a href=', '>')
    hyperlinks = clean_hyperlinks(html_hyperlinks)
    return(hyperlinks)

def show_some_text(text):
    '''Returns first 100 characters in string.'''
    return(text[:100]+"...")


#Processing
def join_paragraphs(p_list, join_char = ' \n '):
    return(join_char.join(p_list))

def get_words_from_paragraphs(p_list):
    paragraph = join_paragraphs(p_list)
    return(paragraph.split(" "))

def add_count_feature(df, feature):
    df[feature+"_count"] = df.apply(lambda row : len(row[feature]), axis=1)
    return(df)

### Scraping the Data

In [153]:
soup = get_soup_doc(HTML_LINK)
headers = get_headers(soup)
header_strings = [str(h) for h in headers]
h_indices = get_indices(soup, header_strings)

#Printing
hyperlinks = []
para_list = []
section_headers = []
print('PAGE TITLE:',get_title(soup))
for i, head in enumerate(headers):
    section_headers.append(head.text)
    print("\n","#"*100, '\n\tSECTION:', head.text)
    paragraphs = get_raw_paragraphs(soup, h_indices, i)
    hrefs = [get_hyperlink(p) for p in paragraphs]
    paras = [remove_HTML(p) for p in paragraphs]
    hyperlinks.append(combine(hrefs))
    para_list.append(paras)
    for p in paras:
        print('\n', show_some_text(p))
    print('\nLINKS:', show_some(str(combine(hrefs))))

PAGE TITLE: Statistics

 #################################################################################################### 
	SECTION: Introduction

 Statistics is a mathematical body of science that pertains to the collection, analysis, interpretati...

 In applying statistics to a problem, it is common practice to start with a population or process to ...

 When a census is not feasible, a chosen subset of the population called a sample is studied. Once a ...

LINKS: ['https://en.wikipedia.org/wiki/Data', 'https://en.wikipedia.org/wiki/Mathematics', 'https://en.wiki...

 #################################################################################################### 
	SECTION: Mathematical statistics

 Mathematical statistics is the application of mathematics to statistics. Mathematical techniques use...

LINKS: ['https://en.wikipedia.org/wiki/Mathematics', 'https://en.wikipedia.org/wiki/Mathematical_analysis',...

 ##############################################################


 Some well-known statistical tests and procedures are:
...

LINKS: ['https://en.wikipedia.org/wiki/Statistical_hypothesis_testing']...

 #################################################################################################### 
	SECTION: Exploratory data analysis

 Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main charac...

LINKS: ['https://en.wikipedia.org/wiki/Data_analysis', 'https://en.wikipedia.org/wiki/Data_set', 'https://e...

 #################################################################################################### 
	SECTION: Misuse

 Misuse of statistics can produce subtle but serious errors in description and interpretation—subtle ...

 Even when statistical techniques are correctly applied, the results can be difficult to interpret fo...

 There is a general perception that statistical knowledge is all-too-frequently intentionally misused...

 Ways to avoid misuse of statistics include using proper diagrams 

Some of the sections don't have any paragraphs associated with them. This is because I assigned paragraphs to more specific subsections as opposed to sections as a whole. I'll organize this data into a dataframe now.

### Organizing the Data

In [156]:
wiki_df = pd.DataFrame({'section': section_headers, "hyperlinks": hyperlinks, 'paragraphs': para_list})
wiki_df['words'] = wiki_df.apply(lambda row : get_words_from_paragraphs(row.paragraphs), axis=1)
wiki_df = add_count_feature(wiki_df, 'hyperlinks')
wiki_df = add_count_feature(wiki_df, 'paragraphs')
wiki_df = add_count_feature(wiki_df, 'words')
wiki_df = wiki_df.set_index('section')
wiki_df = wiki_df[~(wiki_df.paragraphs_count==0)]
wiki_df

Unnamed: 0_level_0,hyperlinks,paragraphs,words,hyperlinks_count,paragraphs_count,words_count
section,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Introduction,"[https://en.wikipedia.org/wiki/Data, https://e...",[Statistics is a mathematical body of science ...,"[Statistics, is, a, mathematical, body, of, sc...",18,3,346
Mathematical statistics,"[https://en.wikipedia.org/wiki/Mathematics, ht...",[Mathematical statistics is the application of...,"[Mathematical, statistics, is, the, applicatio...",3,1,27
History,[https://en.wikipedia.org/wiki/Mathematics_in_...,[The early writings on statistical interferenc...,"[The, early, writings, on, statistical, interf...",51,8,758
Sampling,[https://en.wikipedia.org/wiki/Design_of_exper...,"[When full census data cannot be collected, st...","[When, full, census, data, cannot, be, collect...",10,3,242
Experimental and observational studies,"[https://en.wikipedia.org/wiki/Causality, http...",[A common goal for a statistical research proj...,"[A, common, goal, for, a, statistical, researc...",10,1,201
Experiments,"[https://en.wikipedia.org/wiki/Assembly_line, ...",[The basic steps of a statistical experiment a...,"[The, basic, steps, of, a, statistical, experi...",2,2,153
Observational study,[https://en.wikipedia.org/wiki/Cohort_study],[An example of an observational study is one t...,"[An, example, of, an, observational, study, is...",1,1,102
Types of data,[https://en.wikipedia.org/wiki/Level_of_measur...,[Various attempts have been made to produce a ...,"[Various, attempts, have, been, made, to, prod...",15,4,365
Descriptive statistics,"[https://en.wikipedia.org/wiki/Count_noun, htt...",[A descriptive statistic (in the count noun se...,"[A, descriptive, statistic, (in, the, count, n...",6,1,78
Inferential statistics,"[https://en.wikipedia.org/wiki/Data_analysis, ...",[Statistical inference is the process of using...,"[Statistical, inference, is, the, process, of,...",5,1,83
