This post will first show an easy way to scrape text from Wikipedia in a handful of lines of code then extract the same text using a more general way that can be applied to other websites.

We will scrape text content from a Wikipedia page on ‘Disappearance of Ben McDaniel’ , who was his article is about the scuba diver who went missing in 2010 and a creator of Matplot lib, a powerful package for visualisation.

We will be using the following two methods:

1. Wikipedia Library
2. urllib and BeautifulSoup


### 1. Wikipedia Package

The wikipedia package was designed to make it easy and simple to extract data from Wikipedia and it really delivers that.

We can extract the text content from the Wikipedia page using a just few lines of code:

In [1]:
# Import package
import wikipedia

# Specify the title of the Wikipedia page
wiki = wikipedia.page('Disappearance of Ben McDaniel')

# Extract the plain text content of the page
text = wiki.content

print(text)

On August 20, 2010, employees in the dive shop at Vortex Spring, north of Ponce de Leon, Florida, United States, noticed that a pickup truck had remained in the shop's parking lot for the previous two days. It belonged to Ben McDaniel (born April 15, 1980), a Tennesseean who had been diving regularly at the spring while living in his parents' nearby beach house. He had last been seen by two of those employees on the evening of August 18, on a dive entering a cave 58 feet (18 m) below the water's surface. While he was initially believed to have drowned on that dive, and his parents still strongly believe his body is in an inaccessible reach of the extensive cave system, no trace of him has ever been found. The state of Florida issued his family a death certificate in 2013.McDaniel had been living at his parents' beach house on the Emerald Coast during what they called a "sabbatical" in the wake of a divorce, a business failure, and the death of his younger brother two years earlier. An 

We have just extracted the text content! 

Suppose we wanted to keep only the body of each paragraph and nothing else, then we will have to do some cleaning:

- Drop headers surrounded by ‘==’: re.sub(r'==.*?==+', '', text)

- Replace ‘\n’ (a new line) with ‘’ (an empty string): .replace('\n', '')

The output text is a string (can check using type(text)) which allows us to leverage string methods or any other operations we could use on a string.

In [2]:
# Import package
import re

# Clean text
text = re.sub(r'==.*?==+', '', text)
text = text.replace('\n', '')
text



That’s it, we have just scraped text from Wikipedia using the wikipedia library. You can find the documentation [here](https://wikipedia.readthedocs.io/en/latest/code.html#api)

### 2. General way with urllib & BeautifulSoup

Let’s extract the same data using a more general approach that you can use for other websites.

In [3]:
# Import packages
from urllib.request import urlopen
from bs4 import BeautifulSoup

# Specify url of the web page
source = urlopen('https://en.wikipedia.org/wiki/Disappearance_of_Ben_McDaniel').read()

# Make a soup 
soup = BeautifulSoup(source,'lxml')
soup

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Disappearance of Ben McDaniel - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"469bad0a-b6c2-477a-ad51-6622007d3e65","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Disappearance_of_Ben_McDaniel","wgTitle":"Disappearance of Ben McDaniel","wgCurRevisionId":973083112,"wgRevisionId":973083112,"wgArticleId":47303771,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description matches Wikidata","Coordinates not on Wikidata","Articles with hCard

Let’s try to make sense of it together.

The soup contains everything on the page and comprises of smaller elements, most of which we don’t need (e.g. tables, references). Using this script, we can find out what set of elements there are in the soup:

In [4]:
print(set([text.parent.name for text in soup.find_all(text=True)]))

{'i', 'span', 'p', 'footer', 'ol', 'label', 'th', 'h2', 'cite', 'form', 'body', 'a', 'head', 'b', 'q', 'h1', 'script', 'td', 'style', 'nav', 'ul', '[document]', 'h3', 'title', 'div', 'html', 'li'}


In our case, we are interested in those that are tagged as <p> which stands for paragraphs. We can find all paragraphs using the script below:

    
And, if you are curious to understand more about the other elements, try changing ‘p’ with another element from the set (refer to the set of elements above).

In [5]:
# Extract the plain text content from paragraphs
text = ''
for paragraph in soup.find_all('p'):
    text += paragraph.text
    
text



Let’s do a little bit of cleaning to get the exact output from the previous section:

- Drop footnote superscripts in brackets: re.sub(r'\[.*?\]+', '', text)
- Replace ‘\n’ (a new line) with ‘’ (an empty string): .replace('\n', '')

In [6]:
# Import package
import re

# Clean text
text = re.sub(r'\[.*?\]+', '', text)
text = text.replace('\n', '')
text



We have extracted the exact same text using a more general approach. 