# Webscraping Wikipedia

In this notebook we are going to see some code to extract the text from a wikipedia page. For this we are going to use the Python library [beatiful soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), a really nice library for extracting content from HTML/XML files (what all websites are made from, to some extent). 

This code will load in the content from a wikipedia page we specify, get the content from the main text section, and extract the text. Later in the code we will see how to write this to a file, using the append function, so we can repeatedly load in new wikipedia articles and add them to an ever expanding file. 

You may need to uncomment and run these installs if you don't have the libraries `requests` and `beatifulsoup4` already:

In [11]:
# %pip install requests
# %pip install beautifulsoup4


In [12]:
pip install beautifulsoup4





Now lets import them:

In [13]:
import requests
from bs4 import BeautifulSoup

This function will extract the text from the main section of any wikipedia page. It has two parameters, page_title, which is just the title of the wikipedia article, and language, so you can scrape wikipedia in Languages other than English if you like!

The main page content is in a div section called `div.mw-body-content.mw-content-ltr div.mw-parser-output` Why it is called that? I have no idea. Ask the Wikipedia devs! 

It is worth bearing in mind, all web pages will have different names for the different sections and properties in their HTML pages, and not all web pages are as easy to scrape as wikipedias! If you wanted to change this code to scrape a different website you will almost certainly have to change the property that is passed into `soup.select()`. **Pro tip:** instead of trying to work out what div section you want to extract data from by looking through the HTML manually, you can use the chrome extension [simple scraper](https://simplescraper.io/docs/), which has a nice interactive way of finding the properties that you want to scrape. 

<a id='scrape-function'></a>

In [14]:
def extract_main_body_text(page_title, language='en'):
    # Lets construct our URL with our language and page title input parameters
    wikipedia_url = f'https://{language}.wikipedia.org/wiki/{page_title}'
    
    # Make a request to the wikipdia server and check to see we get a response
    response = requests.get(wikipedia_url)
    if response.status_code != 200:
        return "Failed to retrieve the page."

    # Use beatiful soup to parse the HTML
    soup = BeautifulSoup(response.text, 'html.parser')

    # Empty string to put our content
    main_body_text = ""
    # Get the text from the main body using this specific tag
    main_content = soup.select('div.mw-body-content.mw-content-ltr div.mw-parser-output')
    
    # If we have retrieve content
    if main_content:
        # Find each paragraph
        for paragraph in main_content[0].find_all('p'):
            # And add that paragraph to our main_body_text string
            main_body_text += paragraph.get_text()
            #Add a new line after each paragraph
            main_body_text += '\n'

    return main_body_text

In this cell we will define the wikipedia page that we want to scrape (in the variable `page_title`), we will then call the function about the extract the text from the page in the variable `main_body_text`:

<a id='set-page'>

In [19]:
page_title = "Camberwell" 
main_body_text = extract_main_body_text(page_title, language='en')
print(main_body_text)



Canary Wharf is an area of East London, England, located near the Isle of Dogs in the London Borough of Tower Hamlets. Canary Wharf is defined by the Greater London Authority as being part of London's central business district, alongside Central London.[1] Alongside the City of London, it constitutes one of the main financial centres in the United Kingdom and the world,[2] containing many high-rise buildings including the third-tallest in the UK, One Canada Square,[3] which opened on 26 August 1991.[4]

Developed on the site of the former West India Docks, Canary Wharf contains around 16,000,000 sq ft (1,500,000 m2) of office and retail space. It has many open areas, including Canada Square, Cabot Square and Westferry Circus. Together with Heron Quays and Wood Wharf, it forms the Canary Wharf Estate, around 97 acres (39 ha) in area.

Canary Wharf is located on the West India Docks on the Isle of Dogs.

From 1802 to the late 1980s, what would become the Canary Wharf Estate was a part 

In [None]:
page_title = "Canary Wharf" 
main_body_text = extract_main_body_text(page_title, language='en')
print(main_body_text)

Here we can save our scraped data to a file. This is in the folder `data/my-data` a special folder where you can put your datasets as you work on these code projects (without them being tracked by git). 

These cell will write whatever text is in the variable `main_body_text` to the end of the file `wikipedia-text`. Here we are using the option `a` which means append. This won't delete or overwrite any of the data already in our file. That way we can keep running the cell about to load text from a new page and the cell below to append it to our evergrowing file!

<a id='write-text'></a>

In [16]:
with open("../data/my-data/my-wikipedia-text.txt", "a") as myfile:
    myfile.write(main_body_text)

## Tasks

Run through this notebook a couple of times, [loading in different Wikipedia articles](#set-page) and [saving them](#write-text) to the file of scraped text. Don't worry about the data getting overwritten, we are writing in append only so our file will just get bigger each time you load and save the text from another page. 

Open up the text file in `data/my-data/my-wikipedia-text.txt` and see what is in there. Does it look sensible? Once you feel like you have scraped enough text then go to the `text-generation-with-markov-chains` notebook and try generating some text with the data you have scraped! 

If you want to add more data more quickly, and find running the cells individually for one page, why don't you try and write some code that goes through a list where you can put lots of wikipedia page titles, and get it to get the text from each page one by one and save it to our appended file?

### Bonus tasks

There are some bonus tasks here if you want to develop your web scraping skills futher. Alternatively, if you are more interested in the generative text component of this session, you can spend your time on the bonus tasks there.

**Bonus task A:** Can you modify the [function that performs the web scraping](#scrape-function) to include the headers in the wikipedia page text as well as the paragraphs? 

**Bonus task B:** Can you write some code that finds and removes the citations (numbers in square brackets, e.g. \[1\], \[2\], \[12\])  from the text before writing it to the file? Tip: You may want to borrow some of the regex code from the stemmer we build in week 3 to do this. 

**Bonus task C:** Can you adapt this code to extract data from another website. You will almost certainly have to change the property being searched for in [the web scraping function](#scrape-function). Either use your browser to look at the HTML code for the site (in most browsers this will be under a menu option called developer tools), or use the chrome extension [simple scraper](https://simplescraper.io/docs/) to help you. It will not be possible to scrape data from all websites using this code, lots of website these days try to prevent bots from accessing their content!
