# Wikipedia Web Crawl

> In this lesson we will apply the skills learned previously by writing a web crawler that explores Wikipedia.

## Case Introduction

**Automating the Wikipedia Crawl**

Go to a Wikipedia page you find interesting, or just a random one and click the first link. Then on that page click the first link in the main body of the article text and just keep going. You could try it multiple times for different pages and see whether you get different behaviour. We're going to work on automating this process, ending up with a program that will go through Wikipedia for us.

## Laying the Groundwork

1. Source Code

The HTML(HyperText Markup Language) source code of Wikipedia.

2. Target Tags

*Anchor tags* (denoted by <a></a> are used to create links. The link's destination is specified in the *href attribute*, and the text in between the opening and closing tags is the link's text.

`<a href="https://en.wikipedia.org/wiki/Cat">Learn more about cats!</a>`

3. Get HTML

Tool: requests library(python)

```
# Install requests with pip
# $ pip3 install requests

# Get html with requests
url = 'https://en.wikipedia.org/wiki/Dead_Parrot_sketch'
response = requests.get(url)
html = response.text
```

4. Parse HTML

Tool: Beautiful Soup

```
# Install Beautiful Soup with pip
# $ pip3 install beautifulsoup4

# Find the article's link
soup = bs4.BeautifulSoup(html, "html.parser")
article_link = soup.find(id='mw-content-text').find(class_="mw-parser-output").p.a.get('href')
```

## Designing the Program

**Design**

* Looping
* Data structures
* Steps to perform
* Specific extra(slowing dow)

**The Sequence of Steps**

1. Open an article
2. Find the first link in the article
3. Follow the link
4. Record the link in the article_chain data structure.
5. Repeat this process until we reach the Philosophy article, or get stuck in an article cycle

**loop**

Steps of loop:
1. Find the first link in the current article's HTML
2. Download the HTML for the current article
3. Add the first link from the current article to article_chain
4. Pause for a couple seconds so we don't flood Wikipedia with requests.

The program should end the while loop when:
1. we reach Philosophy,
2. we reach a page we've already visited, hence find ourselves in a cycle of articles (like the case of Chair,
3. we go on for too long (we think that 25 steps is plenty, but you can adjust this if you like), or 
4. we find a page that has no links on it - we simply can't keep going in this case.

**Pseudo Code**

```
page = a random starting page
article_chain = []
while title of page isn't 'Philosophy' and we have not discovered a cycle:
    append page to article_chain
    download the page content
    find the first link in the content
    page = that link
    pause for a second
```

## Implementing the Program

**Write code** 

* Code to control loop
* Steps inside loop
* Planned find_first_link
* Wrote find_first_link
* Test