# Exercise 2: Writing a Simple Web Crawler

Write a simple web crawler. More precisely, a program that extracts recursively all links from web pages. The result of running the web crawler is a dictionary, were the key-value pairs correspond to outgoing links from a web page with the URL, which is stored in the key.


In case a page returns a status code, which is not `200` we just disregard this page. See https://en.wikipedia.org/wiki/List_of_HTTP_status_codes for more detailes on the various HTTP status codes.

In [36]:
import bs4
import requests

url = 'https://www.cphbusiness.dk/'

def do_it(url):
    links_dict = {url: []}
    r = requests.get(url)
    # Check if status code is 200
    r.raise_for_status
    soup = bs4.BeautifulSoup(r.text, 'html.parser')

    for link in soup.find_all('a'):
        if link.get('href') is not None:
            if link.get('href').startswith('https'):
                links_dict[url].append(link.get('href'))
    return links_dict

links_dict = do_it(url)
new_links = []

for new_url in links_dict[url]:
    new_links.append(do_it(new_url))

print("Start page:", links_dict)
print("Pages under start page:", new_links)

Start page: {'https://www.cphbusiness.dk/': ['https://intra.cphbusiness.dk/', 'https://intra.cphbusiness.dk/', 'https://cphbusiness.mrooms.net/', 'https://selvbetjening.cphbusiness.dk/', 'https://wayf.survey-xact.dk/', 'https://europe.wiseflow.net/login/', 'https://efif-my.sharepoint.com/_layouts/15/MySite.aspx?MySiteRedirect=AllDocuments', 'https://www.cphbusiness.dk/om-cphbusiness/alumni#jobportaler', 'https://www.facebook.com/copenhagenbusinessacademy', 'https://www.linkedin.com/company/copenhagen-business-academy', 'https://twitter.com/cphbusiness', 'https://www.instagram.com/cphbusiness/']}
Pages under start page: [{'https://intra.cphbusiness.dk/': []}, {'https://intra.cphbusiness.dk/': []}, {'https://cphbusiness.mrooms.net/': ['https://cphbusiness.mrooms.net', 'https://cphbusiness.mrooms.net/login/index.php', 'https://cphbusiness.mrooms.net/mahara/auth/xmlrpc/jump.php?hostwwwroot=https%3A%2F%2Fcphbusiness.mrooms.net&wantsurl=%2F&remoteurl=1', 'https://cphbusiness.mrooms.net/login

## Exercise with findall()
In the following text find all the family names of everyone with first name Peter:

"Peter Hansen was meeting up with Jacob Fransen for a quick lunch, but first he had to go by Peter Beier to pick up some chokolate for his wife. Meanwhile Pastor Peter Jensen was going to church to give his sermon for the same 3 people in his parish. Those were Peter Kold and Henrik Halberg plus a third person who had recently moved here from Norway called Peter Harold".

In [43]:
import re

text = "Peter Hansen was meeting up with Jacob Fransen for a quick lunch, but first he had to go by Peter Beier to pick up some chokolate for his wife. Meanwhile Pastor Peter Jensen was going to church to give his sermon for the same 3 people in his parish. Those were Peter Kold and Henrik Halberg plus a third person who had recently moved here from Norway called Peter Harold"
reg = re.compile(r'(Peter) (\w+)')
mo = reg.findall(text)

family_names_for_peter = []

for fullname in mo:
    family_names_for_peter.append(fullname[1])

family_names_for_peter

['Hansen', 'Beier', 'Jensen', 'Kold', 'Harold']