# Simple Web Crawler Implementation

A simple web crawler designed here is composed of 4 main modules:
* <b>Scheduler</b>: maintain a queue of URLs to visit
* <b>Downloader</b>: download web pages
* <b>Analyzer</b>: analyze content and links
* <b>Storage</b>: store content and metadata

## 1) Basic Downloader
Every web crawler should be defined a <i>name</i> and identified its <i>owner</i> (i.e., the '`user-agent`' and '`from`' fields of the headers, respectively). Sometimes, you may get an error message, caused by the connection timeout and the page not found, for instance. You can print '`response.status_code`' to track that problem.

In [17]:
import requests
from requests.exceptions import HTTPError

headers = {
    'User-Agent': '6210506348',
    'From': 'natthakit.n@ku.th'
}
seed_url = 'https://www.ku.ac.th/th/'

def get_page(url):
    global headers
    text = ''
    try:
        response = requests.get(url, headers=headers, timeout=2)
        # If the response was successful, no Exception will be raised
        response.raise_for_status()
    except HTTPError as http_err:
        print(f'HTTP error occurred: {http_err}')  # Python 3.6
        # return False
    except Exception as err:
        print(f'Other error occurred: {err}')  # Python 3.6
        # return False
    else:
        # print('Success!')
        text = response.text
    return text.lower()

raw_html = get_page(seed_url)
print(raw_html)

<!doctype html>
<html lang="th">

<head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
  <meta http-equiv="x-ua-compatible" content="ie=edge">
  <title>มหาวิทยาลัยเกษตรศาสตร์</title>

  <meta name="keywords" content="ku,kasetsart university,มหาวิทยาลัยเกษตรศาสตร์"/>
<meta name="description" content="มหาวิทยาลัยเกษตรศาสตร์ สร้างสรรค์ศาสตร์แห่งแผ่นดินสู่สากลเพื่อพัฒนาประเทศอย่างยั่งยืน kasetsart university is a public research university in bangkok,..." />
<meta property="og:site_name" content="www.ku.ac.th"/>
<meta property="og:locale" content="th_th"/>
<meta property="og:type" content="website"/>
<meta property="og:title" content="ku | มหาวิทยาลัยเกษตรศาสตร์ รอบรั้วชาวนนทรี" />
<meta property="og:url" content="https://www.ku.ac.th/th/"/>
<meta property="og:image" content="https://www.ku.ac.th/assets/ku_logo.png" />
<meta property="og:description" content="มหาวิทยาลัยเกษตรศาสตร์ สร้างสรรค์ศาสตร์แห่งแผ่นดินสู่สากลเพื่อพั

## 2) Basic Analyzer
### 2.1 Link Parser
The following code is an example of simple link parser. The program extracts all links by considering the <i>anchor</i> tag only, and stores them into a `urls` list.

In [2]:
def link_parser(raw_html):
    urls = [];
    pattern_start = '<a href="';  pattern_end = '"'
    index = 0;  length = len(raw_html)
    while index < length:
        start = raw_html.find(pattern_start, index)
        if start > 0:
            start = start + len(pattern_start)
            end = raw_html.find(pattern_end, start)
            link = raw_html[start:end]
            if len(link) > 0:
                if link not in urls:
                    urls.append(link)
            index = end
        else:
            break
    return urls

raw_html = '<html><body><a href="http://test1.com">test1</a><br><a href="http://test2.com">test2</a></body></html>'
print(link_parser(raw_html))

['http://test1.com', 'http://test2.com']


### 2.2 URL Normalization
The following code is an example of using the `urljoin()` function to transform a relative URL to the absolute one.

In [3]:
from urllib.parse import urljoin

# Define an absolute (base) URL of a web page
base_url = 'https://mike.cpe.ku.ac.th'

# An example of the extracted absolute link
link_1 = 'http://www.ku.ac.th'
# An example of the extracted relative link
link_2 = 'download/homework.html'

# Resolve links
abs_link_1 = urljoin(base_url, link_1)
abs_link_2 = urljoin(base_url, link_2)

print(abs_link_1)  # -> http://www.ku.ac.th
print(abs_link_2)  # -> https://mike.cpe.ku.ac.th/download/homework.html

http://www.ku.ac.th
https://mike.cpe.ku.ac.th/download/homework.html


## 3) Basic Scheduler
The following code is an example of using a FIFO queue to handle the extracted URLs to be further downloaded. In particular, the main crawling process simply invokes the previous two defined functions, i.e., `get_page()` and `link_parser()`, to first download a web page and extract its out-going links, respectively. Then, all extracted links will be stored into a queue. We define here two queues: `frontier_q` and `visited_q`. The former is used as the FIFO queue to keep URLs for next downloading, while the latter is used to remember which web pages have been already downloaded.

In [4]:
seed_url = 'https://www.ku.ac.th/th/'
frontier_q = [seed_url]
visited_q = []

# param 'links' is a list of extracted links to be stored in the queue
def enqueue(links):
    global frontier_q
    for link in links:
        if link not in frontier_q and link not in visited_q:
            frontier_q.append(urljoin(seed_url,link))

# FIFO queue
def dequeue():
    global frontier_q
    current_url = frontier_q[0]
    frontier_q = frontier_q[1:]
    return current_url

#--- main process ---#
current_url = dequeue()
visited_q.append(current_url)
raw_html = get_page(current_url)
extracted_links = link_parser(raw_html)
enqueue(extracted_links)
print(frontier_q)

Success!
['https://www.ku.ac.th/th/community-home', 'https://www.ku.ac.th/th/newcomer-home', 'https://www.ku.ac.th/th/partner-home', 'https://www.facebook.com/kasetsartuniversity', 'https://twitter.com/kasetsart_ku?s=09', 'https://www.instagram.com/kasetsart_ku/', 'https://www.ku.ac.th/th/', 'https://www.youtube.com/channel/uc1lx-ul4ln8jxedtdxep7ga', 'mailto:www@ku.ac.th']


## 4) Storing Text into a File
As the following, we use the `os.makedirs()` function to first create (sub)directories. Notice that the `exist_ok=True` parameter is set to prevent an exception error if the target directory already exists. Then, we use the `open()`, `write()`, and `close()` functions to open a file, write some text into that file, and afterwards close it. In addition, we import the `codecs` module together with using the '`utf-8`' encoding for non-English content.

In [5]:
import os, codecs

# Create (sub)directories with the 0o755 permission
# param 'exist_ok' is True for no exception if the target directory already exists
path = 'html/subdir1/subdir2'
os.makedirs(path, 0o755, exist_ok=True)

# Write content into a file
raw_html = '<html><body><a href="http://test1.com">test1</a><br><a href="http://test2.com">test2</a></body></html>'
raw_html = get_page('http://sis.ku.ac.th/')
abs_file = path + '/index2' + '.html'
f = codecs.open(abs_file, 'w', 'utf-8')
f.write(raw_html)
f.close()

Success!


In [6]:
from urllib.parse import urlparse

url = 'www.ku.ac.th/th/scholarships?category=120#kuyraisas'
result = urlparse(url)

print(result)
print(result.path)

filepath = 'html/' + result.netloc + result.path[:result.path.rfind('/')]
print(filepath)

filename = result.path[result.path.rfind('/')+1:] 
if result.query != '':
  filename = filename + '?' + result.query
if result.fragment != '':
  filename = filename + '#' + result.fragment
if filename == '':
  filename = 'dummy'

        
print(filename)

ParseResult(scheme='', netloc='', path='www.ku.ac.th/th/scholarships', params='', query='category=120', fragment='kuyraisas')
www.ku.ac.th/th/scholarships
html/www.ku.ac.th/th
scholarships?category=120#kuyraisas


# <font color="blue">Your Turn ...</font>
Write a web crawler to collect 10,000 web pages (including only '`.htm`' and '`.html`' files) within the '`ku.ac.th`' domain.

In [7]:
seed_url = 'https://cooking.kapook.com/'

In [39]:
i=0
# frontier_q = ['https://www.siammakro.co.th/','https://www.lotuss.com/th','https://cpfreshmartshop.com/','https://www.tops.co.th/th/']
frontier_q = ['https://nlovecooking.com/','https://krua.co/','https://cooking.kapook.com/','https://cookpad.com/th']
visited_q = []
downloaded = []
KEY_WORD = ('วัตถุดิบ','แคลอรี่','อาหาร','เมนู','ของกิน','กับข้าว','รสชาติ','อร่อย','เครื่องเคียง','ของว่าง','เครื่องดื่ม','ขนม')

In [43]:
from urllib.parse import urlparse

headers = {
    'User-Agent': '6210506348',
    'From': 'natthakit.n@ku.th'
}

def link_parser(raw_html):
    urls = [];
    pattern_start = '<a href="';  pattern_end = '"'
    index = 0;  length = len(raw_html)
    while index < length:
        start = raw_html.find(pattern_start, index)
        if start > 0:
            start = start + len(pattern_start)
            end = raw_html.find(pattern_end, start)
            link = raw_html[start:end]
            if len(link) > 0:
                if link not in urls:
                    urls.append(link)
            index = end
        else:
            break
    return urls

def enqueue(links):
    global frontier_q
    for link in links:
        link = urljoin(seed_url,link)
        if link not in frontier_q and link not in visited_q:
            frontier_q.append(link)

def dequeue():
    global frontier_q
    current_url = frontier_q[0]
    frontier_q = frontier_q[1:]
    return current_url           

while True:
    current_url = dequeue()
    if 'download' in current_url or '.pdf' in current_url:
        continue

    # print(visited_q)
    # print(frontier_q)
    
    path = 'html/' + current_url.replace('https://','')
    result = urlparse(current_url)
    filepath = 'html/' + result.netloc + result.path[:result.path.rfind('/')]
    filename = result.path[result.path.rfind('/')+1:] 
    
    if result.query != '':
        # filename = filename + '' + result.query
        continue
    if result.fragment != '':
        # filename = filename + '' + result.fragment
        continue

    if filename == '':
        filename = 'dummy'
           
    if len(filename) > 50 :
        continue

    if '.' in filename:
        if '.html' not in filename or '.htm' not in filename:
            continue

    visited_q.append(current_url)
    raw_html = get_page(current_url)
    extracted_links = link_parser(raw_html)

    enqueue(extracted_links)

    if(sum([1 for x in KEY_WORD if x in raw_html])<3):
        continue

    try:
        os.makedirs(filepath, 0o755, exist_ok=True)
    except:
        continue

    abs_file = filepath + '/' +  filename
    if '.html' not in filename or '.htm' not in filename:
        abs_file = abs_file + '.html'

    try:
        f = codecs.open(abs_file, 'w', 'utf-8')
    except:
        continue

    if 'facebook' in current_url or 'youtube' in current_url or 'google' in current_url or 'instagram' in current_url or 'twitter' in current_url or 'line' in url:
        continue

    f.write(raw_html)
    f.close()
    
    print('current_url =',current_url)
    print('filepath =',filepath)
    print('filename =',filename)
    print('abs_file =',abs_file)
    downloaded.append(current_url)
    
    print('#',i+1)
    
    i+=1
    if i==14000:
        break


current_url = https://nlovecooking.com/tag/%e0%b8%aa%e0%b8%b9%e0%b8%95%e0%b8%a3%e0%b8%99%e0%b9%89%e0%b8%b3%e0%b8%9e%e0%b8%a3%e0%b8%b4%e0%b8%81%e0%b8%9b%e0%b8%b1%e0%b8%81%e0%b8%a9%e0%b9%8c%e0%b9%83%e0%b8%95%e0%b9%89/
filepath = html/nlovecooking.com/tag/%e0%b8%aa%e0%b8%b9%e0%b8%95%e0%b8%a3%e0%b8%99%e0%b9%89%e0%b8%b3%e0%b8%9e%e0%b8%a3%e0%b8%b4%e0%b8%81%e0%b8%9b%e0%b8%b1%e0%b8%81%e0%b8%a9%e0%b9%8c%e0%b9%83%e0%b8%95%e0%b9%89
filename = dummy
abs_file = html/nlovecooking.com/tag/%e0%b8%aa%e0%b8%b9%e0%b8%95%e0%b8%a3%e0%b8%99%e0%b9%89%e0%b8%b3%e0%b8%9e%e0%b8%a3%e0%b8%b4%e0%b8%81%e0%b8%9b%e0%b8%b1%e0%b8%81%e0%b8%a9%e0%b9%8c%e0%b9%83%e0%b8%95%e0%b9%89/dummy.html
# 12001
current_url = https://nlovecooking.com/tag/%e0%b8%9c%e0%b8%b1%e0%b8%94%e0%b8%a1%e0%b8%b0%e0%b9%80%e0%b8%82%e0%b8%b7%e0%b8%ad%e0%b8%a2%e0%b8%b2%e0%b8%a7%e0%b9%80%e0%b8%88/
filepath = html/nlovecooking.com/tag/%e0%b8%9c%e0%b8%b1%e0%b8%94%e0%b8%a1%e0%b8%b0%e0%b9%80%e0%b8%82%e0%b8%b7%e0%b8%ad%e0%b8%a2%e0%b8%b2%e0%b8%a7%e0%b9%80%e0%b

In [37]:
frontier_q

['http://fishbio.fish.ku.ac.th/',
 'http://www.fish.ku.ac.th/fishpro/index.html',
 'http://www.fish.ku.ac.th/fishaquaculture/index.html',
 'http://www.fish.ku.ac.th/fishmarine/index.html',
 'http://www.fish.ku.ac.th/sriracha/',
 'http://www.fish.ku.ac.th/oldversion/klongwan/default.asp?dept=13',
 'http://www.fish.ku.ac.th/oldversion/samutsongkhram/default.asp?dept=14',
 'http://www.fish.ku.ac.th/oldversion/kps/default.asp?dept=15',
 'http://www2.rdi.ku.ac.th/andaman/',
 'http://www.fish.ku.ac.th/oldversion/kumf/default.asp?dept=18',
 'http://www.fish.ku.ac.th/oldversion/flib/default.asp?dept=16',
 'https://www.ku.ac.th/photo-album/view/53',
 'https://fish.ku.ac.th/',
 'https://www.facebook.com/%e0%b8%84%e0%b8%93%e0%b8%b0%e0%b8%9b%e0%b8%a3%e0%b8%b0%e0%b8%a1%e0%b8%87-%e0%b8%a1%e0%b8%ab%e0%b8%b2%e0%b8%a7%e0%b8%b4%e0%b8%97%e0%b8%a2%e0%b8%b2%e0%b8%a5%e0%b8%b1%e0%b8%a2%e0%b9%80%e0%b8%81%e0%b8%a9%e0%b8%95%e0%b8%a3%e0%b8%a8%e0%b8%b2%e0%b8%aa%e0%b8%95%e0%b8%a3%e0%b9%8c-882670935113568/',
 'http

In [None]:
with open('./downloaded.txt', 'w') as writefile:
    for d in downloaded:
        writefile.write(d+'\n')

In [None]:
my_file = open("./downloaded.txt", "r")
content = my_file.readlines()
for i in range(len(content)):
    content[i] = content[i].replace('\n','')
len(content)
content

['https://www.ku.ac.th/th',
 'https://www.ku.ac.th/th/community-home',
 'https://www.ku.ac.th/th/newcomer-home',
 'https://www.ku.ac.th/th/partner-home',
 'https://www.ku.ac.th/th/login',
 'https://www.ku.ac.th/th/recruitment',
 'https://registrar.ku.ac.th/%e0%b8%9a%e0%b8%a3%e0%b8%b4%e0%b8%81%e0%b8%b2%e0%b8%a3%e0%b8%99%e0%b8%b4%e0%b8%aa%e0%b8%b4%e0%b8%95/%e0%b8%9b%e0%b8%8f%e0%b8%b4%e0%b8%97%e0%b8%b4%e0%b8%99%e0%b8%81%e0%b8%b2%e0%b8%a3%e0%b8%a8%e0%b8%b6%e0%b8%81%e0%b8%a9%e0%b8%b2-%e0%b8%a1%e0%b8%81/',
 'https://www.grad.ku.ac.th/%e0%b8%99%e0%b8%b4%e0%b8%aa%e0%b8%b4%e0%b8%95/%e0%b8%9b%e0%b8%8f%e0%b8%b4%e0%b8%97%e0%b8%b4%e0%b8%99%e0%b8%81%e0%b8%b2%e0%b8%a3%e0%b8%a8%e0%b8%b6%e0%b8%81%e0%b8%a9%e0%b8%b2/',
 'https://www.ku.ac.th/tuition-fees/',
 'https://www.ku.ac.th/faculty-bangkhen',
 'https://www.ku.ac.th/faculty-kamphaeng-saen-campus',
 'https://www.ku.ac.th/faculty-chalermphakiet-campus-sakon-nakhon',
 'https://www.ku.ac.th/faculty-sriracha-campus',
 'https://www.ku.ac.th/faculty-suphan

In [None]:
hostname = []
for c in content:
    url = c
    result = urlparse(url)
    result
    h = result.scheme + '://' + result.netloc
    if h not in hostname:
        hostname.append(h)

with open('./hostname.txt', 'w') as writefile:
    for d in hostname:
        writefile.write(d+'\n')

In [None]:
hostname

['https://www.ku.ac.th',
 'https://registrar.ku.ac.th',
 'https://www.grad.ku.ac.th',
 'https://www3.rdi.ku.ac.th',
 'https://webmail.ku.ac.th',
 'https://ku.ac.th',
 'https://annualconference.ku.ac.th',
 'https://u2t.ku.ac.th',
 'https://www.sdgs.ku.ac.th',
 'https://greencampus.ku.ac.th',
 'https://login.ku.ac.th',
 'https://live.ku.ac.th',
 'https://ocs.ku.ac.th',
 'https://kps.ku.ac.th',
 'https://www.csc.ku.ac.th',
 'https://www.src.ku.ac.th',
 'https://www.sbc.ku.ac.th',
 'https://admission.ku.ac.th',
 'http://sis.ku.ac.th',
 'https://coed.ku.ac.th',
 'https://askme.registrar.ku.ac.th',
 'http://www.ku.ac.th',
 'http://www.kps.ku.ac.th',
 'http://www.src.ku.ac.th',
 'http://www.csc.ku.ac.th',
 'https://kuservice.ku.ac.th',
 'http://clgc.agri.kps.ku.ac.th',
 'http://www.hort.ku.ac.th',
 'http://kukr2.lib.ku.ac.th',
 'https://lib.ku.ac.th',
 'https://iad.intaff.ku.ac.th',
 'https://www.registrar.ku.ac.th',
 'http://ku.ac.th',
 'http://www.sbc.ku.ac.th',
 'http://kps.ku.ac.th',
 'ht

In [None]:
list_robot = []
list_sitemap = []
list_success_robot = []
for h in hostname:
    hb = h + '/robots.txt'
    raw_html = get_page(hb)
    
    if 'user-agent' in raw_html:
        list_robot.append(h)

    if '' not in raw_html:
        list_success_robot.append(h)

    if 'sitemap' in raw_html:
        list_sitemap.append(h)

Success!
Success!
Other error occurred: HTTPSConnectionPool(host='www.grad.ku.ac.th', port=443): Max retries exceeded with url: /robots.txt (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')))
Other error occurred: HTTPSConnectionPool(host='www3.rdi.ku.ac.th', port=443): Max retries exceeded with url: /robots.txt (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')))
HTTP error occurred: 404 Client Error: Not Found for url: https://webmail.ku.ac.th/robots.txt
Success!
HTTP error occurred: 404 Client Error: Not Found for url: https://annualconference.ku.ac.th/robots.txt
Success!
Success!
Success!
Other error occurred: HTTPSConnectionPool(host='login.ku.ac.th', port=443): Max retries exceeded with url: /robots.txt (Caused by ConnectTimeoutError(<urllib3.connection.HTT

In [None]:
list_robot

['https://registrar.ku.ac.th',
 'https://u2t.ku.ac.th',
 'https://www.sdgs.ku.ac.th',
 'https://greencampus.ku.ac.th',
 'http://sis.ku.ac.th',
 'http://kukr2.lib.ku.ac.th',
 'http://kupresident13.ku.ac.th',
 'http://sp.ku.ac.th',
 'http://www.sdgs.ku.ac.th',
 'https://qdrm.ku.ac.th',
 'http://registrar.ku.ac.th',
 'https://sis.ku.ac.th',
 'https://esdpsd.psd.kps.ku.ac.th',
 'https://kukr2.lib.ku.ac.th',
 'https://kukr.lib.ku.ac.th',
 'http://agkb.lib.ku.ac.th',
 'https://portal.lib.ku.ac.th',
 'http://eoffice.lib.ku.ac.th',
 'http://ag-ebook.lib.ku.ac.th',
 'http://intanin.lib.ku.ac.th',
 'http://jindamanee.lib.ku.ac.th',
 'http://kucc28.ku.ac.th',
 'https://kucc28.ku.ac.th',
 'http://test.sp.ku.ac.th',
 'https://login.portal.lib.ku.ac.th',
 'http://go.sa.ku.ac.th',
 'http://www.sa.ku.ac.th',
 'http://media.eto.ku.ac.th',
 'http://www.km.eto.ku.ac.th',
 'http://www.ku-ept.human.ku.ac.th',
 'https://sp.ku.ac.th',
 'https://sa.ku.ac.th',
 'http://oku.sa.ku.ac.th',
 'https://cloudbox.ku.a

In [None]:
for lr in list_robot:
    raw_html = get_page(lr + '/robots.txt')
    result = urlparse(lr + '/robots.txt')
    filepath = 'html/' + result.netloc + result.path[:result.path.rfind('/')]
    filename = result.path[result.path.rfind('/')+1:] 

    abs_file = filepath + '/' + filename
    print(abs_file)
    os.makedirs(filepath, 0o755, exist_ok=True)
    f = codecs.open(abs_file, 'w', 'utf-8')
    f.write(raw_html)
    f.close()

Success!
html/registrar.ku.ac.th/robots.txt
Success!
html/u2t.ku.ac.th/robots.txt
Success!
html/www.sdgs.ku.ac.th/robots.txt
Success!
html/greencampus.ku.ac.th/robots.txt
Success!
html/sis.ku.ac.th/robots.txt
Success!
html/kukr2.lib.ku.ac.th/robots.txt
Success!
html/kupresident13.ku.ac.th/robots.txt
Success!
html/sp.ku.ac.th/robots.txt
Success!
html/www.sdgs.ku.ac.th/robots.txt
Success!
html/qdrm.ku.ac.th/robots.txt
Success!
html/registrar.ku.ac.th/robots.txt
Success!
html/sis.ku.ac.th/robots.txt
Success!
html/esdpsd.psd.kps.ku.ac.th/robots.txt
Success!
html/kukr2.lib.ku.ac.th/robots.txt
Success!
html/kukr.lib.ku.ac.th/robots.txt
Success!
html/agkb.lib.ku.ac.th/robots.txt
Success!
html/portal.lib.ku.ac.th/robots.txt
Success!
html/eoffice.lib.ku.ac.th/robots.txt
Success!
html/ag-ebook.lib.ku.ac.th/robots.txt
Success!
html/intanin.lib.ku.ac.th/robots.txt
Success!
html/jindamanee.lib.ku.ac.th/robots.txt
Success!
html/kucc28.ku.ac.th/robots.txt
Success!
html/kucc28.ku.ac.th/robots.txt
Succ

In [None]:
with open('./list_robots.txt', 'w') as writefile:
    for d in list_robot:
        writefile.write(d+'\n')

with open('./list_sitemap.txt', 'w') as writefile:
    for d in list_sitemap:
        writefile.write(d+'\n')

In [None]:
!rm -rf html