# Webscraping intro

## Scraping rules
- You should check a site's terms and conditions before you scrape them. It's their data and they likely have some rules to govern it.
- Be nice - A computer will send web requests much quicker than a user can. Make sure you space out your requests a bit so that you don't hammer the site's server.
- Scrapers break - Sites change their layout all the time. If that happens, be prepared to rewrite your code.
- Web pages are inconsistent - There's sometimes some manual clean up that has to happen even after you've gotten your data.

In [None]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import sys

In [None]:
def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
        return None
    try:
        soup = BeautifulSoup(html.read(), 'html.parser')  # or 'lxml'
        title = soup.body.h1
    except AttributeError as e:
        return None
    return title

In [None]:
title = getTitle("http://www.pythonscraping.com/exercises/exercise1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

### Select by class

In [None]:
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
soup = BeautifulSoup(html, "html.parser")
nameList = soup.findAll("span", {"class": "green"})

for name in nameList:
    print(name.get_text())

### Select by Attribute

In [None]:
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
soup = BeautifulSoup(html, "html.parser")
allText = soup.findAll(id="text")
print(allText[0].get_text())

### Find descendants(children)

In [None]:
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
soup = BeautifulSoup(html, "html.parser")

for child in soup.find("table",{"id": "giftList"}).children:
    print(child)

### Find siblings

In [None]:
for sibling in bsObj.find("table",{"id": "giftList"}).tr.next_siblings:
    print(sibling) 

### Find parents

In [None]:
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
soup = BeautifulSoup(html, "html.parser")
print(soup.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())

### Regex

In [None]:
import re
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
soup = BeautifulSoup(html, "html.parser")
images = soup.findAll("img", {"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")})
for image in images: 
    print(image["src"])

### Lambda exp

In [None]:
html = urlopen("http://www.pythonscraping.com/pages/page2.html")
soup = BeautifulSoup(html, "html.parser")
tags = soup.findAll(lambda tag: len(tag.attrs) == 2)
for tag in tags:
    print(tag)