# Web Scraping 101 | Introductory Tutorial

## Reference: [Web Scraping 101 with Python](http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/) by [Greg Reda](https://twitter.com/gjreda)

*Wed Sep  7 15:34:17 IST 2016*

> For this example, we're going to use the Chicago Reader's Best of 2011 list. Why? Because I think it's a great example of terrible data presentation on the web. Go ahead and browse it for a bit.

> To start, we need to take a look at the HTML that displays these categories. If you're in Chrome or Firefox, highlight "Readers' Poll Winners", right-click, and select Inspect Element.

> This opens up the browser's Developer Tools (in Firefox, you might now have to click the HTML button on the right side of the developer pane to fully show it). Now we'll be able to see the page layout. The browser has brought us directly to the piece of HTML that's used to display the "Readers' Poll Winners" `<dt>` element.

> This seems to be the area of code where there's going to be some consistency in how the category links are displayed. See that `<dl class="boccat">` just above our "Readers' Poll Winners" line? If you mouse over that line in your browser's dev tools, you'll notice that it highlights the entire section of category links we want. And every category link is within a `<dd>` element. Perfect! Let's get all of them.

In [1]:
from bs4 import BeautifulSoup

In [2]:
from urllib2 import urlopen

In [3]:
BASE_URL = "https://www.chicagoreader.com"

In [4]:
def get_category_links(section_url):
    html = urlopen(section_url).read()
    soup = BeautifulSoup(html, "lxml")
    boccat = soup.find("dl", "boccat")
    category_links = [BASE_URL + dd.a["href"] for dd in boccat.findAll("dd")]
    return category_links

> Hopefully this code is relatively easy to follow, but if not, here's what we're doing:

- Loading the urlopen function from the urllib2 library into our local namespace.
- Loading the BeautifulSoup class from the bs4 (BeautifulSoup4) library into our local namespace.
- Setting a variable named BASE_URL to `http://www.chicagoreader.com`. We do this because the links used through the site are relative - meaning they do not include the base domain. In order to store our links properly, we need to concatenate the base domain with each relative link.
- Define a function named get_category_links.
  - The function requires a parameter of `section_url`. In this example, we are going to use the **Food and Drink** section of the BOC list, however we could use a different section URL: for instance, the **City Life** section's URL. We're able to create just one generic function because each section page is structured the same.
  - Open the `section_url` and read it in the html object.
  - Create an object called soup based on the BeautifulSoup class. The soup object is an instance of the BeautifulSoup class. It is initialized with the `html` object and parsed with `lxml`.
  - In our BeautifulSoup instance (which we called `soup`), find the `<dl>` element with a class of "boccat" and store that section in a variable called boccat.
  - This is a list comprehension. For every `<dd>` element found within our boccat variable, we're getting the `href` of its `<a>` element (our category links) and concatenating on our `BASE_URL` to make it a complete link. All of these links are being stored in a list called `category_links`. You could also write this line with a for loop, but I prefer a list comprehension here because of its simplicity.
  - Finally, our function returns the `category_links` list that we created on the previous line.

> Now that we have our list of category links, we'd better start going through them to get our winners and runners-up. Let's figure out which elements contain the parts we care about.

> If we look at the Best Chef category, we can see that our category is in `<h1 class="headline">`. Shortly after that, we find our winner and runners-up stored in `<h2 class="boc1">` and `<h2 class="boc2">`, respectively.

> Let's write some code to get all of them.

In [5]:
def get_category_winner(category_url):
    html = urlopen(category_url).read()
    soup = BeautifulSoup(html, "lxml")
    category = soup.find("h1", "headline").string
    winner = [h2.string for h2 in soup.findAll("h2", "boc1")]
    runners_up = [h2.string for h2 in soup.findAll("h2", "boc2")]
    return {"category": category, "category_url": category_url, "winner": winner, "runners_up": runners_up}

## DRY - Don't Repeat Yourself
> All the text in this notebook includig the paras below are copied directly from the link mentioned in the Reference above.

As mentioned in the previous section, lines two and three of our second function mirror lines in our first function.

Imagine a scenario where we want to change the parser we're passing into our BeautifulSoup instance (in this case, lxml). With the way we've currently written our code, we'd have to make that change in two places. Now imagine you've written many more functions to scrape this data - maybe one to get addresses and another to get paragraphs of text about the winner - you've likely repeated those same two lines of code in these functions and you now have to remember to make changes in four different places. That's not ideal.

**A good principle in writing code is DRY - Don't Repeat Yourself**. 

When you notice that you've written the same lines of code a couple times throughout your script, it's probably a good idea to step back and think if there's a better way to structure that piece.

In [1]:
"""
In our case, we're going to write another function to
simply process a URL and return a BeautifulSoup instance.
We can then call this function in our other functions
instead of duplicating our logic.
"""
def make_soup(url):
    html = urlopen(url).read()
    return BeautifulSoup(html, "lxml")

And also make this change in the above function definitions.
<code>
soup = make_soup(url)
</code>

In [24]:
# Therefore ... the final code is ...
from bs4 import BeautifulSoup
from urllib2 import urlopen
BASE_URL = "https://www.chicagoreader.com"

def make_soup(url):
    html = urlopen(url).read()
    return BeautifulSoup(html, "lxml")

def get_category_links(section_url):
    soup = make_soup(section_url)
    boccat = soup.find("dl", "boccat")
    category_links = [BASE_URL + dd.a["href"] for dd in boccat.findAll("dd")]
    return category_links

def get_category_winner(category_url):
    soup = make_soup(category_url)
    category = soup.find("h1", "headline").string
    winner = [h2.string for h2 in soup.findAll("h2", "boc1")]
    runners_up = [h2.string for h2 in soup.findAll("h2", "boc2")]
    return {"category": category, "category_url": category_url, "winner": winner, "runners_up": runners_up}

if __name__ == '__main__':
    food_n_drink = ("http://www.chicagoreader.com/chicago/"
                    "best-of-chicago-2011-food-drink/BestOf?oid=4106228")
    
    categories = get_category_links(food_n_drink)

    data = [] # a list to store our dictionaries
    for category in categories:
        winner = get_category_winner(category)
        data.append(winner)
        sleep(1) # be nice

print data

URLError: <urlopen error [Errno 111] Connection refused>

In [None]:
# This is author's code
from bs4 import BeautifulSoup
from urllib2 import urlopen
from time import sleep # be nice

BASE_URL = "http://www.chicagoreader.com"

def make_soup(url):
    html = urlopen(url).read()
    return BeautifulSoup(html, "lxml")

def get_category_links(section_url):
    soup = make_soup(section_url)
    boccat = soup.find("dl", "boccat")
    category_links = [BASE_URL + dd.a["href"] for dd in boccat.findAll("dd")]
    return category_links

def get_category_winner(category_url):
    soup = make_soup(category_url)
    category = soup.find("h1", "headline").string
    winner = [h2.string for h2 in soup.findAll("h2", "boc1")]
    runners_up = [h2.string for h2 in soup.findAll("h2", "boc2")]
    return {"category": category,
            "category_url": category_url,
            "winner": winner,
            "runners_up": runners_up}

if __name__ == '__main__':
    food_n_drink = ("http://www.chicagoreader.com/chicago/"
                    "best-of-chicago-2011-food-drink/BestOf?oid=4106228")
    
    categories = get_category_links(food_n_drink)

    data = [] # a list to store our dictionaries
    for category in categories:
        winner = get_category_winner(category)
        data.append(winner)
        sleep(1) # be nice

    print data

## My Experiment

Now, with the above code as it is, I just created a gist link.

https://gist.github.com/asinode/16ee953f91df4817f6c73e461b80eb9a

I will try to get the code out of this page.

In [4]:
from bs4 import BeautifulSoup
from urllib2 import urlopen

base_url = "https://gist.github.com/asinode/16ee953f91df4817f6c73e461b80eb9a"

html = urlopen(base_url).read()

In [5]:
type(html)

str

In [6]:
len(html)

40888

In [9]:
html[50:150]

'ead prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# object: http://ogp.me/ns/object# article:'

In [10]:
htmllist = html.split()

In [11]:
type(htmllist)

list

In [12]:
len(htmllist)

2575

In [13]:
htmllist[:10]

['<!DOCTYPE',
 'html>',
 '<html',
 'lang="en"',
 'class="">',
 '<head',
 'prefix="og:',
 'http://ogp.me/ns#',
 'fb:',
 'http://ogp.me/ns/fb#']

In [14]:
soup = BeautifulSoup(html, "lxml")

In [17]:
tdata = soup.find("div", "js-gist-file-update-container js-task-list-container file-box")

In [18]:
type(tdata)

bs4.element.Tag

In [21]:
# print tdata # looking at the output, this is not what I want.