# Web Scraping with BeautifulSoup

### what is HTML

**HyperText Markup Language**: it is NOT a programming language. As its name points it is a *markup language* is used to indicate to the browser how to layout content. 

HTML is based on tags, which indicates what should be done with the content.

The most basic tag is the `<html>`. Everything inside of it is HTML. **Important:** We need to use tags to delimit the scope, so we use open and close tags, like in the example:

```html
<html>
...
</html>
```

Inside of an `html` tag, we can use other tags. Usually, a HTML page has two other scopes defined by tags: `head` and `body`. The content of the web page goes into the body. The head contains metadata about the page, like the title of the page (it sometimes stores JS, CSSs, etc.)

When scraping, we usually focus on what is inside of the `<body>  <\body>`

```html
<html>
   <head>
        ...
   </head>
   <body>
       ...
   </body>
</html>
```

There are many possible tags with different roles, for example `<p>` delimits a paragraph `<br>` breaks a line, `<a>` represents links

**THIS CODE:**

```html
<html>
   <head>
   </head>

   <body>
      <p>
         Paragraph
         <a href="https://www.github.com">Link to GitHub</a>
      </p>
      <p>
         See the link below:
         <a href="https://www.twitter.com">Twitter</a> </p>
   </body>
</html>
```


**BECOMES THIS:**


<html>
   <head>
   </head>

   <body>
      <p>
         Paragraph
         <a href="https://www.github.com">Link to GitHub</a>
      </p>
      <p>
         See the link below:
         <a href="https://www.twitter.com">Twitter</a> </p>
   </body>
</html>


In the above example, the `<a>` tag presents an `href` attribute, which determines where the link goes.

Elements (tags) may have multiple attributes to define its layout/behavior. The attribute `class`, for example, indicates the CSS that will be applied there. The attribute `id` is used sometimes to identify a tag

### Let's scrape

First, we need to import the module we are using... BeautifulSoup

In [None]:
import requests
from bs4 import BeautifulSoup

The page we will use is an ugly page created to experiment BeautifulSoup.

https://pythonprogramming.net/parsememcparseface/ 

Let's take a look at the page.

And now, we will retrieve the page, using request (the same way we did for accessing REST services)


In [43]:
result = requests.get("https://pythonprogramming.net/parsememcparseface/")

We use the content, to get ready to scrape

And we call/instantiate our BeautifulSoup object, using our response content.

In [None]:
content = result.content
soup = BeautifulSoup(content, "html.parser")
soup

If we want, we can make it easier to read with `prettify()`

In [None]:
print(soup.prettify())

We can use multiple attributes/methods depending on what we want to scrape/get!

First, let's get what is inside of the tag title:

```html
<html>
<head>
    ...
    <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
    <title>Python Programming Tutorials</title>
```

In [46]:
print(soup.title)

<title>Python Programming Tutorials</title>


We can deal with soup.title (which is a Tag object), getting the name of the tag, its content, parent, etc...

In [48]:
#returns the name of the tag
print(soup.title.name)

title


In [49]:
# returns the content of the Tag as a an object
print(soup.title.string)

Python Programming Tutorials


In [50]:
# who is the parent of the tag in the nested structure?
# <head> ... 
#     <title>...
print(soup.title.parent.name)

head


In [51]:
# with a different tag <p>
# getting the first <p>
print(soup.p)

<p class="introduction">Oh, hello! This is a <span style="font-size:115%">wonderful</span> page meant to let you practice web scraping. This page was originally created to help people work with the <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="blank"><strong>Beautiful Soup 4</strong></a> library.</p>


In [52]:
# getting an attribute of the tag: 
# <p class="introduction">
soup.p['class']

['introduction']

We can also retrieve all the items with the same Tag, and get them as an iterable object

In [58]:
all_paragraphs = soup.find_all('p')
print(all_paragraphs[8])

<p class="grey-text right" style="padding-right:10px">Programming is a superpower.</p>


We can even iterate :-) 

BTW, let's look at an interesting difference: text vs. string
- string: returns the NavigableString object inside of the tag. If there are internal tags it will return NoneType object
- text: returns the text inside of taf and subtags, concatenating them.

In [59]:
for para in all_paragraphs:
    print("Original:\t", para)
    print(".string: \t", para.string)
    print(".text:   \t", (para.text))
    print("----")

Original:	 <p class="introduction">Oh, hello! This is a <span style="font-size:115%">wonderful</span> page meant to let you practice web scraping. This page was originally created to help people work with the <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="blank"><strong>Beautiful Soup 4</strong></a> library.</p>
.string: 	 None
.text:   	 Oh, hello! This is a wonderful page meant to let you practice web scraping. This page was originally created to help people work with the Beautiful Soup 4 library.
----
Original:	 <p>The following table gives some general information for the following <code>programming languages</code>:</p>
.string: 	 None
.text:   	 The following table gives some general information for the following programming languages:
----
Original:	 <p>I think it's clear that, on a scale of 1-10, python is:</p>
.string: 	 I think it's clear that, on a scale of 1-10, python is:
.text:   	 I think it's clear that, on a scale of 1-10, python is:
----
Orig

Playing with all the links. The folloing will show how to get the attributes of an `<a>` tag

In [60]:
links = soup.find_all('a') # Returns a ResultSet, a list with BS4 flavor
for url in links:
    print(url)
    print(url.text)
    print(url.get('href'))
    print(url['href']) # two ways of getting the same attribute
    print(url.get('class'))
    print("---")

<a class="brand-logo" href="/"><img class="img-responsive" src="/static/images/mainlogowhitethick.jpg" style="width:50px; height;50px; margin-top:5px"/></a>

/
/
['brand-logo']
---
<a class="button-collapse" data-activates="navsidebar" href="#"><i class="mdi-navigation-menu"></i></a>

#
#
['button-collapse']
---
<a href="/">Home</a>
Home
/
/
None
---
<a class="tooltipped" data-delay="50" data-position="bottom" data-tooltip="sudo apt-get upgrade" href="/+=1/">+=1</a>
+=1
/+=1/
/+=1/
['tooltipped']
---
<a href="/support/">Support the Content</a>
Support the Content
/support/
/support/
None
---
<a href="https://goo.gl/7zgAVQ" target="blank"><!--<i class="material-icons">question_answer</i>-->Community</a>
Community
https://goo.gl/7zgAVQ
https://goo.gl/7zgAVQ
None
---
<a href="/login/">Log in</a>
Log in
/login/
/login/
None
---
<a href="/register/">Sign up</a>
Sign up
/register/
/register/
None
---
<a href="/">Home</a>
Home
/
/
None
---
<a class="tooltipped" data-delay="50" data-position="

### Retrieving Tags with specific attributes

I want to get all divs with specific attributes
```html
<!-- main content -->
<div class="container" style="max-width:1500px; min-height:100%">
<!--Notification:-->
```

See the code below:



In [61]:
divs = soup.find_all('div', attrs={"class": "container", "style":"max-width:1500px; min-height:100%"})
len(divs)

1

Now, we will get the div with the attribute `attrs={"class"="body"}` and navigate through its children.



In [None]:
body = soup.find('div', attrs={"class": "body"})

for child in body.children:
    print(child)
    print("----")

In [63]:
# This is the number of children of that div
len(list(body.children))

28

In [64]:
# And this is the number of descendants, going down to deeper levels of the 
# nested structure
len(list(body.descendants))

176

Checking all the links that are inside of that div body, and highlighting the target of each of them

In [65]:
for item in body.findAll('a') :
    print(item.string + " is a link to " + item.get('href'))
    if (item.has_attr('target')):
        print("target is: " + item.get("target")+"\n")

Beautiful Soup 4 is a link to https://www.crummy.com/software/BeautifulSoup/bs4/doc/
target is: blank

sitemap is a link to /sitemap.xml
target is: blank



Navigating a little bit more. We will retrieve all images under the div

In [66]:
print(body.findAll('img'))

[<img alt="omg batman" class="responsive-img" src="https://s-media-cache-ak0.pinimg.com/originals/e8/2a/ff/e82aff2876b080675449d0cef7685321.jpg">
</img>]


We can even create conditionals based on the presence of an attribute in an element

In [None]:
for a in soup.findAll("a"):
    if (a.has_attr("data-delay")):
        print("YES: " + a.text)
    else:
        print("NO: " + a.text)


## This is BeautifulSoup

**LAST BUT NOT LEAST!** Check the terms of use of the websites before you use a scraper!

BeautifulSoup documentation is rich and enlightening. Use as needed: https://beautiful-soup-4.readthedocs.io/