# Web Scraping

### But first... what is HTML

**HyperText Markup Language**: it is NOT a programming language. As its name points it is a *markup language* is used to indicate to the browser how to layout content. 

HTML is based on tags, which indicates what should be done with the content.

The most basic tag is the `<html>`. Everything inside of it is HTML. **Important:** We need to use tags to delimit the scope, so we use open and close tags, like in the example:

``` 
<html>
...
</html>
```

Inside of an html tag, we can use other tags. Usually, a HTML page has two other scopes defined by tags: `head` and `body`. The content of the web page goes into the body. The head contains metadata about the page, like the title of the page (it sometimes stores JS, CSSs, etc.)

When scrapping, we usually focus on what is inside of the `<body>  <\body>`


<html>
    <head>
    </head>
    
   <body>
   </body>
</html>


There are many possible tags with different roles, for example `<p>` delimits a paragraph `<br>` breaks a line, `<a>` represents links

<html>
   <head>
   </head>

   <body>
      <p>
         Paragraph
         <a href="https://www.github.com">Link to GitHub</a>
      </p>
      <p>
         See the link below:
         <a href="https://www.twitter.com">Twitter</a> </p>
   </body>
</html>

In the above example, the `<a>` tag presents an `href` attribute, which determines where the link goes.

Elements (tags) may have multiple attributes to define its layout/behavior. The attribute `class`, for example, indicates the CSS that will be applied there. The attribute `id` is used sometimes to identify a tag

### Let's scrape

First, we need to import the module we are using... BeautifulSoup

In [110]:
import requests
from bs4 import BeautifulSoup

Let's get a page... using requests

In [111]:
result = requests.get("https://pythonprogramming.net/parsememcparseface/")

We use the content, to get ready to scrape

And we call/instantiate our BeautifulSoup object, using our response content.

In [112]:
content = result.content
soup=BeautifulSoup(content, "html.parser")
soup

<html>
<head>
<!--
		palette:
		dark blue: #003F72
		yellow: #FFD166
		salmon: #EF476F
		offwhite: #e7d7d7
		Light Blue: #118AB2
		Light green: #7DDF64
		-->
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>Python Programming Tutorials</title>
<meta content="Python Programming tutorials from beginner to advanced on a massive variety of topics. All video and text tutorials are free." name="description"/>
<link href="/static/favicon.ico" rel="shortcut icon"/>
<link href="/static/css/materialize.min.css" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"/>
<meta content="3fLok05gk5gGtWd_VSXbSSSH27F2kr1QqcxYz9vYq2k" name="google-site-verification">
<link href="/static/css/bootstrap.css" rel="stylesheet" type="text/css"/>
<!-- Compiled and minified CSS -->
<!-- Compiled and minified JavaScript -->
<script src="https://code.jquery.com/jquery-2.1.4.min.js"></script>
<script src="https://cdnjs.cloudflare.com/aj

If we want, we can make it easier to read...

In [113]:
print(soup.prettify())

<html>
 <head>
  <!--
		palette:
		dark blue: #003F72
		yellow: #FFD166
		salmon: #EF476F
		offwhite: #e7d7d7
		Light Blue: #118AB2
		Light green: #7DDF64
		-->
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Python Programming Tutorials
  </title>
  <meta content="Python Programming tutorials from beginner to advanced on a massive variety of topics. All video and text tutorials are free." name="description"/>
  <link href="/static/favicon.ico" rel="shortcut icon"/>
  <link href="/static/css/materialize.min.css" rel="stylesheet"/>
  <link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"/>
  <meta content="3fLok05gk5gGtWd_VSXbSSSH27F2kr1QqcxYz9vYq2k" name="google-site-verification">
   <link href="/static/css/bootstrap.css" rel="stylesheet" type="text/css"/>
   <!-- Compiled and minified CSS -->
   <!-- Compiled and minified JavaScript -->
   <script src="https://code.jquery.com/jquery-2.1.4.min.js">
   </script>
   <

We can use multiple attributes/methods depending on what we wanna scrape/get!

In [115]:
print(soup.title)

<title>Python Programming Tutorials</title>


We can deal with soup.title (which is a Tag object), getting name, content, parent, etc...

In [36]:
print(soup.title.name)

title


In [16]:
print(soup.title.string)

Python Programming Tutorials


In [17]:
print(soup.title.parent.name)

head


In [28]:
print(soup.p)

<list_iterator object at 0x1078e3860>


In [19]:
soup.p['class']

['introduction']

In [20]:
print(soup.find_all('p'))

[<p class="introduction">Oh, hello! This is a <span style="font-size:115%">wonderful</span> page meant to let you practice web scraping. This page was originally created to help people work with the <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="blank"><strong>Beautiful Soup 4</strong></a> library.</p>, <p>The following table gives some general information for the following <code>programming languages</code>:</p>, <p>I think it's clear that, on a scale of 1-10, python is:</p>, <p>Javascript (dynamic data) test:</p>, <p class="jstest" id="yesnojs">y u bad tho?</p>, <p>Whᶐt hαppéns now¿</p>, <p><a href="/sitemap.xml" target="blank"><strong>sitemap</strong></a></p>, <p class="grey-text text-lighten-4">Contact: Harrison@pythonprogramming.net.</p>, <p class="grey-text right" style="padding-right:10px">Programming is a superpower.</p>]


In [25]:
for para in soup.find_all('p'):
    print(para.string)
    print(str(para.text))
    print("----")

None
Oh, hello! This is a wonderful page meant to let you practice web scraping. This page was originally created to help people work with the Beautiful Soup 4 library.
----
None
The following table gives some general information for the following programming languages:
----
I think it's clear that, on a scale of 1-10, python is:
I think it's clear that, on a scale of 1-10, python is:
----
Javascript (dynamic data) test:
Javascript (dynamic data) test:
----
y u bad tho?
y u bad tho?
----
Whᶐt hαppéns now¿
Whᶐt hαppéns now¿
----
sitemap
sitemap
----
Contact: Harrison@pythonprogramming.net.
Contact: Harrison@pythonprogramming.net.
----
Programming is a superpower.
Programming is a superpower.
----


In [27]:
for url in soup.find_all('a'):
    print(url.text)
    print(url.get('href'))
    print("---")


/
---

#
---
Home
/
---
+=1
/+=1/
---
Support the Content
/support/
---
Community
https://goo.gl/7zgAVQ
---
Log in
/login/
---
Sign up
/register/
---
Home
/
---
+=1
/+=1/
---
Support the Content
/support/
---
Community
https://goo.gl/7zgAVQ
---
Log in
/login/
---
Sign up
/register/
---
Beautiful Soup 4
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
---
sitemap
/sitemap.xml
---
Support this Website!
/support-donate/
---
Consulting and Contracting
/consulting/
---
Facebook
https://www.facebook.com/pythonprogramming.net/
---
Twitter
https://twitter.com/sentdex
---
Instagram
https://instagram.com/sentdex
---
Terms and Conditions
/about/tos/
---
Privacy Policy
/about/privacy-policy/
---
Programming is a superpower.
https://xkcd.com/353/
---


In [61]:
body = soup.find('div',attrs={"class":"body"})
print(body.prettify())


<div class="body">
 <p class="introduction">
  Oh, hello! This is a
  <span style="font-size:115%">
   wonderful
  </span>
  page meant to let you practice web scraping. This page was originally created to help people work with the
  <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="blank">
   <strong>
    Beautiful Soup 4
   </strong>
  </a>
  library.
 </p>
 <p>
  The following table gives some general information for the following
  <code>
   programming languages
  </code>
  :
 </p>
 <ul>
  <li>
   Python
  </li>
  <li>
   Pascal
  </li>
  <li>
   Lisp
  </li>
  <li>
   D#
  </li>
  <li>
   Cobol
  </li>
  <li>
   Fortran
  </li>
  <li>
   Haskell
  </li>
 </ul>
 <table style="width:100%">
  <tr>
   <th>
    Program Name
   </th>
   <th>
    Internet Points
   </th>
   <th>
    Kittens?
   </th>
  </tr>
  <tr>
   <td>
    Python
   </td>
   <td>
    932914021
   </td>
   <td>
    Definitely
   </td>
  </tr>
  <tr>
   <td>
    Pascal
   </td>
   <td>
    532
 

In [59]:
print(footer.prettify())

AttributeError: 'NoneType' object has no attribute 'prettify'

In [62]:
for child in body.children:
    print(child)
    print("---")



---
<p class="introduction">Oh, hello! This is a <span style="font-size:115%">wonderful</span> page meant to let you practice web scraping. This page was originally created to help people work with the <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="blank"><strong>Beautiful Soup 4</strong></a> library.</p>
---


---
<p>The following table gives some general information for the following <code>programming languages</code>:</p>
---


---
<ul>
<li>Python</li>
<li>Pascal</li>
<li>Lisp</li>
<li>D#</li>
<li>Cobol</li>
<li>Fortran</li>
<li>Haskell</li>
</ul>
---


---
<table style="width:100%">
<tr>
<th>Program Name</th>
<th>Internet Points</th>
<th>Kittens?</th>
</tr>
<tr>
<td>Python</td>
<td>932914021</td>
<td>Definitely</td>
</tr>
<tr>
<td>Pascal</td>
<td>532</td>
<td>Unlikely</td>
</tr>
<tr>
<td>Lisp</td>
<td>1522</td>
<td>Uncertain</td>
</tr>
<tr>
<td>D#</td>
<td>12</td>
<td>Possibly</td>
</tr>
<tr>
<td>Cobol</td>
<td>3</td>
<td>No.</td>
</tr>
<tr>
<td>Fortran<

In [64]:
len(list(body.children))

28

In [71]:
body.find("a")


[<a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="blank"><strong>Beautiful Soup 4</strong></a>,
 <a href="/sitemap.xml" target="blank"><strong>sitemap</strong></a>]

In [103]:
for item in body.findAll("a"):
    print(item.string + " is a link to " + item.get('href'))
    if (item.has_attr('target')):
        print("target is: " + item.get("target"))

Beautiful Soup 4 is a link to https://www.crummy.com/software/BeautifulSoup/bs4/doc/
target is: blank
sitemap is a link to /sitemap.xml
target is: blank


In [78]:
body.findAll("div")

[<div class="card hoverable">
 <div class="card-content">
 <div class="card-title"></div>
 <img alt="omg batman" class="responsive-img" src="https://s-media-cache-ak0.pinimg.com/originals/e8/2a/ff/e82aff2876b080675449d0cef7685321.jpg">
 </img></div>
 </div>, <div class="card-content">
 <div class="card-title"></div>
 <img alt="omg batman" class="responsive-img" src="https://s-media-cache-ak0.pinimg.com/originals/e8/2a/ff/e82aff2876b080675449d0cef7685321.jpg">
 </img></div>, <div class="card-title"></div>]

In [77]:
body.findAll("img")

[<img alt="omg batman" class="responsive-img" src="https://s-media-cache-ak0.pinimg.com/originals/e8/2a/ff/e82aff2876b080675449d0cef7685321.jpg">
 </img>]

In [81]:
len(list(body.descendants))

176

In [96]:
print(body.a)

<a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="blank"><strong>Beautiful Soup 4</strong></a>


In [109]:
for a in soup.findAll("a"):
    if (a.has_attr("data-delay")):
        print("YES: " + a.text)
    else:
        print("NO: " + a.text)


NO: 
{'href': '/', 'class': ['brand-logo']}
NO: 
{'href': '#', 'data-activates': 'navsidebar', 'class': ['button-collapse']}
NO: Home
{'href': '/'}
YES: +=1
50
NO: Support the Content
{'href': '/support/'}
NO: Community
{'href': 'https://goo.gl/7zgAVQ', 'target': 'blank'}
NO: Log in
{'href': '/login/'}
NO: Sign up
{'href': '/register/'}
NO: Home
{'href': '/'}
YES: +=1
50
NO: Support the Content
{'href': '/support/'}
NO: Community
{'href': 'https://goo.gl/7zgAVQ', 'target': 'blank'}
NO: Log in
{'href': '/login/'}
NO: Sign up
{'href': '/register/'}
NO: Beautiful Soup 4
{'href': 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/', 'target': 'blank'}
NO: sitemap
{'href': '/sitemap.xml', 'target': 'blank'}
NO: Support this Website!
{'class': ['grey-text', 'text-lighten-3'], 'href': '/support-donate/'}
NO: Consulting and Contracting
{'class': ['grey-text', 'text-lighten-3'], 'href': '/consulting/'}
NO: Facebook
{'class': ['grey-text', 'text-lighten-3'], 'href': 'https://www.facebook.com