#### Objective: To learn web data scraping using BeautifulSoup, requests and lxml

This notebook is following the book, "Web scraping with python: collecting data from the modern web" by Ryan Mitchell, 1st edition.

In [None]:
# check installed packages
# Reference: https://stackoverflow.com/questions/12939975/how-to-list-all-installed-packages-and-their-versions-in-python
# Type conda list in command promt

# install packages for web scraping
import sys
!conda install --yes --prefix {sys.prefix} requests

In [1]:
from urllib.request import urlopen
html = urlopen("http://pythonscraping.com/pages/page1.html")
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


In [4]:
# running beautiful soup
from bs4 import BeautifulSoup as bs
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = bs(html.read())
print(bsObj.h1)

<h1>An Interesting Title</h1>


In [1]:
import requests

In [2]:
url="https://github.com/duttashi/"

In [3]:
page = requests.get(url)

In [4]:
page

<Response [200]>

In [5]:
# response code 200 means page was downloaded
page.text

'\n\n<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://assets-cdn.github.com">\n  <link rel="dns-prefetch" href="https://avatars0.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://avatars1.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://avatars2.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://avatars3.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-Z0JAar9+DkI1NjGVdZr3GivARUgJtA0o2eHlTv7Ou2gshR5awWVf8QGsq11Ns9ZxQLEs+G5/SuARmvpOLMzulw==" rel="stylesheet" href="https://assets-cdn.github.com/assets/frameworks-95aff0b550d3fe338b645a4deebdcb1b.css" />\n  <link crossorigin="anonymous" media="all" integrity="sha512-Q9OXBa6POM9XEiCfd809So5nkqF5fF9C0x6r+ENhto31Esta6/hG0meSbrJpZ9GiJ/Q7KP

In [6]:
# Stepping Through a Page with Beautiful Soup
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.text, "html.parser")

In [7]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://assets-cdn.github.com" rel="dns-prefetch"/>
  <link href="https://avatars0.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://avatars1.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://avatars2.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://avatars3.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
  <link crossorigin="anonymous" href="https://assets-cdn.github.com/assets/frameworks-95aff0b550d3fe338b645a4deebdcb1b.css" integrity="sha512-Z0JAar9+DkI1NjGVdZr3GivARUgJtA0o2eHlTv7Ou2gshR5awWVf8QGsq11Ns9ZxQLEs+G5/SuARmvpOLMzulw==" media="all" rel="stylesheet">
   <link crossorigin="anonymous" href="https://assets-cdn.github.com/assets/github-8539ff6efb69113c34c860c76c5bc3d0.css" integrity="sha512-Q9OXBa6POM9XE

#### Finding Instances of a Tag
We can extract a single tag from a page by using Beautiful Soup’s `find_all` method. This will return all instances of a given tag within a document.


In [8]:
soup.find_all('p')

[<p class="col-8 mx-auto">Sign up for your own profile on GitHub, the best place to host code, manage projects, and build software alongside 28 million developers.</p>,
 <p>Hide content and notifications from this user.</p>,
 <p>Contact Support about this user’s behavior.</p>,
 <p class="pinned-repo-desc text-gray text-small d-block mt-2 mb-3">exploratory, inferential and predictive data analysis</p>,
 <p class="mb-0 f6 text-gray">
 <span class="repo-language-color pinned-repo-meta" style="background-color:#198CE7;"></span>
             R
             <a class="pinned-repo-meta muted-link" href="/duttashi/learnr/stargazers">
 <svg aria-label="stars" class="octicon octicon-star" height="16" role="img" version="1.1" viewbox="0 0 14 16" width="14"><path d="M14 6l-4.9-.64L7 1 4.9 5.36 0 6l3.6 3.26L2.67 14 7 11.67 11.33 14l-.93-4.74L14 6z" fill-rule="evenodd"></path></svg>
               10
             </a>
 <a class="pinned-repo-meta muted-link" href="/duttashi/learnr/network">
 <svg aria

We can target specific classes and IDs by using the `find_all()` method and passing the class and ID strings as arguments. In Beautiful Soup we will assign the string for the class to the keyword argument `class_`

In [12]:
soup.find_all(class_="repo js-repo")

[<span class="repo js-repo" title="learnr">learnr</span>,
 <span class="repo js-repo" title="visualizer">visualizer</span>,
 <span class="repo js-repo" title="statsmodelling">statsmodelling</span>,
 <span class="repo js-repo" title="scrapers">scrapers</span>,
 <span class="repo js-repo" title="pipeliner">pipeliner</span>,
 <span class="repo js-repo" title="sparklyr">sparklyr</span>]

In [13]:
soup.find_all("span" ,class_="repo js-repo")

[<span class="repo js-repo" title="learnr">learnr</span>,
 <span class="repo js-repo" title="visualizer">visualizer</span>,
 <span class="repo js-repo" title="statsmodelling">statsmodelling</span>,
 <span class="repo js-repo" title="scrapers">scrapers</span>,
 <span class="repo js-repo" title="pipeliner">pipeliner</span>,
 <span class="repo js-repo" title="sparklyr">sparklyr</span>]