[![AnalyticsDojo](https://s3.amazonaws.com/analyticsdojo/logo/final-logo.png)](http://rpi.analyticsdojo.com)
<center><h1>Introduction to Python - Web Mining</h1></center>
<center><h3><a href = 'http://rpi.analyticsdojo.com'>rpi.analyticsdojo.com</a></h3></center>


## This tutorial is directly from the the BeautifulSoup documentation.
[https://www.crummy.com/software/BeautifulSoup/bs4/doc/]

### Before you begin
If running locally you need to make sure that you have beautifulsoup4 installed. 
`conda install beautifulsoup4`

In [None]:
# All html documents have structure.  Here, we can see a basic html page.  

In [2]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [3]:
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())


<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


### A Retreived Beautiful Soup Object 
- Can be parsed via dot notation to travers down the hierarchy by *class name*, *tag name*, *tag type*, etc.



In [19]:
soup


<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [20]:
#Select the title class.
soup.title
 


<title>The Dormouse's story</title>

In [5]:
#Name of the tag.
soup.title.name




'title'

In [21]:
#String contence inside the tag
soup.title.string




"The Dormouse's story"

In [22]:
#Parent in hierarchy.
soup.title.parent.name




'head'

In [23]:
#List the first p tag.
soup.p




<p class="title"><b>The Dormouse's story</b></p>

In [24]:
#List the class of the first p tag.
soup.p['class']




['title']

In [25]:
#List the class of the first p tag.
soup.a




<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [12]:
#List all a tags.
soup.find_all('a')



[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [13]:

soup.find(id="link3")


<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [14]:
#The Robots.txt listing who is allowed.
response = requests.get("https://en.wikipedia.org/robots.txt")
txt = response.text
print(txt)

﻿# robots.txt for http://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#

# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

User-agent: Orthogaffe
Disallow:

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /

User-agent: Zealbot
Disallow: /

User-agent: MSIECrawler
Disallow: /

User-agent: SiteSnagger
Disallow: /

User-agent: WebStripper
Disallow: /

User-agent: WebCopier
Disallow: /

User-agent: Fetch
Disallow: /


In [15]:
response = requests.get("https://www.rpi.edu")
txt = response.text
soup = BeautifulSoup(txt, 'html.parser')

print(soup.prettify())

<!DOCTYPE doctype html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="Rensselaer is America's oldest technological research university, offering bachelor's, master's, and doctoral degrees in engineering, the sciences, information technology and web science, architecture, management, and the humanities, arts, and social sciences." http-equiv="description"/>
  <meta content="rensselaer, polytechnic, institute, rpi, university, graduate, engineering, architecture, science, humanities, business, research, biotechnology, nanotechnology, information technology, electronic arts, empac, troy, new york, new polytechnic, web science, big data" name="keywords"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible">
   <title>
    Rensselaer Polytechnic Institute (RPI) :: Architecture, Business, Engineering, Humanities, IT &amp; Web Science, Science
   </title>
   <link href="Assets/css/boots

In [16]:
soup.find_all('link')

[<link href="Assets/css/bootstrap2.min.css" rel="stylesheet" type="text/css"/>,
 <link href="Assets/js/_flexslider/flexslider.css" rel="stylesheet" type="text/css"/>,
 <link href="Assets/css/refresh_spring_2016-v2.css" rel="stylesheet" type="text/css"/>,
 <link href="Assets/js/_layerslider/css/layerslider.css" rel="stylesheet" type="text/css"/>,
 <link href="/dept/cct/apps/web-branding/v1/css/rpi.css" rel="stylesheet" type="text/css"/>,
 <link href="Assets/images/favicon.png" rel="shortcut icon" type="image/png">
 <link href="https://fonts.googleapis.com/css?family=Source+Sans+Pro:400,300,200,700" rel="stylesheet" type="text/css"/>
 <!--<link href="https://maxcdn.bootstrapcdn.com/font-awesome/4.2.0/css/font-awesome.min.css" rel="stylesheet">-->
 <link href="Assets/fontawesome/css/fontawesome.css" rel="stylesheet" type="text/css"/>
 <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
 <script src="Assets/js/bootstrap.min.js"></script>
 <!--
 <script

In [None]:
# Experiment with selecting your own website.  Selecting out a url. 

response = requests.get("enter url here")
txt = response.text
soup = BeautifulSoup(txt, 'html.parser')

print(soup.prettify())

#For more info, see 
[https://github.com/stanfordjournalism/search-script-scrape](https://github.com/stanfordjournalism/search-script-scrape) 