

# Web Scraping with Beautiful Soup

## Working with objects

In [135]:
#! pip3 install BeautifulSoup4



In [136]:
from bs4 import BeautifulSoup

### The BeautifulSoup object

In [200]:
html_doc = '''
<html><head><title>Best Books</title></head>
<body>
<p class='title'><b>PYTHON DATA SCIENCE HANDBOOK</b></p>

<p class='description'>
    This review is for  <a href="http://shop.oreilly.com/product/0636920034919.do">Python Data Science Handbook</a>  by Jake VanderPlas; the full text of the book and the content is available <a href="https://github.com/jakevdp/PythonDataScienceHandbook">on GitHub</a>  in the form of Jupyter notebooks.

    <p>The text is released under the <a href="https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode">CC-BY-NC-ND license</a>, and code is released under the <a href="https://opensource.org/licenses/MIT">MIT license</a>.</p>
    
    <p>If you find this content useful, please consider supporting the work by <a href="http://shop.oreilly.com/product/0636920034919.do">buying the book</a>!</p>
    
    
<br><br>
This book:
        <br>
 <ul>
  <li>Is not meant to be an introduction to Python or to programming in general; I assume the reader has familiarity with the Python language, including defining functions, assigning variables, calling methods of objects, controlling the flow of a program, and other basic tasks.</li>
  <li>Is for individuals who want to learn the language with the aim of using it as a tool for data-intensive and computational science</li>
  <li>Focuses on a particular package or tool that contributes a fundamental piece of the Python Data Sciece story in each chapter</li>
  </ul>
<br><br>
What to do next:
<br>
<p>Supplemental material (code examples, figures, etc.) is available for download at <a href="http://github.com/jakevdp/PythonDataScienceHandbook/">http://github.com/jakevdp/PythonDataScienceHandbook/</a>. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.</p>

<p>If you feel your use of code examples falls outside fair use or the per‐ mission given above, feel free to contact us at permissions@oreilly.com</p>

<p class='description'>Most of the text on this page is copied from <a href="https://jakevdp.github.io/PythonDataScienceHandbook/00.00-preface.html"> here</a> to create a sample html page for teaching purposes </p>

'''

In [201]:
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup)


<html><head><title>Best Books</title></head>
<body>
<p class="title"><b>PYTHON DATA SCIENCE HANDBOOK</b></p>
<p class="description">
    This review is for  <a href="http://shop.oreilly.com/product/0636920034919.do">Python Data Science Handbook</a>  by Jake VanderPlas; the full text of the book and the content is available <a href="https://github.com/jakevdp/PythonDataScienceHandbook">on GitHub</a>  in the form of Jupyter notebooks.

    <p>The text is released under the <a href="https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode">CC-BY-NC-ND license</a>, and code is released under the <a href="https://opensource.org/licenses/MIT">MIT license</a>.</p>
<p>If you find this content useful, please consider supporting the work by <a href="http://shop.oreilly.com/product/0636920034919.do">buying the book</a>!</p>
<br/><br/>
This book:
        <br/>
<ul>
<li>Is not meant to be an introduction to Python or to programming in general; I assume the reader has familiarity with the Pyt

In [202]:
print(soup.prettify()[0:350])

<html>
 <head>
  <title>
   Best Books
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    PYTHON DATA SCIENCE HANDBOOK
   </b>
  </p>
  <p class="description">
   This review is for
   <a href="http://shop.oreilly.com/product/0636920034919.do">
    Python Data Science Handbook
   </a>
   by Jake VanderPlas; the full text of the book and the


### Tag objects


#### Working with names

In [213]:
soup = BeautifulSoup('<b body="description"">Product Description</b>', 'html')

tag=soup.b
type(tag)

bs4.element.Tag

In [207]:
tag=soup.html.body.p

In [214]:
print(tag)

<b body="description">Product Description</b>


In [215]:
tag.name

'b'

In [216]:
tag.name = 'bestbooks'
tag

<bestbooks body="description">Product Description</bestbooks>

In [217]:
tag.name

'bestbooks'

#### Working with attributes

In [218]:
tag['body']

'description'

In [219]:
tag.attrs

{'body': 'description'}

In [220]:
tag['id'] = 3
tag.attrs

{'body': 'description', 'id': 3}

In [221]:
tag

<bestbooks body="description" id="3">Product Description</bestbooks>

In [222]:
del tag['body']
del tag['id']
tag

<bestbooks>Product Description</bestbooks>

In [223]:
tag.attrs

{}

#### Using tags to navigate a tree


In [224]:
html_doc = '''
<html><head><title>Best Books</title></head>
<body>
<p class='title'><b>PYTHON DATA SCIENCE HANDBOOK</b></p>

<p class='description'>
    This review is for  <a href="http://shop.oreilly.com/product/0636920034919.do">Python Data Science Handbook</a>  by Jake VanderPlas; the full text of the book and the content is available <a href="https://github.com/jakevdp/PythonDataScienceHandbook">on GitHub</a>  in the form of Jupyter notebooks.

    <p>The text is released under the <a href="https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode">CC-BY-NC-ND license</a>, and code is released under the <a href="https://opensource.org/licenses/MIT">MIT license</a>.</p>
    
    <p>If you find this content useful, please consider supporting the work by <a href="http://shop.oreilly.com/product/0636920034919.do">buying the book</a>!</p>
    
    
<br><br>
This book:
        <br>
 <ul>
  <li>Is not meant to be an introduction to Python or to programming in general; I assume the reader has familiarity with the Python language, including defining functions, assigning variables, calling methods of objects, controlling the flow of a program, and other basic tasks.</li>
  <li>Is for individuals who want to learn the language with the aim of using it as a tool for data-intensive and computational science</li>
  <li>Focuses on a particular package or tool that contributes a fundamental piece of the Python Data Sciece story in each chapter</li>
  </ul>
<br><br>
What to do next:
<br>
<p>Supplemental material (code examples, figures, etc.) is available for download at <a href="http://github.com/jakevdp/PythonDataScienceHandbook/">http://github.com/jakevdp/PythonDataScienceHandbook/</a>. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.</p>

<p>If you feel your use of code examples falls outside fair use or the per‐ mission given above, feel free to contact us at permissions@oreilly.com</p>

<p class='description'>Most of the text on this page is copied from <a href="https://jakevdp.github.io/PythonDataScienceHandbook/00.00-preface.html"> here</a> to create a sample html page for teaching purposes </p>


'''
soup = BeautifulSoup(html_doc, 'html.parser')

In [155]:
soup.head

<head><title>Best Books</title></head>

In [156]:
soup.title

<title>Best Books</title>

In [157]:
soup.body.b

<b>PYTHON DATA SCIENCE HANDBOOK</b>

In [158]:
soup.body

<body>
<p class="title"><b>PYTHON DATA SCIENCE HANDBOOK</b></p>
<p class="description">
    This review is for  <a href="http://shop.oreilly.com/product/0636920034919.do">Python Data Science Handbook</a>  by Jake VanderPlas; the full text of the book and the content is available <a href="https://github.com/jakevdp/PythonDataScienceHandbook">on GitHub</a>  in the form of Jupyter notebooks.

    <p>The text is released under the <a href="https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode">CC-BY-NC-ND license</a>, and code is released under the <a href="https://opensource.org/licenses/MIT">MIT license</a>.</p>
<p>If you find this content useful, please consider supporting the work by <a href="http://shop.oreilly.com/product/0636920034919.do">buying the book</a>!</p>
<br/><br/>
This book:
        <br/>
<ul>
<li>Is not meant to be an introduction to Python or to programming in general; I assume the reader has familiarity with the Python language, including defining functions, as

In [159]:
soup.ul

<ul>
<li>Is not meant to be an introduction to Python or to programming in general; I assume the reader has familiarity with the Python language, including defining functions, assigning variables, calling methods of objects, controlling the flow of a program, and other basic tasks.</li>
<li>Is for individuals who want to learn the language with the aim of using it as a tool for data-intensive and computational science</li>
<li>Focuses on a particular package or tool that contributes a fundamental piece of the Python Data Sciece story in each chapter</li>
</ul>

In [160]:
soup.a

<a href="http://shop.oreilly.com/product/0636920034919.do">Python Data Science Handbook</a>

# Exploring NavigableString Objects

In [161]:
from bs4 import BeautifulSoup

In [162]:
soup = BeautifulSoup('<b body="description">Product description</b>')

# NavigableString objects

In [163]:
tag= soup.b
type(tag)

bs4.element.Tag

In [164]:
tag.name

'b'

In [165]:
tag.string

'Product description'

In [166]:
type(tag.string)

bs4.element.NavigableString

In [167]:
nav_string = tag.string
nav_string

'Product description'

In [168]:
nav_string.replace_with('Null')
tag.string

'Null'

# Working with NavigableString objects

In [265]:
html_doc = '''
<html><head><title>Best Books</title></head>
<body>
<p class='title'><b>PYTHON DATA SCIENCE HANDBOOK</b></p>

<p class='description'>
    This review is for  <a href="http://shop.oreilly.com/product/0636920034919.do">Python Data Science Handbook</a>  by Jake VanderPlas; the full text of the book and the content is available <a href="https://github.com/jakevdp/PythonDataScienceHandbook">on GitHub</a>  in the form of Jupyter notebooks.

    <p>The text is released under the <a href="https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode">CC-BY-NC-ND license</a>, and code is released under the <a href="https://opensource.org/licenses/MIT">MIT license</a>.</p>
    
    <p>If you find this content useful, please consider supporting the work by <a href="http://shop.oreilly.com/product/0636920034919.do">buying the book</a>!</p>
    
    
<br><br>
This book:
        <br>
 <ul>
  <li>Is not meant to be an introduction to Python or to programming in general; I assume the reader has familiarity with the Python language, including defining functions, assigning variables, calling methods of objects, controlling the flow of a program, and other basic tasks.</li>
  <li>Is for individuals who want to learn the language with the aim of using it as a tool for data-intensive and computational science</li>
  <li>Focuses on a particular package or tool that contributes a fundamental piece of the Python Data Sciece story in each chapter</li>
  </ul>
<br><br>
What to do next:
<br>
<p>Supplemental material (code examples, figures, etc.) is available for download at <a href="http://github.com/jakevdp/PythonDataScienceHandbook/">http://github.com/jakevdp/PythonDataScienceHandbook/</a>. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.</p>

<p>If you feel your use of code examples falls outside fair use or the per‐ mission given above, feel free to contact us at permissions@oreilly.com</p>

<p class='description'>Most of the text on this page is copied from <a href="https://jakevdp.github.io/PythonDataScienceHandbook/00.00-preface.html"> here</a> to create a sample html page for teaching purposes </p>


'''
soup = BeautifulSoup(html_doc, 'html.parser')

In [279]:
print(soup.text)


Best Books

PYTHON DATA SCIENCE HANDBOOK

    This review is for  Python Data Science Handbook  by Jake VanderPlas; the full text of the book and the content is available on GitHub  in the form of Jupyter notebooks.

    The text is released under the CC-BY-NC-ND license, and code is released under the MIT license.
If you find this content useful, please consider supporting the work by buying the book!

This book:
        

Is not meant to be an introduction to Python or to programming in general; I assume the reader has familiarity with the Python language, including defining functions, assigning variables, calling methods of objects, controlling the flow of a program, and other basic tasks.
Is for individuals who want to learn the language with the aim of using it as a tool for data-intensive and computational science
Focuses on a particular package or tool that contributes a fundamental piece of the Python Data Sciece story in each chapter


What to do next:

Supplemental material 

In [280]:
list(soup.strings)

['\n',
 'Best Books',
 '\n',
 '\n',
 'PYTHON DATA SCIENCE HANDBOOK',
 '\n',
 '\n    This review is for  ',
 'Python Data Science Handbook',
 '  by Jake VanderPlas; the full text of the book and the content is available ',
 'on GitHub',
 '  in the form of Jupyter notebooks.\n\n    ',
 'The text is released under the ',
 'CC-BY-NC-ND license',
 ', and code is released under the ',
 'MIT license',
 '.',
 '\n',
 'If you find this content useful, please consider supporting the work by ',
 'buying the book',
 '!',
 '\n',
 '\nThis book:\n        ',
 '\n',
 '\n',
 'Is not meant to be an introduction to Python or to programming in general; I assume the reader has familiarity with the Python language, including defining functions, assigning variables, calling methods of objects, controlling the flow of a program, and other basic tasks.',
 '\n',
 'Is for individuals who want to learn the language with the aim of using it as a tool for data-intensive and computational science',
 '\n',
 'Focuses on

In [281]:
for string in soup.stripped_strings: print(repr(string))

'Best Books'
'PYTHON DATA SCIENCE HANDBOOK'
'This review is for'
'Python Data Science Handbook'
'by Jake VanderPlas; the full text of the book and the content is available'
'on GitHub'
'in the form of Jupyter notebooks.'
'The text is released under the'
'CC-BY-NC-ND license'
', and code is released under the'
'MIT license'
'.'
'If you find this content useful, please consider supporting the work by'
'buying the book'
'!'
'This book:'
'Is not meant to be an introduction to Python or to programming in general; I assume the reader has familiarity with the Python language, including defining functions, assigning variables, calling methods of objects, controlling the flow of a program, and other basic tasks.'
'Is for individuals who want to learn the language with the aim of using it as a tool for data-intensive and computational science'
'Focuses on a particular package or tool that contributes a fundamental piece of the Python Data Sciece story in each chapter'
'What to do next:'
'Supplem

In [269]:
title_tag = soup.title
title_tag

<title>Best Books</title>

In [270]:
title_tag.parent

<head><title>Best Books</title></head>

In [271]:
title_tag.string

'Best Books'

In [272]:
title_tag.string.parent

<title>Best Books</title>

### Saving to a file

In [277]:
 file=open('bookreco.txt','w+')

In [278]:
for string in soup.stripped_strings:
    file.write(string+'\n')
file.flush()
file.close()

# Data Parsing

In [103]:
import pandas as pd

from bs4 import BeautifulSoup

import re

In [240]:
r = '''
<html><head><title>Best Books</title></head>
<body>
<p class='title'><b>PYTHON DATA SCIENCE HANDBOOK</b></p>

<p class='description'>
    This review is for  <a href="http://shop.oreilly.com/product/0636920034919.do">Python Data Science Handbook</a>  by Jake VanderPlas; the full text of the book and the content is available <a href="https://github.com/jakevdp/PythonDataScienceHandbook">on GitHub</a>  in the form of Jupyter notebooks.

    <p>The text is released under the <a href="https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode">CC-BY-NC-ND license</a>, and code is released under the <a href="https://opensource.org/licenses/MIT">MIT license</a>.</p>
    
    <p>If you find this content useful, please consider supporting the work by <a href="http://shop.oreilly.com/product/0636920034919.do">buying the book</a>!</p>
    
    
<br><br>
This book:
        <br>
 <ul>
  <li>Is not meant to be an introduction to Python or to programming in general; I assume the reader has familiarity with the Python language, including defining functions, assigning variables, calling methods of objects, controlling the flow of a program, and other basic tasks.</li>
  <li>Is for individuals who want to learn the language with the aim of using it as a tool for data-intensive and computational science</li>
  <li>Focuses on a particular package or tool that contributes a fundamental piece of the Python Data Sciece story in each chapter</li>
  </ul>
<br><br>
What to do next:
<br>
<p>Supplemental material (code examples, figures, etc.) is available for download at <a href="http://github.com/jakevdp/PythonDataScienceHandbook/">http://github.com/jakevdp/PythonDataScienceHandbook/</a>. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.</p>

<p>If you feel your use of code examples falls outside fair use or the per‐ mission given above, feel free to contact us at permissions@oreilly.com</p>

<p class='description'>Most of the text on this page is copied from <a id='link 3' href="https://jakevdp.github.io/PythonDataScienceHandbook/00.00-preface.html"> here</a> to create a sample html page for teaching purposes </p>


'''

In [241]:
soup = BeautifulSoup(r, 'html')
type(soup)

bs4.BeautifulSoup

# Parsing your data

In [242]:
print(soup.prettify()[0:100])

<html>
 <head>
  <title>
   Best Books
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    PY


# Getting data from a parse tree

In [243]:
text_only = soup.get_text()
print(text_only)

Best Books

PYTHON DATA SCIENCE HANDBOOK

    This review is for  Python Data Science Handbook  by Jake VanderPlas; the full text of the book and the content is available on GitHub  in the form of Jupyter notebooks.

    The text is released under the CC-BY-NC-ND license, and code is released under the MIT license.
If you find this content useful, please consider supporting the work by buying the book!

This book:
        

Is not meant to be an introduction to Python or to programming in general; I assume the reader has familiarity with the Python language, including defining functions, assigning variables, calling methods of objects, controlling the flow of a program, and other basic tasks.
Is for individuals who want to learn the language with the aim of using it as a tool for data-intensive and computational science
Focuses on a particular package or tool that contributes a fundamental piece of the Python Data Sciece story in each chapter


What to do next:

Supplemental material (


# Searching and retrieving data from a parse tree


Retrieving tags by filtering with name arguments

In [244]:
soup.find_all("li")

[<li>Is not meant to be an introduction to Python or to programming in general; I assume the reader has familiarity with the Python language, including defining functions, assigning variables, calling methods of objects, controlling the flow of a program, and other basic tasks.</li>,
 <li>Is for individuals who want to learn the language with the aim of using it as a tool for data-intensive and computational science</li>,
 <li>Focuses on a particular package or tool that contributes a fundamental piece of the Python Data Sciece story in each chapter</li>]

Retrieving tags by filtering with keyword arguments

In [245]:
soup.find_all(id="link 3")

[<a href="https://jakevdp.github.io/PythonDataScienceHandbook/00.00-preface.html" id="link 3"> here</a>]

Retrieving tags by filtering with string arguments

In [246]:
soup.find_all('ul')

[<ul>
 <li>Is not meant to be an introduction to Python or to programming in general; I assume the reader has familiarity with the Python language, including defining functions, assigning variables, calling methods of objects, controlling the flow of a program, and other basic tasks.</li>
 <li>Is for individuals who want to learn the language with the aim of using it as a tool for data-intensive and computational science</li>
 <li>Focuses on a particular package or tool that contributes a fundamental piece of the Python Data Sciece story in each chapter</li>
 </ul>]

Retrieving tags by filtering with list objects

In [247]:
soup.find_all(['ul', 'b'])

[<b>PYTHON DATA SCIENCE HANDBOOK</b>, <ul>
 <li>Is not meant to be an introduction to Python or to programming in general; I assume the reader has familiarity with the Python language, including defining functions, assigning variables, calling methods of objects, controlling the flow of a program, and other basic tasks.</li>
 <li>Is for individuals who want to learn the language with the aim of using it as a tool for data-intensive and computational science</li>
 <li>Focuses on a particular package or tool that contributes a fundamental piece of the Python Data Sciece story in each chapter</li>
 </ul>]

Retrieving tags by filtering with regular expressions

In [248]:
l = re.compile('l')
for tag in soup.find_all(l): print(tag.name)

html
title
ul
li
li
li


Retrieving tags by filtering with a Boolean value

In [249]:
for tag in soup.find_all(True): print(tag.name)

html
head
title
body
p
b
p
a
a
p
a
a
p
a
br
br
br
ul
li
li
li
br
br
br
p
a
p
p
a


Retrieving weblinks by filtering with string objects

In [250]:
for link in soup.find_all('a'): print(link.get('href'))

http://shop.oreilly.com/product/0636920034919.do
https://github.com/jakevdp/PythonDataScienceHandbook
https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode
https://opensource.org/licenses/MIT
http://shop.oreilly.com/product/0636920034919.do
http://github.com/jakevdp/PythonDataScienceHandbook/
https://jakevdp.github.io/PythonDataScienceHandbook/00.00-preface.html


Retrieving strings by filtering with regular expressions

In [251]:
soup.find_all(string=re.compile("data"))

['Is for individuals who want to learn the language with the aim of using it as a tool for data-intensive and computational science']

# Getting data from Web

In [253]:
from bs4 import BeautifulSoup
import urllib
import re

In [254]:
r = urllib.request.urlopen('http://ygenc.github.io/lectures/cit201/bookrecomendation.html').read()
soup = BeautifulSoup(r, "html")

In [256]:
soup

<html><head><title>Best Books</title></head>
<body>
<p class="title"><b>PYTHON DATA SCIENCE HANDBOOK</b></p>
<p class="description">
	This review is for  <a href="http://shop.oreilly.com/product/0636920034919.do">Python Data Science Handbook</a>  by Jake VanderPlas; the full text of the book and the content is available <a href="https://github.com/jakevdp/PythonDataScienceHandbook">on GitHub</a>  in the form of Jupyter notebooks.
	
	</p><p>The text is released under the <a href="https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode">CC-BY-NC-ND license</a>, and code is released under the <a href="https://opensource.org/licenses/MIT">MIT license</a>.</p>
<p>If you find this content useful, please consider supporting the work by <a href="http://shop.oreilly.com/product/0636920034919.do">buying the book</a>!</p>
<br/><br/>
This book:
        <br/>
<ul>
<li>Is not meant to be an introduction to Python or to programming in general; I assume the reader has familiarity with the Pytho

# Web Scraping another example

In [None]:
#! pip3 install urllib

In [193]:
from bs4 import BeautifulSoup
import urllib
import re

In [257]:
r = urllib.request.urlopen('https://analytics.usa.gov').read()
soup = BeautifulSoup(r, "html")
type(soup)

bs4.BeautifulSoup

Scraping a webpage and saving your results

In [258]:
print(soup.prettify()[:100])

<!DOCTYPE html>
<html lang="en">
 <!-- Initalize title and data source variables -->
 <head>
  <!--



In [259]:
for link in soup.find_all('a'): print(link.get('href'))

/
#explanation
https://analytics.usa.gov/data/
https://open.gsa.gov/api/dap/
data/
#top-pages-realtime
#top-pages-7-days
#top-pages-30-days
https://analytics.usa.gov/data/live/all-pages-realtime.csv
https://analytics.usa.gov/data/live/all-domains-30-days.csv
https://www.digitalgov.gov/services/dap/
https://www.digitalgov.gov/services/dap/common-questions-about-dap-faq/#part-4
https://support.google.com/analytics/answer/2763052?hl=en
https://analytics.usa.gov/data/live/second-level-domains.csv
https://analytics.usa.gov/data/live/sites.csv
mailto:DAP@support.digitalgov.gov
https://analytics.usa.gov/data/
https://open.gsa.gov/api/dap/
mailto:DAP@support.digitalgov.gov
https://github.com/GSA/analytics.usa.gov/issues
https://github.com/GSA/analytics.usa.gov
https://github.com/18F/analytics-reporter
http://www.gsa.gov/
https://www.digitalgov.gov/services/dap/
https://cloud.gov/


In [260]:
for link in soup.findAll('a', attrs={'href': re.compile("^http")}): print(link)

<a href="https://analytics.usa.gov/data/">Data</a>
<a href="https://open.gsa.gov/api/dap/" target="_blank">API</a>
<a href="https://analytics.usa.gov/data/live/all-pages-realtime.csv">Download the full dataset.</a>
<a href="https://analytics.usa.gov/data/live/all-domains-30-days.csv">Download the full dataset.</a>
<a class="external-link" href="https://www.digitalgov.gov/services/dap/">Digital Analytics Program</a>
<a class="external-link" href="https://www.digitalgov.gov/services/dap/common-questions-about-dap-faq/#part-4">does not track individuals</a>
<a class="external-link" href="https://support.google.com/analytics/answer/2763052?hl=en">anonymizes the IP addresses</a>
<a class="external-link" href="https://analytics.usa.gov/data/live/second-level-domains.csv">400 executive branch government domains</a>
<a class="external-link" href="https://analytics.usa.gov/data/live/sites.csv">about 5,700 total websites</a>
<a href="https://analytics.usa.gov/data/">download the data here.</a>
<

In [261]:
file = open('parsed_data.txt', 'w+')
for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
    #soup_link = str(link)
    print(link)
    file.write(soup_link)
file.flush()
file.close()

<a href="https://analytics.usa.gov/data/">Data</a>
<a href="https://open.gsa.gov/api/dap/" target="_blank">API</a>
<a href="https://analytics.usa.gov/data/live/all-pages-realtime.csv">Download the full dataset.</a>
<a href="https://analytics.usa.gov/data/live/all-domains-30-days.csv">Download the full dataset.</a>
<a class="external-link" href="https://www.digitalgov.gov/services/dap/">Digital Analytics Program</a>
<a class="external-link" href="https://www.digitalgov.gov/services/dap/common-questions-about-dap-faq/#part-4">does not track individuals</a>
<a class="external-link" href="https://support.google.com/analytics/answer/2763052?hl=en">anonymizes the IP addresses</a>
<a class="external-link" href="https://analytics.usa.gov/data/live/second-level-domains.csv">400 executive branch government domains</a>
<a class="external-link" href="https://analytics.usa.gov/data/live/sites.csv">about 5,700 total websites</a>
<a href="https://analytics.usa.gov/data/">download the data here.</a>
<

In [199]:
%pwd

'/Users/yegingenc/Dropbox/Pace/Courses/CIT201/2019 Fall/_week10'