<img src="http://i67.tinypic.com/2jcbwcw.png" align="left"></img><br><br><br><br>


## Breakout Lecture 8: Web scraping & crawling

**Author List**: Alexander Fred Ojala

**Original Sources**: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

**License**: Feel free to do whatever you want to with this code

## Pre-Setup

In [82]:
# stretch Jupyter coding blocks to fit screen
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:75% !important; }</style>")) # if 100% it would fit the screen

In [83]:
# make it run on py2 and py3
from __future__ import division, print_function

## Webscraping intro

In order to scrape content from a website we first need to download the HTML contents of the website. This can be done with the Python library **requests** (with its `.get` method).

Then when we want to extract certain information from a website we use the scraping tool **BeautifulSoup4** (import bs4). In order to extract information with beautifulsoup we have to create a soup object from the HTML source code of a website.

In [84]:
import requests # The requests library is an HTTP library for getting content and posting etc.
import bs4 as bs # BeautifulSoup4 is a Python library for pulling data out of HTML and XML code.

# Scraping a simple website

In [85]:
source = requests.get("https://alexanderfo.github.io") # a GET request will download the HTML webpage.
print(source) # If <Response [200]> then the website has been downloaded succesfully

<Response [200]>


**Different types of repsonses:**
Generally status code starting with 2 indicates success. Status code starting with 4 or 5 indicates error

In [86]:
print(source.content) # This is the HTML content of the website, as you can see it's quite hard to decipher
print('\nRequests get source type:',type(source.content)) # byte type, default encoding of strings

b'<!DOCTYPE html>\n\n<head>\n\t<title>Data-X: Simple Git website</title>\n\t<meta name="author" content="afo" />\n</head>\n\n<!-- Website starts here" -->\n\n<body style="background-color: white;">\n<br><br><br><br>\n\t<div>\n\n\t\t<center>\n\n\t\t\t<h1>Data-X Lecture 8<br></h1>\n\n\t\t\t<h2>Record Attendance at: </h2>\n\n\t\t\t<h3><a href="https://goo.gl/77iPL2">https://goo.gl/77iPL2</a></h3>\n\n\n\t\t\t<br>\n\t\t\t<p> Here is a paragraph of random text </p>\n\n\t\t\t<p> Second paragraph of random text </p>\n\n\t\t</center>\n\t\n\t</div>\n\n\n</body>\n</html>'

Requests get source type: <class 'bytes'>


In [87]:
# Read in source.content to beautifulsoup 
# beautifulsoup can parse (extract specific information) HTML code

soup = bs.BeautifulSoup(source.content ,features='html.parser') # we pass in the source and choose a parser 

# The parser specifies what type of code we are parsing, here 'html.parser'

In [88]:
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [89]:
print(soup) # This is the HTML code of the website, decoded as a beautiful soup object

<!DOCTYPE html>

<head>
<title>Data-X: Simple Git website</title>
<meta content="afo" name="author"/>
</head>
<!-- Website starts here" -->
<body style="background-color: white;">
<br><br><br><br>
<div>
<center>
<h1>Data-X Lecture 8<br/></h1>
<h2>Record Attendance at: </h2>
<h3><a href="https://goo.gl/77iPL2">https://goo.gl/77iPL2</a></h3>
<br>
<p> Here is a paragraph of random text </p>
<p> Second paragraph of random text </p>
</br></center>
</div>
</br></br></br></br></body>



In [90]:
# Suppose we want to extract content that is shown on the website

print(soup.body) # This is the main content of the website, located within the <body> tag

<body style="background-color: white;">
<br><br><br><br>
<div>
<center>
<h1>Data-X Lecture 8<br/></h1>
<h2>Record Attendance at: </h2>
<h3><a href="https://goo.gl/77iPL2">https://goo.gl/77iPL2</a></h3>
<br>
<p> Here is a paragraph of random text </p>
<p> Second paragraph of random text </p>
</br></center>
</div>
</br></br></br></br></body>


In [91]:
print(soup.title) # Title of the website
print(soup.find('title')) # same as .method

<title>Data-X: Simple Git website</title>
<title>Data-X: Simple Git website</title>


In [96]:
# If we want to extract specific text
(soup.find('p')) # will only return first <p> tag

<p> Here is a paragraph of random text </p>

In [97]:
# If we want to extract all <p> tags
print(soup.find_all('p')) # returns list of all <p> tags

[<p> Here is a paragraph of random text </p>, <p> Second paragraph of random text </p>]


In [103]:
# Extract links / urls
# Links in html is usually coded as <a href="url"> where the link is url

print(soup.a)
print(type(soup.a))


<a href="https://goo.gl/77iPL2">https://goo.gl/77iPL2</a>
<class 'bs4.element.Tag'>


In [104]:
# if we only want the link
attendance_link = soup.a.get('href')
print(attendance_link) # then we have extracted the link
print(type(attendance_link))

https://goo.gl/77iPL2
<class 'str'>


## Scrape the Data-X website for the current syllabus

In [106]:
source = requests.get('https://data-x.blog/').content # get the source content

In [107]:
soup = bs.BeautifulSoup(source,'html.parser')

In [109]:
print(soup.prettify()) # .prettify() method makes the HTML code more readable

# as you can see this code is more difficult to read

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8">
   <meta content="width=device-width" name="viewport">
    <link href="http://gmpg.org/xfn/11" rel="profile">
     <link href="https://data-x.blog/xmlrpc.php" rel="pingback">
      <title>
       Data-X – A Public and Open Website for the Data-X Course at UC Berkeley.
      </title>
      <script src="https://r-login.wordpress.com/remote-login.php?action=js&amp;host=data-x.blog&amp;id=120928364&amp;t=1489091011&amp;back=https%3A%2F%2Fdata-x.blog%2F" type="text/javascript">
      </script>
      <script type="text/javascript">
       /* <![CDATA[ */
			if ( 'function' === typeof WPRemoteLogin ) {
				document.cookie = "wordpress_test_cookie=test; path=/";
				if ( document.cookie.match( /(;|^)\s*wordpress_test_cookie\=/ ) ) {
					WPRemoteLogin();
				}
			}
		/* ]]> */
      </script>
      <link href="//s2.wp.com" rel="dns-prefetch"/>
      <link href="//s0.wp.com" rel="dns-prefetch"/>
      <link href="//datax911.wordp

In [111]:
print(soup.title) # we are at the correct website

<title>Data-X – A Public and Open Website for the Data-X Course at UC Berkeley.</title>


In [116]:
for p in soup.find_all('p'):
    print(p.text)

Instructor: Ikhlaq Sidhu, IEOR, UC Berkeley (contact)
You can find all the resources and code samples for Data-X on this page.  This content for this course is drawn from open source tools and publicly available materials.
At UC Berkeley, this course is 3 units, limited to 55 students in Spring 2017
Thursdays: 5:00 to 7:59 pm in 3108 Etcheverry Hall
In Spring, 2017, the course is run as an experimental section.
Suggestions for Data-X project may be submitted here:
https://goo.gl/forms/h6cAxZS3Il2F0k4F2
Data-X Breadth Perspectives:
Ref B01: Why you’re not getting value from your data science
Syllabus: Click Here
Getting Started:

Course Materials:
Lectures: 
Course Introduction (download)
Remaining Lectures, Homework, and Notebooks to be posted here
Cookbook Code Samples:
Follow this link
Coding Questions: Try Stack Overflow and/or simply ask Google
CS Tools Reference Materials:
Ref CS01: Python Quick Reference Guide, Python Review from Data 8
and Python Data Structures for 2.7.
Ref CS0

In [134]:
# Now we want to find the Syllabus, however we are at the root web page, not displaying the syllabus
# Get links from the data-x website
for url in soup.find_all('a'):
    link = url.get('href')
    if 'data-x.blog' in link:
        print(link) # we see that the syllabus is located at the url https://data-x.blog/syllabus-data-x/
        if 'syllabus' in link:
            syllabus_url = link

https://data-x.blog/
https://data-x.blog/
https://data-x.blog/
https://data-x.blog/syllabus-data-x/
https://data-x.blog/breakouts/
https://data-x.blog/contact/
https://data-x.blog/contact/
https://data-x.blog/syllabus-data-x/


In [135]:
print(syllabus_url)

https://data-x.blog/syllabus-data-x/


In [139]:
# Open new connection
source = requests.get(syllabus_url).content
soup = bs.BeautifulSoup(source, 'html.parser')

print(soup.body.prettify()) # we can see that the table is stored within <td> tags

<body class="page-template page-template-page-templates page-template-full-width-page page-template-page-templatesfull-width-page-php page page-id-94 custom-background mp6 customizer-styles-applied not-multi-author display-header-text highlander-enabled highlander-light">
 <div class="hfeed site" id="page">
  <header class="site-header" id="masthead" role="banner">
   <div class="site-branding">
    <div class="site-image">
     <a class="header-image-link" href="https://data-x.blog/" rel="home" title="Data-X">
      <img alt="" height="154" src="https://datax911.files.wordpress.com/2016/12/cropped-banner_matrix1.jpg" width="912"/>
     </a>
    </div>
    <!-- .header-image -->
    <h1 class="site-title">
     <a href="https://data-x.blog/" rel="home" title="Data-X">
      Data-X
     </a>
    </h1>
    <h2 class="site-description">
     A Public and Open Website for the Data-X Course at UC Berkeley.
    </h2>
   </div>
   <!-- .site-branding -->
   <nav class="main-navigation" id="si

### Example when there is a difference in child strings

In [148]:
print(soup.find_all('td'))

[<td width="36"><strong>Lec #</strong></td>, <td width="131"><strong>Topic</strong></td>, <td width="72"><strong>Tools</strong></td>, <td width="108"><strong>Cookbook Examples</strong></td>, <td width="68"><strong>HW DUE</strong></td>, <td width="104"><strong>Lab<br/>
</strong></td>, <td width="36"><strong>1</strong></td>, <td width="131">Introduction: Overview of Frameworks for obtaining insights from data (Slides)</td>, <td width="72">Anaconda, Python</td>, <td width="108">Setting up Anaconda Environment</td>, <td width="68">HW 1 Assigned</td>, <td width="104"></td>, <td width="36"><strong>2</strong></td>, <td width="131">Notebook: Python Numpy Notebook</td>, <td width="72">Python, Numpy, Pandas, JSON formatted files</td>, <td width="108">Earthquake Data live query</td>, <td width="68">Bring 3 ideas to next class</td>, <td width="104">Form Teams</td>, <td width="36"><strong>3</strong></td>, <td width="131">Data signals in Tables.  Slides: Pandas Overview</td>, <td width="72">Pandas, 

In [244]:
# Find first url

first_url = soup.find('a')
print('first_url:', first_url,'\n')

print('Type:',type(first_url))
print('Text: ',first_url.text)
print('Attributes:',first_url.attrs)

first_url: <a class="header-image-link" href="https://data-x.blog/" rel="home" title="Data-X">
<img alt="" height="154" src="https://datax911.files.wordpress.com/2016/12/cropped-banner_matrix1.jpg" width="912"/>
</a> 

Type: <class 'bs4.element.Tag'>
Text:  


Attributes: {'href': 'https://data-x.blog/', 'title': 'Data-X', 'rel': ['home'], 'class': ['header-image-link']}


In [248]:
for url in soup.find_all('a'):
    print(url.get('href')) # get method get specific tag

https://data-x.blog/
https://data-x.blog/
#content
/
https://data-x.blog/
https://data-x.blog/syllabus-data-x/
https://data-x.blog/breakouts/
https://data-x.blog/contact/
https://piazza.com/class/iy6i5458rva2ud
https://wordpress.com/?ref=footer_blog
#


In [249]:
# imagine we only want to extract http links and write them to a file called data-x-urls.txt on a separate line 

# find all url links at the page
links = list()
for url in soup.find_all('a'):
    link = url.get('href')
    if 'http' in link:
        print(link)
        links.append(link+'\n')

# create/open a txt file with write, will overwrite if there is a file called data-x-urls
with open('data-x-urls.txt', 'w') as file: 
    file.writelines(links) #

https://data-x.blog/
https://data-x.blog/
https://data-x.blog/
https://data-x.blog/syllabus-data-x/
https://data-x.blog/breakouts/
https://data-x.blog/contact/
https://piazza.com/class/iy6i5458rva2ud
https://wordpress.com/?ref=footer_blog


In [250]:
# Only find URL's in the navigation bar (tag nav)
nav=soup.nav
nav

<nav class="main-navigation" id="site-navigation" role="navigation">
<h1 class="menu-toggle">Menu</h1>
<div class="screen-reader-text skip-link"><a href="#content" title="Skip to content">Skip to content</a></div>
<div class="menu-primary-container"><ul class="menu" id="menu-primary"><li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-8" id="menu-item-8"><a href="/">Home</a></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-home menu-item-9" id="menu-item-9"><a href="https://data-x.blog/">About</a></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page current-menu-item page_item page-item-94 current_page_item menu-item-102" id="menu-item-102"><a href="https://data-x.blog/syllabus-data-x/">Syllabus: Data-X</a></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-129" id="menu-item-129"><a href="https://data-x.blog/breakouts/">Breakouts</a></li>
<li class="menu-item menu-item-type-p

In [188]:
type(nav)

bs4.element.Tag

In [189]:
for url in nav.find_all('a'):
    print(url.get('href')) # only links in navigation bar

#content
/
https://data-x.blog/
https://data-x.blog/syllabus-data-x/
https://data-x.blog/breakouts/
https://data-x.blog/contact/


In [190]:
body = soup.body #get content within the <body> tag of the HTML code

In [191]:
# Print all body text
for paragraph in body.find_all('p'):
    print(paragraph.text) #might be two body tags. 
    # Just text from the body
    # Scraping for content

Instructor: Ikhlaq Sidhu
Department of Industrial Engineering & Operations Research
Offered Spring 2017, 3 Units, Lecture and Lab:
Prerequisite: Interested students should have working knowledge of Python in advance of the class, and also should have completed a fundamental probability or statistics course.
New Location: Now at 3108 Etcheverry Hall, Time: 5:10 pm-7:59
Not Barrows 60
Teaching Team:
Office Hours:
Tue 1:30-2:30pm at Etcheverry 4176 (Breakout Room B)
This course surveys a variety of key of concepts that are useful for designing and building applications that process data signals.  The course also introduces modern open source, computer programming tools and libraries that can be used to implement these applications.  These concepts include filtering, prediction, classification, decision-making, Markov chains, LTI systems, spectral analysis, and frameworks for learning from data.    After reviewing each concept, we explore implementing it within sample applications using Py

In [192]:
# Find all text within div sections, also child tags - or specific div section
for div in soup.find_all('div'):
    print(div.text) # a lot









Data-X
A Public and Open Website for the Data-X Course at UC Berkeley.


Menu
Skip to content
Home
About
Syllabus: Data-X
Breakouts
Contact
 






Syllabus: Data-X


Data-X: Data, Signals, and Systems
IEOR 190D/ 290-003
Spring 2017
Instructor: Ikhlaq Sidhu
Department of Industrial Engineering & Operations Research
Offered Spring 2017, 3 Units, Lecture and Lab:

Undergraduate Section: 190D, Class Number 33036
Graduate Section: Class Number INDENG 290 – 003, 33258

Prerequisite: Interested students should have working knowledge of Python in advance of the class, and also should have completed a fundamental probability or statistics course.
New Location: Now at 3108 Etcheverry Hall, Time: 5:10 pm-7:59
Not Barrows 60
Teaching Team:

GSI: Kevin Bozhe Li, kbl4ew@berkeley.edu
Visiting Scholar: Alexander Fred-Ojala, afo@berkeley.edu
Tensor Flow Lead: Nathan Cheng, ncheng@berkeley.edu
NLTK Lead: Sam Choi, sam.choi@berkeley.edu

Office Hours:
Tue 1:30-2:30pm at Etcheverry 4176 (Breakout

In [193]:
# prints both mobile and html version

for div in soup.find_all('div', class_='site-content'):
    print(div.text)






Syllabus: Data-X


Data-X: Data, Signals, and Systems
IEOR 190D/ 290-003
Spring 2017
Instructor: Ikhlaq Sidhu
Department of Industrial Engineering & Operations Research
Offered Spring 2017, 3 Units, Lecture and Lab:

Undergraduate Section: 190D, Class Number 33036
Graduate Section: Class Number INDENG 290 – 003, 33258

Prerequisite: Interested students should have working knowledge of Python in advance of the class, and also should have completed a fundamental probability or statistics course.
New Location: Now at 3108 Etcheverry Hall, Time: 5:10 pm-7:59
Not Barrows 60
Teaching Team:

GSI: Kevin Bozhe Li, kbl4ew@berkeley.edu
Visiting Scholar: Alexander Fred-Ojala, afo@berkeley.edu
Tensor Flow Lead: Nathan Cheng, ncheng@berkeley.edu
NLTK Lead: Sam Choi, sam.choi@berkeley.edu

Office Hours:
Tue 1:30-2:30pm at Etcheverry 4176 (Breakout Room B)
Description
This course surveys a variety of key of concepts that are useful for designing and building applications that process data signals

In [194]:
## Only get tables, scraping tables and xml documents

table = soup.table
table = soup.find('table')

In [195]:
table # shows the html code of the table

<table width="518">
<tbody>
<tr>
<td width="36"><strong>Lec #</strong></td>
<td width="131"><strong>Topic</strong></td>
<td width="72"><strong>Tools</strong></td>
<td width="108"><strong>Cookbook Examples</strong></td>
<td width="68"><strong>HW DUE</strong></td>
<td width="104"><strong>Lab<br/>
</strong></td>
</tr>
<tr>
<td width="36"><strong>1</strong>
<p>Jan 19</p></td>
<td width="131">Introduction: Overview of Frameworks for obtaining insights from data (Slides)
<p>Slides: Python and Math/Probability Pre-requisites</p></td>
<td width="72">Anaconda, Python</td>
<td width="108">Setting up Anaconda Environment</td>
<td width="68">HW 1 Assigned</td>
<td width="104"></td>
</tr>
<tr>
<td width="36"><strong>2</strong>
<p>Jan 26</p></td>
<td width="131">Notebook: Python Numpy Notebook
<p>Slides: Data Structure Outline</p>
<p>Slides: Numpy Review</p></td>
<td width="72">Python, Numpy, Pandas, JSON formatted files</td>
<td width="108">Earthquake Data live query
<p>Example with JSON file</p></

In [196]:
table_rows = table.find_all('tr') #table.tr or table.find('tr') would only find one

In [197]:
for tr in table_rows:
    td = tr.find_all('td') # find all table data
    row = [i.text for i in td]
    print(row) # get all the table data

['Lec #', 'Topic', 'Tools', 'Cookbook Examples', 'HW DUE', 'Lab\n']
['1\nJan 19', 'Introduction: Overview of Frameworks for obtaining insights from data (Slides)\nSlides: Python and Math/Probability Pre-requisites', 'Anaconda, Python', 'Setting up Anaconda Environment', 'HW 1 Assigned', '']
['2\nJan 26', 'Notebook: Python Numpy Notebook\nSlides: Data Structure Outline\nSlides: Numpy Review', 'Python, Numpy, Pandas, JSON formatted files', 'Earthquake Data live query\nExample with JSON file', 'Bring 3 ideas to next class\nHW 1 Due', 'Form Teams']
['3\nFeb 2', 'Data signals in Tables.\xa0 Slides: Pandas Overview\nNotebook: Pandas Intro\nNotebook: Pandas and Stock Market', 'Pandas, Numpy, SciPy, Matplotlib', 'Stock market live download to Pandas DataFrame. Quant trading algorithm', 'HW 2 Due', 'Form Teams']
['4\nFeb 9', 'Scoring, Linear Prediction and Max Likelihood Prediction. Extending to multiple variables', 'Numpy, SciPy, Matplotlib', 'Code samples: 2 variable and multi-variable Linear

In [198]:
# pandas version of grabbing tables, better

import pandas as pd

# requires html5lib: 
#!conda install --yes html5lib
dfs = pd.read_html('https://data-x.blog/syllabus-data-x/',header=0)
# header = 0, indicates that first row is header
# find all tables and parse them to several data frames



In [199]:
print(type(dfs))
print(len(dfs))
df = dfs[0]

<class 'list'>
1


In [200]:
# Looks great, but we might want the dates to be the indices and in datetimeformat
df.head()

Unnamed: 0,Lec #,Topic,Tools,Cookbook Examples,HW DUE,Lab
0,1 Jan 19,Introduction: Overview of Frameworks for obtai...,"Anaconda, Python",Setting up Anaconda Environment,HW 1 Assigned,
1,2 Jan 26,Notebook: Python Numpy Notebook Slides: Data S...,"Python, Numpy, Pandas, JSON formatted files",Earthquake Data live query Example with JSON file,Bring 3 ideas to next class HW 1 Due,Form Teams
2,3 Feb 2,Data signals in Tables. Slides: Pandas Overvie...,"Pandas, Numpy, SciPy, Matplotlib",Stock market live download to Pandas DataFrame...,HW 2 Due,Form Teams
3,4 Feb 9,"Scoring, Linear Prediction and Max Likelihood ...","Numpy, SciPy, Matplotlib",Code samples: 2 variable and multi-variable Li...,HW 3 Due,Validate and Adjust
4,5 Feb 16,"Classification. Logistic Regression, SVM, Nonl...","Scikit Learn, Seaborn Visualization",Classification example with Iris Database: Log...,HW4 Due,Low Tech Demo and Validation Results


In [201]:
df.iloc[:,0] # dates are stored in the first column

0        1 Jan 19
1        2 Jan 26
2         3 Feb 2
3         4 Feb 9
4        5 Feb 16
5        6 Feb 23
6         7 Mar 2
7         8 Mar 9
8        9 Mar 16
9       10 Mar 23
10       11 Apr 6
11      12 Apr 13
12      12 Apr 20
13      13 Apr 27
14    Final May 4
Name: Lec #, dtype: object

In [202]:
dates = list() # list of better formatted dates
for date in df.iloc[:,0]:
    d = date.split()[1:] # split date strings and only extract Month plus Day, exclude lecture number
    d = '2017 ' + ' '.join(d)
    dates.append(d)

In [203]:
dates

['2017 Jan 19',
 '2017 Jan 26',
 '2017 Feb 2',
 '2017 Feb 9',
 '2017 Feb 16',
 '2017 Feb 23',
 '2017 Mar 2',
 '2017 Mar 9',
 '2017 Mar 16',
 '2017 Mar 23',
 '2017 Apr 6',
 '2017 Apr 13',
 '2017 Apr 20',
 '2017 Apr 27',
 '2017 May 4']

In [204]:
df.index=pd.to_datetime(dates) # convert dates to datetime objects and set them as the index
df.index.name='Date' #rename the index column to be "Date"

In [205]:
df.drop('Lec #',axis=1,inplace=True) # Drop the first column, with the old dates

In [226]:
df.head()

Unnamed: 0_level_0,Topic,Tools,Cookbook Examples,HW DUE,Lab
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2017-01-19,Introduction: Overview of Frameworks for obtai...,"Anaconda, Python",Setting up Anaconda Environment,HW 1 Assigned,
2017-01-26,Notebook: Python Numpy Notebook Slides: Data S...,"Python, Numpy, Pandas, JSON formatted files",Earthquake Data live query Example with JSON file,Bring 3 ideas to next class HW 1 Due,Form Teams
2017-02-02,Data signals in Tables. Slides: Pandas Overvie...,"Pandas, Numpy, SciPy, Matplotlib",Stock market live download to Pandas DataFrame...,HW 2 Due,Form Teams
2017-02-09,"Scoring, Linear Prediction and Max Likelihood ...","Numpy, SciPy, Matplotlib",Code samples: 2 variable and multi-variable Li...,HW 3 Due,Validate and Adjust
2017-02-16,"Classification. Logistic Regression, SVM, Nonl...","Scikit Learn, Seaborn Visualization",Classification example with Iris Database: Log...,HW4 Due,Low Tech Demo and Validation Results


In [227]:
pd.set_option('display.max_colwidth', -1) # to not get ... in the results

df.head()

Unnamed: 0_level_0,Topic,Tools,Cookbook Examples,HW DUE,Lab
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2017-01-19,Introduction: Overview of Frameworks for obtaining insights from data (Slides) Slides: Python and Math/Probability Pre-requisites,"Anaconda, Python",Setting up Anaconda Environment,HW 1 Assigned,
2017-01-26,Notebook: Python Numpy Notebook Slides: Data Structure Outline Slides: Numpy Review,"Python, Numpy, Pandas, JSON formatted files",Earthquake Data live query Example with JSON file,Bring 3 ideas to next class HW 1 Due,Form Teams
2017-02-02,Data signals in Tables. Slides: Pandas Overview Notebook: Pandas Intro Notebook: Pandas and Stock Market,"Pandas, Numpy, SciPy, Matplotlib",Stock market live download to Pandas DataFrame. Quant trading algorithm,HW 2 Due,Form Teams
2017-02-09,"Scoring, Linear Prediction and Max Likelihood Prediction. Extending to multiple variables","Numpy, SciPy, Matplotlib",Code samples: 2 variable and multi-variable Linear Prediction,HW 3 Due,Validate and Adjust
2017-02-16,"Classification. Logistic Regression, SVM, Nonlinear mapping","Scikit Learn, Seaborn Visualization","Classification example with Iris Database: Logistic, SVM",HW4 Due,Low Tech Demo and Validation Results


In [228]:
df.to_html('data-x-sched.html')

In [229]:
pd.options.display.max_colwidth=50 #change back to default max col_width

# Scrape images

In [257]:
print(soup.find_all('img'))

[<img alt="" height="154" src="https://datax911.files.wordpress.com/2016/12/cropped-banner_matrix1.jpg" width="912"/>, <img alt="course-model" class="alignnone size-full wp-image-99" data-attachment-id="99" data-comments-opened="1" data-image-description="" data-image-meta='{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0"}' data-image-title="course-model" data-large-file="https://datax911.files.wordpress.com/2017/01/course-model.jpg?w=1032?w=911" data-medium-file="https://datax911.files.wordpress.com/2017/01/course-model.jpg?w=1032?w=300" data-orig-file="https://datax911.files.wordpress.com/2017/01/course-model.jpg?w=1032" data-orig-size="911,379" data-permalink="https://data-x.blog/syllabus-data-x/course-model/#main" sizes="(max-width: 911px) 100vw, 911px" src="https://datax911.files.wordpress.com/2017/01/course-model.jpg?w=1032" srcset="https://datax911.files.wordp

In [284]:
os.path.basename?

In [286]:
import os
import urllib

for link in soup.find_all('img'):
    img_url=link.get('src')
    
    if 'jpg' in img_url: #only check for jpg images
        print(img_url)
        print(os.path.splitext(os.path.basename(img_url))) # returns final component of pathname and extension as a tuple
        filename = os.path.splitext(os.path.basename(img_url))[0] + '.jpg'
        urllib.request.urlretrieve(img_url,filename) # urllib requests a file and then writes it to disk
    else:
        print('EXCLUDED:',img_url)

https://datax911.files.wordpress.com/2016/12/cropped-banner_matrix1.jpg
('cropped-banner_matrix1', '.jpg')
https://datax911.files.wordpress.com/2017/01/course-model.jpg?w=1032
('course-model', '.jpg?w=1032')
EXCLUDED: https://sb.scorecardresearch.com/p?c1=2&c2=7518284&c3=&c4=&c5=&c6=&c15=&cv=2.0&cj=1
EXCLUDED: https://pixel.wp.com/b.gif?v=noscript


In [None]:
# XML documents - site maps, all the urls. just between tags
# XML human and machine readable.
# Newest links: all the links for FIND SITE MAP!
# News websites will have sitemaps for politics, bot constantly
# tracking news track the sitemaps

In [297]:
source = urllib.request.urlopen('https://data-x.blog/sitemap.xml').read()
soup = bs.BeautifulSoup(source,'xml') # interact with this object, looks like source in brower

In [299]:
soup.find_all('loc')

[<loc>https://data-x.blog/syllabus-data-x/</loc>,
 <image:loc>https://datax911.files.wordpress.com/2017/01/course-model.jpg</image:loc>,
 <loc>https://data-x.blog/breakouts/</loc>,
 <loc>https://data-x.blog/contact/</loc>,
 <image:loc>https://datax911.files.wordpress.com/2016/12/is-bio1.jpg</image:loc>,
 <loc>https://data-x.blog</loc>]

# Scrape Bloomberg for news

In [306]:
source = urllib.request.urlopen('https://www.bloomberg.com/feeds/bpol/sitemap_news.xml').read()
soup = bs.BeautifulSoup(source,'xml')

In [333]:
soup.prettify

<bound method Tag.prettify of <?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
<url>
<loc>https://www.bloomberg.com/politics/articles/2017-03-07/fbi-s-comey-asked-to-testify-in-house-panel-s-russia-trump-probe</loc>
<news:news>
<news:publication>
<news:name>Bloomberg</news:name>
<news:language>en</news:language>
</news:publication>
<news:title>Comey Asked to Testify in House Panel's Russia-Trump Probe</news:title>
<news:publication_date>2017-03-08T00:01:34.173Z</news:publication_date>
<news:keywords>Cybersecurity, National Security, California, Russia, Richard Mauze Burr, Sally Yates, Adam B Schiff, Barack H Obama, Devin G Nunes, Donald John Trump, James Brien Comey</news:keywords>
<news:stock_tickers/>
</news:news>
<image:image>
<image:loc>https://assets.bwbx.io/images/users/iqjWHBFdfxIU/iSZah6P7k4uo/v0/1200x-1.jpg

In [346]:
for news in soup.find_all({'news'}):
    print(news.title.text)
    print(news.publication_date.text)
    print(news.keywords.text)
    print('\n')

Comey Asked to Testify in House Panel's Russia-Trump Probe
2017-03-08T00:01:34.173Z
Cybersecurity, National Security, California, Russia, Richard Mauze Burr, Sally Yates, Adam B Schiff, Barack H Obama, Devin G Nunes, Donald John Trump, James Brien Comey


Taxpayers May Be on the Hook for the Next SpaceX or Orbital Rocket Failure
2017-03-08T00:12:54.207Z
Congress, Engineering, Retirement, Transportation, Florida, Work, Washington, Science, White House, Cargo, Tech, NASA, Donald John Trump, Elon R Musk


Le Pen Promises to Resign If EU Exit Vote Fails, AFP Says
2017-03-07T16:31:41.051Z
France


House Speaker Paul Ryan Says GOP Health Bill Will Pass
2017-03-07T22:29:55.502Z
Law, Work, Washington, Health


Rep. Jordan Says Obamacare Repeal Plan Will Lower Costs
2017-03-07T22:22:37.303Z
Conservative, Jordan


London-Based Regulators in EU's Sights as Brexit Eviction Likely
2017-03-08T00:01:00.001Z
Luxembourg, Netherlands, Austria, France, Health, Banking, Sweden, Portugal, Ireland, U.K., Dr

In [None]:
# example from https://www.ayima.com/guides/how-to-visualize-an-xml-sitemap-using-python.html

# Visualize XML sitemap with categories!
import requests
from bs4 import BeautifulSoup

url = 'https://www.sportchek.ca/sitemap.xml'
url = 'https://www.bloomberg.com/feeds/bpol/sitemap_index.xml'
page = requests.get(url)
print('Loaded page with: %s' % page)

sitemap_index = BeautifulSoup(page.content, 'html.parser')
print('Created %s object' % type(sitemap_index))

In [None]:
urls = [element.text for element in sitemap_index.findAll('loc')]
print(urls)

In [None]:
def extract_links(url):
    ''' Open an XML sitemap and find content wrapped in loc tags. '''

    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    links = [element.text for element in soup.findAll('loc')]

    return links

sitemap_urls = []
for url in urls:
    links = extract_links(url)
    sitemap_urls += links

print('Found {:,} URLs in the sitemap'.format(len(sitemap_urls)))

In [None]:
with open('sitemap_urls.dat', 'w') as f:
    for url in sitemap_urls:
        f.write(url + '\n')

In [None]:
'''
Categorize a list of URLs by site path.
The file containing the URLs should exist in the working directory and be
named sitemap_urls.dat. It should contain one URL per line.
Categorization depth can be specified by executing a call like this in the
terminal (where we set the granularity depth level to 5):
    python categorize_urls.py --depth 5
The same result can be achieved by setting the categorization_depth variable
manually at the head of this file and running the script with:
    python categorize_urls.py
'''
from __future__ import print_function


categorization_depth=3



# Main script functions


def peel_layers(urls, layers=3):
    ''' Builds a dataframe containing all unique page identifiers up
    to a specified depth and counts the number of sub-pages for each.
    Prints results to a CSV file.
    urls : list
        List of page URLs.
    layers : int
        Depth of automated URL search. Large values for this parameter
        may cause long runtimes depending on the number of URLs.
    '''

    # Store results in a dataframe
    sitemap_layers = pd.DataFrame()

    # Get base levels
    bases = pd.Series([url.split('//')[-1].split('/')[0] for url in urls])
    sitemap_layers[0] = bases

    # Get specified number of layers
    for layer in range(1, layers+1):

        page_layer = []
        for url, base in zip(urls, bases):
            try:
                page_layer.append(url.split(base)[-1].split('/')[layer])
            except:
                # There is nothing that deep!
                page_layer.append('')

        sitemap_layers[layer] = page_layer

    # Count and drop duplicate rows + sort
    sitemap_layers = sitemap_layers.groupby(list(range(0, layers+1)))[0].count()\
                     .rename('counts').reset_index()\
                     .sort_values('counts', ascending=False)\
                     .sort_values(list(range(0, layers)), ascending=True)\
                     .reset_index(drop=True)

    # Convert column names to string types and export
    sitemap_layers.columns = [str(col) for col in sitemap_layers.columns]
    sitemap_layers.to_csv('sitemap_layers.csv', index=False)

    # Return the dataframe
    return sitemap_layers




sitemap_urls = open('sitemap_urls.dat', 'r').read().splitlines()
print('Loaded {:,} URLs'.format(len(sitemap_urls)))

print('Categorizing up to a depth of %d' % categorization_depth)
sitemap_layers = peel_layers(urls=sitemap_urls,
                             layers=categorization_depth)
print('Printed {:,} rows of data to sitemap_layers.csv'.format(len(sitemap_layers)))


In [None]:
'''
Visualize a list of URLs by site path.
This script reads in the sitemap_layers.csv file created by the
categorize_urls.py script and builds a graph visualization using Graphviz.
Graph depth can be specified by executing a call like this in the
terminal:
    python visualize_urls.py --depth 4 --limit 10 --title "My Sitemap" --style "dark" --size "40"
The same result can be achieved by setting the variables manually at the head
of this file and running the script with:
    python visualize_urls.py
'''
from __future__ import print_function


# Set global variables

graph_depth = 3  # Number of layers deep to plot categorization
limit = 3       # Maximum number of nodes for a branch
title = ''       # Graph title
style = 'light'  # Graph style, can be "light" or "dark"
size = '8,5'     # Size of rendered PDF graph


# Import external library dependencies

import pandas as pd
import graphviz



# Main script functions

def make_sitemap_graph(df, layers=3, limit=50, size='8,5'):
    ''' Make a sitemap graph up to a specified layer depth.
    sitemap_layers : DataFrame
        The dataframe created by the peel_layers function
        containing sitemap information.
    layers : int
        Maximum depth to plot.
    limit : int
        The maximum number node edge connections. Good to set this
        low for visualizing deep into site maps.
    '''


    # Check to make sure we are not trying to plot too many layers
    if layers > len(df) - 1:
        layers = len(df)-1
        print('There are only %d layers available to plot, setting layers=%d'
              % (layers, layers))


    # Initialize graph
    f = graphviz.Digraph('sitemap', filename='sitemap_graph_%d_layer' % layers)
    f.body.extend(['rankdir=LR', 'size="%s"' % size])


    def add_branch(f, names, vals, limit, connect_to=''):
        ''' Adds a set of nodes and edges to nodes on the previous layer. '''

        # Get the currently existing node names
        node_names = [item.split('"')[1] for item in f.body if 'label' in item]

        # Only add a new branch it it will connect to a previously created node
        if connect_to:
            if connect_to in node_names:
                for name, val in list(zip(names, vals))[:limit]:
                    f.node(name='%s-%s' % (connect_to, name), label=name)
                    f.edge(connect_to, '%s-%s' % (connect_to, name), label='{:,}'.format(val))


    f.attr('node', shape='rectangle') # Plot nodes as rectangles

    # Add the first layer of nodes
    for name, counts in df.groupby(['0'])['counts'].sum().reset_index()\
                          .sort_values(['counts'], ascending=False).values:
        f.node(name=name, label='{} ({:,})'.format(name, counts))

    if layers == 0:
        return f

    f.attr('node', shape='oval') # Plot nodes as ovals
    f.graph_attr.update()

    # Loop over each layer adding nodes and edges to prior nodes
    for i in range(1, layers+1):
        cols = [str(i_) for i_ in range(i)]
        nodes = df[cols].drop_duplicates().values
        for j, k in enumerate(nodes):

            # Compute the mask to select correct data
            mask = True
            for j_, ki in enumerate(k):
                mask &= df[str(j_)] == ki

            # Select the data then count branch size, sort, and truncate
            data = df[mask].groupby([str(i)])['counts'].sum()\
                    .reset_index().sort_values(['counts'], ascending=False)

            # Add to the graph
            add_branch(f,
                       names=data[str(i)].values,
                       vals=data['counts'].values,
                       limit=limit,
                       connect_to='-'.join(['%s']*i) % tuple(k))

            print(('Built graph up to node %d / %d in layer %d' % (j, len(nodes), i))\
                    .ljust(50), end='\r')

    return f


def apply_style(f, style, title=''):
    ''' Apply the style and add a title if desired. More styling options are
    documented here: http://www.graphviz.org/doc/info/attrs.html#d:style
    f : graphviz.dot.Digraph
        The graph object as created by graphviz.
    style : str
        Available styles: 'light', 'dark'
    title : str
        Optional title placed at the bottom of the graph.
    '''

    dark_style = {
        'graph': {
            'label': title,
            'bgcolor': '#3a3a3a',
            'fontname': 'Helvetica',
            'fontsize': '18',
            'fontcolor': 'white',
        },
        'nodes': {
            'style': 'filled',
            'color': 'white',
            'fillcolor': 'black',
            'fontname': 'Helvetica',
            'fontsize': '14',
            'fontcolor': 'white',
        },
        'edges': {
            'color': 'white',
            'arrowhead': 'open',
            'fontname': 'Helvetica',
            'fontsize': '12',
            'fontcolor': 'white',
        }
    }

    light_style = {
        'graph': {
            'label': title,
            'fontname': 'Helvetica',
            'fontsize': '18',
            'fontcolor': 'black',
        },
        'nodes': {
            'style': 'filled',
            'color': 'black',
            'fillcolor': '#dbdddd',
            'fontname': 'Helvetica',
            'fontsize': '14',
            'fontcolor': 'black',
        },
        'edges': {
            'color': 'black',
            'arrowhead': 'open',
            'fontname': 'Helvetica',
            'fontsize': '12',
            'fontcolor': 'black',
        }
    }

    if style == 'light':
        apply_style = light_style

    elif style == 'dark':
        apply_style = dark_style

    f.graph_attr = apply_style['graph']
    f.node_attr = apply_style['nodes']
    f.edge_attr = apply_style['edges']

    return f




# Read in categorized data
sitemap_layers = pd.read_csv('sitemap_layers.csv', dtype=str)
# Convert numerical column to integer
sitemap_layers.counts = sitemap_layers.counts.apply(int)
print('Loaded {:,} rows of categorized data from sitemap_layers.csv'\
        .format(len(sitemap_layers)))

print('Building %d layer deep sitemap graph' % graph_depth)
f = make_sitemap_graph(sitemap_layers, layers=graph_depth,
                       limit=limit, size=size)
f = apply_style(f, style=style, title=title)

f.render(cleanup=True)
print('Exported graph to sitemap_graph_%d_layer.pdf' % graph_depth)


