<img src="http://i67.tinypic.com/2jcbwcw.png" align="left"></img><br><br><br><br>


## Notebook: Web scraping & web crawling

**Author List**: Alexander Fred Ojala

**Original Sources**: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ & https://www.dataquest.io/blog/web-scraping-tutorial-python/

**License**: Feel free to do whatever you want to with this code

**Compatibility:** Python 2.x and 3.x

# Table of Contents
(Clickable document links)
___

### [0: Pre-steup](#sec0)
Document setup and Python 2 and Python 3 compability

### [1: Simple webscrpaing intro](#sec1)

Simple example of webscraping on a premade HTML template

### [2: Scrape Data-X Schedule](#sec2)

Find and scrape the current Data-X schedule. 

### [3: Scrape Images and Files](#sec3)

Scrape a website of Images, PDF's, CSV data or any other file type.

## [Breakout Problem: Scrape Weather Data](#secBK)

Scrape real time weather data in Berkeley.


### [Appendix](#sec5)

#### [Scrape Bloomberg sitemap for political news headlines](#sec6)

#### [Webcrawl Twitter, recusrive URL link fetcher + depth](#sec7)

#### [SEO, visualize webite categories as a tree](#sec8)

<a id='sec0'></a>
## Pre-Setup

In [6]:
# stretch Jupyter coding blocks to fit screen
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>")) # if 100% it would fit the screen

In [7]:
# make it run on py2 and py3
from __future__ import division, print_function

<a id='sec1'></a>
# Webscraping intro

In order to scrape content from a website we first need to download the HTML contents of the website. This can be done with the Python library **requests** (with its `.get` method).

Then when we want to extract certain information from a website we use the scraping tool **BeautifulSoup4** (import bs4). In order to extract information with beautifulsoup we have to create a soup object from the HTML source code of a website.

In [8]:
import requests # The requests library is an 
# HTTP library for getting content and posting etc.

import bs4 as bs # BeautifulSoup4 is a Python library 
# for pulling data out of HTML and XML code.

# Scraping a simple website

In [9]:
source = requests.get("https://afo.github.io/data-x") 
# a GET request will download the HTML webpage.

print(source) # If <Response [200]> then 
# the website has been downloaded succesfully

<Response [200]>


**Different types of repsonses:**
Generally status code starting with 2 indicates success. Status code starting with 4 or 5 indicates error

In [10]:
print(source.content) # This is the HTML content of the website,
# as you can see it's quite hard to decipher

b'<!DOCTYPE html>\n<html>\n<head>\n\n<title>Data-X Webscrape Tutorial</title>\n\n<style>\ndiv.container {\n    width: 100%;\n    border: 1px solid gray;\n}\n\n.header {\n    color:green;\n}\n\n#second {\n    font-style: italic;\n}\n\n</style>\n\n</head>\n\n<body style="background-color: pink">\n\n<h1 class="header">Simple Data-X site</h1>\n\n\n<h3 id="second">This site is only live to be scraped.</h3>\n\n\n<div class="container">\n<p>Some cool text in a container</p>\n</div>\n  \n\n  <h4> Random list </h4>\n<nav class="regular_list">\n  <ul>\n    <li><a href="https://en.wikipedia.org/wiki/London">London</a></li>\n    <li><a href="https://en.wikipedia.org/wiki/Tokyo">Tokyo</a></li>\n  </ul>\n</nav>\n\n\n\n\n  <h2>Random London Information within p tags</h2>\n\n  <p>London is the capital city of England. It is the most populous city in the  United Kingdom, with a metropolitan area of over 13 million inhabitants.</p>\n  <p>Standing on the River Thames, London has been a major settlement f

In [11]:
print(type(source.content)) # type byte in Python 3, 
# type str in Python 2. Byte is default encoding of strings

<class 'bytes'>


In [12]:
# Read in source.content to beautifulsoup 
# beautifulsoup can parse (extract specific information) HTML code

soup = bs.BeautifulSoup(source.content ,features='lxml') 
# we pass in the source content and choose a parser

# features specifies what type of code we are parsing, 
# here 'lxml' specifies an HTML parser

In [13]:
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [14]:
print(soup) 

<!DOCTYPE html>
<html>
<head>
<title>Data-X Webscrape Tutorial</title>
<style>
div.container {
    width: 100%;
    border: 1px solid gray;
}

.header {
    color:green;
}

#second {
    font-style: italic;
}

</style>
</head>
<body style="background-color: pink">
<h1 class="header">Simple Data-X site</h1>
<h3 id="second">This site is only live to be scraped.</h3>
<div class="container">
<p>Some cool text in a container</p>
</div>
<h4> Random list </h4>
<nav class="regular_list">
<ul>
<li><a href="https://en.wikipedia.org/wiki/London">London</a></li>
<li><a href="https://en.wikipedia.org/wiki/Tokyo">Tokyo</a></li>
</ul>
</nav>
<h2>Random London Information within p tags</h2>
<p>London is the capital city of England. It is the most populous city in the  United Kingdom, with a metropolitan area of over 13 million inhabitants.</p>
<p>Standing on the River Thames, London has been a major settlement for two millennia, its history going back to its founding by the Romans, who named it Londin

Above we printed the HTML code of the website, 
decoded as a beautiful soup object
`<xxx> </xxx>`: are the tags, for more info: 
https://www.w3schools.com/tags/ref_byfunc.asp

**class and id: **used as hooks to give unique styling and id to elements in HTML

Full list of HTML tags: https://developer.mozilla.org/en-US/docs/Web/HTML/Element

In [15]:
# Suppose we want to extract content that is shown on the website

print(soup.body) # This is the main content of the website, located within the <body> tag

<body style="background-color: pink">
<h1 class="header">Simple Data-X site</h1>
<h3 id="second">This site is only live to be scraped.</h3>
<div class="container">
<p>Some cool text in a container</p>
</div>
<h4> Random list </h4>
<nav class="regular_list">
<ul>
<li><a href="https://en.wikipedia.org/wiki/London">London</a></li>
<li><a href="https://en.wikipedia.org/wiki/Tokyo">Tokyo</a></li>
</ul>
</nav>
<h2>Random London Information within p tags</h2>
<p>London is the capital city of England. It is the most populous city in the  United Kingdom, with a metropolitan area of over 13 million inhabitants.</p>
<p>Standing on the River Thames, London has been a major settlement for two millennia, its history going back to its founding by the Romans, who named it Londinium.</p>
<footer>footer content</footer>
</body>


In [16]:
print(soup.title) # Title of the website
print(soup.find('title')) # same as .title

<title>Data-X Webscrape Tutorial</title>
<title>Data-X Webscrape Tutorial</title>


In [17]:
# If we want to extract specific text
print(soup.find('p')) # will only return first <p> tag

<p>Some cool text in a container</p>


In [18]:
print(soup.find('p').text) # extracts the string within the <p> tag

Some cool text in a container


In [19]:
# If we want to extract all <p> tags
print(soup.find_all('p')) # returns list of all <p> tags

[<p>Some cool text in a container</p>, <p>London is the capital city of England. It is the most populous city in the  United Kingdom, with a metropolitan area of over 13 million inhabitants.</p>, <p>Standing on the River Thames, London has been a major settlement for two millennia, its history going back to its founding by the Romans, who named it Londinium.</p>]


In [20]:
print(soup.find(class_='header')) 
# we can also search for classes within all tags, using class_
# note _ is used to distinguish with Python's builtin class function

print(soup.find(id='second'))

<h1 class="header">Simple Data-X site</h1>
<h3 id="second">This site is only live to be scraped.</h3>


In [21]:
print(soup.find_all(class_='regular_list'))

[<nav class="regular_list">
<ul>
<li><a href="https://en.wikipedia.org/wiki/London">London</a></li>
<li><a href="https://en.wikipedia.org/wiki/Tokyo">Tokyo</a></li>
</ul>
</nav>]


In [22]:
for p in soup.find_all('p'): # print all p tags in the list
    print(p.text)

Some cool text in a container
London is the capital city of England. It is the most populous city in the  United Kingdom, with a metropolitan area of over 13 million inhabitants.
Standing on the River Thames, London has been a major settlement for two millennia, its history going back to its founding by the Romans, who named it Londinium.


In [23]:
# Extract links / urls
# Links in html is usually coded as <a href="url">
# where the link is url

print(soup.a)
print(type(soup.a))


<a href="https://en.wikipedia.org/wiki/London">London</a>
<class 'bs4.element.Tag'>


In [24]:
soup.a.get('href') 
# to get the link from href attribute

'https://en.wikipedia.org/wiki/London'

In [25]:
# if we want to list links and their text info

links = soup.find_all('a')

for l in links:
    
    print("\nInfo about {}: ".format(l.text), \
      l.get('href')) 
# then we have extracted the link


Info about London:  https://en.wikipedia.org/wiki/London

Info about Tokyo:  https://en.wikipedia.org/wiki/Tokyo


<a id='sec2'></a>

# Now let us scrape the current Syllabus Schedule from the Data-X website


In [26]:
source = requests.get('https://data-x.blog/').content 
# get the source content

In [27]:
soup = bs.BeautifulSoup(source,'lxml')

In [28]:
print(soup.prettify()) 
# .prettify() method makes the HTML code more readable

# as you can see this code is more difficult 
# to read then the simple example above

<!DOCTYPE html>
<html class="no-js no-svg" lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="http://gmpg.org/xfn/11" rel="profile"/>
  <script>
   (function(html){html.className = html.className.replace(/\bno-js\b/,'js')})(document.documentElement);
  </script>
  <title>
   Data-X at Berkeley – A Framework for Digital Transformation
  </title>
  <link href="//s0.wp.com" rel="dns-prefetch"/>
  <link href="//fonts.googleapis.com" rel="dns-prefetch"/>
  <link href="//s.w.org" rel="dns-prefetch"/>
  <link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
  <link href="https://data-x.blog/feed/" rel="alternate" title="Data-X at Berkeley » Feed" type="application/rss+xml"/>
  <link href="https://data-x.blog/comments/feed/" rel="alternate" title="Data-X at Berkeley » Comments Feed" type="application/rss+xml"/>
  <script type="text/javascript">
   window._wpemojiSettings = {"baseUrl":"https:\/\

In [29]:
print(soup.find('title').text) 
# check that we are at the correct website

Data-X at Berkeley – A Framework for Digital Transformation


In [30]:
for p in soup.find_all('p'):
    print(p.text)

A Framework for Digital Transformation
McKinsey Dec 2017: A new survey finds that many companies are launching data-focused businesses. But few have achieved significant financial impact, which requires the right combination of strategy, culture, and organization. (link)
Ikhlaq Sidhu, IEOR, UC Berkeley (contact).
Today, the world is literally reinventing itself with Data and AI.  However, neither leading companies nor the world’s top students have the complete knowledge set or access to the full networks they need to participate in this newly developing world.  The Data-X project designed to fix this problem.
The approach is to bring together students, faculty, new ventures, and large firms so they can learn from each other in a manner that is both technically deep and yet broad in an application sense.  Each of these segments provides an important part of the understanding of data problems to the other.   And as a result, we have the opportunity to develop a large-scale, holistic, dat

In [31]:
navigation_bar = soup.find('nav')
print(navigation_bar)

<nav aria-label="Top Menu" class="main-navigation" id="site-navigation" role="navigation">
<button aria-controls="top-menu" aria-expanded="false" class="menu-toggle">
<svg aria-hidden="true" class="icon icon-bars" role="img"> <use href="#icon-bars" xlink:href="#icon-bars"></use> </svg><svg aria-hidden="true" class="icon icon-close" role="img"> <use href="#icon-close" xlink:href="#icon-close"></use> </svg>Menu	</button>
<div class="menu-primary-container"><ul class="menu" id="top-menu"><li class="menu-item menu-item-type-custom menu-item-object-custom current-menu-item current_page_item menu-item-8" id="menu-item-8"><a href="/">Home</a></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-183" id="menu-item-183"><a href="https://data-x.blog/resources/">Resources</a></li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-102" id="menu-item-102"><a href="https://data-x.blog/syllabus/">Syllabus</a></li>
<li class="menu-item menu-ite

In [32]:
nav_bar = navigation_bar.text
print(nav_bar)



    Menu	
Home
Resources
Syllabus
Posts
Labs
Projects
Advisors
Contact

  Scroll down to content



In [33]:
# Now we want to find the Syllabus, 
# however we are at the root web page, not displaying the syllabus
# Get links from the data-x website
for url in navigation_bar.find_all('a'): 
    # look for links in the navigation bar. Tag <nav>
    
    link = url.get('href')
    if 'data-x.blog' in link:
        print(link) 
        # syllabus is located at https://data-x.blog/syllabus/
        if 'syllabus' in link:
            syllabus_url = link

https://data-x.blog/resources/
https://data-x.blog/syllabus/
https://data-x.blog/posts/
https://data-x.blog/project/
http://data-x.blog/projects
https://data-x.blog/advisors/
https://data-x.blog/contact/


In [34]:
print(syllabus_url)

https://data-x.blog/syllabus/


In [35]:
# Open new connection to the syllabus url. Replace soup object.
source = requests.get(syllabus_url).content
soup = bs.BeautifulSoup(source, 'lxml') 
# 'lxml' parser better for tables, very similar to 'html.parser'

print(soup.body.prettify()) 
# we can see that the table is stored within <td> tags

<body class="page-template-default page page-id-94 has-header-image page-one-column colors-light cannot-edit">
 <div class="site" id="page">
  <a class="skip-link screen-reader-text" href="#content">
   Skip to content
  </a>
  <header class="site-header" id="masthead" role="banner">
   <div class="custom-header">
    <div class="custom-header-media">
     <div class="wp-custom-header" id="wp-custom-header">
      <img alt="Data-X at Berkeley" height="1200" src="https://datax911.files.wordpress.com/2017/09/cropped-adobestock_107620848.jpeg" width="2000"/>
     </div>
    </div>
    <div class="site-branding">
     <div class="wrap">
      <div class="site-branding-text">
       <p class="site-title">
        <a href="https://data-x.blog/" rel="home">
         Data-X at Berkeley
        </a>
       </p>
       <p class="site-description">
        A Framework for Digital Transformation
       </p>
      </div>
      <!-- .site-branding-text -->
     </div>
     <!-- .wrap -->
    </div>


### Find the course schedule table from the syllabus:  
Usually data on a website is stored in tables under the `<td>` tag. Here we want to extract the information in the Data-X syllabus.
###  __ NOTE:  To identify element, class or id  name___ of the object of your interest on a web page, you can go to the link address in your browser, under 'more tools' option click __'developer tools'__. This opens  the 'Document object Model' of the webpage. Hover on the element of your interest on the webpage to check its location. This will help you in deciding which parts of 'soup content' you want to parse.

more info at:https://developer.chrome.com/devtools

In [36]:
# We can see that course schedule is in <table><table/> elements
# We can also get the table
full_table = soup.find_all('table')

In [37]:
# A new row in an HTML table starts with <tr> tag
# A new column entry is defined by <td> tag
table_result = list()
for table in full_table:
    for row in table.find_all('tr'):
        row_cells = row.find_all('td') # find all table data
        row_entries = [cell.text for cell in row_cells]
        print(row_entries) 
        table_result.append(row_entries)
        # get all the table data into a list

['Topic 1:', 'Introduction\nTheory: Overview of Frameworks for obtaining insights from data (Slides).\nTools: Python Review']
['Code', '1. Introduction to GitHub\n2. Setting up Anaconda Environment\n3. Coding with Python Review']
['DUE', 'Homework 1 assigned.']
['\xa0Project', 'Office Hours Session that week for Environment Set Up']
['Topic 2:', 'Tools: NumPy, Pandas, Matplotlib']
['Code', '\n\nCoding with Numpy\nCoding with Pandas\nCoding with Matplotlib\n\n']
['DUE', 'HW 1 Due']
['\xa0Project', 'Bring 3 ideas to class.']
['Topic 3:', 'Theory: Data as a Signal with Correlation\nTools: Webscraping\xa0– crawling and API use']
['Code', 'Coding with BeautifulSoup and other python scraping\xa0 libraries']
['DUE', 'Homework -2 Due']
['\xa0Project', 'Form Teams Part II,\xa0Mixer: Form teams for the final project.']
['Topic 4:', 'Theory: Prediction Algorithms Primer\nTools: Scikit Learn for Classification and Regression']
['Code', 'Coding with Scikit Learn']
['DUE', '\xa0Homework 3 Due']
['\x

In [38]:
# We can also read it in to a Pandas DataFrame
import pandas as pd
pd.set_option('display.max_colwidth', 10000)

df = pd.DataFrame(table_result)
df

Unnamed: 0,0,1
0,Topic 1:,Introduction\nTheory: Overview of Frameworks for obtaining insights from data (Slides).\nTools: Python Review
1,Code,1. Introduction to GitHub\n2. Setting up Anaconda Environment\n3. Coding with Python Review
2,DUE,Homework 1 assigned.
3,Project,Office Hours Session that week for Environment Set Up
4,Topic 2:,"Tools: NumPy, Pandas, Matplotlib"
5,Code,\n\nCoding with Numpy\nCoding with Pandas\nCoding with Matplotlib\n\n
6,DUE,HW 1 Due
7,Project,Bring 3 ideas to class.
8,Topic 3:,Theory: Data as a Signal with Correlation\nTools: Webscraping – crawling and API use
9,Code,Coding with BeautifulSoup and other python scraping libraries


In [39]:
# Pandas can also grab tables from a website automatically

import pandas as pd

import html5lib
# requires html5lib: 
#!conda install --yes html5
dfs = pd.read_html('https://data-x.blog/syllabus/') 
# returns a list of all tables at url



In [40]:
print(type(dfs)) #list of tables
print(len(dfs)) # we only have one table
print(type(dfs[0])) # stored as DataFrame
df = pd.concat(dfs,ignore_index=True)

<class 'list'>
12
<class 'pandas.core.frame.DataFrame'>


In [41]:
# Looks so-so
df

Unnamed: 0,0,1
0,Topic 1:,Introduction Theory: Overview of Frameworks for obtaining insights from data (Slides). Tools: Python Review
1,Code,1. Introduction to GitHub 2. Setting up Anaconda Environment 3. Coding with Python Review
2,DUE,Homework 1 assigned.
3,Project,Office Hours Session that week for Environment Set Up
4,Topic 2:,"Tools: NumPy, Pandas, Matplotlib"
5,Code,Coding with Numpy Coding with Pandas Coding with Matplotlib
6,DUE,HW 1 Due
7,Project,Bring 3 ideas to class.
8,Topic 3:,Theory: Data as a Signal with Correlation Tools: Webscraping – crawling and API use
9,Code,Coding with BeautifulSoup and other python scraping libraries


In [42]:
# Make it nicer
df.columns=  ['Part','Detailed Description']

weeks = list()
for i in range(1,13):
    weeks = weeks+['Week{}'.format(i) for tmp in range(4)]
df['Week'] = weeks

In [43]:
df.head()

Unnamed: 0,Part,Detailed Description,Week
0,Topic 1:,Introduction Theory: Overview of Frameworks for obtaining insights from data (Slides). Tools: Python Review,Week1
1,Code,1. Introduction to GitHub 2. Setting up Anaconda Environment 3. Coding with Python Review,Week1
2,DUE,Homework 1 assigned.,Week1
3,Project,Office Hours Session that week for Environment Set Up,Week1
4,Topic 2:,"Tools: NumPy, Pandas, Matplotlib",Week2


In [44]:
df = df.set_index(['Week','Part'])

In [45]:
df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Detailed Description
Week,Part,Unnamed: 2_level_1
Week1,Topic 1:,Introduction Theory: Overview of Frameworks for obtaining insights from data (Slides). Tools: Python Review
Week1,Code,1. Introduction to GitHub 2. Setting up Anaconda Environment 3. Coding with Python Review
Week1,DUE,Homework 1 assigned.
Week1,Project,Office Hours Session that week for Environment Set Up
Week2,Topic 2:,"Tools: NumPy, Pandas, Matplotlib"
Week2,Code,Coding with Numpy Coding with Pandas Coding with Matplotlib
Week2,DUE,HW 1 Due
Week2,Project,Bring 3 ideas to class.
Week3,Topic 3:,Theory: Data as a Signal with Correlation Tools: Webscraping – crawling and API use
Week3,Code,Coding with BeautifulSoup and other python scraping libraries


<a id='sec3'></a>
# Scrape images and other files

In [46]:
# As we can see there are two images on the data-x.blog/resources
# say that we want to download them
# Images are displayed with the <img> tag in HTML

# open connection and create new soup

raw = requests.get('https://data-x.blog/resources/').content
soup = bs.BeautifulSoup(raw,features='lxml')

print(soup.find('img')) 
# as we can see below the image urls 
# are stored as the src inside the img tag

<img alt="Data-X at Berkeley" height="1200" src="https://datax911.files.wordpress.com/2017/09/cropped-adobestock_107620848.jpeg" width="2000"/>


In [47]:
# Parse all url to the images
img_urls = list()
for img in soup.find_all('img'): 
    img_url = img.get('src') 
    if '.jpeg' in img_url or '.jpg' in img_url:
        print(img_url)
        img_urls.append(img_url)
    

https://datax911.files.wordpress.com/2017/09/cropped-adobestock_107620848.jpeg
https://i2.wp.com/data-x.blog/wp-content/uploads/2017/05/unnamed-2.jpg?resize=740%2C416&ssl=1


In [48]:
print(img_urls)

['https://datax911.files.wordpress.com/2017/09/cropped-adobestock_107620848.jpeg', 'https://i2.wp.com/data-x.blog/wp-content/uploads/2017/05/unnamed-2.jpg?resize=740%2C416&ssl=1']


In [49]:
!ls

notebook-webscraping.ipynb  placeholder  webscraper-breakout.ipynb


In [50]:
# To downloads and save files with Python we can use 
# the shutil library which is a file operations library

import shutil

for idx, img_url in enumerate(img_urls): 
    #enumarte to create a file integer name for every image
    
    img_source = requests.get(img_url, stream=True) 
    # we set stream = True to download/ 
    # stream the content of the data
    
    with open('img'+str(idx)+'.jpg', 'wb') as file: 
        # open file connection, create file and write to it
        shutil.copyfileobj(img_source.raw, file) 
        # save the raw file object

    del img_source # to remove the file from memory

In [51]:
!ls

img0.jpg  notebook-webscraping.ipynb  webscraper-breakout.ipynb
img1.jpg  placeholder


## Scraping function to download files of any type from a website

In [52]:
# Extended scraping function of any file format
import os # To format file name
import shutil # To copy file object from python to disk
import requests
import bs4 as bs

def py_file_scraper(url, html_tag='img', source_tag='src', file_type='.jpg',max=-1):
    
    '''
    Function that scrapes a website for certain file formats.
    The files will be placed in a folder called "files" 
    in the working directory.
    
    url = the url we want to scrape from
    html_tag = the file tag (usually img for images or 
    a for file links)
    
    source_tag = the source tag for the file url 
    (usually src for images or href for files)
    
    file_type = .png, .jpg, .pdf, .csv, .xls etc.
    
    max = integer (max number of files to scrape, 
    if = -1 it will scrape all files)
    '''
    
    # make a directory called 'files' 
    # for the files if it does not exist
    if not os.path.exists('files/'):
        os.makedirs('files/')
    print('Loading content from the url...')
    source = requests.get(url).content
    print('Creating content soup...')
    soup = bs.BeautifulSoup(source,'lxml')
    
    i=0
    print('Finding tag:%s...'%html_tag)
    for n, link in enumerate(soup.find_all(html_tag)):
        file_url=link.get(source_tag)
        print (n+1,'. File url',file_url)
        
        
        if 'http' in file_url: # check that it is a valid link
            print('It is a valid url..')
            
            
            if file_type in file_url: #only check for specific 
                # file type
                
                print('%s FILE TYPE FOUND IN THE URL...'%file_type)
                file_name = os.path.splitext(os.path.basename(file_url))[0] + file_type 
                #extract file name from url

                file_source = requests.get(file_url, stream = True)
             
                # open new stream connection

                with open('./files/'+file_name, 'wb') as file: 
                    # open file connection, create file and 
                    # write to it
                    
                    shutil.copyfileobj(file_source.raw, file) 
                    # save the raw file object
                    
                    print('DOWNLOADED:',file_name)
                    
                    i+=1
                    
                del file_source # delete from memory
            else:
                print('%s file type NOT found in url:'%file_type)
                print('EXCLUDED:',file_url) 
                # urls not downloaded from
                
        if i == max:
            print('Max reached')
            break
            

    print('Done!')

In [53]:
py_file_scraper('https://funcatpictures.com/') 
# scrape cats

Loading content from the url...
Creating content soup...
Finding tag:img...
1 . File url http://www.funcatpictures.com/wp-content/uploads/2014/08/funnycatpictures.png
It is a valid url..
.jpg file type NOT found in url:
EXCLUDED: http://www.funcatpictures.com/wp-content/uploads/2014/08/funnycatpictures.png
2 . File url https://funcatpictures.com/wp-content/uploads/2017/11/funny-cat-pictures-when-you-arent-redy-to-wake-up.jpg
It is a valid url..
.jpg FILE TYPE FOUND IN THE URL...
DOWNLOADED: funny-cat-pictures-when-you-arent-redy-to-wake-up.jpg
3 . File url https://funcatpictures.com/wp-content/uploads/2017/11/funny-cat-pictures-so-long.jpg
It is a valid url..
.jpg FILE TYPE FOUND IN THE URL...
DOWNLOADED: funny-cat-pictures-so-long.jpg
4 . File url https://funcatpictures.com/wp-content/uploads/2017/10/funny-pictures-awwgasm-kitty.jpg
It is a valid url..
.jpg FILE TYPE FOUND IN THE URL...
DOWNLOADED: funny-pictures-awwgasm-kitty.jpg
5 . File url https://funcatpictures.com/wp-content/upl

In [54]:
!ls ./files

funny-cat-pictures-feels-like-a-pea-under-here.jpg
funny-cat-pictures-making-bed.jpg
funny-cat-pictures-my-face-when-youre-looking.jpg
funny-cat-pictures-so-long.jpg
funny-cat-pictures-when-you-arent-redy-to-wake-up.jpg
funny-pictures-awwgasm-kitty.jpg


In [55]:
# scrape pdf's from data-x site
py_file_scraper('https://data-x.blog/resources',
                html_tag='a',source_tag='href',file_type='.pdf', \
                max=5)

Loading content from the url...
Creating content soup...
Finding tag:a...
1 . File url #content
2 . File url https://data-x.blog/
It is a valid url..
.pdf file type NOT found in url:
EXCLUDED: https://data-x.blog/
3 . File url /
4 . File url https://data-x.blog/resources/
It is a valid url..
.pdf file type NOT found in url:
EXCLUDED: https://data-x.blog/resources/
5 . File url https://data-x.blog/syllabus/
It is a valid url..
.pdf file type NOT found in url:
EXCLUDED: https://data-x.blog/syllabus/
6 . File url https://data-x.blog/posts/
It is a valid url..
.pdf file type NOT found in url:
EXCLUDED: https://data-x.blog/posts/
7 . File url https://data-x.blog/project/
It is a valid url..
.pdf file type NOT found in url:
EXCLUDED: https://data-x.blog/project/
8 . File url http://data-x.blog/projects
It is a valid url..
.pdf file type NOT found in url:
EXCLUDED: http://data-x.blog/projects
9 . File url https://data-x.blog/advisors/
It is a valid url..
.pdf file type NOT found in url:
EXCLU

In [56]:
# scrape csv files from website
py_file_scraper('http://www-eio.upc.edu/~pau/cms/rdata/datasets.html',
                html_tag='a', # R data sets
                source_tag='href', file_type='.csv',max=5)

Loading content from the url...
Creating content soup...
Finding tag:a...
1 . File url http://www-eio.upc.edu/~pau/cms/rdata/csv/datasets/AirPassengers.csv
It is a valid url..
.csv FILE TYPE FOUND IN THE URL...
DOWNLOADED: AirPassengers.csv
2 . File url http://www-eio.upc.edu/~pau/cms/rdata/doc/datasets/AirPassengers.html
It is a valid url..
.csv file type NOT found in url:
EXCLUDED: http://www-eio.upc.edu/~pau/cms/rdata/doc/datasets/AirPassengers.html
3 . File url http://www-eio.upc.edu/~pau/cms/rdata/csv/datasets/BJsales.csv
It is a valid url..
.csv FILE TYPE FOUND IN THE URL...
DOWNLOADED: BJsales.csv
4 . File url http://www-eio.upc.edu/~pau/cms/rdata/doc/datasets/BJsales.html
It is a valid url..
.csv file type NOT found in url:
EXCLUDED: http://www-eio.upc.edu/~pau/cms/rdata/doc/datasets/BJsales.html
5 . File url http://www-eio.upc.edu/~pau/cms/rdata/csv/datasets/BOD.csv
It is a valid url..
.csv FILE TYPE FOUND IN THE URL...
DOWNLOADED: BOD.csv
6 . File url http://www-eio.upc.edu/~

---
<a id='secBK'></a>
# Breakout problem


In this Breakout Problem you should extract live weather data in Berkeley from:

[http://forecast.weather.gov/MapClick.php?lat=37.87158815800046&lon=-122.27274583799971](http://forecast.weather.gov/MapClick.php?lat=37.87158815800046&lon=-122.27274583799971)

* Task scrape
    * period / day (as Tonight, Friday, FridayNight etc.)
    * the temperature for the period (as Low, High)
    * the long weather description (e.g. Partly cloudy, with a low around 49..)
    
Store the scraped data strings in a Pandas DataFrame



**Hint:** The weather information is found in a div tag with `id='seven-day-forecast'`




# Appendix

<a id='sec6'></a>
# Scrape Bloomberg sitemap (XML) for current political news

In [None]:
# XML documents - site maps, all the urls. just between tags
# XML human and machine readable.
# Newest links: all the links for FIND SITE MAP!
# News websites will have sitemaps for politics, bot constantly
# tracking news track the sitemaps

# Before scraping a website look at robots.txt file
bs.BeautifulSoup(requests.get('https://www.bloomberg.com/robots.txt').content,'lxml')

In [None]:
source = requests.get('https://www.bloomberg.com/feeds/bpol/sitemap_news.xml').content
soup = bs.BeautifulSoup(source,'xml') # Note parser 'xml'

In [None]:
print(soup.prettify())

In [None]:
# Find political news headlines
for news in soup.find_all({'news'}):
    print(news.title.text)
    print(news.publication_date.text)
    #print(news.keywords.text)
    print('\n')

<a id='sec7'></a>
# Web crawl

Web crawling is almost like webscraping, but instead you crawl a specific website (and often its subsites) and extract meta information. It can be seen as simple, recursive scraping. This can be used for web indexing (in order to build a web search engine).

## Web crawl Twitter account
**Authors:** Kunal Desai & Alexander Fred Ojala

In [None]:
import bs4
from bs4 import BeautifulSoup
import requests

In [None]:
# Helper function to maintain the urls and the number of times they appear

url_dict = dict()

def add_to_dict(url_d, key):
    if key in url_d:
        url_d[key] = url_d[key] + 1
    else:
        url_d[key] = 1

In [None]:
# Recursive function which extracts links from the given url upto a given 'depth'.

def get_urls(url, depth):
    if depth == 0:
        return
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    for link in soup.find_all('a'):
        if link.has_attr('href') and "https://" in link['href']:
#             print(link['href'])
            add_to_dict(url_dict, link['href'])
            get_urls(link['href'], depth - 1)

In [None]:
# Iterative function which extracts links from the given url upto a given 'depth'.

def get_urls_iterative(url, depth):
    urls = [url]
    for url in urls:
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        for link in soup.find_all('a'):
            if link.has_attr('href') and "https://" in link['href']:
                add_to_dict(url_dict, link['href'])
                urls.append(link['href'])
        if len(urls) > depth:
            break

In [None]:
get_urls("https://twitter.com/GolfWorld", 2)
for key in url_dict:
    print(str(key) + "  ----   " + str(url_dict[key]))

<a id='sec8'></a>
# SEO: Visualize sitemap and categories in a website

**Source:** https://www.ayima.com/guides/how-to-visualize-an-xml-sitemap-using-python.html

In [None]:
# Visualize XML sitemap with categories!
import requests
from bs4 import BeautifulSoup

url = 'https://www.sportchek.ca/sitemap.xml'
url = 'https://www.bloomberg.com/feeds/bpol/sitemap_index.xml'
page = requests.get(url)
print('Loaded page with: %s' % page)

sitemap_index = BeautifulSoup(page.content, 'html.parser')
print('Created %s object' % type(sitemap_index))

In [None]:
urls = [element.text for element in sitemap_index.findAll('loc')]
print(urls)

In [None]:
def extract_links(url):
    ''' Open an XML sitemap and find content wrapped in loc tags. '''

    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    links = [element.text for element in soup.findAll('loc')]

    return links

sitemap_urls = []
for url in urls:
    links = extract_links(url)
    sitemap_urls += links

print('Found {:,} URLs in the sitemap'.format(len(sitemap_urls)))

In [None]:
with open('sitemap_urls.dat', 'w') as f:
    for url in sitemap_urls:
        f.write(url + '\n')

In [None]:
'''
Categorize a list of URLs by site path.
The file containing the URLs should exist in the working directory and be
named sitemap_urls.dat. It should contain one URL per line.
Categorization depth can be specified by executing a call like this in the
terminal (where we set the granularity depth level to 5):
    python categorize_urls.py --depth 5
The same result can be achieved by setting the categorization_depth variable
manually at the head of this file and running the script with:
    python categorize_urls.py
'''
from __future__ import print_function


categorization_depth=3



# Main script functions


def peel_layers(urls, layers=3):
    ''' Builds a dataframe containing all unique page identifiers up
    to a specified depth and counts the number of sub-pages for each.
    Prints results to a CSV file.
    urls : list
        List of page URLs.
    layers : int
        Depth of automated URL search. Large values for this parameter
        may cause long runtimes depending on the number of URLs.
    '''

    # Store results in a dataframe
    sitemap_layers = pd.DataFrame()

    # Get base levels
    bases = pd.Series([url.split('//')[-1].split('/')[0] for url in urls])
    sitemap_layers[0] = bases

    # Get specified number of layers
    for layer in range(1, layers+1):

        page_layer = []
        for url, base in zip(urls, bases):
            try:
                page_layer.append(url.split(base)[-1].split('/')[layer])
            except:
                # There is nothing that deep!
                page_layer.append('')

        sitemap_layers[layer] = page_layer

    # Count and drop duplicate rows + sort
    sitemap_layers = sitemap_layers.groupby(list(range(0, layers+1)))[0].count()\
                     .rename('counts').reset_index()\
                     .sort_values('counts', ascending=False)\
                     .sort_values(list(range(0, layers)), ascending=True)\
                     .reset_index(drop=True)

    # Convert column names to string types and export
    sitemap_layers.columns = [str(col) for col in sitemap_layers.columns]
    sitemap_layers.to_csv('sitemap_layers.csv', index=False)

    # Return the dataframe
    return sitemap_layers




sitemap_urls = open('sitemap_urls.dat', 'r').read().splitlines()
print('Loaded {:,} URLs'.format(len(sitemap_urls)))

print('Categorizing up to a depth of %d' % categorization_depth)
sitemap_layers = peel_layers(urls=sitemap_urls,
                             layers=categorization_depth)
print('Printed {:,} rows of data to sitemap_layers.csv'.format(len(sitemap_layers)))


In [None]:
'''
Visualize a list of URLs by site path.
This script reads in the sitemap_layers.csv file created by the
categorize_urls.py script and builds a graph visualization using Graphviz.
Graph depth can be specified by executing a call like this in the
terminal:
    python visualize_urls.py --depth 4 --limit 10 --title "My Sitemap" --style "dark" --size "40"
The same result can be achieved by setting the variables manually at the head
of this file and running the script with:
    python visualize_urls.py
'''
from __future__ import print_function


# Set global variables

graph_depth = 3  # Number of layers deep to plot categorization
limit = 3       # Maximum number of nodes for a branch
title = ''       # Graph title
style = 'light'  # Graph style, can be "light" or "dark"
size = '8,5'     # Size of rendered PDF graph


# Import external library dependencies

import pandas as pd
import graphviz



# Main script functions

def make_sitemap_graph(df, layers=3, limit=50, size='8,5'):
    ''' Make a sitemap graph up to a specified layer depth.
    sitemap_layers : DataFrame
        The dataframe created by the peel_layers function
        containing sitemap information.
    layers : int
        Maximum depth to plot.
    limit : int
        The maximum number node edge connections. Good to set this
        low for visualizing deep into site maps.
    '''


    # Check to make sure we are not trying to plot too many layers
    if layers > len(df) - 1:
        layers = len(df)-1
        print('There are only %d layers available to plot, setting layers=%d'
              % (layers, layers))


    # Initialize graph
    f = graphviz.Digraph('sitemap', filename='sitemap_graph_%d_layer' % layers)
    f.body.extend(['rankdir=LR', 'size="%s"' % size])


    def add_branch(f, names, vals, limit, connect_to=''):
        ''' Adds a set of nodes and edges to nodes on the previous layer. '''

        # Get the currently existing node names
        node_names = [item.split('"')[1] for item in f.body if 'label' in item]

        # Only add a new branch it it will connect to a previously created node
        if connect_to:
            if connect_to in node_names:
                for name, val in list(zip(names, vals))[:limit]:
                    f.node(name='%s-%s' % (connect_to, name), label=name)
                    f.edge(connect_to, '%s-%s' % (connect_to, name), label='{:,}'.format(val))


    f.attr('node', shape='rectangle') # Plot nodes as rectangles

    # Add the first layer of nodes
    for name, counts in df.groupby(['0'])['counts'].sum().reset_index()\
                          .sort_values(['counts'], ascending=False).values:
        f.node(name=name, label='{} ({:,})'.format(name, counts))

    if layers == 0:
        return f

    f.attr('node', shape='oval') # Plot nodes as ovals
    f.graph_attr.update()

    # Loop over each layer adding nodes and edges to prior nodes
    for i in range(1, layers+1):
        cols = [str(i_) for i_ in range(i)]
        nodes = df[cols].drop_duplicates().values
        for j, k in enumerate(nodes):

            # Compute the mask to select correct data
            mask = True
            for j_, ki in enumerate(k):
                mask &= df[str(j_)] == ki

            # Select the data then count branch size, sort, and truncate
            data = df[mask].groupby([str(i)])['counts'].sum()\
                    .reset_index().sort_values(['counts'], ascending=False)

            # Add to the graph
            add_branch(f,
                       names=data[str(i)].values,
                       vals=data['counts'].values,
                       limit=limit,
                       connect_to='-'.join(['%s']*i) % tuple(k))

            print(('Built graph up to node %d / %d in layer %d' % (j, len(nodes), i))\
                    .ljust(50), end='\r')

    return f


def apply_style(f, style, title=''):
    ''' Apply the style and add a title if desired. More styling options are
    documented here: http://www.graphviz.org/doc/info/attrs.html#d:style
    f : graphviz.dot.Digraph
        The graph object as created by graphviz.
    style : str
        Available styles: 'light', 'dark'
    title : str
        Optional title placed at the bottom of the graph.
    '''

    dark_style = {
        'graph': {
            'label': title,
            'bgcolor': '#3a3a3a',
            'fontname': 'Helvetica',
            'fontsize': '18',
            'fontcolor': 'white',
        },
        'nodes': {
            'style': 'filled',
            'color': 'white',
            'fillcolor': 'black',
            'fontname': 'Helvetica',
            'fontsize': '14',
            'fontcolor': 'white',
        },
        'edges': {
            'color': 'white',
            'arrowhead': 'open',
            'fontname': 'Helvetica',
            'fontsize': '12',
            'fontcolor': 'white',
        }
    }

    light_style = {
        'graph': {
            'label': title,
            'fontname': 'Helvetica',
            'fontsize': '18',
            'fontcolor': 'black',
        },
        'nodes': {
            'style': 'filled',
            'color': 'black',
            'fillcolor': '#dbdddd',
            'fontname': 'Helvetica',
            'fontsize': '14',
            'fontcolor': 'black',
        },
        'edges': {
            'color': 'black',
            'arrowhead': 'open',
            'fontname': 'Helvetica',
            'fontsize': '12',
            'fontcolor': 'black',
        }
    }

    if style == 'light':
        apply_style = light_style

    elif style == 'dark':
        apply_style = dark_style

    f.graph_attr = apply_style['graph']
    f.node_attr = apply_style['nodes']
    f.edge_attr = apply_style['edges']

    return f




# Read in categorized data
sitemap_layers = pd.read_csv('sitemap_layers.csv', dtype=str)
# Convert numerical column to integer
sitemap_layers.counts = sitemap_layers.counts.apply(int)
print('Loaded {:,} rows of categorized data from sitemap_layers.csv'\
        .format(len(sitemap_layers)))

print('Building %d layer deep sitemap graph' % graph_depth)
f = make_sitemap_graph(sitemap_layers, layers=graph_depth,
                       limit=limit, size=size)
f = apply_style(f, style=style, title=title)

f.render(cleanup=True)
print('Exported graph to sitemap_graph_%d_layer.pdf' % graph_depth)


