<img src="https://raw.githubusercontent.com/afo/data-x-plaksha/master/imgsource/dx_logo.png" align="left"></img><br><br><br><br>


## Breakout Lecture 8: Web scraping & web crawling

**Author List**: Alexander Fred Ojala

**Original Sources**: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ & https://www.dataquest.io/blog/web-scraping-tutorial-python/

**License**: Feel free to do whatever you want to with this code

**Compatibility:** Python 2.x and 3.x

# Table of Contents
(Clickable document links)
___

### [0: Pre-steup](#sec0)
Document setup and Python 2 and Python 3 compability

### [1: Simple webscrpaing intro](#sec1)

Simple example of webscraping on a premade HTML template

### [2: Scrape Data-X Schedule](#sec2)

Find and scrape the current Data-X schedule. 

### [3: Scrape Images and Files](#sec3)

Scrape a website of Images, PDF's, CSV data or any other file type.

## [Breakout Problem: Scrape Weather Data](#sec4)

Scrape real time weather data in Berkeley.


### [Appendix](#sec5)

#### [Scrape Bloomberg sitemap for political news headlines](#sec6)

#### [Webcrawl Twitter, recusrive URL link fetcher + depth](#sec7)

#### [SEO, visualize webite categories as a tree](#sec8)

<a id='sec0'></a>
## Pre-Setup

In [None]:
# stretch Jupyter coding blocks to fit screen
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:75% !important; }</style>")) # if 100% it would fit the screen

In [None]:
# make it run on py2 and py3
from __future__ import division, print_function

<a id='sec1'></a>
# Webscraping intro

In order to scrape content from a website we first need to download the HTML contents of the website. This can be done with the Python library **requests** (with its `.get` method).

Then when we want to extract certain information from a website we use the scraping tool **BeautifulSoup4** (import bs4). In order to extract information with beautifulsoup we have to create a soup object from the HTML source code of a website.

In [None]:
import requests # The requests library is an HTTP library for getting content and posting etc.
import bs4 as bs # BeautifulSoup4 is a Python library for pulling data out of HTML and XML code.

# Scraping a simple website

In [None]:
source = requests.get("https://alexanderfo.github.io") # a GET request will download the HTML webpage.
print(source) # If <Response [200]> then the website has been downloaded succesfully

**Different types of repsonses:**
Generally status code starting with 2 indicates success. Status code starting with 4 or 5 indicates error

In [None]:
print(source.content) # This is the HTML content of the website, as you can see it's quite hard to decipher

In [None]:
print(type(source.content)) # type byte in Python 3, type str in Python 2. Byte is default encoding of strings

In [None]:
# Read in source.content to beautifulsoup 
# beautifulsoup can parse (extract specific information) HTML code

soup = bs.BeautifulSoup(source.content ,features='lxml') # we pass in the source and choose a parser 

# features specifies what type of code we are parsing, here 'lxml' specifies an HTML parser

In [None]:
print(type(soup))

In [None]:
print(soup) # This is the HTML code of the website, decoded as a beautiful soup object

In [None]:
# Suppose we want to extract content that is shown on the website

print(soup.body) # This is the main content of the website, located within the <body> tag

In [None]:
print(soup.title) # Title of the website
print(soup.find('title')) # same as .title

In [None]:
# If we want to extract specific text
print(soup.find('p')) # will only return first <p> tag

In [None]:
print(soup.find('p').text) # extracts the string within the <p> tag

In [None]:
# If we want to extract all <p> tags
print(soup.find_all('p')) # returns list of all <p> tags

In [None]:
print(soup.find(class_='header')) # we can also search for classes within all tags, using class_
print(soup.find(id='second'))
# note _ is used to distinguish with Python's builtin class function

In [None]:
print(soup.find_all(class_='regular'))

In [None]:
for p in soup.find_all('p'): # print all p tags in the list
    print(p.text)

In [None]:
# Extract links / urls
# Links in html is usually coded as <a href="url"> where the link is url

print(soup.a)
print(type(soup.a))


In [None]:
# if we only want the link
attendance_link = soup.find('a').get('href') # we want to get the string specified by the 'href inside the a tag
print("To record attendance for today's lecture go to: ",attendance_link) # then we have extracted the link

<a id='sec2'></a>

# Scrape the current Syllabus Schedule from the Data-X website


In [None]:
source = requests.get('https://data-x.blog/').content # get the source content

In [None]:
soup = bs.BeautifulSoup(source,'lxml')

In [None]:
print(soup.prettify()) # .prettify() method makes the HTML code more readable

# as you can see this code is more difficult to read then the simple example above

In [None]:
print(soup.find('title').text) # we are at the correct website

In [None]:
for p in soup.find_all('p'):
    print(p.text)

In [None]:
navigation_bar = soup.find('nav')
print(navigation_bar)

In [None]:
# Now we want to find the Syllabus, however we are at the root web page, not displaying the syllabus
# Get links from the data-x website
for url in navigation_bar.find_all('a'): # look for links in the navigation bar. Tag <nav>
    link = url.get('href')
    if 'data-x.blog' in link:
        print(link) # we see that the syllabus is located at the url https://data-x.blog/syllabus-data-x/
        if 'syllabus' in link:
            syllabus_url = link

In [None]:
print(syllabus_url)

In [None]:
# Open new connection to the syllabus url. Replace soup object.
source = requests.get(syllabus_url).content
soup = bs.BeautifulSoup(source, 'lxml') # 'lxml' parser better for tables, very similar to 'html.parser'

print(soup.body.prettify()) # we can see that the table is stored within <td> tags

### Finding the course scheudle table
Usually data on a website is stored in tables under the `<td>` tag. Here we want to extract the information in the Data-X syllabus.

In [None]:
# We can also get the table
table = soup.find('table')
print(table.prettify()) #HTML code of the table

In [None]:
# A new row in an HTML table starts with <tr> tag
# A new column entry is defined by <td> tag

In [None]:
table_result = list()
for row in table.find_all('tr'):
    row_cells = row.find_all('td') # find all table data
    row_entries = [cell.text for cell in row_cells]
    print(row_entries) 
    table_result.append(row_entries)# get all the table data into a list

In [None]:
# We can also read it in to a Pandas DataFrame
import pandas as pd    
df = pd.DataFrame(table_result)
df.head()

In [None]:
# Pandas can also grab tables from a website automatically

import pandas as pd

# requires html5lib: 
#!conda install --yes html5lib
dfs = pd.read_html('https://data-x.blog/syllabus-data-x/',header=0) # returns a list of all tables at url
# header = 0, indicates that first row is header



In [None]:
print(type(dfs)) #list of tables
print(len(dfs)) # we only have one table
print(type(dfs[0])) # stored as DataFrame
df = dfs[0]

In [None]:
# Looks great
df.head(4)

<a id='sec3'></a>
# Scrape images and other files

In [None]:
# As we can see there are two images on the data-x syllabus site that we might want to download
# Images are displayed with the <img> tag in HTML

print(soup.find('img')) # as we can see below the image urls are stored as the src inside the img tag

In [None]:
# Parse all url to the images
img_urls = list()
for img in soup.find_all('img'): 
    img_url = img.get('src') 
    print(img_url) # we only want images with .jpg extension
    if '.jpg' in img_url:
        img_urls.append(img_url)
    

In [None]:
print(img_urls)

In [None]:
# To downloads and save files with Python we can use the shutil library
# which is a file operations library

import shutil

for idx, img_url in enumerate(img_urls): #enumarte to create a file integer name for every image
    
    img_source = requests.get(img_url, stream=True) 
    # we set stream = True to download/ stream the content of the data
    
    with open('img'+str(idx)+'.jpg', 'wb') as file: # open file connection, create file and write to it
        shutil.copyfileobj(img_source.raw, file) # save the raw file object

    del img_source # to remove the file from memory

## Scraping function to download files of any type from a website

In [None]:
# Extended scraping function of any file format
import os # To format file name
import shutil # To copy file object from python to disk
import requests
import bs4 as bs

def py_file_scraper(url, html_tag='img', source_tag='src', file_type='.jpg',max=-1):
    
    '''
    Function that scrapes a website for certain file formats.
    The files will be placed in a folder called "files" in the working directory.
    
    url = the url we want to scrape from
    html_tag = the file tag (usually img for images or a for file links)
    source_tag = the source tag for the file url (usually src for images or href for files)
    file_type = .png, .jpg, .pdf, .csv, .xls etc.
    max = integer (max number of files to scrape, if = -1 it will scrape all files)
    '''
    
    # make a directory called 'files' for the files if it does not exist
    if not os.path.exists('files/'):
        os.makedirs('files/')

    source = requests.get(url).content
    soup = bs.BeautifulSoup(source,'lxml')
    
    i=0
    for link in soup.find_all(html_tag):
        file_url=link.get(source_tag)
        
        
        if 'http' in file_url: # check that it is a valid link

            if file_type in file_url: #only check for specific file type

                file_name = os.path.splitext(os.path.basename(file_url))[0] + file_type 
                #extract file name from url

                file_source = requests.get(file_url, stream = True)
                # open new stream connection

                with open('./files/'+file_name, 'wb') as file: 
                    # open file connection, create file and write to it
                    shutil.copyfileobj(file_source.raw, file) # save the raw file object
                    print('DOWNLOADED:',file_name)
                    
                    i+=1
                    
                del file_source # delete from memory
            else:
                print('EXCLUDED:',file_url) # urls not downloaded from
                
        if i==max:
            print('Max reached')
            break
            

    print('Done!')

In [None]:
py_file_scraper('https://data-x.blog/syllabus-data-x/') # scrape images form data-x syllabus

In [None]:
# scrape pdf's from data-x site
py_file_scraper('https://data-x.blog/',html_tag='a',source_tag='href',file_type='.pdf',max=3)

In [None]:
# scrape csv files from website
py_file_scraper('http://www-eio.upc.edu/~pau/cms/rdata/datasets.html',html_tag='a', # R data sets
                source_tag='href', file_type='.csv',max=5)

---
<a id='sec4'></a>
# Breakout problem


In this week's breakout you should extract live weather data in Berkeley from:

[http://forecast.weather.gov/MapClick.php?lat=37.87158815800046&lon=-122.27274583799971](http://forecast.weather.gov/MapClick.php?lat=37.87158815800046&lon=-122.27274583799971)

* Task scrape
    * period / day (as Tonight, Friday, FridayNight etc.
    * the temperature for the period (as Low, High)
    * the long weather description (e.g. Partly cloudy, with a low around 49..)
    
Store the scraped data strings in a Pandas DataFrame



**Hint:** The weather information is found in a div tag with `id='seven-day-forecast'`



# Insert Breakout solution below

In [None]:
import requests
import bs4 as bs
import pandas as pd

<a id='sec5'></a>
# Appendix

<a id='sec6'></a>
# Scrape Bloomberg sitemap (XML) for current political news

In [None]:
# XML documents - site maps, all the urls. just between tags
# XML human and machine readable.
# Newest links: all the links for FIND SITE MAP!
# News websites will have sitemaps for politics, bot constantly
# tracking news track the sitemaps

# Before scraping a website look at robots.txt file
bs.BeautifulSoup(requests.get('https://www.bloomberg.com/robots.txt').content,'lxml')

In [None]:
source = requests.get('https://www.bloomberg.com/feeds/bpol/sitemap_news.xml').content
soup = bs.BeautifulSoup(source,'xml') # Note parser 'xml'

In [None]:
print(soup.prettify())

In [None]:
# Find political news headlines
for news in soup.find_all({'news'}):
    print(news.title.text)
    print(news.publication_date.text)
    #print(news.keywords.text)
    print('\n')

<a id='sec7'></a>
# Web crawl

Web crawling is almost like webscraping, but instead you crawl a specific website (and often its subsites) and extract meta information. It can be seen as simple, recursive scraping. This can be used for web indexing (in order to build a web search engine).

## Web crawl Twitter account
**Authors:** Kunal Desai & Alexander Fred Ojala

In [None]:
import bs4
from bs4 import BeautifulSoup
import requests

In [None]:
# Helper function to maintain the urls and the number of times they appear

url_dict = dict()

def add_to_dict(url_d, key):
    if key in url_d:
        url_d[key] = url_d[key] + 1
    else:
        url_d[key] = 1

In [None]:
# Recursive function which extracts links from the given url upto a given 'depth'.

def get_urls(url, depth):
    if depth == 0:
        return
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    for link in soup.find_all('a'):
        if link.has_attr('href') and "https://" in link['href']:
#             print(link['href'])
            add_to_dict(url_dict, link['href'])
            get_urls(link['href'], depth - 1)

In [None]:
# Iterative function which extracts links from the given url upto a given 'depth'.

def get_urls_iterative(url, depth):
    urls = [url]
    for url in urls:
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        for link in soup.find_all('a'):
            if link.has_attr('href') and "https://" in link['href']:
                add_to_dict(url_dict, link['href'])
                urls.append(link['href'])
        if len(urls) > depth:
            break

In [None]:
get_urls("https://twitter.com/GolfWorld", 2)
for key in url_dict:
    print(str(key) + "  ----   " + str(url_dict[key]))

<a id='sec8'></a>
# SEO: Visualize sitemap and categories in a website

**Source:** https://www.ayima.com/guides/how-to-visualize-an-xml-sitemap-using-python.html

In [None]:
# Visualize XML sitemap with categories!
import requests
from bs4 import BeautifulSoup

url = 'https://www.sportchek.ca/sitemap.xml'
url = 'https://www.bloomberg.com/feeds/bpol/sitemap_index.xml'
page = requests.get(url)
print('Loaded page with: %s' % page)

sitemap_index = BeautifulSoup(page.content, 'html.parser')
print('Created %s object' % type(sitemap_index))

In [None]:
urls = [element.text for element in sitemap_index.findAll('loc')]
print(urls)

In [None]:
def extract_links(url):
    ''' Open an XML sitemap and find content wrapped in loc tags. '''

    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    links = [element.text for element in soup.findAll('loc')]

    return links

sitemap_urls = []
for url in urls:
    links = extract_links(url)
    sitemap_urls += links

print('Found {:,} URLs in the sitemap'.format(len(sitemap_urls)))

In [None]:
with open('sitemap_urls.dat', 'w') as f:
    for url in sitemap_urls:
        f.write(url + '\n')

In [None]:
'''
Categorize a list of URLs by site path.
The file containing the URLs should exist in the working directory and be
named sitemap_urls.dat. It should contain one URL per line.
Categorization depth can be specified by executing a call like this in the
terminal (where we set the granularity depth level to 5):
    python categorize_urls.py --depth 5
The same result can be achieved by setting the categorization_depth variable
manually at the head of this file and running the script with:
    python categorize_urls.py
'''
from __future__ import print_function


categorization_depth=3



# Main script functions


def peel_layers(urls, layers=3):
    ''' Builds a dataframe containing all unique page identifiers up
    to a specified depth and counts the number of sub-pages for each.
    Prints results to a CSV file.
    urls : list
        List of page URLs.
    layers : int
        Depth of automated URL search. Large values for this parameter
        may cause long runtimes depending on the number of URLs.
    '''

    # Store results in a dataframe
    sitemap_layers = pd.DataFrame()

    # Get base levels
    bases = pd.Series([url.split('//')[-1].split('/')[0] for url in urls])
    sitemap_layers[0] = bases

    # Get specified number of layers
    for layer in range(1, layers+1):

        page_layer = []
        for url, base in zip(urls, bases):
            try:
                page_layer.append(url.split(base)[-1].split('/')[layer])
            except:
                # There is nothing that deep!
                page_layer.append('')

        sitemap_layers[layer] = page_layer

    # Count and drop duplicate rows + sort
    sitemap_layers = sitemap_layers.groupby(list(range(0, layers+1)))[0].count()\
                     .rename('counts').reset_index()\
                     .sort_values('counts', ascending=False)\
                     .sort_values(list(range(0, layers)), ascending=True)\
                     .reset_index(drop=True)

    # Convert column names to string types and export
    sitemap_layers.columns = [str(col) for col in sitemap_layers.columns]
    sitemap_layers.to_csv('sitemap_layers.csv', index=False)

    # Return the dataframe
    return sitemap_layers




sitemap_urls = open('sitemap_urls.dat', 'r').read().splitlines()
print('Loaded {:,} URLs'.format(len(sitemap_urls)))

print('Categorizing up to a depth of %d' % categorization_depth)
sitemap_layers = peel_layers(urls=sitemap_urls,
                             layers=categorization_depth)
print('Printed {:,} rows of data to sitemap_layers.csv'.format(len(sitemap_layers)))


In [None]:
'''
Visualize a list of URLs by site path.
This script reads in the sitemap_layers.csv file created by the
categorize_urls.py script and builds a graph visualization using Graphviz.
Graph depth can be specified by executing a call like this in the
terminal:
    python visualize_urls.py --depth 4 --limit 10 --title "My Sitemap" --style "dark" --size "40"
The same result can be achieved by setting the variables manually at the head
of this file and running the script with:
    python visualize_urls.py
'''
from __future__ import print_function


# Set global variables

graph_depth = 3  # Number of layers deep to plot categorization
limit = 3       # Maximum number of nodes for a branch
title = ''       # Graph title
style = 'light'  # Graph style, can be "light" or "dark"
size = '8,5'     # Size of rendered PDF graph


# Import external library dependencies

import pandas as pd
import graphviz



# Main script functions

def make_sitemap_graph(df, layers=3, limit=50, size='8,5'):
    ''' Make a sitemap graph up to a specified layer depth.
    sitemap_layers : DataFrame
        The dataframe created by the peel_layers function
        containing sitemap information.
    layers : int
        Maximum depth to plot.
    limit : int
        The maximum number node edge connections. Good to set this
        low for visualizing deep into site maps.
    '''


    # Check to make sure we are not trying to plot too many layers
    if layers > len(df) - 1:
        layers = len(df)-1
        print('There are only %d layers available to plot, setting layers=%d'
              % (layers, layers))


    # Initialize graph
    f = graphviz.Digraph('sitemap', filename='sitemap_graph_%d_layer' % layers)
    f.body.extend(['rankdir=LR', 'size="%s"' % size])


    def add_branch(f, names, vals, limit, connect_to=''):
        ''' Adds a set of nodes and edges to nodes on the previous layer. '''

        # Get the currently existing node names
        node_names = [item.split('"')[1] for item in f.body if 'label' in item]

        # Only add a new branch it it will connect to a previously created node
        if connect_to:
            if connect_to in node_names:
                for name, val in list(zip(names, vals))[:limit]:
                    f.node(name='%s-%s' % (connect_to, name), label=name)
                    f.edge(connect_to, '%s-%s' % (connect_to, name), label='{:,}'.format(val))


    f.attr('node', shape='rectangle') # Plot nodes as rectangles

    # Add the first layer of nodes
    for name, counts in df.groupby(['0'])['counts'].sum().reset_index()\
                          .sort_values(['counts'], ascending=False).values:
        f.node(name=name, label='{} ({:,})'.format(name, counts))

    if layers == 0:
        return f

    f.attr('node', shape='oval') # Plot nodes as ovals
    f.graph_attr.update()

    # Loop over each layer adding nodes and edges to prior nodes
    for i in range(1, layers+1):
        cols = [str(i_) for i_ in range(i)]
        nodes = df[cols].drop_duplicates().values
        for j, k in enumerate(nodes):

            # Compute the mask to select correct data
            mask = True
            for j_, ki in enumerate(k):
                mask &= df[str(j_)] == ki

            # Select the data then count branch size, sort, and truncate
            data = df[mask].groupby([str(i)])['counts'].sum()\
                    .reset_index().sort_values(['counts'], ascending=False)

            # Add to the graph
            add_branch(f,
                       names=data[str(i)].values,
                       vals=data['counts'].values,
                       limit=limit,
                       connect_to='-'.join(['%s']*i) % tuple(k))

            print(('Built graph up to node %d / %d in layer %d' % (j, len(nodes), i))\
                    .ljust(50), end='\r')

    return f


def apply_style(f, style, title=''):
    ''' Apply the style and add a title if desired. More styling options are
    documented here: http://www.graphviz.org/doc/info/attrs.html#d:style
    f : graphviz.dot.Digraph
        The graph object as created by graphviz.
    style : str
        Available styles: 'light', 'dark'
    title : str
        Optional title placed at the bottom of the graph.
    '''

    dark_style = {
        'graph': {
            'label': title,
            'bgcolor': '#3a3a3a',
            'fontname': 'Helvetica',
            'fontsize': '18',
            'fontcolor': 'white',
        },
        'nodes': {
            'style': 'filled',
            'color': 'white',
            'fillcolor': 'black',
            'fontname': 'Helvetica',
            'fontsize': '14',
            'fontcolor': 'white',
        },
        'edges': {
            'color': 'white',
            'arrowhead': 'open',
            'fontname': 'Helvetica',
            'fontsize': '12',
            'fontcolor': 'white',
        }
    }

    light_style = {
        'graph': {
            'label': title,
            'fontname': 'Helvetica',
            'fontsize': '18',
            'fontcolor': 'black',
        },
        'nodes': {
            'style': 'filled',
            'color': 'black',
            'fillcolor': '#dbdddd',
            'fontname': 'Helvetica',
            'fontsize': '14',
            'fontcolor': 'black',
        },
        'edges': {
            'color': 'black',
            'arrowhead': 'open',
            'fontname': 'Helvetica',
            'fontsize': '12',
            'fontcolor': 'black',
        }
    }

    if style == 'light':
        apply_style = light_style

    elif style == 'dark':
        apply_style = dark_style

    f.graph_attr = apply_style['graph']
    f.node_attr = apply_style['nodes']
    f.edge_attr = apply_style['edges']

    return f




# Read in categorized data
sitemap_layers = pd.read_csv('sitemap_layers.csv', dtype=str)
# Convert numerical column to integer
sitemap_layers.counts = sitemap_layers.counts.apply(int)
print('Loaded {:,} rows of categorized data from sitemap_layers.csv'\
        .format(len(sitemap_layers)))

print('Building %d layer deep sitemap graph' % graph_depth)
f = make_sitemap_graph(sitemap_layers, layers=graph_depth,
                       limit=limit, size=size)
f = apply_style(f, style=style, title=title)

f.render(cleanup=True)
print('Exported graph to sitemap_graph_%d_layer.pdf' % graph_depth)


