# Web Mining

Another source of data for analysis comes from web pages that can be scraped. Scraping is the art of extracting data from websites. 

## Web Scraping

### Components of a Website

Websites are created using HTML (Hypertext Markup Language), along with CSS (Cascading Style Sheets) and JavaScript. Here's what these things do ([source](https://blog.hubspot.com/marketing/web-design-html-css-javascript)):
* HTML provides the basic structure of sites, which is enhanced and modified by other technologies like CSS and JavaScript.
* CSS is used to control presentation, formatting, and layout.
* JavaScript is used to control the behavior of different elements.

In web scraping, we are mostly concerned with HTML. HTML contains components known as elements or "HTML tags," which control how content within the tags are displayed in the browser. Types of tags include image tags, paragraph tags, header tags, and div tags. These tags have have attributes (such as "class" and "id" information), which we can reference to extract the data inside. You can view the source code of a web page by right clicking on it in your browser and clicking "Inspect" (Chrome).  

### Try It

Go to the URL: https://www.bikemap.net/en/r/98460/ and inspect the source code on the page. Specifically inspect the description of the route under "About this route." What is the class of the div that contains the route information?

## HTTP Requests

In order to fetch the HTML document for rendering, web browsers issue HTTP requests. We can simulate this action using the `requests` module. For your reference, the link to the `requests` documentation is below. 

https://requests.kennethreitz.org/en/master/ 

In [1]:
import requests

In [2]:
bike_url = 'https://www.bikemap.net/en/r/98460/'
# Issue a simple HTTP request to get the webpage text
bike_page = requests.get(bike_url)
# Response code is returned
bike_page

<Response [200]>

In [3]:
# Use .text to view the page
# View the first 1000 characters of the HTML document
bike_page.text[:1000]

'\n\n\n\n\n<!doctype html>\n<html lang="en">\n    <head>\n        <meta charset="utf-8" />\n        <meta http-equiv="X-UA-Compatible" content="IE=edge" /><script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={licenseKey:"a29ddca0f7",applicationID:"48546301"};window.NREUM||(NREUM={}),__nr_require=function(e,n,t){function r(t){if(!n[t]){var i=n[t]={exports:{}};e[t][0].call(i.exports,function(n){var i=e[t][1][n];return r(i||n)},i,i.exports)}return n[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<t.length;i++)r(t[i]);return r}({1:[function(e,n,t){function r(){}function i(e,n,t){return function(){return o(e,[u.now()].concat(f(arguments)),n?null:this,t),n?void 0:this}}var o=e("handle"),a=e(4),f=e(5),c=e("ee").get("tracer"),u=e("loader"),s=NREUM;"undefined"==typeof window.newrelic&&(newrelic=s);var p=["setPageViewName","setCustomAttribute","setErrorHandler","finished","addToTrace","inlineHit","addRelease"],l="api-",d=l+"ixn-";a(p,function(e,

`requests` can also handle authentication and cookies. See the documentation for more information.

We can also issue HTTP requests via the `urllib.request` module. This is the simplest way to download images. 

https://docs.python.org/3/library/urllib.request.html#module-urllib.request

In [4]:
import urllib.request

In [5]:
# We will create an images folder in your current 
# working directory using the os module
import os

In [6]:
# Create an images folder in the current working directory
my_wd = os.getcwd()
print("My working directory:")
print(my_wd)

My working directory:
/Users/akshayd/Desktop/Sem_2/Python/Web Mining


In [7]:
# Path to images folder will be the current working directory + images
img_dir = os.path.join(my_wd,'images')
# If images folder does not exist, create it
if not os.path.exists(img_dir):
    os.makedirs(img_dir)

In [8]:
# URL of image we want to retrieve
bike_img_url = 'https://media.bikemap.net/routes/98460/staticmaps/98460_1000x260.jpg'

# Use string functions to get the name of the image
bike_img_name = bike_img_url.split("/")[-1]
print(bike_img_name)

98460_1000x260.jpg


In [9]:
# Use urllib to fetch the image
# Save the image to the images folder in the working directory
# Note that os.path.join is concatenating the path to the
# images directory AND the name of the image e.g. my_image.png
urllib.request.urlretrieve(url=bike_img_url, filename=os.path.join(img_dir,bike_img_name))

('/Users/akshayd/Desktop/Sem_2/Python/Web Mining/images/98460_1000x260.jpg',
 <http.client.HTTPMessage at 0x106efb048>)

### BeautifulSoup

The HTML output we obtained from our HTTP request was not easy to read. We can use the `BeautifulSoup` module to make the HTTP more readable and to search for specific content on the page. We will barely scratch the surface of what you can do with BeautifulSoup. The documentation is below for reference.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [10]:
# Often people forget that the package name for BeautifulSoup is bs4
from bs4 import BeautifulSoup

In [11]:
# Recall that our output from requests was not easy to read
bike_page.text[:1000]

'\n\n\n\n\n<!doctype html>\n<html lang="en">\n    <head>\n        <meta charset="utf-8" />\n        <meta http-equiv="X-UA-Compatible" content="IE=edge" /><script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={licenseKey:"a29ddca0f7",applicationID:"48546301"};window.NREUM||(NREUM={}),__nr_require=function(e,n,t){function r(t){if(!n[t]){var i=n[t]={exports:{}};e[t][0].call(i.exports,function(n){var i=e[t][1][n];return r(i||n)},i,i.exports)}return n[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<t.length;i++)r(t[i]);return r}({1:[function(e,n,t){function r(){}function i(e,n,t){return function(){return o(e,[u.now()].concat(f(arguments)),n?null:this,t),n?void 0:this}}var o=e("handle"),a=e(4),f=e(5),c=e("ee").get("tracer"),u=e("loader"),s=NREUM;"undefined"==typeof window.newrelic&&(newrelic=s);var p=["setPageViewName","setCustomAttribute","setErrorHandler","finished","addToTrace","inlineHit","addRelease"],l="api-",d=l+"ixn-";a(p,function(e,

In [12]:
# First pass the text from the http request to the parser 
bike_page_soup = BeautifulSoup(bike_page.text, 'html.parser')
bike_page_soup


<!DOCTYPE doctype html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/><script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={licenseKey:"a29ddca0f7",applicationID:"48546301"};window.NREUM||(NREUM={}),__nr_require=function(e,n,t){function r(t){if(!n[t]){var i=n[t]={exports:{}};e[t][0].call(i.exports,function(n){var i=e[t][1][n];return r(i||n)},i,i.exports)}return n[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<t.length;i++)r(t[i]);return r}({1:[function(e,n,t){function r(){}function i(e,n,t){return function(){return o(e,[u.now()].concat(f(arguments)),n?null:this,t),n?void 0:this}}var o=e("handle"),a=e(4),f=e(5),c=e("ee").get("tracer"),u=e("loader"),s=NREUM;"undefined"==typeof window.newrelic&&(newrelic=s);var p=["setPageViewName","setCustomAttribute","setErrorHandler","finished","addToTrace","inlineHit","addRelease"],l="api-",d=l+"ixn-";a(p,function(e,n){s[n]=i(l+n,!0,"api")}),s

In [13]:
# soup.find can search for specific kinds of tags (name),
# classes, and ids
# In the case of multiple matches, the first result will be returned
bike_page_soup.find(name="h1", class_="title")

<h1 class="title" itemprop="name" title="Anacostia Trail ">
         Anacostia Trail 
    </h1>

In [14]:
# .find_all returns all matches as a list
bike_page_soup.find_all(name="div", class_="flag-body")

[<div class="flag-body">
 <span class="item-value" id="route-distance">27 km</span>
 <span class="item-label">Distance</span>
 </div>, <div class="flag-body">
 <span class="item-value" id="route-ascent">
                             126 m
                     </span>
 <span class="item-label">Ascent</span>
 </div>, <div class="flag-body">
 <span class="item-value" id="route-descent">
                             123 m
                     </span>
 <span class="item-label">Descent</span>
 </div>]

In [15]:
# Looking at the output above, it looks like we don't need the 
# flag-body container divs - let's return the item-value spans instead
# .find_all returns all matches in a list
bike_values_list = bike_page_soup.find_all(name="span", class_="item-value")
print(bike_values_list)
print("Number of results: " + str(len(bike_values_list)))

[<span class="item-value" id="route-distance">27 km</span>, <span class="item-value" id="route-ascent">
                            126 m
                    </span>, <span class="item-value" id="route-descent">
                            123 m
                    </span>]
Number of results: 3


In [16]:
# Let's extract the text from inside route-distance
# route-distance is the first result
route_distance_str = bike_values_list[0].text
print(route_distance_str)

27 km


In [17]:
# Equivalently, we could have used .find for the id "route-distance"
print(bike_page_soup.find(name="span", id="route-distance").text)

27 km


In [18]:
# Let's use regular expressions to extract the number!
import re

In [19]:
# Use re.findall to match only the numbers
# Recall: "\d" is a digit - "+" specifies 1 or more times
re.findall('\d+', route_distance_str)

['27']

In [20]:
# Store the first result of the regular expression find as a float
route_distance_num = float(re.findall('\d.', route_distance_str)[0])
print(route_distance_num)

27.0


## Using Web Scraped Content in Pandas

In [21]:
from numpy import nan as NA
import numpy as np
import pandas as pd
pd.set_option('max_colwidth',150)

### Putting It All Together - Define a Function to Process a Page

Once you've inspected and extracted all the relevant information you want on the page, you can create a function to process the information from that URL. At a minimum, the function should take in the URL or a part of the URL as an argument. (In our example, we can use the route ID because that is a unique identifier of the URL.)

Store the information in a pandas dataframe.

In [22]:
# In our example, I will extract the title, description, distance, and image. Note I am using the route ID to construct the URL.
def bike_url_scrape(route_id):
    # Concatenate base URL with route ID to get specific page
    bike_url = 'https://www.bikemap.net/en/r/' + str(route_id) + '/'

    # Request the page HTML, pass to BeautifulSoup parser
    bike_page = requests.get(bike_url)
    bike_page_soup = BeautifulSoup(bike_page.text, 'html.parser')

    # Title
    bike_route_title_raw = bike_page_soup.find(name="h1", class_="title").text
    # Strip leading/trailing whitespace from title
    bike_route_title = bike_route_title_raw.strip()

    # Extract the image - some pages do not have images!
    # First we need to get the image url - it is inside the header carousel
    try:    
        carousel_div = bike_page_soup.find(name="div", id="header-carousel")
        bike_img_url = carousel_div.find('img')['src']
        bike_img_name = str(route_id) + "_" + bike_img_url.split("/")[-1]
        urllib.request.urlretrieve(url=bike_img_url, filename=os.path.join(img_dir,bike_img_name))
    except:
        print("Page does not have a header image.")
    
    # Description
    bike_route_desc = bike_page_soup.find(name="div", class_="route-description").text

    # Distance string
    bike_route_dist_str = bike_page_soup.find(name="span", id="route-distance").text

    # Distance numeric - use regex
    bike_route_dist_num = float(re.findall('\d.', bike_route_dist_str)[0])
    
    # Return result as a dataframe created from a NP array
    temp_row = np.array([[
        route_id, bike_route_title, bike_route_desc, bike_route_dist_str, bike_route_dist_num
        ]])
    temp_df = pd.DataFrame(temp_row, columns = ['route_id', 'bike_route_title', 'bike_route_desc', 'bike_route_dist_str', 'bike_route_dist_num'])
    
    return temp_df

In [23]:
# Try out some routes: 4994814, 4891998, 3503377, 1903803
bike_scrape_df1 = bike_url_scrape(4994814)
bike_scrape_df1.head()

Unnamed: 0,route_id,bike_route_title,bike_route_desc,bike_route_dist_str,bike_route_dist_num
0,4994814,West Hyattsville Metro to Roaming Rooster,\n\nNo description yet.\n\n,4 km,4.0


### Loop Over Pages and DataFrames

In [24]:
# Import the time module to pause the code in case we are identified as scrapers and blocked!
import time

In [25]:
# List of route ids we want to scrape
route_ids = [98460, 4994814, 4891998, 3503377, 1903803]

In [26]:
# Loop over our route IDs and append to the empty dataframe
# Store dataframes in a list
dfs = []
for route_id in route_ids:
    print("Processing Route: " + str(route_id))
    # Scrape route
    bike_scrape_res = bike_url_scrape(route_id)
    # Wait half a second
    time.sleep(0.5)
    # Append df to list
    dfs.append(bike_scrape_res)
    
print("Finished")

Processing Route: 98460
Processing Route: 4994814
Processing Route: 4891998
Processing Route: 3503377
Page does not have a header image.
Processing Route: 1903803
Page does not have a header image.
Finished


In [27]:
# Use pd.concat to combine the dataframes
bike_scrape_df = pd.concat(dfs)
bike_scrape_df.head()

Unnamed: 0,route_id,bike_route_title,bike_route_desc,bike_route_dist_str,bike_route_dist_num
0,98460,Anacostia Trail,\n\nRoute for 2008 Anacostia Trails Fall Foliage Bike Ride on October 25. Mostly on paved trails. Detour through Macgruder Park because of WSSC wo...,27 km,27.0
0,4994814,West Hyattsville Metro to Roaming Rooster,\n\nNo description yet.\n\n,4 km,4.0
0,4891998,College Park Bike Run,\n\nFirst ride on my new bike.\n\n,22 km,22.0
0,3503377,To Work,\n\nNo description yet.\n\n,9 km,9.0
0,1903803,11/11/2012 Maryland trek,\n\n\n\n\n\n\nDistance:\n18.41 mi\n\n\n\n\n,29 km,29.0
