# Web Scraping

### Scraping Guidelines

Keep in mind you should always have permission for the website you are scraping! Check a websites terms and conditions for more info. Also keep in mind that a computer can send requests to a website very fast, so a website may block your computer's ip address if you send too many requests too quickly. Lastly, websites change all the time! You will most likely need to update your code often for long term web-scraping jobs.

**Always ask for permission!**

### Example 1 - Grabbing the title of a page

Let's start very simple, we will grab the title of a page. 
Remember that this is the HTML block with the **title** tag. For this task we will use **www.example.com** which is a website specifically made to serve as an example domain. Let's go through the main steps:

In [1]:
#import requests library
import requests


In [11]:
# Step 1: Use the requests library to grab the page
# Note, this may fail if you have a firewall blocking Python/Jupyter 
# Note sometimes you need to run this twice if it fails the first time

res = requests.get('https://www.example.com')

This object is a requests.models.Response object and it actually contains the information from the website, for example:

In [12]:
# check the type
type(res)


requests.models.Response

In [13]:
# see the text

res.text

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <

____
As a result, you get a string. Now we use BeautifulSoup to analyze the extracted page. Technically we could use our own custom script to loook for items in the string of **res.text** but the BeautifulSoup library already has lots of built-in tools and methods to grab information from a string. 

In [24]:
#import beautiful soup lib (bs4)

from bs4 import BeautifulSoup


In [27]:
#use BeautifulSoup to analyze

soup = BeautifulSoup(res.text, 'html.parser')

In [28]:
#check your soup results
print(soup.prettify())


<!DOCTYPE html>
<html>
 <head>
  <title>
   Example Domain
  </title>
  <meta charset="utf-8"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <style type="text/css">
   body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
  </style>
 </head>
 <body>
  <div>
   <h1>
    Example Domain
   </h1>
   <p>
    This dom

Now let's use the **.select()** method to grab elements. We are looking for the 'title' tag, so we will pass in 'title'


In [41]:
# check the title
soup.select('Title')[0]


bs4.element.Tag

Notice what is returned here, its actually a **list** containing all the title elements (along with their tags). You can use indexing or even looping to grab the elements from the list. Since this object it still a specialized tag, we can use method calls to grab just the text.

In [42]:
#grab just the text
soup.select('Title')[0].getText()


'Example Domain'

In [44]:
#check first element
soup = BeautifulSoup(lan.text, 'html.parser')


In [None]:
# check the type



In [None]:
# get the title



### Example 2 - Getting an Image from a Website

Let's attempt to grab the image of lantern from wikipedia article:
https://en.wikipedia.org/wiki/Lantern_Festival

Wiki are open source.

In [None]:
#import the library that you will need 



In [53]:
# send request to get https://en.wikipedia.org/wiki/Lantern_Festival
# store to object res
lan = requests.get('https://en.wikipedia.org/wiki/Lantern_Festival')


In [59]:
# use the BeautifulSoup to convert the text formatting.
soup = BeautifulSoup(lan.text, 'html.parser')


In [61]:
#select all image tag
wow = soup.select('img')
len(wow)

27

Looks like we will need to be more specific in order to only grab the two main images. 
Let's inspect the website again.

In [65]:
# select all .thumbimage tag, store in image_info
# look at image_info

t = soup.select('.thumbimage')

In [66]:
#check the length of image_info
# how many images?
len(t)


4

In [67]:
# get the first lantern image
lantern = t[0]

In [68]:
#check the type
type(lantern)


bs4.element.Tag

You can make dictionary like calls for parts of the **Tag**, in this case, we are interested in the **src** , or "source" of the image, which should be its own .jpg or .png link:

In [69]:
#get the source only

lantern['src']

'//upload.wikimedia.org/wikipedia/commons/thumb/3/36/Statues_of_mother_and_daughter_celebrating_the_Lantern_Festival._Xi%27an.jpg/220px-Statues_of_mother_and_daughter_celebrating_the_Lantern_Festival._Xi%27an.jpg'

** be more specific**
Now that you have the actual src link, you can grab the image with requests and get along with the .content attribute. Note how we had to add https:// before the link, if you don't do this, requests will complain (but it gives you a pretty descriptive error code).

In [75]:
#request to get the image with the link directly, save to image_link

link = requests.get("https://upload.wikimedia.org/wikipedia/commons/3/36/Statues_of_mother_and_daughter_celebrating_the_Lantern_Festival._Xi%27an.jpg")

link = requests.get(lantern['src'])

MissingSchema: Invalid URL '//upload.wikimedia.org/wikipedia/commons/thumb/3/36/Statues_of_mother_and_daughter_celebrating_the_Lantern_Festival._Xi%27an.jpg/220px-Statues_of_mother_and_daughter_celebrating_the_Lantern_Festival._Xi%27an.jpg': No schema supplied. Perhaps you meant http:////upload.wikimedia.org/wikipedia/commons/thumb/3/36/Statues_of_mother_and_daughter_celebrating_the_Lantern_Festival._Xi%27an.jpg/220px-Statues_of_mother_and_daughter_celebrating_the_Lantern_Festival._Xi%27an.jpg?

In [71]:
# check the raw content (its a binary file, meaning we will need to use binary read/write methods for saving it)

link.content


b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00`\x00`\x00\x00\xff\xe19\x94Exif\x00\x00MM\x00*\x00\x00\x00\x08\x00\x0e\x01\x0f\x00\x02\x00\x00\x00\n\x00\x00\x08\xc2\x01\x10\x00\x02\x00\x00\x00\t\x00\x00\x08\xcc\x01\x12\x00\x03\x00\x00\x00\x01\x00\x01\x00\x00\x011\x00\x02\x00\x00\x00&\x00\x00\x08\xd6\x012\x00\x02\x00\x00\x00\x14\x00\x00\x08\xfc\x02\x13\x00\x03\x00\x00\x00\x01\x00\x02\x00\x00\x87i\x00\x04\x00\x00\x00\x01\x00\x00\t\x10\x88%\x00\x04\x00\x00\x00\x01\x00\x007\xc4\x880\x00\x03\x00\x00\x00\x01\x00\x01\x00\x00\xc4\xa5\x00\x07\x00\x00\x00\xd0\x00\x007\xfc\xc6\xd2\x00\x07\x00\x00\x00@\x00\x008\xcc\xc6\xd3\x00\x07\x00\x00\x00\x80\x00\x009\x0c\xea\x1c\x00\x07\x00\x00\x08\x0c\x00\x00\x00\xb6\xea\x1d\x00\t\x00\x00\x00\x01\xff\xff\xfe\xa4\x00\x00\x00\x00\x1c\xea\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\

Let's write this to a file, name it "myNewImage.jpg". The 'wb' call to "write a binary" file.

In [72]:
# write to a file called "myNewImage.jpg" with 'wb' option

f = open('sup.jpg', 'wb')


In [73]:
# write the content
f.write(link.content)


332193

In [74]:
# close the file, then check your folder.
f.close


<function BufferedWriter.close>