# Web Scraping

### Scraping Guidelines

Keep in mind you should always have permission for the website you are scraping! Check a websites terms and conditions for more info. Also keep in mind that a computer can send requests to a website very fast, so a website may block your computer's ip address if you send too many requests too quickly. Lastly, websites change all the time! You will most likely need to update your code often for long term web-scraping jobs.

**Always ask for permission!**

### Example 1 - Grabbing the title of a page

Let's start very simple, we will grab the title of a page. 
Remember that this is the HTML block with the **title** tag. For this task we will use **www.example.com** which is a website specifically made to serve as an example domain. Let's go through the main steps:

In [1]:
#import requests library
import requests


In [2]:
# Step 1: Use the requests library to grab the page
# Note, this may fail if you have a firewall blocking Python/Jupyter 
# Note sometimes you need to run this twice if it fails the first time

res = requests.get("www.example.com")

MissingSchema: Invalid URL 'www.example.com': No schema supplied. Perhaps you meant http://www.example.com?

This object is a requests.models.Response object and it actually contains the information from the website, for example:

In [3]:
# check the type
type(res)


NameError: name 'res' is not defined

In [None]:
# see the text



____
As a result, you get a string. Now we use BeautifulSoup to analyze the extracted page. Technically we could use our own custom script to loook for items in the string of **res.text** but the BeautifulSoup library already has lots of built-in tools and methods to grab information from a string. 

In [None]:
#import beautiful soup lib (bs4)



In [None]:
#use BeautifulSoup to analyze



In [None]:
#check your soup results



Now let's use the **.select()** method to grab elements. We are looking for the 'title' tag, so we will pass in 'title'


In [None]:
# check the title



Notice what is returned here, its actually a **list** containing all the title elements (along with their tags). You can use indexing or even looping to grab the elements from the list. Since this object it still a specialized tag, we can use method calls to grab just the text.

In [None]:
#grab just the text



In [None]:
#check first element



In [None]:
# check the type



In [None]:
# get the title



### Example 2 - Getting an Image from a Website

Let's attempt to grab the image of lantern from wikipedia article:
https://en.wikipedia.org/wiki/Lantern_Festival

Wiki are open source.

In [None]:
#import the library that you will need 



In [None]:
# send request to get https://en.wikipedia.org/wiki/Lantern_Festival
# store to object res



In [None]:
# use the BeautifulSoup to convert the text formatting.




In [None]:
#select all image tag



Looks like we will need to be more specific in order to only grab the two main images. 
Let's inspect the website again.

In [None]:
# select all .thumimage tag, store in image_info
# look at image_info



In [None]:
#check the length of image_info
# how many images?



In [None]:
# get the first lantern image


In [None]:
#check the type



You can make dictionary like calls for parts of the **Tag**, in this case, we are interested in the **src** , or "source" of the image, which should be its own .jpg or .png link:

In [None]:
#get the source only



** be more specific**
Now that you have the actual src link, you can grab the image with requests and get along with the .content attribute. Note how we had to add https:// before the link, if you don't do this, requests will complain (but it gives you a pretty descriptive error code).

In [None]:
#request to get the image with the link directly, save to image_link



In [None]:
# check the raw content (its a binary file, meaning we will need to use binary read/write methods for saving it)



Let's write this to a file, name it "myNewImage.jpg". The 'wb' call to "write a binary" file.

In [None]:
# write to a file called "myNewImage.jpg" with 'wb' option




In [None]:
# write the content



In [None]:
# close the file, then check your folder.

