# Scraping websites with Python

* Amount of Information on internet is 1,200 petabytes for the BIG FOUR. 
* Not all information is available through APIs
    * Getting Stock market indices for closing every day around the world (Bloomberg !!)
    * Real Estate Listings (any type of listings)
    * Email address gathering (or any other address gathering) 
    * etc etc. 

## Getting Started

We will use a pyton library called beautioful soup for scraping information.     

In [1]:
!pip install BeautifulSoup4




## Basics of HTML
[HTML DOCUMENT OBJECT MODEL](https://www.digitalocean.com/community/tutorials/introduction-to-the-dom)

http://htmlcodeeditor.com/

    <!DOCTYPE html>  
    <html>  
        <head>
        
        </head>
        <body>
            <h1> First Scraping </h1>
            <p> Hello World </p>
        <body>
    </html>
    
This is the basic syntax of an HTML webpage. Every <tag> serves a block inside the webpage:

    1. <!DOCTYPE html>: HTML documents must start with a type declaration.
    2. The HTML document is contained between <html> and </html>.
    3. The meta and script declaration of the HTML document is between <head> and </head>.
    4. The visible part of the HTML document is between <body> and </body> tags.
    5. Title headings are defined with the <h1> through <h6> tags.
    6. Paragraphs are defined with the <p> tag.

        Other useful tags include <a> for hyperlinks, <table> for tables, <tr> for table rows, and <td> for table columns.

Also, HTML tags sometimes come with `id` or `class` attributes. 

* The id attribute specifies a unique id for an HTML tag and the value must be unique within the HTML document. 
https://www.w3schools.com/tags/att_id.asp

* The class attribute is used to define equal styles for HTML tags with the same class. We can make use of these ids and classes to help us locate the data we want.
https://www.w3schools.com/html/html_classes.asp

https://css-tricks.com/the-difference-between-id-and-class/


## Scraping 

### Scraping Rules

1. You should check a website’s Terms and Conditions before you scrape it. Be careful to read the statements about legal use of data. Usually, the data you scrape should not be used for commercial purposes.

2. Do not request data from the website too aggressively with your program (also known as spamming). One request for one webpage per second is good practice.

3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite your code as needed


### Inspect the code

We’ll be working with data from the official website of the National Gallery of Art in the United States. The National Gallery is an art museum located on the National Mall in Washington, D.C. It holds over 120,000 pieces dated from the Renaissance to the present day done by more than 13,000 artists.

We would like to search the Index of Artists, which, at the time of updating this tutorial, is available via the Internet Archive’s Wayback Machine at the following URL:

https://web.archive.org/web/20170131230332/https://www.nga.gov/collection/an.shtm

Since we’ll be doing this project in order to learn about web scraping with Beautiful Soup, we don’t need to pull too much data from the site, so let’s limit the scope of the artist data we are looking to scrape. Let’s therefore choose one letter — in our example we’ll choose the letter Z.

In the Z page, we see that the first artist listed at the time of writing is Zabaglia, Niccola. This info might help us evaluate the scrapping process. We’ll start by working with this first page, with the following URL for the letter Z:

#### https://web.archive.org/web/20121007172955/http://www.nga.gov/collection/anZ1.htm

The last Z page, last artist !!: 

#### https://web.archive.org/web/20121010201041/http://www.nga.gov/collection/anZ4.htm

In [2]:
# Import necessary libraries

import requests
from bs4 import BeautifulSoup

In [4]:
# Collect first page of artists’ list
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')

<Response [200]>

In [5]:
# Create a BeautifulSoup object
soup = BeautifulSoup(page.text, 'html.parser')

With our page collected, parsed, and set up as a BeautifulSoup object, we can move on to collecting the data that we would like.

### Pulling Text From a Web Page

we’ll collect artists’ names and the relevant links available on the website. You may want to collect different data, such as the artists’ nationality and dates. Whatever data you would like to collect, you need to find out how it is described by the DOM of the web page.

To do this, in your web browser, right-click — or CTRL + click on macOS — on the first artist’s name, Zabaglia, Niccola. Within the context menu that pops up, you should see a menu item similar to Inspect Element (Firefox) or Inspect (Chrome).

    1. We’ll see first that the table of names is within <div> tags where class="BodyText". This is important to note so that we only search for text within this section of the web page. 

    2. Notice that the name Zabaglia, Niccola is in a link tag, since the name references a web page that describes the artist. So we will want to reference the <a> tag for links. Each artist’s name is a reference to a link.

To do this, we’ll use Beautiful Soup’s `find()` and `find_all()` methods in order to pull the text of the artists’ names from the BodyText div.


Now we have a variable, soup, containing the HTML of the page. Here’s where we can start coding the part that extracts the data.

Remember the unique layers of our data? BeautifulSoup can help us get into these layers and extract the content with find(). In this case, since the HTML class name is unique on this page, we can simply query <div class="name">.

In [7]:
# Pull all text from the BodyText div
artist_name_list = soup.find(class_='BodyText')

In [8]:
# Pull text from all instances of <a> tag within BodyText div
artist_name_list_items = artist_name_list.find_all('a')

In [9]:
artist_name_list_items

[<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630">Zabaglia, Niccola</a>,
 <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34202">Zaccone, Fabian</a>,
 <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3475">Zadkine, Ossip</a>,
 <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=25135">Zaech, Bernhard</a>,
 <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=2298">Zagar, Jacob</a>,
 <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=23988">Zagroba, Idalia</a>,
 <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=8232">Zaidenberg, A.</a>,
 <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34154">Zaidenberg, Arthur</a>,
 <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=4910">Zaisinger, Matthäus</a>,
 <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artisti

We need to create a for loop in order to iterate over all the artist names that we just put into the artist_name_list_items variable. 

We’ll print these names out with the prettify() method in order to turn the Beautiful Soup parse tree into a nicely formatted Unicode string.

In [10]:
# Create for loop to print out all artists' names
for artist_name in artist_name_list_items:
    print(artist_name.prettify())

<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630">
 Zabaglia, Niccola
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34202">
 Zaccone, Fabian
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3475">
 Zadkine, Ossip
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=25135">
 Zaech, Bernhard
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=2298">
 Zagar, Jacob
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=23988">
 Zagroba, Idalia
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=8232">
 Zaidenberg, A.
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34154">
 Zaidenberg, Arthur
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=4910">
 Zaisinger, Matthäus
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch

    What we see in the output at this point is the full text and tags related to all of the artists’ names within the <a> tags found in the <div class="BodyText"> tag on the first page, as well as some additional link text at the bottom. Since we don’t want this extra information, let’s work on removing this in the next section.



### Remove unwanted data

    In order to remove the bottom links of the page, let’s again right-click and Inspect the DOM. We’ll see that the links on the bottom of the <div class="BodyText"> section are contained in an HTML table: <table class="AlphaNav">:
    
We can therefore use Beautiful Soup to find the AlphaNav class and use the decompose() method to remove a tag from the parse tree and then destroy it along with its contents.

In [12]:
# Remove bottom links
last_links = soup.find(class_='AlphaNav')
last_links.decompose()

artist_name_list = soup.find(class_='BodyText')
artist_name_list_items = artist_name_list.find_all('a')

for artist_name in artist_name_list_items:
    print(artist_name.prettify())

<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630">
 Zabaglia, Niccola
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34202">
 Zaccone, Fabian
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3475">
 Zadkine, Ossip
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=25135">
 Zaech, Bernhard
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=2298">
 Zagar, Jacob
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=23988">
 Zagroba, Idalia
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=8232">
 Zaidenberg, A.
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34154">
 Zaidenberg, Arthur
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=4910">
 Zaisinger, Matthäus
</a>
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch

## Pulling the Contents from a Tag

    In order to access only the actual artists’ names, we’ll want to target the contents of the <a> tags rather than print out the entire link tag.

We can do this with Beautiful Soup’s `.contents`, which will return the tag’s children as a Python list data type.

Let’s revise the for loop so that instead of printing the entire link and its tag, we’ll print the list of children (i.e. the artists’ full names)

In [16]:
artists = []
for artist_name in artist_name_list_items:
    names = artist_name.contents[0]
    
    artists.append(names)
    print(names)

Zabaglia, Niccola
Zaccone, Fabian
Zadkine, Ossip
Zaech, Bernhard
Zagar, Jacob
Zagroba, Idalia
Zaidenberg, A.
Zaidenberg, Arthur
Zaisinger, Matthäus
Zajac, Jack
Zak, Eugène
Zakharov, Gurii Fillipovich
Zakowortny, Igor
Zalce, Alfredo
Zalopany, Michele
Zammiello, Craig
Zammitt, Norman
Zampieri, Domenico
Zampieri, called Domenichino, Domenico
Zanartú, Enrique Antunez
Zanchi, Antonio
Zanetti, Anton Maria
Zanetti Borzino, Leopoldina
Zanetti I, Antonio Maria, conte
Zanguidi, Jacopo
Zanini, Giuseppe
Zanini-Viola, Giuseppe
Zanotti, Giampietro
Zao Wou-Ki


In [17]:
artists

['Zabaglia, Niccola',
 'Zaccone, Fabian',
 'Zadkine, Ossip',
 'Zaech, Bernhard',
 'Zagar, Jacob',
 'Zagroba, Idalia',
 'Zaidenberg, A.',
 'Zaidenberg, Arthur',
 'Zaisinger, Matthäus',
 'Zajac, Jack',
 'Zak, Eugène',
 'Zakharov, Gurii Fillipovich',
 'Zakowortny, Igor',
 'Zalce, Alfredo',
 'Zalopany, Michele',
 'Zammiello, Craig',
 'Zammitt, Norman',
 'Zampieri, Domenico',
 'Zampieri, called Domenichino, Domenico',
 'Zanartú, Enrique Antunez',
 'Zanchi, Antonio',
 'Zanetti, Anton Maria',
 'Zanetti Borzino, Leopoldina',
 'Zanetti I, Antonio Maria, conte',
 'Zanguidi, Jacopo',
 'Zanini, Giuseppe',
 'Zanini-Viola, Giuseppe',
 'Zanotti, Giampietro',
 'Zao Wou-Ki']

    what if we want to also capture the URLs associated with those artists? 
    
    We can extract URLs found within a page’s <a> tags by using Beautiful Soup’s `get('href')`  method 

In [18]:
artists = []
for artist_name in artist_name_list_items:
    names = artist_name.contents[0]
    links = 'https://web.archive.org' + artist_name.get('href')
    artists.append([names,links])
    print(names)
    print(links)


Zabaglia, Niccola
https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630
Zaccone, Fabian
https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34202
Zadkine, Ossip
https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3475
Zaech, Bernhard
https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=25135
Zagar, Jacob
https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=2298
Zagroba, Idalia
https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=23988
Zaidenberg, A.
https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=8232
Zaidenberg, Arthur
https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34154
Zaisinger, Matthäus
https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=4910
Zajac, Jac

In [20]:
print(artists)

[['Zabaglia, Niccola', 'https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630'], ['Zaccone, Fabian', 'https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34202'], ['Zadkine, Ossip', 'https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3475'], ['Zaech, Bernhard', 'https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=25135'], ['Zagar, Jacob', 'https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=2298'], ['Zagroba, Idalia', 'https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=23988'], ['Zaidenberg, A.', 'https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=8232'], ['Zaidenberg, Arthur', 'https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34154'], ['Zaisinger, Matthäus', 'https://web.archive.org/web/20121

In [None]:
#Change list to Pandas dataframe and add column names 


In [None]:
# Save the df to a CSV/xlsx file

In [None]:
# Write a loop for scraping all pages 

# Import Libraries

# Run a for loop iterating through all 5 pages and gathering contents

# Run a second for loop to repeat the above process for all 5 pages 

# Write contents to pandas DF and save as a CSV file
