## WEB SCRAPING

<img src="https://static.javatpoint.com/python/images/web-scraping-using-python.png" width=800>


### A Guide to Ethical Web Scraping
https://www.empiricaldata.org/dataladyblog/a-guide-to-ethical-web-scraping


### HTML Guide ###
Since we will be scraping from the web, you need to know HTML (HyperText Markup Language). [Introduction and tutorials of HTML](https://developer.mozilla.org/en-US/docs/Web/HTML)

### Use the appropriate web browser
Use Chrome or Firefox to inspect the HTML elements in the web page

## Q1: Read the [Guide to Ethical Web Scraping](https://www.empiricaldata.org/dataladyblog/a-guide-to-ethical-web-scraping) article and write a short paragraph to summarize the key messages about ethical web scraping.

Answer here:
We can use data scraping in a variety of ways to benefit everyone. The issue is that the ethical ramifications of, instance, scraping people's health records, might be unclear at times. In scraping a data, there are some aspects that should be noticed. First, the API way is often the best way. Meaning that instead of scraping, use the API if one exists. Then, respects the robots.txt. Also, read the terms and condition, be gentle, and identify yourself. Remember to always ask for permission as well because we are going to use the data that does not belong to us. We should also value the content that we keep and treat the data with respect. We must also give credit to the owner to bring quality traffic back to their website. Lastly, practice for ethical web scraping. The demand for data sources grows over time, and many websites lack APIs that allow developers to access the data they require. This only implies that web scraping will become more common over time, and it is critical for developers to understand how to do it correctly.

## Q2: Scape data from this [website](https://www.sjsu.edu/people/wendy.lee/docs/CS122/table.html), create a dataframe to store the data, and write the data to a csv file.
- This website contains a table that lists the books, the authors, the quantity, and the prices. 
- The dataframe should contain three columns: `Title`, `Author`, and `Unit_Price`.
- The `Unit_Price` column should store the price for one book.
- Save the data in the dataframe in a csv file, `books.csv`. 

In [82]:
# Your code here . . .
from bs4 import BeautifulSoup as soup
import requests
import pandas as pd
import random

url="https://www.sjsu.edu/people/wendy.lee/docs/CS122/table.html"

def get_UA():
    '''This list is from https://www.jcchouinard.com/random-user-agent-with-python-and-beautifulsoup/'''
    uastrings = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",\
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36",\
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10) AppleWebKit/600.1.25 (KHTML, like Gecko) Version/8.0 Safari/600.1.25",\
                "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0",\
                "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",\
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",\
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/600.1.17 (KHTML, like Gecko) Version/7.1 Safari/537.85.10",\
                "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",\
                "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0",\
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.104 Safari/537.36"\
                ]
    return random.choice(uastrings)

mockheaders = {'User-Agent': get_UA()}
print(mockheaders)
r = requests.get(url, headers=mockheaders)

# Input to BeautifulSoup
html_soup = soup(r.text, 'html.parser')

{'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0'}


In [84]:

containers = html_soup.findAll("tr")

print(len(containers))

print(containers[4])


8
<tr>
<td>
<strong class="book-title">Little Fires Everywhere</strong>
<span class="text-offset">by Celeste Ng</span>
</td>
<td class="item-stock">In Stock</td>
<td class="item-qty">1</td>
<td class="item-price">$10.20</td>
</tr>


In [85]:

i=0
for container in containers:
    try:
        title = container.findAll("strong", {"class":"book-title"})[0].text
        author = container.findAll("span", {"class":"text-offset"})[0].text
        unit_price = container.findAll("td", {"class":"item-price"})[0].text


        print('Title:',title)
        print('Author:',author)
        print('Price:',unit_price)
        print("-------------" )
        i+=1
    except:
        #print("Can't parse this container")
        pass

Title: Where the Crawdads Sing
Author: by Delia Owens
Price: $11.00
-------------
Title: Midnight Sun
Author: by Stephenie Meyer
Price: $26.64$13.32 × 2
-------------
Title: Introducing HTML5
Author: by Bruce Lawson & Remy Sharp
Price: $22.23
-------------
Title: Little Fires Everywhere
Author: by Celeste Ng
Price: $10.20
-------------


In [87]:
title_list = []
author_list = []
unit_price_list = []

for container in containers:
    try:
        title = container.findAll("strong", {"class":"book-title"})[0].text
        author = container.findAll("span", {"class":"text-offset"})[0].text.replace("by", "")
        unit_price = container.findAll("td", {"class":"item-price"})[0].text.replace("$26.64", "").replace("×", "").replace(" 2", "")


        title_list.append(title)
        author_list.append(author)
        unit_price_list.append(unit_price)
    except:
        #print("Can't parse this container")
        continue

data = {'Title':title_list,'Author':author_list,'Price':unit_price_list}
df = pd.DataFrame(data)

# Write to csv file
f = open("books.csv", 'w')
df.to_csv(f)
f.close()

df

Unnamed: 0,Title,Author,Price
0,Where the Crawdads Sing,Delia Owens,$11.00
1,Midnight Sun,Stephenie Meyer,$13.32
2,Introducing HTML5,Bruce Lawson & Remy Sharp,$22.23
3,Little Fires Everywhere,Celeste Ng,$10.20
