### Questions
* how to scrape something with beautiful soup when there are no class ids or ids for the section
* is regex important for web scraping? 

### Objectives
YWBAT
* differentiate between getting a request and parsing the file
* use beautiful soup to parse an html file
* use selenium to parse an html file

### Outline
* questions
* scrape ebay for pricing
* load that information into a list
* try to work with infinite scrolling

In [14]:
import requests # used to get a webpage
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup # parse a webpage
from pprint import pprint

import matplotlib.pyplot as plt
import seaborn as sns

### Let's get an ebay url to scrape

**setup url**

In [11]:
search = "mechanical keyboards"
url = "https://www.ebay.com/sch/i.html?_nkw={}".format(search.replace(" ", "+"))
url

'https://www.ebay.com/sch/i.html?_nkw=mechanical+keyboards'

**make request**

In [None]:
page = requests.get(url)

In [16]:
content = page.content

**make soup object for parsing**

In [17]:
soup = BeautifulSoup(content, 'html.parser')

**grab the item__wrappers**

In [22]:
item_wrappers = soup.find_all('div', attrs={"class":"s-item__wrapper clearfix"}) # class fields are special

**learning wrappers using the first item__wrapper**

In [53]:
first_item_wrapper = item_wrappers[0]
url = first_item_wrapper.find_all('a')[0].get('href')
text = first_item_wrapper.find_all('h3')[0].text.strip()
description = first_item_wrapper.find_all("div", attrs={"class":"s-item__subtitle"})[0].text
item_status = first_item_wrapper.find_all("div", attrs={"class":"s-item__subtitle"})[1].text
price = float(first_item_wrapper.find_all("span", attrs={"class":"s-item__price"})[0].text.replace("$", ""))
shipping_info = first_item_wrapper.find_all("span", attrs={"class":"s-item__shipping s-item__logisticsCost"})[0].text.lower()
free_return = first_item_wrapper.find_all('span', attrs={"class":"s-item__free-returns s-item__freeReturnsNoFee"})[0].text.lower()

In [37]:
first_item_wrapper.find_all('h3')[0].text.strip()

'Mechanical Keyboard RGB Wired Backlit Ergonomic Gaming Keyboard  Blue Switches'

In [39]:
# item description
description = first_item_wrapper.find_all("div", attrs={"class":"s-item__subtitle"})[0].text
description

'Cherry MX RGB blue key switches & 104 Key & Ombar Brand'

In [42]:
# item status
item_status = first_item_wrapper.find_all("div", attrs={"class":"s-item__subtitle"})[1].text
item_status

'Brand New'

In [47]:
# finding price
price = float(first_item_wrapper.find_all("span", attrs={"class":"s-item__price"})[0].text.replace("$", ""))

25.99

In [52]:
# finding shipping info
shipping_info = first_item_wrapper.find_all("span", attrs={"class":"s-item__shipping s-item__logisticsCost"})[0].text.lower()
shipping_info


'free shipping'

In [55]:
# check on the return rate
free_return = first_item_wrapper.find_all('span', attrs={"class":"s-item__free-returns s-item__freeReturnsNoFee"})[0].text.lower()
free_return

'free returns'

In [60]:
# loop through all items and get the information desired

dlist = []

for item_wrapper in item_wrappers:
    # commenting this out, since we are iterating through our item wrappers
    # item_wrapper = item_wrappers[0]
    d = {}
    d["url"] = item_wrapper.find_all('a')[0].get('href')
    d["text"] = item_wrapper.find_all('h3')[0].text.strip()
    d["description"] = item_wrapper.find_all("div", attrs={"class":"s-item__subtitle"})[0].text
    d['low_price'] = None
    d['high_price'] = None
    
    
    try:
        d["item_status"] = item_wrapper.find_all("div", attrs={"class":"s-item__subtitle"})[1].text
    except:
        d["item_status"] = None
        
    price_text = item_wrapper.find_all("span", attrs={"class":"s-item__price"})[0].text.replace("$", "")
    try:
        d["price"] = float(price_text)
    except:
        if 'to' in price_text:
            p1, p2 = price_text.split("to")
            d["low_price"] = float(p1.strip())
            d["high_price"] = float(p2.strip())
            
    d["shipping_info"] = item_wrapper.find_all("span", attrs={"class":"s-item__shipping s-item__logisticsCost"})[0].text.lower()
    
    try:
        d["free_return"] = item_wrapper.find_all('span', attrs={"class":"s-item__free-returns s-item__freeReturnsNoFee"})[0].text.lower()
    except:
        d["free_return"] = None
    
    
#     for k, v in d.items():
#         print("{} : {}".format(k, v))
#     print("-"*50)
#     print("\n")

    dlist.append(d)

In [61]:
df = pd.DataFrame(dlist)
df.head()

Unnamed: 0,description,free_return,high_price,item_status,low_price,price,shipping_info,text,url
0,Cherry MX RGB blue key switches & 104 Key & Om...,free returns,,Brand New,,25.99,free shipping,Mechanical Keyboard RGB Wired Backlit Ergonomi...,https://www.ebay.com/itm/Mechanical-Keyboard-R...
1,Brand New,free returns,,,,25.99,free shipping,Ombar K676 RGB 104 Key Mechanical Gaming Keybo...,https://www.ebay.com/itm/Ombar-K676-RGB-104-Ke...
2,Pre-Owned,,,,,49.0,+$12.99 shipping,WASD Code 87-Key V2B Backlit Mechanical Keyboa...,https://www.ebay.com/itm/WASD-Code-87-Key-V2B-...
3,US Seller – Fast Shipping – 60 Day Returns – W...,free returns,,Brand New,,28.99,free shipping,"SPONSOREDRosewill RGB Gaming Keyboard, Wired, ...",https://www.ebay.com/itm/Rosewill-RGB-Gaming-K...
4,US Seller – Fast Shipping – 60 Day Returns – W...,free returns,,Brand New,,49.99,free shipping,SPONSOREDRosewill RGB Mechanical Gaming Keyboa...,https://www.ebay.com/itm/Rosewill-RGB-Mechanic...


### What did we learn?
* sublime text and postman
* sublime text great for shortcut keys and editing many things at once
* postman great for request calls and seeing headers
* try/except
* find_all, .get(), attrs in find_all, .text, 
* how to begin parsing through data

In [65]:
# dealing with multiple pages
search = "mechanical keyboard".replace(" ", "+")
page_number = 1
url = "https://www.ebay.com/sch/i.html?_nkw={}&_pgn={}"
url

'https://www.ebay.com/sch/i.html?_nkw={}&_pgn={}'

In [67]:
urls = [url.format(search, page) for page in range(1, 11)]
urls

['https://www.ebay.com/sch/i.html?_nkw=mechanical+keyboard&_pgn=1',
 'https://www.ebay.com/sch/i.html?_nkw=mechanical+keyboard&_pgn=2',
 'https://www.ebay.com/sch/i.html?_nkw=mechanical+keyboard&_pgn=3',
 'https://www.ebay.com/sch/i.html?_nkw=mechanical+keyboard&_pgn=4',
 'https://www.ebay.com/sch/i.html?_nkw=mechanical+keyboard&_pgn=5',
 'https://www.ebay.com/sch/i.html?_nkw=mechanical+keyboard&_pgn=6',
 'https://www.ebay.com/sch/i.html?_nkw=mechanical+keyboard&_pgn=7',
 'https://www.ebay.com/sch/i.html?_nkw=mechanical+keyboard&_pgn=8',
 'https://www.ebay.com/sch/i.html?_nkw=mechanical+keyboard&_pgn=9',
 'https://www.ebay.com/sch/i.html?_nkw=mechanical+keyboard&_pgn=10']

In [69]:
# loop through all items and get the information desired

dlist = []
for url in urls:
    page = requests.get(url)
    content = page.content
    soup = BeautifulSoup(content, 'html.parser')
    item_wrappers = soup.find_all('div', attrs={"class":"s-item__wrapper clearfix"}) # class fields are special
    
    for item_wrapper in item_wrappers:
        # commenting this out, since we are iterating through our item wrappers
        # item_wrapper = item_wrappers[0]
        d = {}
        d["url"] = item_wrapper.find_all('a')[0].get('href')
        d["text"] = item_wrapper.find_all('h3')[0].text.strip()
        d["description"] = item_wrapper.find_all("div", attrs={"class":"s-item__subtitle"})[0].text
        d['low_price'] = None
        d['high_price'] = None


#         try:
#             d["item_status"] = item_wrapper.find_all("div", attrs={"class":"s-item__subtitle"})[1].text
#         except:
#             d["item_status"] = None

#         price_text = item_wrapper.find_all("span", attrs={"class":"s-item__price"})[0].text.replace("$", "")
#         try:
#             d["price"] = float(price_text)
#         except:
#             if 'to' in price_text:
#                 p1, p2 = price_text.split("to")
#                 d["low_price"] = float(p1.strip())
#                 d["high_price"] = float(p2.strip())

#         d["shipping_info"] = item_wrapper.find_all("span", attrs={"class":"s-item__shipping s-item__logisticsCost"})[0].text.lower()

#         try:
#             d["free_return"] = item_wrapper.find_all('span', attrs={"class":"s-item__free-returns s-item__freeReturnsNoFee"})[0].text.lower()
#         except:
#             d["free_return"] = None


    #     for k, v in d.items():
    #         print("{} : {}".format(k, v))
    #     print("-"*50)
    #     print("\n")

        dlist.append(d)

In [72]:
df = pd.DataFrame(dlist)
print(df.shape)
df.head()

(600, 5)


Unnamed: 0,description,high_price,low_price,text,url
0,Pre-Owned,,,WASD Code 87-Key V2B Backlit Mechanical Keyboa...,https://www.ebay.com/itm/WASD-Code-87-Key-V2B-...
1,Cherry MX RGB blue key switches & 104 Key & Om...,,,Mechanical Keyboard RGB Wired Backlit Ergonomi...,https://www.ebay.com/itm/Mechanical-Keyboard-R...
2,Pre-Owned,,,New ListingTenkeyless Mechanical Keyboard,https://www.ebay.com/itm/Tenkeyless-Mechanical...
3,US Seller – Fast Shipping – 60 Day Returns – W...,,,"SPONSOREDRosewill RGB Gaming Keyboard, Wired, ...",https://www.ebay.com/itm/Rosewill-RGB-Gaming-K...
4,US Seller – Fast Shipping – 60 Day Returns – W...,,,SPONSOREDRosewill RGB Mechanical Gaming Keyboa...,https://www.ebay.com/itm/Rosewill-RGB-Mechanic...


In [76]:
soup.find_all('div', attrs={"class":"s-item__wrapper clearfix"})

[]