# Web Scraping in Python for Beginners 

In this notebook, I will give an introduction to webscraping in python. This will not be an exhaustive overview of all the resources at your disposal with python, but it should be enough to help you in the beginning. The two tools I will be going over in this post are Beautiful Soup (utilizing Requests) and Selenium. The only other major option you have available is Scrapy. First, we will go over Beautiful Soup, then Selenium, and in the end, we will bring them together! Along the way, I will discuss the pros and cons of both.

#### First we need to import the needed libraries: 

In [2]:
import time
import requests
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import numpy as np 
import pandas as pd
import re
import urllib

### Beautiful Soup _Only_:

Beautiful Soup is a python package that parces XML and HTML. It makes it very easy to get data from the html data avaible on the website. However, it does have it's pitfalls. These include not being able to grab data as easily from dynamic pages. In those situations we will use Selenium.

So, in the cell below we can start with a quick example. Let's try with a few pages to get a feel for how things change. So, below lets define three variables (url_1, url_2, and url_3) to three different urls.

The first is Zara's dresses page, flatiron data science page, and pelton's yelp page.

In [3]:
url_1 = 'https://www.zara.com/us/en/woman-dresses-l1066.html?v1=1180427'
url_2 = 'https://flatironschool.com/career-courses/data-science-bootcamp/'
url_3 = 'https://www.yelp.com/biz/peloton-new-york?osq=peloton'

So, first we need to use the Requests library to request the url and assign that to the variable page. We can then use page in the beautiful soup function, and pass 'lxml' as the second argument. This just means that Beautiful Soup is parcing the page as lxml.

By convention we will save the results of BeautifulSoup as soup (in this case FI_soup).

In [4]:
zara_page = requests.get(url_1)
zara_soup = BeautifulSoup(zara_page.content, 'lxml')

You can already view this, simply run the cell below to view the page html: 

In [5]:
zara_soup #this may time out due to the page being too big

zara_soup.find('div') #this will return the first div on the page

<div class="_global-loader active" id="global-loader"></div>

So, now that we have the html saved as a variable we can parce through the page to get the information we need.

If we go and inspect our page, we can see the classnames and IDs of items that we want. below is the screen shot of what the it looked like when I inspected each product item. In this exercise, we will be grabbing the products name and price, and saving it as a dictionary.

<img style="float: center;" src="images/inspect_zara.png"  width=800>

if you look at it we can easily see the classname for the divs surrounding the product name and the price, **'name _item'** and **'price _product_price'** respectivery.

#### Lets first see if we can grab all of the product names very quickly.

In Beautiful Soup the method findAll allows us to do just that. If we use the find method, it will only retun the first itme that it finds that meets the requirements passed. 

As seen above, the first thing we pass in the method is the html tag (i.e. 'div','p','h1','a') and the second will be a dictionary with a key value pair. Typically you will see 'class' with the actual class name or 'ID'. We will be using class.

After grabbing all of the links with this class name _'name'_ we will use a [list comprehension](https://hackernoon.com/list-comprehension-in-python-8895a785550b) to create a list of the names. 

In [11]:
names = zara_soup.findAll('a',{'class','name'})
product_names = [item.text for item in names]
product_names[:5]

['FLORAL PRINT BALLOON SLEEVE DRESS',
 'FLORAL PRINT DRESS',
 'LONG LINEN DRESS',
 'PLEATED DRESS',
 'FLORAL PRINT DRESS']

The next part for prices gets a little complicated. Due to how the html is parced we need to use a little regex to grab the prices! 

In [19]:
prices = zara_soup.findAll('div',{'class','_product-price'})

Now for some Regex. We will use Python's 'Re' library's search method as well as it's group. If we look at all of the prices we realize they are surrounded in a span and have a special html tag called 'data-price'. So to grab this information pass the first chuck of html before the price <img style="display: inline-block;" src="images/span_img.png"  width=150> and before each special character we will use a '\' to break and let the regex parcer know that we need that special charater in our search. next we will have parentheses with '.*' in it which just means any length/combination of characters, and then finish it with ' USD"><' this is how our sting should end.

In [37]:
prod_prices = [float(re.search('\<span data\-price\=\"(.*)\ USD\"\>\<',str(item)).group(1)) for item in prices]
prod_prices[:5]

[69.9, 49.9, 89.9, 69.9, 49.9]