# Web Scraping with BeautifulSoup4 and Pandas

This Jupyter notebook is a simple demo scraping a web page from Total Wine and showcasing the powerful tools used in web
scraping to harvest data from the world wide web. 

To start, we'll import the packages we'll be using in this demo and go over each one briefly.

In [None]:
import pandas as pd 
from bs4 import BeautifulSoup
import requests 

Starting with Pandas, which we've abbreviated as 'pd', is an open source data analysis and manipulation library.
It's key feature is what is called a <b>DataFrame</b>. A DataFrame is an in-memory data structure with indexing and 
re-indexing, alignment, manipulation, pivoting, and statistical analysis capabilities (and many more!) 

BeautifulSoup4 is a simple to use library designed to scrape information from web pages. It has an HTML and 
XML parser included that allows ease of searching and parsing your desires web page. 

Requests coins itself as "an elegant and simple HTTP library for Python, built for human beings." And they 
have every right to make such claim. This library allows us to make HTTP requests using python, and it's how
we will be obtaining our information to scrape.

Now we'll start with the basics, first we're going to make an HTTP request to our desire website and download a webpage.
We'll do this by creating a Session object and use that to make a request to our website!

In [None]:
destination_url = 'https://www.totalwine.com'
session = requests.Session()
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0"
                         ".3770.142 Safari/537.36"}
response = session.get(destination_url, headers=headers)

#side note, you can check the status code of the response from the web server.
print(response.status_code)

<i>Huzzah!</i> We have successfully made a request. Alright, pack your bags, time to go home. Just kidding, we've only 
started. So what did we do? We:
 - Set the destination URL
 - Created a Session object which will make the request for us
 - Initialized a dictionary with a single key-value pair, which is the header we'll be using to mimic the user-agent of 
 a regular browser. Sometimes website don't like non-human traffic :( 
 - Called the session object and stored the response in a variable, and finally printed the status code to make sure
 everything was a-okay. 

Now that we have our webpage, let's begin scraping! 

First thing's first, our destination URL is set to the homepage of the website. Since we're interested in getting some 
details about the products they sell, let's change the destination URL of a page that contains information about a 
product. I'm a tequila kind of guy, so let's take a look at one of those! 

In [None]:
destination_url = 'https://www.totalwine.com/spirits/tequila/blanco-silver/don-julio-1942-tequila/p/38388750?s=919&igrules=true'
response = session.get(destination_url, headers=headers)
print(response.status_code)

<i>Note: Remember, we want to make sure that we received a valid response from the server. The status code helps with 
that. If you make a request and you see the status code is anything other than a number between 200, you will have to 
look at the specific status code and found out why your request didn't succeed with the server.</i>  

In [None]:
soup = BeautifulSoup(response.text, 'html.parser')

print(soup.prettify())

*If you want to see what the page's source code looks like uncomment the print statement, beware, it's long!*

Now we have our page, let's start with scraping! So where do we begin? What kind of information can we get from this 
web-page? Well, basically just about anything that you can see. For this tutorial, I'm just interested in the name,
price, and spirits type on the page.

Now that leads us to the next question, how do we extract the information from the page? Let's first do a little digging
around and see where our information is being contained. Here, we'll take a look at how the name of the spirit is 
represented on the web page. 


This is a fairly simple HTML tag for the name of the tequila name. We can use the BeautifulSoup API to find an element
based on its tag.


In [None]:
name = soup.find("h1").text

print(name)


Now we're going to find the price in this document. Let's take a look at what the code looks like containing the price.


This one should be easy as well, notice the **id** attribute in the HTML div tag. BeautifulSoup again makes this quite
simple for us.  Using the same *find* method, we can use the **id** parameter to search the document by ID.

In [None]:
price = soup.find(id="edlpPrice").text

print(price)


And finally, for the spirits type, this should be quick since we know what we're looking for at this point. 


The HTML here says that this is a button, with a class of "detailsTableText__1SvcRdYn"

In [None]:
spirit_type = soup.find("div", {"class" : "detailsTableText__1SvcRdYn"}).text
print(spirit_type)

Wait a second, that's the brand... Not the spirit type! What gives!? Well, class names aren't unique in HTML, so what 
happened is with calling BeautifulSoup's *find* method, we were able to find the first instance of a div that had a 
class with a matching name. So what we need to do is find **all** of the divs that use that class, and find the index
of the appropriate element located in the list of returned objects in order to get the spirit type.

In [None]:
divs = soup.find_all("div", {"class" : "detailsTableText__1SvcRdYn"})
for div in divs:
    print(div.text)


It appears that the spirit type is the 3rd object in the array of divs with that class name.

In [None]:
spirit_type = divs[2].text
print(spirit_type)


Success! We've found all the information we wanted for this tutorial. So now we're going to show you how to collect this
information into a neat table-like structure using a **DataFrame** and exporting our DataFrame to a csv, all using
pandas!

So, we have all of our information in 3 separate variables: *name*, *price*, and *spirit_type*. So now we're going to 
merge those three into a collection (*namely a list*) and create a Pandas DataFrame out of that list.

In [None]:
details = [name, spirit_type, price]
df = pd.DataFrame([details], columns = ['Spirit Name', 'Spirit Type', 'Price'])
print(df)


Pandas makes it incredibly easy to work with and structure your data to make it easy to digest. To conclude this 
tutorial. We'll use Pandas's neat *to_csv* method to write the results of our scrape to a CSV file! This is handy, 
because it means we can export our results to a portable file, or even use that same file and import it to a database! 

In [None]:
df.to_csv('total_wine_scrape.csv', index=False)


That concludes our tutorial using BeautifulSoup, Python, and Pandas to scrape websites. If you enjoyed this tutorial,
or would like to provide some constructive feedback please comment! If you have a request for any other 
related tutorials you would like to see feel free to comment as well.