<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-\amily:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Web Scraping
              
</p>
</div>

Data Science Cohort Live NYC Feb 2022
<p>Phase 1: Topic 10</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
   

Previously:
    
- Accessed data via API

Sometimes no programmatic access to data!
- No API exists
- No SQL server to interact with.
- No csv files to download.

Many ecommerce sites: no APIs or databases to interact with.
<br>
<br>

<div>
<center><img src="Images/prod_page.png" width="600"/></center>
</div>
<center> Master of Malt</center>
   

<div>
<center><img src="Images/ardbeg_page.png" width="700"/></center>
</div>   

There is data on the page
<div>
<center><img src="Images/ardbeg_tast_nt.png" width="600"/></center>
</div>   

Find the data is in the web site source code...
<div>
<center><img src="Images/source_whiskex.png" width="1800"/></center>
</div>
    <center> Data embedded within a soup of HTML tags </center>   

Let's take a look at a very simple sample web site.

#### HyperText Markup Language (HTML)

Tells a browser how to layout content.

- Consists of elements called tags. 
- The most basic tag is the html tag: specfies everything inside of opening/closing tags is HTML. 

Take a look at an example website.


| Tag | Function | 
| --- | --- |
| html | Denotes extent of HTML document |
| head | External style sheet definition, metadata, titles |
| title | Web page title |
| body | Specifies main web page content block |
| h1-h6 | Section heading (ordered by decreasing size)|
| p | Represents paragraph |
| div | Defines division or section of document |
| span | Meant for inline or small selection  |
| img | Signifies image and defines source |
| a | Linking to external sites or internal events  |
| ul | Declare unordered (bulleted) list |
| li | List item |

#### CSS (Cascading Style Sheets)

- Uses class and id modifiers on tag.
- Styling:
    - Color
    - Font
    - Spacing,
    - etc.
- Can use external sheet for styling
- Separate content and styling.

#### Structure of tag levels
- HTML document structured as tree structure:
<br>
<br>
<div>
    <center><img src="Images/html_tree.png" width="500"/></center>
</div>

#### Goal
Extract information structured by tags.

- Get HTML documents as text.
- Parse tags and extract data.

#### Web scraping frameworks

<div>
    <center><img src="Images/scrapy.png" width="180"/></center>
</div>
<div>
<center><img src="Images/selenium.png" width="300"/></center>
</div>
<div>
<center><img src="Images/bs4.png" width="300"/></center>
</div>

We will use:

<div>
<center><img src="Images/bs4.png" width="400"/></center>
</div>

<div>
<center><img src="Images/requests.png" width="300"/></center>
</div>

- **Requests**: grab the HTML content as text.
- **BeautifulSoup**: parse the content and extract data.

In [2]:
# import requests
import requests

Make requests on a simple webpage:

In [3]:
sample_url = "http://dataquestio.github.io/web-scraping-pages/simple.html"
r = requests.get(sample_url)

Let's get the content:
- like .text attribute
- returns in byte representation.

In [4]:
req_content = r.content
req_content 

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

- Pretty ugly.
- Parse and get relevant data:
    - Want to use HTML tree structure.
    - Class and id structure.
    
BeautifulSoup helps us with this:

In [5]:
from bs4 import BeautifulSoup

Create Soup object with web site content as input.

In [6]:
soup = BeautifulSoup(req_content, 'html.parser') 

In [7]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


Soup is parsing structure and hierarchy of tags and content in HTML document.

Can go tranverse through tree hierarchy:

#### Descending through hierarchy

In [8]:
soup

<!DOCTYPE html>

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

In [9]:
html_level = soup.html
html_level

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

Tag element contains:
- node tag
- node contents (children nodes, text, etc.)

In [10]:
type(html_level)

bs4.element.Tag

**.name** attribute 

Can get the name of the node that you are at:
- .name attribute of Soup/Tag objects    

In [11]:
html_level.name

'html'

**.contents** attribute

- gets list of tag's children

In [12]:
html_level.contents

['\n',
 <head>
 <title>A simple example page</title>
 </head>,
 '\n',
 <body>
 <p>Here is some simple content for this page.</p>
 </body>,
 '\n']

**.children** attribute

Can also yield the children as generator:
- as opposed to .contents which yields entire list of children.
- useful when creating list comprehensions off the tag's children.

In [13]:
html_level.children

<list_iterator at 0x1c0b94d8f40>

Get the name of the tags of html's direct children:
- need to exclude line breaks.

In [14]:
children_names = \
[child.name for child 
 in html_level.children if child != '\n']

In [15]:
children_names

['head', 'body']

Let's access the body child and go down the branch:
- Can address body child as an attribute of previous level.

In [16]:
body_level = html_level.body
body_level

<body>
<p>Here is some simple content for this page.</p>
</body>

There's another level left down this branch:

In [17]:
body_level

<body>
<p>Here is some simple content for this page.</p>
</body>

Accessing the paragraph <p> child:

In [18]:
p_level = body_level.p
p_level

<p>Here is some simple content for this page.</p>

**.text** attribute

Get the text inside the tag:
- .text attribute

In [19]:
p_level.text

'Here is some simple content for this page.'

#### Going up levels:
**.parent** attribute:
- We can also go the other way:

In [20]:
p_level.parent

<body>
<p>Here is some simple content for this page.</p>
</body>

Not too shabby.

#### Going sideways
- Traversing through siblings

**.previous_siblings** attribute
- generator that creates previous siblings

In [21]:
html_level

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

In [22]:
body_level

<body>
<p>Here is some simple content for this page.</p>
</body>

In [23]:
prev_sibs = body_level.previous_siblings
prev_sibs

<generator object PageElement.previous_siblings at 0x000001C0BA802200>

Traversing the generator:
- goes backwards through previous siblings with next() operator
- terminates when exhausts previous siblings

In [24]:
next(prev_sibs)

'\n'

Can be used in a list comprehension as well:
- get tag names of previous siblings excluding line breaks

In [25]:
prevsib_names = [prev_sib.name for prev_sib in body_level.previous_siblings if prev_sib != "\n"]
prevsib_names

['head']

**.next_siblings** attribute: 
- does the same thing but for siblings following the current tag

In [26]:
head_level = html_level.head
list(head_level.next_siblings)

['\n',
 <body>
 <p>Here is some simple content for this page.</p>
 </body>,
 '\n']

Previous web-site was very simple. Website usually has more complex tree structures:
- A given tag can have many children of the same type. Want all children of a given type.
- Dealing with nested structures: divs within divs 
- A set of children with a given class. 
- Specific tag with a unique id  

A more complex but still simple example might help:

In [27]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content)
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

Going down to the body level:

In [28]:
body_level = soup.html.body
body_level

<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>

There are many p tags.
- Want all p tags:

In [29]:
body_level.p

<p class="inner-text first-item" id="first">
                First paragraph.
            </p>

Only got the first.

We need:
    
.find_all() 
- finds all instances of specified tags contained in current node.
- returns a list

In [30]:
body_level.find_all('p')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>,
 <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

But let's take a closer look at the body structure:
- .prettify() can sometimes be useful

In [31]:
print(body_level.prettify())

<body>
 <div>
  <p class="inner-text first-item" id="first">
   First paragraph.
  </p>
  <p class="inner-text">
   Second paragraph.
  </p>
 </div>
 <p class="outer-text first-item" id="second">
  <b>
   First outer paragraph.
  </b>
 </p>
 <p class="outer-text">
  <b>
   Second outer paragraph.
  </b>
 </p>
</body>



**Nested structures**
- One set of paragraphs p tags contained in a div
- Other set as direct children of body.

May want to access p tags that are direct children.

.find_all() has recursive argument (True as default)
- recursive = False:
- gets immediate children satisying requirement

In [32]:
body_level.find_all('p', recursive = False)

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

This is the paragraph tags on the outer level.
- direct children of the body

**Exercise**: get me the paragraph Tags nested in the div layer.

In [33]:
# do it!


**Solution**:

In [34]:
body_level.div.find_all('p')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>]

In [35]:
# can be useful if div had class arguments
# id arguments

body_level.find('div').find_all('p')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>]

#### Class and id selectors

- a grouping of tags (class)
- naming a specific tag instance (id).

Used in CSS styling.

Can also use this for data selection / scraping.

**Class and id selectors with**:
- .find()
- .find_all()

Take additional arguments for class/id

In [36]:
print(body_level.prettify())

<body>
 <div>
  <p class="inner-text first-item" id="first">
   First paragraph.
  </p>
  <p class="inner-text">
   Second paragraph.
  </p>
 </div>
 <p class="outer-text first-item" id="second">
  <b>
   First outer paragraph.
  </b>
 </p>
 <p class="outer-text">
  <b>
   Second outer paragraph.
  </b>
 </p>
</body>



Get the group of paragraph tags in the inner-text class.

In [37]:
body_level.find_all('p', class_ = 'inner-text')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>]

**Class and id selectors with**:
- .find()
- .find_all()

Take additional arguments for class/id

In [38]:
body_level

<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>

Extract by id:

In [40]:
body_level.find('p', id = 'second')

<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>

#### Going back to our whisky page

- Get bottling details (age, ABV, distillery, etc)

In [41]:
ardbeg_url = "https://www.thewhiskyexchange.com/p/114/ardbeg-uigeadail"

In [42]:
ardbeg_req = requests.get(ardbeg_url)
ardbeg_soup = BeautifulSoup(ardbeg_req.content)

Let's get the whisky facts:
- Bottler
- Country
- Chill filtered
- etc.

In [43]:
# returns the match as a tag object
prod_fact = ardbeg_soup.find('ul', 
                 class_ = "product-facts" )

prod_fact

<ul class="product-facts">
<li class="product-facts__item">
<img alt="Bottler Icon" class="product-facts__icon" height="400" loading="lazy" src="/media/rtwe/assets/application/images/facts/fact--bottler.svg" width="400"/>
<h3 class="product-facts__type">Bottler</h3>
<p class="product-facts__data">Distillery Bottling</p>
</li>
<li class="product-facts__item">
<img alt="Country Icon" class="product-facts__icon" height="400" loading="lazy" src="/media/rtwe/assets/application/images/facts/fact--country.svg" width="400"/>
<h3 class="product-facts__type">Country</h3>
<p class="product-facts__data">Scotland</p>
</li>
<li class="product-facts__item">
<img alt="Region Icon" class="product-facts__icon" height="400" loading="lazy" src="/media/rtwe/assets/application/images/facts/fact--region.svg" width="400"/>
<h3 class="product-facts__type">Region</h3>
<p class="product-facts__data">Islay</p>
</li>
<li class="product-facts__item">
<img alt="Chill Filtered Icon" class="product-facts__icon" height

Clearly have a list with each element (li) containing:
- attribute image
- key in h3 tag of class "product-facts__type"
- value in p tag of class "product-facts__data"


Let's get the first li item:
- prod_fact.find('li')
- prod_fact.li

In [44]:
first_li_elem = prod_fact.find('li')
first_li_elem

<li class="product-facts__item">
<img alt="Bottler Icon" class="product-facts__icon" height="400" loading="lazy" src="/media/rtwe/assets/application/images/facts/fact--bottler.svg" width="400"/>
<h3 class="product-facts__type">Bottler</h3>
<p class="product-facts__data">Distillery Bottling</p>
</li>

Now we want the key-value pairs:

In [45]:
# using .find() because we know there is only one of these tags
detail_key = first_li_elem.find('h3', class_ = "product-facts__type").text
detail_key

'Bottler'

In [46]:
# using .find() because we know there is only one of these tags
detail_val = first_li_elem.find('p', class_ = "product-facts__data").text
detail_val

'Distillery Bottling'

We can get the key,value pairs for all of these list elements in the product fact list:
- iterate over .children
- or use .find_all('li')

In [47]:
factTag_list = prod_fact.find_all('li')
factTag_list

[<li class="product-facts__item">
 <img alt="Bottler Icon" class="product-facts__icon" height="400" loading="lazy" src="/media/rtwe/assets/application/images/facts/fact--bottler.svg" width="400"/>
 <h3 class="product-facts__type">Bottler</h3>
 <p class="product-facts__data">Distillery Bottling</p>
 </li>,
 <li class="product-facts__item">
 <img alt="Country Icon" class="product-facts__icon" height="400" loading="lazy" src="/media/rtwe/assets/application/images/facts/fact--country.svg" width="400"/>
 <h3 class="product-facts__type">Country</h3>
 <p class="product-facts__data">Scotland</p>
 </li>,
 <li class="product-facts__item">
 <img alt="Region Icon" class="product-facts__icon" height="400" loading="lazy" src="/media/rtwe/assets/application/images/facts/fact--region.svg" width="400"/>
 <h3 class="product-facts__type">Region</h3>
 <p class="product-facts__data">Islay</p>
 </li>,
 <li class="product-facts__item">
 <img alt="Chill Filtered Icon" class="product-facts__icon" height="400" 

Each element is a Tag object.

In [48]:
print(type(factTag_list[0]))

<class 'bs4.element.Tag'>


Let's iterate over the list and extract keys and pairs:

In [49]:
data_dict = {}

for elem in factTag_list:
    
    detail_key = elem.find('h3', class_ = "product-facts__type").text
    detail_val = elem.find('p', class_ = "product-facts__data").text
    
    data_dict.update({detail_key: detail_val})

Take a look at out data dictionary:

In [50]:
data_dict

{'Bottler': 'Distillery Bottling',
 'Country': 'Scotland',
 'Region': 'Islay',
 'Chill Filtered': 'No',
 'Colouring': 'Yes',
 'Certification': 'Vegan'}

Starting to look a lot like data that could be a row in a table or DataFrame.

Let's try and extract some other information about this whisky as well:
- Get the header for the Flavour Profile subsection
- Get the contents

- notice that all are in the header h2 tag with id = FlavourProfile
- let's access this h2

In [51]:
flavorheader = ardbeg_soup.find(
    'h2', id = "FlavourProfile")
flavorheader

<h2 class="product-title product-title--bravo" id="FlavourProfile">Flavour Profile</h2>

In [52]:
header_text = flavorheader.text
header_text

'Flavour Profile'

- the corresponding content is in a div with class = "flavour-profile"
- let's access this

In [56]:
flavor_content = ardbeg_soup.find(
    'div', class_ = "flavour-profile")

print(flavor_content.prettify())

<div class="flavour-profile">
 <div class="flavour-profile__group flavour-profile__group--style">
  <h3 class="flavour-profile__title">
   Style
  </h3>
  <ul class="flavour-profile__list flavour-profile__list--style">
   <li class="flavour-profile__item flavour-profile__item--style">
    <div class="flavour-profile__gauge js-flavour-profile__gauge" data-part="4" data-text="4" data-total="5">
    </div>
    <span class="flavour-profile__label">
     Body
    </span>
   </li>
   <li class="flavour-profile__item flavour-profile__item--style">
    <div class="flavour-profile__gauge js-flavour-profile__gauge" data-part="4" data-text="4" data-total="5">
    </div>
    <span class="flavour-profile__label">
     Richness
    </span>
   </li>
   <li class="flavour-profile__item flavour-profile__item--style">
    <div class="flavour-profile__gauge js-flavour-profile__gauge" data-part="5" data-text="5" data-total="5">
    </div>
    <span class="flavour-profile__label">
     Smoke
    </span>
  

The whisky has four flavor scores that we are interested in extracting:
- Body
- Richness
- Smoke
- Sweetness

These are contained in the first of the children div nodes.

A list with the strength of various taste characteristics:

In [60]:
flavor_style = flavor_content.div
flavor_style

<div class="flavour-profile__group flavour-profile__group--style">
<h3 class="flavour-profile__title">Style</h3>
<ul class="flavour-profile__list flavour-profile__list--style">
<li class="flavour-profile__item flavour-profile__item--style">
<div class="flavour-profile__gauge js-flavour-profile__gauge" data-part="4" data-text="4" data-total="5"></div>
<span class="flavour-profile__label">Body</span>
</li>
<li class="flavour-profile__item flavour-profile__item--style">
<div class="flavour-profile__gauge js-flavour-profile__gauge" data-part="4" data-text="4" data-total="5"></div>
<span class="flavour-profile__label">Richness</span>
</li>
<li class="flavour-profile__item flavour-profile__item--style">
<div class="flavour-profile__gauge js-flavour-profile__gauge" data-part="5" data-text="5" data-total="5"></div>
<span class="flavour-profile__label">Smoke</span>
</li>
<li class="flavour-profile__item flavour-profile__item--style">
<div class="flavour-profile__gauge js-flavour-profile__gauge"

Ther is another sibling div that contains other information:
- A list with similar taste descriptors

In [65]:
# first sibiling is a \n character
flavor_style.next_sibling.next_sibling

<div class="flavour-profile__group flavour-profile__group--character">
<h3 class="flavour-profile__title">Character</h3>
<ul class="flavour-profile__list flavour-profile__list--character">
<li class="flavour-profile__item flavour-profile__item--character">
<img alt="Honey " class="flavour-profile__image" height="180" loading="lazy" src="/media/rtwe/uploads/flavour/honey.png" width="180"/>
<span class="flavour-profile__label">Honey </span>
</li>
<li class="flavour-profile__item flavour-profile__item--character">
<img alt="Malt" class="flavour-profile__image" height="180" loading="lazy" src="/media/rtwe/uploads/flavour/malt.png" width="180"/>
<span class="flavour-profile__label">Malt</span>
</li>
<li class="flavour-profile__item flavour-profile__item--character">
<img alt="Fruit Cake" class="flavour-profile__image" height="180" loading="lazy" src="/media/rtwe/uploads/flavour/fruitcake.png" width="180"/>
<span class="flavour-profile__label">Fruit Cake</span>
</li>
<li class="flavour-profi

In [66]:
print(flavor_style.prettify())

<div class="flavour-profile__group flavour-profile__group--style">
 <h3 class="flavour-profile__title">
  Style
 </h3>
 <ul class="flavour-profile__list flavour-profile__list--style">
  <li class="flavour-profile__item flavour-profile__item--style">
   <div class="flavour-profile__gauge js-flavour-profile__gauge" data-part="4" data-text="4" data-total="5">
   </div>
   <span class="flavour-profile__label">
    Body
   </span>
  </li>
  <li class="flavour-profile__item flavour-profile__item--style">
   <div class="flavour-profile__gauge js-flavour-profile__gauge" data-part="4" data-text="4" data-total="5">
   </div>
   <span class="flavour-profile__label">
    Richness
   </span>
  </li>
  <li class="flavour-profile__item flavour-profile__item--style">
   <div class="flavour-profile__gauge js-flavour-profile__gauge" data-part="5" data-text="5" data-total="5">
   </div>
   <span class="flavour-profile__label">
    Smoke
   </span>
  </li>
  <li class="flavour-profile__item flavour-profil

Get the keys for the flavor profile:
- Contained as text in span of class="flavour-profile__label"

In [67]:
flav_key_spans  = \
flavor_style.find_all(
    'span', class_ = 'flavour-profile__label')

In [69]:
flav_profile_keys = \
[ span.text for span in flav_key_spans ]
flav_profile_keys

['Body', 'Richness', 'Smoke', 'Sweetness']

For getting the values:
- Value is inside the tag as the data-text attribute

**How can we extract it**?

In [76]:
flav_profile_gauges = \
flavor_style.find_all(
    'div', class_ = 'flavour-profile__gauge')

flav_profile_gauges

[<div class="flavour-profile__gauge js-flavour-profile__gauge" data-part="4" data-text="4" data-total="5"></div>,
 <div class="flavour-profile__gauge js-flavour-profile__gauge" data-part="4" data-text="4" data-total="5"></div>,
 <div class="flavour-profile__gauge js-flavour-profile__gauge" data-part="5" data-text="5" data-total="5"></div>,
 <div class="flavour-profile__gauge js-flavour-profile__gauge" data-part="2" data-text="2" data-total="5"></div>]

Tags are addressable as dictionaries:
- tag attribute name is key

In [78]:
flav_profile_gauges[0]

<div class="flavour-profile__gauge js-flavour-profile__gauge" data-part="4" data-text="4" data-total="5"></div>

In [80]:
flav_profile_gauges[0]['data-text']

'4'

Extracting the values for the flavor profile is straightforward:

In [84]:
value_list = [gauge['data-text'] for gauge in flav_profile_gauges]
value_list

['4', '4', '5', '2']

In [None]:
# Zipping this together and making a dictionary
flav_dict = \
dict(zip(flav_profile_keys, value_list))

flav_dict

And we can update our data dictionary:

In [91]:
# and we can update our data dictionary
data_dict.update(flav_dict)

data_dict

{'Bottler': 'Distillery Bottling',
 'Country': 'Scotland',
 'Region': 'Islay',
 'Chill Filtered': 'No',
 'Colouring': 'Yes',
 'Certification': 'Vegan',
 'Body': '4',
 'Richness': '4',
 'Smoke': '5',
 'Sweetness': '2'}

There are many places where we can get the name of the Whisky:
- A meta tag below the html level
- with attributes name = 'twitter:title' and content = "Ardbeg Uigeadail"

.find() and .find_all()
- has way to select a tag by attributes
- attrs takes in a dictionary of attributes

In [100]:
name_meta_tag = ardbeg_soup.html.find('meta', attrs = {'name': 'twitter:title'})
name_meta_tag

<meta content="Ardbeg Uigeadail" name="twitter:title"/>

Extract the content attribute from tag:
- update data dictionary

In [101]:
name_meta_tag['content']

'Ardbeg Uigeadail'

In [103]:
data_dict.update(
    {'name': name_meta_tag['content'] })

data_dict

{'Bottler': 'Distillery Bottling',
 'Country': 'Scotland',
 'Region': 'Islay',
 'Chill Filtered': 'No',
 'Colouring': 'Yes',
 'Certification': 'Vegan',
 'Body': '4',
 'Richness': '4',
 'Smoke': '5',
 'Sweetness': '2',
 'name': 'Ardbeg Uigeadail'}

A lot of:
- html tree traversing
- exploring scheme
- searches to extract data

When each product site has same tagging structure:

- Build function that extracts data like we did.
- Can be used while looping through multiple products.

#### Build Function

In [108]:
def extract_data_dict(product_page):
    
    data_dict = {}
    product_req = requests.get(product_page)
    product_soup = BeautifulSoup(product_req.content)
    
    # get name
    name_meta_tag = product_soup.html.find('meta', attrs = {'name': 'twitter:title'})
    data_dict.update({'name': name_meta_tag['content'] })
    
    # get product facts and extract information

    prod_fact = product_soup.find('ul', class_ = "product-facts" )
    
    # loops through to update data_dict with bottling information
    for elem in prod_fact.find_all('li'):
    
        detail_key = elem.find('h3', class_ = "product-facts__type").text
        detail_val = elem.find('p', class_ = "product-facts__data").text
        data_dict.update({detail_key: detail_val})
    
    # get flavor ratings
    flavor_style = product_soup.find('div', class_ = "flavour-profile").div
    flav_profile_keys = [ span.text for span in flavor_style.find_all('span', class_ = 'flavour-profile__label') ]
    value_list = [gauge['data-text'] for gauge in flavor_style.find_all('div', class_ = 'flavour-profile__gauge')]
    data_dict.update(dict(zip(flav_profile_keys, value_list)))

    return data_dict

Use function on a product page url:

In [110]:
extract_data_dict('https://www.thewhiskyexchange.com/p/114/ardbeg-uigeadail')

{'name': 'Ardbeg Uigeadail',
 'Bottler': 'Distillery Bottling',
 'Country': 'Scotland',
 'Region': 'Islay',
 'Chill Filtered': 'No',
 'Colouring': 'Yes',
 'Certification': 'Vegan',
 'Body': '4',
 'Richness': '4',
 'Smoke': '5',
 'Sweetness': '2'}

#### Crawling a page of products

Let's end this by applying our function to a page with a list of products:
- need to extract product list
- get link urls
- apply our function to the link urls.

Soupify the products page.

In [137]:
scotchproducts_page = "https://www.thewhiskyexchange.com/c/40/single-malt-scotch-whisky"
prodpage_req = requests.get(scotchproducts_page)
scotchproducts_soup = BeautifulSoup(prodpage_req.content)

Parse the source code and get a list of the product urls:

In [144]:
li_items = scotchproducts_soup.find_all('li', class_="product-grid__item")
prod_urls = [ 'https://www.thewhiskyexchange.com/' + elem.a['href'] for elem in li_items][0:10]
prod_urls

['https://www.thewhiskyexchange.com//p/47802/deanston-18-year-old',
 'https://www.thewhiskyexchange.com//p/3121/lagavulin-16-year-old',
 'https://www.thewhiskyexchange.com//p/60567/balvenie-21-year-old-second-red-rose-stories',
 'https://www.thewhiskyexchange.com//p/3512/macallan-12-year-old-sherry-oak',
 'https://www.thewhiskyexchange.com//p/52345/mortlach-15-year-old-game-of-thrones-six-kingdoms',
 'https://www.thewhiskyexchange.com//p/63743/laphroaig-10-year-old-cask-strength-batch-014-bot2021',
 'https://www.thewhiskyexchange.com//p/232/arran-10-year-old',
 'https://www.thewhiskyexchange.com//p/34537/macallan-12-year-old-double-cask',
 'https://www.thewhiskyexchange.com//p/284/balvenie-12-year-old-doublewood',
 'https://www.thewhiskyexchange.com//p/9731/glendronach-15-year-old-revival-sherry-cask']

Now apply our function:
- Thread it through the list
- map() is useful here.

A list of dicts.

In [147]:
extracted_data = \
list(map(extract_data_dict, prod_urls))

extracted_data

[{'name': 'Deanston 18 Year Old',
  'Bottler': 'Distillery Bottling',
  'Age': '18 Year Old',
  'Country': 'Scotland',
  'Region': 'Highland',
  'Cask Type': 'First-Fill Bourbon',
  'Chill Filtered': 'No',
  'Body': '3',
  'Richness': '4',
  'Smoke': '0',
  'Sweetness': '3'},
 {'name': 'Lagavulin 16 Year Old',
  'Bottler': 'Distillery Bottling',
  'Age': '16 Year Old',
  'Country': 'Scotland',
  'Region': 'Islay',
  'Colouring': 'Yes',
  'Body': '4',
  'Richness': '3',
  'Smoke': '4',
  'Sweetness': '3'},
 {'name': 'Balvenie 21 Year Old - Second Red Rose - Stories',
  'Bottler': 'Distillery Bottling',
  'Age': '21 Year Old',
  'Country': 'Scotland',
  'Region': 'Speyside',
  'Cask Type': 'Australian Shiraz Wine Finish',
  'Body': '3',
  'Richness': '3',
  'Smoke': '0',
  'Sweetness': '3'},
 {'name': 'Macallan 12 Year Old - Sherry Oak',
  'Bottler': 'Distillery Bottling',
  'Age': '12 Year Old',
  'Country': 'Scotland',
  'Region': 'Speyside',
  'Cask Type': 'Sherry',
  'Colouring': 'No

This is a tabular format:
- put this in a dataframe.
- can save to csv, etc.

In [156]:
import pandas as pd
whisky_df = pd.DataFrame(extracted_data)
whisky_df

Unnamed: 0,name,Bottler,Age,Country,Region,Cask Type,Chill Filtered,Body,Richness,Smoke,Sweetness,Colouring,Bottling Date
0,Deanston 18 Year Old,Distillery Bottling,18 Year Old,Scotland,Highland,First-Fill Bourbon,No,3,4,0,3,,
1,Lagavulin 16 Year Old,Distillery Bottling,16 Year Old,Scotland,Islay,,,4,3,4,3,Yes,
2,Balvenie 21 Year Old - Second Red Rose - Stories,Distillery Bottling,21 Year Old,Scotland,Speyside,Australian Shiraz Wine Finish,,3,3,0,3,,
3,Macallan 12 Year Old - Sherry Oak,Distillery Bottling,12 Year Old,Scotland,Speyside,Sherry,,3,3,1,2,No,
4,Mortlach 15 Year Old - Game of Thrones Six Kin...,Distillery Bottling,15 Year Old,Scotland,Speyside,Bourbon Barrel Finish,,4,3,0,3,,
5,Laphroaig 10 Year Old - Cask Strength - Batch ...,Distillery Bottling,10 Year Old,Scotland,Islay,,,4,2,4,3,,June 2021
6,Arran 10 Year Old,Distillery Bottling,10 Year Old,Scotland,Island,,No,2,2,0,2,No,
7,Macallan 12 Year Old Double Cask,Distillery Bottling,12 Year Old,Scotland,Speyside,Oloroso Sherry,,3,3,0,2,No,
8,Balvenie 12 Year Old DoubleWood,Distillery Bottling,12 Year Old,Scotland,Speyside,Sherry Finish,,3,4,0,4,Yes,
9,Glendronach 15 Year Old Revival - Sherry Cask,Distillery Bottling,15 Year Old,Scotland,Highland,Sherry,,4,4,0,2,No,


If you want to scrape the entire set of product pages:
- need to loop through product list pages.
- extract all urls.
- apply function to url list.

**Not always easy**
- Some product pages have some tag elements missing
- slightly different page structure.
- Error handling required.

#### A final important etiquette note

**Throttle requests**
- limit the time between each request/product page scrape
- server limits scrape rate: 
    - cut access to you if you scrape too much, too fast

- time.sleep(t)
- t is in seconds

Try every second.

In [161]:
import time

def extract_data_dict(product_page):
    
    # essential to not getting blocked by server
    time.sleep(1) #this waits 500 ms before executing code
    
    data_dict = {}
    product_req = requests.get(product_page)
    product_soup = BeautifulSoup(product_req.content)
    
    # get name
    name_meta_tag = product_soup.html.find('meta', attrs = {'name': 'twitter:title'})
    data_dict.update({'name': name_meta_tag['content'] })
    
    # get product facts and extract information

    prod_fact = product_soup.find('ul', class_ = "product-facts" )
    
    # loops through to update data_dict with bottling information
    for elem in prod_fact.find_all('li'):
    
        detail_key = elem.find('h3', class_ = "product-facts__type").text
        detail_val = elem.find('p', class_ = "product-facts__data").text
        data_dict.update({detail_key: detail_val})
    
    # get flavor ratings
    flavor_style = product_soup.find('div', class_ = "flavour-profile").div
    flav_profile_keys = [ span.text for span in flavor_style.find_all('span', class_ = 'flavour-profile__label') ]
    value_list = [gauge['data-text'] for gauge in flavor_style.find_all('div', class_ = 'flavour-profile__gauge')]
    data_dict.update(dict(zip(flav_profile_keys, value_list)))

    return data_dict

Running the data extractor will take more time:

In [162]:
extracted_data = \
list(map(extract_data_dict, prod_urls))

extracted_data

[{'name': 'Deanston 18 Year Old',
  'Bottler': 'Distillery Bottling',
  'Age': '18 Year Old',
  'Country': 'Scotland',
  'Region': 'Highland',
  'Cask Type': 'First-Fill Bourbon',
  'Chill Filtered': 'No',
  'Body': '3',
  'Richness': '4',
  'Smoke': '0',
  'Sweetness': '3'},
 {'name': 'Lagavulin 16 Year Old',
  'Bottler': 'Distillery Bottling',
  'Age': '16 Year Old',
  'Country': 'Scotland',
  'Region': 'Islay',
  'Colouring': 'Yes',
  'Body': '4',
  'Richness': '3',
  'Smoke': '4',
  'Sweetness': '3'},
 {'name': 'Balvenie 21 Year Old - Second Red Rose - Stories',
  'Bottler': 'Distillery Bottling',
  'Age': '21 Year Old',
  'Country': 'Scotland',
  'Region': 'Speyside',
  'Cask Type': 'Australian Shiraz Wine Finish',
  'Body': '3',
  'Richness': '3',
  'Smoke': '0',
  'Sweetness': '3'},
 {'name': 'Macallan 12 Year Old - Sherry Oak',
  'Bottler': 'Distillery Bottling',
  'Age': '12 Year Old',
  'Country': 'Scotland',
  'Region': 'Speyside',
  'Cask Type': 'Sherry',
  'Colouring': 'No

Given:

- errors might pop up
- connection may be severed
- with wait time: takes a long time

Good idea to append requested data to a .json or .csv file as you scrape.


#### Challenge #1
- write a function that takes in a list of urls and a path name for a .json file
- creates empty .json file if it does not exist
- applies extract_data_dict to a product url
- gets data for one product and appends to json
- loops through all urls.

#### Challenge #2

- modify extract_data_dict:
    - include the character descriptors as a list and update dictionary with {'character': [...]} 
    - e.g, {'character': ['mint', 'pear', 'cherry', 'chocolate', 'coffee']} 
- Apply the function of challenge #1 to prod_urls and modification to extracted_data_dict
- Send Praveen your resultant .json file