# Data Wrangling with Python - Basics of Web Scraping

## An intro to web scraping

Web scraping is the process of collecting structured web data in an automated fashion. It’s also called <b>web data extraction</b>. web data extraction is used by people and businesses who want to make use of the vast amount of publicly available web data to make smarter decisions.

It can help in business scenarios like:
* Competitor price monitoring
* Sentiment analysis through user reviews
* Academic research
* Extract insights from news stories and so on

**BeautifulSoup** is a library in Python that helps you scrape data: <a href="https://beautiful-soup-4.readthedocs.io/en/latest/#">Documentation</a>

Typically used functions in BeautifulSoup:
* `.text`
* `.name`
* `.parent`
* `.find()`
* `.find_all()`
* `.select()`

#### Import Library

In [1]:
from bs4 import BeautifulSoup

## Finding tags and content in a HTML

### `.find()`

You can find different parts of a HTML by referencing the tag name. First, lets store a HTML script into a BeautifulSoup object called <b>soup</b>

In [2]:
html_text = """
<html>
    <head>
        <title class="sitetitle">Tony's Suits</title>
        <h1> Welcome to Tony's Suits </h1>
    </head>
    <body>
    This is the latest collection of Iron Man toys.
    </body>
</html>
"""

In [3]:
soup = BeautifulSoup(html_text)

Let's try accessing the different parts of the above HTML

In [4]:
soup.find('title')

<title class="sitetitle">Tony's Suits</title>

In [5]:
soup.find('title', class_='sitetitle')

<title class="sitetitle">Tony's Suits</title>

You can also directly reference known html tags using the dot operator

In [6]:
soup.title

<title class="sitetitle">Tony's Suits</title>

In [7]:
soup.h1

<h1> Welcome to Tony's Suits </h1>

You can go one step upper in the hierarchy using the <b>parent</b> method

In [8]:
soup.title.parent

<head>
<title class="sitetitle">Tony's Suits</title>
</head>

In [9]:
soup.title.parent.name

'head'

### `.find_all()`

You can use find_all to get all the matches in a single python list. Lets use the below HTML script to understand how it works:

In [10]:
html_text = """
<html>
    <body>
        <center>
            <p>Title of the Book </p>
            <p>Author</p>
            <p>Synopsis of the book</p>
        </center>
    </body>
</html>"""

In [11]:
soup = BeautifulSoup(html_text)

Let's find all the paragraph tags in this HTML

In [12]:
p_tags = soup.find_all('p')
p_tags

[<p>Title of the Book </p>, <p>Author</p>, <p>Synopsis of the book</p>]

In [13]:
# We can strip the leading and trailing spaces from the text using the .strip() method
p_tags = [x.text.strip() for x in soup.find_all('p')]
p_tags

['Title of the Book', 'Author', 'Synopsis of the book']

### `.select()`
This grabs all the instancs of the requested tag and returns a list

In [14]:
html_text = """
<li id="list_1">
    <div><span class="descriptor">Name: </span><span class="name">Stephanie Martinez</span></div>
    <div><span class="descriptor">DOB: </span><span class="dob">2002-07-18</span></div>
    <div><span class="descriptor">Email: </span><span class="email">pyauto0@example.com</span></div>
    <div><span class="descriptor">Phone: </span><span class="phone">735-070-3726x49154</span></div>
    <div><span class="descriptor">Payment: </span><span class="payment">CARD-AmEx</span></div>
</li>
<li id="list_2">
    <div><span class="descriptor">Name: </span><span class="name">Nathan King</span></div>
    <div><span class="descriptor">DOB: </span><span class="dob">2007-08-22</span></div>
    <div><span class="descriptor">Email: </span><span class="email">pyauto1@example.com</span></div>
    <div><span class="descriptor">Payment: </span><span class="payment">CARD-National</span></div>
</li>
"""

In [15]:
soup = BeautifulSoup(html_text)

In [16]:
li_tags = soup.select('li')

You can select each list item using the index

In [17]:
li_tags[0]

<li id="list_1">
<div><span class="descriptor">Name: </span><span class="name">Stephanie Martinez</span></div>
<div><span class="descriptor">DOB: </span><span class="dob">2002-07-18</span></div>
<div><span class="descriptor">Email: </span><span class="email">pyauto0@example.com</span></div>
<div><span class="descriptor">Phone: </span><span class="phone">735-070-3726x49154</span></div>
<div><span class="descriptor">Payment: </span><span class="payment">CARD-AmEx</span></div>
</li>

In [18]:
li_tags[1]

<li id="list_2">
<div><span class="descriptor">Name: </span><span class="name">Nathan King</span></div>
<div><span class="descriptor">DOB: </span><span class="dob">2007-08-22</span></div>
<div><span class="descriptor">Email: </span><span class="email">pyauto1@example.com</span></div>
<div><span class="descriptor">Payment: </span><span class="payment">CARD-National</span></div>
</li>

#### Let's try getting some info in a readable format from this HTML

for example - "Stephanie Martinez was born on 2002-07-18"

In [19]:
# extract the value in the name class for the 1st list element
name = li_tags[0].select('.name')[0].text # '.name' refers to the item with class as 'name' automatically
name

'Stephanie Martinez'

In [20]:
# extract the value in the dob class for the 1st list element
dob = li_tags[0].select('.dob')[0].text
dob

'2002-07-18'

In [21]:
# now we can loop and get it for both the list items
for i in range(len(li_tags)):
    name = li_tags[i].select('.name')[0].text
    dob = li_tags[i].select('.dob')[0].text
    print(f"{name} was born on {dob}")

Stephanie Martinez was born on 2002-07-18
Nathan King was born on 2007-08-22


***
Let's say we want to store this data in a pandas DataFrame, we can do the following:


## Storing extracted data into a pandas dataframe 

In [22]:
html_text = """
<li id="list_1">
    <div><span class="descriptor">Name: </span><span class="name">Stephanie Martinez</span></div>
    <div><span class="descriptor">DOB: </span><span class="dob">2002-07-18</span></div>
    <div><span class="descriptor">Email: </span><span class="email">pyauto0@example.com</span></div>
    <div><span class="descriptor">Phone: </span><span class="phone">735-070-3726x49154</span></div>
    <div><span class="descriptor">Payment: </span><span class="payment">CARD-AmEx</span></div>
</li>
<li id="list_2">
    <div><span class="descriptor">Name: </span><span class="name">Nathan King</span></div>
    <div><span class="descriptor">DOB: </span><span class="dob">2007-08-22</span></div>
    <div><span class="descriptor">Email: </span><span class="email">pyauto1@example.com</span></div>
    <div><span class="descriptor">Payment: </span><span class="payment">CARD-National</span></div>
</li>
"""

In [23]:
soup = BeautifulSoup(html_text)

#### Import Library

In [24]:
import pandas as pd

Let's create an empty list to store the scraped results

In [25]:
scraped_data = []

Now let's iterate through each "li" tag, gather the values under each class and store the result inside a python list

In [26]:
for i in range(len(li_tags)):
    
    name = li_tags[i].select('.name')[0].text
    dob = li_tags[i].select('.dob')[0].text
    email = li_tags[i].select('.email')[0].text
    payment = li_tags[i].select('.payment')[0].text
    
    final_list = [name, dob, email, payment]
    # append this into data
    scraped_data.append(final_list)

In [27]:
scraped_data

[['Stephanie Martinez', '2002-07-18', 'pyauto0@example.com', 'CARD-AmEx'],
 ['Nathan King', '2007-08-22', 'pyauto1@example.com', 'CARD-National']]

Now we can create a dataframe called df and pass the scraped data in, along with column names

In [28]:
df = pd.DataFrame(data=scraped_data, columns=['Name','DOB','Email','PaymentMethod'])
df.head()

Unnamed: 0,Name,DOB,Email,PaymentMethod
0,Stephanie Martinez,2002-07-18,pyauto0@example.com,CARD-AmEx
1,Nathan King,2007-08-22,pyauto1@example.com,CARD-National


-------------------