<h1> 
    <img src="https://www.numberedart.com/ekmps/shops/numberedart/images/cool-andy-warhol-x22-campbells-soup-x22-pop-art-paint-by-number-kit-378-p.jpg" alt="" width="50"/>
    <b>Beautiful Soup For The Soul</b>
    <img src="https://www.numberedart.com/ekmps/shops/numberedart/images/cool-andy-warhol-x22-campbells-soup-x22-pop-art-paint-by-number-kit-378-p.jpg" alt="" width="50"/>
</h1> 
<hr>

<h2> Introduction </h2><hr>
So I heard you're looking for some data?  Well, you've come to the right place. 

But what do you do when the data you need doesn't come wrapped in a neat API?  Or that dataset doesn't have a "download as csv" button?  Inconcievable!  But never fear, web scraping is here!  

<br>
Web scraping is exactly what it sounds like: extracting data from websites.  But before we delve into Beautiful Soup, the web scraping library for Python, we have to take a step back and re-visit the good 'ole Myspace days editing HTML to make that perfect layout.  So strap in, and let's take a look at some basic HTML.

![](https://github.com/brianlau336/BeautifulSoupForTheSoul/blob/master/pic1.jpg?raw=true)


<h2> 
    <img src="https://image.flaticon.com/icons/svg/136/136528.svg" alt="" width="50"/>
    HTML
</h2>
<hr>
HTML is the standard markup language for creating web pages and web applications.  HTML is <b>not</b> a programming language, like Python â€” instead, it's a markup language that tells a browser how to display content. HTML is versy similar to programs like Microsoft Word where you can make text bold, create paragraphs, etc.

HTML consists of elements called tags. The most basic tag is the `<html>` tag. This tag tells the web browser that everything between the two tags can be expected to be HTML.

```html
<html>
    <head>
    </head>
    <body>
        <p class="bold-paragraph">
            Here's a paragraph of text!
            <a href="https://www.chase.com" id="banking">Bank with Chase!</a>
        </p>
        <p class="bold-paragraph extra-large">
            Here's a second paragraph of text!
            <a href="https://www.python.org" class="extra-large">Python</a>
        </p>
    </body>
</html>
```

<h3>Tag Genealogy:</h3>
    
<b>child</b> â€” a child is a tag inside another tag<br>
<b>parent</b> â€” a parent is the tag another tag is inside<br>
<b>sibling</b> â€” a sibiling is a tag that is nested inside the same parent as another tag

<h3>Tag Identifiers:</h3>

<b>class</b> â€”  One element can have multiple classes, and a class can be shared between elements<br>
<b>id</b> â€” Each element can only have one id, and an id can only be used once on a page

![](https://upload.wikimedia.org/wikipedia/commons/5/55/HTML_element_structure.svg)

<h2> 
    <img src="https://image.flaticon.com/icons/svg/138/138807.svg" width=50/>
    Requests Library 
</h2>
<hr>
The first step in scraping a web page is retrieving the web page.  We can do this with the requests library.  The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us.

In [None]:
import requests
page = requests.get("http://www.google.com")

After running our request, we get a `Response` object. This object has a status_code property, which indicates if the page was downloaded successfully



In [None]:
page.status_code

A few possible responses:<br>
200 <b>OK</b> - The resource has been fetched and is transmitted in the message body.<br>
400 <b>Bad Request</b> - This response means that server could not understand the request due to invalid syntax.<br>
500 <b>Internal Server Error</b> The server has encountered a situation it doesn't know how to handle.<br>

We can print the HTML of the page:

In [None]:
page.content

<h2> 
    <img src="https://image.flaticon.com/icons/svg/889/889705.svg" width=50/>
    Beautiful Soup Library 
</h2>
<hr>
Now that we have the raw HTML, we can use what we can start parsing with our Soup!

In [None]:
example_page = '''<!DOCTYPE html>
<html>
    <head>
    </head>
    <body>
        <p class="test1">Here's a paragraph of text!</p>
        <p class="test2">Here's a second paragraph of text!</p>
    </body>
</html>'''
#http://htmlpreview.github.io/?https://github.com/brianlau336/BeautifulSoupForTheSoul/blob/master/example_page1.html

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(example_page, 'html.parser')
print(soup.prettify())

In [None]:
soup.children

In [None]:
[x for x in soup.children] #list(soup.children)

In [None]:
[type(item) for item in list(soup.children)]


In [None]:
html = list(soup.children)[2]
html

In [None]:
list(html.children)

In [None]:
body = list(html.children)[3]
body

In [None]:
p = list(body.children)[1]
p

In [None]:
p.get_text()

<h3> Find All Tags </h3>
Okay, now that was somewhat painful.  Thankfully, we don't have to manually drill down like that every time.  If we want to extract a single tag, we can use the find_all method which will find all the instances of the tag on the page and return a list.
<br>
<br>
![](https://github.com/brianlau336/BeautifulSoupForTheSoul/blob/master/pic2.jpg?raw=true)

In [None]:
soup.find_all('p')


In [None]:
for x in soup.find_all('p'):
    print (x.get_text().strip())

If you instead only want to find the first instance of a tag, you can use the find method, which will return a single `BeautifulSoup` object

In [None]:
soup.find('p')

<h3> Searching By Genealogy/Relations </h3>
The Beautiful Soup API defines ten other methods for searching the tree, but donâ€™t be afraid! Five of these methods are basically the same as find_all(), and the other five are basically the same as find(). The only differences are in what parts of the tree they search.



In [None]:
first_p = soup.find('p')
first_p.find_parent()

In [None]:
first_p = soup.find('p')
first_p.find_next_sibling()

<h3> Searching By Class/Id </h3>
Classes and ids are used by CSS to determine which HTML elements to apply certain styles to. We can also use them when scraping to specify specific elements we want to scrape. 



In [None]:
example_page2 = '''<!DOCTYPE html>
<html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <div>
            <p class="inner-text first-item" id="first">
                First paragraph.
            </p>
            <p class="inner-text">
                Second paragraph.
            </p>
        </div>
        <p class="outer-text first-item" id="second">
        <b>
                First outer paragraph.
            </b>
        </p>
        <p class="outer-text">
            <b>
                Second outer paragraph.
            </b>
        </p>
    </body>
</html>'''
#http://htmlpreview.github.io/?https://github.com/brianlau336/BeautifulSoupForTheSoul/blob/master/example_page2.html
from bs4 import BeautifulSoup
soup = BeautifulSoup(example_page2, 'html.parser')
print(soup.prettify())

Now, we can use the `find_all` method to search for items by class or by id. In the below example, we'll search for any p tag that has the class outer-text:



In [None]:
soup.find_all('p', class_='outer-text')


We can also search for elements by id:


In [None]:
soup.find_all(id="first")


You can also search for items using CSS selectors. These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:

In [None]:
soup.select("div p")

<h2> 
    <img src="https://image.flaticon.com/icons/svg/214/214363.svg" width=50>
    A Real Life Example - Houston Weather Data
</h2>
<hr>
Let's say you wanted to grab some weather data for Houston, and you couldn't find any viable API's to use.  You stumble across the National Weather Service website and oh look! They have exactly what you want.  Time to flex those scraping skills.
<br>
<br>
![](https://github.com/brianlau336/BeautifulSoupForTheSoul/blob/master/pic3.jpeg?raw=true)
<b>Houston Weather:</b>
https://forecast.weather.gov/MapClick.php?lat=29.7606&lon=-95.3697

Step 1: Download the web page containing the forecast.


In [None]:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=29.7606&lon=-95.3697")
soup = BeautifulSoup(page.content, 'html.parser')

Step 2: Find the div with id "seven-day-forecast", and assign it to week_forecast

In [None]:
week_forecast = soup.find(id="seven-day-forecast")

Step 3: Inside week_forecast, find each individual forecast item 

In [None]:
forecast_items = week_forecast.find_all(class_="tombstone-container")

Step 4: Extract and print the first forecast item.

In [None]:
tonight = forecast_items[0]
print(tonight.prettify())

Okay cool, that forecast item (day) we drilled down into has everything we want!  
Looking at what we have, we can extract 4 major data points:

1. The name of the forecast item under "period-name" class<br>
2. The description of the conditions within the "title" property of img. This is a little different so we'll come back to it later.<br>
3. A short description of the conditions under "short-desc" class<br>
4. The temperature low under "temp" class<br>

In [None]:
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()

print(period)
print(short_desc)
print(temp)

Now, we can extract the "title" attribute from the img tag. To do this, we just treat the BeautifulSoup object like a dictionary, and pass in the attribute we want as a key:

In [None]:
img = tonight.find("img")
desc = img['title']

print(desc)

Now that we know how to extract each individual piece of information, we can combine our knowledge with css selectors and lists to extract everything at once.

So let's select all items with the class period-name inside an item with the class tombstone-container in seven_day.
We can use list comprehension to call the get_text method on each BeautifulSoup object.

In [None]:
period_tags = week_forecast.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

In [None]:
short_descs = [sd.get_text() for sd in week_forecast.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in week_forecast.select(".tombstone-container .temp")]
descs = [d["title"] for d in week_forecast.select(".tombstone-container img")]

print(short_descs)
print(temps)
print(descs)

<h2>
    <img src="https://image.flaticon.com/icons/svg/185/185816.svg" width=50>
    Pandas!
</h2>
<hr>
Okay, so we have a bunch of lists with data...hm...does this look <i>familiar</i>? 
<br>
<br>
<center>
    <img src="https://confluence.uk.jpmorgan.com/confluence/download/attachments/726254563/PandaMan2.jpeg?version=1&modificationDate=1529014241000&api=v2">
</center>

We can now combine the data into a Pandas `DataFrame` and analyze it.

In order to do this, we'll call the `DataFrame` class, and pass in each list of items that we have. We pass them in as part of a dictionary. Each dictionary key will become a column in the DataFrame, and each list will become the values in the column.

In [None]:
import pandas as pd
weather = pd.DataFrame({
        "period": periods, 
        "short_desc": short_descs, 
        "temp": temps, 
        "desc":descs
    })
weather

We can now do some analysis on the data. For example, we can use a regular expression and the Series.str.extract method to pull out the numeric temperature values:

In [None]:
temp_nums = weather["temp"].str.extract("(?P<temp_num>\d+)")
weather["temp_num"] = temp_nums.astype('int')
temp_nums

We could then find the mean of all the high and low temperatures:



In [None]:
weather["temp_num"].mean()

We could also only select the rows that happen at night:


In [None]:
is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
is_night

In [None]:
weather[is_night]


<h2>
    <img src="https://image.flaticon.com/icons/svg/497/497738.svg" width=50>
    Conclusion
</h2>
<hr>

We've only scratched the tip of the iceberg with this walkthrough and real world example.  Imagine all the cool things you can now scrape like airline tickets, stock market data, consumer product prices, it's endless!  

<b>Warning!</b>  
Web scraping had created a bit of a grey area of what is allowed and what is not allowed.

1. It's increasingly being used for business purposes to gain a competitive advantage. So there's often a financial motive behind it.
2. It's often done in complete disregard of copyright laws and of Terms of Service (ToS).
3. It's often done in abusive manners. For example, web scrapers might send much more requests per second than what a human would do, thus causing an unexpected load on websites.
4. It can be used to perform prohibited operations on websites, like circumventing the security measures that are put in place to automatically download data, which would otherwise be inaccessible.

In 2009 Facebook won one of the first copyright suits against a web scraper.  This laid the groundwork for numerous lawsuits that tie any web scraping with a direct copyright violation and very clear monetary damages.

In 2017 Ticketmaster sued Prestige Entertainment, claiming it used computer programs to illegally buy as many as 40 percent of the available seats for performances of "Hamilton" in New York and the majority of the tickets Ticketmaster had available for the Mayweather v. Pacquiao fight in Las Vegas two years ago. 

![](http://www.blacksheepproductions.com/wp-content/uploads/2016/07/with-great-power-comes-great-resposibility.jpg)