# Web Scraping

<div class="alert alert-info">
Web scrapers are programs that read and extract data from websites.
</div>

The websites generally consist of text and HTML tags. Those tags define the structure and layout of the document. For instance, we see tags like `<i>` to italicize text, tags like `<table>` to define tables of data. We're going to extract a table from a Wikipedia page in this example.  

The tool we'll use to find and extract the table is a Python library called BeautifulSoup.


In [None]:
from bs4 import BeautifulSoup
import requests

# we'll import pandas since we'll need it later
import pandas as pd

### Getting the Web Content
The first thing to do is to grab the contents of our website.   
Make a variable called `skycrapercenter_url` that holds the link to the tallest buildings page on Wikipedia. That URL is:  
[https://www.skyscrapercenter.com/buildings](https://www.skyscrapercenter.com/buildings)  

Use `requests.get()` to open the URL using `skycrapercenter_url` as its input. That will give you the webpage: Store this in a variable named `skyscraper_page`.  

In [None]:
# site_url will hold the URL for the Wikipedia building page
skycrapercenter_url = "https://www.skyscrapercenter.com/buildings"

# get the web page using the URL request
skyscraper_page = requests.get(skycrapercenter_url)

Let's print the first 1000 characters of the skyscrapercenter.com entry. We do that by requesting the `.content` of `skyscraper_page`.

In [None]:
# print the first 1000 characters of the web page
print(skyscraper_page.content[0:1000])

You can see that the page mentions "html" in the first few words. Otherwise, there's a lot of text that doesn't appear to have much meaning. It's certainly not obvious how we'd find our data table in all of this HTML text.  

But the HTML tags are the key to finding the table that we'd like. We can use BeautifulSoup to examine a complex document like this and look for the information we're interested in.

In [None]:
# Read the webpage with BeautifulSoups HTML parser
soup_page = BeautifulSoup(skyscraper_page.content, 'html.parser')

BeautifulSoup has some simple commands to find features within a web site. For example, we can extract a web page's title.

In [None]:
# find and print the web page title
print("Title: ")
print(soup_page.title)

Note that the text between the `<title>...</title>` tags is the title of the page. If we just want the text and not the surroundings tags, we can do this:

In [None]:
print("Title: ")
print(soup_page.title.string)

Notice we get the title, but its's surrounded by a **lot** of space.

In [None]:
title_string = ' '.join(soup_page.title.string.split())
title_string

We can use the function `find` on a BeautifulSoup page to find different tags. For example, let's find the first tag in the document that uses the paragraph, or `<p>`, tag:

In [None]:
print("First <p> tag: ")
print(soup_page.find('p'))

There's a lot in that paragraph! You can pull out other tags in the document in similar ways.  

Remember that we identified the HTML table in the Wikipedia article back in Canvas. That table starts with the following text:

```
<table id="buildingsTable" class="custom-table buildings-table bg-white pt-1">
```

What is all that stuff after `table`? It turns out we don't really care. All that matters is that it helps us differentiate this table from any others that might be in the same document. It helps us know this is the right thing to extract. So let's do that using `find`.

In [None]:
data_table = soup_page.find('table', id="buildingsTable")

# Let's make sure we found the right table
print(data_table)

There's a lot there, but you should see the column titles in the first few lines of the table HTML.  

The next part is tricky. It involves the same idea: Using `find` to look through the table and extract the data elements. In the Canvas session we said we care about the name of the building, its height, the number of floors, the city and country, and the year it was built. We have to look through the table rows (`tr`) and pull the table data (`td`) in each column.  

Don't worry about how this works yet.

In [None]:
#
building_names = []
heights = []
heights_in_feet = []
number_of_floors = []
cities = []
countries = []
completion_dates = []

# for every row (<tr>)
for row in data_table.findAll('tr'):

    # get all of the table data (<td>) in each column
    cells = row.findAll('td')

    # Skip rows that aren't 9 columns long
    if len(cells) != 9:
        continue

    try:
        building_names.append(cells[1].find('a').text.strip())
        cities.append(cells[2].find('a').text.strip())
        completion_dates.append(int(cells[4].find('p').text.strip()))
        heights.append(cells[5].find('p').text.strip())
        number_of_floors.append(int(cells[6].find('p').text.strip()))

    except:
        break


### Putting the data into a dataframe

Our netx step is to take what we read from the skyscrapercenter.com page and convert it into a Pandas dataframe.

Let's make a Python dictionary with each of the lists we created above.. These will have roughly the same titles as the table on the skyscraper page.

In [None]:
data = {'name': building_names,
        'city': cities,
        'year': completion_dates,
        'height': heights,
        'floors': number_of_floors}

Now we can combine the `column_names` and the `data` into a Pandas dataframe. Let's put the dataframe into a variable named `df`.

In [None]:
df = pd.DataFrame(data)
df.head()

And there you have it, a Pandas dataframe with all of the data from the skyscraper webpage! That was a lot of work, so let's save the dataframe so we don't need to scrape the web page again. We use `to_csv` to convert the dataframe into a CSV file. We also add the argument `index=False`; this will drop the index values (0,1,2...) from the CSV file. They'll return when we reload the file into a dataframe.

In [None]:
df.to_csv("tallest_skyscrapers.csv", index=False)