![b4s](img/beautiful_soup.png)

## [Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautifulsoup)

## Benefits of *not* scraping
![options](img/other_options.png)

### Use case

![use](img/use_case.png)

### Goal

![python](img/how_works.png)

#### Discuss
What's a website you'd like  to scrape?

### Scenario

I want to analyze the top song award of the Grammies to see if I can find any patterns in country of origin, singer, song content, etc. 

But where do I start finding that data? Not from an API.

Well, we can start [here](https://en.wikipedia.org/wiki/Grammy_Award_for_Song_of_the_Year)

### This is our target
![target](img/target.png)

### Learning goals:

- scrape a basic wikipedia website using beautiful soup
- transform the html table we want to a pandas `DataFrame`
- scrape a more complex wikipedia
- transform the wanted scraped data into a pandas `DataFrame`
- if time, go hunt a wild website and scrape it

## Basic wikipedia

![vheck](img/basic.gif)

Task: Get one column from a table on wikipedia

Let's get those libraries we want

In [4]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

Use the `url` inside of a `requests.get` and assign it to `website_url`

First, a wikipedia article where we only want to get one column of information - countries!

https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area

In [5]:
website_url = requests.get('').text

Start to use the BeautifulSoup functions to create a BeautifulSoup object

In [6]:
soup = BeautifulSoup(website_url,'lxml')
#print(soup.prettify())

Find the class of interest

In [7]:
table = soup.find('table',{'class':'wikitable sortable'})

Keep looking at the html to see if you can find any commonalities in what you want to scrape....

All the country names are links! We can use the `a` tag!

In [8]:
links = table.find_all('a')

We can now iterate over links to process it and create a list of text

In [9]:
links = table.find_all('a')
Countries = []
for link in links:
    Countries.append(link.get('title'))
    
print(Countries)

['Russia', None, 'China', 'Hong Kong', 'Macau', 'India', None, 'Kazakhstan', 'Saudi Arabia', 'Iran', 'Mongolia', 'Indonesia', 'Pakistan', 'Gilgit-Baltistan', 'Azad Kashmir', 'Turkey', 'Myanmar', 'Afghanistan', 'Yemen', 'Thailand', 'Turkmenistan', 'Uzbekistan', 'Iraq', 'Japan', 'Vietnam', 'Malaysia', 'Oman', 'Philippines', 'Laos', 'Kyrgyzstan', 'Syria', 'Golan Heights', 'Cambodia', 'Bangladesh', 'Nepal', 'Tajikistan', 'North Korea', 'South Korea', 'Jordan', 'Azerbaijan', 'United Arab Emirates', 'Georgia (country)', 'Sri Lanka', 'Egypt', 'Bhutan', 'Taiwan', 'Armenia', 'Israel', 'Kuwait', 'East Timor', 'Qatar', 'Lebanon', 'Cyprus', 'Northern Cyprus', 'State of Palestine', 'Brunei', 'Bahrain', 'Singapore', 'Maldives']


Now, let's convert that list to a data frame

In [10]:
df = pd.DataFrame()
df['Country'] = Countries

In [11]:
df.head()

Unnamed: 0,Country
0,Russia
1,
2,China
3,Hong Kong
4,Macau


## Less Basic - Get a whole table
Let's go inspect the webiste to find the right tag/heading/etc for the table we want

What are the important tags here?<br>
What class is the important one?

`table`<br>
`wikitable sortable`

**Task**<br>
Work with a partner to comment the following code and figure out what it does

In [12]:
response = requests.get('https://en.wikipedia.org/wiki/Grammy_Award_for_Song_of_the_Year').text

soup = BeautifulSoup(response,'lxml')
#print(soup.prettify())

tab = soup.find("table",{"class":"wikitable sortable"})
# pd.read_html(tab.prettify())

rows = tab.find_all('tr')

data = []
for row in rows:
    data.append([x.get_text().strip() for x in row.find_all(['th','td'])])

df = pd.DataFrame(data)

new_header = df.iloc[0]
df = df[1:]
df.columns = new_header

### But this is hard. Is there an easier way to do this?

Another way, if you **know** there is a `table` in the `html` somewhere

In [30]:
indeed = pd.read_html('https://www.indeed.com/q-Data-Scientist-l-Washington,-DC-jobs.html')

`grammies` returns a `list` of `DataFrames`<br>
We still need to find the _correct_ one

In [31]:
links = table.find_all('a')
jobs_list = []
for link in links:
    jobs_list.append(link.get('title'))
    
print(jobs_list)

['Russia', None, 'China', 'Hong Kong', 'Macau', 'India', None, 'Kazakhstan', 'Saudi Arabia', 'Iran', 'Mongolia', 'Indonesia', 'Pakistan', 'Gilgit-Baltistan', 'Azad Kashmir', 'Turkey', 'Myanmar', 'Afghanistan', 'Yemen', 'Thailand', 'Turkmenistan', 'Uzbekistan', 'Iraq', 'Japan', 'Vietnam', 'Malaysia', 'Oman', 'Philippines', 'Laos', 'Kyrgyzstan', 'Syria', 'Golan Heights', 'Cambodia', 'Bangladesh', 'Nepal', 'Tajikistan', 'North Korea', 'South Korea', 'Jordan', 'Azerbaijan', 'United Arab Emirates', 'Georgia (country)', 'Sri Lanka', 'Egypt', 'Bhutan', 'Taiwan', 'Armenia', 'Israel', 'Kuwait', 'East Timor', 'Qatar', 'Lebanon', 'Cyprus', 'Northern Cyprus', 'State of Palestine', 'Brunei', 'Bahrain', 'Singapore', 'Maldives']


In [16]:
df= pd.read_html('https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area')

In [23]:
#df

Another way with the same concept....

In [15]:
response = requests.get('https://en.wikipedia.org/wiki List_of_American_Grammy_Award_winners_and_nominees')
soup = BeautifulSoup(html.response)

tab = soup.find("table",{"class":"wikitable sortable"})
df = pd.read_html(tab.prettify())

NameError: name 'html' is not defined

## Now find a free-range website

get in groups of four and try to scrape a website into a pandas df