# <center> Extra Courses : Web Scraping using BeautifulSoup </center>
klik untuk [Open in colab](https://colab.research.google.com/github/ferdinand-winstein/py-dts/blob/master/2022/Extra%20Courses/Python%20Extra%20Courses%20-%20Pandas.ipynb?) 

![bs.png](attachment:bs.png)
Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML,which is useful for web scraping.

Beautiful Soup was started by Leonard Richardson, who continues to contribute to the project, and is additionally supported by Tidelift, a paid subscription to open-source maintenance.

# Mengambil Content

Contoh Webpage yang akan kita pakai [Here](https://keithgalli.github.io/web-scraping/example.html)

In [None]:
import requests
from bs4 import BeautifulSoup as bs #pip install BeautifulSoup4

In [None]:
# load webpage yang mau kita scrapping dengan request
r = requests.get('https://keithgalli.github.io/web-scraping/example.html')

# convert jadi object bs
soup = bs(r.content)

print(soup)

In [None]:
#output dengan indentasi - agar lebih mudah dibaca
print(soup.prettify())

# `find` dan `find_all`

In [None]:
soup.find('h2')

In [None]:
soup.find_all('h2')

In [None]:
soup.find_all(['h2', 'h1'])

In [None]:
soup.find_all('p')

In [None]:
# You can pass in attributes to the find/find_all function
soup.find('p', attrs={'id': 'paragraph-id'})

In [None]:
# You can nest find/find_all calls
body = soup.find('body')
body

In [None]:
div = body.find('div')
div

In [None]:
header = div.find('h1')
header

In [None]:
soup.body.div.h1

In [None]:
print(soup.prettify())

## menggunakan bantuan regex

In [None]:
import re
paragraph = soup.find_all('p', string=re.compile('Some'))
paragraph

In [None]:
headers = soup.find_all('h2', string=re.compile('(H|h)eader'))
headers

# `select` CSS Selector
List CSS Selector : https://www.w3schools.com/cssref/css_selectors.asp

In [None]:
content = soup.select("body div")
content

In [None]:
paragraphs = soup.select("h2 ~ p")
paragraphs

In [None]:
paragraphs = soup.select('body > p')
print(paragraphs)
for paragraph in paragraphs:
    print(paragraph.select('i'))

In [None]:
par = soup.select('p#paragraph-id b')
par

In [None]:
soup.select('[id=paragraph-id]')

# Get different properties of the HTML

In [None]:
soup.find('title').get_text()

In [None]:
div

In [None]:
div = soup.find('div')
print(div.get_text())

In [None]:
link = soup.find('a')
link['href']

In [None]:
par = soup.select('p#paragraph-id')[0]
par['id']

# Code Navigation

In [None]:
print(soup.prettify())

In [None]:
soup.body.find('div')

## `find_parent()`

In [None]:
soup.body.find('div').find_parent()

## `find_next_siblings()` dan `find_previous_siblings()`

In [None]:
soup.body.find("div").find_next_siblings()

In [None]:
soup.body.find("h2").find_previous_sibling()

# Contoh

Contoh Webpage yang akan kita pakai [Here](https://keithgalli.github.io/web-scraping/webpage.html)

## Contoh 1 : Mengambil Link Media Sosial

In [None]:
r = requests.get('https://keithgalli.github.io/web-scraping/webpage.html')
webpage = bs(r.content)

#print(webpage.prettify())

### Metode 1 : Menggunakan `find`

In [None]:
links = webpage.find('ul', attrs={'class':'socials'})
link_list = links.find_all('a')
actual_links = [link['href'] for link in link_list]
actual_links

### Metode 2 : Menggunakan CSS Selector

In [None]:
links = webpage.select('ul.socials a')
actual_links = [link['href'] for link in links]
actual_links

#### Bisa kita ambil dari parent-nya juga

In [None]:
links = webpage.select("li.social")
links

In [None]:
links = webpage.select("body ul li.social a")
links

## Contoh 2 : Mengambil Semua Text

In [None]:
header = webpage.body.find("h2", string="Photos")
previous_elements = header.find_previous_siblings()
previous_elements_sorted = previous_elements[::-1]
elements = [x.get_text() for x in previous_elements_sorted]
text = "\n".join(elements)
print(text)

## Contoh 3 : Mengambil Table

In [None]:
table = webpage.select('table.hockey-stats')[0]
columns = table.find('thead').find_all('th')
column_name = [column.string for column in columns]
column_name

In [None]:
import pandas as pd

table_rows = table.find('tbody').find_all('tr')

l=[]
for tr in table_rows:
    td= tr.find_all('td')
    row = [txt.get_text().strip() for txt in td]
    l.append(row)
    
    
df = pd.DataFrame(l, columns=column_name)
df

## Contoh 4 : Mengambil semua yang menggunakan kata "is"

In [None]:
import re
facts = webpage.select('ul.fun-facts li')
fact_is = [fact.find(string = re.compile('is')) for fact in facts]
fact_is
fact_is_new = [fact.find_parent().get_text() for fact in fact_is if fact]
fact_is_new


## Contoh 5 : Mendownload Gambar

In [None]:
img = webpage.select('div.row div.column img')
img

In [None]:
url = "https://keithgalli.github.io/web-scraping/"

lake = img[1]['src']
full_url = url+lake
full_url

img_data = requests.get(full_url).content
with open('Pontevecchio.jpg', 'wb') as handler:
    handler.write(img_data)

Tutorial : https://www.youtube.com/watch?v=GjKQ6V_ViQE&t=1935s

Dokumentasi BeautifulSoup : https://www.crummy.com/software/BeautifulSoup/bs4/doc/