# BeautifulSoup Basic

In [1]:
from urllib.parse import urljoin
baseUrl = "http://test.com/html/a.html"

### urljoin

With ``urljoin`` function, you can combinate the baseurl and new part but it's not simple adding but replacement of last part. Let me give you examples.

In [6]:
print(">>", urljoin(baseUrl, "b.html"))

>> http://test.com/html/b.html


you can replace not only last part like a "b.html" but also directory like a example below.

In [7]:
print(">>", urljoin(baseUrl, "sub/b.html"))

>> http://test.com/html/sub/b.html


with `../`, you can access upper directory

In [8]:
print(">>", urljoin(baseUrl, "../index.html"))

>> http://test.com/index.html


Let me give you an another example.

In [9]:
print(">>", urljoin(baseUrl, "../img/img.jpg"))

>> http://test.com/img/img.jpg


### beautiful soup basic

[You can refer the beautiful soup reference.](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [21]:
from bs4 import BeautifulSoup
import sys
import os

html = """
<html>
<body>
<h1> Python BeautifulSoup study </h1>
<p> tag selector </p><p> css selector </p>
</body>
</html>
"""

You can parse html with `BeautifulSoup` function using "html.parser" feature.

In [22]:
soup = BeautifulSoup(html, "html.parser")

`.prettify()` function shows html easy to read.

In [23]:
print(soup.prettify())

<html>
 <body>
  <h1>
   Python BeautifulSoup study
  </h1>
  <p>
   tag selector
  </p>
  <p>
   css selector
  </p>
 </body>
</html>



Let's acess the `<h1>`. If you want to access sub par of html, you can do it with `.`. For example **soup.html.body.h1**

In [28]:
h1 = soup.html.body.h1
print("h1 : ", h1)
print("h1 contents : ", h1.string)

h1 :  <h1> Python BeautifulSoup study </h1>
h1 contents :   Python BeautifulSoup study 


What about `<p>` case? First, get the first `<p>` and access next `<p>` with `.next_sibling`. <br>
※ If html has `<p> </p>` \n `<p> </p>` structure, `.next_sibling` will get the '\n'. Therefore, you'd better preprocess for remocing '\n' or should use `.next_sibling` twice. (.next_sibling.next_sibling).

In [30]:
p1 = soup.html.body.p
print("p1 : ", p1)
p2 = p1.next_sibling
print("p2 : ", p2)

p1 :  <p> tag selector </p>
p2 :  <p> css selector </p>


`.previous_sibling` method access previous `<p> </p>`.

In [31]:
p3 = p2.previous_sibling
print("p3 : ", p3)

p3 :  <p> tag selector </p>


### beautiful soup - tag selector

In [43]:
html = """
<html>
<body>
    <ul>
        <li><a href = "http://www.google.com">google</a></li>
        <li><a href = "httpL//www.yahoo.com">yahoo</a></li>
        <li><a href = "http://www.naver.com">naver</a></li>        
        <li><a href = "http://www.github.com">github</a></li>
        <li><a href = "http://www.github.com/gaussian37">github</a></li>
    </ul>
</body>
</html>

"""

soup = BeautifulSoup(html, "html.parser")

`.find_all` function extracts a list of Tag objects that match the given criteria.  You can specify the name of the Tag and any

In [44]:
links = soup.find_all("a")

In [45]:
for link in links:
    print(">> link : ", type(link), link)
    print(">> text : ", link.string)
    print(">> href : ", link.attrs["href"])
    print()

>> link :  <class 'bs4.element.Tag'> <a href="http://www.google.com">google</a>
>> text :  google
>> href :  http://www.google.com

>> link :  <class 'bs4.element.Tag'> <a href="httpL//www.yahoo.com">yahoo</a>
>> text :  yahoo
>> href :  httpL//www.yahoo.com

>> link :  <class 'bs4.element.Tag'> <a href="http://www.naver.com">naver</a>
>> text :  naver
>> href :  http://www.naver.com

>> link :  <class 'bs4.element.Tag'> <a href="http://www.github.com">github</a>
>> text :  github
>> href :  http://www.github.com

>> link :  <class 'bs4.element.Tag'> <a href="http://www.github.com/gaussian37">github</a>
>> text :  github
>> href :  http://www.github.com/gaussian37



You also can find specific condition tag.

In [48]:
github_tags = soup.find_all(name = "a", string = "github")
print(github_tags)

[<a href="http://www.github.com">github</a>, <a href="http://www.github.com/gaussian37">github</a>]


If you want to find limited number of result, you can use **limit option.**

In [53]:
github_tag = soup.find_all(name = "a", string = "github", limit= 1)
print(github_tag)

[<a href="http://www.github.com">github</a>]


You can also "only" find the string. You'd better use regular expression to find the string.

In [54]:
txt_find = soup.find_all(string = ["google", "github"])
print(txt_find)

['google', 'github', 'github']


### Beautiful Soup - CSS tag

+ Reference
  - https://www.w3schools.com/cssref/css_selectors.asp <br>
  - https://www.w3schools.com/cssref/trysel.asp

In [57]:
html = """
<html>
<body>
<div id="main">
    <h1> lecture list </h1>
    <ul class = "lectures">
        <li> Python </li>
        <li> Crawling </li>
        <li> Machine Learning </li>
        <li> Deep Learning </li>
    </ul>
</div>
</body>
</html>
"""

In [61]:
soup = BeautifulSoup(html, "html.parser")
h1 = soup.select("div#main > h1")
print("h1 : ", h1)
print("h1 type : ", type(h1))

h1 :  [<h1> lecture list </h1>]
h1 type :  <class 'list'>


In [64]:
list_li = soup.select("div#main > ul.lectures > li")
for li in list_li:
    print("li : ", li.string)

li :   Python 
li :   Crawling 
li :   Machine Learning 
li :   Deep Learning 


## Beautifulsoup selector example !!

In [66]:
from bs4 import BeautifulSoup
import sys
import os

html = """
<html>
<body>
    <ul>
        <li><a id = "google" href = "http://www.google.com">google</a></li>
        <li><a id = "yahoo" href = "http://www.yahoo.com">yahoo</a></li>
        <li><a id = "naver" href = "http://www.naver.com">naver</a></li>        
        <li><a id = "github" href = "https://www.github.com">github</a></li>
        <li><a href = "https://www.github.com/gaussian37">github</a></li>
    </ul>
</body>
</html>

"""

### find a string with tag

In [71]:
soup = BeautifulSoup(html, 'html.parser')
google = soup.find_all(name = "a", string = "google")
print(google)

[<a href="http://www.google.com" id="google">google</a>]


### find a string with tag ID

In [33]:
yahoo = soup.find_all(id = "yahoo")

In [34]:
print(yahoo)

[<a href="httpL//www.yahoo.com" id="yahoo">yahoo</a>]


### find a string with regular expression

[regular expression reference](http://pythonstudy.xyz/python/article/401-%EC%A0%95%EA%B7%9C-%ED%91%9C%ED%98%84%EC%8B%9D-Regex)

In [39]:
import re

In [43]:
lis = soup.find_all(href = re.compile(r"^https://"))

In [46]:
for li in lis:
    print(li)
    print(li.attrs["href"])

<a href="https://www.github.com" id="github">github</a>
https://www.github.com
<a href="https://www.github.com/gaussian37">github</a>
https://www.github.com/gaussian37


In [80]:
lis = soup.find_all(href = re.compile(r"^http://\w{3}.\w{6}"))

In [81]:
lis

[<a href="http://www.google.com" id="google">google</a>]

### find a string with CSS selector example

In [92]:
from bs4 import BeautifulSoup
import sys
import os

food_list = """
<html>
<body>
<div id = "foods">
<h1> menu </h1>
    <ul id = "food-list">
        <li class = "food hot" data-lo = "ko">chicken</li>
        <li class = "food" data-lo = "jp">pork cutlet</li>
        <li class = "food hot" data-lo = "ko">bbq</li>
        <li class = "food" data-lo = "us">steak</li>    
    </ul>
    <ul id = "alcohol-list">
        <li class = "alcohol" data-lo = "ko">soju</li>
        <li class = "alcohol" data-lo = "us">beer</li>    
        <li class = "alcohol" data-lo = "ko">whiskey</li>   
        <li class = "alcohol high" data-lo = "ru">vodka</li>    
        <li class = "alcohol" data-lo = "jp">sake</li>            
    </ul>
</div>
</body>
</html>
"""

In [93]:
soup = BeautifulSoup(food_list, "html.parser")

Let's access **vodka**.

In [107]:
print(soup.select_one("li:nth-of-type(8)").string)

vodka


In [108]:
print(soup.select_one("#alcohol-list > li:nth-of-type(4)").string)

vodka


Ref : `soup.select` returns list not string.

In [109]:
print(soup.select("#alcohol-list > li[data-lo='ru']")[0].string)

vodka


In [112]:
print(soup.select("#alcohol-list > li.alcohol.high")[0].string)

vodka


In [121]:
param = {"data-lo" : "ru"}
print(soup.find("li", {"data-lo":"ru"}).string)

vodka


In [123]:
print(soup.find(id = "alcohol-list").find("li", param).string)

vodka


In [126]:
for e in soup.find_all("li"):
    if e["data-lo"] == "us":
        print(e.string)

steak
beer


In [131]:
car_list = """
<ul id = "cars">
    <li id = "ge">Genesis</li>
    <li id = "av">Avante</li>
    <li id = "so">Sonata</li>
    <li id = "gr">Granduer</li>    
    <li id = "tu">Tucson</li>
</ul>
"""

soup = BeautifulSoup(car_list, "html.parser")

In [132]:
def car_list_function(selector):
    print("car list : ", soup.select_one(selector).string)

In order to access the `id`, enter the `#id`

In [133]:
car_list_function("#gr")

car list :  Granduer


In [135]:
car_list_function("li#gr")

car list :  Granduer


In [136]:
car_list_function("ul > li#tu")

car list :  Tucson


In [140]:
car_list_function("#cars > #gr")

car list :  Granduer


In [141]:
car_list_function("li[id='gr']")

car list :  Granduer


In [143]:
print("with soup.select : ", soup.select("li")[3].string)

with soup.select :  Granduer


In [145]:
print("with soup.find_all : ", soup.find_all("li")[3].string)

with soup.find_all :  Granduer


We also can use `Lambda function`

In [155]:
car_list_lambda = lambda selection : print("car_lambda : ", soup.select_one(selection).string)

In [156]:
car_list_lambda("#cars > #ge")

car_lambda :  Genesis
