# Beautiful Soup Basics

`res.text` gave all the HTML that we needed. But it's just a string. BS will allow us to parse that string, i.e. convert it into an actual object that we can manipulate and use.

We can now use Beautiful Soup to clean up our string and turn it into useful HTML, e.g. `soup.body`:

In [3]:
# scrape.py

import requests
from bs4 import BeautifulSoup

res = requests.get('https://news.ycombinator.com/')
soup = BeautifulSoup(res.text, 'html.parser')
print(soup.body.contents)

[<center><table bgcolor="#f6f6ef" border="0" cellpadding="0" cellspacing="0" id="hnmain" width="85%">
<tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" style="padding:2px" width="100%"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img height="18" src="y18.gif" style="border:1px white solid;" width="18"/></a></td>
<td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
<a href="newest">new</a> | <a href="front">past</a> | <a href="newcomments">comments</a> | <a href="ask">ask</a> | <a href="show">show</a> | <a href="jobs">jobs</a> | <a href="submit">submit</a> </span></td><td style="text-align:right;padding-right:4px;"><span class="pagetop">
<a href="login?goto=news">login</a>
</span></td>
</tr></table></td></tr>
<tr id="pagespace" style="height:10px" title=""></tr><tr><td><table border="0" cellpadding="0" cellspacing="0" class="itemlist">
<tr class="athing" id=

We can look for all the `div`s that we have in the Soup object

In [4]:
import requests
from bs4 import BeautifulSoup

res = requests.get('https://news.ycombinator.com/')
soup = BeautifulSoup(res.text, 'html.parser')
print(soup.find_all('div'))

[<div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upv

Maybe we want to get all the `<a>` tags, which are the links:

In [None]:
import requests
from bs4 import BeautifulSoup

res = requests.get('https://news.ycombinator.com/')
soup = BeautifulSoup(res.text, 'html.parser')
print(soup.find_all('a'))

And this gives us something like all the links on the page:

In [7]:
import requests
from bs4 import BeautifulSoup

res = requests.get('https://news.ycombinator.com/')
soup = BeautifulSoup(res.text, 'html.parser')
print(soup.title)

<title>Hacker News</title>


In [8]:
import requests
from bs4 import BeautifulSoup

res = requests.get('https://news.ycombinator.com/')
soup = BeautifulSoup(res.text, 'html.parser')
print(soup.a)

<a href="https://news.ycombinator.com"><img height="18" src="y18.gif" style="border:1px white solid;" width="18"/></a>


We get the first `<a>` tag that comes up.

In [9]:
import requests
from bs4 import BeautifulSoup

res = requests.get('https://news.ycombinator.com/')
soup = BeautifulSoup(res.text, 'html.parser')
print(soup.find('a'))

<a href="https://news.ycombinator.com"><img height="18" src="y18.gif" style="border:1px white solid;" width="18"/></a>


This is useful because if we go to our Hacker News first link, for example, and inspect it, we're given the HTML DOM object that gives us all the information on the page. But if we also want the `votes` we'll need to keep drilling down to the table data that contains the votes

In [None]:
<td align="right" valign="top" class="title"><span class="rank">3.</span></td>      <td valign="top" class="votelinks"><center><a id='up_33256446'href='vote?id=33256446&amp;how=up&amp;goto=news'><div class='votearrow' title='upvote'></div></a></center></td><td class="title"><span class="titleline"><a href="https://commoncog.com/focus-saying-no-to-good-ideas/">Focus is saying no to good ideas</a><span class="sitebit comhead"> (<a href="from?site=commoncog.com"><span class="sitestr">commoncog.com</span></a>)</span></span></td></tr><tr><td colspan="2"></td><td class="subtext"><span class="subline">
  <span class="score" id="score_33256446">232 points</span> by <a href="user?id=nsoonhui" class="hnuser">nsoonhui</a> <span class="age" title="2022-10-19T01:49:45"><a href="item?id=33256446">10 hours ago</a></span> <span id="unv_33256446"></span> | <a href="hide?id=33256446&amp;goto=news">hide</a> | <a href="item?id=33256446">67&nbsp;comments</a>        </span>
      </td>

We can probably use the `id` attribute `id="score_33256446"`, copying the score:

In [10]:
import requests
from bs4 import BeautifulSoup

res = requests.get('https://news.ycombinator.com/')
soup = BeautifulSoup(res.text, 'html.parser')
print(soup.find(id='score_33256446'))

<span class="score" id="score_33256446">269 points</span>


We can select whatever data we want!