# Beautiful Soup

#### There are mainly two ways to extract data from a website:

#### Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.

#### Access the HTML of the webpage and extract useful information/data from it. This technique is called web scraping or web harvesting or web data extraction.

#### Through beautiful soup we can access the HTML code of webpage

#### Once we have accessed the HTML content, we have to do task of parsing the data.

#### Since most of the HTML data is nested, we cannot extract data simply through string processing. We needs a parser which can create a nested/tree structure of the HTML data.

#### Then all we need to do is navigating and searching the parse tree that we created, i.e. tree traversal

In [1]:
# Let's extract data from HTML code of webpage "http://dataquestio.github.io/web-scraping-pages/simple.html"

In [2]:
import requests
from bs4 import BeautifulSoup   #for beautiful soup

In [3]:
r = requests.get('http://dataquestio.github.io/web-scraping-pages/simple.html')

In [4]:
r

<Response [200]>

In [5]:
r.text

'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

#### Now using requests library we get some unstructured data, like above. BS4 library used to convert it into something structured one and allows us to parse data from HTML page of it

In [6]:
soup = BeautifulSoup(r.content, 'html.parser')  #r.content or r.text, and html.parser not required may be

#we can also parse as lxml instead of html.parser, for that we need to install lxml library

In [7]:
soup  #got soup object

<!DOCTYPE html>

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

In [8]:
print(soup.prettify())  #gives more like html structure with indentation

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [9]:
soup.children  #this gives iterator

<list_iterator at 0x2aeb0756608>

#### Now after running for loop on soup.children we get list of 3 elements

In [10]:
[x for x in soup.children]

['html',
 '\n',
 <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [11]:
len([x for x in soup.children])

3

In [12]:
[type(x) for x in soup.children]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

#### So above, first element of list is main parent element here is DocType, second element is navigable string which is all other than tags in html so here it is \n , and last element is html tag object (which is child of DocType we can say)

#### Means it gives o/p as doctype, naviagable string, html tag

#### so we are interested in last element at index -1 of the list, which is HTML tag object

In [13]:
html = list(soup.children)[-1]

In [14]:
html #here we will get only html tag now

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

#### Again we can run soup.children on this html to deep dive into other tags, like head and body tags etc. In this way we can parse theough the html code

#### when we parse through html variable, we will get now \n, head tag, \n, body tag, \n

In [15]:
len([x for x in html.children])

5

In [16]:
[x for x in html.children]

['\n',
 <head>
 <title>A simple example page</title>
 </head>,
 '\n',
 <body>
 <p>Here is some simple content for this page.</p>
 </body>,
 '\n']

In [17]:
[type(x) for x in html.children]

[bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString]

#### Now if we are interested in head tag, we need to use first index position where head tag is present, Let's see

In [18]:
head = list(html.children)[1]

In [19]:
head

<head>
<title>A simple example page</title>
</head>

In [20]:
[x for x in head.children]

['\n', <title>A simple example page</title>, '\n']

#### Now if we are interested in title tag, we need to use again first index position where title tag is present, Let's see

In [21]:
title = list(head.children)[1]

In [22]:
title

<title>A simple example page</title>

In [23]:
# now title tag has not any children left, so we are now fetching text 
# inside title tag using title.children, instead we will use title.getText()

#lets see what we will get after title.children 
[x for x in title.children]

['A simple example page']

In [24]:
title.getText() # this is correct way to fetch the text, if tag does not left with childs

'A simple example page'

#### If we need to fetch the content in body tag

In [25]:
body = list(html.children)[3]

In [26]:
body

<body>
<p>Here is some simple content for this page.</p>
</body>

In [27]:
[x for x in body.children] 

['\n', <p>Here is some simple content for this page.</p>, '\n']

In [28]:
paragraph = list(body.children)[1]

In [29]:
paragraph

<p>Here is some simple content for this page.</p>

In [30]:
paragraph.getText()

'Here is some simple content for this page.'

### But this is easy for webpages having simple HTML code, but for big sites it not possible to deep dive like explained above, so we need to use following steps

#### Instead of scraping data using html, body, title tags etc., we can use id and classes from HTML code to fetch the data, because this id and classses are permanent for dynamic web pages only content will change. Our script won't work only if site html code is changed wholly

In [31]:
# Let's scrape the data from site 
# 'https://forecast.weather.gov/MapClick.php?lat=37.777120000000025&lon=-122.41963999999996#.YDyJImgzZPY'

In [32]:
data = requests.get('https://forecast.weather.gov/MapClick.php?lat=32.7157&lon=-117.1617#.YDypH2gzZPY')

In [33]:
soup = BeautifulSoup(data.text, 'html.parser')

In [34]:
# Now from page's HTML code, we will extract id="seven-day-forecast" to see data relevant to only that div tag
# as there is only one id named seven-day-forecast, we will use find method on soup object

seven_day = soup.find(id='seven-day-forecast')

In [35]:
# Now each period for forecast is grabbed under the class="tombstone-container", we will now search data from this class
# As there are multiple class with name tombstone-container we will use find_all method

forecast_items = seven_day.find_all(class_="tombstone-container") #we use class_ to differentiate from class symbol

In [36]:
# if we are interested only in tonight forecast, we need to use zeroth index of forecast_items list

tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Overnight
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Overnight: Mostly clear, with a low around 47. Calm wind. " class="forecast-icon" src="newimages/medium/nfew.png" title="Overnight: Mostly clear, with a low around 47. Calm wind. "/>
 </p>
 <p class="short-desc">
  Mostly Clear
 </p>
 <p class="temp temp-low">
  Low: 47 °F
 </p>
</div>


#### Now from above class of "tombstone-container" for tonight, we will grab class="period-name", class="short-desc", class="temp temp-low" and from img tag we will grab the title

In [37]:
period = tonight.find(class_="period-name")

In [38]:
short_desc = tonight.find(class_="short-desc")

In [39]:
temperature = tonight.find(class_="temp temp-low")  #'temp temp-low' or only 'temp'

In [40]:
print(period.get_text())
print(short_desc.get_text())
print(temperature.get_text())

Overnight
Mostly Clear
Low: 47 °F


In [41]:
image = tonight.find("img")

In [42]:
desc = image['title']  #image act as dictionary, we can call keys for this just like we do in dictionary

In [43]:
print(desc)

Overnight: Mostly clear, with a low around 47. Calm wind. 


In [44]:
# we can run for loop also to grab all data for all periods
# there are total 9 periods as nine class with name tombstone-container, so run for loop 9 times

for i in range(9):
    per = forecast_items[i]
    period = per.find(class_="period-name")
    short_desc = per.find(class_="short-desc")
    temperature = per.find(class_="temp")
    image = per.find("img")
    desc = image['title']
    print(period.get_text())
    print(short_desc.get_text())
    print(temperature.get_text())
    print(desc)
    print('\n')
    

Overnight
Mostly Clear
Low: 47 °F
Overnight: Mostly clear, with a low around 47. Calm wind. 


Monday
Sunny
High: 73 °F
Monday: Sunny, with a high near 73. Calm wind becoming northwest around 5 mph in the afternoon. 


MondayNight
Mostly Clear
Low: 48 °F
Monday Night: Mostly clear, with a low around 48. Northwest wind around 5 mph becoming calm. 


Tuesday
Sunny
High: 73 °F
Tuesday: Sunny, with a high near 73. Light and variable wind becoming west 5 to 10 mph in the afternoon. 


TuesdayNight
Mostly Clear
Low: 49 °F
Tuesday Night: Mostly clear, with a low around 49. Calm wind. 


Wednesday
Slight ChanceShowers
High: 68 °F
Wednesday: A slight chance of showers.  Partly sunny, with a high near 68. Chance of precipitation is 20%.


WednesdayNight
ChanceShowers
Low: 51 °F
Wednesday Night: A chance of showers, mainly before 4am.  Mostly cloudy, with a low around 51. Chance of precipitation is 50%.


Thursday
Slight ChanceShowers thenMostly Sunny
High: 66 °F
Thursday: A slight chance of show

### 1. Grabbing Title of webpage (Udemy Video) using select method

In [51]:
rt = requests.get('https://www.example.com/')

In [52]:
rt

<Response [200]>

In [53]:
r.text

'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

In [54]:
soup = BeautifulSoup(r.text, 'html.parser')

In [55]:
soup

<!DOCTYPE html>

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

In [59]:
# to grab title of page

a = soup.select('title')  #pass title tag, if we want paragraph pass 'p'

#so for select method, we need to pass tag name

In [63]:
a  # returns the string

[<title>A simple example page</title>]

In [64]:
a[0]

<title>A simple example page</title>

In [65]:
a[0].getText()

'A simple example page'

### 2. Grabbing class of webpage (Udemy Video) using select method

Let say from https://en.wikipedia.org/wiki/Jonas_Salk website, we need to grab all points from contents like 
	
    Early life and education
	Polio research....etc

<table>

<thead >
<tr>
<th>
<p>Syntax to pass to the .select() method</p>
</th>
<th>
<p>Match Results</p>
</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<p><code>soup.select('div')</code></p>
</td>
<td>
<p>All elements with the <code>&lt;div&gt;</code> tag</p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('#some_id')</code></p>
</td>
<td>
<p>The HTML element containing the <code>id</code> attribute of <code>some_id</code></p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('.notice')</code></p>
</td>
<td>
<p>All the HTML elements with the CSS <code>class</code> named <code>notice</code></p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('div span')</code></p>
</td>
<td>
<p>Any elements named <code>&lt;span&gt;</code> that are within an element named <code>&lt;div&gt;</code></p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('div &gt; span')</code></p>
</td>
<td>
<p>Any elements named <code class="literal2">&lt;span&gt;</code> that are <span><em >directly</em></span> within an element named <code class="literal2">&lt;div&gt;</code>, with no other element in between</p>
</td>
</tr>
<tr>

</tr>
</tbody>
</table>

In [66]:
res = requests.get('https://en.wikipedia.org/wiki/Jonas_Salk')

In [67]:
res

<Response [200]>

In [68]:
soup = BeautifulSoup(res.text, 'html.parser')

In [70]:
soup.select('.toctext')

[<span class="toctext">Early life and education</span>,
 <span class="toctext">Education</span>,
 <span class="toctext">Medical school</span>,
 <span class="toctext">Postgraduate research and early laboratory work</span>,
 <span class="toctext">Polio research</span>,
 <span class="toctext">Becoming a public figure</span>,
 <span class="toctext">Celebrity versus privacy</span>,
 <span class="toctext">Maintaining his individuality</span>,
 <span class="toctext">Establishing the Salk Institute</span>,
 <span class="toctext">AIDS vaccine work</span>,
 <span class="toctext">Salk's "biophilosophy"</span>,
 <span class="toctext">Personal life</span>,
 <span class="toctext">Honors and recognition</span>,
 <span class="toctext">Documentary films</span>,
 <span class="toctext">Salk's book publications</span>,
 <span class="toctext">See also</span>,
 <span class="toctext">References</span>,
 <span class="toctext">Further reading</span>,
 <span class="toctext">External links</span>]

In [72]:
first_line = soup.select('.toctext')[0]

In [73]:
first_line.text

'Early life and education'

In [71]:
# now to grab only text from all of the above

for i in soup.select('.toctext'):
    print(i.text)  #.text will print only text

Early life and education
Education
Medical school
Postgraduate research and early laboratory work
Polio research
Becoming a public figure
Celebrity versus privacy
Maintaining his individuality
Establishing the Salk Institute
AIDS vaccine work
Salk's "biophilosophy"
Personal life
Honors and recognition
Documentary films
Salk's book publications
See also
References
Further reading
External links
