# Web Scraping

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.

The Internet provides abundant sources of information for professionals and enthusiasts from various industries. Extracting data from websites however, can be tedious, especially if you need to repeatedly retrieve data in the same format everyday.

For Example: If you follow the stock market, getting closing prices everyday can be a pain, especially when you have to open several webpages to record them regularly. You can make your data extraction easier by building your own web scraper to retrieve stock indices automatically.

## Scraping Rules
* You should check a website's Terms and Conditions before you scrape them. Be careful to read the statements about legal use of data, as usually, the data you scrape should not be used for commercial purposes.
* Do not request data from the website too aggressively with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human), one request for one webpage per second is good practice.
* The layout of a website may change from time to time, so make sure to revisit the site and rewrite your code as needed

In [3]:
from lxml import html
import requests

In [4]:
file = open("test.html", "r")
pg = file.read()
file.close()
pg

"<html>\n    <head>\n    </head>\n    <body>\n        <p>\n            A status_code of 200 means that the page downloaded successfully. We won't fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.\n        </p>\n        <p>\n            We can use the BeautifulSoup library to parse this document, and extract the text from the p tag. We first have to import the library, and create an instance of the BeautifulSoup class to parse our document.\n        </p>\n    </body>\n</html>"

In [5]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(pg, 'html.parser')
soup

<html>
<head>
</head>
<body>
<p>
            A status_code of 200 means that the page downloaded successfully. We won't fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.
        </p>
<p>
            We can use the BeautifulSoup library to parse this document, and extract the text from the p tag. We first have to import the library, and create an instance of the BeautifulSoup class to parse our document.
        </p>
</body>
</html>

In [6]:
soup.find_all('p')

[<p>
             A status_code of 200 means that the page downloaded successfully. We won't fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.
         </p>, <p>
             We can use the BeautifulSoup library to parse this document, and extract the text from the p tag. We first have to import the library, and create an instance of the BeautifulSoup class to parse our document.
         </p>]

In [7]:
pageText = []
for paragraph in soup.find_all('p'):
    pageText.append(paragraph.get_text())
    
pageText

["\n            A status_code of 200 means that the page downloaded successfully. We won't fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.\n        ",
 '\n            We can use the BeautifulSoup library to parse this document, and extract the text from the p tag. We first have to import the library, and create an instance of the BeautifulSoup class to parse our document.\n        ']

**Select a tokenizer of your choice**
https://www.nltk.org/_modules/nltk/tokenize.html

In [8]:
from nltk.tokenize.punkt    import PunktSentenceTokenizer

In [9]:
tknizer = PunktSentenceTokenizer()

In [10]:
tknizer.tokenize(pageText[0])

['\n            A status_code of 200 means that the page downloaded successfully.',
 "We won't fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error."]

In [11]:
from nltk.tokenize.regexp   import WhitespaceTokenizer

In [12]:
toknizer = WhitespaceTokenizer()

In [13]:
toknizer.tokenize(pageText[0])

['A',
 'status_code',
 'of',
 '200',
 'means',
 'that',
 'the',
 'page',
 'downloaded',
 'successfully.',
 'We',
 "won't",
 'fully',
 'dive',
 'into',
 'status',
 'codes',
 'here,',
 'but',
 'a',
 'status',
 'code',
 'starting',
 'with',
 'a',
 '2',
 'generally',
 'indicates',
 'success,',
 'and',
 'a',
 'code',
 'starting',
 'with',
 'a',
 '4',
 'or',
 'a',
 '5',
 'indicates',
 'an',
 'error.']

In [14]:
words = []
for txt in pageText:
    #   dictionaryU1.extend(twt.lower().strip().strip('"').strip("'").split())
        words.extend(toknizer.tokenize(txt))

words        

['A',
 'status_code',
 'of',
 '200',
 'means',
 'that',
 'the',
 'page',
 'downloaded',
 'successfully.',
 'We',
 "won't",
 'fully',
 'dive',
 'into',
 'status',
 'codes',
 'here,',
 'but',
 'a',
 'status',
 'code',
 'starting',
 'with',
 'a',
 '2',
 'generally',
 'indicates',
 'success,',
 'and',
 'a',
 'code',
 'starting',
 'with',
 'a',
 '4',
 'or',
 'a',
 '5',
 'indicates',
 'an',
 'error.',
 'We',
 'can',
 'use',
 'the',
 'BeautifulSoup',
 'library',
 'to',
 'parse',
 'this',
 'document,',
 'and',
 'extract',
 'the',
 'text',
 'from',
 'the',
 'p',
 'tag.',
 'We',
 'first',
 'have',
 'to',
 'import',
 'the',
 'library,',
 'and',
 'create',
 'an',
 'instance',
 'of',
 'the',
 'BeautifulSoup',
 'class',
 'to',
 'parse',
 'our',
 'document.']

In [15]:
import pandas as pd
dfWords = pd.DataFrame(words,columns=["words"])
dfWords.head(10)

Unnamed: 0,words
0,A
1,status_code
2,of
3,200
4,means
5,that
6,the
7,page
8,downloaded
9,successfully.


In [16]:
dfc = pd.DataFrame(dfWords.words.value_counts())
dfc

Unnamed: 0,words
the,6
a,5
to,3
and,3
We,3
indicates,2
of,2
parse,2
an,2
BeautifulSoup,2


In [17]:
dfc.columns

Index(['words'], dtype='object')

In [18]:
dfc = dfc.rename(index=str, columns={"words": "Count"})
dfc

Unnamed: 0,Count
the,6
a,5
to,3
and,3
We,3
indicates,2
of,2
parse,2
an,2
BeautifulSoup,2


In [19]:
dfc.to_dict()

{'Count': {'2': 1,
  '200': 1,
  '4': 1,
  '5': 1,
  'A': 1,
  'BeautifulSoup': 2,
  'We': 3,
  'a': 5,
  'an': 2,
  'and': 3,
  'but': 1,
  'can': 1,
  'class': 1,
  'code': 2,
  'codes': 1,
  'create': 1,
  'dive': 1,
  'document,': 1,
  'document.': 1,
  'downloaded': 1,
  'error.': 1,
  'extract': 1,
  'first': 1,
  'from': 1,
  'fully': 1,
  'generally': 1,
  'have': 1,
  'here,': 1,
  'import': 1,
  'indicates': 2,
  'instance': 1,
  'into': 1,
  'library': 1,
  'library,': 1,
  'means': 1,
  'of': 2,
  'or': 1,
  'our': 1,
  'p': 1,
  'page': 1,
  'parse': 2,
  'starting': 2,
  'status': 2,
  'status_code': 1,
  'success,': 1,
  'successfully.': 1,
  'tag.': 1,
  'text': 1,
  'that': 1,
  'the': 6,
  'this': 1,
  'to': 3,
  'use': 1,
  'with': 2,
  "won't": 1}}

https://en.wikipedia.org/wiki/Bag-of-words_model

1. John likes to watch movies. Mary likes movies too.
2. John also likes to watch football games.

1. BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
2. BoW2 = {"John":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};

1. [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]
2. [1, 1, 1, 1, 0, 0, 0, 1, 1, 1]

In [20]:
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')

Signature: requests.get(url, params=None, **kwargs)
Docstring:
Sends a GET request.

:param url: URL for the new :class:`Request` object.
:param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
:param \*\*kwargs: Optional arguments that ``request`` takes.
:return: :class:`Response <Response>` object
:rtype: requests.Response
File:      c:\programdata\anaconda3\lib\site-packages\requests\api.py
Type:      function

In [21]:
soup = BeautifulSoup(page.content, 'html.parser')
soup

<!DOCTYPE html>

<html>
<head>
<meta charset="utf-8"/>
<title>Items 1 to 20 -- Example Page 1</title>
<script type="text/javascript">
      var _gaq = _gaq || [];
      _gaq.push(['_setAccount', 'UA-23648880-1']);
      _gaq.push(['_trackPageview']);
      _gaq.push(['_setDomainName', 'econpy.org']);
    </script>
</head>
<body>
<div align="center">1, <a href="http://econpy.pythonanywhere.com/ex/002.html">[<font color="green">2</font>]</a>, <a href="http://econpy.pythonanywhere.com/ex/003.html">[<font color="green">3</font>]</a>, <a href="http://econpy.pythonanywhere.com/ex/004.html">[<font color="green">4</font>]</a>, <a href="http://econpy.pythonanywhere.com/ex/005.html">[<font color="green">5</font>]</a></div>
<div title="buyer-info">
<div title="buyer-name">Carson Busses</div>
<span class="item-price">$29.95</span><br/>
</div>
<div title="buyer-info">
<div title="buyer-name">Earl E. Byrd</div>
<span class="item-price">$8.37</span><br/>
</div>
<div title="buyer-info">
<div title="bu

**Signature: html.fromstring(html, base_url=None, parser=None, **kw)**

Docstring:

Parse the html, returning a single element/document.

This tries to minimally parse the chunk of text, without knowing if it
is a fragment or a document.

base_url will set the document's base_url attribute (and the tree's docinfo.URL)

File:      c:\programdata\anaconda3\lib\site-packages\lxml\html\__init__.py

Type:      function

In [22]:
tree = html.fromstring(page.content)

<div title="buyer-name">Carson Busses</div>
<span class="item-price">$29.95</span>

In [23]:
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

In [24]:
print('Buyers: ', buyers)
print('Prices: ', prices)

Buyers:  ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes', 'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff', 'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup', 'Ira Pent', 'Ben D. Rules', 'Ave Sectomy', 'Gary Shattire', 'Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell']
Prices:  ['$29.95', '$8.37', '$15.26', '$19.25', '$19.25', '$13.99', '$31.57', '$8.49', '$14.47', '$15.86', '$11.11', '$15.98', '$16.27', '$7.50', '$50.85', '$14.26', '$5.68', '$15.00', '$114.07', '$10.09']


In [25]:
df = pd.DataFrame({"Buyers": buyers,
     "Prices": prices
    })
df.head(10)

Unnamed: 0,Buyers,Prices
0,Carson Busses,$29.95
1,Earl E. Byrd,$8.37
2,Patty Cakes,$15.26
3,Derri Anne Connecticut,$19.25
4,Moe Dess,$19.25
5,Leda Doggslife,$13.99
6,Dan Druff,$31.57
7,Al Fresco,$8.49
8,Ido Hoe,$14.47
9,Howie Kisses,$15.86


In [26]:
df.to_csv("outputCSV.csv", sep=',')

In [27]:
import requests

page = requests.get("http://www.bloomberg.com/quote/SPX:IND")
page

<Response [200]>

In [28]:
page.status_code

200

In [29]:
page.content

b'<!DOCTYPE html>\n<html xmlns:og="http://ogp.me/ns#" data-view-uid="0"><head>\n<base href=\'https://www.bloomberg.com/\'><script src="https://assets.bwbx.io/markets/public/javascripts/ab_test-2eba8ff0e9.js" data-view-uid="0_5"></script> <meta charset="utf-8"> <title>SPX Quote - S&amp;P 500 Index - Bloomberg Markets</title><meta http-equiv="X-UA-Compatible" content="IE=edge"><script type="text/javascript">!function(e,t){function u(){return t.getElementById("bb-nav")}function M(){return document.querySelector(".bb-unsupported-message__text")}function L(){var e=document.getElementById("bb_unsupported_custom_message");if(e)return JSON.parse(e.innerHTML).message}function o(){var o=\'body.bb-unsupported-browser .bb-nav-placeholder{height:auto}body.bb-unsupported-browser #bb-that,body.bb-unsupported-browser .bb-nav-root{display:none;height:auto}body.bb-unsupported-browser .bb-nav-root{display:block}.bb-unsupported-message{background-color:#000;padding:20px 0}@media screen and (max-width: 63.

In [30]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

In [31]:
print(soup.prettify())

<!DOCTYPE html>
<html data-view-uid="0" xmlns:og="http://ogp.me/ns#">
 <head>
  <base href="https://www.bloomberg.com/"/>
  <script data-view-uid="0_5" src="https://assets.bwbx.io/markets/public/javascripts/ab_test-2eba8ff0e9.js">
  </script>
  <meta charset="utf-8"/>
  <title>
   SPX Quote - S&amp;P 500 Index - Bloomberg Markets
  </title>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <script type="text/javascript">
   !function(e,t){function u(){return t.getElementById("bb-nav")}function M(){return document.querySelector(".bb-unsupported-message__text")}function L(){var e=document.getElementById("bb_unsupported_custom_message");if(e)return JSON.parse(e.innerHTML).message}function o(){var o='body.bb-unsupported-browser .bb-nav-placeholder{height:auto}body.bb-unsupported-browser #bb-that,body.bb-unsupported-browser .bb-nav-root{display:none;height:auto}body.bb-unsupported-browser .bb-nav-root{display:block}.bb-unsupported-message{background-color:#000;padding:20px 0}@media

In [32]:
list(soup.children)

['html', '\n', <html data-view-uid="0" xmlns:og="http://ogp.me/ns#"><head>
 <base href="https://www.bloomberg.com/"/><script data-view-uid="0_5" src="https://assets.bwbx.io/markets/public/javascripts/ab_test-2eba8ff0e9.js"></script> <meta charset="utf-8"/> <title>SPX Quote - S&amp;P 500 Index - Bloomberg Markets</title><meta content="IE=edge" http-equiv="X-UA-Compatible"/><script type="text/javascript">!function(e,t){function u(){return t.getElementById("bb-nav")}function M(){return document.querySelector(".bb-unsupported-message__text")}function L(){var e=document.getElementById("bb_unsupported_custom_message");if(e)return JSON.parse(e.innerHTML).message}function o(){var o='body.bb-unsupported-browser .bb-nav-placeholder{height:auto}body.bb-unsupported-browser #bb-that,body.bb-unsupported-browser .bb-nav-root{display:none;height:auto}body.bb-unsupported-browser .bb-nav-root{display:block}.bb-unsupported-message{background-color:#000;padding:20px 0}@media screen and (max-width: 63.6875

In [33]:
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

In [34]:
html = list(soup.children)[2]

In [35]:
list(html.children)

[<head>
 <base href="https://www.bloomberg.com/"/><script data-view-uid="0_5" src="https://assets.bwbx.io/markets/public/javascripts/ab_test-2eba8ff0e9.js"></script> <meta charset="utf-8"/> <title>SPX Quote - S&amp;P 500 Index - Bloomberg Markets</title><meta content="IE=edge" http-equiv="X-UA-Compatible"/><script type="text/javascript">!function(e,t){function u(){return t.getElementById("bb-nav")}function M(){return document.querySelector(".bb-unsupported-message__text")}function L(){var e=document.getElementById("bb_unsupported_custom_message");if(e)return JSON.parse(e.innerHTML).message}function o(){var o='body.bb-unsupported-browser .bb-nav-placeholder{height:auto}body.bb-unsupported-browser #bb-that,body.bb-unsupported-browser .bb-nav-root{display:none;height:auto}body.bb-unsupported-browser .bb-nav-root{display:block}.bb-unsupported-message{background-color:#000;padding:20px 0}@media screen and (max-width: 63.6875rem){.bb-unsupported-message__logo-box{padding-bottom:10px}.bb-unsu

In [36]:
body = list(html.children)[2]
body

<body class="default-layout markets-section-front" data-tracker-events="click,pageview"> <!-- Google Tag Manager (noscript) --> <noscript><iframe height="0" src="https://www.googletagmanager.com/ns.html?id=GTM-MNTH5N" style="display:none;visibility:hidden" width="0"></iframe></noscript> <!-- End Google Tag Manager (noscript) --><script src="https://assets.bwbx.io/markets/public/javascripts/critical-21892c9095.js" type="text/javascript"></script><div class="header-ad"><div class="leaderboard fixed-height" data-view-uid="0_3"><div class="ad on-large-desktop on-small-desktop on-tablet constrained-width" data-position="leaderboard1" id="0_3"> <script type="text/javascript"> (function(w, d) { var B = "bpop";  var S = "cmpid";  var T = "test";  var t = {"currentResource":"Company|SPX:IND","ticker":"SPX-IND","page":"market quote","url":"/quote/SPX:IND","position":"leaderboard1"}; var q = w.__bloomberg__.query; [B, S, T].forEach(function(k) { var v = q.getValue(k); t[k] = v ? v : null; }); var

In [37]:
list(body.children)

[' ',
 ' Google Tag Manager (noscript) ',
 ' ',
 <noscript><iframe height="0" src="https://www.googletagmanager.com/ns.html?id=GTM-MNTH5N" style="display:none;visibility:hidden" width="0"></iframe></noscript>,
 ' ',
 ' End Google Tag Manager (noscript) ',
 <script src="https://assets.bwbx.io/markets/public/javascripts/critical-21892c9095.js" type="text/javascript"></script>,
 <div class="header-ad"><div class="leaderboard fixed-height" data-view-uid="0_3"><div class="ad on-large-desktop on-small-desktop on-tablet constrained-width" data-position="leaderboard1" id="0_3"> <script type="text/javascript"> (function(w, d) { var B = "bpop";  var S = "cmpid";  var T = "test";  var t = {"currentResource":"Company|SPX:IND","ticker":"SPX-IND","page":"market quote","url":"/quote/SPX:IND","position":"leaderboard1"}; var q = w.__bloomberg__.query; [B, S, T].forEach(function(k) { var v = q.getValue(k); t[k] = v ? v : null; }); var el = d.getElementById("0_3"); var ad = w.__bloomberg__.ads.createAd({

In [38]:
p = list(body.children)[9]
p

<div class="container"><main class="page-content" id="content" lang="en"><div data-view-uid="1|0_6"><div class="quote-page module"> <div class="basic-quote"> <div data-view-uid="1|0_6_1"><div data-view-uid="1|0_6_1_1"><div class="watchlist-notification"> <span class="watchlist-notification_message">Error: Could not add to watchlist.</span> <span class="watchlist-notification_close">X</span> </div> <div class="watchlist"> + Watchlist </div></div><h1 class="name"> S&amp;P 500 Index </h1><div class="ticker-container"> <div class="ticker"> SPX:IND </div> <div class="exchange"> </div> </div><div class="market-status-container"> <div class="market-status "> </div> <div class="market-status-message "> </div> </div><div class="price-container down"> <div class="arrow"></div><!-- no spaces --><div class="price">2,656.30</div><!-- no spaces --> <div class="change-container"> <div> 7.69 </div> <div> 0.29% </div> </div> </div> <div class="price-datetime"> As of 4/13/2018 </div><div class="mobile-b

In [39]:
p1 = list(p.children)[0]
p1

<main class="page-content" id="content" lang="en"><div data-view-uid="1|0_6"><div class="quote-page module"> <div class="basic-quote"> <div data-view-uid="1|0_6_1"><div data-view-uid="1|0_6_1_1"><div class="watchlist-notification"> <span class="watchlist-notification_message">Error: Could not add to watchlist.</span> <span class="watchlist-notification_close">X</span> </div> <div class="watchlist"> + Watchlist </div></div><h1 class="name"> S&amp;P 500 Index </h1><div class="ticker-container"> <div class="ticker"> SPX:IND </div> <div class="exchange"> </div> </div><div class="market-status-container"> <div class="market-status "> </div> <div class="market-status-message "> </div> </div><div class="price-container down"> <div class="arrow"></div><!-- no spaces --><div class="price">2,656.30</div><!-- no spaces --> <div class="change-container"> <div> 7.69 </div> <div> 0.29% </div> </div> </div> <div class="price-datetime"> As of 4/13/2018 </div><div class="mobile-basic-data"> <div class=

In [40]:
p2 = list(p1.children)[0]
p2

<div data-view-uid="1|0_6"><div class="quote-page module"> <div class="basic-quote"> <div data-view-uid="1|0_6_1"><div data-view-uid="1|0_6_1_1"><div class="watchlist-notification"> <span class="watchlist-notification_message">Error: Could not add to watchlist.</span> <span class="watchlist-notification_close">X</span> </div> <div class="watchlist"> + Watchlist </div></div><h1 class="name"> S&amp;P 500 Index </h1><div class="ticker-container"> <div class="ticker"> SPX:IND </div> <div class="exchange"> </div> </div><div class="market-status-container"> <div class="market-status "> </div> <div class="market-status-message "> </div> </div><div class="price-container down"> <div class="arrow"></div><!-- no spaces --><div class="price">2,656.30</div><!-- no spaces --> <div class="change-container"> <div> 7.69 </div> <div> 0.29% </div> </div> </div> <div class="price-datetime"> As of 4/13/2018 </div><div class="mobile-basic-data"> <div class="data-table data-table_basic"> <!-- no spaces --><

In [41]:
p3 = list(p2.children)[0]
p3

<div class="quote-page module"> <div class="basic-quote"> <div data-view-uid="1|0_6_1"><div data-view-uid="1|0_6_1_1"><div class="watchlist-notification"> <span class="watchlist-notification_message">Error: Could not add to watchlist.</span> <span class="watchlist-notification_close">X</span> </div> <div class="watchlist"> + Watchlist </div></div><h1 class="name"> S&amp;P 500 Index </h1><div class="ticker-container"> <div class="ticker"> SPX:IND </div> <div class="exchange"> </div> </div><div class="market-status-container"> <div class="market-status "> </div> <div class="market-status-message "> </div> </div><div class="price-container down"> <div class="arrow"></div><!-- no spaces --><div class="price">2,656.30</div><!-- no spaces --> <div class="change-container"> <div> 7.69 </div> <div> 0.29% </div> </div> </div> <div class="price-datetime"> As of 4/13/2018 </div><div class="mobile-basic-data"> <div class="data-table data-table_basic"> <!-- no spaces --><div class="cell "> <div cla

In [42]:
p4 = list(p3.children)[18]
p4

<div class="news show" data-tracker-category="recirc" data-tracker-events="click"> <div class="index_news" data-view-uid="1|0_6_5"><div class="nav"> <h3 class="news__header active" data-nav="company-news"><a>Market News</a></h3> </div><div class="news__state active" data-group="company-news" data-tracker-label="company_news"> <article class="news-story"> <time class="news-story__published-at" datetime="2018-04-13T23:15:07.461Z"> 4/13/2018 </time> <div class="news-story__headline"> <a class="news-story__url" data-resource-id="P6PXJI6JIJW401" data-resource-type="article" data-tracker-action="click" data-tracker-label="bloomberg" href="https://www.bloomberg.com/news/articles/2018-04-13/zuckerberg-s-u-s-pilgrimage-highlights-perks-of-being-the-boss">Zuckerberg's $1.5 Million Worth of Private-Plane Trips—and Other Perks of Being the Boss</a> </div> </article> <article class="news-story"> <time class="news-story__published-at" datetime="2018-04-13T21:37:14.185Z"> 4/13/2018 </time> <div class

In [43]:
p5 = list(p4.children)[1]
p5

<div class="index_news" data-view-uid="1|0_6_5"><div class="nav"> <h3 class="news__header active" data-nav="company-news"><a>Market News</a></h3> </div><div class="news__state active" data-group="company-news" data-tracker-label="company_news"> <article class="news-story"> <time class="news-story__published-at" datetime="2018-04-13T23:15:07.461Z"> 4/13/2018 </time> <div class="news-story__headline"> <a class="news-story__url" data-resource-id="P6PXJI6JIJW401" data-resource-type="article" data-tracker-action="click" data-tracker-label="bloomberg" href="https://www.bloomberg.com/news/articles/2018-04-13/zuckerberg-s-u-s-pilgrimage-highlights-perks-of-being-the-boss">Zuckerberg's $1.5 Million Worth of Private-Plane Trips—and Other Perks of Being the Boss</a> </div> </article> <article class="news-story"> <time class="news-story__published-at" datetime="2018-04-13T21:37:14.185Z"> 4/13/2018 </time> <div class="news-story__headline"> <a class="news-story__url" data-resource-id="P758226K50XS0

In [44]:
p6 = list(p5.children)[1]
p6

<div class="news__state active" data-group="company-news" data-tracker-label="company_news"> <article class="news-story"> <time class="news-story__published-at" datetime="2018-04-13T23:15:07.461Z"> 4/13/2018 </time> <div class="news-story__headline"> <a class="news-story__url" data-resource-id="P6PXJI6JIJW401" data-resource-type="article" data-tracker-action="click" data-tracker-label="bloomberg" href="https://www.bloomberg.com/news/articles/2018-04-13/zuckerberg-s-u-s-pilgrimage-highlights-perks-of-being-the-boss">Zuckerberg's $1.5 Million Worth of Private-Plane Trips—and Other Perks of Being the Boss</a> </div> </article> <article class="news-story"> <time class="news-story__published-at" datetime="2018-04-13T21:37:14.185Z"> 4/13/2018 </time> <div class="news-story__headline"> <a class="news-story__url" data-resource-id="P758226K50XS01" data-resource-type="article" data-tracker-action="click" data-tracker-label="bloomberg" href="https://www.bloomberg.com/news/videos/2018-04-13/s-p-50

In [45]:
p7 = list(p6.children)[1]

print(p7.prettify())

<article class="news-story">
 <time class="news-story__published-at" datetime="2018-04-13T23:15:07.461Z">
  4/13/2018
 </time>
 <div class="news-story__headline">
  <a class="news-story__url" data-resource-id="P6PXJI6JIJW401" data-resource-type="article" data-tracker-action="click" data-tracker-label="bloomberg" href="https://www.bloomberg.com/news/articles/2018-04-13/zuckerberg-s-u-s-pilgrimage-highlights-perks-of-being-the-boss">
   Zuckerberg's $1.5 Million Worth of Private-Plane Trips—and Other Perks of Being the Boss
  </a>
 </div>
</article>



In [46]:
p8 = list(p7.children)[3]
p_8 = list(p7.children)[1]
p8

<div class="news-story__headline"> <a class="news-story__url" data-resource-id="P6PXJI6JIJW401" data-resource-type="article" data-tracker-action="click" data-tracker-label="bloomberg" href="https://www.bloomberg.com/news/articles/2018-04-13/zuckerberg-s-u-s-pilgrimage-highlights-perks-of-being-the-boss">Zuckerberg's $1.5 Million Worth of Private-Plane Trips—and Other Perks of Being the Boss</a> </div>

In [47]:
print(p_8.get_text())
p8.get_text()

 4/13/2018 


" Zuckerberg's $1.5 Million Worth of Private-Plane Trips—and Other Perks of Being the Boss "