# Web Scrapping Basics Using Python

In this tutorial we will look at basic operations one can perform

In [2]:
import bs4 as bs
import urllib.request

It is common lingo to name the variables as follows

In [3]:
sauce = urllib.request.urlopen('https://dacatay.com').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
print(soup)

URLError: <urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>

If you are not used to look as HTML code and know that it is normally organized using tags and whitespace, this output might look a bit strange. We can, however, restore order to this mess to some extend by using

In [1]:
print(prettify(soup))

NameError: name 'prettify' is not defined

In [None]:
# Search for specific tag e.g.
# soup.p
# soup.a
# soup.nav
# soup.body

# For example search ll links that are present in the navigation bar
#for url in soup.find_all('a'):
#    print(url.get('href'))

The example below displays a very basic HTML file. 

In [None]:
<!DOCTYPE html>
<html>
<body>

<h1>This is a header</h1>

<p>This is a paragraph.</p>
<p>This is another paragraph.</p>

<a href="https://dacatay.com/uncategorized/web-scraping-with-python/">This is a link to this article.</a>

<p>This is a link to an Einstein picture</p>
<img src="https://upload.wikimedia.org/wikipedia/en/8/86/Einstein_tongue.jpg" width="104" height="142">

<p>This is an example table we will scrap</p>

</body>
</html>

Specific to the website template that I am using, the "Twenty Sixteen" theme from Wordpress.org, is that the content under the home tab is entered in a <code>div</code> tag with class attribute <code>class="entry-content"</code>.

We can now, for example, extract all links on the page using

In [None]:
urls = soup.a
print(urls)

Alternatively, we can achieve the same result with

In [5]:
urls = soup.find_all('a')
print(urls)

NameError: name 'soup' is not defined

An <code>a</code> tag is usually accompanied by a <code>href=</code> attribute. We can specifically retrieve this attribute for every element in our <code>urls</code> list using the <code>get</code> method

In [None]:
for url in urls:
    print(url.get('href'))

For another example, suppose that we are looking for all hyperlinks that are rpesent in the navigation bar. In this case, we would make use of the <code>nav</code> tag and do the following

In [None]:
nav = soup.nav

for url in nav.find_all('a'):
    print(url.get('href'))

Similarly, to retrieve al paragraphs in the complete body of the HTML document we would do this

In [None]:
body = soup.body
for p in body.find_all('p'):
    print(p.text)

This next one is a little more specific. Some web pages doe not just enter their contents using the basic paragraph <code>p</code> tags. The create custom classes to avoid being scrapped automatically. For the case of my webpage, I am using the "Twenty Sixteen" WordPress theme template, which comes with such special <code>div</code> tag classes defined for content entry.

In [None]:
for div in soup.find_all('div', class_='body'):
    print(div.text)

## Tables and XML

As before we can retrieve table elements wrapped in <code>table</code> tags from a webpage like so

In [None]:
table = soup.table          # or
table = soup.find('table')
print(table)

In [None]:
table_rows = table.find_all('tr')
for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    print(row)

## Alternative: Use pandas to read a datatable from a webpage


In [None]:
import pandas as pd

In [None]:
df = pd.read_html('https://dacatay.com')