<a href="https://colab.research.google.com/github/ds-joy/web-scraping/blob/main/realPython/webScraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

####One useful package for web scraping that you can find in Python’s standard library is urllib, which contains tools for working with URLs. In particular, the urllib.request module contains a function called urlopen() that can be used to open a URL within a program.


In [2]:
from urllib.request import urlopen

####The web page that we’ll open is at the following URL:

In [3]:
url = "http://olympus.realpython.org/profiles/aphrodite"
page = urlopen(url)

urlopen() returns an HTTPResponse object

In [4]:
page

<http.client.HTTPResponse at 0x7f466c352510>

To extract the HTML from the page, first use the HTTPResponse object’s .read() method, which returns a sequence of bytes. Then use .decode() to decode the bytes to a string using UTF-8:

In [5]:
html_bytes = page.read()
html = html_bytes.decode("utf-8")

print(html)

<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>



#### Extracting the title

In [6]:
title_index = html.find("<title>")
title_index

14

In [7]:
start_index = title_index + len("<title>")
start_index

21

In [8]:
end_index = html.find("</title>")
end_index

39

In [9]:
title = html[start_index:end_index]
title

'Profile: Aphrodite'

####Another title

In [10]:
url2 = "http://olympus.realpython.org/profiles/poseidon"
page2 = urlopen(url2)

html2 = page2.read().decode("utf-8")
print(html2)

<html>
<head>
<title >Profile: Poseidon</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/poseidon.jpg" />
<h2>Name: Poseidon</h2>
<br><br>
Favorite animal: Dolphin
<br><br>
Favorite color: Blue
<br><br>
Hometown: Sea
</center>
</body>
</html>



In [11]:
start = html2.find("<title>") + len("<title>")
end = html2.find("</title>")

title2 = html2[start:end]
title2

'\n<head>\n<title >Profile: Poseidon'

###Regular expressions

In [12]:
import re
re.findall("ab*c", "abcd")

['abc']

The regular expression "ab*c" matches any part of the string that begins with an "a", ends with a "c", and has zero or more instances of "b" between the two. re.findall() returns a list of all matches. The string "ac" matches this pattern, so it’s returned in the list.

In [13]:
print(re.findall("ab*c", "abcd"))
print(re.findall("ab*c", "acc"))
print(re.findall("ab*c", "abdc"))
re.findall("ab*c", "abcac")


['abc']
['ac']
[]


['abc', 'ac']

In [14]:
re.findall("ab*c", "ABCD", re.IGNORECASE)

['ABC']

You can use a period (.) to stand for any single character in a regular expression. For instance, you could find all the strings that contain the letters "a" and "c" separated by a single character as follows:

In [15]:
print(re.findall("a.c", "abc"))
print(re.findall("a.c", "abbc"))
print(re.findall("a.c", "ac"))

['abc']
[]
[]


In [16]:
print(re.findall("a.*c", "abc"))
print(re.findall("a.*c", "abbc"))
print(re.findall("a.*c", "ac"))

['abc']
['abbc']
['ac']


There’s one more function in the re module that’s useful for parsing out text. re.sub(), which is short for substitute, allows you to replace text in a string that matches a regular expression with new text. It behaves sort of like the .replace() string method.

The arguments passed to re.sub() are the regular expression, followed by the replacement text, followed by the string. Here’s an example:


In [17]:
string = "Everything is <replased> if it's in <tags>."
string = re.sub("<.*>", "ELEPHANTS", string)
string

'Everything is ELEPHANTS.'

In [18]:
string = "Everything is <replased> if it's in <tags>."
string = re.sub("<.*?>", "ELEPHANTS", string)
string

"Everything is ELEPHANTS if it's in ELEPHANTS."