### SCRAPING

BEAUTIFULSOUP

- ability to get data from HTML and XML
- the easiest way to get data
- no JavaScript support
- inefficient
- good for beginners
- can create issues when migrating parts of code between projects due to library dependencies

SELENIUM

- wasn't design for web scraping but for test automation of webpages
- works with JavaScript
- easier to learn than Scrapy
- slow
- good for small projects

SCRAPY

- written entirely in Python
- the most complete library for scraping
- harder to learn
- fast
- can store data in databases, creat crawlers
- great for large projects when speed is priority

WRITING TO A FILE

In [None]:
with open('test.txt', 'w') as file:
    file.write('Data succesfully scraped!')

FROM DICT TO PANDAS

In [None]:
df.from_dict(my_dict)

FROM PANDAS TO CSV

In [None]:
df.to_csv('states.csv')
df.to_csv('states.csv', index=False)

### BEAUTIFUL SOUP


- by default BS is using Unicode (BS assumes that the input document is Unicode)
- the output from a Beautiful Soup is UTF-8 document, irrespective of the entered document to BeutifulSoup (this happens automatically as the BeautifulSoup object is instantiated, and further printed or returned to variable)


HTML TAGS

`<nav>` navigational

- container tag for a group of navigational links
- not all links of a document should be inside a `<nav>` element, its only for major blocks of navigation links
- Browsers, such as screen readers for disabled users, can use this element to determine whether to omit the initial rendering of this content.

```html
<nav>
<a href="/html/">HTML</a> |
<a href="/css/">CSS</a> |
<a href="/js/">JavaScript</a> |
<a href="/python/">Python</a>
</nav>

<article> article
- container tag for a collection of diverse, independent, self-contained content

<article>
<h2>Google Chrome</h2>
<p>Google Chrome is a web browser developed by Google, released in 2008. Chrome is the world's most popular web browser today!</p>
</article>

<div> divider
<head>
<body>
<header>

<p> paragraph
<h1><h2><h3> headers
<button>
<table>
<tr> table row elements
<td> each element of table data
<ul> unordered list
<li> list item
<a> anchor (for links), <a href="/html/">HTML</a>
<iframe> embeds another page within a page, nested browsing
<br/> line break
```

ATTRIBUTE

```html
class="main-article"
id="354"
```

ORDER OF FINDING ELEMENTS

1. ID
2. Class name
3. Tag name, CSS Selector
4. Xpath

XPATH

XML Path Language is designed to support the query or transformation of XML documents

- can be used to compute values (eg. strings, numbers, bools) from the content of an XML document
- support for XPath exists in applications that support XML, such as web browsers


LXML

- parser of XML and HTML

BEAUTIFUL SOUP

BeautifulSoup is the object created when we try to scrape a web resource.

In [3]:
import requests
from bs4 import BeautifulSoup
import lxml

#>>> soup.name
# '[document]'

FETCHING PAGES

In [None]:
result = requests.get('www.google.com')
response = requests.get('https://subslikescript.com/movie/Titanic-120338')

VIEWING PAGE CONTENT

In [None]:
content = result.text
#subslikescript.com/movie/Titanic-120338

PARSING WEBSITE - CREATE SOUP

In [None]:
soup = BeautifulSoup(content, 'lxml') # lxml is a parser

FINDING ELEMENTS WITH BEAUTIFUL SOUP

- if find() can't find anything it returns None

In [None]:
soup.find(id='specific_id')
soup.find('tag', class_='class_name') -> soup.find('article', class_='main_article')
soup.find('h1') # returns a single element
soup.find_all('h2') # return a list

PRETTIFY

- makes html code more readable
- just for reading

In [None]:
print(soup.prettify())

- you can specify the encoding of a print-out

In [None]:
print(soup.prettify("latin-1"))

FINDING SPECIFIC ELEMENTS IN THE PARSED OBJECT

- all methods of .find() can be used also in find_all() as well as find_parents(), find_children(), find_siblings()

In [None]:
box = soup.find('article', class_='main-article')
transcript = soup.find('div', class_='full-script').get_text()

GETTING THE TEXT OUT THE ELEMENT


In [None]:
title = soup.find('h1').get_text()

STRIPPING WHITE SPACE FROM BEGINING AND END OF EACH BIT OF TEXT

In [None]:
soup.get_text(strip=True)

REPLACE A NEW LINE WITH A BLANK SPACE

In [None]:
soup.get_text(strip=True, separator=' ')

FIND ALL

- if find_all() can't find anything it returns an empty list

In [None]:
markup = BeautifulSoup('<p>Top Three</p><p><pre>Programming Languages are:'
                       '</pre></p><p><b>Java, Python, Cplusplus</b></p>', 'lxml')

print(markup.find_all('p'))

In [None]:
[<p>Top Three</p>, <p></p>, <p><b>Java, Python, Cplusplus</b></p>]

FIND ALL LINKS WITHIN AN ARTICLE TAG

In [None]:
links = my_article.find_all('a', href=True)

SEARCHING WITH REGULAR EXPRESSIONS

In [None]:
import re
markup2 = BeautifulSoup('<p>Top Three</p><p><pre>Programming Languages are:'
                       '</pre></p><p><b>Java, Python, Cplusplus</b></p>', 'lxml')

print(markup2.find_all(re.compile('^p')))

In [None]:
[<p>Top Three</p>, <p></p>, <pre>Programming Languages are:</pre>,
 <p><b>Java, Python, Cplusplus</b></p>]

FINDING MULTIPLE ELEMENTS

In [None]:
print(markup2.find_all(['pre', 'b']))


In [None]:
[<pre>Programming Languages are:</pre>, <b>Java, Python, Cplusplus</b>]

FINDING ONLY TAGS (NO STRINGS)

In [None]:
print(markup2.find_all(True))

In [None]:
[<html><body><p>Top Three</p><p></p><pre>Programming Languages are:</pre><p><b>Java, Python, Cplusplus</b></p></body></html>,
<body><p>Top Three</p><p></p><pre>Programming Languages are:</pre><p><b>Java, Python, Cplusplus</b></p></body>,
<p>Top Three</p>,
<p></p>,
<pre>Programming Languages are:</pre>,
<p><b>Java, Python, Cplusplus</b></p>,
<b>Java, Python, Cplusplus</b>]

FINDING ONLY TAGS AND DISPLAYING ONLY TAG NAMES

In [None]:
for tag in markup2.find_all(True):
    print(tag.name)

html
body
p
p
pre
p
b

STOP SEARCHING AFTER YOU FIND THE FIRST ELEMENT OF INTEREST

In [None]:
soup.find_all('title',limit=1)

TAG OBJECT

CHECK THE TYPE OF A TAG

In [None]:
tag.name

- name can be changed (which will convert into a html tag)

In [None]:
tag.name = 'Strong'

```html
<Strong><body><b class="boldest">TutorialsPoint</b></body></Strong>
```

ATTRIBUTES

- a tag object can have any numer of attributes
- the tag `<b class='boldest'>` has an attribute `class` whose value is 'boldest'
- anything that is not tag is an attribute it must contain a value

In [None]:
soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>', 'lxml')
tag = soup.html.b
print(tag.attrs)

In [None]:
{'class': ['boldest']}

ASSIGN AN ATTRIBUTE WITH VALUE

- works for both reassignement of existing attributes and for creating new ones

In [None]:
tag2['class'] = 'Online-Learning'

MULTI-VALUED ATTRIBUTES

- returns a list

In [None]:
css_soup = BeautifulSoup('<p class="body bold"></p>')
print(css_soup.p['class'])

In [None]:
['body', 'bold']

However, if any attribute contains more than one value but it is not multi-valued attributes by any-version of HTML standard, beautiful soup will leave the attribute alone.

In [None]:
id_soup = BeautifulSoup('<p id="body bold"></p>', 'lxml')
print(id_soup.p['id'])

body bold

In [None]:
print(type(id_soup.p['id']))

`<class 'str'>`

GETTING A LIST OF ATTRIBUTES

In [None]:
rel_soup.a['rel'] = ['Index', 'Online library, its all free']
print(id_soup.p.get_attribute_list('id'))

In [None]:
['body bold']

MULTI-VALUE ATTRIBUTES DON'T WORK WITH XML, ITS JUST A STRING DEFINED WITHIN AN ATTRIBUTE

In [None]:
xml_soup = BeautifulSoup('<p class="body bold"></p>', 'xml')
print(xml_soup.p['class'])

In [None]:
'body bold'

NAVIGABLESTRING OBJECT

- NavigableString object is used to represent the contents of a tag
- Are used to represent text within tags, rather than tags themselves

In [None]:
soup2 = BeautifulSoup('<h2 id="message">Hello, Tuto</h2>', 'lxml')
print(soup2.string)

In [None]:
'Hello, Tuto'

REPLACING THE STRING

In [None]:
soup2 = BeautifulSoup('<h2 id="message">Hello, Tuto</h2>', 'lxml')
soup2.string.replace_with('Online Learning')
print(soup2.string)

COMMENT OBJECT


- special type of NavigableString

In [None]:
soup3 = BeautifulSoup('<p><!-- Everything inside is a comment. --></p>')
comment = soup.p.string
print(type(comment))

`<class 'bs4.element.Comment'>`

NAVIGATING BY USING TAG NAME

- using a tag name as an attribute will give you only the first tag by that name

In [None]:
html_doc = """
<html><head><title>Tutorials Point</title></head>
<body>
<a href="https://www.tutorialspoint.com/java/java_overview.htm" class="prog" id="link1">Java</a>,
<a href="https://www.tutorialspoint.com/cprogramming/index.htm" class="prog" id="link2">C</a>,
<p class="prog">Programming Languages</p>
"""

soup4 = BeautifulSoup(html_doc, 'html.parser')

print(soup4.a)

`<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>
`

GET ALL TAGS OF THE SAME TYPE, USE FIND_ALL()

- returns a list of all tags of a specified tags in a document

In [None]:
print(soup4.find_all('a'))

CONTENTS

- returns a list of items within a block of html code, even it's only one item

In [None]:
head_tag = soup4.head
print(head_tag)

`<head><title>Tutorials Point</title></head>`

In [None]:
print(soup4.head.contents)

`[<title>Tutorials Point</title>]`

In [None]:
print(soup4.head.contents[0])

`<title>Tutorials Point</title>`

CALCULATE NUMBER OF ITEMS OF CONTENTS

In [None]:
len(soup4.body.contents)

CHILDREN

- returns the same as .contents but as a generator

In [None]:
for c in soup4.body.children:
    print(c)

for c in soup4.body.p.children:
    print(c)

DESCENDANTS

- it allows you to reach children and childrean of childrean, as far as you like
- it's also a generator

CALCULATE THE NUMBER OF CHILDREN AND DESCENDANTS

In [None]:
len(list(soup.children))
len(list(soup.descendants))

STRING

- if a tag has only one child and that child is a NavigableString, the child is made available as .string
- if a tag's only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child
- however, if a tag contains more than one thing, then it's not clear what .string should refer to, so .string is defined to None


In [None]:
print(soup4.body.p.string)

STRINGS

- returns a generator

In [None]:
for string in soup4.strings:
    print(string)

STRIPPED STRINGS

- removes extra white space

In [None]:
for string in soup4.stripped_strings:
    print(string)

PARENT

- top parent is a BeautifulSoup object

In [None]:
Ttag = soup4.title
print(Ttag)

`<title>Tutorials Point</title>`

In [None]:
print(Ttag.parent)

`<head><title>Tutorials Point</title></head>`

- The parent of a top-level tag like <html> is the Beautifulsoup object itself.
- the parent of the BeautifulSoup object is None

In [None]:
htmlatag = soup4.html
print(type(htmlatag.parent))

`<class 'bs4.BeautifulSoup'>`

PARENTS

- to iterate over all parents, use .parents

In [None]:
link = soup4.a

for parent in link.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

SIBLING

- next_sibling and previous_sibling navigate between page elements that are on the same level of the parse tree
- next line '/n' is identified as a sibling of NavigableString type
- in order to avoid new lines elements to be included in the list of siblings, use .find_previous_siblings() method

In [None]:
print(sibling_soup.b.next_sibling)
print(sibling_soup.b.previous_sibling)


PREVIOUS SIBLINGS

In [None]:
p_bob = soup5.find(id="bob")
for s in p_bob.previous_siblings:
    print(repr(s))

```
'\n'
<p>Alex</p>
'\n'
```

In [None]:
for s in p_bob.previous_siblings:
    print(type(s))

`<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>`

NEXT SIBLINGS

In [None]:
for sibling in soup.a.next_siblings:
	print(repr(sibling))

for sibling in soup.find(id="link3").previous_siblings:
	print(repr(sibling)

FIND PREVIOUS SIBLINGS

- returns actual siblings only, drops '/n'

In [None]:
for s in p_bob.find_previous_siblings():
    print(s)


`<p>Alex</p>`

ELEMENTS (.next_element and .previous_element)

- .next_element and .previous_element attributes
- it seeks for elements in the parse tree (not just the same type), so it will return Tags, NavigableStrings and Comments

In [None]:
my_html = """
         <p>Alex</p>
   <p>Bob</p>
"""
soup = BeautifulSoup(my_html)

p_alex = soup.find("p")
print(p_alex.next_element)


In [None]:
'Alex'

In [None]:
p_alex = soup.find("p")
print(repr(p_alex.next_element.next_element))

In [None]:
'\n'

In [None]:
p_alex = soup.find("p")
print(p_alex.next_element.next_element.next_element)

`<p>Bob</p>`

NEXT ELEMENTS AND PREVIOUS ELEMENTS

- works the same as .next_element/.previous_element but returns a generator

CSS SELECTORS

- library supports the most commonly-used CSS selectors
- this API is useful for people who know CSS syntax, but all these operations can be done with the regular bs4 API

<https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors>

SELECT METHOD

- you can search for elements using CSS selectors with the .select() method
- select_one() - does the same .select() but returns the first found object only


In [None]:
soup.select('title')

In [None]:
soup.select("div p")

In [None]:
[<p id="alex">Alex</p>, <p>Bob</p>, <p id="cathy">Cathy</p>]

In [None]:
soup.select("p:nth-of-type(1)") #css pseudo-class, matches elements based on their position among siblings fo the same type (tag name), returns first p tag

soup.select("li:nth-of-type(10)") # returns the 10th li item

soup.select("head > title") # find first tile below head tag

ENCODING

In [None]:
soup.original_encoding

In [None]:
'ISO-8859-7'

APPLY ENCODING OF YOUR CHOICE

In [None]:
soup = BeautifulSoup(markup, from_encoding="iso-8859-8")

EXCLUDE ENCODING

In [None]:
soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])

ENCODE A SPECIFIC DOCUMENT

In [None]:
soup.p.encode("latin-1")

`b'<p>0My first paragraph.</p>'`

MARKUP LANGUAGE

- text-encoding system consiting of a set of symbols inserted in a text document to control its structure, formatting, or the relationship between its part
- XML, HTML, LaTeX

XML MARKUP DELIMITER

- unique character which indicates the begining or an end of a XML object
- examples: '<', '/>'

XML ENTITY REFERENCES

- a group of characters used in text as a substitute for a single specific character
- using entity reference prevents a literal character (ex. '&') from being mistaken for a markup delimiter
- eg. is an attribute must contain a left angle bracker ('<') you can substitute it by the entity reference "&lt;"
- XML entity references always begin with ampersand (&) and end with a semicolon (;)
- you can also substitute with a numeric or hexadecimal reference

Character	Entity Reference	Numeric Reference	Hexadecimal Reference
&		&amp;			&#38;			&#x26;	
<		&lt;			&#60;			&#x3C;
>		&gt;			&#62;			&#x3C;
"		&quot;			&#34;			&#x22;