## setting up

In [2]:
import requests
from bs4 import BeautifulSoup

## 1. Making a GET request 

- exploring the HTML's and getting a simple response from a website
- it is recommended to explore a website first using a developer tool

In [3]:
base_site = 'https://en.wikipedia.org/wiki/Music'

response = requests.get(base_site)
response

<Response [200]>

In [4]:
html = response.content
html[:100]

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title'

## 2. Constructing a Beautiful Soup object instance 

- this is the point where an html parser is specified
- Beautiful Soup recommends working with the following (ranked from best to worst):
    - html5lib (parses HTML the way a web browser does)
    - lxml (used for both html and xml parsing)
    - html.parser (included in Python's standard library)
    - https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=html%20parser#installing-a-parser

In [6]:
soup = BeautifulSoup(html, 'html.parser')

## 3. Exporting HTML to a file

In [7]:
with open('Wiki_response.html', 'wb') as file:
    file.write(soup.prettify('utf-8'))

------------------------------------------------------------------
------------------------------------------------------------------
------------------------------------------------------------------

## 4. Searching a navigating the HTML tree 

- the Beautiful Soup object represents the whole document
    - not just the text
    - all the elements of html
    
- we can use different **methods** to explore the object
    - find()
        - returns the FIRST TAG matching the search
    - findall()
        - returns a LIST of ALL TAGS

In [8]:
soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Music - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"1331537f-0095-44f5-adc9-58333059cd35","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Music","wgTitle":"Music","wgCurRevisionId":1039119909,"wgRevisionId":1039119909,"wgArticleId":18839,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["All articles with incomplete citations","Articles with incomplete citations from July 2019","CS1 maint: archived copy as title","CS1: Julian–Gregorian uncertainty","Webarchive template wayback

In [13]:
links = soup.find_all('a')
print("type(links)", type(links))
print(len(links))
links

type(links) <class 'bs4.element.ResultSet'>
2504


[<a id="top"></a>,
 <a href="/wiki/Wikipedia:Protection_policy#semi" title="This article is semi-protected."><img alt="Page semi-protected" data-file-height="512" data-file-width="512" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/30px-Semi-protection-shackle.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/40px-Semi-protection-shackle.svg.png 2x" width="20"/></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>,
 <a class="image" href="/wiki/File:Fran%C3%A7ois_Boucher,_Allegory_of_Music,_1764,_NGA_32680.jpg"><img alt="" class="thumbimage" data-file-height="3189" data-f

In [7]:
soup.find('head')

<head>
<meta charset="utf-8"/>
<title>Music - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"74564fa7-31e0-49d4-9d0c-d7c94c6c5baa","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Music","wgTitle":"Music","wgCurRevisionId":984758011,"wgRevisionId":984758011,"wgArticleId":18839,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["All articles with incomplete citations","Articles with incomplete citations from July 2019","CS1 maint: archived copy as title","CS1: Julian–Gregorian uncertainty","Webarchive template wayback links","Pages containing links to subscription-only content","CS1

In [14]:
table = soup.find('tbody')
table

<tbody><tr><th class="sidebar-title" style="background:antiquewhite;;padding:0.5em; display:block;margin-bottom:0.4em;"><a href="/wiki/Performing_arts" title="Performing arts">Performing arts</a></th></tr><tr><td class="sidebar-content hlist" style="padding-bottom:0.5em;">
<ul><li><a href="/wiki/Acrobatics" title="Acrobatics">Acrobatics</a></li>
<li><a href="/wiki/Ballet" title="Ballet">Ballet</a></li>
<li><a href="/wiki/List_of_circus_skills" title="List of circus skills">Circus skills</a></li>
<li><a href="/wiki/Clown" title="Clown">Clown</a></li>
<li><a href="/wiki/Dance" title="Dance">Dance</a></li>
<li><a href="/wiki/Gymnastics" title="Gymnastics">Gymnastics</a></li>
<li><a href="/wiki/Magic_(illusion)" title="Magic (illusion)">Magic</a></li>
<li><a href="/wiki/Mime_artist" title="Mime artist">Mime</a></li>
<li><a class="mw-selflink selflink">Music</a></li>
<li><a href="/wiki/Opera" title="Opera">Opera</a></li>
<li><a href="/wiki/Professional_wrestling" title="Professional wrestli

In [15]:
type(table)

bs4.element.Tag

In [16]:
table.find_all('td')

[<td class="sidebar-content hlist" style="padding-bottom:0.5em;">
 <ul><li><a href="/wiki/Acrobatics" title="Acrobatics">Acrobatics</a></li>
 <li><a href="/wiki/Ballet" title="Ballet">Ballet</a></li>
 <li><a href="/wiki/List_of_circus_skills" title="List of circus skills">Circus skills</a></li>
 <li><a href="/wiki/Clown" title="Clown">Clown</a></li>
 <li><a href="/wiki/Dance" title="Dance">Dance</a></li>
 <li><a href="/wiki/Gymnastics" title="Gymnastics">Gymnastics</a></li>
 <li><a href="/wiki/Magic_(illusion)" title="Magic (illusion)">Magic</a></li>
 <li><a href="/wiki/Mime_artist" title="Mime artist">Mime</a></li>
 <li><a class="mw-selflink selflink">Music</a></li>
 <li><a href="/wiki/Opera" title="Opera">Opera</a></li>
 <li><a href="/wiki/Professional_wrestling" title="Professional wrestling">Professional wrestling</a></li>
 <li><a href="/wiki/Puppetry" title="Puppetry">Puppetry</a></li>
 <li><a href="/wiki/Public_speaking" title="Public speaking">Speech</a></li>
 <li><a href="/wiki

In [17]:
len(table.find_all('td'))

2

## Navigating the HTML tree 

- children of the tbody tag
    - contents
    - parent
- the methods can be stacked and we can go sideways...

In [18]:
table.contents

[<tr><th class="sidebar-title" style="background:antiquewhite;;padding:0.5em; display:block;margin-bottom:0.4em;"><a href="/wiki/Performing_arts" title="Performing arts">Performing arts</a></th></tr>,
 <tr><td class="sidebar-content hlist" style="padding-bottom:0.5em;">
 <ul><li><a href="/wiki/Acrobatics" title="Acrobatics">Acrobatics</a></li>
 <li><a href="/wiki/Ballet" title="Ballet">Ballet</a></li>
 <li><a href="/wiki/List_of_circus_skills" title="List of circus skills">Circus skills</a></li>
 <li><a href="/wiki/Clown" title="Clown">Clown</a></li>
 <li><a href="/wiki/Dance" title="Dance">Dance</a></li>
 <li><a href="/wiki/Gymnastics" title="Gymnastics">Gymnastics</a></li>
 <li><a href="/wiki/Magic_(illusion)" title="Magic (illusion)">Magic</a></li>
 <li><a href="/wiki/Mime_artist" title="Mime artist">Mime</a></li>
 <li><a class="mw-selflink selflink">Music</a></li>
 <li><a href="/wiki/Opera" title="Opera">Opera</a></li>
 <li><a href="/wiki/Professional_wrestling" title="Professional

In [19]:
len(table.contents)

3

In [20]:
table.parent

<table class="sidebar nomobile nowraplinks"><tbody><tr><th class="sidebar-title" style="background:antiquewhite;;padding:0.5em; display:block;margin-bottom:0.4em;"><a href="/wiki/Performing_arts" title="Performing arts">Performing arts</a></th></tr><tr><td class="sidebar-content hlist" style="padding-bottom:0.5em;">
<ul><li><a href="/wiki/Acrobatics" title="Acrobatics">Acrobatics</a></li>
<li><a href="/wiki/Ballet" title="Ballet">Ballet</a></li>
<li><a href="/wiki/List_of_circus_skills" title="List of circus skills">Circus skills</a></li>
<li><a href="/wiki/Clown" title="Clown">Clown</a></li>
<li><a href="/wiki/Dance" title="Dance">Dance</a></li>
<li><a href="/wiki/Gymnastics" title="Gymnastics">Gymnastics</a></li>
<li><a href="/wiki/Magic_(illusion)" title="Magic (illusion)">Magic</a></li>
<li><a href="/wiki/Mime_artist" title="Mime artist">Mime</a></li>
<li><a class="mw-selflink selflink">Music</a></li>
<li><a href="/wiki/Opera" title="Opera">Opera</a></li>
<li><a href="/wiki/Profess

In [21]:
table.parent.parent

<div class="mw-parser-output"><div class="shortdescription nomobile noexcerpt noprint searchaux" style="display:none">Form of art using sound and silence</div>
<style data-mw-deduplicate="TemplateStyles:r1033289096">.mw-parser-output .hatnote{font-style:italic}.mw-parser-output div.hatnote{padding-left:1.6em;margin-bottom:0.5em}.mw-parser-output .hatnote i{font-style:normal}.mw-parser-output .hatnote+link+.hatnote{margin-top:-0.5em}</style><div class="hatnote navigation-not-searchable" role="note">For other uses, see <a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>.</div>
<p class="mw-empty-elt">
</p>
<div class="thumb tright"><div class="thumbinner" style="width:222px;"><a class="image" href="/wiki/File:Fran%C3%A7ois_Boucher,_Allegory_of_Music,_1764,_NGA_32680.jpg"><img alt="" class="thumbimage" data-file-height="3189" data-file-width="4000" decoding="async" height="175" src="//upload.wikimedia.org/wikipedia/commons/t

-------------------------------------------------
-------------------------------------------------
-------------------------------------------------

## Searching by attributes 

- using the id's of specific tags
    - the attributes name can be user-defined
    - but we can explore the common ones
    
- Passing attributes as function parameters
    - using the 'class' object of html tags

In [22]:
soup.find('div', id='siteSub')

<div class="noprint" id="siteSub">From Wikipedia, the free encyclopedia</div>

In [23]:
soup.find_all('a', class_ = 'mw-jump-link')

[<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>]

In [24]:
soup.find_all('a', class_ = 'mw-jump-link', href = '#mw-head')

[<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>]

In [25]:
soup.find('a', class_ = 'mw-jump-link', href = '#mw-head')

<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>

## Placing the attributes in a dictionary

In [26]:
soup.find('a', attrs={'class':'mw-jump-link', 'href':'#mw-head'})

<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>

In [27]:
soup.find('div', {'id':'footer'})

In [28]:
soup.find('div', id='footer')

## Extracting data from the HTML tree

In [29]:
a = soup.find('a', class_='mw-jump-link')
a

<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>

In [30]:
a.name

'a'

## Getting the attribute value

In [31]:
a['href']

'#mw-head'

In [32]:
a['class']

['mw-jump-link']

##### multivalued attributes return a list of values, even if in the particular tag they have only one value

In [33]:
a.get('href')

'#mw-head'

In [34]:
a.get('class')

['mw-jump-link']

##### Difference between dict.key and get methods are when the value is missing:
a) we will get a key error if the value doesn't exist
b) the get method will return a None object

In [36]:
a['a']

KeyError: 'a'

In [40]:
a.get('a')
repr(a.get('a'))

'None'

In [41]:
a.get('id')

In [42]:
repr(a.get('id'))

'None'

### attrs method lets us explore the attributes of a specific tag

In [44]:
a.attrs

{'class': ['mw-jump-link'], 'href': '#mw-head'}

-----------------------------------------------------

-----------------------------------------------------

-----------------------------------------------------

# How to extract the text from tags 

In [45]:
a.string

'Jump to navigation'

In [46]:
a.text

'Jump to navigation'

In [47]:
p = soup.find_all('p')[1]
p

<p><b>Music</b> is the <a href="/wiki/The_arts" title="The arts">art</a> of arranging <a href="/wiki/Sound" title="Sound">sounds</a> in time to produce a <a href="/wiki/Musical_composition" title="Musical composition">composition</a> through the <a href="/wiki/Elements_of_music" title="Elements of music">elements</a> of melody, harmony, rhythm, and timbre.<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup> It is one of the <a href="/wiki/Cultural_universal" title="Cultural universal">universal cultural</a> aspects of all human societies. General <a class="mw-redirect" href="/wiki/Definitions_of_music" title="Definitions of music">definitions of music</a> include common elements such as <a href="/wiki/Pitch_(music)" title="Pitch (music)">pitch</a> (which governs <a href="/wiki/Melody" title="Melody">melody</a> and <a href="/wiki/Harmony" title="Harmony">harmony</a>), <a href="/wiki/Rhythm" title="Rhythm">rhythm</a> (and its associated concepts <a href="/wiki/Temp

In [48]:
p.text

'Music is the art of arranging sounds in time to produce a composition through the elements of melody, harmony, rhythm, and timbre.[1] It is one of the universal cultural aspects of all human societies. General definitions of music include common elements such as pitch (which governs melody and harmony), rhythm (and its associated concepts tempo, meter, and articulation), dynamics (loudness and softness), and the sonic qualities of timbre and texture (which are sometimes termed the "color" of a musical sound). Different styles or types of music may emphasize, de-emphasize or omit some of these elements. Music is performed with a vast range of instruments and vocal techniques ranging from singing to rapping; there are solely instrumental pieces, solely vocal pieces (such as songs without instrumental accompaniment) and pieces that combine singing and instruments. The word derives from Greek μουσική (mousike; "(art) of the Muses").[2]\n'

In [49]:
p.string

In [50]:
repr(p.string)

'None'

##### string method is looking for a specific string associated to a tag
##### p tag is not associated to a specific string
##### text returns characters, regardless of what is associated to that element

In [51]:
p.parent

<div class="mw-parser-output"><div class="shortdescription nomobile noexcerpt noprint searchaux" style="display:none">Form of art using sound and silence</div>
<style data-mw-deduplicate="TemplateStyles:r1033289096">.mw-parser-output .hatnote{font-style:italic}.mw-parser-output div.hatnote{padding-left:1.6em;margin-bottom:0.5em}.mw-parser-output .hatnote i{font-style:normal}.mw-parser-output .hatnote+link+.hatnote{margin-top:-0.5em}</style><div class="hatnote navigation-not-searchable" role="note">For other uses, see <a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>.</div>
<p class="mw-empty-elt">
</p>
<div class="thumb tright"><div class="thumbinner" style="width:222px;"><a class="image" href="/wiki/File:Fran%C3%A7ois_Boucher,_Allegory_of_Music,_1764,_NGA_32680.jpg"><img alt="" class="thumbimage" data-file-height="3189" data-file-width="4000" decoding="async" height="175" src="//upload.wikimedia.org/wikipedia/commons/t

In [52]:
p.parent.text

'Form of art using sound and silence\nFor other uses, see Music (disambiguation).\n\n\n Allegory of Music, by François Boucher, 1764\nPerforming arts\nAcrobatics\nBallet\nCircus skills\nClown\nDance\nGymnastics\nMagic\nMime\nMusic\nOpera\nProfessional wrestling\nPuppetry\nSpeech\nStand-up comedy\nTheatre\nVentriloquism\nvte\nMusic is the art of arranging sounds in time to produce a composition through the elements of melody, harmony, rhythm, and timbre.[1] It is one of the universal cultural aspects of all human societies. General definitions of music include common elements such as pitch (which governs melody and harmony), rhythm (and its associated concepts tempo, meter, and articulation), dynamics (loudness and softness), and the sonic qualities of timbre and texture (which are sometimes termed the "color" of a musical sound). Different styles or types of music may emphasize, de-emphasize or omit some of these elements. Music is performed with a vast range of instruments and vocal t

In [53]:
print(soup.text)





Music - Wikipedia

































Music

From Wikipedia, the free encyclopedia



Jump to navigation
Jump to search
Form of art using sound and silence
For other uses, see Music (disambiguation).


 Allegory of Music, by François Boucher, 1764
Performing arts
Acrobatics
Ballet
Circus skills
Clown
Dance
Gymnastics
Magic
Mime
Music
Opera
Professional wrestling
Puppetry
Speech
Stand-up comedy
Theatre
Ventriloquism
vte
Music is the art of arranging sounds in time to produce a composition through the elements of melody, harmony, rhythm, and timbre.[1] It is one of the universal cultural aspects of all human societies. General definitions of music include common elements such as pitch (which governs melody and harmony), rhythm (and its associated concepts tempo, meter, and articulation), dynamics (loudness and softness), and the sonic qualities of timbre and texture (which are sometimes termed the "color" of a musical sound). Different styles or types of music may emphas

#### Beautiful soup reads only HTML code, so it also returns any JS code from the webpage

## .strings and .stripped_strings

In [54]:
for s in p.strings:
    print(repr(s))

'Music'
' is the '
'art'
' of arranging '
'sounds'
' in time to produce a '
'composition'
' through the '
'elements'
' of melody, harmony, rhythm, and timbre.'
'[1]'
' It is one of the '
'universal cultural'
' aspects of all human societies. General '
'definitions of music'
' include common elements such as '
'pitch'
' (which governs '
'melody'
' and '
'harmony'
'), '
'rhythm'
' (and its associated concepts '
'tempo'
', '
'meter'
', and '
'articulation'
'), '
'dynamics'
' (loudness and softness), and the sonic qualities of '
'timbre'
' and '
'texture'
' (which are sometimes termed the "color" of a musical sound). Different '
'styles or types'
' of music may emphasize, de-emphasize or omit some of these elements. Music is performed with a vast range of '
'instruments'
' and vocal techniques ranging from '
'singing'
' to '
'rapping'
'; there are solely '
'instrumental pieces'
', '
'solely vocal pieces'
' (such as songs without instrumental '
'accompaniment'
') and pieces that combine sin

In [55]:
for s in p.stripped_strings:
    print(repr(s))

'Music'
'is the'
'art'
'of arranging'
'sounds'
'in time to produce a'
'composition'
'through the'
'elements'
'of melody, harmony, rhythm, and timbre.'
'[1]'
'It is one of the'
'universal cultural'
'aspects of all human societies. General'
'definitions of music'
'include common elements such as'
'pitch'
'(which governs'
'melody'
'and'
'harmony'
'),'
'rhythm'
'(and its associated concepts'
'tempo'
','
'meter'
', and'
'articulation'
'),'
'dynamics'
'(loudness and softness), and the sonic qualities of'
'timbre'
'and'
'texture'
'(which are sometimes termed the "color" of a musical sound). Different'
'styles or types'
'of music may emphasize, de-emphasize or omit some of these elements. Music is performed with a vast range of'
'instruments'
'and vocal techniques ranging from'
'singing'
'to'
'rapping'
'; there are solely'
'instrumental pieces'
','
'solely vocal pieces'
'(such as songs without instrumental'
'accompaniment'
') and pieces that combine singing and instruments. The word derives fr

## Links - absolute path URL

In [56]:
links

[<a id="top"></a>,
 <a href="/wiki/Wikipedia:Protection_policy#semi" title="This article is semi-protected."><img alt="Page semi-protected" data-file-height="512" data-file-width="512" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/30px-Semi-protection-shackle.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/40px-Semi-protection-shackle.svg.png 2x" width="20"/></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>,
 <a class="image" href="/wiki/File:Fran%C3%A7ois_Boucher,_Allegory_of_Music,_1764,_NGA_32680.jpg"><img alt="" class="thumbimage" data-file-height="3189" data-f

In [59]:
link = links[26]
link

<a href="/wiki/Template_talk:Performing_arts" title="Template talk:Performing arts"><abbr title="Discuss this template">t</abbr></a>

In [61]:
link.string

't'

In [62]:
link['href']

'/wiki/Template_talk:Performing_arts'

In [63]:
from urllib.parse import urljoin

In [64]:
base_site

'https://en.wikipedia.org/wiki/Music'

In [65]:
relative_url = link['href']

In [66]:
full_url = urljoin(base_site, relative_url)
full_url

'https://en.wikipedia.org/wiki/Template_talk:Performing_arts'

## Processing multiple links at once with List Comprehension

In [67]:
[l.get('href') for l in links]

[None,
 '/wiki/Wikipedia:Protection_policy#semi',
 '#mw-head',
 '#searchInput',
 '/wiki/Music_(disambiguation)',
 '/wiki/File:Fran%C3%A7ois_Boucher,_Allegory_of_Music,_1764,_NGA_32680.jpg',
 '/wiki/File:Fran%C3%A7ois_Boucher,_Allegory_of_Music,_1764,_NGA_32680.jpg',
 '/wiki/Fran%C3%A7ois_Boucher',
 '/wiki/Performing_arts',
 '/wiki/Acrobatics',
 '/wiki/Ballet',
 '/wiki/List_of_circus_skills',
 '/wiki/Clown',
 '/wiki/Dance',
 '/wiki/Gymnastics',
 '/wiki/Magic_(illusion)',
 '/wiki/Mime_artist',
 None,
 '/wiki/Opera',
 '/wiki/Professional_wrestling',
 '/wiki/Puppetry',
 '/wiki/Public_speaking',
 '/wiki/Stand-up_comedy',
 '/wiki/Theatre',
 '/wiki/Ventriloquism',
 '/wiki/Template:Performing_arts',
 '/wiki/Template_talk:Performing_arts',
 'https://en.wikipedia.org/w/index.php?title=Template:Performing_arts&action=edit',
 '/wiki/The_arts',
 '/wiki/Sound',
 '/wiki/Musical_composition',
 '/wiki/Elements_of_music',
 '#cite_note-1',
 '/wiki/Cultural_universal',
 '/wiki/Definitions_of_music',
 '/wi

In [68]:
clean_links = [l for l in links if l.get('href') != None]

In [69]:
relative_urls = [l.get('href') for l in clean_links]
relative_urls

['/wiki/Wikipedia:Protection_policy#semi',
 '#mw-head',
 '#searchInput',
 '/wiki/Music_(disambiguation)',
 '/wiki/File:Fran%C3%A7ois_Boucher,_Allegory_of_Music,_1764,_NGA_32680.jpg',
 '/wiki/File:Fran%C3%A7ois_Boucher,_Allegory_of_Music,_1764,_NGA_32680.jpg',
 '/wiki/Fran%C3%A7ois_Boucher',
 '/wiki/Performing_arts',
 '/wiki/Acrobatics',
 '/wiki/Ballet',
 '/wiki/List_of_circus_skills',
 '/wiki/Clown',
 '/wiki/Dance',
 '/wiki/Gymnastics',
 '/wiki/Magic_(illusion)',
 '/wiki/Mime_artist',
 '/wiki/Opera',
 '/wiki/Professional_wrestling',
 '/wiki/Puppetry',
 '/wiki/Public_speaking',
 '/wiki/Stand-up_comedy',
 '/wiki/Theatre',
 '/wiki/Ventriloquism',
 '/wiki/Template:Performing_arts',
 '/wiki/Template_talk:Performing_arts',
 'https://en.wikipedia.org/w/index.php?title=Template:Performing_arts&action=edit',
 '/wiki/The_arts',
 '/wiki/Sound',
 '/wiki/Musical_composition',
 '/wiki/Elements_of_music',
 '#cite_note-1',
 '/wiki/Cultural_universal',
 '/wiki/Definitions_of_music',
 '/wiki/Pitch_(musi

In [70]:
full_urls = [urljoin(base_site, url) for url in relative_urls]

In [71]:
full_urls

['https://en.wikipedia.org/wiki/Wikipedia:Protection_policy#semi',
 'https://en.wikipedia.org/wiki/Music#mw-head',
 'https://en.wikipedia.org/wiki/Music#searchInput',
 'https://en.wikipedia.org/wiki/Music_(disambiguation)',
 'https://en.wikipedia.org/wiki/File:Fran%C3%A7ois_Boucher,_Allegory_of_Music,_1764,_NGA_32680.jpg',
 'https://en.wikipedia.org/wiki/File:Fran%C3%A7ois_Boucher,_Allegory_of_Music,_1764,_NGA_32680.jpg',
 'https://en.wikipedia.org/wiki/Fran%C3%A7ois_Boucher',
 'https://en.wikipedia.org/wiki/Performing_arts',
 'https://en.wikipedia.org/wiki/Acrobatics',
 'https://en.wikipedia.org/wiki/Ballet',
 'https://en.wikipedia.org/wiki/List_of_circus_skills',
 'https://en.wikipedia.org/wiki/Clown',
 'https://en.wikipedia.org/wiki/Dance',
 'https://en.wikipedia.org/wiki/Gymnastics',
 'https://en.wikipedia.org/wiki/Magic_(illusion)',
 'https://en.wikipedia.org/wiki/Mime_artist',
 'https://en.wikipedia.org/wiki/Opera',
 'https://en.wikipedia.org/wiki/Professional_wrestling',
 'https

In [73]:
internal_links = [url for url in full_urls if 'wikipedia.org' in url]
internal_links

### internal_links are links that lead to a page on the same domain

['https://en.wikipedia.org/wiki/Wikipedia:Protection_policy#semi',
 'https://en.wikipedia.org/wiki/Music#mw-head',
 'https://en.wikipedia.org/wiki/Music#searchInput',
 'https://en.wikipedia.org/wiki/Music_(disambiguation)',
 'https://en.wikipedia.org/wiki/File:Fran%C3%A7ois_Boucher,_Allegory_of_Music,_1764,_NGA_32680.jpg',
 'https://en.wikipedia.org/wiki/File:Fran%C3%A7ois_Boucher,_Allegory_of_Music,_1764,_NGA_32680.jpg',
 'https://en.wikipedia.org/wiki/Fran%C3%A7ois_Boucher',
 'https://en.wikipedia.org/wiki/Performing_arts',
 'https://en.wikipedia.org/wiki/Acrobatics',
 'https://en.wikipedia.org/wiki/Ballet',
 'https://en.wikipedia.org/wiki/List_of_circus_skills',
 'https://en.wikipedia.org/wiki/Clown',
 'https://en.wikipedia.org/wiki/Dance',
 'https://en.wikipedia.org/wiki/Gymnastics',
 'https://en.wikipedia.org/wiki/Magic_(illusion)',
 'https://en.wikipedia.org/wiki/Mime_artist',
 'https://en.wikipedia.org/wiki/Opera',
 'https://en.wikipedia.org/wiki/Professional_wrestling',
 'https

## Extracting data from nested tags

In [74]:
div_notes = soup.find_all('div', {'role':'note'})
div_notes

[<div class="hatnote navigation-not-searchable" role="note">For other uses, see <a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>.</div>,
 <div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/Musical_composition" title="Musical composition">Musical composition</a></div>,
 <div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/Musical_notation" title="Musical notation">Musical notation</a></div>,
 <div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/Musical_improvisation" title="Musical improvisation">Musical improvisation</a></div>,
 <div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/Music_theory" title="Music theory">Music theory</a></div>,
 <div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/Elements_of_music" title="Elements of music">Elements 

In [75]:
div_links = [div.find('a') for div in div_notes]
div_links

[<a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>,
 <a href="/wiki/Musical_composition" title="Musical composition">Musical composition</a>,
 <a href="/wiki/Musical_notation" title="Musical notation">Musical notation</a>,
 <a href="/wiki/Musical_improvisation" title="Musical improvisation">Musical improvisation</a>,
 <a href="/wiki/Music_theory" title="Music theory">Music theory</a>,
 <a href="/wiki/Elements_of_music" title="Elements of music">Elements of music</a>,
 <a href="/wiki/Strophic_form" title="Strophic form">Strophic form</a>,
 <a href="/wiki/History_of_music" title="History of music">History of music</a>,
 <a href="/wiki/Music_of_Egypt" title="Music of Egypt">Music of Egypt</a>,
 <a class="mw-redirect" href="/wiki/20th-century_music" title="20th-century music">20th-century music</a>,
 <a href="/wiki/Aesthetics_of_music" title="Aesthetics of music">Aesthetics of music</a>,
 <a href="/wiki/Neuroscience_of_musi

In [76]:
len(div_links)

22

In [77]:
div_notes[6].find_all('a')

[<a href="/wiki/Strophic_form" title="Strophic form">Strophic form</a>,
 <a href="/wiki/Binary_form" title="Binary form">Binary form</a>,
 <a href="/wiki/Ternary_form" title="Ternary form">Ternary form</a>,
 <a class="mw-redirect" href="/wiki/Rondo_form" title="Rondo form">Rondo form</a>,
 <a href="/wiki/Variation_(music)" title="Variation (music)">Variation (music)</a>,
 <a class="mw-redirect" href="/wiki/Musical_development" title="Musical development">Musical development</a>]

In [78]:
div_links = []

for div in div_notes:
    anchors = div.find_all('a')
    
    div_links.extend(anchors)
#     a different way would be to iterate again and append the list:
#     for a in anchors: 
#         div_links.append(a)

In [79]:
div_links

[<a class="mw-disambig" href="/wiki/Music_(disambiguation)" title="Music (disambiguation)">Music (disambiguation)</a>,
 <a href="/wiki/Musical_composition" title="Musical composition">Musical composition</a>,
 <a href="/wiki/Musical_notation" title="Musical notation">Musical notation</a>,
 <a href="/wiki/Musical_improvisation" title="Musical improvisation">Musical improvisation</a>,
 <a href="/wiki/Music_theory" title="Music theory">Music theory</a>,
 <a href="/wiki/Elements_of_music" title="Elements of music">Elements of music</a>,
 <a href="/wiki/Strophic_form" title="Strophic form">Strophic form</a>,
 <a href="/wiki/Binary_form" title="Binary form">Binary form</a>,
 <a href="/wiki/Ternary_form" title="Ternary form">Ternary form</a>,
 <a class="mw-redirect" href="/wiki/Rondo_form" title="Rondo form">Rondo form</a>,
 <a href="/wiki/Variation_(music)" title="Variation (music)">Variation (music)</a>,
 <a class="mw-redirect" href="/wiki/Musical_development" title="Musical development">Mu

In [80]:
len(div_links)

29

In [81]:
note_urls = [urljoin(base_site, l.get('href')) for l in div_links]
note_urls

['https://en.wikipedia.org/wiki/Music_(disambiguation)',
 'https://en.wikipedia.org/wiki/Musical_composition',
 'https://en.wikipedia.org/wiki/Musical_notation',
 'https://en.wikipedia.org/wiki/Musical_improvisation',
 'https://en.wikipedia.org/wiki/Music_theory',
 'https://en.wikipedia.org/wiki/Elements_of_music',
 'https://en.wikipedia.org/wiki/Strophic_form',
 'https://en.wikipedia.org/wiki/Binary_form',
 'https://en.wikipedia.org/wiki/Ternary_form',
 'https://en.wikipedia.org/wiki/Rondo_form',
 'https://en.wikipedia.org/wiki/Variation_(music)',
 'https://en.wikipedia.org/wiki/Musical_development',
 'https://en.wikipedia.org/wiki/History_of_music',
 'https://en.wikipedia.org/wiki/Music_of_Egypt',
 'https://en.wikipedia.org/wiki/20th-century_music',
 'https://en.wikipedia.org/wiki/Aesthetics_of_music',
 'https://en.wikipedia.org/wiki/Neuroscience_of_music',
 'https://en.wikipedia.org/wiki/Hearing',
 'https://en.wikipedia.org/wiki/Culture_in_music_cognition',
 'https://en.wikipedia.or

---------------------------------------------------
---------------------------------------------------
---------------------------------------------------

# Scraping multiple pages automatically

In [82]:
note_urls
# the goal is to have no None values in this list

['https://en.wikipedia.org/wiki/Music_(disambiguation)',
 'https://en.wikipedia.org/wiki/Musical_composition',
 'https://en.wikipedia.org/wiki/Musical_notation',
 'https://en.wikipedia.org/wiki/Musical_improvisation',
 'https://en.wikipedia.org/wiki/Music_theory',
 'https://en.wikipedia.org/wiki/Elements_of_music',
 'https://en.wikipedia.org/wiki/Strophic_form',
 'https://en.wikipedia.org/wiki/Binary_form',
 'https://en.wikipedia.org/wiki/Ternary_form',
 'https://en.wikipedia.org/wiki/Rondo_form',
 'https://en.wikipedia.org/wiki/Variation_(music)',
 'https://en.wikipedia.org/wiki/Musical_development',
 'https://en.wikipedia.org/wiki/History_of_music',
 'https://en.wikipedia.org/wiki/Music_of_Egypt',
 'https://en.wikipedia.org/wiki/20th-century_music',
 'https://en.wikipedia.org/wiki/Aesthetics_of_music',
 'https://en.wikipedia.org/wiki/Neuroscience_of_music',
 'https://en.wikipedia.org/wiki/Hearing',
 'https://en.wikipedia.org/wiki/Culture_in_music_cognition',
 'https://en.wikipedia.or

In [84]:
### creating a list that stores all the paragraphs' texts ###

par_text = []

# create a counter for the iterator
i = 0

for url in note_urls:
    
        ### 1. send a request to each iterable (page)
        note_resp = requests.get(url)
        
        ### 1b) check if a page is missing (status 404) and avoid it
        if note_resp.status_code == 200:
            print('URL #{0}: {1}'.format(i+1, url))
        else:
            print('Status code {0}: Skipping URL #{1}: {2}'.format(note_resp.status_code, i+1, url))
            i = i+1
            continue
            
        ### 2. get CONTENT from each page
        note_html = note_resp.content
        
        ### 3. create a BS object and dedicate a parser (if left to default, BS will choose between built-in 'python-html' or open-source libraries 'html5lib' and 'lxml')
        note_soup = BeautifulSoup(note_html, 'html.parser')
        
        ### 4. DATA EXTRACTION - store all paragraph tags in a variable
        note_pars = note_soup.find_all('p')
        
        ### list comprehension
        text = [p.text for p in note_pars]
        
        ### append to the initialized list
        par_text.append(text)
        
        # add to the counter
        i = i+1

URL #1: https://en.wikipedia.org/wiki/Music_(disambiguation)
URL #2: https://en.wikipedia.org/wiki/Musical_composition
URL #3: https://en.wikipedia.org/wiki/Musical_notation
URL #4: https://en.wikipedia.org/wiki/Musical_improvisation
URL #5: https://en.wikipedia.org/wiki/Music_theory
URL #6: https://en.wikipedia.org/wiki/Elements_of_music
URL #7: https://en.wikipedia.org/wiki/Strophic_form
URL #8: https://en.wikipedia.org/wiki/Binary_form
URL #9: https://en.wikipedia.org/wiki/Ternary_form
URL #10: https://en.wikipedia.org/wiki/Rondo_form
URL #11: https://en.wikipedia.org/wiki/Variation_(music)
URL #12: https://en.wikipedia.org/wiki/Musical_development
URL #13: https://en.wikipedia.org/wiki/History_of_music
URL #14: https://en.wikipedia.org/wiki/Music_of_Egypt
URL #15: https://en.wikipedia.org/wiki/20th-century_music
URL #16: https://en.wikipedia.org/wiki/Aesthetics_of_music
URL #17: https://en.wikipedia.org/wiki/Neuroscience_of_music
URL #18: https://en.wikipedia.org/wiki/Hearing
URL #

In [85]:
par_text[0]

['Music is an art form consisting of sound and silence, expressed through time.\n',
 'Music may also refer to:\n']

In [86]:
page_text = "".join(par_text[0])
page_text

'Music is an art form consisting of sound and silence, expressed through time.\nMusic may also refer to:\n'

In [87]:
page_text = ["".join(text) for text in par_text]
page_text[0]

'Music is an art form consisting of sound and silence, expressed through time.\nMusic may also refer to:\n'

In [88]:
print(page_text[17])

Hearing, or auditory perception, is the ability to perceive sounds through an organ, such as an ear, by detecting vibrations as periodic changes in the pressure of a surrounding medium.[1]. The academic field concerned with hearing is auditory science.
Sound may be heard through solid, liquid, or gaseous matter.[2] It is one of the traditional five senses. Partial or total inability to hear is called hearing loss.
In humans and other vertebrates, hearing is performed primarily by the auditory system: mechanical waves, known as vibrations, are detected by the ear and transduced into nerve impulses that are perceived by the brain (primarily in the temporal lobe). Like touch, audition requires sensitivity to the movement of molecules in the world outside the organism. Both hearing and touch are types of mechanosensation.[3][4]
There are three main components of the human auditory system: the outer ear, the middle ear, and the inner ear.
The outer ear includes the pinna, the visible part o

## Mapping each text of a page to their URL's

In [89]:
url_to_text = dict(zip(note_urls, page_text))

In [90]:
print(url_to_text['https://en.wikipedia.org/wiki/Musical_composition'])


Musical composition, music composition or simply composition, can refer to an original piece or work of music,[1] either vocal or instrumental, the structure of a musical piece or to the process of creating or writing a new piece of music. People who create new compositions are called composers. Composers of primarily songs are usually called songwriters;[2][3] with songs, the person who writes lyrics for a song is the lyricist. In many cultures, including Western classical music, the act of composing typically includes the creation of music notation, such as a sheet music "score," which is then performed by the composer or by other musicians. In popular music and traditional music, songwriting may involve the creation of a basic outline of the song, called the lead sheet, which sets out the melody, lyrics and chord progression. In classical music, orchestration (choosing the instruments of a large music ensemble such as an orchestra which will play the different parts of music, such 