Summary of the goal of the notebook.

Adapt the following:

Use external data to help us answering our questions, boorst our model, for instances if we believe that the product or service have a close relation with weather conditions, adding data about weather can improve our model performance.

In addition, web scraping can be a good tool to keep us update about our partners and concurrency, helping to make informed business decision.


Web scraping allows us to load a Web page into Python and extract the information we want. We can then work with the data using standard analysis tools like pandas and numpy.

Before we can do Web scraping, we need to understand the structure of the Web page we're working with, then find a way to extract parts of that structure in a sensible way.

We'll use the requests library heavily as we learn about Web scraping. This library enables us to download a Web page. We'll also use the beautifulsoup library to extract the relevant parts of the Web page.



**WHY WEB SCRAPING IS IMPORTANT**

In this notebook we show how to extract text from a website as well as information within tables in those websites. To finish we show how to extract information within hyperlinks within a website.

For this we use basic knowledge of `HTML` which means its tree structure and that tags define the branches where the information we search are.

In addition we make use of two Python libraries:

* [`requests`](https://requests.readthedocs.io/en/master/) which we allow us to get the webpage we want; and 
* [`Beautiful Soup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) that parses the content of the webpage and allows us extracting tags from an HTML document.

For our example, we go to a neutral camp where I believe everybody (or the great majority) loves: Movies and Music! In addition we pay our respect to the first Jams Bond to [Sir Thomas Sean Connery](https://www.imdb.com/name/nm0000125/bio) that left us last month.

The idea is to:

🎬 Extract information about all the movies from James Bond in a table at [List_of_James_Bond_films](https://en.wikipedia.org/wiki/List_of_James_Bond_films)

🎶 Extract information about all the James Bond's title songs in a table at [Lijst_van_titelsongs_uit_de_James_Bondfilms](https://nl.wikipedia.org/wiki/Lijst_van_titelsongs_uit_de_James_Bondfilms) 
("Yes! Dutch site the structure of the table was much easier. So if you have an option go to the easy one  😉 ")

🎶 Scrape all lyrics of the songs (Notice that some title songs are instrumental)

So let's start!


# Web Scraping Information about James Bond's Movies

## Step 1: Inspecting the website

Every time we scrape a website we need to have an idea of its structure and where to find what we need.

For this, no matter which browser we use, we can access its code by right clicking and choosing to access it source code, i.e., `view page` (Firefox) or `view page source` (Chrome and Microsoft Edge). If you need details of an specific element right click on it and choose `inspect element`(Firefox) or `inspect` (Chrome and Microsoft Edge).

Web pages use `HyperText Markup Language (HTML)` which is a markup language with its own syntax and rules. When a Web browser like Chrome or Firefox downloads a Web page, it reads the HTML to determine how to render it and display it to you.

HTML consists of **tags**. Anything in between the opening and closing of a tag is the content of that tag. 

Some of elements that often encountered are:

`<p>`: Used for paragraphs. 

`head` : Contains metadata useful to the Web browser that's rendering the page and it is not visible to the user.

`body` : Contains represents the content of an HTML document with which the user interacts

`a` : Creates a hyperlink to web pages, files, email addresses, locations in the same page, or anything else a URL can address.

For more definitions of elements check this [link](https://developer.mozilla.org/en-US/docs/Web/HTML/Element).

While inspecting the website source code you will notice that Some tags contain attributes which provide special instructions for the contents contained with that tag. Specific html attributes names are followed by equal sign, followed by information which being passed to that attribute within that tag.

For example:

## Step 2: Access Content of Website

For this we :

1. Access website using `requests`
2. Parse content with `Beautiful Soup` so we can extract what we need within tags

In [3]:
# import necessary packages
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [1]:
# importing packages

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

TodaysDate = time.strftime("%Y-%m-%d")

pd.options.display.max_rows = 999

In [4]:
# start making request to our chose website

main_url = "https://en.wikipedia.org/wiki/List_of_James_Bond_films"

# Send request and catch response: r

response = requests.get(main_url)

# get the content of the response

content = response.content

# parse webpage
parser = BeautifulSoup(content, 'html.parser')

In [5]:
parser

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of James Bond films - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"787a57a0-6146-43b8-91a1-6869ed9be667","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_James_Bond_films","wgTitle":"List of James Bond films","wgCurRevisionId":988146648,"wgRevisionId":988146648,"wgArticleId":33190861,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description matches Wikidata","Use dmy dates from June 2020","EngvarB from June 2020","CS1 mai

We will need to perform the same process for our 2 next tasks, so let's build a function:

In [6]:
def parse_website(url):
    """ """
    
    # Send request and catch response
    response = requests.get(main_url)

    # get the content of the response
    content = response.content

    # parse webpage
    parser = BeautifulSoup(content, 'html.parser')
    
    return parser
    

## Extracting What We Need

This part will depend on the structure of the website source code and of what you need as information from it.

In [2]:
# start making request to our chose website

main_url = "https://nl.wikipedia.org/wiki/Lijst_van_titelsongs_uit_de_James_Bondfilms"
# Send request and catch response: r

response = requests.get(main_url)
response

<Response [200]>

In [307]:
content = response.content
content

b'<!DOCTYPE html>\n<html class="client-nojs" lang="nl" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Lijst van titelsongs uit de James Bondfilms - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":[",\\t.",".\\t,"],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","januari","februari","maart","april","mei","juni","juli","augustus","september","oktober","november","december"],"wgRequestId":"97c40ad9-8b08-42d1-b7d5-2875af1e3797","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Lijst_van_titelsongs_uit_de_James_Bondfilms","wgTitle":"Lijst van titelsongs uit de James Bondfilms","wgCurRevisionId":56107008,"wgRevisionId":56107008,"wgArticleId":701466,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["James Bond"],"wgPageContentLanguage":"nl","wgPageContentModel":

Now that we have the tree let's get the branches we want. To access the branches we use tags as attributes. Therefore, to obtain the title of the webpage:

In [9]:
title = parser.title
title = title.text
title

'List of James Bond films - Wikipedia'

Similarly, for the 1st and main paragraph:

In [11]:
# body is with html element
body = parser.body
body

<body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject mw-editable page-List_of_James_Bond_films rootpage-List_of_James_Bond_films skin-vector action-view skin-vector-legacy"><div class="noprint" id="mw-page-base"></div>
<div class="noprint" id="mw-head-base"></div>
<div class="mw-body" id="content" role="main">
<a id="top"></a>
<div class="mw-body-content" id="siteNotice"><!-- CentralNotice --></div>
<div class="mw-indicators mw-body-content">
<div class="mw-indicator" id="mw-indicator-featured-star"><a href="/wiki/Wikipedia:Featured_lists" title="This is a featured list. Click here for more information."><img alt="This is a featured list. Click here for more information." data-file-height="438" data-file-width="462" decoding="async" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.or

In [15]:
len(parser.find_all('body'))

1

How many paragraphs do we have in this page ?

In [37]:
parser.find_all('p')

[<p class="mw-empty-elt">
 </p>,
 <p><a href="/wiki/James_Bond_(literary_character)" title="James Bond (literary character)">James Bond</a> is a <a href="/wiki/Character_(arts)" title="Character (arts)">fictional character</a> created by the novelist <a href="/wiki/Ian_Fleming" title="Ian Fleming">Ian Fleming</a> in 1953. Bond is a British secret agent working for <a href="/wiki/Secret_Intelligence_Service" title="Secret Intelligence Service">MI6</a> who also answers to his codename, ”007“.  He has been <a class="mw-redirect" href="/wiki/James_Bond_filmography" title="James Bond filmography">portrayed on film</a> by the actors <a href="/wiki/Sean_Connery" title="Sean Connery">Sean Connery</a>, <a href="/wiki/David_Niven" title="David Niven">David Niven</a>, <a href="/wiki/George_Lazenby" title="George Lazenby">George Lazenby</a>, <a href="/wiki/Roger_Moore" title="Roger Moore">Roger Moore</a>, <a href="/wiki/Timothy_Dalton" title="Timothy Dalton">Timothy Dalton</a>, <a href="/wiki/Pier

The method `find_all` returns a list and as one we can access an item using an index.

In [36]:
print(parser.find_all('p')[1].text)

James Bond is a fictional character created by the novelist Ian Fleming in 1953. Bond is a British secret agent working for MI6 who also answers to his codename, ”007“.  He has been portrayed on film by the actors Sean Connery, David Niven, George Lazenby, Roger Moore, Timothy Dalton, Pierce Brosnan and Daniel Craig, in twenty-seven productions. All the films but two were made by Eon Productions. Eon now holds the full adaptation rights to all of Fleming's Bond novels.[1][2]



In [31]:
parser.find_all('div', class_="hatnote navigation-not-searchable")

[<div class="hatnote navigation-not-searchable" role="note">This article is about the Bond films themselves. For the production background of the films, see <a href="/wiki/Production_of_the_James_Bond_films" title="Production of the James Bond films">Production of the James Bond films</a>. For the various portrayals of the character, see <a href="/wiki/Portrayal_of_James_Bond_in_film" title="Portrayal of James Bond in film">Portrayal of James Bond in film</a>.</div>,
 <div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/Dr._No_(film)" title="Dr. No (film)">Dr. No (film)</a></div>,
 <div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/From_Russia_with_Love_(film)" title="From Russia with Love (film)">From Russia with Love (film)</a></div>,
 <div class="hatnote navigation-not-searchable" role="note">Main article: <a href="/wiki/Goldfinger_(film)" title="Goldfinger (film)">Goldfinger (film)</a></div>,
 <div class="hatno

In [23]:
parser.find_all("a")

[<a id="top"></a>,
 <a href="/wiki/Wikipedia:Featured_lists" title="This is a featured list. Click here for more information."><img alt="This is a featured list. Click here for more information." data-file-height="438" data-file-width="462" decoding="async" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/40px-Cscr-featured.svg.png 2x" width="20"/></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a href="/wiki/Production_of_the_James_Bond_films" title="Production of the James Bond films">Production of the James Bond films</a>,
 <a href="/wiki/Portrayal_of_James_Bond_in_film" title="Portrayal of James Bond in film">Portrayal of James Bond in film</a>,
 <a href="/wiki/James

In [10]:
# p (paragraph) is with body
body = body.p
# then we have the proper text
body = body.text
# and we can apply strip to clean a bit
body = body.strip()
body

''

In [308]:
from bs4 import BeautifulSoup

# Initialize the parser, and pass content to it.
parser = BeautifulSoup(content, 'html.parser')



In [309]:
title = parser.title
title

<title>Lijst van titelsongs uit de James Bondfilms - Wikipedia</title>

In [310]:
# use text that allow us to access the text with tags
title = title.text
title

'Lijst van titelsongs uit de James Bondfilms - Wikipedia'

What if I want to see the content of all paragraphs in body? 

Use `find_all`, as follows, for instances:

In [312]:
list_paragraphs = parser.body.find_all('p')
list_paragraphs

[p.text for p in list_paragraphs]

['De James Bondfilms van EON Producties hebben vele melodieën opgeleverd, waarvan de meeste tot de beste filmmuziek worden beschouwd. De titelsongs zijn vooraanstaand in de keuze voor onderscheidende melodieën, vaak gezongen door op dat moment populaire zangers of bands. De bekendste melodie is de zeer herkenbare James Bond Theme. Deze oorspronkelijke titelsong is afkomstig uit de eerste officiële 007-film. De jazz-achtige sound is gecomponeerd door John Barry, gebaseerd op eerder werk van Monty Norman. De eerste melodie voor de titelsong van Dr. No van Monty Norman viel niet in de smaak bij de filmproducenten, waarna John Barry in allerijl een nieuwe melodie moest componeren.\n',
 'Een opsomming van de overige titelsongs van alle James Bondfilms:\n',
 'Titelsongs van de onofficiële films:\n']

In [313]:
# Get the body tag from the document.
# Since we passed in the top level of the document to the parser, we need to pick a branch off of the root.
# With BeautifulSoup, we can access branches by using tag types as attributes.
body = parser.body
body

<body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject mw-editable page-Lijst_van_titelsongs_uit_de_James_Bondfilms rootpage-Lijst_van_titelsongs_uit_de_James_Bondfilms skin-vector action-view skin-vector-legacy"><div class="noprint" id="mw-page-base"></div>
<div class="noprint" id="mw-head-base"></div>
<div class="mw-body" id="content" role="main">
<a id="top"></a>
<div class="mw-body-content" id="siteNotice"><!-- CentralNotice --></div>
<div class="mw-indicators mw-body-content">
</div>
<h1 class="firstHeading" id="firstHeading" lang="nl">Lijst van titelsongs uit de James Bondfilms</h1>
<div class="mw-body-content" id="bodyContent">
<div class="noprint" id="siteSub">Uit Wikipedia, de vrije encyclopedie</div>
<div id="contentSub"></div>
<div id="contentSub2"></div>
<div id="jump-to-nav"></div>
<a class="mw-jump-link" href="#mw-head">Naar navigatie springen</a>
<a class="mw-jump-link" href="#searchInput">Naar zoeken springen</a>
<div class="mw-content-ltr" dir="ltr" i

In [314]:
# Get the p tag from the body.
p = body.p

# Print the text inside the p tag.
# Text is a property that gets the inside text of a tag.
print(p.text)

head = parser.head

title = head.title

title_text = title.text

title_text

De James Bondfilms van EON Producties hebben vele melodieën opgeleverd, waarvan de meeste tot de beste filmmuziek worden beschouwd. De titelsongs zijn vooraanstaand in de keuze voor onderscheidende melodieën, vaak gezongen door op dat moment populaire zangers of bands. De bekendste melodie is de zeer herkenbare James Bond Theme. Deze oorspronkelijke titelsong is afkomstig uit de eerste officiële 007-film. De jazz-achtige sound is gecomponeerd door John Barry, gebaseerd op eerder werk van Monty Norman. De eerste melodie voor de titelsong van Dr. No van Monty Norman viel niet in de smaak bij de filmproducenten, waarna John Barry in allerijl een nieuwe melodie moest componeren.



'Lijst van titelsongs uit de James Bondfilms - Wikipedia'

In [315]:
list_columns = parser.table.find_all('th')
list_columns = [item.text.strip() for item in list_columns]
list_columns

['Titelsong', 'Artiest', 'Film', 'Jaar', 'Componist']

We have now the names of our 5 columns. Following, we will built the content of our table. 

In [316]:
list_table_songs = parser.tbody.find_all('td')
list_table_songs = [item.text.strip() for item in list_table_songs]
list_table_songs

['James Bond Theme en  Kingston Calypso',
 'Orkest o.l.v. John Barry',
 'Dr. No',
 '1962',
 'Monty Norman & John Barry',
 'From Russia with Love',
 'Matt Monro',
 'From Russia with Love',
 '1963',
 'John Barry & Lionel Bart',
 'Goldfinger',
 'Shirley Bassey',
 'Goldfinger',
 '1964',
 'John Barry & Anthony Newley & Leslie Bricusse',
 'Thunderball',
 'Tom Jones',
 'Thunderball',
 '1965',
 'John Barry & Don Black',
 'You Only Live Twice',
 'Nancy Sinatra',
 'You Only Live Twice',
 '1967',
 'John Barry & Leslie Bricusse',
 "On Her Majesty's Secret Service",
 'Orkest o.l.v. John Barry',
 "On Her Majesty's Secret Service",
 '1969',
 'John Barry',
 'Diamonds Are Forever',
 'Shirley Bassey',
 'Diamonds Are Forever',
 '1971',
 'John Barry & Don Black',
 'Live and Let Die',
 'Paul McCartney & Wings',
 'Live and Let Die',
 '1973',
 'Paul McCartney & Linda McCartney',
 'The Man with the Golden Gun',
 'Lulu',
 'The Man with the Golden Gun',
 '1974',
 'John Barry & Don Black',
 'Nobody Does It Bette

`<td>` is a html element that defines a cell of a table that contains data. As we can notice above every 5 rows (cells of the table) contains respectively, `Titelsong`, `Artiest`, `Film`, `Jaar`, `Componist`. Let's use this to build our dataframe with all title song of the James Bond film series.

In [317]:
list_title_song = [list_table_songs[idx] for idx in range(len(list_table_songs)) if idx % 5 == 0 ]
list_performer = [list_table_songs[idx] for idx in range(len(list_table_songs)) if idx % 5 == 1 ]
list_film = [list_table_songs[idx] for idx in range(len(list_table_songs)) if idx % 5 == 2 ]
list_year = [list_table_songs[idx] for idx in range(len(list_table_songs)) if idx % 5 == 3 ]
list_composer = [list_table_songs[idx] for idx in range(len(list_table_songs)) if idx % 5 == 4 ]


In [318]:
list_title_song

['James Bond Theme en  Kingston Calypso',
 'From Russia with Love',
 'Goldfinger',
 'Thunderball',
 'You Only Live Twice',
 "On Her Majesty's Secret Service",
 'Diamonds Are Forever',
 'Live and Let Die',
 'The Man with the Golden Gun',
 'Nobody Does It Better',
 'Moonraker',
 'For Your Eyes Only',
 'All Time High',
 'A View to a Kill',
 'The Living Daylights',
 'Licence to Kill',
 'GoldenEye',
 'Tomorrow Never Dies',
 'The World Is Not Enough',
 'Die Another Day',
 'You Know My Name',
 'Another Way to Die',
 'Skyfall',
 "Writing's On The Wall",
 'No Time to Die']

In [319]:
dict_data = {list_columns[0]:list_title_song,list_columns[1]:list_performer,
             list_columns[2]:list_film, list_columns[3]:list_year,list_columns[4]: list_composer}

In [321]:
df_title_songs = pd.DataFrame(dict_data)
df_title_songs

Unnamed: 0,Titelsong,Artiest,Film,Jaar,Componist
0,James Bond Theme en Kingston Calypso,Orkest o.l.v. John Barry,Dr. No,1962,Monty Norman & John Barry
1,From Russia with Love,Matt Monro,From Russia with Love,1963,John Barry & Lionel Bart
2,Goldfinger,Shirley Bassey,Goldfinger,1964,John Barry & Anthony Newley & Leslie Bricusse
3,Thunderball,Tom Jones,Thunderball,1965,John Barry & Don Black
4,You Only Live Twice,Nancy Sinatra,You Only Live Twice,1967,John Barry & Leslie Bricusse
5,On Her Majesty's Secret Service,Orkest o.l.v. John Barry,On Her Majesty's Secret Service,1969,John Barry
6,Diamonds Are Forever,Shirley Bassey,Diamonds Are Forever,1971,John Barry & Don Black
7,Live and Let Die,Paul McCartney & Wings,Live and Let Die,1973,Paul McCartney & Linda McCartney
8,The Man with the Golden Gun,Lulu,The Man with the Golden Gun,1974,John Barry & Don Black
9,Nobody Does It Better,Carly Simon,The Spy Who Loved Me,1977,Marvin Hamlisch & Carole Bayer Sager


# List of James Bond Films

In [322]:
main_url = "https://en.wikipedia.org/wiki/List_of_James_Bond_films"

In [323]:
response = requests.get(main_url)
content = response.content
parser = BeautifulSoup(content, 'html.parser')



In [324]:
len(parser.find_all('tbody'))

6

There are 6 tables in the website, but we are interested in the 1st one.

In [326]:
parser.tbody

<tbody><tr>
<th rowspan="2" scope="col">Title
</th>
<th rowspan="2" scope="col">Year
</th>
<th rowspan="2" scope="col">Bond actor
</th>
<th rowspan="2" scope="col">Director
</th>
<th class="unsortable" colspan="2">Box office (millions)<sup class="reference" id="cite_ref-FOOTNOTEBlockAutrey_Wilson2010428–429_15-0"><a href="#cite_note-FOOTNOTEBlockAutrey_Wilson2010428–429-15">[14]</a></sup>
</th>
<th class="unsortable" colspan="2">Budget (millions)<sup class="reference" id="cite_ref-FOOTNOTEBlockAutrey_Wilson2010428–429_15-1"><a href="#cite_note-FOOTNOTEBlockAutrey_Wilson2010428–429-15">[14]</a></sup>
</th>
<th class="unsortable" rowspan="2" scope="col"><span class="nowrap"><abbr title="References">Ref(s)</abbr></span>
</th></tr>
<tr class="unsortable">
<th data-sort-type="number" scope="col">Actual $
</th>
<th data-sort-type="number" scope="col">Adjusted 2005 $
</th>
<th data-sort-type="number" scope="col">Actual $
</th>
<th data-sort-type="number" scope="col">Adjusted 2005 $
</th></tr>

In [327]:
parser.tbody.find_all('th', scope="col")

[<th rowspan="2" scope="col">Title
 </th>,
 <th rowspan="2" scope="col">Year
 </th>,
 <th rowspan="2" scope="col">Bond actor
 </th>,
 <th rowspan="2" scope="col">Director
 </th>,
 <th class="unsortable" rowspan="2" scope="col"><span class="nowrap"><abbr title="References">Ref(s)</abbr></span>
 </th>,
 <th data-sort-type="number" scope="col">Actual $
 </th>,
 <th data-sort-type="number" scope="col">Adjusted 2005 $
 </th>,
 <th data-sort-type="number" scope="col">Actual $
 </th>,
 <th data-sort-type="number" scope="col">Adjusted 2005 $
 </th>,
 <th colspan="4" scope="col"><b>Total of Eon-produced films</b>
 </th>]

In [328]:
list_col_01 = parser.tbody.find_all('th', scope="col")
list_col_01 = [item.text.strip() for item in list_col_01 if ('Ref' not in item.text) & ('Total' not in item.text)]
list_col_01

['Title',
 'Year',
 'Bond actor',
 'Director',
 'Actual $',
 'Adjusted 2005 $',
 'Actual $',
 'Adjusted 2005 $']

In [329]:
parser.tbody.find_all('th', class_="unsortable")

[<th class="unsortable" colspan="2">Box office (millions)<sup class="reference" id="cite_ref-FOOTNOTEBlockAutrey_Wilson2010428–429_15-0"><a href="#cite_note-FOOTNOTEBlockAutrey_Wilson2010428–429-15">[14]</a></sup>
 </th>,
 <th class="unsortable" colspan="2">Budget (millions)<sup class="reference" id="cite_ref-FOOTNOTEBlockAutrey_Wilson2010428–429_15-1"><a href="#cite_note-FOOTNOTEBlockAutrey_Wilson2010428–429-15">[14]</a></sup>
 </th>,
 <th class="unsortable" rowspan="2" scope="col"><span class="nowrap"><abbr title="References">Ref(s)</abbr></span>
 </th>]

In [330]:
list_col_02 = parser.tbody.find_all('th', class_="unsortable")
list_col_02 = [item.text.strip().replace('[14]',"") for item in list_col_02 if ('Ref' not in item.text) & ('Total' not in item.text)]
list_col_02

['Box office (millions)', 'Budget (millions)']

In [331]:
list_col_02=list_col_02*2
list_col_02.sort()
list_col_02

['Box office (millions)',
 'Box office (millions)',
 'Budget (millions)',
 'Budget (millions)']

In [349]:
columns = [list_col_01[idx] if idx in range(len(list_col_01[:4])) else list_col_02[idx-4] +' '+ list_col_01[idx] for idx in range(len(list_col_01)) ]
columns

['Title',
 'Year',
 'Bond actor',
 'Director',
 'Box office (millions) Actual $',
 'Box office (millions) Adjusted 2005 $',
 'Budget (millions) Actual $',
 'Budget (millions) Adjusted 2005 $']

## Film title

In [350]:
# List of film titles
list_films = [item.text.strip() for item in parser.tbody.find_all("th",scope="row")]

In [351]:
# how many james bond movies until November 2020
len(list_films)

25

In [228]:
df_title_songs['Film']==list_films

0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
11    True
12    True
13    True
14    True
15    True
16    True
17    True
18    True
19    True
20    True
21    True
22    True
23    True
24    True
Name: Film, dtype: bool

In [229]:
df_title_songs['Titelsong'][0]

'James Bond Theme en  Kingston Calypso'

In [240]:
parser.tbody.find_all('td')

[<td>1962
 </td>,
 <td><a href="/wiki/Sean_Connery" title="Sean Connery">Sean Connery</a>
 </td>,
 <td><a href="/wiki/Terence_Young_(director)" title="Terence Young (director)">Terence Young</a>
 </td>,
 <td>59.5
 </td>,
 <td>448.8
 </td>,
 <td>1.1
 </td>,
 <td>7.0
 </td>,
 <td><sup class="reference" id="cite_ref-FOOTNOTEBlockAutrey_Wilson2010428–429_15-2"><a href="#cite_note-FOOTNOTEBlockAutrey_Wilson2010428–429-15">[14]</a></sup><sup class="reference" id="cite_ref-FOOTNOTECorkScivally2002300–303_16-0"><a href="#cite_note-FOOTNOTECorkScivally2002300–303-16">[15]</a></sup>
 </td>,
 <td>1963
 </td>,
 <td><a href="/wiki/Sean_Connery" title="Sean Connery">Sean Connery</a>
 </td>,
 <td><a href="/wiki/Terence_Young_(director)" title="Terence Young (director)">Terence Young</a>
 </td>,
 <td>78.9
 </td>,
 <td>543.8
 </td>,
 <td>2.0
 </td>,
 <td>12.6
 </td>,
 <td><sup class="reference" id="cite_ref-FOOTNOTEBlockAutrey_Wilson2010428–429_15-3"><a href="#cite_note-FOOTNOTEBlockAutrey_Wilson201042

In [254]:
list_info_films = parser.tbody.find_all('td')
list_info_films

[<td>1962
 </td>,
 <td><a href="/wiki/Sean_Connery" title="Sean Connery">Sean Connery</a>
 </td>,
 <td><a href="/wiki/Terence_Young_(director)" title="Terence Young (director)">Terence Young</a>
 </td>,
 <td>59.5
 </td>,
 <td>448.8
 </td>,
 <td>1.1
 </td>,
 <td>7.0
 </td>,
 <td><sup class="reference" id="cite_ref-FOOTNOTEBlockAutrey_Wilson2010428–429_15-2"><a href="#cite_note-FOOTNOTEBlockAutrey_Wilson2010428–429-15">[14]</a></sup><sup class="reference" id="cite_ref-FOOTNOTECorkScivally2002300–303_16-0"><a href="#cite_note-FOOTNOTECorkScivally2002300–303-16">[15]</a></sup>
 </td>,
 <td>1963
 </td>,
 <td><a href="/wiki/Sean_Connery" title="Sean Connery">Sean Connery</a>
 </td>,
 <td><a href="/wiki/Terence_Young_(director)" title="Terence Young (director)">Terence Young</a>
 </td>,
 <td>78.9
 </td>,
 <td>543.8
 </td>,
 <td>2.0
 </td>,
 <td>12.6
 </td>,
 <td><sup class="reference" id="cite_ref-FOOTNOTEBlockAutrey_Wilson2010428–429_15-3"><a href="#cite_note-FOOTNOTEBlockAutrey_Wilson201042

In [258]:
list_info_films[0]

<td>1962
</td>

In [261]:
list_info_films = [item.text.strip for item in parser.tbody.find_all('td')]
list_info_films = [item[idx] for idx in range(len(list_info_films)) if idx % 8 != 7]
list_info_films

IndexError: string index out of range

In [246]:
list_info_films = [item.text.strip() for item in parser.tbody.find_all('td')]

In [247]:
list_info_films

['1962',
 'Sean Connery',
 'Terence Young',
 '59.5',
 '448.8',
 '1.1',
 '7.0',
 '[14][15]',
 '1963',
 'Sean Connery',
 'Terence Young',
 '78.9',
 '543.8',
 '2.0',
 '12.6',
 '[14][15][16]',
 '1964',
 'Sean Connery',
 'Guy Hamilton',
 '124.9',
 '820.4',
 '3.0',
 '18.6',
 '[14][15][17]',
 '1965',
 'Sean Connery',
 'Terence Young',
 '141.2',
 '848.1',
 '6.8',
 '41.9',
 '[14][15][18]',
 '1967',
 'Sean Connery',
 'Lewis Gilbert',
 '111.6',
 '514.2',
 '10.3',
 '59.9',
 '[15][19]',
 '1969',
 'George Lazenby',
 'Peter R. Hunt',
 '64.6',
 '291.5',
 '7.0',
 '37.3',
 '[14][15]',
 '1971',
 'Sean Connery',
 'Guy Hamilton',
 '116.0',
 '442.5',
 '7.2',
 '34.7',
 '[14][15][20]',
 '1973',
 'Roger Moore',
 'Guy Hamilton',
 '126.4',
 '460.3',
 '7.0',
 '30.8',
 '[14][15]',
 '1974',
 'Roger Moore',
 'Guy Hamilton',
 '97.6',
 '334.0',
 '7.0',
 '27.7',
 '[15][21]',
 '1977',
 'Roger Moore',
 'Lewis Gilbert',
 '185.4',
 '533.0',
 '14.0',
 '45.1',
 '[14][15][22]',
 '1979',
 'Roger Moore',
 'Lewis Gilbert',
 '210

In [242]:
list_info_films = [item.text.strip() for item in parser.tbody.find_all('td') if item.index() % 8==0]
list_info_films

TypeError: unsupported operand type(s) for %: 'method' and 'int'

In [239]:
list_year_film = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 7 == 0 ]
list_actor = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 7 == 1 ]
list_director = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 7 == 2 ]
list_box_office_actual = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 7 == 3 ]
list_box_office_adj_2005 = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 7 == 4 ]
list_budget_actual = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 7 == 5 ]
list_budget_adj_2005 = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 7 == 6 ]

In [237]:
list_of_lists_films = [list_films, list_year_film, list_actor, list_director, list_box_office_actual, list_box_office_adj_2005, 
                 list_budget_actual, list_budget_adj_2005]


In [238]:
for l in list_of_lists_films:
    print(l)

['Dr. No', 'From Russia with Love', 'Goldfinger', 'Thunderball', 'You Only Live Twice', "On Her Majesty's Secret Service", 'Diamonds Are Forever', 'Live and Let Die', 'The Man with the Golden Gun', 'The Spy Who Loved Me', 'Moonraker', 'For Your Eyes Only', 'Octopussy', 'A View to a Kill', 'The Living Daylights', 'Licence to Kill', 'GoldenEye', 'Tomorrow Never Dies', 'The World Is Not Enough', 'Die Another Day', 'Casino Royale', 'Quantum of Solace', 'Skyfall', 'Spectre', 'No Time to Die']
['1962', '7.0', '2.0', '820.4', '141.2', 'Lewis Gilbert', 'George Lazenby', '1971', '34.7', '7.0', '334.0', '185.4', 'Lewis Gilbert', 'Roger Moore', '1983', '53.9', '30.0', '313.5', '156.2', 'Martin Campbell', 'Pierce Brosnan', '1999', '158.3', '142.0', '581.5', '586.1', 'Sam Mendes', 'Daniel Craig', 'Daniel Craig']
['Sean Connery', '1963', '12.6', '3.0', '848.1', '111.6', 'Peter R. Hunt', 'Sean Connery', '1973', '30.8', '7.0', '533.0', '210.3', 'John Glen', 'Roger Moore', '1985', '54.5', '40.0', '250.

In [234]:
list_budget_adj_2005

['7.0',
 '12.6',
 '18.6',
 '41.9',
 '59.9',
 '37.3',
 '34.7',
 '30.8',
 '27.7',
 '45.1',
 '91.5',
 '60.2',
 '53.9',
 '54.5',
 '68.8',
 '56.7',
 '76.9',
 '133.9',
 '158.3',
 '154.2',
 '145.3',
 '181.4',
 '127.7–170.2',
 '2021']

In [215]:
headers = ['Title','Year','Bond actor','Director','Box Office Actual $','Box Office Adjusted 2005 $','Budget Actual $',
           'Budget Adjusted 2005 $']

In [224]:
dict_films = {headers[idx]:list_of_lists_films[idx] for idx in range(len(headers))}
dict_films

{'Title': ['Dr. No',
  'From Russia with Love',
  'Goldfinger',
  'Thunderball',
  'You Only Live Twice',
  "On Her Majesty's Secret Service",
  'Diamonds Are Forever',
  'Live and Let Die',
  'The Man with the Golden Gun',
  'The Spy Who Loved Me',
  'Moonraker',
  'For Your Eyes Only',
  'Octopussy',
  'A View to a Kill',
  'The Living Daylights',
  'Licence to Kill',
  'GoldenEye',
  'Tomorrow Never Dies',
  'The World Is Not Enough',
  'Die Another Day',
  'Casino Royale',
  'Quantum of Solace',
  'Skyfall',
  'Spectre',
  'No Time to Die'],
 'Year': ['1962',
  '1963',
  '1964',
  '1965',
  '1967',
  '1969',
  '1971',
  '1973',
  '1974',
  '1977',
  '1979',
  '1981',
  '1983',
  '1985',
  '1987',
  '1989',
  '1995',
  '1997',
  '1999',
  '2002',
  '2006',
  '2008',
  '2012',
  '2015',
  'Daniel Craig'],
 'Bond actor': ['Sean Connery',
  'Sean Connery',
  'Sean Connery',
  'Sean Connery',
  'Sean Connery',
  'George Lazenby',
  'Sean Connery',
  'Roger Moore',
  'Roger Moore',
  'Ro

In [225]:
df_films = pd.DataFrame(dict_films)

ValueError: arrays must all be same length

In [None]:
dict_films = {}

for item in list_films:
    

## Lyrics

In [265]:
main_url = "https://www.stlyrics.com/b/bestofbondjamesbond.htm"

In [266]:
def retrieve_hyperlinks(main_url):
    """ 
    Extract all hyperlinks in 'main_url' and return a list with these hyperlinks 
    """
    
    # Send request and catch response: r

    r = requests.get(main_url)

    # Extracts response as html: html_doc
    html_doc = r.text

    # Create a BeautifulSoup object from the HTML: soup
    soup = BeautifulSoup(html_doc,"lxml")
    
    # Find all 'a' tags (which define hyperlinks): a_tags

    a_tags = soup.find_all('a')
    
    # Create a list with hyperlinks found

    list_links = [link.get('href') for link in a_tags]
    
    # Remove none values if there is some
    
    list_links = list(filter(None, list_links)) 
    
    return list_links

**Code also to obtain the title of the songs**

In [270]:
list_links = retrieve_hyperlinks(main_url)

In [271]:
list_links = list(set(list_links))

print('\n Number of links before filtering:', len(list_links))
list_links[:20]


 Number of links before filtering: 109


['/songs/p.html',
 '/dmca.htm',
 '/lyrics/bestofbondjamesbond/moonraker.htm',
 '/l/lovebirds.htm',
 '/lyrics/bestofbondjamesbond/fromrussiawithlove.htm',
 '/w.htm',
 '/songs/v.html',
 '/songs/h.html',
 '/o/oldguard.htm',
 '/19.htm',
 '/lyrics/bestofbondjamesbond/themanwiththegoldengun.htm',
 '/songs/m.html',
 'https://www.twitter.com/stlyricscom',
 '/songs/t.html',
 '/songs/y.html',
 '/songs/0-9.html',
 '/b/binge.htm',
 '/p/palmsprings.htm',
 '/k/kissingbooth2.htm',
 '/lyrics/bestofbondjamesbond/dieanotherday.htm']

In [272]:
list_links = [link for link in list_links if 'bestofbondjamesbond' in link]
print('\n Number of links after filtering:', len(list_links))
list_links


 Number of links after filtering: 23


['/lyrics/bestofbondjamesbond/moonraker.htm',
 '/lyrics/bestofbondjamesbond/fromrussiawithlove.htm',
 '/lyrics/bestofbondjamesbond/themanwiththegoldengun.htm',
 '/lyrics/bestofbondjamesbond/dieanotherday.htm',
 '/lyrics/bestofbondjamesbond/thelivingdaylights.htm',
 '/lyrics/bestofbondjamesbond/goldfinger.htm',
 '/lyrics/bestofbondjamesbond/goldeneye.htm',
 '/lyrics/bestofbondjamesbond/anotherwaytodie.htm',
 '/lyrics/bestofbondjamesbond/alltimehigh.htm',
 '/lyrics/bestofbondjamesbond/youknowmyname.htm',
 '/lyrics/bestofbondjamesbond/diamondsareforever.htm',
 '/lyrics/bestofbondjamesbond/nobodydoesitbetter.htm',
 '/lyrics/bestofbondjamesbond/liveandletdie.htm',
 '/lyrics/bestofbondjamesbond/foryoureyesonly.htm',
 '/lyrics/bestofbondjamesbond/wehaveallthetimeintheworld.htm',
 '/lyrics/bestofbondjamesbond/theworldisnotenough.htm',
 '/lyrics/bestofbondjamesbond/onhermajestyssecretservice.htm',
 '/lyrics/bestofbondjamesbond/jamesbondtheme.htm',
 '/lyrics/bestofbondjamesbond/licencetokill.htm

In [274]:
complete_urls = ["https://www.stlyrics.com"+link for link in list_links]
complete_urls

['https://www.stlyrics.com/lyrics/bestofbondjamesbond/moonraker.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/fromrussiawithlove.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/themanwiththegoldengun.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/dieanotherday.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/thelivingdaylights.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/goldfinger.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/goldeneye.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/anotherwaytodie.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/alltimehigh.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/youknowmyname.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/diamondsareforever.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/nobodydoesitbetter.htm',
 'https://www.stlyrics.com/lyrics/bestofbondjamesbond/liveandletdie.htm',
 'https://www.stlyri

In [None]:
def extract_lyric_from_url(url_lyric):
    """ 
    Extract lyrics after prettify beautiful soup from www.songteksten.nl 
    """
    
    
    # send a http request
    r_lyric = requests.get(url_lyric)
    
    # obtain text with html containt of the url
    html_doc_lyric = r_lyric.text
    
    # making html easier to read
    soup_lyric = BeautifulSoup(html_doc_lyric,"lxml")

    
    # prettifying it
    soup_lyric_pretty = soup_lyric.prettify()
    
    # Isolating deal that contains the lyric
    
    text = soup_lyric_pretty.split('</h1>\n')[1].split('<div class="buma-consent" role="alert">')[0]

    # Cleaning text and building a list with it
    list_lyrics = text.split('<br/>\n')
    list_lyrics = [item.replace('\n','') for item in list_lyrics]
    list_lyrics = [item.lstrip().rstrip() for item in list_lyrics]
    
    # removing empty elements from the list
    
    for item in list_lyrics:
        if str(item) == '':
            list_lyrics.remove(item)
            
    # this part was added after noticing that at least one lyric was not following the normal pattern
    
    if '<div' in list_lyrics[0]:
        list_lyrics = list_lyrics[1:]
        
        
    # Having the lyrics in string format
    
    lyrics = '. '.join(list_lyrics)
            
    
    # returning both list and string
    
    return list_lyrics, lyrics

In [276]:
url_lyric = complete_urls[0]

In [277]:
r_lyric = requests.get(url_lyric)
    
# obtain text with html containt of the url
html_doc_lyric = r_lyric.content
    
# making html easier to read
soup_lyric = BeautifulSoup(html_doc_lyric,"lxml")


In [278]:
soup_lyric

<!DOCTYPE HTML>
<html itemscope="" itemtype="http://schema.org/WebPage" lang="en" prefix="og: http://ogp.me/ns#" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="#d4d4d4" name="theme-color"/>
<link href="/manifest.json" rel="manifest"/>
<link href="/images/desktop/favicon.ico" rel="shortcut icon"/>
<!--[if lt IE 9]>
        <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
        <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
        <![endif]-->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.3.1/jquery.min.js" type="text/javascript"></script>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width" name="viewport"/>
<meta content="https://www.stlyrics.com/lyrics/007jamesbondthebestsongs/moonraker.htm" itemprop="url"/>
<meta content="Moonraker lyrics by Shirley Bassey from Best of Bond... James Bond sou

In [282]:
lyric_list = soup_lyric.find_all('div', class_="highlight")
lyric_list

[<div class="highlight">Where are you</div>,
 <div class="highlight">Why do you hide</div>,
 <div class="highlight">Where is that moonlight trail that leads to your side</div>,
 <div class="highlight">Just like the moonraker goes</div>,
 <div class="highlight">in search of his dream of gold</div>,
 <div class="highlight">I search for love</div>,
 <div class="highlight">for someone to have and hold</div>,
 <div class="highlight">I've seen your smile</div>,
 <div class="highlight">in a thousand dreams</div>,
 <div class="highlight">felt your touch</div>,
 <div class="highlight">and it always seems</div>,
 <div class="highlight">You love me</div>,
 <div class="highlight">You love me</div>,
 <div class="highlight"></div>,
 <div class="highlight">Where are you</div>,
 <div class="highlight">When will we meet</div>,
 <div class="highlight">Take my unfinished life and make it complete</div>,
 <div class="highlight">Just like the moonraker knows</div>,
 <div class="highlight">his dream will co

In [287]:
lyric_list=[item.text.strip() for item in lyric_list ]
# Remove none values if there is some
    
lyric_list = list(filter(None, lyric_list)) 
lyric_list

['Where are you',
 'Why do you hide',
 'Where is that moonlight trail that leads to your side',
 'Just like the moonraker goes',
 'in search of his dream of gold',
 'I search for love',
 'for someone to have and hold',
 "I've seen your smile",
 'in a thousand dreams',
 'felt your touch',
 'and it always seems',
 'You love me',
 'You love me',
 'Where are you',
 'When will we meet',
 'Take my unfinished life and make it complete',
 'Just like the moonraker knows',
 'his dream will come true someday',
 'I know that you',
 'are only a kiss away',
 "I've seen your smile",
 'in a thousand dreams',
 'felt your touch',
 'and it always seems',
 'You love me',
 'You love me']

In [285]:
lyric_list[0].text

'Where are you'

In [303]:
def extract_lyric_from_url(url_lyric):
    """ 
    Extract lyrics after prettify beautiful soup from www.songteksten.nl 
    """
    
    
    # send a http request
    r_lyric = requests.get(url_lyric)
    
    # obtain text with html containt of the url
    html_doc_lyric = r_lyric.text
    
    # making html easier to read
    soup_lyric = BeautifulSoup(html_doc_lyric,"lxml")

    lyric_list = soup_lyric.find_all('div', class_="highlight")
    
    lyric_list=[item.text.strip() for item in lyric_list ]
    # Remove none values if there is some
    
    lyric_list = list(filter(None, lyric_list)) 

    return '\n'.join(lyric_list)
    
    


In [304]:
print(extract_lyric_from_url(url_lyric))

Where are you
Why do you hide
Where is that moonlight trail that leads to your side
Just like the moonraker goes
in search of his dream of gold
I search for love
for someone to have and hold
I've seen your smile
in a thousand dreams
felt your touch
and it always seems
You love me
You love me
Where are you
When will we meet
Take my unfinished life and make it complete
Just like the moonraker knows
his dream will come true someday
I know that you
are only a kiss away
I've seen your smile
in a thousand dreams
felt your touch
and it always seems
You love me
You love me
