<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#📽️Web-Scraping-Information-about-James-Bond's-Movies" data-toc-modified-id="📽️Web-Scraping-Information-about-James-Bond's-Movies-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>📽️Web Scraping Information about James Bond's Movies</a></span><ul class="toc-item"><li><span><a href="#Step-1:-Inspecting-website" data-toc-modified-id="Step-1:-Inspecting-website-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Step 1: Inspecting website</a></span></li><li><span><a href="#Step-2:-Access-Content-of-Website" data-toc-modified-id="Step-2:-Access-Content-of-Website-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Step 2: Access Content of Website</a></span><ul class="toc-item"><li><span><a href="#Extracting-Information-from-Website" data-toc-modified-id="Extracting-Information-from-Website-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Extracting Information from Website</a></span></li><li><span><a href="#Extracting-info-from-Table" data-toc-modified-id="Extracting-info-from-Table-2.2.2"><span class="toc-item-num">2.2.2&nbsp;&nbsp;</span>Extracting info from Table</a></span></li></ul></li></ul></li><li><span><a href="#🎶-Web-Scraping-Information-about-James-Bond's-Theme-Songs" data-toc-modified-id="🎶-Web-Scraping-Information-about-James-Bond's-Theme-Songs-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>🎶 Web Scraping Information about James Bond's Theme Songs</a></span></li><li><span><a href="#🎶-Web-Scraping-Lyrics:-How-to-Access-Information-within-Hyperlinks" data-toc-modified-id="🎶-Web-Scraping-Lyrics:-How-to-Access-Information-within-Hyperlinks-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>🎶 Web Scraping Lyrics: How to Access Information within Hyperlinks</a></span></li></ul></div>

**TO DO**

- Review and add text in section on how to retrieve info from hyperlinks
- Add missing songs


# Introduction

As Edward Dennings said "In God we trust; all other bring data.", so bring us data. 

When data is not available through datasets or APIs, `web scraping` maybe be our last resource. It allows retrieving and parsing data stored on web pages across the Internet. It not only allow us retrieving data when we don't have it but also give us the opportunity to acquire additional data that might give that extra boost to our model. Therefore, obtaining data through `web scraping` is a valuable skill for any data scientist. 

In a `business point of view`, web scraping help us making informed business decision. It provide an opportunity to:

* Know better competitors, their prices, services,
* Know customers, their behavior, their needs, what they think of product(s)/service(s),
* Stay well informed about partners,
* Gather public opinion about a company in general, as well as of its or similar product(s)/service(s),
* Obtain contact or other information of potential clients via social media and forums, so meaningfully resources can be directed towards this group of possible customers.

and the list goes on....

Also for `public/govermental organizations` web scraping can be very helpful. It might help gathering information from websites of different cities within an region about an important subject such as health, security, or environment. This data that sometimes are not easily collected across city agencies might be published by them on line. Therefore, this gives an opportunity to collect and analyze the data in order to extract beneficial insights to society.

In addition, data obtained via web scraping can be used for personal purposes and for fun! For instances, it can help you find your new home, a new recipe, material for your hobby, or information about your favorite subject, artist, movie, music.... again imagination is the limit.

Once you have your data, it is time to analyze and manipulate it using tools such as `pandas` and `numpy`.

Here, to illustrate the use of web scraping we've chosen a subject that probably will please everybody (or most of you): Movies and Music! On top of it we will be an opportunity to pay our respect to the first Jams Bond, [Sir Thomas Sean Connery](https://www.imdb.com/name/nm0000125/bio) that left us October 31, 2020.

Basically the following steps are taken:

📽️ Extract information about all the movies from James Bond from a table at [List_of_James_Bond_films](https://en.wikipedia.org/wiki/List_of_James_Bond_films)

🎶 Extract information about all the James Bond's theme songs from a table at [Lijst_van_titelsongs_uit_de_James_Bondfilms](https://nl.wikipedia.org/wiki/Lijst_van_titelsongs_uit_de_James_Bondfilms) 
("Yes! Dutch site the structure of the table was much easier. So if you have an option go to the easy one  😉 ")

🎶 Scrape lyrics of the theme songs that are not instrumental

To accomplish this we need a basic knowledge of `HTML` which means its tree structure and that tags define the branches where the information we search are. Furthermore, we make use of two Python libraries:

* [`requests`](https://requests.readthedocs.io/en/master/) which we allow us to get the webpage we want; and 
* [`Beautiful Soup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) that parses the content of the webpage and allows us extracting tags from an HTML document.

So let's start!

# 📽️Web Scraping Information about James Bond's Movies

## Step 1: Inspecting website

Every time we scrape a website we need to have an idea of its structure and where to find what we need.

For this, no matter which browser we use, we can access its code by right clicking and choosing to access it source code, i.e., `view page` (Firefox) or `view page source` (Chrome and Microsoft Edge). If you need details of an specific element right click on it and choose `inspect element`(Firefox) or `inspect` (Chrome and Microsoft Edge), instead.

Web pages use `HyperText Markup Language (HTML)` which is a markup language with its own syntax and rules. When a Web browser like Chrome or Firefox downloads a Web page, it reads the HTML to determine how to render it and display it to you.

HTML consists of **tags**. Anything in between the opening and closing of a tag is the content of that tag. 

Some of elements that often encountered are:

`<head>` : Contains metadata useful to the Web browser that's rendering the page and it is not visible to the user.

`<body>` : Contains represents the content of an HTML document with which the user interacts.

`<div>`: Section of the body.

`<p>`: Used for paragraphs. 

`<a>` : Creates a hyperlink to web pages, files, email addresses, locations in the same page, or anything else a URL can address.

For more definitions of elements check these links: [dev_mozilla](https://developer.mozilla.org/en-US/docs/Web/HTML/Element) or [w3s](https://www.w3schools.com/html/html_elements.asp)

While inspecting the website source code you will notice that some tags contain attributes which provide special instructions for the contents contained within that tag. Specific html attributes names are followed by equal sign, followed by information which being passed to that attribute within that tag.

For example:

`<div id="contentSub"></div>`

Try this when repository become public

<img src="https://github.com/dpbac/basics-web-scraping/blob/master/images/webpage_code_ex01.JPG"/ width="800" >

<img src="../images/webpage_code_ex01.JPG" width="800" />

## Step 2: Access Content of Website

For this we need to :

1. Access website using `requests`
2. Parse content with `Beautiful Soup` so we can extract what we need within tags

In [None]:
# importing packages

import requests
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
main_url = "https://en.wikipedia.org/wiki/List_of_James_Bond_films"

# Send request and catch response: r
response = requests.get(main_url)

# get the content of the response
content = response.content

# parse webpage
parser = BeautifulSoup(content, 'html.parser')

The parse is a BeautifulSoup object, which represents the document as a nested data structure.

In [None]:
parser;

We will need to perform the same process for our 2 next tasks, so let's build a function:

In [None]:
def parse_website(url):
    """ 
    Parse content of a website
    
    Args:
        url (str): url of the website of which we want to acess the content 
        
    Return:
        parser: representation of the document as a nested data structure.
    """
    # Send request and catch response
    response = requests.get(url)

    # get the content of the response
    content = response.content

    # parse webpage
    parser = BeautifulSoup(content, "lxml")
    
    return parser  

#     # Send request and catch response
#     response = requests.get(main_url)

#     # get the content of the response
#     content = response.content

#     # parse webpage
#     parser = BeautifulSoup(content, "lxml")
# #     parser = BeautifulSoup(content, 'html.parser')
    
#     return parser
    

### Extracting Information from Website

This part will depend on the structure of the website source code and of what you need as information from it.

Before going to our target (table with information about James Bond Films) let's see how we can access some text of the website.

In [None]:
main_url = "https://en.wikipedia.org/wiki/List_of_James_Bond_films"
parser = parse_website(main_url)

Now that we have the tree let's get the branches we want. To access the branches we use tags as attributes of parser. Therefore, to obtain the title of the webpage:

In [None]:
title = parser.title
title = title.text
title

`Body` is the main branch of the HTML document where all elements such as paragraphs, hyperlinks are located. To access paragraphs we use tag `p`. If we use `find` the 1st paragraph is shown, if we use `find_all` we wil have acess to all paragraphs.

In [None]:
# body is with html element
body = parser.body
body;

In [None]:
# first paragraph
parser.body.find('p')

In [None]:
# all paragraphs

parser.body.find_all('p');

The method `find_all` returns a list and as one we can access an item using an index.

In [None]:
# find all paragraphs within the body of html
list_paragraphs = parser.body.find_all('p')
# extract the string within it
list_paragraphs = [p.text for p in list_paragraphs]
# show the first 2 paragraphs
list_paragraphs[:2]

In [None]:
# text of the first non-empty paragraphy
print(parser.find_all('p')[1].text)

Or if you want all the text...

In [None]:
text_films = ' '.join(list_paragraphs).strip()
# First 2000 characters
print(text_films[:2000])

### Extracting info from Table

You saw how to get some paragraphs, but what we really want as we said at the beginning is information about all movies and those are in the 1st table of the website.

The table information can be found under tag `tbody`.

In [None]:
len(parser.find_all('tbody'))

There are 6 tables in the website, but we are interested in the 1st one.

In [None]:
parser.tbody;

My goal is to build a dataframe so I'll get the header (name of the columns/features) and the data (values for each feature).

In [None]:
parser.tbody.find_all('th', scope="col")

Our result is a list so we can use a list comprehension and apply some filtering to obtain the desired result.

In [None]:
# Obtain column names within tag <th> with attribute col
list_col_01 = parser.tbody.find_all('th', scope="col")
list_col_01 = [item.text.strip() for item in list_col_01 if ('Ref' not in item.text) & ('Total' not in item.text)]
list_col_01

We need to add `Box office (millions)` and `Budget (millions)` before `Actual $` and `Adjusted 2005 $`.

In [None]:
parser.tbody.find_all('th', class_="unsortable")

In [None]:
# Obtain complement of column names at the attribute unsortable and some manipulation so we can have the correct names
list_col_02 = parser.tbody.find_all('th', class_="unsortable")
list_col_02 = [item.text.strip().replace('[14]',"") for item in list_col_02 if ('Ref' not in item.text) & ('Total' not in item.text)]
list_col_02=list_col_02*2
list_col_02.sort()
list_col_02


In [None]:
# Putting all together
list_columns = [list_col_01[idx] if idx in range(len(list_col_01[:4])) else list_col_02[idx-4] +' '+ list_col_01[idx] for idx in range(len(list_col_01)) ]
list_columns

Now that we have the name of features to be used to build our dataframe, let's find the values for each feature. 

If we continue checking the content within `tbody` we will notice that `film titles` are found under tag `th` with attribute `row` while the rest of the information is found under `td` with same attribute.

In [None]:
# Obtain title of the movies
list_films = parser.tbody.find_all('th', scope = "row")
list_films = [film.text.strip() for film in list_films]
list_films

In [None]:
# Obtain all other information about those movies
list_info_films = [item.text.strip() for item in parser.tbody.find_all('td')]
list_info_films = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 8 != 7]
# showing the first 10 elements of the list
list_info_films[:10]

In [None]:
# Organizing information in list_info_films by features
list_year_film = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 7 == 0 ]
list_actor = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 7 == 1 ]
list_director = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 7 == 2 ]
list_box_office_actual = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 7 == 3 ]
list_box_office_adj_2005 = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 7 == 4 ]
list_budget_actual = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 7 == 5 ]
list_budget_adj_2005 = [list_info_films[idx] for idx in range(len(list_info_films)) if idx % 7 == 6 ]

In [None]:
list_of_lists_films = [list_films, list_year_film, list_actor, list_director, list_box_office_actual, list_box_office_adj_2005, 
                 list_budget_actual, list_budget_adj_2005]


In [None]:
# Build a dictionary for our dataframe
dict_films = {list_columns[idx]:list_of_lists_films[idx] for idx in range(len(list_columns))}
# showing 2 items of the dictionary
dict(list(dict_films.items())[0:2])

In [None]:
df_films = pd.DataFrame(dict_films)
df_films.head()

I'll rename column film to `Film Title` so we can use it when merging dataframes with film and music information.

In [None]:
df_films.columns

In [None]:
df_films.rename(columns = {'Title': 'Film Title'}, inplace = True)

In [None]:
df_films.head()

# 🎶 Web Scraping Information about James Bond's Theme Songs

For this task I've chosen the Dutch Wikipedia website because the structure of the table is simpler to extract the information we want and the information there is mostly in English.

Let's start by using our function to parse the content of the website.


In [None]:
# this I checked first: https://en.wikipedia.org/wiki/James_Bond_music

main_url = "https://nl.wikipedia.org/wiki/Lijst_van_titelsongs_uit_de_James_Bondfilms"

parser = parse_website(main_url)

Again, the information we are looking for is in the first table.

In [None]:
parser.find('tbody');

In [None]:
# Name of columns 
list_columns = parser.tbody.find_all('th')
list_columns = [item.text.strip() for item in list_columns]
list_columns

or in English:

In [None]:
list_columns = ['Theme Song', 'Performer', 'Film Title', 'Year', 'Composer']

This time obtaining the header of our data frame was pretty direct. Indeed, we could simply type the list, especially since we needed to translate it. However, it is good to show how different it was from the previous section. Therefore, how you retrieve the information you get depends on the structure of the website.

We have now the names of our 5 columns. Following, we will build the content of our table.

In [None]:
# Extract information about Jame Bond's theme songs
list_table_songs = parser.tbody.find_all('td')
list_table_songs = [item.text.strip() for item in list_table_songs]
# showing the 1st 10 items of the list
list_table_songs[:10]

`<td>` is a html element that defines a cell of a table that contains data. As we can notice above every 5 rows (cells of the table) contains respectively, `Theme Song`, `Performer`, `Film Title`, `Year`, `Composer`. Let's use this to build our data frame with all theme songs of the James Bond film series.

In [None]:
# Spliting information by feature
list_title_songs = [list_table_songs[idx] for idx in range(len(list_table_songs)) if idx % 5 == 0 ]
list_performers = [list_table_songs[idx] for idx in range(len(list_table_songs)) if idx % 5 == 1 ]
list_films = [list_table_songs[idx] for idx in range(len(list_table_songs)) if idx % 5 == 2 ]
list_years = [list_table_songs[idx] for idx in range(len(list_table_songs)) if idx % 5 == 3 ]
list_composers = [list_table_songs[idx] for idx in range(len(list_table_songs)) if idx % 5 == 4 ]


In [None]:
list_of_lists_songs = [list_title_songs, list_performers, list_films, list_years, list_composers]


In [None]:
dict_songs = {list_columns[idx]:list_of_lists_songs[idx] for idx in range(len(list_columns))}

# showing 2 items of the dictionary
dict(list(dict_songs.items())[0:2])

In [None]:
df_songs = pd.DataFrame(dict_songs)
df_songs

Pretty good, right? In what concern web scraping our job is done but as data scientists we need to do our best to have clean data and the most complete and right information. No trash in, trash out! So, there are just little things we need to fix.

First, the first movie of the James Bond franchise, `Dr. No`, has two themes. However, we have information only about the performer of the 1st theme. In addition, formally [Monty Norman](https://en.wikipedia.org/wiki/James_Bond_Theme) is the composer of both James Bond theme and Kingston Calypso.

Second, in some items we find `o.l.v` that means in Dutch `onder leiding van` which we can translate to `led by`.

At last, the `Year` of the last film is 2021 as in the films table. The film was supposed to be released in 2020 but due to COVID it will be released in 2021.

In [None]:
df_songs['Theme Song'].iloc[0] ="James Bond Theme / Kingston Calypso"
df_songs['Composer'].iloc[0] = 'Monty Norman / Byron Lee and the Dragonaires'

# replace 'o.l.v.'' by 'led by'
df_songs['Performer'] = df_songs['Performer'].apply(lambda x : x.replace('o.l.v.','led by'))

# correct year of last move
df_songs['Year'].iloc[24] = '2021'

In [None]:
df_songs

To put all together let's check if columns `Film Title` in both films and songs dataframe are equal. Remember that the 1st movie has 2 entries in `df_songs`.

In [None]:
df_films['Film Title'].equals(df_songs['Film Title'])

In [None]:
df_films_songs = df_films.merge(df_songs, on = ['Film Title', 'Year'])

In [None]:
df_films_songs.head()

Now that you have you data all together you can answer some questions. For instances:

❔ **Which actor performed James Bond more times?**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
plt.figure(figsize=(6,4))
sns.countplot(y=df_films_songs['Bond actor'], order = df_films_songs['Bond actor'].value_counts().index)
plt.title("Actors by Order of How Many Times he Performed 'James Bond'")

In [None]:
df_films_songs['Bond actor'].value_counts()

As we can see, `Roger Moore` performed the 007 agente more times, followed by `Sean Connery`. If `Daniel Craig` goes on for 2 more movies he will replace Roger Moore.

❔ **What was the Box Office and Budget?**

For this we need some cleaning first.

In [None]:
df_films_songs.info()

`Features Box office (millions) Actual $`, `Box office (millions) Adjusted 2005 $`, `Budget (millions) Actual $`,`Budget (millions) Adjusted 2005 $` are `object` type when they should be `float`. This happened because of `TBD` (meaning To Be Defined) and the values given as intervals in indexes 22 and 23 of `Budget (millions) Actual $` and `Budget (millions) Adjusted 2005 $`.

First, let's replace `TBD` by 0.00 since the film will be release in 2021.

Second, let's replace the interval of values by it's mean.

In [None]:
df_films_songs = df_films_songs.replace('TBD',0.00)

In [None]:
def calculate_mean(interval_str):
    """ Calculate mean of a and b where a and b are in the expression `a-b` (string) """
    
    interval_str = interval_str.replace('[b]','')
    
    a = float(interval_str.split('–')[0])
    b = float(interval_str.split('–')[1])
    
    return round((a + b/2),2)

In [None]:
df_films_songs.columns

In [None]:
for idx in range(22,24):
    
    df_films_songs.loc[idx,'Budget (millions) Actual $'] = calculate_mean(df_films_songs.loc[idx,'Budget (millions) Actual $'])
    df_films_songs.loc[idx,'Budget (millions) Adjusted 2005 $'] = calculate_mean(df_films_songs.loc[idx,'Budget (millions) Adjusted 2005 $'])

In [None]:
df_films_songs.head()

In [None]:
for col in ['Box office (millions) Actual $','Box office (millions) Adjusted 2005 $', 'Budget (millions) Actual $',
       'Budget (millions) Adjusted 2005 $']:
    
    df_films_songs[col] = df_films_songs[col].astype('float')

In [None]:
df_films_songs.info()

In [None]:
fig, ax1 = plt.subplots(figsize=(15, 15))
tidy = df_films_songs.melt(id_vars='Film Title',  value_vars=['Box office (millions) Actual $',
       'Box office (millions) Adjusted 2005 $', 'Budget (millions) Actual $',
       'Budget (millions) Adjusted 2005 $']).rename(columns=str.title)
sns.barplot(y='Film Title', x='Value', hue='Variable', data=tidy, ax=ax1)
plt.title("Compare Box Office and Budget of Bond's films until 2008", size=16)
plt.legend(loc = "center right", title = "")


Checking only actual values...

In [None]:
fig, ax1 = plt.subplots(figsize=(15, 15))
tidy = df_films_songs.melt(id_vars='Film Title',  value_vars=['Box office (millions) Actual $', 'Budget (millions) Actual $']).rename(columns=str.title)
sns.barplot(y='Film Title', x='Value', hue='Variable', data=tidy, ax=ax1)
plt.title("Compare Box Office and Budget of Bond's films until 2008", size=16)
plt.legend(loc = "center right", title = "")


It seems that is pretty profitable, right? This video shows how the film industry make money and how taking Box Office as proxy for profit can be misleading https://www.youtube.com/watch?v=jRuc7YgZ_n8&feature=emb_logo

❔ **Is there any performer that performed songs more than once?**

In [None]:
df_films_songs['Performer'].value_counts()[df_films_songs['Performer'].value_counts().values > 1].index[0]

**Which songs and in which years she song?**

In [None]:
df_films_songs[['Theme Song','Year']][df_films_songs['Performer']=='Shirley Bassey']

# 🎶 Web Scraping Lyrics: How to Access Information within Hyperlinks

To show how to scrape webpages within a webpage let's obtain lyrics of James Bond's theme songs.

Here we will build a dataframe with song titles, performers, and lyrics. 

In [None]:
main_url = "https://www.stlyrics.com/b/bestofbondjamesbond.htm"

In [None]:
parser = parse_website(main_url)

In [None]:
def retrieve_hyperlinks(main_url):
    """ 
    Find hyperlinks in 'main_url' 
    
    Args:
        main_url: Main webpage containing hyperlinks
        
    Return:
        list of url: list of hyperlinks from main_url
        
    """
    
    # Find all 'a' tags (which define hyperlinks): a_tags

    a_tags = parser.find_all('a')

    # Create a list with hyperlinks found

    list_links = [link.get('href') for link in a_tags]

    # Remove none values if there is some
    # Remove none values if there is some
    
    list_links = list(filter(None, list_links)) 
    
    return list_links

In [None]:
list_links = retrieve_hyperlinks(main_url)

In [None]:
list_links = list(set(list_links))

print('\n Number of links before filtering:', len(list_links))
list_links[:20]

In [None]:
list_links = [link for link in list_links if 'bestofbondjamesbond' in link]
print('\n Number of links after filtering:', len(list_links))
list_links

In [None]:
complete_urls = ["https://www.stlyrics.com"+link for link in list_links]
complete_urls

In [None]:
lyric_url = complete_urls[0]

# r_lyric = requests.get(lyric_url)
    
# # obtain text with html containt of the url
# html_doc_lyric = r_lyric.content
    
# # making html easier to read
# soup_lyric = BeautifulSoup(html_doc_lyric,"lxml")

soup_lyric = parse_website(lyric_url)

lyric_list = soup_lyric.find_all('div', class_="highlight")
    
lyric_list=[item.text.strip() for item in lyric_list ]
    
# Remove none values if there is some
lyric_list = list(filter(None, lyric_list)) 

print('\n'.join(lyric_list))

In [None]:
def extract_lyric_from_url(lyric_url):
    """ 
    Extract lyrics after prettify beautiful soup from /www.stlyrics.com
    
    Args: 
        lyric_url: url for lyric website
        
    Return:
        text of lyrics
    """
    
    
    # send a http request
    r_lyric = requests.get(lyric_url)
    
    # obtain text with html containt of the url
    html_doc_lyric = r_lyric.text
    
    # making html easier to read
    soup_lyric = BeautifulSoup(html_doc_lyric,"lxml")

    lyric_list = soup_lyric.find_all('div', class_="highlight")
    
    lyric_list=[item.text.strip() for item in lyric_list ]
    # Remove none values if there is some
    
    lyric_list = list(filter(None, lyric_list)) 

    return ' '.join(lyric_list)

In [None]:
list_lyrics = []
list_links = []

for link in complete_urls:
    list_lyrics.append(extract_lyric_from_url(link))
    list_links.append(link.split('/')[-1].replace('.htm',''))

In [None]:
len(list_links)

In [None]:
df_links = pd.DataFrame({'links':list_links, 'lyrics': list_lyrics})
df_links

In [None]:
df_songs

Before merging we need to split the first row which contain information about 2 theme songs. For this I'll append to the dataframe 2 new rows with information about each of the songs.

In [None]:
# Appending two new rows with updated information

df_songs = df_songs.append({'Theme Song':"James Bond Theme", 'Performer':'Orkest led by John Barry', 
                            'Film Title':'Dr. No', 'Year':'1962', 'Composer': 'Monty Norman'}, ignore_index=True)
df_songs = df_songs.append({'Theme Song': "Kingston Calypso (a.k.a 'Three Blind Mice')", 'Performer':"Byron Lee and the Dragonaires",
                            'Film Title':'Dr. No', 'Year':'1962', 'Composer': 'Monty Norman'}, ignore_index=True)

Then drop the old one and re-organize the dataframe.

In [None]:
# removing the incomplete info about Dr. No 
df_songs.drop_duplicates(subset=["Film Title","Performer"], keep='last', inplace=True)
# re-organize dataframe by Year
df_songs.sort_values('Year', inplace=True)
# reset index
df_songs.reset_index(drop=True, inplace=True)

In [None]:
df_lyrics = df_songs.copy()
df_lyrics

In [None]:
df_lyrics['links'] = df_lyrics['Theme Song'].apply(lambda x: x.lower().replace(' ','').replace("'",''))

When merging we choose `outer` on how to merge so we can see clearer if there is some information missing.

In [None]:

df_lyrics = df_lyrics.merge(df_links, on='links', how='outer')

In [None]:
df_lyrics

We can observe that in column `lyrics` we have four `NaN` and one empty cell. The empty cell for the `James Bond Theme` is expected since this is an instrumental songs.

The other four are lyrics that are missing. This because the website we used are based in a collection of themes that goes until 2008.

We also notice that the last row have `NaN` for `Theme Song`, `Performer`, `Film Title`, `Year`, and `Composer`. But we know the title of the song by the link: `We have all the time in the world`.

A litle googleled shows that this [James Bond Theme](https://en.wikipedia.org/wiki/We_Have_All_the_Time_in_the_World) performed by Louis Amstrong was the second theme of `On Her Majesty's Secret Service (1969)` and was composed by Hal David and John Barry. In addition, it says that the other theme `On Her Majesty's Secret Service` is instrumental. Therefore, it is a mistake in the website. In fact, when checking the lyrics there are from 1985! A band called `Orchestral manoeuvres in the dark` and the song is called `Secret`.

Therefore, to make things right we :

1. Remove lyrics from song theme `On Her Majesty's Secret Service`
2. We add the missing information of the second (non-instrumental) theme of `On Her Majesty's Secret Service`
2. We add lyrics to:
    * Kingston Calypso a.k.a Three Blind Mice (1962)
    * Skyfall (2012)
    * Writing's On The Wall (2015)
    * No Time to Die (2021)

In [None]:
# remove lyrics of "On Her Majesty's Secret Service"

df_lyrics['lyrics'][df_lyrics['Theme Song']=="On Her Majesty's Secret Service"]=''

In [None]:
# Add info about theme song 'We have all the time in the world'

df_lyrics = df_lyrics.append({'Theme Song': "We Have All The Time in the World", 
                              'Performer':"Louis Amstrong",
                              'Film Title':"On Her Majesty's Secret Service", 'Year':'1969', 
                              'Composer': 'John Barry & Hal David', 
                              'links':'wehaveallthetimeintheworld',
                              'lyrics':df_lyrics['lyrics'][df_lyrics['links']=='wehaveallthetimeintheworld'].values[0]}, 
                             ignore_index=True)

df_lyrics.drop_duplicates('links', keep='last', inplace = True)

Three out the 4 lyrics can be found in the same website (https://www.songteksten.nl/). Let's start by `Kingston Calypso a.k.a Three Blind Mice` that is found in a different website (https://www.flashlyrics.com/lyrics/monty-norman/kingston-calypso-75)

In [None]:
calypso_url = "https://www.flashlyrics.com/lyrics/monty-norman/kingston-calypso-75"
r_calypso = requests.get(calypso_url)

# obtain text with html containt of the url
html_doc_calypso = r_calypso.content

# making html easier to read
soup_calypso = BeautifulSoup(html_doc_calypso,"lxml")

soup_calypso

lyric_list = soup_calypso.find_all('div', class_="main-panel-content")[0].find_all('span')

lyric_list=[item.text.strip() for item in lyric_list ]

# Remove none values if there is some
lyric_list = list(filter(None, lyric_list)) 

print('\n'.join(lyric_list))

In [None]:
songtekten_url = "https://songteksten.net/lyric/5056/94896/adele/skyfall.html"
r_songtekten = requests.get(songtekten_url)

# obtain text with html containt of the url
html_doc_songtekten = r_songtekten.content

# making html easier to read
soup_songteksten = BeautifulSoup(html_doc_songtekten,"lxml")

lyric_list = soup_songteksten.find_all('div', class_="col-sm-7 content-left")

lyric_list

As you can see the text of the lyrics is between `line breaks`, i.e., <\br> tags. This [link](https://stackoverflow.com/questions/5275359/using-beautifulsoup-to-extract-text-between-line-breaks-e-g-br-tags) points a nice solution using a `childGenerator` from BeautifulSoup.

We combined this solution with some filtering in a list comprehension and voilá!

In [None]:
lyrics_list = []
for a in lyric_list[0].childGenerator():
    lyrics_list.append(a)
    
lyrics_list = [str(a).strip() for a in lyric_list[0].childGenerator() if ('<h1' not in str(a)) and ('<div' not in str(a)) and ('<br/>' not in str(a))]

# Remove none values if there is some
lyrics_list = list(filter(None, lyrics_list)) 

lyrics = '\n'.join(lyrics_list)
print(lyrics)

In [None]:
def extract_lyrics_songtekstennl(songteksten_url):
    """ """
    r_songteksten = requests.get(songteksten_url)

    # obtain text with html containt of the url
    html_doc_songteksten = r_songteksten.content

    # making html easier to read
    soup_songteksten = BeautifulSoup(html_doc_songtekten,"lxml")
    
    lyric_list = soup_songteksten.find_all('div', class_="col-sm-7 content-left")

    lyrics_list = []
    for a in lyric_list[0].childGenerator():
        lyrics_list.append(a)
    
    lyrics_list = [str(a).strip() for a in lyric_list[0].childGenerator() if ('<h1' not in str(a)) and ('<div' not in str(a)) and ('<br/>' not in str(a))]

    # Remove none values if there is some
    lyrics_list = list(filter(None, lyrics_list)) 

    lyrics = '\n'.join(lyrics_list)
    print(lyrics)
    

In [None]:
songteksten_url = "https://www.songteksten.nl/songteksten/362930/adele/skyfall.htm"
extract_lyrics_songtekstennl(songteksten_url)

In [None]:
urls = []

In [None]:
soup_songteksten = parse_website(songteksten_url)
lyric_list = soup_songteksten.find_all('div', class_="col-sm-7 content-left")

lyric_list

In [None]:
text.split('</div>')

In [None]:
lyric_list

In [None]:
lyric_list = soup_calypso.find_all('div', class_="main-panel-content")[0].find_all('span')

lyric_list=[item.text.strip() for item in lyric_list ]

# Remove none values if there is some
# lyric_list = list(filter(None, lyric_list)) 

print('\n'.join(lyric_list))

In [None]:

# drop columns links

df_lyrics.drop('links', axis='columns', inplace=True)

In [None]:
df_lyrics

In [None]:
df_lyrics.columns

**TO CONTINUE**

Include information about https://en.wikipedia.org/wiki/We_Have_All_the_Time_in_the_World

1969 Bond film On Her Majesty's Secret Service,

https://www.songteksten.nl/songteksten/365987/louis-armstrong/we-have-all-the-time-in-the-world.htm

Adele (2012) : https://www.songteksten.nl/songteksten/362930/adele/skyfall.htm

Sam Smith (2015) - Spectre - https://www.songteksten.nl/artiest/131949/sam-smith.htm

Billie Eilish (2021) - No time to day - https://www.songteksten.nl/songteksten/1130189/billie-eilish/no-time-to-die.htm

https://www.flashlyrics.com/lyrics/monty-norman/kingston-calypso-75


Add extra songs.


In [None]:
df_lyrics['lyrics'].iloc[6]

In [None]:
df_songs = df_songs.append({'Theme Song': "Kingston Calypso (a.k.a 'Three Blind Mice')", 'Performer':"Byron Lee and the Dragonaires",
                            'Film Title':'Dr. No', 'Year':'1962', 'Composer': 'Monty Norman'}, ignore_index=True)

In [None]:
# df_lyrics.dropna(inplace=True)

In [None]:
# df_lyrics