# Web Scraping with Beautiful Soup - Lab

## Introduction

Now that you've read and seen some docmentation regarding the use of Beautiful Soup, its time to practice and put that to work! In this lab you'll formalize some of our example code into functions and scrape the lyrics from an artist of your choice.

## Objectives
You will be able to:
* Scrape Static webpages
* Select specific elements from the DOM

## Link Scraping

Write a function to collect the links to each of the song pages from a given artist page.

In [1]:
#imports
from bs4 import BeautifulSoup
import requests

In [2]:
# test of getting the data

url = 'https://www.azlyrics.com/u/u2band.html' #Put the URL of your AZLyrics Artist Page here!

html_page = requests.get(url) #Make a get request to retrieve the page
soup = BeautifulSoup(html_page.content, 'html.parser') #Pass the page contents to beautiful soup for parsing

In [3]:
# inspect the data
soup

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta content='U2 lyrics - 230 song lyrics sorted by album, including "With Or Without You", "One", "Lights Of Home".' name="description"/>
<meta content="U2, U2 lyrics, discography, albums, songs" name="keywords"/>
<meta content="noarchive" name="robots"/>
<title>U2 Lyrics</title>
<link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/css/bootstrap.min.css" rel="stylesheet"/>
<link href="//www.azlyrics.com/bsaz.css" rel="stylesheet"/>
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js">

In [4]:
def grab_song_links(artist_page_url):
    #get the data
    url = artist_page_url #Put the URL of your AZLyrics Artist Page here!
    html_page = requests.get(url) #Make a get request to retrieve the page
    soup = BeautifulSoup(html_page.content, 'html.parser') #Pass the page contents to beautiful soup for parsing

    # make a list of albums
    albums = soup.find_all("div", class_ = 'album')
    
    data = [] #Create a storage container for songs

    for album_n in range(len(albums)):
        #On the last album, we won't be able to look forward
        if album_n == len(albums)-1:
            cur_album = albums[album_n]
            album_songs = cur_album.findNextSiblings('a')
            for song in album_songs:
                page = song.get('href')
                title = song.text
                album = cur_album.text
                data.append((title, page, album))
        else:
            cur_album = albums[album_n]
            next_album = albums[album_n+1]
            saca = cur_album.findNextSiblings('a') #songs after current album
            sbna = next_album.findPreviousSiblings('a') #songs before next album
            album_songs = [song for song in saca if song in sbna] #album songs are those listed after the current album but before the next one!
            for song in album_songs:
                page = song.get('href')
                title = song.text
                album = cur_album.text
                data.append((title, page, album))
    return data

In [5]:
grab_song_links(url)

[('I Will Follow', '../lyrics/u2band/iwillfollow.html', 'album: "Boy" (1980)'),
 ('Twilight', '../lyrics/u2band/twilight.html', 'album: "Boy" (1980)'),
 ('An Cat Dubh', '../lyrics/u2band/ancatdubh.html', 'album: "Boy" (1980)'),
 ('Into The Heart',
  '../lyrics/u2band/intotheheart.html',
  'album: "Boy" (1980)'),
 ('Out Of Control',
  '../lyrics/u2band/outofcontrol.html',
  'album: "Boy" (1980)'),
 ('Stories For Boys',
  '../lyrics/u2band/storiesforboys.html',
  'album: "Boy" (1980)'),
 ('The Ocean', '../lyrics/u2band/theocean.html', 'album: "Boy" (1980)'),
 ('A Day Without Me',
  '../lyrics/u2band/adaywithoutme.html',
  'album: "Boy" (1980)'),
 ('Another Time, Another Place',
  '../lyrics/u2band/anothertimeanotherplace.html',
  'album: "Boy" (1980)'),
 ('The Electric Co.',
  '../lyrics/u2band/theelectricco.html',
  'album: "Boy" (1980)'),
 ('Shadows And Tall Trees',
  '../lyrics/u2band/shadowsandtalltrees.html',
  'album: "Boy" (1980)'),
 ('', None, 'album: "Boy" (1980)'),
 ('Gloria', 

## Text Scraping
Write a secondary function that scrapes the lyrics for each song page.

In [6]:
#Remember to open up the webpage in a browser and control-click/right-click and go to inspect!

#Example page
# url = 'https://www.azlyrics.com/lyrics/lilyallen/sheezus.html'
url = 'https://www.azlyrics.com/u/u2band/withorwithoutyou.html'

html_page = requests.get(url)
soup = BeautifulSoup(html_page.content, 'html.parser')
soup.prettify()[:1000]

'<!DOCTYPE html>\n<html lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>\n  <meta content="width=device-width, initial-scale=1" name="viewport"/>\n  <meta content="noarchive" name="robots"/>\n  <meta content="AZLyrics" name="name"/>\n  <meta content="lyrics,music,song lyrics,songs,paroles" name="keywords"/>\n  <base href="//www.azlyrics.com"/>\n  <script src="//www.azlyrics.com/external.js" type="text/javascript">\n  </script>\n  <title>\n   AZLyrics - Song Lyrics from A to Z\n  </title>\n  <link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/css/bootstrap.min.css" rel="stylesheet"/>\n  <link href="//www.azlyrics.com/bsaz.css" rel="stylesheet"/>\n  <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->\n  <!--[if lt IE 9]>\r\n      <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>\r\n      <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script

In [7]:
divs = soup.findAll('div')
divs

[<div class="container">
 <div class="navbar-header">
 <button class="navbar-toggle collapsed" data-target="#search-collapse" data-toggle="collapse" type="button"><span class="glyphicon glyphicon-search"></span></button>
 <button class="navbar-toggle collapsed" data-target="#artists-collapse" data-toggle="collapse" type="button"><span class="glyphicon glyphicon-th-list"></span></button>
 <a class="navbar-brand" href="//www.azlyrics.com"><img alt="AZLyrics.com" class="pull-left" src="//www.azlyrics.com/az_logo_tr.png" style="max-height:40px; margin-top:-10px;"/></a>
 </div>
 <ul class="collapse navbar-collapse nav navbar-nav" id="artists-collapse">
 <li>
 <div class="btn-group text-center" role="group">
 <a class="btn btn-menu" href="//www.azlyrics.com/a.html">A</a>
 <a class="btn btn-menu" href="//www.azlyrics.com/b.html">B</a>
 <a class="btn btn-menu" href="//www.azlyrics.com/c.html">C</a>
 <a class="btn btn-menu" href="//www.azlyrics.com/d.html">D</a>
 <a class="btn btn-menu" href="/

In [8]:
div = divs[2]
div

<div class="btn-group text-center" role="group">
<a class="btn btn-menu" href="//www.azlyrics.com/a.html">A</a>
<a class="btn btn-menu" href="//www.azlyrics.com/b.html">B</a>
<a class="btn btn-menu" href="//www.azlyrics.com/c.html">C</a>
<a class="btn btn-menu" href="//www.azlyrics.com/d.html">D</a>
<a class="btn btn-menu" href="//www.azlyrics.com/e.html">E</a>
<a class="btn btn-menu" href="//www.azlyrics.com/f.html">F</a>
<a class="btn btn-menu" href="//www.azlyrics.com/g.html">G</a>
<a class="btn btn-menu" href="//www.azlyrics.com/h.html">H</a>
<a class="btn btn-menu" href="//www.azlyrics.com/i.html">I</a>
<a class="btn btn-menu" href="//www.azlyrics.com/j.html">J</a>
<a class="btn btn-menu" href="//www.azlyrics.com/k.html">K</a>
<a class="btn btn-menu" href="//www.azlyrics.com/l.html">L</a>
<a class="btn btn-menu" href="//www.azlyrics.com/m.html">M</a>
<a class="btn btn-menu" href="//www.azlyrics.com/n.html">N</a>
<a class="btn btn-menu" href="//www.azlyrics.com/o.html">O</a>
<a cla

In [9]:
for n, div in enumerate(divs):
    if "<!-- Usage of azlyrics.com content by any " in div.text:
        print(n)

In [10]:
main_page = soup.find('div', {"class": "container main-page"})
main_page

<div class="container main-page">
<div class="row">
<div class="col-sm-6 col-lg-4 col-lg-offset-2 text-center artist-col">
<h1>Welcome to AZLyrics!</h1><br/>
              It's a place where all searches end!<br/><br/>
              We have a large, legal, every day growing universe of lyrics where stars of all genres and ages shine.<br/><br/>
<form action="//search.azlyrics.com/search.php" class="search" method="get" role="search">
<div class="input-group">
<input class="form-control" name="q" placeholder="" type="text"/>
<span class="input-group-btn">
<button class="btn btn-primary" type="submit"><span class="glyphicon glyphicon-search"></span></button>
</span>
</div>
</form>
<p class="help-block">Enter artist name or song title</p>
</div>
<div class="col-sm-6 col-lg-4 text-center artist-col">
<div class="hidden-xs rect-ad">
<span id="cf_medrec"></span>
</div>
</div>
</div>
<div class="row">
<div class="col-xs-12 col-lg-8 col-lg-offset-2 text-center artist-col">
<h1>WHAT'S HOT?</h1>


In [11]:
main_12 = main_page.find('div', {"class" : "row"})
main_12

<div class="row">
<div class="col-sm-6 col-lg-4 col-lg-offset-2 text-center artist-col">
<h1>Welcome to AZLyrics!</h1><br/>
              It's a place where all searches end!<br/><br/>
              We have a large, legal, every day growing universe of lyrics where stars of all genres and ages shine.<br/><br/>
<form action="//search.azlyrics.com/search.php" class="search" method="get" role="search">
<div class="input-group">
<input class="form-control" name="q" placeholder="" type="text"/>
<span class="input-group-btn">
<button class="btn btn-primary" type="submit"><span class="glyphicon glyphicon-search"></span></button>
</span>
</div>
</form>
<p class="help-block">Enter artist name or song title</p>
</div>
<div class="col-sm-6 col-lg-4 text-center artist-col">
<div class="hidden-xs rect-ad">
<span id="cf_medrec"></span>
</div>
</div>
</div>

In [12]:
main_13 = main_12.find('div', {"col-sm-6 col-lg-4 text-center artist-col"})
main_13

<div class="col-sm-6 col-lg-4 text-center artist-col">
<div class="hidden-xs rect-ad">
<span id="cf_medrec"></span>
</div>
</div>

In [13]:
lyrics = main_l3.findAll('div')[6].text
lyrics

NameError: name 'main_l3' is not defined

## Synthesizing
Create a script using your two functions above to scrape all of the song lyrics for a given artist.


In [None]:
#Use this block for your code!

## Visualizing
Generate two bar graphs to compare lyrical changes for the artist of your chose. For example, the two bar charts could compare the lyrics for two different songs or two different albums.

In [None]:
#Use this block for your code!

## Level - Up

Think about how you structured the data from your web scraper. Did you scrape the entire song lyrics verbatim? Did you simply store the words and their frequency counts, or did you do something else entirely? List out a few different options for how you could have stored this data. What are advantages and disadvantages of each? Be specific and think about what sort of analyses each representation would lend itself to.

In [None]:
#Use this block for your code!

## Summary

Congratulations! You've now practiced your Beautiful Soup knowledge!