# Web Scraping with Beautiful Soup - Lab

## Introduction

Now that you've read and seen some docmentation regarding the use of Beautiful Soup, its time to practice and put that to work! In this lab you'll formalize some of our example code into functions and scrape the lyrics from an artist of your choice.

## Objectives
You will be able to:
* Scrape Static webpages - Is there a difference between a dynamic and static website?
* Compare APIs vs WebScraping
* Select specific elements from the DOM 

# What we learned
* Beautiful soup is indeed Beautiful - Stephan
    * because of .attrs
* Learned about try and except - Emily
* Inspecting a webpage using 'Inspect' - Me
    

## Link Scraping

Write a function to collect the links to each of the song pages from a given artist page.

In [17]:
#Starter Code
from bs4 import BeautifulSoup
import requests
import xmltodict
import json
from pprint import pprint


# url = 'https://www.azlyrics.com/t/toto.html' #Put the URL of your AZLyrics Artist Page here!

In [9]:
html_page = requests.get(url) #Make a get request to retrieve the page
soup = BeautifulSoup(html_page.content, 'html.parser') #Pass the page contents to beautiful soup for parsing


#The example from our lecture/reading
data = [] #Create a storage container
for album_n in range(len(albums)):
    #On the last album, we won't be able to look forward
    if album_n == len(albums)-1:
        cur_album = albums[album_n]
        album_songs = cur_album.findNextSiblings('a')
        for song in album_songs:
            page = song.get('href')
            title = song.text
            album = cur_album.text
            data.append((title, page, album))
    else:
        cur_album = albums[album_n]
        next_album = albums[album_n+1]
        saca = cur_album.findNextSiblings('a') #songs after current album
        sbna = next_album.findPreviousSiblings('a') #songs before next album
        album_songs = [song for song in saca if song in sbna] #album songs are those listed after the current album but before the next one!
        for song in album_songs:
            page = song.get('href')
            title = song.text
            album = cur_album.text
            data.append((title, page, album))
data[:2]

[('Down On Life',
  '../lyrics/elliphant/downonlife.html',
  'EP: "Elliphant" (2012)'),
 ('Tekkno Scene',
  '../lyrics/elliphant/tekknoscene.html',
  'EP: "Elliphant" (2012)')]

## Text Scraping
Write a secondary function that scrapes the lyrics for each song page.

In [18]:
#Remember to open up the webpage in a browser and control-click/right-click and go to inspect!
from bs4 import BeautifulSoup
import requests

#Example page
url = 'https://www.ebay.com/deals?'


html_page = requests.get(url) # requests.get() -> makes a requests then gets the page content of the url
soup = BeautifulSoup(html_page.content, 'html.parser')
soup.prettify()[:1000]

'<!DOCTYPE doctype html>\n<html lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>\n  <meta content="width=device-width" name="viewport"/>\n  <meta content="34E98E6F27109BE1A9DCF19658EEEE33" name="msvalidate.01">\n   <meta content="6e11485a66d91eff" name="yandex-verification">\n    <link href="https://ir.ebaystatic.com" rel="preconnect"/>\n    <link href="https://i.ebayimg.com" rel="preconnect"/>\n    <meta content="acf32e2a69cbc2b0" name="y_key">\n     <title>\n      Daily Deals on eBay | Best deals and Free Shipping\n     </title>\n     <meta content="Save money on the best Deals online with eBay Deals. We update our deals daily, so check back for the best deals - Plus Free Shipping" name="description"/>\n     <meta content="8kHr3jd3Z43q1ovwo0KVgo_NZKIEMjthBxti8m8fYTg" name="google-site-verification"/>\n     <link href="https://www.ebay.com/deals" rel="canonical"/>\n     <meta content="unsafe-url" name="referrer"/>\n     <meta con

In [60]:
cols = soup.find_all("h3", class_="dne-itemtile-title ellipse-2", attrs={"title":True})

In [62]:
# go through each h3 and make that a soup object, then call the title on it -> this plan sucks and is unncessary

In [71]:
# use the .attrs method on your find_all to get the attributes
all_my_titles = [col.attrs['title'] for col in cols]
all_my_titles

['Omega XL 60 ct by Great HealthWorks: Small, Potent, Joint Pain Relief - Omega-3',
 "adidas Trefoil Oversize Sweatshirt Women's",
 'Apple iPad Air 2 - WiFi Tablet 16GB 32GB 64GB 128GB 2nd Generation',
 'Crocs Mens Walu Loafer',
 'WG305.1 WORX 8 Amp 14" Electric Chain Saw',
 'Logitech Harmony Smart Control All In One Remote with Hub & Smartphone App Black',
 "Chaps Men's Fleece Flannel 1/4 Zip Jacket",
 'Oakley Thinlink Sunglasses Black Iridium OO9316-03 63mm 9316-03',
 "adidas Cloudfoam Advantage Shoes Women's",
 'DxO One 20.2MP Digital Camera with Wi-Fi - Designed for iOS Devices',
 'Egyptian Comfort 1800 Thread Count 4 Piece Bed Sheet Set Deep Pocket',
 'For Apple iPhone XS Max/XR/XS/X/8/7 Plus 6s Tough Shockproof Armor Hybrid Case',
 'Vornado Whole Room Vortex Space Heater with Remote Control Timer, Gray TVH500',
 '18K Gold Plated Cuban/Curb Link Chain Necklace or Bracelet - Lifetime Warranty',
 'Mario Kart 8 Deluxe for Nintendo Switch - Brand New',
 'Samsonite Pivot 3 Piece Set - 

In [77]:
cols[0].span.attrs['itemprop']

'name'

In [78]:
new_urls = [col.a.attrs['href'] for col in cols]

In [79]:
new_urls

['https://www.ebay.com/itm/Omega-XL-60-ct-by-Great-HealthWorks-Small-Potent-Joint-Pain-Relief-Omega-3/292483036552?_trkparms=5373%3A0%7C5374%3AFeatured',
 'https://www.ebay.com/itm/adidas-Trefoil-Oversize-Sweatshirt-Womens/153207790819?_trkparms=5373%3A0%7C5374%3AFeatured',
 'https://www.ebay.com/itm/Apple-iPad-Air-2-2nd-WiFi-Cellular-Unlocked-16GB-32GB-64GB-128GB/352252833744?_trkparms=5373%3A0%7C5374%3AFeatured',
 'https://www.ebay.com/itm/Crocs-Mens-Walu-Loafer/142861768316?_trkparms=5373%3A0%7C5374%3AFeatured',
 'https://www.ebay.com/itm/WG305-1-WORX-8-Amp-14-Electric-Chain-Saw/252642451443?_trkparms=5373%3A0%7C5374%3AFeatured',
 'https://www.ebay.com/itm/Logitech-Harmony-Smart-Control-All-In-One-Remote-with-Hub-Smartphone-App-Black/392028404465?_trkparms=5373%3A0%7C5374%3AFeatured',
 'https://www.ebay.com/itm/Chaps-Mens-Fleece-Flannel-1-4-Zip-Jacket/332862495007?_trkparms=5373%3A0%7C5374%3AFeatured',
 'https://www.ebay.com/itm/Oakley-Thinlink-Sunglasses-Black-Iridium-OO9316-03-63m

In [92]:
def get_url_response(url):
    resp_page = requests.get(url) # requests.get() -> makes a requests then gets the page content of the url
    soup_ = BeautifulSoup(resp_page.content, 'html.parser')
    try:
        rating = soup_.find('a', class_="reviews-star-rating")
        rating_string = rating.attrs["title"].split(" ")[0]
        return rating_string
    except AttributeError as e:
        print("{} \n has no ratings - {}".format(url, e))
        return None

In [93]:
ratings = [get_url_response(url=url_) for url_ in new_urls]

https://www.ebay.com/itm/adidas-Trefoil-Oversize-Sweatshirt-Womens/153207790819?_trkparms=5373%3A0%7C5374%3AFeatured 
 has no ratings
https://www.ebay.com/itm/Apple-iPad-Air-2-2nd-WiFi-Cellular-Unlocked-16GB-32GB-64GB-128GB/352252833744?_trkparms=5373%3A0%7C5374%3AFeatured 
 has no ratings
https://www.ebay.com/itm/Crocs-Mens-Walu-Loafer/142861768316?_trkparms=5373%3A0%7C5374%3AFeatured 
 has no ratings
https://www.ebay.com/itm/Chaps-Mens-Fleece-Flannel-1-4-Zip-Jacket/332862495007?_trkparms=5373%3A0%7C5374%3AFeatured 
 has no ratings
https://www.ebay.com/itm/adidas-Cloudfoam-Advantage-Shoes-Womens/153157702764?_trkparms=5373%3A0%7C5374%3AFeatured 
 has no ratings
https://www.ebay.com/itm/Egyptian-Comfort-1800-Thread-Count-4-Piece-Bed-Sheet-Set-Deep-Pocket/181234756175?_trkparms=5373%3A0%7C5374%3AFeatured 
 has no ratings
https://www.ebay.com/itm/For-Apple-iPhone-XS-Max-XR-XS-X-8-7-Plus-6s-Tough-Shockproof-Armor-Hybrid-Case/400918113149?_trkparms=5373%3A0%7C5374%3AFeatured 
 has no ratin

https://www.ebay.com/itm/Samsung-Galaxy-Note-9-128GB-SM-N9600-FACTORY-UNLOCKED-6-4-Snapdragon-845/132742920393?_trkparms=5373%3A0%7C5374%3AFeatured%7C5079%3A6000003253 
 has no ratings
https://www.ebay.com/itm/Apple-iPhone-7-a1778-32GB-GSM-Unlocked/252816011949?_trkparms=5373%3A0%7C5374%3AFeatured%7C5079%3A6000003253 
 has no ratings
https://www.ebay.com/itm/Fuelworx-Made-in-the-USA-Stackable-Easy-Pour-Gas-Can-CARB-Compliant-2-5-Gallon/173447172748?_trkparms=5373%3A0%7C5374%3AFeatured%7C5079%3A6000000126 
 has no ratings
https://www.ebay.com/itm/MSI-GeForce-RTX-2080-GAMING-X-TRIO-Video-Card-8GB-256-Bit-GDDR6/382579978539?_trkparms=5373%3A0%7C5374%3AFeatured%7C5079%3A6000004300 
 has no ratings
https://www.ebay.com/itm/SPECIAL-PRICE-BANK-WIRE-PAYMENT-1-oz-Gold-Buffalo-BU-Random-Year-Lot-of-10/142685356389?_trkparms=5373%3A0%7C5374%3AFeatured%7C5079%3A6000000123 
 has no ratings
https://www.ebay.com/itm/2018-Mexico-5-oz-Silver-Libertad-BU-SKU-162409/142741065805?_trkparms=5373%3A0%7C5374

In [94]:
ratings

['4.7',
 None,
 None,
 None,
 '4.9',
 '4.5',
 None,
 '4.9',
 None,
 '5.0',
 None,
 None,
 '4.5',
 None,
 '4.9',
 None,
 None,
 None,
 None,
 '4.9',
 '4.7',
 None,
 '5.0',
 None,
 None,
 '4.1',
 None,
 None,
 None,
 '4.7',
 None,
 '4.0',
 '5.0',
 None,
 '4.5',
 None,
 None,
 None,
 '4.5',
 None,
 '3.3',
 '4.7',
 None,
 '4.7',
 None,
 None,
 '4.5',
 None,
 None,
 '4.8',
 '5.0',
 '4.7',
 '5.0',
 '5.0',
 None,
 '4.6',
 None,
 None,
 None,
 '5.0',
 None,
 '4.8',
 None,
 None,
 None,
 '4.9',
 None,
 '4.7',
 None,
 '4.8',
 '5.0',
 '4.5',
 '4.6',
 '5.0',
 '4.7',
 None,
 None,
 '4.9',
 None,
 '4.0',
 '4.9',
 None,
 '4.0',
 None,
 '4.4',
 None,
 None,
 '5.0',
 None,
 '5.0',
 '4.9',
 None,
 '5.0',
 None,
 '4.8',
 '4.8',
 '4.7',
 None,
 '4.8',
 '4.7',
 None,
 None,
 '4.5',
 None,
 '4.4',
 '5.0',
 None,
 '4.9',
 '4.7',
 '5.0',
 None,
 None,
 None,
 '5.0',
 '5.0',
 '4.7',
 '4.6',
 '3.8',
 None,
 None,
 None,
 None,
 '4.3',
 '3.4',
 None,
 None,
 '4.8',
 None,
 None,
 '4.6',
 None,
 None,
 None,
 '5.

## Synthesizing
Create a script using your two functions above to scrape all of the song lyrics for a given artist.


In [23]:
#Use this block for your code!
song_urls = []
for link in data:
    try:
        url_song = 'https://www.azlyrics.com'+link[1][2:]
        song_urls.append(url_song)
    except:
        print(link)

('', None, 'EP: "Elliphant" (2012)')
('', None, 'album: "A Good Idea" (2013)')
('', None, 'EP: "Look Like You Love It" (2014)')
('', None, 'EP: "One More" (2014)')


In [61]:
import xml.etree.ElementTree as ET
for song_url in song_urls:
    print(song_url)
    r = requests.get(song_url)
    song_soup = BeautifulSoup(r.text, 'html.parser')
    print(str(song_soup.find_all("meta",attrs={"name":"description"})[0]))
    print("\n\n")

https://www.azlyrics.com/lyrics/elliphant/downonlife.html
<meta content='Lyrics to "Down On Life" song by Elliphant: We are waking up in a pile of shit The whole bay is full of it And eggs keep growing out of our ears...' name="description"/>



https://www.azlyrics.com/lyrics/elliphant/tekknoscene.html
<meta content='Lyrics to "Tekkno Scene" song by Elliphant: Said this could be a Color crusher Color rusher Touch this tune see See me sala Bim bim be Flush thi...' name="description"/>



https://www.azlyrics.com/lyrics/elliphant/makeitjuicy.html
<meta content='Lyrics to "Make It Juicy" song by Elliphant: Come here Lucifer the sober soul me offer Run down pop gonâ all cracky color Come, come, come here c...' name="description"/>



https://www.azlyrics.com/lyrics/elliphant/inthejungle.html
<meta content='Lyrics to "In The Jungle" song by Elliphant: Sick a in the jungle back to instinct back to Basic a runk a back to basic instinct cashing life in...' name="description"/>



https://ww

<meta content="Lyrics to &quot;Save The Grey&quot; song by Elliphant: Man is my dirty and each made clean, STG Look into the eyes of them we pee sincere, STG Mama's not f..." name="description"/>



https://www.azlyrics.com/lyrics/elliphant/youregone.html
<meta content="Lyrics to &quot;You're Gone&quot; song by Elliphant: Who's gonna make sure mind is not dirty then Who's gonna help me get my shit back when I'm shit face..." name="description"/>



https://www.azlyrics.com/lyrics/elliphant/onemore.html
<meta content="Lyrics to &quot;One More&quot; song by Elliphant: Come on na sugar, come I really don't wanna go home Stay with me, be a friend These streets so cold..." name="description"/>



https://www.azlyrics.com/lyrics/elliphant/stepdown.html
<meta content="Lyrics to &quot;Step Down&quot; song by Elliphant: I'm home, home alone I call, call again you not answering And I know You couldn't, you wouldn't dare..." name="description"/>



https://www.azlyrics.com/lyrics/elliphant/everyb

## Visualizing
Generate two bar graphs to compare lyrical changes for the artist of your chose. For example, the two bar charts could compare the lyrics for two different songs or two different albums.

In [None]:
#Use this block for your code!

## Level - Up

Think about how you structured the data from your web scraper. Did you scrape the entire song lyrics verbatim? Did you simply store the words and their frequency counts, or did you do something else entirely? List out a few different options for how you could have stored this data. What are advantages and disadvantages of each? Be specific and think about what sort of analyses each representation would lend itself to.

In [None]:
#Use this block for your code!

## Summary

Congratulations! You've now practiced your Beautiful Soup knowledge!