# Intro to web scraping

The first step of web scraping is to identify a website and download the html code from it. 

Real html from websites tends to be long and a bit too chaotic for a total beginner. Here we will start with a dummy html document and learn the basics of extracting info with beautifulsoup.

- You can learn about Html here https://www.w3schools.com/html/
- You can use codebeautify to make your html more readable and clean https://codebeautify.org/htmlviewer

In [1]:
html_doc = """ <!DOCTYPE html><html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></html>"""

In [2]:
html_doc

' <!DOCTYPE html><html><head><title>The Dormouse\'s story</title></head><body><p class="title"><b>The Dormouse\'s story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></html>'

In [3]:
from bs4 import BeautifulSoup

#### "creating the soup"

In [4]:
# parse the element
soup = BeautifulSoup(html_doc, 'html.parser') 

In [5]:
soup

 <!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></body></html>

In [6]:
import pprint

In [7]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



#### accessing single elements

We can access html tags by appending to the `soup` a dot `.` and the name of the corresponding tag. In case of having multiple instances of the tag, only the first one will be retrieved.  

In [8]:
soup.title

<title>The Dormouse's story</title>

In [9]:
soup.title.parent

<head><title>The Dormouse's story</title></head>

In [10]:
soup.html.body.p

<p class="title"><b>The Dormouse's story</b></p>

<b> searching using find() function

In [11]:
soup.find("p").get_text()

"The Dormouse's story"

In [12]:
# this method only retrieves the first element of the specified tag
soup.p

<p class="title"><b>The Dormouse's story</b></p>

#### finding all elements of a tag with the powerful find_all()

In [13]:
p_tags = soup.find_all("p")
p_tags

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

To get the `text`from the corresponding html code, we can use the function: get_text()

In [14]:
for p in p_tags:
    print(p.get_text())

The Dormouse's story
Once upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.
...


## Return the 3 names of the sisters

In [15]:
a_tags = soup.find_all('a')

In [16]:
for a in a_tags:
    print(a.get_text())

Elsie
Lacie
Tillie


## Using css selectors
Another way to find contents using select(). 

Let's learn first the syntax of css selectors playing this game: https://flukeout.github.io/

Everyone should reach level 6!

https://www.w3schools.com/css/css_howto.asp

In [17]:
soup.select("a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [18]:
for a in soup.select('a'):
    print(a.get_text())

Elsie
Lacie
Tillie


In [19]:
soup.select('p')

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [20]:
soup.select("p")[0]

<p class="title"><b>The Dormouse's story</b></p>

using css selector, you can search directly using Css classes!

In [21]:
soup.select(".title")

[<p class="title"><b>The Dormouse's story</b></p>]

<b> comparing to find_all() ..

In [22]:
soup.find_all("a", class_="sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [23]:
soup.select("a.sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

<b>  You can searc directly using id attributes

In [24]:
soup.select("#link2")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

We can combine the `select()` method with other bs4 methods, such as `get_text()`.

`get_text()`, however, can only be applied to single elements, while `select()` might return multiple elements. It's common to iterate through the output of `select()`

In [25]:
print(soup.select(".story"))

[<p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>, <p class="story">...</p>]


In [26]:
for p in soup.select("p.story"):
    print(p.get_text())

Once upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.
...




Write code to print the following contents (not including the html tags, only human-readable text): 

1. All the "fun facts". 

2. The names of all the places. 

3. The content (name and fact) of all the cities (only cities, not countries!) 

4. The names (not facts!) of all the cities (not countries!)

In [27]:
geography = """
<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>
"""

In [28]:
soup = BeautifulSoup(geography, 'html.parser')

print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  Geography
 </head>
 <body>
  <div class="city">
   <h2>
    London
   </h2>
   <p>
    London is the most popular tourist destination in the world.
   </p>
  </div>
  <div class="city">
   <h2>
    Paris
   </h2>
   <p>
    Paris was originally a Roman City called Lutetia.
   </p>
  </div>
  <div class="country">
   <h2>
    Spain
   </h2>
   <p>
    Spain produces 43,8% of all the world's Olive Oil.
   </p>
  </div>
 </body>
</html>



In [34]:
# Your code goes here
#Fun facts
fun_facts = soup.find_all("p")
fun_facts

[<p>London is the most popular tourist destination in the world.</p>,
 <p>Paris was originally a Roman City called Lutetia.</p>,
 <p>Spain produces 43,8% of all the world's Olive Oil.</p>]

In [35]:
for p in fun_facts:
    print(p.get_text())

London is the most popular tourist destination in the world.
Paris was originally a Roman City called Lutetia.
Spain produces 43,8% of all the world's Olive Oil.


In [37]:
#Names of all the places
all_places = soup.find_all("h2")
all_places

[<h2>London</h2>, <h2>Paris</h2>, <h2>Spain</h2>]

In [38]:
for p in all_places:
    print(p.get_text())

London
Paris
Spain


In [40]:
#The content (name and fact) of all the cities (only cities, not countries!)
soup.select(".city")

[<div class="city">
 <h2>London</h2>
 <p>London is the most popular tourist destination in the world.</p>
 </div>,
 <div class="city">
 <h2>Paris</h2>
 <p>Paris was originally a Roman City called Lutetia.</p>
 </div>]

In [55]:
for p in soup.select(".city"):
    print(p.get_text())


London
London is the most popular tourist destination in the world.


Paris
Paris was originally a Roman City called Lutetia.



In [57]:
#The names (not facts!) of all the cities (not countries!)
for city in soup.select("div.city h2"):
    print(city.get_text())

London
Paris


## Use case: 





In [58]:
# 1. import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [65]:
# 2. find url and store it in a variable
url = "https://www.timeout.com/film/best-movies-of-all-time"
requests.get(url)

<Response [200]>

# <b> using requests package

In [66]:
# 3. download html with a get request 
response = requests.get(url)

In [67]:
response.status_code # 200 status code means OK!

200

In [68]:
soup = BeautifulSoup(response.content)
soup

<!DOCTYPE html>
<html lang="en-GB"><head><meta charset="utf-8"/><meta content="width=device-width, initial-scale=1, minimum-scale=1" name="viewport"/><link href="/static/images/favicon.ico" rel="icon" sizes="16x16" type="image/x-icon"/><link href="/static/images/favicon-32.png" rel="icon" sizes="32x32" type="image/png"/><link href="/static/images/favicon-48.png" rel="icon" sizes="48x48" type="image/png"/><link href="/static/images/favicon-180.png" rel="apple-touch-icon" type="image/png"/><title>100 Best Movies of All Time That You Should Watch Immediately</title><meta content="We have ranked the best movies of all time that our film editors say you need to watch. Which movie is your favourite?" name="description"/><link href="https://www.timeout.com/film/best-movies-of-all-time" rel="canonical"/><meta content="max-image-preview:large" name="robots"/><script>window.digitalData = {"pageInstanceID":"web-uk-worldwide.127741-prod","version":"1.0","timestamp":1707736828238,"page":{"pageInfo"

### HTTP Response status codes 
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

In [69]:
# 4.1. parse html (create the 'soup')
soup = BeautifulSoup(response.content, "html.parser")
# 4.2. check that the html code looks like it should
soup

<!DOCTYPE html>
<html lang="en-GB"><head><meta charset="utf-8"/><meta content="width=device-width, initial-scale=1, minimum-scale=1" name="viewport"/><link href="/static/images/favicon.ico" rel="icon" sizes="16x16" type="image/x-icon"/><link href="/static/images/favicon-32.png" rel="icon" sizes="32x32" type="image/png"/><link href="/static/images/favicon-48.png" rel="icon" sizes="48x48" type="image/png"/><link href="/static/images/favicon-180.png" rel="apple-touch-icon" type="image/png"/><title>100 Best Movies of All Time That You Should Watch Immediately</title><meta content="We have ranked the best movies of all time that our film editors say you need to watch. Which movie is your favourite?" name="description"/><link href="https://www.timeout.com/film/best-movies-of-all-time" rel="canonical"/><meta content="max-image-preview:large" name="robots"/><script>window.digitalData = {"pageInstanceID":"web-uk-worldwide.127741-prod","version":"1.0","timestamp":1707736828238,"page":{"pageInfo"

#### Building the dataframe

In [70]:
#your code here
print(soup.prettify())

<!DOCTYPE html>
<html lang="en-GB">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, minimum-scale=1" name="viewport"/>
  <link href="/static/images/favicon.ico" rel="icon" sizes="16x16" type="image/x-icon"/>
  <link href="/static/images/favicon-32.png" rel="icon" sizes="32x32" type="image/png"/>
  <link href="/static/images/favicon-48.png" rel="icon" sizes="48x48" type="image/png"/>
  <link href="/static/images/favicon-180.png" rel="apple-touch-icon" type="image/png"/>
  <title>
   100 Best Movies of All Time That You Should Watch Immediately
  </title>
  <meta content="We have ranked the best movies of all time that our film editors say you need to watch. Which movie is your favourite?" name="description"/>
  <link href="https://www.timeout.com/film/best-movies-of-all-time" rel="canonical"/>
  <meta content="max-image-preview:large" name="robots"/>
  <script>
   window.digitalData = {"pageInstanceID":"web-uk-worldwide.127741-prod","version":"1.0

In [71]:
soup.select("h3")

[<h3 class="_h3_cuogz_1" data-testid="tile-title_testID"><span>1.</span> 2001: A Space Odyssey (1968)</h3>,
 <h3 class="_h3_cuogz_1" data-testid="tile-title_testID"><span>2.</span> The Godfather (1972)</h3>,
 <h3 class="_h3_cuogz_1" data-testid="tile-title_testID"><span>3.</span> Citizen Kane (1941)</h3>,
 <h3 class="_h3_cuogz_1" data-testid="tile-title_testID"><span>4.</span> Jeanne Dielman, 23, Quai du Commerce, 1080 Bruxelles (1975)</h3>,
 <h3 class="_h3_cuogz_1" data-testid="tile-title_testID"><span>5.</span> Raiders of the Lost Ark (1981)</h3>,
 <h3 class="_h3_cuogz_1" data-testid="tile-title_testID"><span>6.</span> La Dolce Vita (1960)</h3>,
 <h3 class="_h3_cuogz_1" data-testid="tile-title_testID"><span>7.</span> Seven Samurai (1954)</h3>,
 <h3 class="_h3_cuogz_1" data-testid="tile-title_testID"><span>8.</span> In the Mood for Love (2000)</h3>,
 <h3 class="_h3_cuogz_1" data-testid="tile-title_testID"><span>9.</span> There Will Be Blood (2007)</h3>,
 <h3 class="_h3_cuogz_1" data-t

In [72]:
top100m = []
for title in soup.select("h3"):
    top100m.append(title.get_text())

top100m


['1.\xa02001: A Space Odyssey (1968)',
 '2.\xa0The Godfather (1972)',
 '3.\xa0Citizen Kane (1941)',
 '4.\xa0Jeanne Dielman, 23, Quai du Commerce, 1080 Bruxelles (1975)',
 '5.\xa0Raiders of the Lost Ark (1981)',
 '6.\xa0La Dolce Vita (1960)',
 '7.\xa0Seven Samurai (1954)',
 '8.\xa0In the Mood for Love (2000)',
 '9.\xa0There Will Be Blood (2007)',
 '10.\xa0Singin’ in the Rain (1952)',
 '11.\xa0Goodfellas (1990)',
 '12.\xa0North by Northwest (1959)',
 '13.\xa0Mulholland Drive (2001)',
 '14.\xa0Bicycle Thieves (1948)',
 '15.\xa0The Dark Knight (2008)',
 '16.\xa0City Lights (1931)',
 '17.\xa0Grand Illusion (1937)',
 '18.\xa0His Girl Friday (1940)',
 '19.\xa0The Red Shoes (1948)',
 '20.\xa0Vertigo (1958)',
 '21.\xa0Beau Travail (1999)',
 '22.\xa0The Searchers (1956)',
 '23.\xa0Persona (1966)',
 '24.\xa0Do the Right Thing (1989)',
 '25.\xa0Rashomon (1950)',
 '26.\xa0The Rules of the Game (1939)',
 '27.\xa0Jaws (1975)',
 '28.\xa0Double Indemnity (1944)',
 '29.\xa0The 400 Blows (1959)',
 '30.\x

In [73]:
top100m.pop()

'Check out the best movies of all time as chosen by actors'

In [74]:
best_movies = pd.DataFrame({"movie_title":top100m})

In [75]:
best_movies

Unnamed: 0,movie_title
0,1. 2001: A Space Odyssey (1968)
1,2. The Godfather (1972)
2,3. Citizen Kane (1941)
3,"4. Jeanne Dielman, 23, Quai du Commerce, 1080 ..."
4,5. Raiders of the Lost Ark (1981)
...,...
95,96. The Cabinet of Dr. Caligari (1920)
96,97. Nashville (1975)
97,98. Don’t Look Now (1973)
98,99. Bonnie and Clyde (1967)


### Scraping

In [108]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

# 2. find url and store it in a variable
url = "https://gutenberg.org/ebooks/search/?sort_order=downloads"
response = requests.get(url)

In [109]:
soup = BeautifulSoup(response.content, "html.parser")
# 4.2. check that the html code looks like it should
soup

<!DOCTYPE html>

<!--

DON'T USE THIS PAGE FOR SCRAPING.

Seriously. You'll only get your IP blocked.

Download https://www.gutenberg.org/feeds/catalog.rdf.bz2 instead,
which contains *all* Project Gutenberg metadata in one RDF/XML file.

--><html lang="en">
<head>
<style>
.icon   { background: transparent url(/pics/sprite.png) 0 0 no-repeat; }
.page_content a.subtle_link:link {color:currentColor; text-decoration: none;}
.page_content a.subtle_link:hover {color:#003366}
</style>
<link href="/gutenberg/pg-desktop-one.css" rel="stylesheet" type="text/css"/>
<link href="/gutenberg/new_nav.css" rel="stylesheet" type="text/css"/>
<link href="/gutenberg/style.css" rel="stylesheet" type="text/css"/>
<script>//
var canonical_url   = "http://www.gutenberg.org/ebooks/search/?sort_order=downloads";
var lang            = "en";
var msg_load_more   = "Load More Results…";
var page_mode       = "screen";
var dialog_title    = "";
var dialog_message  = "";
//</script>
<script src="/js/pg-two.js"></scr

In [110]:
#your code here
print(soup.prettify())

<!DOCTYPE html>
<!--

DON'T USE THIS PAGE FOR SCRAPING.

Seriously. You'll only get your IP blocked.

Download https://www.gutenberg.org/feeds/catalog.rdf.bz2 instead,
which contains *all* Project Gutenberg metadata in one RDF/XML file.

-->
<html lang="en">
 <head>
  <style>
   .icon   { background: transparent url(/pics/sprite.png) 0 0 no-repeat; }
.page_content a.subtle_link:link {color:currentColor; text-decoration: none;}
.page_content a.subtle_link:hover {color:#003366}
  </style>
  <link href="/gutenberg/pg-desktop-one.css" rel="stylesheet" type="text/css"/>
  <link href="/gutenberg/new_nav.css" rel="stylesheet" type="text/css"/>
  <link href="/gutenberg/style.css" rel="stylesheet" type="text/css"/>
  <script>
   //
var canonical_url   = "http://www.gutenberg.org/ebooks/search/?sort_order=downloads";
var lang            = "en";
var msg_load_more   = "Load More Results…";
var page_mode       = "screen";
var dialog_title    = "";
var dialog_message  = "";
//
  </script>
  <script 

In [118]:
ebooks = []
for title in soup.select("li.booklink span.title"):
    ebooks.append(title.get_text())

ebooks

['Frankenstein; Or, The Modern Prometheus',
 'Moby Dick; Or, The Whale',
 'Pride and Prejudice',
 'Romeo and Juliet',
 'Middlemarch',
 'A Room with a View',
 'Little Women; Or, Meg, Jo, Beth, and Amy',
 'The Complete Works of William Shakespeare',
 'The Blue Castle: a novel',
 'The Enchanted April',
 'The Adventures of Ferdinand Count Fathom — Complete',
 'Cranford',
 'The Expedition of Humphry Clinker',
 'The Adventures of Roderick Random',
 'History of Tom Jones, a Foundling',
 'Twenty years after',
 'My Life — Volume 1',
 "Alice's Adventures in Wonderland",
 'The Great Gatsby',
 'The Picture of Dorian Gray',
 "A Doll's House : a play",
 'The Yellow Wallpaper',
 'The Importance of Being Earnest: A Trivial Comedy for Serious People',
 'Metamorphosis',
 'A Modest Proposal\r']

In [119]:
best_ebooks = pd.DataFrame({"ebook_title":ebooks})
best_ebooks

Unnamed: 0,ebook_title
0,"Frankenstein; Or, The Modern Prometheus"
1,"Moby Dick; Or, The Whale"
2,Pride and Prejudice
3,Romeo and Juliet
4,Middlemarch
5,A Room with a View
6,"Little Women; Or, Meg, Jo, Beth, and Amy"
7,The Complete Works of William Shakespeare
8,The Blue Castle: a novel
9,The Enchanted April


In [121]:
authors = []
for a in soup.select("li.booklink span.subtitle"):
    authors.append(a.get_text())

authors

['Mary Wollstonecraft Shelley',
 'Herman Melville',
 'Jane Austen',
 'William Shakespeare',
 'George Eliot',
 'E. M. Forster',
 'Louisa May Alcott',
 'William Shakespeare',
 'L. M. Montgomery',
 'Elizabeth Von Arnim',
 'T. Smollett',
 'Elizabeth Cleghorn Gaskell',
 'T. Smollett',
 'T. Smollett',
 'Henry Fielding',
 'Alexandre Dumas and Auguste Maquet',
 'Richard Wagner',
 'Lewis Carroll',
 'F. Scott Fitzgerald',
 'Oscar Wilde',
 'Henrik Ibsen',
 'Charlotte Perkins Gilman',
 'Oscar Wilde',
 'Franz Kafka',
 'Jonathan Swift']

In [122]:
books_authors = pd.DataFrame({"author":authors})
books_authors

Unnamed: 0,author
0,Mary Wollstonecraft Shelley
1,Herman Melville
2,Jane Austen
3,William Shakespeare
4,George Eliot
5,E. M. Forster
6,Louisa May Alcott
7,William Shakespeare
8,L. M. Montgomery
9,Elizabeth Von Arnim


In [123]:
online_books = pd.concat([best_ebooks, books_authors], axis=1)
online_books

Unnamed: 0,ebook_title,author
0,"Frankenstein; Or, The Modern Prometheus",Mary Wollstonecraft Shelley
1,"Moby Dick; Or, The Whale",Herman Melville
2,Pride and Prejudice,Jane Austen
3,Romeo and Juliet,William Shakespeare
4,Middlemarch,George Eliot
5,A Room with a View,E. M. Forster
6,"Little Women; Or, Meg, Jo, Beth, and Amy",Louisa May Alcott
7,The Complete Works of William Shakespeare,William Shakespeare
8,The Blue Castle: a novel,L. M. Montgomery
9,The Enchanted April,Elizabeth Von Arnim


In [None]:
'''best_books_df = pd.DataFrame({"books":ebooks, "author":authors})
best_books_df'''

In [124]:
online_books.shape

(25, 2)

### Billboard

In [166]:
url ="https://www.billboard.com/charts/hot-100/"
response = requests.get(url)
soup = BeautifulSoup(response.content)



In [167]:
hiss[0].get_text(strip=True)

'Hiss'

In [168]:
songs = []

for s in soup.select("h3.c-title.a-no-trucate"):
    songs.append(s.get_text(strip=True))

songs

['Hiss',
 'Lovin On Me',
 'Cruel Summer',
 'Lose Control',
 'Greedy',
 'I Remember Everything',
 'Agora Hills',
 'Beautiful Things',
 'Redrum',
 'Snooze',
 'Water',
 'Stick Season',
 'Paint The Town Red',
 'Last Night',
 "Thinkin' Bout Me",
 'Facts',
 'Yes, And?',
 'Never Lose Me',
 'Selfish',
 'Fast Car',
 "Is It Over Now? (Taylor's Version) [From The Vault]",
 'La Diabla',
 'Big Foot',
 'Spin You Around (1/24)',
 'White Horse',
 'Everybody',
 'Pretty Little Poison',
 'Made For Me',
 'Rich Baby Daddy',
 'Houdini',
 'What Was I Made For?',
 'Flowers',
 'Where The Wild Things Are',
 'Igual Que Un Angel',
 'Wild Ones',
 'Feather',
 'Think U The Shit (Fart)',
 'Lil Boo Thang',
 'Fukumean',
 'FTCU',
 'Need A Favor',
 'Good Good',
 'Dance The Night',
 'The Painter',
 'First Person Shooter',
 'Vampire',
 'Surround Sound',
 'World On Fire',
 'Save Me',
 'On My Mama',
 'Truck Bed',
 'My Love Mine All Mine',
 'La Victima',
 'Exes',
 'Strangers',
 'Nee-nah',
 'Murder On The Dancefloor',
 'One Of

In [169]:
artist=[]

for a in soup.select("span.c-label.a-no-trucate"):
    artist.append(a.get_text(strip=True))

artist

['Megan Thee Stallion',
 'Jack Harlow',
 'Taylor Swift',
 'Teddy Swims',
 'Tate McRae',
 'Zach Bryan Featuring Kacey Musgraves',
 'Doja Cat',
 'Benson Boone',
 '21 Savage',
 'SZA',
 'Tyla',
 'Noah Kahan',
 'Doja Cat',
 'Morgan Wallen',
 'Morgan Wallen',
 'Tom MacDonald X Ben Shapiro',
 'Ariana Grande',
 'Flo Milli',
 'Justin Timberlake',
 'Luke Combs',
 'Taylor Swift',
 'Xavi',
 'Nicki Minaj',
 'Morgan Wallen',
 'Chris Stapleton',
 'Nicki Minaj Featuring Lil Uzi Vert',
 'Warren Zeiders',
 'Muni Long',
 'Drake Featuring Sexyy Red & SZA',
 'Dua Lipa',
 'Billie Eilish',
 'Miley Cyrus',
 'Luke Combs',
 'Kali Uchis & Peso Pluma',
 'Jessie Murph & Jelly Roll',
 'Sabrina Carpenter',
 'Ice Spice',
 'Paul Russell',
 'Gunna',
 'Nicki Minaj',
 'Jelly Roll',
 'Usher, Summer Walker & 21 Savage',
 'Dua Lipa',
 'Cody Johnson',
 'Drake Featuring J. Cole',
 'Olivia Rodrigo',
 'JID Featuring 21 Savage & Baby Tate',
 'Nate Smith',
 'Jelly Roll With Lainey Wilson',
 'Victoria Monet',
 'HARDY',
 'Mitski',


In [170]:
billboard_100 = pd.DataFrame({"song":songs, "artist":artist})
billboard_100

Unnamed: 0,song,artist
0,Hiss,Megan Thee Stallion
1,Lovin On Me,Jack Harlow
2,Cruel Summer,Taylor Swift
3,Lose Control,Teddy Swims
4,Greedy,Tate McRae
...,...,...
95,Wildflowers And Wild Horses,Lainey Wilson
96,Northern Attitude,Noah Kahan With Hozier
97,All I Need Is You,Chris Janson
98,My Eyes,Travis Scott


### Cleaning the data

In [None]:
# your code here