# <span style="color:rgb(255, 0, 255)"> Lab Web Scrapping multiple pages
<span style="color:rgb(255, 0, 255)"> Ainara Guerra

#### <span style="color:rgb(255, 0, 255)"> Instructions of the previous labs
Your product will take a song as an input from the user and will output another song (the recommendation). 
In most cases, the recommended song will have to be similar to the inputted song, but the CTO thinks that if the song is on the top charts at the moment, the user will enjoy more a recommendation of a song that's also popular at the moment.

You have find data on the internet about currently popular songs. Billboard maintains a weekly Top 100 of "hot" songs here: https://www.billboard.com/charts/hot-100.

It's a good place to start! Scrape the current top 100 songs and their respective artists, and put the information into a pandas dataframe.

In [42]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [43]:
r = requests.get('https://www.billboard.com/charts/hot-100/')
soup = BeautifulSoup(r.content, 'html.parser')
result = soup.find_all('div', class_='o-chart-results-list-row-container')

data = []
for res in result: #Retrieving the results
    songName = res.find('h3').text.strip()
    artist = res.find('h3').find_next('span').text.strip()
    data.append({'Song': songName, 'Artist': artist})

    # I used stackoverflow for this 

df = pd.DataFrame(data) # Converting into a DataFrame
df

Unnamed: 0,Song,Artist
0,Last Night,Morgan Wallen
1,Flowers,Miley Cyrus
2,Fast Car,Luke Combs
3,Calm Down,Rema & Selena Gomez
4,All My Life,Lil Durk Featuring J. Cole
...,...,...
95,Save Me,Jelly Roll With Lainey Wilson
96,Yandel 150,Yandel & Feid
97,Beso,Rosalia & Rauw Alejandro
98,I Wrote The Book,Morgan Wallen


# <span style="color:rgb(255, 0, 255)"> Instructions of this lab goes as follows:

**Prioritize the MVP**

In the previous lab, you had to scrape data about "hot songs". It's critical to be on track with that part, as it was part of the request from the CTO.

If you couldn't finish the first lab, use this time to go back there.

**Expand the project**

If you're done, you can try to expand the project on your own. Here are a few suggestions:

- Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
- Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
- Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs

**Practice web scraping**

As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field:

Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python'
Find the number of titles that have changed in the United States Code since its last release point: url = 'http://uscode.house.gov/download/download.shtml'
Create a Python list with the top ten FBI's Most Wanted names: url = 'https://www.fbi.gov/wanted/topten'
Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: url = 'https://www.emsc-csem.org/Earthquake/'
List all language names and number of related articles in the order they appear in wikipedia.org: url = 'https://www.wikipedia.org/'
A list with the different kind of datasets available in data.gov.uk: url = 'https://data.gov.uk/'
Display the top 10 languages by number of native speakers stored in a pandas dataframe: url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [None]:
# first try: empty

In [7]:
base_url = "https://www.npr.org/2022/12/15/1135804266/100-best-songs-2022-page-"

# Number of pages to scrape
num_pages = 5

data_1 = []

for page_num in range(1, num_pages + 1):
    url = base_url + str(page_num)
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    song_elements = soup.find_all('h3', class_='edTag')

    for element in song_elements:
        song_info = element.text.strip()
        song_parts = song_info.split(' "')
        if len(song_parts) == 2:
            singer = song_parts[0]
            song = song_parts[1].replace('"', '')
            data_1.append({'Singer': singer, 'Song': song})

df_1 = pd.DataFrame(data_1)
print(df_1)

Empty DataFrame
Columns: []
Index: []


In [None]:
# second try: only the first one

In [16]:
base_url = "https://www.npr.org/2022/12/15/1135804266/100-best-songs-2022-page-"

# Number of pages to scrape
num_pages = 5

data_1 = []

for page_num in range(1, num_pages + 1):
    url = base_url + str(page_num)
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    song_elements = soup.find_all('h3', attrs={"class":'edTag'})

len(song_elements)

40

In [None]:
# third try: all the pages were scrapped but I have to convert it into a proper data frame 

In [18]:
import requests
from bs4 import BeautifulSoup

base_url = "https://www.npr.org/2022/12/15/1135804266/100-best-songs-2022-page-"

# Number of pages to scrape
num_pages = 5

data_1 = []

for page_num in range(1, num_pages + 1):
    url = base_url + str(page_num)
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    song_elements = soup.find_all('h3', attrs={"class": 'edTag'})
    
    data_1.extend(song_elements)
data_1

[<h3 class="edTag">Molly Nilsson</h3>,
 <h3 class="edTag">"Pompeii"<em> </em></h3>,
 <h3 class="edTag">Third Coast Percussion</h3>,
 <h3 class="edTag">"Derivative"</h3>,
 <h3 class="edTag">Plains</h3>,
 <h3 class="edTag">"Abilene"</h3>,
 <h3 class="edTag">Fireboy DML &amp; Asake</h3>,
 <h3 class="edTag">"Bandana"</h3>,
 <h3 class="edTag">Lizzo</h3>,
 <h3 class="edTag">"About Damn Time"<em> </em></h3>,
 <h3 class="edTag">SZA</h3>,
 <h3 class="edTag">"Shirt"<em> </em></h3>,
 <h3 class="edTag">Zach Bryan</h3>,
 <h3 class="edTag">"Something in the Orange"</h3>,
 <h3 class="edTag">Fontaines D.C.</h3>,
 <h3 class="edTag">"Jackie Down the Line"<em> </em></h3>,
 <h3 class="edTag">Harry Styles</h3>,
 <h3 class="edTag">"As It Was"<em> </em></h3>,
 <h3 class="edTag">Stromae</h3>,
 <h3 class="edTag">"L'enfer"<em> </em></h3>,
 <h3 class="edTag">Steve Lacy</h3>,
 <h3 class="edTag">"Bad Habit"</h3>,
 <h3 class="edTag">Joyce Wrice x KAYTRANADA</h3>,
 <h3 class="edTag">"Iced Tea"<em> </em></h3>,
 <h3 c

In [40]:
# So now we need to extract singer and song from data_1 and create a DataFrame

singers = []
songs = []

for i in range(0, len(data_1), 2):
    singer = data_1[i].text.strip()
    song = data_1[i+1].text.strip('"').strip()
    if song.endswith('"'):
        song = song[:-1].strip()
    songs.append(song)
    singers.append(singer)

df_1 = pd.DataFrame({'Song': songs, 'Artist': singers})
df_1

Unnamed: 0,Song,Artist
0,Pompeii,Molly Nilsson
1,Derivative,Third Coast Percussion
2,Abilene,Plains
3,Bandana,Fireboy DML & Asake
4,About Damn Time,Lizzo
...,...,...
95,SAOKO,ROSALÍA
96,Runner,Alex G
97,El Apagón,Bad Bunny
98,ALIEN SUPERSTAR,Beyoncé


In [44]:
final_list = pd.concat([df, df_1], ignore_index=True)
final_list

Unnamed: 0,Song,Artist
0,Last Night,Morgan Wallen
1,Flowers,Miley Cyrus
2,Fast Car,Luke Combs
3,Calm Down,Rema & Selena Gomez
4,All My Life,Lil Durk Featuring J. Cole
...,...,...
195,SAOKO,ROSALÍA
196,Runner,Alex G
197,El Apagón,Bad Bunny
198,ALIEN SUPERSTAR,Beyoncé
