# Lab: not so hot songs
## Considerations

You want your dataset of not_hot_song to be:

* As heterogeneous in terms of (genre, length,...etc) as possible to create better groups of songs.
* Not too big and not too small (typically around 2-3K) songs

In a real-life scenario, you might want to have your dataset as biggest as possible and use specialized Big Data techniques like [PySpark](https://spark.apache.org/docs/latest/api/python/) to group similar songs together. However, you are going to work on your own laptop which has limited power. Therefore, you need to limit the size of your dataset of not_hot_songs otherwise the process of grouping similar songs will take forever.

## Deliverables

Your fork should contain a jupyter notebook with the code to:

* Gather the songs
* Remove songs already present in the hot_songs dataset

In [1]:
# import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
from time import sleep
from random import randint

## URLS
- We are gathering the information about the "not so hot" songs from Wikipedia.
- Wikipedia has lists of top songs for evey year.
- Get a list of all the URLs from Wikipedia with the following code:

In [2]:
pages = []

for i in range(1948, 2022, 3):

    # assemble the url:
    start_at= str(i)
    url = "https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_" + start_at

    # download html with a get request:
    response = requests.get(url)

    # monitor the process by printing the status code
    print("Status code: " + str(response.status_code))

    # store response into "pages" list
    pages.append(response)

    # respectful nap:
    wait_time = randint(1,4)
    print("I will sleep for " + str(wait_time) + " second/s.") # print this to know if the code is progressing

    # scraping code scrape_website(url)
    sleep(wait_time)

Status code: 200
I will sleep for 1 second/s.
Status code: 200
I will sleep for 3 second/s.
Status code: 200
I will sleep for 1 second/s.
Status code: 200
I will sleep for 4 second/s.
Status code: 200
I will sleep for 1 second/s.
Status code: 200
I will sleep for 4 second/s.
Status code: 200
I will sleep for 3 second/s.
Status code: 200
I will sleep for 4 second/s.
Status code: 200
I will sleep for 3 second/s.
Status code: 200
I will sleep for 1 second/s.
Status code: 200
I will sleep for 4 second/s.
Status code: 200
I will sleep for 1 second/s.
Status code: 200
I will sleep for 1 second/s.
Status code: 200
I will sleep for 2 second/s.
Status code: 200
I will sleep for 4 second/s.
Status code: 200
I will sleep for 3 second/s.
Status code: 200
I will sleep for 2 second/s.
Status code: 200
I will sleep for 1 second/s.
Status code: 200
I will sleep for 1 second/s.
Status code: 200
I will sleep for 2 second/s.
Status code: 200
I will sleep for 4 second/s.
Status code: 200
I will sleep for 

## Extract information on titles and artists from Wikipedia

In [3]:
# create empty lists
soups = []
combined = [] # list to store tuples with title of the songs and artists

for url in pages:
    soup = BeautifulSoup(url.content, "html.parser")
    soups.append(soup)
    
    # select only the from table with ranking of songs
    for index, table in enumerate(soup.select("tbody")): 
        if ( index == 0):
            for row in table.select("tr"):
                cols = row.select("td") # Python list with 3 items
                
                #cols[1] -> title, cols[2] -> artist
                
                if ( len(cols) !=0): # table has 3 columns: 1)[0] rank, 2)[1] song, 3)[2] artist
                    song_titles = cols[1].select("a") # select what's inside the anchor ("a")
                    artists_names = cols[2].select("a")
                    
                    if (len(song_titles) > 1 and len(artists_names) == 1): # if there's more than one song
                        songs = []
                        for item in song_titles: # iterates through all the songs
                            songs.append(item.get_text().replace('"',''))
                        name = artists_names[0].get_text()
                        combined.append(list(zip(songs,[name]*len(songs)))) # for each of the songs, assigns the name of the artist

                    elif (len(song_titles) == 1 and len(artists_names) > 1): # if there's one song sang by several artists (collab)
                        artist = artists_names[0].get_text() # only take the first artist
                        song = song_titles[0].get_text().replace('"','') # get song name and remove ""
                        combined.append((song,artist))

                    elif (len(song_titles) == 0 and len(artists_names) == 1): # there's at list one song with no "a" (anchor), so len=0.
                        song = cols[1].get_text().replace('"','') # there's no "a" so we get the title name directly from the col
                        artist = artists_names[0].get_text()
                        combined.append((song,artist))

                    elif (len(song_titles) == 1 and len(artists_names) == 1): # normal case: 1 song and 1 artist
                        song = song_titles[0].get_text().replace('"','')
                        artist = artists_names[0].get_text()
                        combined.append((song,artist))
        


# Checking our output:
print(len(combined)) # output: 607

2242


In [4]:
# cast information of 'combined' list into pandas dataframe
combined_df = pd.DataFrame(combined, columns=["titles","artists"])
combined_df.tail(25) # display last 25 rows to check that info is stored correctly

Unnamed: 0,titles,artists
2217,The Bigger Picture,Lil Baby
2218,Only Human,Jonas Brothers
2219,The Woo,Pop Smoke
2220,Sum 2 Prove,Lil Baby
2221,Stuck with U,Ariana Grande
2222,Mood Swings,Pop Smoke
2223,You Should Be Sad,Halsey
2224,Dior,Pop Smoke
2225,Supalonely,Benee
2226,Even Though I'm Leaving,Luke Combs


Compare dfs and drop rows

In [20]:
# import dataframe with top 100 Billboard hot songs
data_hot = pd.read_csv('hot_songs.csv') # small dataframe (to compare with the dataframe of not-hot songs)
data_hot.head()

Unnamed: 0,titles,artists
0,Kill Bill,SZA
1,Last Night,Morgan Wallen
2,Flowers,Miley Cyrus
3,Princess Diana,Ice Spice & Nicki Minaj
4,Ella Baila Sola,Eslabon Armado X Peso Pluma


In [46]:
# create a copy of the 2not so hot" songs dataframe
data_nothot = combined_df.copy() # BIG (for reference)
display(data_nothot.head())
display(data_nothot.shape)

Unnamed: 0,titles,artists
0,Twelfth Street Rag,Pee Wee Hunt
1,Mañana (Is Soon Enough for Me),Peggy Lee
2,Now Is the Hour,Bing Crosby
3,A Tree in the Meadow,Margaret Whiting
4,"You Can't Be True, Dear",Ken Griffin


(2242, 2)

In [41]:
df_merge = pd.merge(data_nothot, data_hot, on= ['titles', 'artists'], how='outer', indicator=True)

print(df_merge.shape)
display(df_merge.head())

(2341, 3)


Unnamed: 0,titles,artists,_merge
0,Twelfth Street Rag,Pee Wee Hunt,left_only
1,Mañana (Is Soon Enough for Me),Peggy Lee,left_only
2,Now Is the Hour,Bing Crosby,left_only
3,A Tree in the Meadow,Margaret Whiting,left_only
4,"You Can't Be True, Dear",Ken Griffin,left_only


In [43]:
# check values of 'merge' column: 
# both = common rows, left_only = rows only in "not-hot-songs" dataframe, right_only =  rows only in "hot-songs" dataframe
# keep only left_only
df_merge['_merge'].value_counts()

left_only     2241
right_only      99
both             1
Name: _merge, dtype: int64

In [44]:
df_final = df_merge.loc[df_merge["_merge"] == "left_only"].drop("_merge", axis=1)
df_final.shape

(2241, 2)