![Ironhack logo](https://i.imgur.com/1QgrNNw.png)

# Lab | Not hot songs

## Introduction

Now that you have scrapped the website [Billboard](https://www.billboard.com/charts/hot-100/) to create a *hot_songs* dataset, it's time to prepare a new dataset of *not_hot_songs*. This dataset can contain songs of your choice, others collected from the web or any other combination. Some sources of songs can be:

* [Wikipedia](https://en.wikipedia.org/wiki/Lists_of_songs)
* [Subset of million songs dataset](http://millionsongdataset.com/pages/getting-dataset/#subset) *Note:* this dataset takes several GB of disk space!!!
* [Kaggle](https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify)
* Your favourite songs included in the Trello board as long as they are not hot songs :)

## Considerations

You want your dataset of not_hot_song to be:

* As heterogeneous in terms of (genre, length,...etc) as possible to create better groups of songs.
* Not too big and not too small (typically around 2-3K) songs

In a real-life scenario, you might want to have your dataset as biggest as possible and use specialized Big Data techniques like [PySpark](https://spark.apache.org/docs/latest/api/python/) to group similar songs together. However, you are going to work on your own laptop which has limited power. Therefore, you need to limit the size of your dataset of not_hot_songs otherwise the process of grouping similar songs will take forever.

## Deliverables

Your fork should contain a jupyter notebook with the code to:

* Gather the songs
* Remove songs already present in the hot_songs dataset

In [1]:
import pandas as pd

# Read the CSV file into a DataFrame
not_hot_df = pd.read_csv('not_hot.csv')

# Display the DataFrame
not_hot_df.head(1)



Unnamed: 0,Index,Highest Charting Position,Number of Times Charted,Week of Highest Charting,Song Name,Streams,Artist,Artist Followers,Song ID,Genre,...,Danceability,Energy,Loudness,Speechiness,Acousticness,Liveness,Tempo,Duration (ms),Valence,Chord
0,1,1,8,2021-07-23--2021-07-30,Beggin',48633449,Måneskin,3377762,3Wrjm47oTz2sjIgck11l5e,"['indie rock italiano', 'italian pop']",...,0.714,0.8,-4.808,0.0504,0.127,0.359,134.002,211560,0.589,B


In [2]:
import pandas as pd

# Load the CSV file into a DataFrame
hot_df = pd.read_csv('hot_df.csv')

# Display the DataFrame
hot_df.head(1)

Unnamed: 0,Song,Artist
0,Vampire,Olivia Rodrigo


In [3]:
not_hot_df_new = not_hot_df[['Artist', 'Song Name']]

In [4]:
not_hot_df_cleaned = not_hot_df_new.rename(columns={'Artist': 'artist', 'Song Name': 'song'})

In [5]:
hot_df_cleaned = hot_df.rename(columns={"Song":"song", "Artist":"artist"})

In [6]:
not_hot_df_cleaned = not_hot_df_cleaned[['song', 'artist']]

In [7]:
not_hot_df_cleaned.head(2)

Unnamed: 0,song,artist
0,Beggin',Måneskin
1,STAY (with Justin Bieber),The Kid LAROI


In [8]:
hot_df_cleaned.head(2)

Unnamed: 0,song,artist
0,Vampire,Olivia Rodrigo
1,Last Night,Morgan Wallen


In [9]:
concatenated_df = pd.concat([hot_df_cleaned, not_hot_df_cleaned])

In [10]:
unique_counts = concatenated_df.nunique()
unique_counts

song      1655
artist     772
dtype: int64

In [11]:
duplicates = concatenated_df.duplicated()
has_duplicates = duplicates.any()
has_duplicates

False