# Introduction
Music is an important artifact of culture and society. Thus, analyzing song data can reveal a lot about the societal trends. Music consumption, specifically, tells us how society.

# The Dataset
To aid in our investigation, we use [Dhruvil Dave's dataset](https://www.kaggle.com/datasets/dhruvildave/billboard-the-hot-100-songs/) on songs from the Billboard Hot 100, ranging from the year 1958 to 2021.

All dataset and other intermediary data are stored in the `/data` folder in the root directory of this project. Let's move up a directory so that we have access to the data we're using.

In [None]:
%cd ..

Now, we can load our dataset.

In [None]:
import pandas as pd
charts_df = pd.read_csv('data/charts.csv')

## Exploring our variables

Now, we have a DataFrame object we can work with! Let us explore the observations and variables in our dataset.

In [None]:
charts_df.info()

We have 330 087 observations and six variables in our dataset. The variables and their description are as follows:
- **date**: date of the chart
- **rank**: rank of the song
- **song**: song title
- **artist**: artist name
- **last-week**: rank of the song in the preceding week
- **peak-rank**: the highest rank that the song historically charted
- **weeks-on-board**: how many weeks the song is charting up to the point of the record (does not have to be consecutive)

The `date`, `song` and `artist` variables are object types (or string types) while the rest are numeric types.

In [None]:
charts_df['date'] = pd.to_datetime(charts_df['date'])
charts_df

### Handling null and duplicate variables
Out of all the variables, the `last-week` variable contains some non-null values as seen in the output for `charts_df.info()`. We can see below how many actual null values we have (32 312 rows).

In [None]:
charts_df['last-week'][charts_df['last-week'].isnull()]

This is expected, however. These null values simply mean that the song entry is new to the charts which corresponds to no record for last week.

For duplicates, the variables `date`, `rank` and other similar chart variables are expected to duplicate. Artists may also have multiple charting songs throughout their career. Songs may chart multiple weeks as well.

However, for the sake of analysis, we might need to create a single entry for each song instead and extract the most relevant features: the `peak-rank` and the maximum value for `weeks-on-board` of that song. 

In this way, we trim our dataset and remove extraneous fluff that may hinder our analysis.

But before we remove the duplicates, we must consider engineering more features. I will explain in a while why we must do feature engineering before handling duplicates.

### Feature engineering
Since we're working with dataset that pertains to cultural trends, one interesting feature we might explore is the decade. We can bin these dates to certain years and decades of the charting.

Here we extract the year of the charting period of a given entry.

In [None]:
year_series = charts_df['date'].dt.year.apply(lambda x: int(x))
year_series.name = 'year'
year_series

For the analysis to be more helpful, it could benefit from putting the years to specific bins of decades as well.

In [None]:
decade_series = (year_series // 10 * 10).apply(lambda x: int(x))
decade_series.name = 'decade'

We then concatenate the series of new features to our cleaned `DataFrame`.

In [None]:
cleaned_charts_df = pd.concat([charts_df, year_series, decade_series], axis='columns')
cleaned_charts_df

The years in the decades 2020s and 1950s are not of standard length since the dataset only involves the one or two years in that decades. To extract useful comparison between each decade, we can limit our range.

In [None]:
cleaned_charts_df = cleaned_charts_df.query('decade < 2020 & decade > 1950')
cleaned_charts_df

Remember we have duplicate entries for each time a particular song enters the charts. However, it might be useful to have the most important "summary" of the song. This would be the max peak (highest peak throughout the song's lifetime in the charts) and the total weeks on board.

In [None]:
longest_running_charting = cleaned_charts_df.groupby(['song', 'artist', 'decade'], as_index=False).aggregate('max')[['song', 'artist', 'decade', 'weeks-on-board']]
top_rank = cleaned_charts_df.groupby(['song', 'artist', 'decade'], as_index=False).aggregate('min')['rank']
cleaned_charts_df = pd.concat([longest_running_charting, top_rank], axis='columns')
cleaned_charts_df.sort_values(axis='rows', by='decade')

However, the chart data alone might not have the most groundbreaking insights. We can augment this primary dataset with another to broaden our investigation.

In [None]:
tracks_df = pd.read_csv('data/tracks.csv')
tracks_df

[Yamac Eren's dataset](https://www.kaggle.com/datasets/yamaerenay/spotify-dataset-19212020-600k-tracks) consists of Spotify data of around 600 000 tracks. Like the Billboard Hot 100, this dataset can be generated from public data given by Spotify. Specifically, it can be gathered through the Spotify API.

It consists of the following important features: 
- **name**: the title of the song
- **duration_ms**: the length of the song in millisecond
- **artists**: an array of artists featured in the song
- **danceability**: "how suitable a track is for dancing based on a combination of musical elements"
- **energy**: "represents a perceptual measure of intensity and activity"
- **loudness**: "overall loudness of a track in decibels"
- **acousticness**: "confidence measure from 0.0 to 1.0 of whether the track is acoustic"
- **valence**: "describing the musical positiveness conveyed by a track"

In [None]:
tracks_df.head()

To cross-reference the songs from the charts to the Spotify data, we need the title and the artist to match. 

However, since the artists is in a string format resembling an array, we need to "evaluate" it and get the first artist to match it with the charts `DataFrame`.

In [None]:
tracks_df['artist'] = tracks_df['artists'].apply(lambda x: eval(x)[0])
tracks_df['artist']

This methodology, however, might have some consequences that we might consider when we interpret our data in the succeeding notebooks. Some artist fields in the original dataset have artist features in them (more than one artists on the same track). 

Since we're only matching one artist in the Spotify data, we might not be able to merge them all. One way to address this is to separate the artists within a track. In other words, for every token that resembles more than one artist, we duplicate the entry with each artist on one record. In this way, we can properly match all songs in the first `DataFrame` to the second.

However, there are plenty ways to signify an artist feature (e.g. "Featuring.", "Feat.", "&", ",", ...). In the interest of time, we ignore the songs with artist features. Since we have a large sample size, we can justify this action via central limit theorem.



Another caveat in our augmented dataset is the duplicate tracks. Since Spotify tracks all versions of a track, the same song might have multiple records.

In [None]:
tracks_df[tracks_df.duplicated(subset=['name', 'artists'], keep=False)].sort_values('name').head()

To address this, we just drop the duplicates since they more or less refer to the same song with similar features.

In [None]:
tracks_df.drop_duplicates(subset=['name', 'artists'], inplace=True)

We can combine all these features together in a single `DataFrame`.

In [None]:
final_df = cleaned_charts_df.merge(tracks_df[['name', 'artist', 'duration_ms', 'danceability', 'energy', 'loudness','mode', 'speechiness', 'acousticness', 'instrumentalness', 'valence']], left_on=['song', 'artist'], right_on=['name', 'artist'])
final_df

To use the same `DataFrame` in the future, we can simple serialize our object.

In [None]:
final_df.to_pickle('./data/pkls/charts_with_audio_features_df.pkl')