# Data Science for Music

## Dataset Sources:

1. My Winamp's music library's Media Library Export - can be found under data/music_library_export.xml
2. My Last.fm account's scrobbles (time series of when each song was played) extracted from [here](https://lastfm.ghan.nl/export/) - can be found under data/lastfm-scrobbles-edchapa.csv
3. [GTZAN Dataset - Music Genre Classification](https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification)

## Process

First, I exported my song library from Winamp, which already contains some information like track name, artist, genre, etc.

However, the export is in iTunes XML format, so I will have to convert it into csv format first using Python's xmltodict and csv libraries. Let's also instsall altair since we will use it later.

In [517]:
%pip install xmltodict
%pip install altair



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


First we read the xml file, and create the output csv file.

In [518]:
import xmltodict, csv

with open('data/music_library_export.xml') as xml_file:
    xml_file = xmltodict.parse(xml_file.read())

csv_file = open("data/music_library_export.csv", "w", encoding="utf-8", newline='')
csv_file_writer = csv.writer(csv_file)

We will use the xml file's keys as the column names for the csv file.

In [519]:
xml_file_keys = ["Track ID", "Name", "Artist", "Album Artist", "Album",
                 "Genre", "Comments", "Kind", "Size", "Total Time",
                 "Track Number", "Year", "Bit Rate", "Track Count",
                 "Composer", "Publisher", "Location", "File Folder Count",
                 "Library Folder Count", "Date Modified", "Date Added"]
csv_file_writer.writerow(xml_file_keys)

208

Now we will write a row in the csv file per song in the xml file. Since the xml file is not an xml standard file, we will need to manually parse the values inside the dictionaries contained in it, and process each type of key separately. Since not all songs contain all fields, we will use two counters to keep track of the skipped integer and string values (which are tracked separately in the xml file's dictionary) so we can use them to write the rest of the values in the right column on the csv file. Otherwise, the values in the resulting csv file may be off whenever a song had missing values in the original xml file.

In [520]:
for song in xml_file['plist']['dict']['dict']['dict']:
    song_info = ['' for i in range(len(xml_file_keys))]
    skipped_integers = 0
    skipped_strings = 0
    if "Track ID" in song['key']:
        song_info[0] = song['integer'][0]
    else:
        skipped_integers += 1
    if "Name" in song['key']:
        song_info[1] = song['string'][0]
    else:
        skipped_strings += 1
    if "Artist" in song['key']:
        song_info[2] = song['string'][1 - skipped_strings]
    else:
        skipped_strings += 1
    if "Album Artist" in song['key']:
        song_info[3] = song['string'][2 - skipped_strings]
    else:
        skipped_strings += 1
    if "Album" in song['key']:
        song_info[4] = song['string'][3 - skipped_strings]
    else:
        skipped_strings += 1
    if "Genre" in song['key']:
        song_info[5] = song['string'][4 - skipped_strings]
    else:
        skipped_strings += 1
    if "Comments" in song['key']:
        song_info[6] = song['string'][5 - skipped_strings]
    else:
        skipped_strings += 1
    if "Kind" in song['key']:
        song_info[7] = song['string'][6 - skipped_strings]
    else:
        skipped_strings += 1
    if "Size" in song['key']:
        song_info[8] = song['integer'][1 - skipped_integers]
    else:
        skipped_integers += 1
    if "Total Time" in song['key']:
        song_info[9] = song['integer'][2 - skipped_integers]
    else:
        skipped_integers += 1
    if "Track Number" in song['key']:
        song_info[10] = song['integer'][3 - skipped_integers]
    else:
        skipped_integers += 1
    if "Year" in song['key']:
        song_info[11] = song['integer'][4 - skipped_integers]
    else:
        skipped_integers += 1
    if "Bit Rate" in song['key']:
        song_info[12] = song['integer'][5 - skipped_integers]
    else:
        skipped_integers += 1
    if "Track Count" in song['key']:
        song_info[13] = song['integer'][6 - skipped_integers]
    else:
        skipped_integers += 1
    if "Composer" in song['key']:
        song_info[14] = song['string'][7 - skipped_strings]
    else:
        skipped_strings += 1
    if "Publisher" in song['key']:
        song_info[15] = song['string'][8 - skipped_strings]
    else:
        skipped_strings += 1
    if "Location" in song['key']:
        song_info[16] = song['string'][9 - skipped_strings]
    if "File Folder Count" in song['key']:
        song_info[17] = song['integer'][7 - skipped_integers]
    else:
        skipped_integers += 1
    if "Library Folder Count" in song['key']:
        song_info[18] = song['integer'][8 - skipped_integers]
    if "Date Modified" in song['key']:
        song_info[19] = song['date'][0]
    if "Date Added" in song['key']:
        song_info[20] = song['date'][1]
    csv_file_writer.writerow(song_info)
csv_file.close()

Now we will read the csv file into a Polars DataFrame, and display the first rows.

In [521]:
import polars as pl
songs_df = pl.read_csv('data/music_library_export.csv')
songs_df.head()

Track ID,Name,Artist,Album Artist,Album,Genre,Comments,Kind,Size,Total Time,Track Number,Year,Bit Rate,Track Count,Composer,Publisher,Location,File Folder Count,Library Folder Count,Date Modified,Date Added
i64,str,str,str,str,str,str,str,i64,i64,i64,i64,i64,i64,str,str,str,i64,i64,str,str
0,"""A Story Of Boy…","""Mychael Danna …","""Mychael Danna …","""(500) Days Of …","""Soundtrack""",,"""MPEG audio fil…",1529041,95000,1,2009,128,16,,,"""file://localho…",-1,-1,"""2009-11-06T05:…","""2023-10-23T02:…"
1,"""Us""","""Regina Spektor…","""Regina Spektor…","""(500) Days Of …","""Soundtrack""",,"""MPEG audio fil…",4640293,289000,2,2009,128,16,,,"""file://localho…",-1,-1,"""2009-11-06T05:…","""2023-10-23T02:…"
2,"""There Is A Lig…","""The Smiths""","""The Smiths""","""(500) Days Of …","""Soundtrack""",,"""MPEG audio fil…",3904714,243000,3,2009,128,16,,,"""file://localho…",-1,-1,"""2009-11-06T05:…","""2023-10-23T02:…"
3,"""Bad Kids""","""Black Lips""","""Black Lips""","""(500) Days Of …","""Soundtrack""",,"""MPEG audio fil…",2056471,128000,4,2009,128,16,,,"""file://localho…",-1,-1,"""2009-11-06T05:…","""2023-10-23T02:…"
4,"""Please, Please…","""The Smiths""","""The Smiths""","""(500) Days Of …","""Soundtrack""",,"""MPEG audio fil…",1805315,112000,5,2009,128,16,,,"""file://localho…",-1,-1,"""2009-11-06T05:…","""2023-10-23T02:…"


Now I would like to see how many genres we have. For that, we will select the genres column from the DataFrame and display its full contents by casting it into a numpy array.

In [522]:
songs_df.select(
    pl.col('Genre')
).unique().sort("Genre").to_numpy()


array([[None],
       ['(255)'],
       ['.'],
       ['AOR Classic Rock'],
       ['Alt Rock'],
       ['Alt. Rock'],
       ['Alternative'],
       ['Alternative & Punk'],
       ['Alternative Rock'],
       ['Alternative, Rock'],
       ['Alternativo'],
       ['Ambient'],
       ['Ambient Alternative'],
       ['Ambient Techance'],
       ['Anime'],
       ['Anti-folk'],
       ['Arena/Power Metal'],
       ['Avantgarde'],
       ['Ballad'],
       ['Banda sonora'],
       ['Black Metal'],
       ['Blues'],
       ['Brit Pop'],
       ['Brit-pop'],
       ['Campfire Rock'],
       ['Chiptune'],
       ['Choral'],
       ['Classic Hard Rock'],
       ['Classic Pop Punk'],
       ['Classic Rock'],
       ['Classical'],
       ['Cosmic Tones for Mental Therapy'],
       ['Country'],
       ['Dance'],
       ['Dance & DJ'],
       ['Dance / Disco'],
       ['Death Metal'],
       ['Desert Rock'],
       ['Down-tempo / Pop / Alternativa'],
       ['Dubstep'],
       ['Duck Remixes'],
  

## Data Imputation

We can see there are some genres that are repeated but with slight spelling or language differences, so we will try to rename them so they match the rest. We will also remove values that are not actual genres (e.g. 'unknown') and turn them into null values so we can later remove them if necessary.

In [523]:
genres = songs_df.select(
    pl.col("Genre").map_elements(
        lambda x: "Alternative" if x == "Alternativo" else x)
    .map_elements(
        lambda x: "Electronic" if x == "Electronica" else x)
    .map_elements(
        lambda x: "Electronic Pop" if x in ["Pop Electronica", "Electronica / Pop"] else x)
    .map_elements(
        lambda x: "Indie" if x == "indie" else x)
    .map_elements(
        lambda x: "Indie Rock" if x in ["Rock/Indie", "Indie/Rock", "General Indie Rock"] else x)
    .map_elements(
        lambda x: "Miscellaneous" if x == "misc" else x)
    .map_elements(
        lambda x: "Soundtrack" if x in ["soundtrack", "Banda sonora"] else x)
    .map_elements(
        lambda x: "Thrash Metal" if x == "Thrash Metal" else x)
    .map_elements(
        lambda x: "Alt Rock" if x in ["Alt. Rock", "Alternative Rock", "Rock alternativo",
                                      "Alternative, Rock", "General Alternative Rock"] else x)
    .map_elements(
        lambda x: "Brit Pop" if x == "Brit-pop" else x)
    .map_elements(
        lambda x: "Pop Rock" if x in ["Pop/Rock", "Pop/Rock 2000's"] else x)
    .map_elements(
        lambda x: "Pop" if x == "General Pop" else x)
    .map_elements(
        lambda x: "Folk" if x == "General Folk" else x)
    .map_elements(
        lambda x: "Rock" if x in ["General Rock", "Rock En General", "Rock en general", "Rock @",
                                  "rock"] else x)
    .map_elements(
        lambda x: "Heavy Metal" if x == "Rock Duro Y Heavy" else x)
    .map_elements(
        lambda x: "Hip Hop/Rap" if x == "General Rap/Hip-Hop" else x)
    .map_elements(
        lambda x: "Bitpop" if x == "bitpop" else x)
    .map_elements(
        lambda x: "Chillstep" if x == "chillstep" else x)
    .map_elements(
        lambda x: "Chiptune" if x == "chiptune" else x)
    .map_elements(
        lambda x: None if x in ["genre", "default", ".", "(255)", "Other"] else x)
    .map_elements(
        lambda x: "Unclassifiable" if x == "General Unclassifiable" else x)
    .map_elements(
        lambda x: "Soft Rock / Alternative Folk / Folk / Rock" if x == "soft rock/alternative folk/folk/rock" else x)
    .alias("Genre")
).to_series()

songs_df = songs_df.with_columns(genres.alias("Genre"))
songs_df.select(
    pl.col('Genre')
).unique().sort("Genre").to_numpy()


array([[None],
       ['AOR Classic Rock'],
       ['Alt Rock'],
       ['Alternative'],
       ['Alternative & Punk'],
       ['Ambient'],
       ['Ambient Alternative'],
       ['Ambient Techance'],
       ['Anime'],
       ['Anti-folk'],
       ['Arena/Power Metal'],
       ['Avantgarde'],
       ['Ballad'],
       ['Bitpop'],
       ['Black Metal'],
       ['Blues'],
       ['Brit Pop'],
       ['Campfire Rock'],
       ['Chillstep'],
       ['Chiptune'],
       ['Choral'],
       ['Classic Hard Rock'],
       ['Classic Pop Punk'],
       ['Classic Rock'],
       ['Classical'],
       ['Cosmic Tones for Mental Therapy'],
       ['Country'],
       ['Dance'],
       ['Dance & DJ'],
       ['Dance / Disco'],
       ['Death Metal'],
       ['Desert Rock'],
       ['Down-tempo / Pop / Alternativa'],
       ['Dubstep'],
       ['Duck Remixes'],
       ['Duet'],
       ['EDM: Dubstep'],
       ['EDM: Electro House'],
       ['EPM'],
       ['Easy Listening'],
       ['Electro'],
       [

Since we don't want null values in the Genre and Year columns, we will remove them.

In [524]:
songs_df = songs_df.drop_nulls(["Genre"])
songs_df = songs_df.drop_nulls(["Year"])

Some songs have invalid years so let's filter those out as well.

In [525]:
songs_df = songs_df.filter(pl.col('Year') > 1000).filter(pl.col('Year') < 2024)

We will finally remove the Comments, File Folder Count, Library Folder Count, Kind, and Location columns since they don't contain useful/relevant data.

In [526]:
songs_df.drop_in_place("Comments")
songs_df.drop_in_place("File Folder Count")
songs_df.drop_in_place("Library Folder Count")
songs_df.drop_in_place("Kind")
songs_df.drop_in_place("Location")
songs_df.head()

Track ID,Name,Artist,Album Artist,Album,Genre,Size,Total Time,Track Number,Year,Bit Rate,Track Count,Composer,Publisher,Date Modified,Date Added
i64,str,str,str,str,str,i64,i64,i64,i64,i64,i64,str,str,str,str
0,"""A Story Of Boy…","""Mychael Danna …","""Mychael Danna …","""(500) Days Of …","""Soundtrack""",1529041,95000,1,2009,128,16,,,"""2009-11-06T05:…","""2023-10-23T02:…"
1,"""Us""","""Regina Spektor…","""Regina Spektor…","""(500) Days Of …","""Soundtrack""",4640293,289000,2,2009,128,16,,,"""2009-11-06T05:…","""2023-10-23T02:…"
2,"""There Is A Lig…","""The Smiths""","""The Smiths""","""(500) Days Of …","""Soundtrack""",3904714,243000,3,2009,128,16,,,"""2009-11-06T05:…","""2023-10-23T02:…"
3,"""Bad Kids""","""Black Lips""","""Black Lips""","""(500) Days Of …","""Soundtrack""",2056471,128000,4,2009,128,16,,,"""2009-11-06T05:…","""2023-10-23T02:…"
4,"""Please, Please…","""The Smiths""","""The Smiths""","""(500) Days Of …","""Soundtrack""",1805315,112000,5,2009,128,16,,,"""2009-11-06T05:…","""2023-10-23T02:…"


## EDA

First, let's see how many songs we have of each genre.

In [527]:
songs_by_genre = songs_df.select(
    pl.col('Genre')
).to_series().value_counts()
songs_by_genre

Genre,counts
str,u32
"""Miscellaneous""",8
"""Classic Rock""",26
"""Electronic""",394
"""Industrial""",19
"""Chiptune""",81
"""Grunge""",13
"""Anime""",19
"""Cosmic Tones f…",12
"""Hip Hop / Glam…",1
"""Punk""",143


Now let's get a chart of the top 10 genres with most songs using Altair, and highlight the one with most songs.

In [528]:
import altair as alt

top_10_genres = songs_by_genre.top_k(10, by="counts")
top_genre = top_10_genres.top_k(1, by="counts").to_numpy()[0][0]
alt.Chart(top_10_genres, title="Top 10 Genres").mark_bar().encode(
    x=alt.X('counts', title="Songs"),
    y='Genre',
    color=alt.condition(
        alt.datum.Genre == top_genre,
        alt.value('orange'),
        alt.value('steelblue')
    )
)

Let's now see what's the average song duration per Genre. The duration is in the Total Time column in milliseconds, so we will divide it by 60000 to get the value in minutes, and round it up to 1 decimal.

In [529]:
avg_duration_per_genre = songs_df.group_by('Genre').agg(
    (pl.mean('Total Time')/60000).round(1).alias('Minutes')
)
avg_duration_per_genre

Genre,Minutes
str,f64
"""Reggae""",3.5
"""Psychedelic""",2.6
"""Electro House""",5.5
"""Alternative""",3.7
"""Cosmic Tones f…",4.2
"""Neofolk""",4.3
"""Rock/Pop""",5.7
"""Melodic Black""",6.5
"""Progressive Ro…",9.1
"""Duet""",3.2


And now let's get the top 10 Genres with the longest average song duration.

In [530]:
top_10_avg_duration = avg_duration_per_genre.top_k(10, by="Minutes")
top_10_avg_duration

Genre,Minutes
str,f64
"""Hardstyle""",60.0
"""Progressive Ro…",9.1
"""Ballad""",9.0
"""Progressive Me…",8.2
"""Psychedelia""",7.6
"""Campfire Rock""",7.4
"""Ambient""",7.2
"""Indie/Post Roc…",7.0
"""Ambient Techan…",6.8
"""Arena/Power Me…",6.8


And finally let's plot them in Altair highlighting the one with the longest average duration.

In [531]:
top_duration = avg_duration_per_genre.top_k(1, by="Minutes").to_numpy()[0][1]
alt.Chart(top_10_avg_duration, title="Top 10 AVG Durations By Genre").mark_bar().encode(
    x=alt.X('Genre', axis=alt.Axis(labelAngle=-45)),
    y="Minutes",
    color=alt.condition(
        alt.datum.Minutes == top_duration,
        alt.value('orange'),
        alt.value('steelblue')
    )
).properties(width=400)

Now let's see how many songs we have per Year.

In [532]:
songs_per_year = songs_df.select(
    pl.col('Name').alias('Songs'),
    pl.col('Year')
).group_by('Year').agg(
    pl.count('Songs')
).sort(by='Year')
songs_per_year

Year,Songs
i64,u32
1950,1
1951,1
1964,2
1965,1
1966,2
1967,14
1968,12
1969,48
1970,38
1971,27


And finally let's plot them on an Altair chart, and highlight the year with most songs.

In [533]:
top_year = songs_per_year.top_k(1, by="Songs").to_numpy()[0][0]
alt.Chart(songs_per_year, title="Song Count per Year").mark_bar().encode(
    x=alt.X("Year:O", axis=alt.Axis(labelAngle=-45)),
    y="Songs:Q",
    color=alt.condition(
        alt.datum.Year == top_year,
        alt.value('orange'),
        alt.value('steelblue')
    )
)

## DATE ADDED HERE?

Now we will load the last.fm export and display the first rows. This dataset has a row for each time a song was played in my Spotify library, which is separate from my Winamp one.

In [534]:
scrobbles_df = pl.read_csv("data/lastfm-scrobbles-edchapa.csv")
scrobbles_df.head()

uts,utc_time,artist,artist_mbid,album,album_mbid,track,track_mbid
i64,str,str,str,str,str,str,str
1691282428,"""06 Aug 2023, 0…","""Alejandro Fern…","""""","""Hecho en Méxic…","""19337281-88cb-…","""Caballero""",""""""
1691282200,"""06 Aug 2023, 0…","""Los Acosta""","""ddcbd7c8-73da-…","""Intimidades""","""""","""Como Una Novel…",""""""
1691281988,"""06 Aug 2023, 0…","""Los Askis""","""7941c16f-c2cb-…","""Pasión Y Cumbi…","""""","""Amor Regresa""","""32cf21ce-274a-…"
1691281754,"""06 Aug 2023, 0…","""Los Ángeles Az…","""dcb5e5c6-5f21-…","""De Buenos Aire…","""a2811b27-95b5-…","""Te Necesito""",""""""
1691281575,"""06 Aug 2023, 0…","""Los Ángeles Az…","""dcb5e5c6-5f21-…","""De Buenos Aire…","""a2811b27-95b5-…","""Entrega De Amo…",""""""


Let's start by removing the uts, artist_mbid, album_mbid, and track_mbid columns since they don't contain relevant information.

In [535]:
scrobbles_df.drop_in_place("uts")
scrobbles_df.drop_in_place("artist_mbid")
scrobbles_df.drop_in_place("album_mbid")
scrobbles_df.drop_in_place("track_mbid")
scrobbles_df.head()

utc_time,artist,album,track
str,str,str,str
"""06 Aug 2023, 0…","""Alejandro Fern…","""Hecho en Méxic…","""Caballero"""
"""06 Aug 2023, 0…","""Los Acosta""","""Intimidades""","""Como Una Novel…"
"""06 Aug 2023, 0…","""Los Askis""","""Pasión Y Cumbi…","""Amor Regresa"""
"""06 Aug 2023, 0…","""Los Ángeles Az…","""De Buenos Aire…","""Te Necesito"""
"""06 Aug 2023, 0…","""Los Ángeles Az…","""De Buenos Aire…","""Entrega De Amo…"


Now let's create a new column with the "Artist" and "Track" and display the DataFrame.

In [536]:
scrobbles_df = scrobbles_df.with_columns(
    (pl.col('track') + " - " + pl.col('artist')).alias("Song - Artist")
)
scrobbles_df.head()

utc_time,artist,album,track,Song - Artist
str,str,str,str,str
"""06 Aug 2023, 0…","""Alejandro Fern…","""Hecho en Méxic…","""Caballero""","""Caballero - Al…"
"""06 Aug 2023, 0…","""Los Acosta""","""Intimidades""","""Como Una Novel…","""Como Una Novel…"
"""06 Aug 2023, 0…","""Los Askis""","""Pasión Y Cumbi…","""Amor Regresa""","""Amor Regresa -…"
"""06 Aug 2023, 0…","""Los Ángeles Az…","""De Buenos Aire…","""Te Necesito""","""Te Necesito - …"
"""06 Aug 2023, 0…","""Los Ángeles Az…","""De Buenos Aire…","""Entrega De Amo…","""Entrega De Amo…"


Then let's get the number of times each song was played, sort it on descending order, and add a row count column.

In [537]:
played_songs = scrobbles_df.group_by('Song - Artist').agg(
    pl.col('Song - Artist').count().alias('play_num')
).sort(by="play_num", descending=True).with_row_count()
played_songs.head()

row_nr,Song - Artist,play_num
u32,str,u32
0,"""Amor a primera…",117
1,"""Icy Skies - Fi…",112
2,"""Calm Down (wit…",109
3,"""On Eloquence -…",96
4,"""Caves - CLANN""",95


And now let's get the top 10 most played songs.

In [538]:
top_10_played_songs = played_songs.top_k(10, by="play_num")
top_10_played_songs

row_nr,Song - Artist,play_num
u32,str,u32
0,"""Amor a primera…",117
1,"""Icy Skies - Fi…",112
2,"""Calm Down (wit…",109
3,"""On Eloquence -…",96
4,"""Caves - CLANN""",95
5,"""Last Breath - …",93
6,"""Entrega De Amo…",87
7,"""Le Quattro Sta…",85
8,"""Equinox - Eric…",82
9,"""Ya acabó - Con…",73


Now let's chart them using Altair, highlighting the most played song.

In [539]:
top_song = top_10_played_songs.top_k(1, by="play_num").to_numpy()[0][2]
alt.Chart(top_10_played_songs, title="Top 10 Most Played Songs").mark_bar().encode(
    x=alt.X('play_num', title="Number of times played"),
    y="Song - Artist",
    color=alt.condition(
        alt.datum.play_num == top_song,
        alt.value('orange'),
        alt.value('steelblue')
    )
)

Now let's see if we can find any correlation between the variables of our previous datasets. For that we will create ProfileReports using Python's ydata_profiling library and store them in a "reports" folder.

In [540]:
from ydata_profiling import ProfileReport
songs_profile_report = ProfileReport(songs_df.to_pandas(), title="All Songs Profiling Report")
scrobbles_profile_report = ProfileReport(scrobbles_df.to_pandas(), title="Scrobbles Profiling Report")
songs_profile_report.to_file("reports/All_Songs_Profile.html")
scrobbles_profile_report.to_file("reports/Scrobbles_Profile.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

![Correlations heatmap from 'All Songs Profiling Report'](images/correlations.png)
Correlations heatmap from 'All Songs Profiling Report'

No correlations were found in the "Scrobbling Profiling Report", but from the correlations heatmap in the "All Songs Profiling Report" from the songs_df dataset we can see that there's a (perhaps obvious) correlation between the song's size, and its duration (total time) and bit rate, so this gives us a hint to try to predict a song's size based on its duration and bit rate. This calculation could easily be done for constant bit rate songs (CBR) but maybe not so accurately for songs with variable bit rates (VBR), so it will be an interesting exercise to try and predict the song's size using a model. Before generating this profile report, I was not very confident any correlation would be found on any of the previous datasets, so this is a great example of how useful these types of tools are when performing an EDA.

## Visualization

First, let's plot the song_df's 'Total Time', 'Bit Rate' and 'Size' columns, and see how they are distributed. Since the 'Total Time' value is in milliseconds, let's convert it to seconds and rename the colum to 'Duration' for the chart.

In [587]:
import plotly.express as px

data = songs_df.select(pl.col('Bit Rate'), pl.col('Size')).with_columns(songs_df.select((pl.col('Total Time')/1000).alias('Duration')))

fig = px.scatter_3d(
    data, 
    x='Duration',
    y='Bit Rate',
    z='Size',
    color='Size'
)
fig

By manipulating the previous 3D chart, we can see that the size of a song most of the times increases proportionaly to its duration, and to a lesser degree to its bit rate. The song with the biggest size (represented by the yellow dot) also has the longest duration of all songs, and the blue dots have the smallest size because they also have either short durations and/or low bit rates. This gives us a hint we can probably predict the size of a song using a linear regresion using its duration and bit rate.

## Modeling

Now let's begin by separating the variable we want to predict, the size of the songs, also called our dependent variable. 

In [588]:
songs_size = songs_df.select(pl.col('Size'))
songs_size

Size
i64
1529041
4640293
3904714
2056471
1805315
6658626
2969306
3737517
2630756
3596648


We will also remove the rest of the columns from the dataset except for 'Total Time' and 'Bit Rate', which will be our independent variables.

In [596]:
songs_independent_vars = songs_df.select((pl.col('Total Time')/1000), pl.col('Bit Rate'))
songs_independent_vars

Total Time,Bit Rate
f64,i64
95.0,128
289.0,128
243.0,128
128.0,128
112.0,128
416.0,128
185.0,128
233.0,128
164.0,128
224.0,128


Now we are going to use sklearn's train_test_split module to split our dataset into three: our training dataset, our testing dataset, and the validation dataset. Since there's no fixed rule to pick the sizes of each, let's go with 60% of the data for training, 20% for testing, and 20% for validating the model, and see how or model behaves.

In [597]:
from sklearn.model_selection import train_test_split

total_song_count = len(songs_size)
training_count = int(total_song_count * .60) + 1
test_count = int(total_song_count * .20)
validation_count = int(total_song_count * .20)
training_x, rest_x, training_y, rest_y = train_test_split(songs_independent_vars, songs_size, train_size=training_count)
testing_x, validation_x, testing_y, validation_y = train_test_split(rest_x, rest_y, train_size=test_count)
print("Training dataset size:", len(training_x))
print("Validation dataset size:", len(validation_x))
print("Testing dataset size:", len(testing_x))

Training dataset size: 4994
Validation dataset size: 1664
Testing dataset size: 1664


Since the 'Total Time' and 'Bit Rate' columns have values on different scales we will use a scikit-learn's RobustScaler to normalize them. For that, we will build a feature engineering pipeline for both columns.

In [598]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import RobustScaler

scaler = ColumnTransformer([
    ("scaler", RobustScaler(), ["Total Time", "Bit Rate"])
])
feature_engineering_pipeline = pipe = Pipeline(
    [
        (
            "features",
            FeatureUnion(
                [
                    ("scaled", scaler)
                ]
            ),
        )
    ]
)
feature_engineering_pipeline

Now let's transform the training data using our pipeline. Since scikit-learn doesn't support Polars DataFrames direclty (because of their lack of indexes/iloc methods), we'll convert the Polars DataFrame containing our training data into a Pandas Dataframe on the fly using its .to_pandas() method.

In [599]:
features_training_x = feature_engineering_pipeline.fit_transform(training_x.to_pandas())
features_training_x

array([[ 0.11881188, -0.27848101],
       [-0.71287129,  0.59493671],
       [ 0.44554455,  0.35443038],
       ...,
       [-0.1980198 ,  1.24050633],
       [ 0.61386139,  0.4556962 ],
       [-0.44554455,  0.41772152]])

Since we saw the song's duration was proportional to the size and bit rate, let's try building a linear regresion model to try to predict the songs' size based on its duration and bit rate.

In [600]:
from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(features_training_x, training_y)

Now that we've trained our model, let's see if it is accurately predicting the songs' size as expected by using the validation and testing datasets, and let's also calculate the model's coefficient of determination (R²).

In [601]:
features_validation_x = feature_engineering_pipeline.transform(validation_x.to_pandas())
prediction_y = model.predict(features_validation_x)
prediction_score = model.score(features_validation_x, validation_y)
print("Model prediction score with validation data:", prediction_score)


Model prediction score with validation data: 0.9261284768955244


In [602]:
features_testing_x = feature_engineering_pipeline.transform(testing_x.to_pandas())
prediction_y = model.predict(features_testing_x)
prediction_score = model.score(features_testing_x, testing_y)
print("Model prediction score with testing data:", prediction_score)

Model prediction score with testing data: 0.9146574727019628
