# Preprocessing
This notebook is responsible for transformation of raw data collected from Spotify's API and adjusting the structure and data types for storage in a relational database.

In [1]:
# Import necessary packages

import os
import sqlite3
import pandas as pd
import ast
import json

In [2]:
# Load the dataset
df = pd.read_csv('../data/raw/spotify_playlists.csv')

## Data Extraction
We begin by extracting only the information we are interested in and creating separate data frames that later will be transformed into database tables.

In [3]:
df['playlist_id'] = df['playlist_href'].apply(lambda x: x.split('/')[-1])
df['songs'] = df['songs'].apply(lambda x: ast.literal_eval(x))
# chatGPT helped here
df['songs'] = [songs + [{'playlist_id': playlist_id}] for songs, playlist_id in zip(df['songs'], df['playlist_id'])]

In [4]:
# Copy was made to avoid SettingWithCopyWarning
playlists = df[['playlist_id', 'name', 'description']].copy()
playlists['num_followers'] = df['followers']

Helper functions below are used to create a readily formatted data frame with songs details. `artist_list_to_string` creates a string of artists' names separated by commas (We will use SQLite that cannot store lists), `clean_song_data` extracts only the relevant information for every song.

In [5]:
def artist_list_to_string(artists):
    str = ""

    for artist in artists:
        str += artist['name'] + ", "

    str = str[:-2]
    return str
    
def clean_song_data(song: list, playlist_id):
    row = {}
    
    row['song_id'] = song['track']['external_ids']['isrc']
    row['playlist_id'] = playlist_id
    row['album_id'] = song['track']['album']['id']
    row['title'] = song['track']['name']
    row['release_date'] = song['track']['album']['release_date']
    row['is_explicit'] = song['track']['explicit']
    row['album_name'] = song['track']['album']['name']
    row['image_url'] = song['track']['album']['images'][2]['url']
    row['artists'] = artist_list_to_string(song['track']['artists'])
    row['popularity'] = song['track']['popularity']

    return row

After creating new functions we are ready to apply them to transform the songs data into a more managable format.

In [6]:
# chatGPT helped here
songs = df['songs'].apply(lambda song_list: [clean_song_data(song, song_list[-1]['playlist_id']) for song in song_list[:-1]])
songs = songs.explode().apply(pd.Series)

## Structuring the data
To remove redundant records and minimize the storage required, we further transform the data frames.

### Check for duplicates
First, we investigate the presence of duplicates. Uniqueness of every playlist is ensured by the return statement of the `get_playlists_links` function in the notebook NB01. Naturally, it should be expected that the same song could be present in multiple playlists. To accomodate future exploratory data analysis, we make a new variable representing the number of appearances a song has in all playlists considered.

In [7]:
# The output of this cell motivates the creation of value counts variable
songs['num_occurrances'] = songs['song_id'].map(songs['song_id'].value_counts())
songs.loc[songs['num_occurrances'] > 2].set_index('num_occurrances').head()

Unnamed: 0_level_0,song_id,playlist_id,album_id,title,release_date,release_date_precision,is_explicit,album_name,image_url,artists,popularity
num_occurrances,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3,USSM12402296,37i9dQZF1DWYrgs30Ir8ow,17R63Sb8OrPikc2R4mlpdC,GIRLS,2024-06-28,day,True,GIRLS,https://i.scdn.co/image/ab67616d00004851c8af72...,The Kid LAROI,85
3,USUG12403632,37i9dQZF1DWYrgs30Ir8ow,1SiWjniEb94wSFZ5YjZDHr,Did It First (with Central Cee),2024-07-11,day,True,Did It First (with Central Cee),https://i.scdn.co/image/ab67616d00004851dc9f83...,"Ice Spice, Central Cee",84
3,USUG12401967,37i9dQZF1DWYrgs30Ir8ow,1WAjjRMfZjEXtB0lQrAw6Q,"Good Luck, Babe!",2024-04-05,day,False,"Good Luck, Babe!",https://i.scdn.co/image/ab67616d0000485191b4bc...,Chappell Roan,95
3,USAT22401341,37i9dQZF1DWYrgs30Ir8ow,2lIZef4lzdvZkiiCzvPKj7,360,2024-06-07,day,True,BRAT,https://i.scdn.co/image/ab67616d0000485188e382...,Charli xcx,86
3,USUG12400968,37i9dQZF1DWYrgs30Ir8ow,5ylbxH7EqpsmHZCRuiYewS,Si Antes Te Hubiera Conocido,2024-06-21,day,False,Si Antes Te Hubiera Conocido,https://i.scdn.co/image/ab67616d00004851491678...,KAROL G,94


Additionally, multiple songs can come from the same album. For that reason we split the `songs` table into two parts: variables related strictly to songs and related to the albums. As the next step, we create a table that represents the relation between the songs and the albums they are part of. Similar relation holds for songs and playlists, which is reflected in the code below.

In [8]:
albums = songs[['album_id', 'album_name', 'release_date', 'release_date_precision', 'image_url']]
song_album_mapping = songs[['song_id', 'album_id']]
# The playlist table was created at the top of the notebook
song_playlist_mapping = songs[['song_id', 'playlist_id']]
songs = songs.drop(columns=['album_id', 'album_name', 'release_date', 'release_date_precision', 'image_url', 'playlist_id'])

Finally, we check if the dataset has observations for which variables that will be used as keys in the database have the same values. We need to treat such values carefully as in later stage they may cause an error.

In [9]:
print(albums.value_counts().head(1))
print(playlists.value_counts().head(1))
print(songs.value_counts().head(1))
print(song_album_mapping.value_counts().head(1))
print(song_playlist_mapping.value_counts().head(1))

album_id                album_name  release_date  release_date_precision  image_url                                                       
151w1FgRZfnKZA9FEcg9Z3  Midnights   2022-10-21    day                     https://i.scdn.co/image/ab67616d00004851bb54dde68cd23e2a268ae0f5    10
Name: count, dtype: int64
playlist_id             name         description                                                      num_followers
37i9dQZF1DWSBi5svWQ9Nk  Hot Hits NL  De 50 populairste hits van nu. Cover: Roxy Dekker & Ronnie Flex  922309           1
Name: count, dtype: int64
song_id       title                     is_explicit  artists   popularity  num_occurrances
GBAYE2400676  feelslikeimfallinginlove  False        Coldplay  85          5                  5
Name: count, dtype: int64
song_id       album_id              
GBAYE2400676  6RjTapeTvms8jSeIRGc5Ve    5
Name: count, dtype: int64
song_id       playlist_id           
ZA34K2401960  37i9dQZF1DXcRXFNfZr7Tp    1
Name: count, dtype: int64


As expected, we have duplicates in the `albums`, `songs` and `song_album_mapping`. To resolve the future issue of dupicates, we drop them before we proceed further.

In [10]:
albums = albums.drop_duplicates(subset='album_id')
songs = songs.drop_duplicates(subset='song_id')
song_album_mapping = song_album_mapping.drop_duplicates()

# Creation of relational database
Having cleaned and divided the data into meaningful tables, we can now create a database to store all the information. We will be using [SQLite](https://www.sqlite.org/index.html) because the database can be stored locally and the limitation of maximum one open connection is not an issue for this project.

In [11]:
%load_ext sql
%config SqlMagic.autocommit=True

### Modification of data types
Initially, to ensure efficient and reliable storage of information in the database, I wanted to change the column types to most efficient types. After visiting the [Datatypes page of SQLite](https://www.sqlite.org/datatype3.html) I realized that all values are automatically transformed into one of five generic types: NULL, INTEGER, REAL, TEXT, BLOB. For that reason, I refrain from optimizing the data types.

In [12]:
print(albums.head(1))
print(songs.head(1))
print(song_album_mapping.head(1))
print(song_playlist_mapping.head(1))
print(playlists.head(1))

                 album_id album_name release_date release_date_precision  \
0  17R63Sb8OrPikc2R4mlpdC      GIRLS   2024-06-28                    day   

                                           image_url  
0  https://i.scdn.co/image/ab67616d00004851c8af72...  
        song_id  title  is_explicit        artists  popularity  \
0  USSM12402296  GIRLS         True  The Kid LAROI          85   

   num_occurrances  
0                3  
        song_id                album_id
0  USSM12402296  17R63Sb8OrPikc2R4mlpdC
        song_id             playlist_id
0  USSM12402296  37i9dQZF1DWYrgs30Ir8ow
              playlist_id        name  \
0  37i9dQZF1DWYrgs30Ir8ow  Fresh Hits   

                                       description  num_followers  
0  Altijd fris in Fresh Hits. Cover: The Kid LAROI          27226  


Below, all tables' schemas are created with appropriate choice of data types and keys.

In [13]:
%%sql sqlite:///../data/clean/spotify_playlists.db --alias db

CREATE TABLE albums
(
    album_id CHAR(22) PRIMARY KEY,
    album_name VARCHAR(50),
    release_date DATE,
    image_url CHAR(64)
);

CREATE TABLE songs
(
    song_id CHAR(12) PRIMARY KEY,
    title VARCHAR(50),
    is_explicit BOOLEAN,
    artists VARCHAR(50),
    popularity TINYINT,
    num_occurrances TINYINT
);

CREATE TABLE song_album_map
(
    song_id CHAR(12),
    album_id CHAR(22),
    PRIMARY KEY (song_id, album_id),
    
);

CREATE TABLE song_playlist_map
(
    song_id CHAR(12),
    playlist_id CHAR(22),
    PRIMARY KEY (song_id, playlist_id)
);

CREATE TABLE playlists
(
    playlist_id CHAR(22) PRIMARY KEY,
    name VARCHAR(50),
    description VARCHAR(100),
    num_followers INT
);

Now, we can populate the tables with the cleaned data.

In [14]:
conn = sqlite3.connect('../data/clean/spotify_playlists.db')

albums.to_sql('albums', conn, if_exists='append', index=False)
songs.to_sql('songs', conn, if_exists='append', index=False)
playlists.to_sql('playlists', conn, if_exists='append', index=False)
song_album_mapping.to_sql('song_album_map', conn, if_exists='append', index=False)
song_playlist_mapping.to_sql('song_playlist_map', conn, if_exists='append', index=False)

1938