# Preprocessing
This notebook is responsible for transformation of raw data collected from Spotify's API and adjusting the structure and data types for storage in a relational database.

In [26]:
# Import necessary packages

import os
import sqlite3
import pandas as pd
import ast
import json

In [2]:
# Load the dataset
df = pd.read_csv('../data/raw/spotify_playlists.csv')

## Data Extraction
We begin by extracting only the information we are interested in and creating separate data frames that later will be transformed into database tables.

In [3]:
df['playlist_id'] = df['playlist_href'].apply(lambda x: x.split('/')[-1])
df['songs'] = df['songs'].apply(lambda x: ast.literal_eval(x))
# chatGPT helped here
df['songs'] = [songs + [{'playlist_id': playlist_id}] for songs, playlist_id in zip(df['songs'], df['playlist_id'])]

In [4]:
# Copy was made to avoid SettingWithCopyWarning
playlists = df[['playlist_id', 'name', 'description']].copy()
playlists['num_followers'] = df['followers']
playlists['link'] = df['playlist_href']

Helper functions below are used to create a readily formatted data frame with songs details. `artist_list_to_string` creates a string of artists' names separated by commas (We will use SQLite that cannot store lists), `clean_song_data` extracts only the relevant information for every song.

In [5]:
def artist_list_to_string(artists):
    str = ""

    for artist in artists:
        str += artist['name'] + ", "

    str = str[:-2]
    return str
    
def clean_song_data(song: list, playlist_id):
    row = {}
    
    row['song_id'] = song['track']['external_ids']['isrc']
    row['playlist_id'] = playlist_id
    row['album_id'] = song['track']['album']['id']
    row['title'] = song['track']['name']
    row['release_date'] = song['track']['album']['release_date']
    row['release_date_precision'] = song['track']['album']['release_date_precision']
    row['explicit'] = song['track']['explicit']
    row['album_name'] = song['track']['album']['name']
    row['image'] = song['track']['album']['images'][2]['url']
    row['artists'] = artist_list_to_string(song['track']['artists'])
    row['popularity'] = song['track']['popularity']

    return row

After creating new functions we are ready to apply them to transform the songs data into a more managable format.

In [6]:
# chatGPT helped here
songs = df['songs'].apply(lambda song_list: [clean_song_data(song, song_list[-1]['playlist_id']) for song in song_list[:-1]])
songs = songs.explode().apply(pd.Series)

## Structuring the data
To remove redundant records and minimize the storage required, we further transform the data frames.

### Check for duplicates
First, we investigate the presence of duplicates. Uniqueness of every playlist is ensured by the return statement of the `get_playlists_links` function in the notebook NB01. Naturally, it should be expected that the same song could be present in multiple playlists. To accomodate future exploratory data analysis, we make a new variable representing the number of appearances a song has in all playlists considered.

In [7]:
# The output of this cell motivates the creation of value counts variable
songs['num_occurances'] = songs['song_id'].map(songs['song_id'].value_counts())
songs.loc[songs['num_occurances'] > 2].set_index('num_occurances').head()

Unnamed: 0_level_0,song_id,playlist_id,album_id,title,release_date,release_date_precision,explicit,album_name,image,artists,popularity
num_occurances,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3,USHR10924519,37i9dQZF1DXd1MXcE8WTXq,64aKkqxc3Ur2LYIKeS5osS,Party In The U.S.A.,2009-01-01,day,False,The Time Of Our Lives,https://i.scdn.co/image/ab67616d00004851d6c3ad...,Miley Cyrus,79
4,GBARL1800368,37i9dQZF1DXd1MXcE8WTXq,7GEzhoTiqcPYkOprWQu581,One Kiss (with Dua Lipa),2018-04-06,day,False,One Kiss (with Dua Lipa),https://i.scdn.co/image/ab67616d00004851d09f96...,"Calvin Harris, Dua Lipa",85
3,USUM71514637,37i9dQZF1DXd1MXcE8WTXq,1sfmaFRbpMEGHH7DGup1wT,Cake By The Ocean,2015-10-23,day,True,SWAAY,https://i.scdn.co/image/ab67616d00004851eed1c8...,DNCE,0
3,GBARL1401524,37i9dQZF1DXd1MXcE8WTXq,3vLaOYCNCzngDf8QdBg2V1,Uptown Funk (feat. Bruno Mars),2015-01-12,day,True,Uptown Special,https://i.scdn.co/image/ab67616d00004851e419cc...,"Mark Ronson, Bruno Mars",83
3,AULI01385760,37i9dQZF1DXd1MXcE8WTXq,5S9b8euumqMhQbMk0zzQdH,Riptide,2014-09-08,day,False,Dream Your Life Away (Special Edition),https://i.scdn.co/image/ab67616d00004851a9929d...,Vance Joy,81


Additionally, multiple songs can come from the same album. For that reason we split the `songs` table into two parts: variables related strictly to songs and related to the albums. As the next step, we create a table that represents the relation between the songs and the albums they are part of. Similar relation holds for songs and playlists, which is reflected in the code below.

In [8]:
albums = songs[['album_id', 'album_name', 'release_date', 'release_date_precision', 'image']]
song_album_mapping = songs[['song_id', 'album_id']]
# The playlist table was created at the top of the notebook
song_playlist_mapping = songs[['song_id', 'playlist_id']]
songs = songs.drop(columns=['album_id', 'album_name', 'release_date', 'release_date_precision', 'image', 'playlist_id'])

Finally, we check if the dataset has observations for which variables that will be used as keys in the database have the same values. We need to treat such values carefully as in later stage they may cause an error.

In [25]:
print(albums.value_counts().head(1))
print(playlists.value_counts().head(1))
print(songs.value_counts().head(1))
print(song_album_mapping.value_counts().head(1))
print(song_playlist_mapping.value_counts().head(1))

album_id                album_name  release_date  release_date_precision  image                                                           
151w1FgRZfnKZA9FEcg9Z3  Midnights   2022-10-21    day                     https://i.scdn.co/image/ab67616d00004851bb54dde68cd23e2a268ae0f5    10
Name: count, dtype: int64
playlist_id             name         description                                                      num_followers  link                                                       
37i9dQZF1DWSBi5svWQ9Nk  Hot Hits NL  De 50 populairste hits van nu. Cover: Roxy Dekker & Ronnie Flex  921925         https://api.spotify.com/v1/playlists/37i9dQZF1DWSBi5svWQ9Nk    1
Name: count, dtype: int64
song_id       title                                  explicit  artists                     popularity  num_occurances
USUM72404990  I Had Some Help (Feat. Morgan Wallen)  True      Post Malone, Morgan Wallen  95          4                 4
Name: count, dtype: int64
song_id       album_id              
U

As expected, we have duplicates in the `albums`, `songs` and `song_album_mapping`. To resolve the future issue of dupicates, we drop them before we proceed further.

In [None]:
albums = albums.drop_duplicates()
songs = songs.drop_duplicates()
song_album_mapping = song_album_mapping.drop_duplicates()

# Creation of relational database
Having cleaned and divided the data into meaningful tables, we can now create a database to store all the information. We will be using [SQLite](https://www.sqlite.org/index.html) because the database can be stored locally and the limitation of maximum one open connection is not an issue for this project.

In [None]:
%load_ext sql
%config SqlMagic.autocommit=True

In [None]:
%%sql sqlite:///../data/clean/spotify_playlists.db --alias db

CREATE TABLE spotify_playlists
(
    
)