## STEP 1: DATA COLLECTION

**Dataset columns description** : 

- **valence**: Indicates the musical positiveness of a track, where high values represent happier and more positive sounds, and low values indicate sadder or negative tones.
- **year**: The release year of the track.It is useful for analyzing trends over time.
- **acousticness**: Measures how acoustic a track sounds. High values suggest the track is more acoustic, while low values imply more electronic or synthetic sounds.
- **artists**: Names of the artists involved in the track.It helps to group songs by same artist.
- **danceability**: Reflects how suitable a track is for dancing. High values denote easier dance rhythms, while low values indicate more complex or irregular beats.
- **duration_ms**: Duration of the track in milliseconds.The longer durations values are associated with extended compositions or live recordings.
- **energy**: It Represents the intensity and activity level of a track. Higher values suggest energetic tracks, while lower values indicate calm or mellow tracks.
- **explicit**: Indicates if the track has explicit lyrics (1 = yes, 0 = no).
- **id**: Unique Spotify ID for the track.
- **instrumentalness**: Predicts the likelihood of a track being instrumental. High values (close to 1.0) suggest little or no vocals, while low values indicate vocal presence.
- **key**: The Musical key of the track, represented by numbers (0 to 11).
- **liveness**: It Estimates the presence of a live audience. High values suggest live performance , while low values indicate studio recordings.
- **loudness**: Overall loudness of the track in decibels. Higher values mean louder tracks, while lower values are quieter.
- **mode**: Indicates modality (1 = major, 0 = minor); major often sounds happier, while minor is generally more somber.
- **name**: The Title of the track.
- **popularity**: Popularity score on Spotify (0 to 100). High values indicate popular tracks, while low values indicate less popular tracks.
- **release_date**: The Date the track was released.It helps to analyze time-based trends.
- **speechiness**: Measures the presence of spoken words in the track. High values indicate more speech-like content (e.g., podcasts), while low values suggest minimal or no speech.
- **tempo**: Tempo of the track in beats per minute (BPM). High values indicate faster tracks, while low values indicate slower songs.

This dataset provides a range of audio features that describe the musical characteristics and popularity metrics for each track, essential for building a recommendation system.

### Import necessary libraries

In [2]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

### 1.1 Load datasets from file paths

In [14]:
data_path = '..\Dataset\data.csv'
genre_data_path = '..\Dataset\data_by_genres.csv'

# Check if files exist and load them
if os.path.exists(data_path) and os.path.exists(genre_data_path):
    data = pd.read_csv(data_path)
    genre_data = pd.read_csv(genre_data_path)
    print("Info: Data and genre data successfully loaded.")
else:
    print("Attention: One or both files are not found in the specified directory.")

Info: Data and genre data successfully loaded.


### 1.2 Overview of Dataset
It provides an overview of the basic structure of the datasets including their dimensions, a preview of the first few rows, and the data types of each column. 

In [16]:
# Display shapes of datasets
print("Data Shape (rows, columns):", data.shape)
print("Genre Data Shape (rows, columns):", genre_data.shape)

Data Shape (rows, columns): (170653, 19)
Genre Data Shape (rows, columns): (2973, 14)


In [18]:
# Display first 10 rows and data types for dataframes
print("\n Data - First 10 Rows")
display(data.head(10))

print("\n Genre Data - First 10 Rows:")
display(genre_data.head(10))


 Data - First 10 Rows


Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
0,0.0594,1921,0.982,"['Sergei Rachmaninoff', 'James Levine', 'Berli...",0.279,831667,0.211,0,4BJqT0PrAfrxzMOxytFOIz,0.878,10,0.665,-20.096,1,"Piano Concerto No. 3 in D Minor, Op. 30: III. ...",4,1921,0.0366,80.954
1,0.963,1921,0.732,['Dennis Day'],0.819,180533,0.341,0,7xPhfUan2yNtyFG0cUWkt8,0.0,7,0.16,-12.441,1,Clancy Lowered the Boom,5,1921,0.415,60.936
2,0.0394,1921,0.961,['KHP Kridhamardawa Karaton Ngayogyakarta Hadi...,0.328,500062,0.166,0,1o6I8BglA6ylDMrIELygv1,0.913,3,0.101,-14.85,1,Gati Bali,5,1921,0.0339,110.339
3,0.165,1921,0.967,['Frank Parker'],0.275,210000,0.309,0,3ftBPsC5vPBKxYSee08FDH,2.8e-05,5,0.381,-9.316,1,Danny Boy,3,1921,0.0354,100.109
4,0.253,1921,0.957,['Phil Regan'],0.418,166693,0.193,0,4d6HGyGT8e121BsdKmw9v6,2e-06,3,0.229,-10.096,1,When Irish Eyes Are Smiling,2,1921,0.038,101.665
5,0.196,1921,0.579,['KHP Kridhamardawa Karaton Ngayogyakarta Hadi...,0.697,395076,0.346,0,4pyw9DVHGStUre4J6hPngr,0.168,2,0.13,-12.506,1,Gati Mardika,6,1921,0.07,119.824
6,0.406,1921,0.996,['John McCormack'],0.518,159507,0.203,0,5uNZnElqOS3W4fRmRYPk4T,0.0,0,0.115,-10.589,1,The Wearing of the Green,4,1921,0.0615,66.221
7,0.0731,1921,0.993,['Sergei Rachmaninoff'],0.389,218773,0.088,0,02GDntOXexBFUvSgaXLPkd,0.527,1,0.363,-21.091,0,"Morceaux de fantaisie, Op. 3: No. 2, Prélude i...",2,1921,0.0456,92.867
8,0.721,1921,0.996,['Ignacio Corsini'],0.485,161520,0.13,0,05xDjWH9ub67nJJk82yfGf,0.151,5,0.104,-21.508,0,La Mañanita - Remasterizado,0,1921-03-20,0.0483,64.678
9,0.771,1921,0.982,['Fortugé'],0.684,196560,0.257,0,08zfJvRLp7pjAb94MA9JmF,0.0,8,0.504,-16.415,1,Il Etait Syndiqué,0,1921,0.399,109.378



 Genre Data - First 10 Rows:


Unnamed: 0,mode,genres,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key
0,1,21st century classical,0.979333,0.162883,160297.7,0.071317,0.606834,0.3616,-31.514333,0.040567,75.3365,0.103783,27.833333,6
1,1,432hz,0.49478,0.299333,1048887.0,0.450678,0.477762,0.131,-16.854,0.076817,120.285667,0.22175,52.5,5
2,1,8-bit,0.762,0.712,115177.0,0.818,0.876,0.126,-9.18,0.047,133.444,0.975,48.0,7
3,1,[],0.651417,0.529093,232880.9,0.419146,0.205309,0.218696,-12.288965,0.107872,112.857352,0.513604,20.859882,7
4,1,a cappella,0.676557,0.538961,190628.5,0.316434,0.003003,0.172254,-12.479387,0.082851,112.110362,0.448249,45.820071,7
5,1,abstract,0.45921,0.516167,343196.5,0.442417,0.849667,0.118067,-15.472083,0.046517,127.88575,0.307325,43.5,1
6,1,abstract beats,0.342147,0.623,229936.2,0.5278,0.333603,0.099653,-7.918,0.116373,112.4138,0.493507,58.933333,10
7,1,abstract hip hop,0.243854,0.694571,231849.2,0.646235,0.024231,0.168543,-7.349328,0.214258,108.244987,0.571391,39.790702,2
8,0,accordeon,0.323,0.588,164000.0,0.392,0.441,0.0794,-14.899,0.0727,109.131,0.709,39.0,2
9,1,accordion,0.446125,0.624812,167061.6,0.373437,0.193738,0.1603,-14.487063,0.078537,112.872438,0.658688,21.9375,2


In [20]:
# Display data types for dataframes

print("Data - Data Types:")
display(data.dtypes.to_frame(name='Data Type'))

print("Genre Data - Data Types:")
display(genre_data.dtypes.to_frame(name='Data Type'))

Data - Data Types:


Unnamed: 0,Data Type
valence,float64
year,int64
acousticness,float64
artists,object
danceability,float64
duration_ms,int64
energy,float64
explicit,int64
id,object
instrumentalness,float64


Genre Data - Data Types:


Unnamed: 0,Data Type
mode,int64
genres,object
acousticness,float64
danceability,float64
duration_ms,float64
energy,float64
instrumentalness,float64
liveness,float64
loudness,float64
speechiness,float64


In [22]:
#Checking the columns common in data and data_by_genre
common_columns_stats = set(data.columns).intersection(set(genre_data.columns))
print(f"Common Columns: {common_columns_stats}")

unique_genre_columns_stats = set(genre_data.columns) - set(data.columns)
print(f"\nUnique Columns in 'genre_by_data' dataset compared to 'data' dataset : {unique_genre_columns_stats}")

unique_data_columns_stats = set(data.columns) - set(genre_data.columns)
print(f"\nUnique Columns in 'data' dataset compared to 'genre_by_data' dataset: {unique_data_columns_stats}")

Common Columns: {'mode', 'tempo', 'valence', 'loudness', 'danceability', 'liveness', 'acousticness', 'speechiness', 'key', 'popularity', 'instrumentalness', 'energy', 'duration_ms'}

Unique Columns in 'genre_by_data' dataset compared to 'data' dataset : {'genres'}

Unique Columns in 'data' dataset compared to 'genre_by_data' dataset: {'name', 'artists', 'year', 'id', 'explicit', 'release_date'}


#### Conclusion
 - The dataset is are well-prepared and formatted properly, and can be used for detailed music analysis.
 - For the column 'release_date' , some of the data samples do not contain the complete date.
   As the 'year' feature already gives the release year , we can analyse further and discard 'release_date' in feature engineering .
 - The common features in data.csv and data_by_genre.csv are 'mode', 'key', 'loudness', 'energy', 'danceability', 'liveness', 'speechiness', 'valence', 'acousticness', 'tempo', 'instrumentalness', 'duration_ms', 'popularity
- In 'data_by_genre' dataset , we have additional feature called 'genre'.

### 1.3 Assessing Missing Data and checking for duplicate values

In [31]:
# Check missing values to ensure completeness of data
print("\nMissing Values in Data")
display(data.isnull().sum().to_frame(name='Missing Values'))

print("\nMissing Values in Genre Data")
display(genre_data.isnull().sum().to_frame(name='Missing Values'))


Missing Values in Data


Unnamed: 0,Missing Values
valence,0
year,0
acousticness,0
artists,0
danceability,0
duration_ms,0
energy,0
explicit,0
id,0
instrumentalness,0



Missing Values in Genre Data


Unnamed: 0,Missing Values
mode,0
genres,0
acousticness,0
danceability,0
duration_ms,0
energy,0
instrumentalness,0
liveness,0
loudness,0
speechiness,0


In [34]:
# Record and display old and new sizes for dataset after removing duplicates

# Data
old_size_data = data.shape
duplicates = data.duplicated().sum()
data = data.drop_duplicates()
new_size_data = data.shape

# Data with Genres
old_size_w_genres = genre_data.shape
duplicates_w_genres = genre_data.duplicated().sum()
genre_data = genre_data.drop_duplicates()
new_size_w_genres = genre_data.shape

print("======= Duplicate Removal Results =======\n")

print("Data")
print(f"  Duplicates Found : {duplicates}")
print(f"  Old Size         : {old_size_data}")
print(f"  New Size         : {new_size_data}\n")

print("Data with Genres")
print(f"  Duplicates Found : {duplicates_w_genres}")
print(f"  Old Size         : {old_size_w_genres}")
print(f"  New Size         : {new_size_w_genres}\n")



Data
  Duplicates Found : 0
  Old Size         : (170653, 19)
  New Size         : (170653, 19)

Data with Genres
  Duplicates Found : 0
  Old Size         : (2973, 14)
  New Size         : (2973, 14)



#### Conclusion
 - The dataset has no missing values in any of the columns.
 - The dataset has no duplicated rows.

### 1.4 Store the raw dataset

In [47]:
# Save the raw datasets as raw.csv files
data.to_csv('..\Dataset\data_raw.csv', index=False)  
genre_data.to_csv('..\Dataset\genre_data_raw.csv', index=False) 