# 7DaysOfCode - Machine Learning 

## Day-2: Data Pre-processing

Pre-processing is a crucial step in the Machine Learning workflow as it ensures the quality of our data. This step involves cleaning, organizing, and transforming raw data into high-quality data that can be used to train models. Properly preprocessed data directly impacts the ability of our models to learn and make accurate predictions.

Some of the data preprocessing steps, such as handling null values and duplicates, were already addressed earlier in this documentation. In this step, I will focus on enriching the dataset, standardization, and checking for multicollinearity. 

In [1]:
#importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', None)

In [2]:
spotify_df = pd.read_csv("C:/Users/SAMSUNG/OneDrive/Documentos/GitHub/7DaysOfCode_SpotifyML/data/day_one.csv")

In [3]:
spotify_df.shape

(89740, 20)

## Separate the dependent and independent variables

As I am dealing with a binary classification problem, where the goal is to predict whether a song will be popular or not, I need to encode the 'popularity' column as the target variable. I have chosen a threshold of 70, which means that songs with a popularity score above 70 will be considered popular and encoded as 1, while those with a score below this threshold will be considered not popular and encoded as 0.

In [5]:
#target value: popularity
spotify_df['popularity'] = spotify_df['popularity'].apply(lambda x: 1 if x > 70 else 0)

## Feature engineering
Feature engineering is the process of creating new featurs or transforming existing features in a dataset to improve the perfomance of machine learning models. In this step I will remove irrelevant columns and create new features.


In [6]:
spotify_df.head()

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,1,3.844433,False,0.676,0.461,1,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,0,2.4935,False,0.42,0.166,1,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,0,3.513767,False,0.438,0.359,0,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,1,3.36555,False,0.266,0.0596,0,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,1,3.314217,False,0.618,0.443,2,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


In this initial stage,  I will be working exclusively with numeric data in this dataset. I have chosen to exclude categorical variables for some reasons:

i. Limited contribution to model performance: Variables such as track_id, album_name, and track_name are unlikely to significantly improve the performance of the model, as they may not contain meaningful numerical information for the model to learn from.

ii. High cardinality of variables: Variables like genre and artist may have a large number of unique categories, which can result in increased computational overhead and additional data cleaning steps. Considering the potential gains in model performance, dealing with these variables may be time-consuming and cumbersome.

By excluding these categorical variables, I can streamline the data preparation process and focus on working with numerical data, which may be more relevant for the specific modeling task at hand.

In [7]:
spotify_df = spotify_df.drop(columns = ['track_id', 'album_name', 'track_name', 'artists', 'track_genre'])

In [8]:
spotify_df.head(10)

Unnamed: 0,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,1,3.844433,False,0.676,0.461,1,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4
1,0,2.4935,False,0.42,0.166,1,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4
2,0,3.513767,False,0.438,0.359,0,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4
3,1,3.36555,False,0.266,0.0596,0,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3
4,1,3.314217,False,0.618,0.443,2,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4
5,0,3.570667,False,0.688,0.481,6,-8.807,1,0.105,0.289,0.0,0.189,0.666,98.017,4
6,1,3.823333,False,0.407,0.147,2,-8.822,1,0.0355,0.857,3e-06,0.0913,0.0765,141.284,3
7,1,4.0491,False,0.703,0.444,11,-9.331,1,0.0417,0.559,0.0,0.0973,0.712,150.96,4
8,1,3.160217,False,0.625,0.414,0,-8.7,1,0.0369,0.294,0.0,0.151,0.669,130.088,4
9,0,3.426567,False,0.442,0.632,1,-6.77,1,0.0295,0.426,0.00419,0.0735,0.196,78.899,4


### One Hot enconding

Handling categorical data is a critical step in the machine learning workflow. Even though I have excluded categorical data with high cardinality, there are still some remaining variables that need to be addressed. Certain machine learning models are unable to work with categorical data directly, and it's important to avoid misinterpretation of categorical variables as continuous ones, such as interpreting key as an integer variable when it is actually a categorical variable. To address this, I will be utilizing One Hot Encoding to appropriately represent these categorical variables in the dataset.

In [9]:
def onehot_encode(df, column, prefix):
    df = df.copy()
    dummies = pd.get_dummies(df[column], prefix=prefix, drop_first=True) # Drop the first column to avoid multicollinearity
    df = pd.concat([df, dummies], axis=1)
    df = df.drop(column, axis=1)
    return df

In [10]:
#getting the dumies
spotify_df = onehot_encode(spotify_df, 'mode', 'mode')
spotify_df = onehot_encode(spotify_df, 'explicit', 'isExplicit')
spotify_df = onehot_encode(spotify_df, 'key', 'key')
spotify_df = onehot_encode(spotify_df, 'time_signature', 'time_signature')

In [11]:
spotify_df.shape

(89740, 28)

In [12]:
spotify_df.head()

Unnamed: 0,popularity,duration_ms,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,mode_1,isExplicit_True,key_1,key_2,key_3,key_4,key_5,key_6,key_7,key_8,key_9,key_10,key_11,time_signature_1,time_signature_3,time_signature_4,time_signature_5
0,1,3.844433,0.676,0.461,-6.746,0.143,0.0322,1e-06,0.358,0.715,87.917,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0
1,0,2.4935,0.42,0.166,-17.235,0.0763,0.924,6e-06,0.101,0.267,77.489,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0
2,0,3.513767,0.438,0.359,-9.734,0.0557,0.21,0.0,0.117,0.12,76.332,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
3,1,3.36555,0.266,0.0596,-18.515,0.0363,0.905,7.1e-05,0.132,0.143,181.74,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,1,3.314217,0.618,0.443,-9.681,0.0526,0.469,0.0,0.0829,0.167,119.949,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0



## Split and scaling

In this step, I will be splitting the data and performing standardization. Standardization involves transforming the values of the dataset such that the mean of the values is 0 and the standard deviation is 1. This is crucial because many machine learning models, including those that use the gradient descent algorithm, are sensitive to the scale of the input features. Having columns with different scales can result in issues such as overshooting or loss of computational efficiency. Standardization helps to mitigate these problems and ensures that the data is normalized to a common scale, making it more suitable for machine learning model training and inference.

In [13]:
spotify_df.columns

Index(['popularity', 'duration_ms', 'danceability', 'energy', 'loudness',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'mode_1', 'isExplicit_True', 'key_1', 'key_2',
       'key_3', 'key_4', 'key_5', 'key_6', 'key_7', 'key_8', 'key_9', 'key_10',
       'key_11', 'time_signature_1', 'time_signature_3', 'time_signature_4',
       'time_signature_5'],
      dtype='object')

In [14]:
#splitting the data
y = spotify_df.loc[:, 'popularity']
X = spotify_df.drop('popularity', axis = 1)

In [15]:
#standardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [16]:
X.shape

(89740, 27)