<h1>Pre-processing & Training Data Development</h1>

This step in the data pre processing will concentrate on pre-processing and data training. The goal of this step is to normalize and standardize all the features in the dataset, as well as create a validation set.

At the end of this stage we will have:
- dummy variables for our categorical ones
- standardized variables

- a split of train and test data

In [1]:
# start by import necessary packages
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime


<h3> Loading the data</h3>
We need to start by importing the data we have been using for this project.

In [2]:
import os
cwd = os.getcwd()

In [3]:
print(cwd)

C:\Users\Alfredo


In [4]:
songs = pd.read_csv('Documents\GitHub\CAPSTONE2\data\songs_data_afterEDA.csv', index_col = 0)
songs.head()

Unnamed: 0,daily_rank,daily_movement,weekly_movement,snapshot_date,popularity,is_explicit,duration_ms,album_release_date,danceability,energy,...,CTRY_SV,CTRY_TH,CTRY_TR,CTRY_TW,CTRY_UA,CTRY_US,CTRY_UY,CTRY_VE,CTRY_VN,CTRY_ZA
0,1,49,49,2024-04-25,90,False,228965,2024-04-18,0.675,0.397,...,0,0,0,0,0,0,0,0,0,0
1,2,48,-1,2024-04-25,100,False,142514,2024-03-19,0.599,0.946,...,0,0,0,0,0,0,0,0,0,0
2,3,47,4,2024-04-25,92,True,175459,2024-04-12,0.701,0.771,...,0,0,0,0,0,0,0,0,0,0
3,4,46,-1,2024-04-25,98,True,222000,2024-02-02,0.791,0.499,...,0,0,0,0,0,0,0,0,0,0
4,5,45,45,2024-04-25,84,True,218004,2024-04-18,0.7,0.763,...,0,0,0,0,0,0,0,0,0,0


In [5]:
songs.columns

Index(['daily_rank', 'daily_movement', 'weekly_movement', 'snapshot_date',
       'popularity', 'is_explicit', 'duration_ms', 'album_release_date',
       'danceability', 'energy', 'key', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature',
       'top_5', 'GDP', 'CPI', 'CTRY_AE', 'CTRY_AR', 'CTRY_AT', 'CTRY_AU',
       'CTRY_BE', 'CTRY_BG', 'CTRY_BO', 'CTRY_BR', 'CTRY_BY', 'CTRY_CA',
       'CTRY_CH', 'CTRY_CL', 'CTRY_CO', 'CTRY_CR', 'CTRY_CZ', 'CTRY_DE',
       'CTRY_DK', 'CTRY_DO', 'CTRY_EC', 'CTRY_EE', 'CTRY_EG', 'CTRY_ES',
       'CTRY_FI', 'CTRY_FR', 'CTRY_GB', 'CTRY_GR', 'CTRY_GT', 'CTRY_Global',
       'CTRY_HK', 'CTRY_HN', 'CTRY_HU', 'CTRY_ID', 'CTRY_IE', 'CTRY_IL',
       'CTRY_IN', 'CTRY_IS', 'CTRY_IT', 'CTRY_JP', 'CTRY_KR', 'CTRY_KZ',
       'CTRY_LT', 'CTRY_LU', 'CTRY_LV', 'CTRY_MA', 'CTRY_MX', 'CTRY_MY',
       'CTRY_NG', 'CTRY_NI', 'CTRY_NL', 'CTRY_NO', 'CTRY_NZ', 'CTRY_PA',
       'CTRY_PE', 'CTRY_PH', 'CTRY

In [6]:
col_info = pd.DataFrame(data = {'Type':songs.dtypes}, index = songs.columns )
col_info.head()

Unnamed: 0,Type
daily_rank,int64
daily_movement,int64
weekly_movement,int64
snapshot_date,object
popularity,int64


In [7]:
col_info.value_counts()

Type   
int64      81
float64    10
bool        2
object      2
Name: count, dtype: int64

In [8]:
col_info[col_info['Type']=='object']

Unnamed: 0,Type
snapshot_date,object
album_release_date,object


As we can see, we need to update the snapshot_date and album_release_date to be date objects and then lets create a numeric feature we can use for our analysis:

In [9]:
#first we update the columns needed to datetime
songs['snapshot_date'] = pd.to_datetime(songs['snapshot_date'])
songs['album_release_date'] = pd.to_datetime(songs['album_release_date'])


In [19]:
songs['days_since_relase'] = (songs['snapshot_date']-songs['album_release_date'])/np.timedelta64(1, 'D')

In [21]:
songs.drop(columns = ['snapshot_date', 'album_release_date'], inplace=True)

In [22]:
col_info = pd.DataFrame(data = {'Type':songs.dtypes}, index = songs.columns )
col_info.value_counts()

Type   
int64      81
float64    11
bool        2
Name: count, dtype: int64

As we can see, we no longer have any object or date variables but we do have 2 bool variables we need to investigate: 

In [23]:
col_info[col_info['Type']=='bool']

Unnamed: 0,Type
is_explicit,bool
top_5,bool


the top_5 column has the information we want to predict and the 'is_explicit' column is part of the information we want to analize. In order to make all our columns numeric, we need to transform these bool to int64:

In [24]:
songs.is_explicit = songs.is_explicit.replace({True:1, False:0})
songs.top_5 = songs.top_5.replace({True:1, False:0})

In [25]:
col_info = pd.DataFrame(data = {'Type':songs.dtypes}, index = songs.columns )
col_info.value_counts()

Type   
int64      83
float64    11
Name: count, dtype: int64

Now that all our features are numeric, we can proceed with the next stage of our process. Dividing into test and train sets:

<h3>Train/Test set split</h3>

In [26]:
songs.shape

(682853, 94)

In [27]:
X_train, X_test, y_train, y_test = train_test_split(songs.drop(columns = 'top_5'), 
                                                    songs.top_5, test_size=0.3, 
                                                    random_state=123)

In [28]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(477997, 93)
(204856, 93)
(477997,)
(204856,)


In [30]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

Now that we have our training and test data properly split using a 30% of our data for testing, as well as our Test and Train data scaled using a StandardScaler trained on our train data, we can move on to the next stage of testing some models.