<h1>Pre-processing & Training Data Development</h1>

This step in the data pre processing will concentrate on pre-processing and data training. The goal of this step is to normalize and standardize all the features in the dataset, as well as create a validation set.

At the end of this stage we will have:
- dummy variables for our categorical ones
- standardized variables

- a split of train and test data

In [1]:
# start by import necessary packages
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime


<h3> Loading the data</h3>
We need to start by importing the data we have been using for this project.

In [56]:
import os
cwd = os.getcwd()

In [57]:
print(cwd)

C:\Users\Alfredo\Documents\GitHub\CAPSTONE2\Notebooks


In [58]:
songs = pd.read_csv('..\data\songs_data_afterEDA.csv', index_col = 0)
songs.head()

Unnamed: 0,spotify_id,name,daily_movement,weekly_movement,country,snapshot_date,popularity,is_explicit,duration_ms,album_release_date,...,hdi_Low,hdi_Medium,hdi_Very High,region_EAP,region_ECA,region_EU,region_GLB,region_LAC,region_SA,region_SSA
0,2OzhQlSqBEmt7hmkYxfT6m,Fortnight (feat. Post Malone),49,49,Global,2024-04-25,90,False,228965,2024-04-18,...,0,0,1,0,0,0,1,0,0,0
1,2GxrNKugF82CnoRFbQfzPf,i like the way you kiss me,48,-1,Global,2024-04-25,100,False,142514,2024-03-19,...,0,0,1,0,0,0,1,0,0,0
2,2qSkIjg1o9h3YT9RAgYN75,Espresso,47,4,Global,2024-04-25,92,True,175459,2024-04-12,...,0,0,1,0,0,0,1,0,0,0
3,6XjDF6nds4DE2BBbagZol6,Gata Only,46,-1,Global,2024-04-25,98,True,222000,2024-02-02,...,0,0,1,0,0,0,1,0,0,0
4,4q5YezDOIPcoLr8R81x9qy,I Can Do It With a Broken Heart,45,45,Global,2024-04-25,84,True,218004,2024-04-18,...,0,0,1,0,0,0,1,0,0,0


In [59]:
songs.columns

Index(['spotify_id', 'name', 'daily_movement', 'weekly_movement', 'country',
       'snapshot_date', 'popularity', 'is_explicit', 'duration_ms',
       'album_release_date', 'danceability', 'energy', 'key', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'time_signature', 'top_5', 'code_standard',
       'child_mort', 'health', 'inflation', 'life_expec', 'total_fer', 'gdpp',
       'hap_score', 'pop2021', 'chistians_p', 'muslims_p', 'unaffiliated_p',
       'hindus_p', 'buddhists_p', 'folkReligions_p', 'other_p', 'jews_p',
       'gnipc_2021', 'imp_exp_rate', 'hdi_Low', 'hdi_Medium', 'hdi_Very High',
       'region_EAP', 'region_ECA', 'region_EU', 'region_GLB', 'region_LAC',
       'region_SA', 'region_SSA'],
      dtype='object')

In [60]:
col_info = pd.DataFrame(data = {'Type':songs.dtypes}, index = songs.columns )
col_info.head()

Unnamed: 0,Type
spotify_id,object
name,object
daily_movement,int64
weekly_movement,int64
country,object


In [61]:
col_info.value_counts()

Type   
float64    26
int64      17
object      6
bool        2
Name: count, dtype: int64

In [62]:
col_info[col_info['Type']=='object']

Unnamed: 0,Type
spotify_id,object
name,object
country,object
snapshot_date,object
album_release_date,object
code_standard,object


As we can see, we need to update the snapshot_date and album_release_date to be date objects and then lets create a numeric feature we can use for our analysis:

In [63]:
#first we update the columns needed to datetime
songs['snapshot_date'] = pd.to_datetime(songs['snapshot_date'])
songs['album_release_date'] = pd.to_datetime(songs['album_release_date'])
songs.drop(columns=['code_standard'], inplace=True)


In [64]:
songs['days_since_relase'] = (songs['snapshot_date']-songs['album_release_date'])/np.timedelta64(1, 'D')

In [65]:
songs.drop(columns = ['snapshot_date', 'album_release_date'], inplace=True)

In [66]:
col_info = pd.DataFrame(data = {'Type':songs.dtypes}, index = songs.columns )
col_info.value_counts()

Type   
float64    27
int64      17
object      3
bool        2
Name: count, dtype: int64

As we can see, we no longer have any object (other than the columns we will keep separate from the set) or date variables but we do have 2 bool variables we need to investigate: 

In [67]:
col_info[col_info['Type']=='bool']

Unnamed: 0,Type
is_explicit,bool
top_5,bool


the top_5 column has the information we want to predict and the 'is_explicit' column is part of the information we want to analize. In order to make all our columns numeric, we need to transform these bool to int64:

In [68]:
songs.is_explicit = songs.is_explicit.replace({True:1, False:0})
songs.top_5 = songs.top_5.replace({True:1, False:0})

In [69]:
col_info = pd.DataFrame(data = {'Type':songs.dtypes}, index = songs.columns )
col_info.value_counts()

Type   
float64    27
int64      19
object      3
Name: count, dtype: int64

Now that all our features are numeric, we can proceed with the next stage of our process. Dividing into test and train sets:

<h3>Train/Test set split</h3>

In [70]:
songs.shape

(635600, 49)

In [71]:
#calculate the positive rate first:
songs.top_5.sum()/len(songs)

0.10016991818753933

In [72]:
X_train, X_test, y_train, y_test = train_test_split(songs.drop(columns = 'top_5'), 
                                                    songs.top_5, test_size=0.3, stratify = songs.top_5,
                                                    random_state=123)

In [73]:
#verify sizes for test/train sets
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(444920, 48)
(190680, 48)
(444920,)
(190680,)


In [74]:
#verify rates are the same for test/train sets
print(y_train.sum()/len(y_train))
print(y_test.sum()/len(y_test))

0.10017081722556864
0.10016782043213761


In [75]:
#First lets save the last object variables and drop them from the train and test set
X_train_ids = X_train[['spotify_id', 'name', 'country']]
X_train = X_train.drop(columns=['spotify_id', 'name', 'country'])

X_test_ids = X_test[['spotify_id', 'name', 'country']]
X_test = X_test.drop(columns=['spotify_id', 'name', 'country'])

In [76]:
#lets confirm the data types left on the train set
col_info = pd.DataFrame(data = {'Type':X_train.dtypes}, index = X_train.columns )
col_info.value_counts()

Type   
float64    27
int64      18
Name: count, dtype: int64

In [77]:
#Now we can use the standard scaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

Now that we have our training and test data properly split using a 30% of our data for testing, as well as our Test and Train data scaled using a StandardScaler trained on our train data, we can move on to the next stage of testing some models.

In [78]:
# save the id data to a new csv file`
X_train_ids.to_csv('../data/train_test/train_ids.csv')
X_test_ids.to_csv('../data/train_test/test_ids.csv')

In [79]:
np.save('../data/train_test/xtrain.npy', X_train_scaled)
np.save('../data/train_test/xtest.npy', X_test_scaled)
np.save('../data/train_test/ytrain.npy', y_train)
np.save('../data/train_test/ytest.npy', y_test)

In [85]:
np.save('../data/train_test/xcolumns.npy', np.array(X_train.columns))