## Use Case

One of our customers strongly believes in technology and have recently backed up its platform using Machine Learning and Artificial Intelligence. Based on data collected from multiple sources on different songs and various artist attributes our customer is excited to challenge the MachineHack community.

By analyzing, the chartbusters data to predict the Views of songs, MachineHackers would advance the state of the current platform. This can help our customer understand user behaviour and personalize the user experience. 
In this hackathon, we challenge the MachineHackers to come up with a prediction algorithm that can predict the views for a given song.

Can you predict how popular a song will be in the future?

## Dataset Description

- Data_Train.csv – the training set, 78458 rows with 11 columns.
- Data_Test.csv – the test set, 19615 rows with 10 columns, except the Views column.
- Sample_Submission.csv – sample submission file format for reference.

## Data Dictionary

- **Unique_ID** : Unique Identifier.
- **Name** : Name of the Artist.
- **Genre** : Genre of the Song.
- **Country** : Origin Country of Artist.
- **Song_Name** : Name of the Song.
- **Timestamp** : Release Date and Time.
- **Views** : Number of times the song was played/viewed (*Target/Dependent Variable*).
- **Comments** : Count of comments for the song.
- **Likes** : Count of Likes.
- **Popularity** : Popularity score for the artist.
- **Followers** : Number of Followers.

## Load necessary packages

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Input, Dense, BatchNormalization, Dropout
from tensorflow.keras.models import Model

In [2]:
from learningratefinder import LearningRateFinder
from clr_callback import CyclicLR

## Set file paths for train and test datasets

In [3]:
train_dataset = "Datasets/Data_Train.csv"
test_dataset = "Datasets/Data_Test.csv"

## Preprocess data

In [36]:
# Read train/predict data into pandas dataframes
train_df = pd.read_csv(train_dataset)
predict_df = pd.read_csv(test_dataset)

In [37]:
# Remove rows, from train_df, having any column value as NaN
train_df.dropna(inplace=True)

In [38]:
# Extract "Views" field from train_df into NumPy array
train_y = np.array([train_df['Views'].values]).T
train_df.drop(['Views'], inplace=True, axis=1)
print("train_y: {}".format(train_y.shape))

train_y: (78457, 1)


In [39]:
# Combine the train and predict dataframes
combined_df = train_df.append(predict_df, sort=False, ignore_index=True)
print(combined_df.shape)

(98072, 10)


In [40]:
combined_df.head(10)

Unnamed: 0,Unique_ID,Name,Genre,Country,Song_Name,Timestamp,Comments,Likes,Popularity,Followers
0,413890,Hardstyle,danceedm,AU,N-Vitral presents BOMBSQUAD - Poison Spitter (...,2018-03-30 15:24:45.000000,4,499,97,119563
1,249453,Dj Aladdin,danceedm,AU,Dj Aladdin - Old School Hip Hop Quick Mix,2016-06-20 05:58:52.000000,17,49,17,2141
2,681116,Maxximize On Air,danceedm,AU,Maxximize On Air - Mixed by Blasterjaxx - Epis...,2015-05-08 17:45:59.000000,11,312,91,22248
3,387253,GR6 EXPLODE,rbsoul,AU,MC Yago - Tenho Compromisso (DJ R7),2017-06-08 23:50:03.000000,2,2400,76,393655
4,1428029,Tritonal,danceedm,AU,Escape (feat. Steph Jones),2016-09-17 20:50:19.000000,81,3031,699,201030
5,2839,k$upreme,all-music,AU,Started Off Finessen' (Prod.Oscar100),2017-11-27 14:55:11.000000,6,4500,325,71038
6,414871,Hardstyle,danceedm,AU,Coone - Universal Language (Cyber Remix),2016-01-22 17:23:26.000000,15,1017,226,119563
7,209496,Diplo,danceedm,AU,Pick Your Poison (feat. Kay) (Figure Remix),2012-01-17 00:00:00.000000,5,88,12,7120051
8,967409,Nick Vanelli,trap,AU,B l o o d s h e d,2018-11-29 22:37:07.000000,0,28,7,1892
9,171948,DeejayEcko(PNCS),latin,AU,CHIHUAHUA MIXDOWN [ Instagram : @deejayeckoo ],2017-09-28 04:07:47.000000,0,622,47,2835


In [43]:
# One-hot encoding for "Name" field
dummy_val = pd.get_dummies(combined_df['Name'], prefix='Name')
combined_df = pd.concat([combined_df, dummy_val], axis=1)
print("Dataframe size after encoding: {}".format(combined_df.shape))

Dataframe size after encoding: (98072, 1229)


In [44]:
# One-hot encoding for "Genre" field
dummy_val = pd.get_dummies(combined_df['Genre'], prefix='Name')
combined_df = pd.concat([combined_df, dummy_val], axis=1)
print("Dataframe size after encoding: {}".format(combined_df.shape))

Dataframe size after encoding: (98072, 1250)


In [None]:
# Sentence encoding for "Song_Name" field
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
song_name_embed = np.array(model(combined_df.Song_Name))
song_name_embed_df = pd.DataFrame(song_name_embed)
combined_df = pd.merge(combined_df, song_name_embed_df, left_index=True, right_index=True)
print("Dataframe size after encoding: {}".format(combined_df.shape))

In [12]:
# Extract new features from "Timestamp" field
combined_df['rel_year'] = combined_df['Timestamp'].map(lambda x: pd.to_datetime(x).year)
combined_df['rel_quarter'] = combined_df['Timestamp'].map(lambda x: pd.to_datetime(x).quarter)
combined_df['rel_month'] = combined_df['Timestamp'].map(lambda x: pd.to_datetime(x).month)
combined_df['rel_week'] = combined_df['Timestamp'].map(lambda x: pd.to_datetime(x).week)
combined_df['rel_day_year'] = combined_df['Timestamp'].map(lambda x: pd.to_datetime(x).dayofyear)
combined_df['rel_day_month'] = combined_df['Timestamp'].map(lambda x: pd.to_datetime(x).day)
combined_df['rel_day_week'] = combined_df['Timestamp'].map(lambda x: pd.to_datetime(x).dayofweek)
print("Dataframe size after encoding: {}".format(combined_df.shape))

Dataframe size after encoding: (98072, 1769)


In [13]:
# Cleanse data in "Likes" field
combined_df['Likes'] = combined_df['Likes'].map(lambda x: x.replace(",", ""))
combined_df['Likes'] = (combined_df.Likes.replace(r'[KM]+$', '', regex=True).astype(float) * 
                        combined_df.Likes.str.extract(r'[\d\.]+([KM]+)', expand=False)
                        .fillna(1).replace(['K','M'], [10**3, 10**6]).astype(int))
print("Dataframe size after encoding: {}".format(combined_df.shape))

Dataframe size after encoding: (98072, 1769)


In [14]:
# Cleanse data in "Popularity" field
combined_df['Popularity'] = combined_df['Popularity'].map(lambda x: x.replace(",", ""))
print("Dataframe size after encoding: {}".format(combined_df.shape))

Dataframe size after encoding: (98072, 1769)


In [42]:
# Convert "Comments" and "Followers" fields to 'float64' datatype
combined_df['Comments'] = combined_df['Comments'].astype('float64')
combined_df['Followers'] = combined_df['Followers'].astype('float64')

In [15]:
# Drop redundant columns
combined_df.drop(['Unique_ID', 'Name', 'Genre', 'Country', 'Song_Name', 'Timestamp'], inplace=True, axis=1)

# Check if any column has NaN value in dataframe
print("Column with NaN value: {}".format(combined_df.columns[combined_df.isnull().any()].tolist()))

Column with NaN value: []


In [16]:
# Segregate combined_df into train/predict datasets
train_x = combined_df[:78457]
predict_x = combined_df[78457:]
print("train_x: {}".format(train_x.shape))
print("predict_x: {}".format(predict_x.shape))

train_x: (78457, 1763)
predict_x: (19615, 1763)


## Create train, validation and test datasets

In [17]:
Xtrain, Xvalidation, Ytrain, Yvalidation = train_test_split(train_x, train_y, test_size=0.05, random_state=10)
Xtrain, Xtest, Ytrain, Ytest = train_test_split(Xtrain, Ytrain, test_size=0.05, random_state=10)

In [18]:
print("Xtrain: {} \nYtrain: {}".format(Xtrain.shape, Ytrain.shape))
print("Xvalidation: {} \nYvalidation: {}".format(Xvalidation.shape, Yvalidation.shape))
print("Xtest: {} \nYtest: {}".format(Xtest.shape, Ytest.shape))

Xtrain: (70807, 1763) 
Ytrain: (70807, 1)
Xvalidation: (3923, 1763) 
Yvalidation: (3923, 1)
Xtest: (3727, 1763) 
Ytest: (3727, 1)


## Save the datasets in NPZ file (for reusability)

In [19]:
np.savez_compressed('Datasets/Chartbusters_Songs_Popularity_Prediction_dataset.npz',
                    Xtrain=Xtrain, Ytrain=Ytrain,
                    Xvalidation=Xvalidation, Yvalidation=Yvalidation,
                    Xtest=Xtest, Ytest=Ytest,
                    Xpredict=predict_x)

## Load datasets from the NPZ file

In [21]:
# Read the training, holdout and test datasets from processed file
processed_dataset = np.load('Datasets/Chartbusters_Songs_Popularity_Prediction_dataset.npz', allow_pickle=True)
Xtrain, Ytrain = processed_dataset['Xtrain'], processed_dataset['Ytrain']
Xvalidation, Yvalidation = processed_dataset['Xvalidation'], processed_dataset['Yvalidation']
Xtest, Ytest = processed_dataset['Xtest'], processed_dataset['Ytest']
Xpredict = processed_dataset['Xpredict']

In [22]:
print("Xtrain: {} \nYtrain: {}".format(Xtrain.shape, Ytrain.shape))
print("Xvalidation: {} \nYvalidation: {}".format(Xvalidation.shape, Yvalidation.shape))
print("Xtest: {} \nYtest: {}".format(Xtest.shape, Ytest.shape))
print("Xpredict: {}".format(Xpredict.shape))

Xtrain: (70807, 1763) 
Ytrain: (70807, 1)
Xvalidation: (3923, 1763) 
Yvalidation: (3923, 1)
Xtest: (3727, 1763) 
Ytest: (3727, 1)
Xpredict: (19615, 1763)


## Build the model

In [29]:
def nn_model(num_of_features):
    """
        Description: Function to build the neural network model
        
        Parameters:
            num_of_features: Number of features in input data
        
        Return:
            model - Keras neural network Model
    """

    # Input Layer
    x_input = Input(shape=(num_of_features, ), name='INPUT')

    # Fully-connected Layer 1
    x = Dense(units=512, name='FC-1', activation='relu')(x_input)
    x = BatchNormalization(name='BN_FC-1')(x)
    x = Dropout(rate=0.5, name='DROPOUT_FC-1')(x)

    # Fully-connected Layer 2
    x = Dense(units=128, name='FC-2', activation='relu')(x)
    x = BatchNormalization(name='BN_FC-2')(x)
    x = Dropout(rate=0.5, name='DROPOUT_FC-2')(x)

    # Output Layer
    x = Dense(units=1, name='OUTPUT')(x)

    # Create Keras Model instance
    model = Model(inputs=x_input, outputs=x, name='Chartbusters_Songs_Popularity_Predictor')
    
    return model

In [30]:
# Define the model hyperparameters
max_iterations = 50
mini_batch_size = 128
min_lr = 1e-4
max_lr = 1e-2
step_size = 8 * (train_x.shape[0] // mini_batch_size)
clr_method = 'triangular2'

In [31]:
# Create the model
model = nn_model(Xtrain.shape[1])

# Compile model to configure the learning process
model.compile(loss='mse',
              optimizer=Adam(lr=min_lr),
              metrics=['mean_squared_logarithmic_error'])

# Triangular learning rate policy
clr = CyclicLR(base_lr=min_lr, max_lr=max_lr, mode=clr_method, step_size=step_size)

In [32]:
model.summary()

Model: "Chartbusters_Songs_Popularity_Predictor"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
INPUT (InputLayer)           [(None, 1763)]            0         
_________________________________________________________________
FC-1 (Dense)                 (None, 512)               903168    
_________________________________________________________________
BN_FC-1 (BatchNormalization) (None, 512)               2048      
_________________________________________________________________
DROPOUT_FC-1 (Dropout)       (None, 512)               0         
_________________________________________________________________
FC-2 (Dense)                 (None, 128)               65664     
_________________________________________________________________
BN_FC-2 (BatchNormalization) (None, 128)               512       
_________________________________________________________________
DROPOUT_FC-2 (Dropout)     

In [33]:
# Learning Rate Finder
lrf = LearningRateFinder(model)
lrf.find((Xtrain, Ytrain),
         startLR=1e-10, endLR=1e-1,
         stepsPerEpoch=np.ceil((len(Xtrain) / float(mini_batch_size))),
         batchSize=mini_batch_size)
lrf.plot_loss()
plt.grid()
plt.show()

ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type int).

In [None]:
# Train the model
history = model.fit(x=Xtrain, y=Ytrain, 
                    batch_size=256, epochs=100, 
                    callbacks=[clr], workers=5,
                    validation_data=(Xvalidation, Yvalidation))

In [None]:
# Test/evaluate the model
score = model.evaluate(x=Xtest, y=Ytest, verbose=0)
print('Test loss: {}', format(score[0]))
print('Test accuracy: {}', format(score[1] * 100))

In [None]:
plt.plot(history.history['loss'], label='train_loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.ylabel('Cost')
plt.xlabel('Epoch #')
plt.title("Model Loss Curve")
plt.legend()
plt.grid()
plt.show()

In [None]:
plt.plot(history.history['mean_squared_logarithmic_error'], label='train_msle')
plt.plot(history.history['val_mean_squared_logarithmic_error'], label='val_msle')
plt.ylabel('Mean Squared Logarithmic Error')
plt.xlabel('Epoch #')
plt.title("MSLE Curve")
plt.legend()
plt.grid()
plt.show()

In [None]:
plt.plot(clr.history["lr"])
plt.ylabel('Learning Rate')
plt.xlabel('Iteration #')
plt.title("Cyclical Learning Rate (CLR)")
plt.grid()
plt.show()