# Popularity of Music Records

In [1]:
import pandas as pd

import statsmodels.api as sm

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Problem 1.1 - Understanding the Data
How many observations (songs) are from the year 2010?

In [2]:
songs = pd.read_csv('../data/songs.csv')
songs[songs['year']==2010].shape[0]

373

## Problem 1.2 - Understanding the Data
How many songs does the dataset include for which the artist name is "Michael Jackson"?

In [3]:
songs[songs['artistname']=='Michael Jackson'].shape[0]

18

## Problem 1.3 - Understanding the Data
Which of these songs by Michael Jackson made it to the Top 10? *Select all that apply*.
- You Rock My World
- You Are Not Alone

In [4]:
songs[
    (songs['artistname']=='Michael Jackson')
    & (songs['Top10']==1)
]['songtitle']

4328    You Rock My World
6206    You Are Not Alone
6209       Black or White
6217    Remember the Time
6914        In The Closet
Name: songtitle, dtype: object

## Problem 1.4 - Understanding the Data
The variable corresponding to the estimated time signature (timesignature) is discrete, meaning that it only takes integer values (0, 1, 2, 3, . . . ). What are the values of this variable that occur in our dataset? *Select all that apply*.
- 0
- 1
- 3
- 4
- 5
- 7

Which timesignature value is the most frequent among songs in our dataset?
- 4

In [5]:
songs['timesignature'].value_counts().sort_index()

0      10
1     143
3     503
4    6787
5     112
7      19
Name: timesignature, dtype: int64

## Problem 1.5 - Understanding the Data
Out of all of the songs in our dataset, the song with the highest tempo is one of the following songs. Which one is it?

In [6]:
songs[songs['tempo']==songs['tempo'].max()]['songtitle']

6205    Wanna Be Startin' Somethin'
Name: songtitle, dtype: object

## Problem 2.1 - Creating Our Prediction Model
We wish to predict whether or not a song will make it to the Top 10. To do this, first split the data into a training set "SongsTrain" consisting of all the observations up to and including 2009 song releases, and a testing set "SongsTest", consisting of the 2010 song releases.

How many observations (songs) are in the training set?

In [7]:
SongsTrain = songs[songs['year']<2010].copy()
SongsTest = songs[songs['year']>=2010].copy()
SongsTrain.shape[0]

7201

## Problem 2.2 - Creating our Prediction Model
In this problem, our outcome variable is "Top10" - we are trying to predict whether or not a song will make it to the Top 10 of the Billboard Hot 100 Chart. Since the outcome variable is binary, we will build a logistic regression model. We'll start by using all song attributes as our independent variables, which we'll call Model 1.

We will only use the variables in our dataset that describe the numerical attributes of the song in our logistic regression model. So we won't use the variables "year", "songtitle", "artistname", "songID" or "artistID".

Looking at the summary of your model, what is the value of the Akaike Information Criterion (AIC)?
- 4827.154102388615

In [8]:
features = songs.columns[5:]

X_train1 = SongsTrain[features[:-1]].copy()
y_train1 = SongsTrain['Top10'].copy()
X_test1 = SongsTest[features[:-1]].copy()
y_test1 = SongsTest['Top10'].copy()

SongsLog1 = sm.Logit(y_train1, sm.add_constant(X_train1)).fit()
print(SongsLog1.aic)

Optimization terminated successfully.
         Current function value: 0.330451
         Iterations 8
4827.154102388615


## Problem 2.3 - Creating Our Prediction Model
Let's now think about the variables in our dataset related to the confidence of the time signature, key and tempo (timesignature_confidence, key_confidence, and tempo_confidence). Our model seems to indicate that these confidence variables are significant (rather than the variables timesignature, key and tempo themselves). What does the model suggest?
- The higher our confidence about time signature, key and tempo, the more likely the song is to be in the Top 10 

In [9]:
SongsLog1.params[:8]

const                       14.699988
timesignature                0.126395
timesignature_confidence     0.744992
loudness                     0.299879
tempo                        0.000363
tempo_confidence             0.473227
key                          0.015882
key_confidence               0.308675
dtype: float64

## Problem 2.4 - Creating Our Prediction Model
In general, if the confidence is low for the time signature, tempo, and key, then the song is more likely to be complex. What does Model 1 suggest in terms of complexity?
- Mainstream listeners tend to prefer less complex songs

## Problem 2.5 - Creating Our Prediction Model
Songs with heavier instrumentation tend to be louder (have higher values in the variable "loudness") and more energetic (have higher values in the variable "energy").

By inspecting the coefficient of the variable "loudness", what does Model 1 suggest?
- Mainstream listeners prefer songs with heavy instrumentation

By inspecting the coefficient of the variable "energy", do we draw the same conclusions as above?
- No

In [10]:
print("Loudness:", SongsLog1.params.loc['loudness'])
print("Energy:", SongsLog1.params.loc['energy'])

Loudness: 0.2998794034266897
Energy: -1.5021444680863525


## Problem 3.1 - Beware of Multicollinearity Issues!
What is the correlation between the variables "loudness" and "energy" in the training set?
- 0.7399067084558058

Given that these two variables are highly correlated, Model 1 suffers from multicollinearity. To avoid this issue, we will omit one of these two variables and rerun the logistic regression. In the rest of this problem, we'll build two variations of our original model: Model 2, in which we keep "energy" and omit "loudness", and Model 3, in which we keep "loudness" and omit "energy".

In [11]:
X_train1[['loudness', 'energy']].corr().iloc[0, 1]

0.7399067084558058

## Problem 3.2 - Beware of Multicollinearity Issues!
Create Model 2, which is Model 1 without the independent variable "loudness".

Look at the summary of SongsLog2, and inspect the coefficient of the variable "energy". What do you observe?
- Model 2 suggests that songs with high energy levels tend to be more popular. This contradicts our observation in Model 1. 

In [12]:
X_train2 = X_train1.drop('loudness', axis=1)
X_test2 = X_test1.drop('loudness', axis=1)

SongsLog2 = sm.Logit(y_train1, sm.add_constant(X_train2)).fit()
SongsLog2.params.loc['energy']

Optimization terminated successfully.
         Current function value: 0.338276
         Iterations 8


0.18126033700549826

## Problem 3.3 - Beware of Multicollinearity Issues!
Now, create Model 3, which should be exactly like Model 1, but without the variable "energy".

Look at the summary of Model 3 and inspect the coefficient of the variable "loudness". Remembering that higher loudness and energy both occur in songs with heavier instrumentation, do we make the same observation about the popularity of heavy instrumentation as we did with Model 2?
- Yes

In [13]:
X_train3 = X_train1.drop('energy', axis=1)
X_test3 = X_test1.drop('energy', axis=1)

SongsLog3 = sm.Logit(y_train1, sm.add_constant(X_train3)).fit()
SongsLog3.params.loc['loudness']

Optimization terminated successfully.
         Current function value: 0.332087
         Iterations 8


0.23055651709919034

## Problem 4.1 - Validating Our Model
Make predictions on the test set using Model 3. What is the accuracy of Model 3 on the test set, using a threshold of 0.45? (Compute the accuracy as a number between 0 and 1.)

In [14]:
pred_3 = SongsLog3.predict(sm.add_constant(X_test3))
pred_3_bool = (pred_3 >= 0.45).astype(int)

(y_test1 == pred_3_bool).mean()

0.8793565683646113

## Problem 4.2 - Validating Our Model
Let's check if there's any incremental benefit in using Model 3 instead of a baseline model. Given the difficulty of guessing which song is going to be a hit, an easier model would be to pick the most frequent outcome (a song is not a Top 10 hit) for all songs. What would the accuracy of the baseline model be on the test set? (Give your answer as a number between 0 and 1.)

In [15]:
1- y_test1.mean()

0.8418230563002681

## Problem 4.3 - Validating Our Model
It seems that Model 3 gives us a small improvement over the baseline model. Still, does it create an edge?

Let's view the two models from an investment perspective. A production company is interested in investing in songs that are highly likely to make it to the Top 10. The company's objective is to minimize its risk of financial losses attributed to investing in songs that end up unpopular.

A competitive edge can therefore be achieved if we can provide the production company a list of songs that are highly likely to end up in the Top 10. We note that the baseline model does not prove useful, as it simply does not label any song as a hit. Let us see what our model has to offer.

How many songs does Model 3 correctly predict as Top 10 hits in 2010 (remember that all songs in 2010 went into our test set), using a threshold of 0.45?

In [16]:
test_pred = y_test1.to_frame()
test_pred['predicted'] = pred_3_bool

cfm = test_pred.value_counts().sort_index()
cfm

Top10  predicted
0      0            309
       1              5
1      0             40
       1             19
dtype: int64

## Problem 4.4 - Validating Our Model
What is the sensitivity of Model 3 on the test set, using a threshold of 0.45?
- 0.3220338983050847

What is the specificity of Model 3 on the test set, using a threshold of 0.45?
- 0.9840764331210191

In [17]:
sensitivity = cfm.loc[1,1] / (cfm.loc[1,1] + cfm.loc[1,0])
specificity = cfm.loc[0,0] / (cfm.loc[0,0] + cfm.loc[0,1])
print("Sensitivity:", sensitivity)
print("Specificity:", specificity)

Sensitivity: 0.3220338983050847
Specificity: 0.9840764331210191


## Problem 4.5 - Validating Our Model
What conclusions can you make about our model? *Select all that apply*.
- Model 3 favors specificity over sensitivity.
- Model 3 provides conservative predictions, and predicts that a song will make it to the Top 10 very rarely. So while it detects less than half of the Top 10 songs, we can be very confident in the songs that it does predict to be Top 10 hits.