# Notes on Interpreting Linear Regression

To get a better understanding on how to interpret the coefficients we will run a couple of regressions on a new dataset on the most streamed Spotify Songs.

The idea of this excersice is for you to get a better understanding on how to interpret coefficients, rather than making a regression that ''makes sense''. In general, the examples that we will perform won't make much sense, but they are meant to get an understanding of interpretation of the variables

For this excercise we will use a dataset of the most streamed songs in Spotify during 2014. Here is a [link to the source of the data](https://www.kaggle.com/datasets/nelgiriyewithana/most-streamed-spotify-songs-2024/data).

In [1]:
import numpy as np
import pandas as pd
from sklearn import linear_model

In [6]:
spotifydf = pd.read_csv('sample_data/spotify_2024.csv', encoding='ISO-8859-1')
spotifydf.head()

Unnamed: 0,Track,Album Name,Artist,Release Date,ISRC,All Time Rank,Track Score,Spotify Streams,Spotify Playlist Count,Spotify Playlist Reach,...,SiriusXM Spins,Deezer Playlist Count,Deezer Playlist Reach,Amazon Playlist Count,Pandora Streams,Pandora Track Stations,Soundcloud Streams,Shazam Counts,TIDAL Popularity,Explicit Track
0,MILLION DOLLAR BABY,Million Dollar Baby - Single,Tommy Richman,4/26/2024,QM24S2402528,1,725.4,390470936,30716,196631588,...,684,62.0,17598718,114.0,18004655,22931,4818457.0,2669262,,0
1,Not Like Us,Not Like Us,Kendrick Lamar,5/4/2024,USUG12400910,2,545.9,323703884,28113,174597137,...,3,67.0,10422430,111.0,7780028,28444,6623075.0,1118279,,1
2,i like the way you kiss me,I like the way you kiss me,Artemas,3/19/2024,QZJ842400387,3,538.4,601309283,54331,211607669,...,536,136.0,36321847,172.0,5022621,5639,7208651.0,5285340,,0
3,Flowers,Flowers - Single,Miley Cyrus,1/12/2023,USSM12209777,4,444.9,2031280633,269802,136569078,...,2182,264.0,24684248,210.0,190260277,203384,,11822942,,0
4,Houdini,Houdini,Eminem,5/31/2024,USUG12403398,5,423.3,107034922,7223,151469874,...,1,82.0,17660624,105.0,4493884,7006,207179.0,457017,,1


In [7]:
spotifydf.columns

Index(['Track', 'Album Name', 'Artist', 'Release Date', 'ISRC',
       'All Time Rank', 'Track Score', 'Spotify Streams',
       'Spotify Playlist Count', 'Spotify Playlist Reach',
       'Spotify Popularity', 'YouTube Views', 'YouTube Likes', 'TikTok Posts',
       'TikTok Likes', 'TikTok Views', 'YouTube Playlist Reach',
       'Apple Music Playlist Count', 'AirPlay Spins', 'SiriusXM Spins',
       'Deezer Playlist Count', 'Deezer Playlist Reach',
       'Amazon Playlist Count', 'Pandora Streams', 'Pandora Track Stations',
       'Soundcloud Streams', 'Shazam Counts', 'TIDAL Popularity',
       'Explicit Track'],
      dtype='object')

In [8]:
spotifydf = spotifydf[[
    'Track', 'Album Name', 'Artist', 'Release Date', 'All Time Rank',
    'Spotify Streams', 'Spotify Playlist Count', 'Spotify Playlist Reach',
    'Spotify Popularity', 'YouTube Views', 'YouTube Likes']].copy()
spotifydf.head()

Unnamed: 0,Track,Album Name,Artist,Release Date,All Time Rank,Spotify Streams,Spotify Playlist Count,Spotify Playlist Reach,Spotify Popularity,YouTube Views,YouTube Likes
0,MILLION DOLLAR BABY,Million Dollar Baby - Single,Tommy Richman,4/26/2024,1,390470936,30716,196631588,92.0,84274754,1713126
1,Not Like Us,Not Like Us,Kendrick Lamar,5/4/2024,2,323703884,28113,174597137,92.0,116347040,3486739
2,i like the way you kiss me,I like the way you kiss me,Artemas,3/19/2024,3,601309283,54331,211607669,92.0,122599116,2228730
3,Flowers,Flowers - Single,Miley Cyrus,1/12/2023,4,2031280633,269802,136569078,85.0,1096100899,10629796
4,Houdini,Houdini,Eminem,5/31/2024,5,107034922,7223,151469874,88.0,77373957,3670188


## Regression on levels

We already saw how to run and interpret this type regression at the beginning of this lecture. It simple a regression of the variables $y$ on $x$ as they are (without modifying their units). We will run the following estimation:

$$
Spotify Streams_i = \beta_0 + \beta_1 YouTube Views_i + \epsilon_i
$$


In [10]:
spotifydf['Spotify Streams'].head()

Unnamed: 0,Spotify Streams
0,390470936
1,323703884
2,601309283
3,2031280633
4,107034922


In [11]:
# Convert strings to numbers
numvars = [
    'Spotify Streams', 'Spotify Playlist Count', 'Spotify Playlist Reach',
    'YouTube Views', 'YouTube Likes'
    ]

for col in numvars:
  spotifydf[col] = spotifydf[col].str.replace(',', '').map(float)

# Remove missing values
spotifydf = spotifydf.dropna(how='any')

In [12]:
spotifydf.head()

Unnamed: 0,Track,Album Name,Artist,Release Date,All Time Rank,Spotify Streams,Spotify Playlist Count,Spotify Playlist Reach,Spotify Popularity,YouTube Views,YouTube Likes
0,MILLION DOLLAR BABY,Million Dollar Baby - Single,Tommy Richman,4/26/2024,1,390470900.0,30716.0,196631588.0,92.0,84274750.0,1713126.0
1,Not Like Us,Not Like Us,Kendrick Lamar,5/4/2024,2,323703900.0,28113.0,174597137.0,92.0,116347000.0,3486739.0
2,i like the way you kiss me,I like the way you kiss me,Artemas,3/19/2024,3,601309300.0,54331.0,211607669.0,92.0,122599100.0,2228730.0
3,Flowers,Flowers - Single,Miley Cyrus,1/12/2023,4,2031281000.0,269802.0,136569078.0,85.0,1096101000.0,10629796.0
4,Houdini,Houdini,Eminem,5/31/2024,5,107034900.0,7223.0,151469874.0,88.0,77373960.0,3670188.0


In [13]:
# Define the data
X = spotifydf[['YouTube Views']].copy()
y = spotifydf[['Spotify Streams']].copy()

In [14]:
levels_reg = linear_model.LinearRegression()
levels_reg.fit(X, y)

In [23]:
beta0 = levels_reg.intercept_[0]
beta1 = levels_reg.coef_[0][0]

print(f"Fitted model: Spotify Streams = {beta0:.2f} + {beta1:.2f} YouTube Views")

Fitted model: Spotify Streams = 347180380.84 + 0.35 YouTube Views


**How would you interpret $\beta_0$ and $\beta_1$**?

- On average, an increase of 1 YouTube view is associated with an increase of 0.35 Spotify's streams
- On average there are 34K Spotify Streams for songs that have 0 YouTube views

## Model 2: Models involving logarithms

Here are are going to look at variables that their units are logarithms. We briefly discussed them at the beginning of the lecture, but here we are going to make more emphasis on them. This models are often presented as three possible forms:

**Lin-Log:** $y = \beta_0 + \beta_1 log(x) + \epsilon$ \\
**Log-Lin:** $log(y) = \beta_0 + \beta_1 x + \epsilon$ \\
**Log-Log:** $log(y) = \beta_0 + \beta_1 log(x) + \epsilon$

Lets run these regressions and see how to interpret the coefficients


In [25]:
X.head()

Unnamed: 0,YouTube Views
0,84274750.0
1,116347000.0
2,122599100.0
3,1096101000.0
4,77373960.0


In [26]:
# First lets run the log-log model
X['log_youtube'] = np.log(X['YouTube Views'])
logy = np.log(y['Spotify Streams'])

# Run the regression as usual
loglog_reg = linear_model.LinearRegression()
loglog_reg.fit(X[['log_youtube']], logy)

In [27]:
beta0 = loglog_reg.intercept_
beta1 = loglog_reg.coef_[0]

print(f"Fitted model: log(Spotify Streams) = {beta0:.2f} + {beta1:.2f} log(YouTube Views)")

Fitted model: log(Spotify Streams) = 11.02 + 0.43 log(YouTube Views)


**What is the interpretation of $\beta_1 = 0.43$?**

- On average, an increase of 1\% in the number of YouTube views is associated with 0.43\% increase in the number of Spotify streams

Lets run now a lin-log model

In [34]:
# Initiate the model
linlog_reg = linear_model.LinearRegression()
linlog_reg.fit(X[['log_youtube']], y['Spotify Streams'])

beta0 = linlog_reg.intercept_
beta1 = linlog_reg.coef_[0]

print(f"Fitted model: Spotify Streams = {beta0:.2f} + {beta1:.2f} log(YouTube Views)")

Fitted model: Spotify Streams = -2298164271.87 + 148694312.90 log(YouTube Views)


Waht is the interpretation of $\beta_1$ now

- On average, an increase of 1% in the number of YouTube views is associated with an increase on 1486.... in the number of Spotify streams

## Percentages as independent variables

In [32]:
spotifydf['Spotify Popularity'].describe()

Unnamed: 0,Spotify Popularity
count,3667.0
mean,64.274611
std,15.127612
min,1.0
25%,61.0
50%,67.0
75%,73.0
max,96.0


In [33]:
Xpct = spotifydf[['Spotify Popularity']].copy()
Xpct['pctg'] = Xpct['Spotify Popularity'] / 100
Xpct.head()

Unnamed: 0,Spotify Popularity,pctg
0,92.0,0.92
1,92.0,0.92
2,92.0,0.92
3,85.0,0.85
4,88.0,0.88


In [44]:
# Initiate the model
pct_reg = linear_model.LinearRegression()
pct_reg.fit(Xpct[['pctg']], logy)

beta0 = pct_reg.intercept_
beta1 = pct_reg.coef_[0]

print(f"Fitted model: log(Spotify Streams) = {beta0:.2f} + {beta1:.2f} Spotify Popularity (pctg)")

Fitted model: log(Spotify Streams) = 14.37 + 7.50 Spotify Popularity (pctg)


### Interpretation

- On average, and increase of 1 percentage point in the Spotify Popularity is associated with a 7.5% increase in the number of spotify streams

## Categorical variables as predictors

Here we will learn how to interpret the coefficients when the predictors are categorical values.

In this example, we will create an indicator variable (a variable that only can take two possible values: 1 or 0) that takes the value one if a song was released after 2014 and 0 if it was released before 2014. And see how to interpret the coefficients of such regression

In [57]:
# Create the dummy variable
spotifydf['date'] = pd.to_datetime(spotifydf['Release Date'])
spotifydf['yr2024'] = 1*(spotifydf['date'] >= '2024-01-01')
spotifydf[['date', 'yr2024']].head(10)

Unnamed: 0,date,yr2024
0,2024-04-26,1
1,2024-05-04,1
2,2024-03-19,1
3,2023-01-12,0
4,2024-05-31,1
5,2023-11-10,0
6,2024-01-18,1
7,2024-02-02,1
9,2024-05-23,1
10,2024-05-10,1


In [58]:
Xdummy = spotifydf[['yr2024']].copy()

# Initiate the model
cat_reg = linear_model.LinearRegression()
cat_reg.fit(Xdummy, y['Spotify Streams'])

beta0 = cat_reg.intercept_
beta1 = cat_reg.coef_[0]

print(f"Fitted model: Spotify Streams = {beta0:.2f}  {beta1:.2f} Year2024")

Fitted model: Spotify Streams = 540638657.48  -465773993.02 Year2024


- Songs released before 2024 (Year2024 == 0) have, on aveage, 540M streams.
- On average, songs released in 2024 have 465M less spotify streams than songs released before 2024

In [59]:
spotifydf.groupby('yr2024')['Spotify Streams'].mean()

Unnamed: 0_level_0,Spotify Streams
yr2024,Unnamed: 1_level_1
0,540638700.0
1,74864660.0
