# Notes on Interpreting Linear Regression

To get a better understanding on how to interpret the coefficients we will run a couple of regressions on a new dataset on the most streamed Spotify Songs.

The idea of this excersice is for you to get a better understanding on how to interpret coefficients, rather than making a regression that ``makes sense''. In general, the examples that we will perform won't make much sense, but they are meant to get an understanding of interpretation of the variables

In [None]:
import pandas as pd
from sklearn import linear_model
import numpy as np

In [None]:
spotifydf = pd.read_csv('sample_data/spotify_2024.csv', encoding='latin')
spotifydf.head()

In [None]:
spotifydf.columns

In [None]:
spotifydf = spotifydf[[
    'Track', 'Album Name', 'Artist', 'Release Date', 'All Time Rank',
    'Spotify Streams', 'Spotify Playlist Count', 'Spotify Playlist Reach',
    'Spotify Popularity', 'YouTube Views', 'YouTube Likes']].copy()
spotifydf.head()

## Regression on levels

We already saw how to run and interpret this type regression at the beginning of this lecture. It simple a regression of the variables $y$ on $x$ as they are (without modifying their units). We will run the following estimation:

$$
Spotify Streams_i = \beta_0 + \beta_1 YouTube Views_i + \epsilon_i
$$


In [None]:
# Convert strings to numbers
numvars = [
    'Spotify Streams', 'Spotify Playlist Count', 'Spotify Playlist Reach',
    'YouTube Views', 'YouTube Likes'
    ]

for col in numvars:
  spotifydf[col] = spotifydf[col].str.replace(',', '').map(float)

# REmove missing values
spotifydf = spotifydf.dropna(how='any')

In [None]:
spotifydf.head()

In [None]:
# Define the data
X = spotifydf[['YouTube Views']].copy()
y = spotifydf[['Spotify Streams']].copy()

In [None]:
levels_reg = linear_model.LinearRegression()
levels_reg.fit(X, y)

In [None]:
beta0 = levels_reg.intercept_[0]
beta1 = levels_reg.coef_[0][0]

print(f"Fitted model: Spotify Streams = {beta0:.2f} + {beta1:.2f} YouTube Views")

**How would you interpret $\beta_0$ and $\beta_1$**?

## Model 2: Models involving logarithms

Here are are going to look at variables that their units are logarithms. We briefly discussed them at the beginning of the lecture, but here we are going to make more emphasis on them. This models are often presented as three possible forms:

**Lin-Log:** $y = \beta_0 + \beta_1 log(x) + \epsilon$ \\
**Log-Lin:** $log(y) = \beta_0 + \beta_1 x + \epsilon$ \\
**Log-Log:** $log(y) = \beta_0 + \beta_1 log(x) + \epsilon$

Lets run these regressions and see how to interpret the coefficients


In [None]:
X

In [None]:
# First lets run the log-log model
X['log_youtube'] = np.log(X['YouTube Views'])
logy = np.log(y['Spotify Streams'])

# Run the regression as usual
loglog_reg = linear_model.LinearRegression()
loglog_reg.fit(X[['log_youtube']], logy)

In [None]:
beta0 = loglog_reg.intercept_
beta1 = loglog_reg.coef_[0]

print(f"Fitted model: log(Spotify Streams) = {beta0:.2f} + {beta1:.2f} log(YouTube Views)")

**What is the interpretation of $\beta_1 = 0.43$?**

Lets run now a lin-log model

In [None]:
# Initiate the model
linlog_reg = linear_model.LinearRegression()
linlog_reg.fit(X[['log_youtube']], y['Spotify Streams'])

beta0 = linlog_reg.intercept_
beta1 = linlog_reg.coef_[0]

print(f"Fitted model: Spotify Streams = {beta0:.2f} + {beta1:.2f} log(YouTube Views)")

Waht is the interpretation of $\beta_1$ now

## Percentages as independent variables

## Categorical variables as predictors