# Performing Linear Regression
### in order to build a Predictor of the release year

In this notebook, we will be using the scikit-learn library to perform a logistic regression on the data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
from tueplots import bundles
plt.rcParams.update(bundles.neurips2021(usetex=False))

In [2]:
### import the datasets
train = pd.read_csv('../data/train_set.csv')
test = pd.read_csv('../data/test_set.csv')

This cell sorts the data and does some preprocessing, in order for the regression to work

In [3]:
polynomial_degree = 3
predictors = ['explicit', 'danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']

X_train = train[predictors]
X_test = test[predictors]

scaler = preprocessing.StandardScaler().fit(pd.concat([train, test])[predictors])

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

y_train = train['year']
y_test = test['year']

In [4]:
if polynomial_degree != -1:
    poly = preprocessing.PolynomialFeatures(degree = polynomial_degree, interaction_only=False, include_bias=False)
    X_train = poly.fit_transform(X_train)
    X_test = poly.transform(X_test)

In [5]:
reg = LinearRegression().fit(X_train, y_train)
reg.coef_.shape

(454,)

Fitting the sklearn Logistic Regression Model 

In [6]:
if polynomial_degree == -1:
    fig, ax = plt.subplots()
    ax.bar(predictors, reg.coef_)
    plt.setp(ax.get_xticklabels(), rotation=30, horizontalalignment='right')
    ax.hlines(0,-.2, 11.2, colors='grey', linestyle='dotted')
    plt.show()

In [9]:
predicted_values = reg.predict(X_test)
random_values = np.random.randint(y_test.min(), y_test.max(), size=(10,) + y_test.shape)

print(f'L1 Error of random: {np.array([np.abs(random - y_test).mean() for random in random_values]).mean()}')
print(f'L1 Error of predictor: {np.abs(predicted_values - y_test).mean()}')
print(f'Accuracy of random: {np.array([np.sum(random == y_test)/len(y_test) for random in random_values]).mean()}')
print(f'Accuracy of predictor: {reg.score(X_test, y_test)}')

L1 Error of random: 10.121316774636012
L1 Error of predictor: 5.964536530912798
Accuracy of random: 0.03144376056577976
Accuracy of predictor: 0.1273191712591384
