Task:

1. Load data
   - In all the models below, you explain the variable sensor_measurement_4
   - select variables

2. Build regression model
   - build trainning model
   - buil test model
   - show coefficient of determination (R^2)

3. Build regression models:
   - Ridge (L2)
   - Lasso (L1)
   - ElasticNet (L1 + L2)

4. Build dataframe with coefficients of each model

5. Build dataframe with coefficient of determination of each model



In [1]:
%pip install pandas
%pip install scikit-learn
%pip install numpy

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

In [3]:
# read data
variables = pd.read_csv('variables.txt', header=None)
variables = variables.T

dataset_test = pd.read_csv('dataset.test.txt', header=None, sep=r"\s+")
dataset_train = pd.read_csv('dataset.train.txt', header=None, sep=r"\s+")

dataset_test.columns = variables.values[0]
dataset_train.columns = variables.values[0]

In [4]:
# define variables to drop based on low standard deviation
low_std = dataset_train.columns[dataset_train.std() < 1e-4]

# drop variables with low standard deviation
dataset_train = dataset_train.drop(columns=low_std)
dataset_test = dataset_test.drop(columns=low_std)

In [5]:
# check correlation between variables and target variable 
# sensor_measurement_4
correlation_matrix = dataset_train.corr()
correlation_with_target = correlation_matrix["sensor_measurement_4"].sort_values(ascending=False)

# select variables with correlation higher than 0.5
treshold = 0.5
low_cor = correlation_with_target[correlation_with_target < treshold].index.tolist()

# drop variables with low correlation
dataset_train = dataset_train.drop(columns=low_cor)
dataset_test = dataset_test.drop(columns=low_cor)

In [6]:
# split train set to train and validation

X_train, X_val, y_train, y_val = train_test_split(dataset_train.drop(columns=["sensor_measurement_4"]), 
                                                  dataset_train["sensor_measurement_4"], 
                                                  test_size=0.2, 
                                                  random_state=42)

# test set
X_test = dataset_test.drop(columns=["sensor_measurement_4"])
y_test = dataset_test["sensor_measurement_4"]

In [7]:
# linear regression

# train model
model_linear = LinearRegression()
model_linear.fit(X_train, y_train)

# verify on validation set
y_linear_val_pred = model_linear.predict(X_val)
r2_train_score = r2_score(y_val, y_linear_val_pred)

# evaluate on test set
y_linear_test_pred = model_linear.predict(X_test)
r2_test_score = r2_score(y_test, y_linear_test_pred)

In [8]:
# RidgeCV

alphas = np.logspace(-6, 6, 13)

# train model
model_ridge = RidgeCV(alphas=alphas, cv=5)
model_ridge.fit(X_train, y_train)

# verify on validation set
y_ridge_val_pred = model_ridge.predict(X_val)
r2_ridge_train_score = r2_score(y_val, y_ridge_val_pred)

# evaluate on test set
y_ridge_test_pred = model_ridge.predict(X_test)
r2_ridge_test_score = r2_score(y_test, y_ridge_test_pred)

In [9]:
# LassoCV

# train model
model_lasso = LassoCV(alphas=alphas, cv=5)
model_lasso.fit(X_train, y_train)

# verify on validation set
y_lasso_val_pred = model_lasso.predict(X_val)
r2_lasso_train_score = r2_score(y_val, y_lasso_val_pred)

# evaluate on test set
y_lasso_test_pred = model_lasso.predict(X_test)
r2_lasso_test_score = r2_score(y_test, y_lasso_test_pred)

In [10]:
# ElasticNetCV

l1_ratios = np.linspace(0.01, 1, 100)

# train model
model_elastic = ElasticNetCV(alphas=alphas, l1_ratio=l1_ratios, cv=5)
model_elastic.fit(X_train, y_train)

# verify on validation set
y_elastic_val_pred = model_elastic.predict(X_val)
r2_elastic_train_score = r2_score(y_val, y_elastic_val_pred)

# evaluate on test set
y_elastic_test_pred = model_elastic.predict(X_test)
r2_elastic_test_score = r2_score(y_test, y_elastic_test_pred)

In [11]:
# models coefficients
selected_variables = X_train.columns

factors = pd.DataFrame({
    'Factor': selected_variables,
    'LinearRegression': model_linear.coef_,
    'Ridge': model_ridge.coef_,
    'Lasso': model_lasso.coef_,
    'ElasticNet': model_elastic.coef_
})

factors

Unnamed: 0,Factor,LinearRegression,Ridge,Lasso,ElasticNet
0,time_in_cycles,0.01386,0.01386,0.01386,0.013862
1,sensor_measurement_2,1.717713,1.717762,1.717741,1.718242
2,sensor_measurement_3,0.130788,0.130791,0.13079,0.130823
3,sensor_measurement_8,14.639063,14.638996,14.638884,14.638287
4,sensor_measurement_11,9.528193,9.528417,9.528321,9.530598
5,sensor_measurement_13,13.264436,13.264443,13.264256,13.26444
6,sensor_measurement_15,32.256267,32.252292,32.254466,32.21369
7,sensor_measurement_17,0.567559,0.567574,0.567569,0.567726


In [12]:
# coefficients of determination
r2_values = {
    'LinearRegression': [r2_train_score, r2_test_score],
    'Ridge': [r2_ridge_train_score, r2_ridge_test_score],
    'Lasso': [r2_lasso_train_score, r2_lasso_test_score],
    'ElasticNet': [r2_elastic_train_score, r2_elastic_test_score]
}
r2_df = pd.DataFrame(r2_values, index=['Train', 'Test'])

r2_df

Unnamed: 0,LinearRegression,Ridge,Lasso,ElasticNet
Train,0.756445,0.756445,0.756445,0.756445
Test,0.578755,0.578755,0.578755,0.578756
