###  Created by Luis A. Sanchez-Perez (alejand@umich.edu)

Performs a linear regression on the bodyfat dataset. Then using the weights of the linear model tries to determine the importance of each feature. It implements three different linear regression models:
* Features without normalization
* Features with standarization
* Features normalized in the 0-1 range

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

In [2]:
# Read dataset
dataset = pd.read_csv('../../datasets/regression/bodyfat.csv')
predictors = dataset.iloc[:,:-1].values
responses = dataset.iloc[:,-1].values
dataset.columns

Index(['Age (Years)', 'Weight (lbs)', 'Height (inches)',
       'Neck circumference (cm)', 'Chest circumference (cm)',
       'Abdomen 2 circumference (cm)', 'Hip circumference (cm)',
       'Thigh circumference (cm)', 'Knee circunference (cm)',
       'Ankle circunference (cm)', 'Biceps (extended) circunference (cm)',
       'Forearm circunference (cm)', 'Wrist circunference (cm)', 'Bodyfat %'],
      dtype='object')

### No standarization
Train and evaluates model using no normalization

In [3]:
# Splits into training/test sets
X_train,X_test,y_train,y_test = train_test_split(predictors,responses,test_size = 0.2,random_state = 0)

# Train and test model
mdl = LinearRegression()
mdl.fit(X_train,y_train)
y_pred = mdl.predict(X_test)
mse = mean_squared_error(y_test,y_pred)
r2 = r2_score(y_test,y_pred)

print('MSE:', mse)
print('R2 Score:', r2)
weights = mdl.coef_
print('\nWeights: ', weights)
index = np.absolute(weights).argmax()
print('\nMost important feature is {} with index {} and value {}'.format(dataset.columns[index], index, weights[index]))
index = np.absolute(weights).argmin()
print('Least important feature is {} with index {} and value {}'.format(dataset.columns[index], index, weights[index]))

MSE: 15.91048910485444
R2 Score: 0.7800078200104427

Weights:  [ 0.03364413 -0.07900737 -0.09520786 -0.47551937  0.00447163  0.97205817
 -0.24588183  0.18656474 -0.03560916  0.09614109  0.1657716   0.44578392
 -1.66684647]

Most important feature is Wrist circunference (cm) with index 12 and value -1.6668464689945888
Least important feature is Chest circumference (cm) with index 4 and value 0.004471630041393969


The previous results make no sense and that's because no normalization was used.

###  Using standarization
Train and evaluates model using standarization

In [4]:
# Splits into training/test sets
X_train,X_test,y_train,y_test = train_test_split(predictors,responses,test_size = 0.2,random_state = 0)

# Scaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Train and test model
mdl = LinearRegression()
mdl.fit(X_train,y_train)
y_pred = mdl.predict(X_test)
mse = mean_squared_error(y_test,y_pred)
r2 = r2_score(y_test,y_pred)

print('MSE:', mse)
print('R2 Score:', r2)
weights = mdl.coef_
print('\nWeights: ', weights)
index = np.absolute(weights).argmax()
print('\nMost important feature is {} with index {} and value {}'.format(dataset.columns[index], index, weights[index]))
index = np.absolute(weights).argmin()
print('Least important feature is {} with index {} and value {}'.format(dataset.columns[index], index, weights[index]))

y_train.mean()

MSE: 15.910489104854395
R2 Score: 0.7800078200104433

Weights:  [ 0.42593899 -2.34066308 -0.37186612 -1.15335127  0.03736982 10.46628993
 -1.80327295  0.98393734 -0.08616706  0.16936719  0.49720155  0.86488278
 -1.55484743]

Most important feature is Abdomen 2 circumference (cm) with index 5 and value 10.466289933738473
Least important feature is Chest circumference (cm) with index 4 and value 0.03736982190083437


19.15771144278607

###  Using 0-1 Normalization
Train and evaluates model using 0-1 normalization

In [5]:
# Splits into training/test sets
X_train,X_test,y_train,y_test = train_test_split(predictors,responses,test_size = 0.2,random_state = 0)

# Scaler
sc = MinMaxScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Defines model
mdl = LinearRegression()

# Train and test model
mdl = LinearRegression()
mdl.fit(X_train,y_train)
y_pred = mdl.predict(X_test)
mse = mean_squared_error(y_test,y_pred)
r2 = r2_score(y_test,y_pred)

print('MSE:', mse)
print('R2 Score:', r2)
weights = mdl.coef_
print('\nWeights: ', weights)
index = np.absolute(weights).argmax()
print('\nMost important feature is {} with index {} and value {}'.format(dataset.columns[index], index, weights[index]))
index = np.absolute(weights).argmin()
print('Least important feature is {} with index {} and value {}'.format(dataset.columns[index], index, weights[index]))

MSE: 15.910489104854467
R2 Score: 0.7800078200104423

Weights:  [  1.98500389 -18.69709511  -4.59377922  -9.55793925   0.23610207
  75.52892015 -15.34302589   7.08945995  -0.5733074    1.36520353
   3.26570052   5.70603414  -9.33434023]

Most important feature is Abdomen 2 circumference (cm) with index 5 and value 75.52892015320883
Least important feature is Chest circumference (cm) with index 4 and value 0.2361020661855693
