In [1]:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_diabetes
from random import shuffle
from numpy import rint

**Part A: Fit a Linear Regression model to the diabetes dataset**

In [2]:
diabetes_data = load_diabetes()
diabetes_features = diabetes_data.data
diabetes_targets = diabetes_data.target

In [3]:
regression_obj = LinearRegression().fit(diabetes_features, diabetes_targets)

In [4]:
predictions = regression_obj.predict(diabetes_features)

In [5]:
r_squared = regression_obj.score(diabetes_features, diabetes_targets)
r_squared

0.5177494254132934

In [6]:
num_exact_predictions = sum(diabetes_targets[predictions == diabetes_targets])
proportion_exact = num_exact_predictions / len(predictions)
proportion_exact

0.0

0 of our predictions were exactly correct, whereas the r^2 value was .51. This implies that while the regression line does provide exact predictions, the model itself is middle-of-the-pack in terms of fitting the sample data (which we hope also reflects performance on the population). This is expected, because we should not expect our regression line to perfectly fit many of the points exactly, if any. The point of the regression model is to best fit the linear relationship between the data, so to expect one line to go through many of the data points while also modelling the rest of the data is unrealistic. Thus, we cannot say that the model is ineffective because it does not exactly predict any of the target values perfectly. It may get very close to exact points, but by the nature of how the model is fit, we cannot expect it to perfectly predict every point.

**Part B: Randomly split row indices of instances of the data set**

In [7]:
num_data_samples = len(diabetes_features)
num_data_samples

442

In [8]:
# get a list of indices and randomly shuffle them to generate test/training sets
data_indices = list(range(0, num_data_samples))
shuffle(data_indices)

In [9]:
training_indices = data_indices[0:300]
testing_indices = data_indices[300:]

In [10]:
training_set = diabetes_features[training_indices]
testing_set = diabetes_features[testing_indices]

In [11]:
regression = LinearRegression().fit(training_set, diabetes_targets[training_indices])

In [12]:
r_squared_train = regression.score(training_set, diabetes_targets[training_indices])
r_squared_train

0.5258839489557672

In [13]:
r_squared_test = regression.score(testing_set, diabetes_targets[testing_indices])
r_squared_test

0.47848698142628354

The r^2 value for the training set is .52, as opposed to the smaller r^2 value of .477 for the testing set. The values are not the same, but the small difference suggests that the model forms slightly worse on the population data (testing set) than on the training data. The training set, then, represents a better fit, which should not come as surprising, since the model is fit based off of the training set. Since we fit the model with data from the training set, we can expect that the goodness of fit will be at least a little higher for the training set. Because we randomly selected indices, however, we are not seeing a massive difference in goodness of fit between training and testing set, since the population data is more or less accurately reflected in the sample data due to the random nature of how we selected indices. 