# Loading Data

We start by loading in the data. Instead of loading in from a file directly, scikit-learn comes with some pre-downloaded datasets that we will use.

For this example, we will use a dataset related to diabetes and various health metrics and glucose levels.

Note that the `sklearn` datasets come pre-separated. What we mean by that is the data inputs are separate from the labels. The input data is stored in the `.data `field and the labels in the `.target` field that we save in new variables.

In [1]:
from sklearn import datasets
diabetes = datasets.load_diabetes()

data = diabetes.data
target = diabetes.target
features = diabetes.feature_names

In [2]:
# Check the shape of the input data.
# 442 rows, 10 columns

data.shape

(442, 10)

In [3]:
# Check the shape of the target
# 442 rows (no columns, just an array)

target.shape

(442,)

In [4]:
# Names of the columns

features

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

In [5]:
target

array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
        69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
        68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
        87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
       259.,  53., 190., 142.,  75., 142., 155., 225.,  59., 104., 182.,
       128.,  52.,  37., 170., 170.,  61., 144.,  52., 128.,  71., 163.,
       150.,  97., 160., 178.,  48., 270., 202., 111.,  85.,  42., 170.,
       200., 252., 113., 143.,  51.,  52., 210.,  65., 141.,  55., 134.,
        42., 111.,  98., 164.,  48.,  96.,  90., 162., 150., 279.,  92.,
        83., 128., 102., 302., 198.,  95.,  53., 134., 144., 232.,  81.,
       104.,  59., 246., 297., 258., 229., 275., 281., 179., 200., 200.,
       173., 180.,  84., 121., 161.,  99., 109., 115., 268., 274., 158.,
       107.,  83., 103., 272.,  85., 280., 336., 281., 118., 317., 235.,
        60., 174., 259., 178., 128.,  96., 126., 28

# Train Model

To train a model, we will use scikit-learns `LinearRegression` model. You can see the documentation for LinearRegression [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

We first import the class, then create an instance of the class, and then call the `fit` function passing in the training data and training label to train the model

In [6]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(data, target)

From the documentation, you can see that there are lots of other functions / properties you can look at. For example, if you want to view the coefficients of the predictor you can access the `.coef_` property.

In [7]:
model.coef_

array([ -10.0098663 , -239.81564367,  519.84592005,  324.3846455 ,
       -792.17563855,  476.73902101,  101.04326794,  177.06323767,
        751.27369956,   67.62669218])

You can also use the `predict` function to make predictions.

In [8]:
predictions = model.predict(diabetes.data)
predictions

array([206.11667725,  68.07103297, 176.88279035, 166.91445843,
       128.46225834, 106.35191443,  73.89134662, 118.85423042,
       158.80889721, 213.58462442,  97.07481511,  95.10108423,
       115.06915952, 164.67656842, 103.07814257, 177.17487964,
       211.7570922 , 182.84134823, 148.00326937, 124.01754066,
       120.33362197,  85.80068961, 113.1134589 , 252.45225837,
       165.48779206, 147.71997564,  97.12871541, 179.09358468,
       129.05345958, 184.7811403 , 158.71516713,  69.47575778,
       261.50385365, 112.82234716,  78.37318279,  87.66360785,
       207.92114668, 157.87641942, 240.84708073, 136.93257456,
       153.48044608,  74.15426666, 145.62742227,  77.82978811,
       221.07832768, 125.21957584, 142.6029986 , 109.49562511,
        73.14181818, 189.87117754, 157.9350104 , 169.55699526,
       134.1851441 , 157.72539008, 139.11104979,  72.73116856,
       207.82676612,  80.11171342, 104.08335958, 134.57871054,
       114.23552012, 180.67628279,  61.12935368,  98.72

You can compute the errors of these predictions. You could write some code yourself to manually compute the mean squared error, but scikit-learn actually provides a function for it!

In [9]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(predictions, target)
mse

2859.6963475867506