### Quantifying Regression

Focus is to find a way to quanitfy an error in the model.

In [12]:
# Dependencies
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression

In [13]:
# Code used to generate some random data in linear regression models

X, y = make_regression(n_samples=20, n_features=1, random_state=0, noise=4, bias=100.0)
print(X.shape)
print(y.shape)

(20, 1)
(20,)


In [14]:
# create the model and fit it
model = LinearRegression()
model.fit(X, y)

LinearRegression()

### Quantifying our Model: Finding our 'Loss'

* Mean Squared Error (MSE)
* R2 Score

In [15]:
# We will quantify how well our model does using r2 and MSE
from sklearn.metrics import mean_squared_error, r2_score

# first use a model to predict
predicted = model.predict(X)

# Score the prediction with MSE and R2
mse = mean_squared_error(y, predicted)
r2 = r2_score(y, predicted)

print("Mean Squared Error: {}".format(mse))
print("R-Squared: {}".format(r2))

Mean Squared Error: 11.933040779746149
R-Squared: 0.903603363418708


### What is Mean Squared Error (technical)

NOTE: "(y, predicted)" - y is our actual (provided) values. and predicted is a result of model.predict(X)  

Mean Squared Error - take all y and predicted, subtract them, then square them. Then find the mean of all those squares. This is the average the square between the actual and predicted values.

### What is R Squared?

R2 - how much of a variance did the regression capture over the actual values

### Bottom Line:

A "good" MSE score will be close to zero while a "good" R squared score will be close to 1.

R squared is the default scoring for many of the Sklearn models.

The "best" way to test your model is to input your own made up values and checking how accurate / how big of an error you receive.

In [16]:
# Continue...

# model.score gives R2
model.score(X, y)

0.903603363418708

## Validation

How well does the model perform on new data?  
One approach is to split the data into training data and test data.  
Train (fit) the data using the training data, then score and validate the model using testing data.  
Sklearn provides a mechanism for doing this.

## Testing and Training Data
In order to quantify our model against new data, we often split the data into training and testing data. Then the model is fit to the training data and scored by the testing data. Use Sklearn to split the data into training and testing sets.

In [17]:
# Dependency to split test and training data 
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [18]:
print(X.shape)
print(X_train.shape, y_train.shape)

(20, 1)
(15, 1) (15,)


In [19]:
print(X_train) # 2d array "array of arrays"
print(y_train) # 1d array

[[ 0.3130677 ]
 [ 0.12167502]
 [-0.85409574]
 [ 0.44386323]
 [ 0.76103773]
 [ 1.86755799]
 [ 0.97873798]
 [ 1.49407907]
 [ 1.76405235]
 [-0.97727788]
 [ 1.45427351]
 [-0.20515826]
 [ 0.95008842]
 [ 0.14404357]
 [-0.10321885]]
[100.14472604 107.32614044  90.31520078 113.51286154 106.59661396
 125.80345967 107.77654399 122.11966081 125.42202601  92.04796546
 121.44454917  95.20896669 112.28760019 104.3306721  104.37128562]


In [20]:
# Then fit (train) the model using the training data X_train y_train
model.fit(X_train, y_train)

LinearRegression()

In [21]:
# Then score the data using testing data X_test y_test
model.score(X_test, y_test)

0.9252522435044104

In other words, this model acheived a score of 92.5% accuracy on the test data.