In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Goals with this Kernel
* Learn what linear models are, and their purpose.
* Utilize them and find out each pro and con.
* Practice basic Data Visualization

# What is a Linear Model?

* In Algebra, we are introduced with the most basic linear equation: $y = mx + b$. Here, there is a direct correlation between $y$ and $x$, where we see that when $x$ changes, $y$ will scale to $x$ considering the coefficient that scales $x$ and it's "bias"/intercept, $b$.
* In higher dimensions, we see that we discover a new terminology called the **Linear Combination**. This is very similar to the linear equation above, except applies to higher dimensions.
* Linear Combination $\rightarrow$ $y = \alpha_1x_1 + \alpha_2x_2 + $ ... $\alpha_nx_n$ , where $\alpha_i$ $\epsilon$ $\mathbb{R}$, $n$ $=$ (# of dimensions), and $x_i$ is the range of possible values.
* In regression, since we are predicting a target value, Linear Models are Models where we expect the target value to be a form of Linear Combination of features:
$$\hat{y}(w,x) = w_0 + w_1x_1 + ... + w_px_p$$
Where $w_j$ is the weight coefficient, $x_j$ is the feature value, and $p$ is the number of features. Note that here: $x_0 = 1$.
Keep in mind that scikit-learn makes the $w = (w_1,w_2,...,w_p)$ labeled as [coef_] and $w_0$ as [intercept_] , which is known as the bias.

## - Example of a problem:
Let's say that we have a dataframe with only 1 feature. This yields:
$$\hat{y}(w,x) = w_0 + w_1x_1$$
Now let's say that: $$w_0 = 2, w_1 = 0.5$$
This gives us the line/model: $$\hat{y} = (0.5)x_1 + 2 $$
Which is drawn as:

In [None]:
w0 = 2. ; w1 = 0.5
weights = np.array([w0,w1])
x1 = np.arange(3)
yhat = w0 + w1*x1
plt.plot(x1,yhat)
plt.ylabel("Predicted value (yhat)")
plt.xlabel("x1")
plt.ylim((0,4))
plt.show()

Now let's say that we have a training value whose feature value, $x_1 = 0.75$, and whose true value is $y = 3.0$ at $x_1 = 0.75$.$$$$
According to our model/estimator, we are predicting, $\hat{y} = 0.5*0.75 + 2 = 2.375$, so our model is off by $3$ $-$ $2.375$ $ = $ $0.625$.

In [None]:
example_x1 = 0.75 ; example_Value = 3.
example = pd.DataFrame({"x1": [0.75], "Value" : [3]})
example

Here is our single example in a DataFrame which we will now plot.

In [None]:
plt.plot(x1,yhat)
plt.scatter(example.loc[:,"x1"],example.loc[:,"Value"])
plt.ylabel("y")
plt.xlabel("x1")
plt.ylim((0,4))
plt.show()
#I plan to make a drawing from the point to the line indicating that this is the distance.

We can see that there is a bit of distance from our model and the true value. This is error. From our human standpoint, all we would do to fix this, would be either to increase the bias/$w_0$ to $2.625$:
$$\hat{y} = 0.5*0.75 + 2.625 = 3$$
Or increase the slope/$w_1$ to $\frac{4}{3}$:
$$\hat{y} = \frac4{3}*0.75 + 2 = 3$$
Each individual Linear Model that is on here will treat error differently, and will adjust it's line accordingly to the error.

# Linear Regression: Ordinary Least Squares

First we must know what the **Sum of Squares** is.
* Sum of Squares can be obtained by getting the distance of every data point from the mean of the data, squaring the distances, then summing those squared distance:
  $$SS = \sum_i{(y_i - \overline{y})^2}$$
  Where $y_i$ represents the $ith$ data in the set, and $\overline{y}$ represents the mean of the data.
* Sum of Squares is also known as variation.
* Also note that the Sum of Squares has nothing to do with a Model/Line.

In [None]:
#Example of Sum of Squares
data = pd.DataFrame({"feature":[0,3,5,7,10],"Value": [0,1,4,3,8]})
data

In [None]:
data_mean = data.Value.mean() #Here we have the mean, (0+1+4+3+8)/5 = 3.2
data_distance = data.Value.values - data_mean #Here is the (y_i - mean)
data_dist_squared = np.power(data_distance,2) #Here is the (y_i - mean)^2
Sum_of_Squares = np.sum(data_dist_squared) #Here is the addition of all squared distances
print("Mean of data is: ", data_mean)
print("Distance of data from mean is: ", data_distance)
print("Squared distance of data from mean is: ", data_dist_squared)
print("Sum of Squares = ", np.around(Sum_of_Squares,2))

Notice that none of that had anything to do with our model. The distance we are calculating in the Sum of Squares is this distance:

In [None]:
x1 = data.feature
data_mean = np.full(x1.size,data_mean)
plt.plot(x1, data_mean)
plt.scatter(data.feature, data.Value)
plt.show()

* Why sum of squares? Why do we square the distance? Why not just take the absolute value to find the distance?
    1. Emphasize datapoints that are further
    2. The algebra is easier

Next, we need to know what the **Residual Sum of Squares** is.
* The difference between the Normal vs. Residual SS, is that Residual instead takes a model and quantifies how much that model is off by. Or in words that are found in documentations and other sources, "how much of the dependent variable’s variation your model did not explain."
* Residual Sum of Squares:
$$\sum_i{(y_i - \hat{y}_i)^2}$$

In [None]:
yhat = w0 + w1*x1
print(data)
plt.plot(x1,yhat)
plt.scatter(data.feature ,data.Value)
plt.xlabel("feature")
plt.ylabel("Value")
plt.show()

In [None]:
#Finding the Residual Sum of Squares of our model and data
model_distance = (data.Value - yhat).values #(y - predicted_y)
model_distance_squared = np.power(model_distance,2) #(y - predicted_y)^2
Residual_Sum_of_Squares = np.sum(model_distance_squared)
print("Distance of data from prediction: ", model_distance)
print("Squared Distance of data from prediction: ", model_distance_squared)
print("Residual Sum of Squares = ", Residual_Sum_of_Squares)

We now know the Residual Sum of Squares of our model. The goal to have the best Linear Model, is to find weight coefficients to minimize this Residual Sum of Squares. How will we do this?

Linear Regression/Ordinary Least Squares.

Using the [LinearRegression] Linear Model from scikit-learn, [LinearRegression] takes a linear model and tries to minimize the residual sum of squares between the true values (observed values) and predicted values. In other words, take all of the distances that you get from predicted value to true value, square them, then add up those square distances. Then find out if there is a way to decrease that distance by adjusting the weights until you find the minimum distances.

According to Scikit-Learn, the [LinearRegression] estimator solves the mathematical problem:
$$\min_w{\Vert{Xw} - y\Vert^2_2}$$
Here, $X = (x_1,x_2...x_p), w = (w_1,...w_p), y = (values)$. But don't forget that $Xw$ is actually the prediction, $\hat{y}$.
* The solution is found by using the **Singular Value Decomposition** of X.

## Linear Regression Caveats

1. Linear Regression depends on how independent the features are from one another. This means that the more correlation they have, the "design matrix"/dataframe becomes close to singular: It slowly has a determinant that is not 0, therefore not having an inverse.
2. The OLS estimator is accurate when the regressors (data) are 