### Review of Univariate Linear Modeling: Goals, Assumptions, and Hypothesis Testing
When fitting a linear model to data, researchers typically have one or both of the following goals in mind:
1. Predict future or unseen values for the target variable (y). This form of modeling is typically referred to as *predictive modeling*.
2. Infer if the trend observed in the model is statistically significant (i.e., the dependent variably 

#### Linear model equation
Y = mX + B

#### Hypotheses in linear modeling
* H_0 (Null hypothesis): m = 0 (i.e., slope is flat)
* H_A (Alternative hypothesis): m != 0 (i.e.., slope is not completely flat) 

#### The 4 Assumptions for Linear Regression Hypothesis Testing
1. Linearity: There is a linear relation between Y and X - **linear plot of real vs predicted**
    a. **Note**: In practice, linear models are often used to model nonlinear relationships due to complexity (number of model parameters/coefs that need to be estimated) of nonlinear models. When using a linear model to model nonlinear relationships, it usually best to use resulting model for predictive purposes only. 
2. Normality: The error terms (residuals) are normally distributed - **plot distribution and calc mean (should be near 0)**
3. Homoscedasticity: The variance of the error terms is constant over all X values (homoscedasticity) - **Graphical Method: Firstly do the regression analysis and then plot the error terms against the predicted values( Yi^). If there is a definite pattern (like linear or quadratic or funnel shaped) obtained from the scatter plot then heteroscedasticity is present.**
    - calculate residuals and show their distribution
    - build an ad hoc plot to test normality using a qq-plot
    - Shapiro-Wilk Test
4. Independence: The error terms are independent

### Links
* https://www.kaggle.com/code/shrutimechlearn/step-by-step-assumptions-linear-regression

In [2]:
from sklearn import datasets
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
boston = datasets.load_boston()

"""
Artificial linear data using the same number of features and observations as the
Boston housing prices dataset for assumption test comparison
"""
linear_X, linear_y = datasets.make_regression(n_samples=boston.data.shape[0],
                                              n_features=boston.data.shape[1],
                                              noise=75, random_state=46)

# Setting feature names to x1, x2, x3, etc. if they are not defined
linear_feature_names = ['X'+str(feature+1) for feature in range(linear_X.shape[1])]
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['HousePrice'] = boston.target

df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,HousePrice
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [3]:
print(type(boston.data))
print(type(boston.target))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


In [4]:
from sklearn.linear_model import LinearRegression

# Fitting the model
boston_model = LinearRegression()
boston_model.fit(boston.data, boston.target)

# Returning the R^2 for the model
boston_r2 = boston_model.score(boston.data, boston.target)
print('R^2: {0}'.format(boston_r2))

R^2: 0.7406426641094095


In [5]:
# below function adapted from: https://jeffmacaluso.github.io/post/LinearRegressionAssumptions/
def calculate_residuals(model, features, label):
    """
    Creates predictions on the features with the model and calculates residuals
    """
    predictions = model.predict(features)
    df_results = pd.DataFrame({'Actual': label, 'Predicted': predictions})
    df_results['Residuals'] = abs(df_results['Actual']) - abs(df_results['Predicted'])
    
    return df_results

### Linearity (from: https://jeffmacaluso.github.io/post/LinearRegressionAssumptions/)
This assumes that there is a linear relationship between the predictors (e.g. independent variables or features) and the response variable (e.g. dependent variable or label). This also assumes that the predictors are additive.

Why it can happen: There may not just be a linear relationship among the data. Modeling is about trying to estimate a function that explains a process, and linear regression would not be a fitting estimator (pun intended) if there is no linear relationship.

What it will affect: The predictions will be extremely inaccurate because our model is underfitting. This is a serious violation that should not be ignored.

**How to detect it**: Plots! If there is only one predictor, this is pretty easy to test with a scatter plot. Most cases aren’t so simple, so we’ll have to modify this by using a scatter plot to see our predicted values versus the actual values (in other words, view the residuals). Ideally, the points should lie on or around a diagonal line on the scatter plot.

How to fix it: Either adding polynomial terms to some of the predictors or applying nonlinear transformations . If those do not work, try adding additional variables to help capture the relationship between the predictors and the label.


#### Option 1 - using sklearn model

In [None]:
# below function adapted from: https://jeffmacaluso.github.io/post/LinearRegressionAssumptions/
import seaborn as sns
def linear_assumption(model, features, label):
    """
    Linearity: Assumes that there is a linear relationship between the predictors and
               the response variable. If not, either a quadratic term or another
               algorithm should be used.
    """
    print('Assumption 1: Linear Relationship between the Target and the Feature', '\n')
        
    print('Checking with a scatter plot of actual vs. predicted.',
           'Predictions should follow the diagonal line.')
    
    # Calculating residuals for the plot
    df_results = calculate_residuals(model, features, label)
    
    # Plotting the actual vs predicted values
    sns.lmplot(x='Actual', y='Predicted', data=df_results, fit_reg=False, size=7)
        
    # Plotting the diagonal line
    line_coords = np.arange(df_results.min().min(), df_results.max().max())
    plt.plot(line_coords, line_coords,  # X and y points
             color='darkorange', linestyle='--')
    plt.title('Actual vs. Predicted')
    plt.show()

#### Option 2 - using statsmodel package

### QQ plots - model residuals

In [None]:
sm.qqplot(model_result.resid, line='s');
