# Linear Regrssion on US Housing Price

## Linear regression primer

In statistics, linear regression is a linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression.

Linear regression models are often fitted using the least squares approach, but they may also be fitted in other ways, such as by minimizing the "lack of fit" in some other norm (as with least absolute deviations regression), or by minimizing a penalized version of the least squares loss function as in ridge regression ($L_2$-norm penalty) and lasso ($L_1$-norm penalty). Conversely, the least squares approach can be used to fit models that are not linear models. Thus, although the terms "least squares" and "linear model" are closely linked, they are not synonymous.

### Import packages and dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


In [None]:
df = pd.read_csv("./USA_Housing.csv")
df.head()

### Check basic info on the data set

**'info()' method to check the data types and number**

In [None]:
df.info(verbose=True)

**'describe()' method to get the statistical summary of the various features of the data set**

In [None]:
df.describe(percentiles=[0.1,0.25,0.5,0.75,0.9])

**'columns' method to get the names of the columns (features)**

In [None]:
df.columns

### Basic plotting and visualization on the data set

**Pairplots using seaborn**

In [None]:
sns.pairplot(df)

**Distribution of price (the predicted quantity)**

In [None]:
df['Price'].plot.hist(bins=25,figsize=(8,4))

In [None]:
df['Price'].plot.density()

**Correlation matrix and heatmap**

In [None]:
df.corr(numeric_only=True)

In [None]:
plt.figure(figsize=(10,7))
sns.heatmap(df.corr(numeric_only=True), annot=True, linewidths=2)

### Feature and variable sets

**Make a list of data frame column names**

In [None]:
l_column = list(df.columns) # Making a list out of column names
len_feature = len(l_column) # Length of column vector list
l_column

**Put all the numerical features in X and Price in y, ignore Address which is string for linear regression**

In [None]:
X = df[l_column[0:len_feature-2]]
y = df[l_column[len_feature-2]]

In [None]:
print("Feature set size:",X.shape)
print("Variable set size:",y.shape)

In [None]:
X.head()

In [None]:
y.head()

### Test-train split

In [None]:
from sklearn.model_selection import train_test_split

**Create training and test splits in using a split ratio of 30%**

### Model fit and training

**Import linear regression model estimator from scikit-learn and instantiate**

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn import metrics

**Create linear regression object**

**Fit the model on to the instantiated object itself**

**Check the intercept and coefficients**

### Calculation of standard errors and t-statistic for the coefficients

In [None]:
def metric_calculations(X_train, y_train, model): 
    # Function to calculate the standard error and the t-statistic based on the model predictions, 
    # returns a dataframe which contains the coefficients, standard error and t-statistic
    cdf = pd.DataFrame(model.coef_,X.columns,columns=['Coefficients'])
    n=X_train.shape[0]
    k=X_train.shape[1]
    dfN = n-k
    train_pred=model.predict(X_train)
    train_error = np.square(train_pred - y_train)
    sum_error=np.sum(train_error)
    se=[0,0,0,0,0]
    for i in range(k):
        r = (sum_error/dfN)
        r = r/np.sum(np.square(X_train[list(X_train.columns)[i]]-X_train[list(X_train.columns)[i]].mean()))
        se[i]=np.sqrt(r)
    cdf['Standard Error']=se
    cdf['t-statistic']=cdf['Coefficients']/cdf['Standard Error']
    return cdf

Sort the features based on the t-statistic

### Prediction, error estimate, and regression evaluation matrices

**Do prediction with the linear model**

**Plot predictions against the ground truth values**

**What can be determined from this plot?**

Answer here

**Calculate the mean absolute, the mean squared and the root mean square errors as well as the $R^2$-value of the predictions**

# Optional plots 
For good practice

**Plotting histogram of the residuals i.e. predicted errors (expect a normally distributed pattern)**

**Scatter plot of residuals vs. predicted values (Homoscedasticity)**