# Project 1: Linear Regression Model

This is the first project of our data science fundamentals. This project is designed to solidify your understanding of the concepts we have learned in Regression and to test your knowledge on regression modelling. There are four main objectives of this project.

1\. Build Linear Regression Models 
* Use closed form solution to estimate parameters
* Use packages of choice to estimate parameters<br>

2\. Model Performance Assessment
* Provide an analytical rationale with choice of model
* Visualize the Model performance
  * MSE, R-Squared, Train and Test Error <br>

3\. Model Interpretation

* Intepret the results of your model
* Intepret the model assement <br>
    
4\. Model Dianostics
* Does the model meet the regression assumptions
    
#### About this Notebook

1\. This notebook should guide you through this project and provide started code
2\. The dataset used is the housing dataset from Seattle homes
3\. Feel free to consult online resources when stuck or discuss with data science team members


Let's get started.

### Packages

Importing the necessary packages for the analysis

In [None]:
# Necessary Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Model and data preprocessing
from sklearn import linear_model
from sklearn.model_selection import train_test_split

%matplotlib inline

Now that you have imported your packages, let's read the data that we are going to be using. The dataset provided is a titled *housing_data.csv* and contains housing prices and information about the features of the houses. Below, read the data into a variable and visualize the top 8 rows the data.  

In [None]:
# Initiliazing seed
np.random.seed(42)

data = pd.read_csv('housing_data.csv') #"insert your code here"

### Split data into train and test

In the code below, we need to split the data into the train and test for modeling and validation of our models. We will cover the Train/Validation/Test as we go along in the project. Fill the following code.

1\. Subset the features to the variable: features <br>
2\. Subset the target variable: target <br>
3\. Set the test size in proportion in to a variable: test_size <br>


In [None]:
features = """ Enter your code here"""  
target = """ Enter your code here""" 
test_size = .33


x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=test_size, random_state=42)

### Data Visualization

The best way to explore the data we have is to build some plots that can help us determine the relationship of the data. We can use a scatter matrix to explore all our variables. Below is some starter code to build the scatter matrix  

In [None]:
_ = pd.plotting.scatter_matrix(x_train, figsize=(14,8), alpha=1, diagonal='kde')

Based on the scatter matrix above, write a brief description of what you observe. In thinking about the description, think about the relationship and whether linear regression is an appropriate choice for modelling this data.

#### Correlation Matrix

In the code below, compute the correlation matrix and write a few thoughts about the observations. In doing so, consider the interplay in the features and how their correlation may affect your modeling.

In [None]:
# Use pandas correlation function
x_train.corr()

## 1. Build Your Model

Now that we have explored the data at a high level, let's build our model. From our sessions, we have discussed both closed form solution, gradient descent and using packages. In this section you will create your own estimators. Starter code is provided to makes this easier.


#### 1.1. Closed Form Solution
Recall: <br>
$$\beta_0 = \bar {y} - \beta_1 \bar{x}$$ <br>
$$\beta_1 = \frac {cov(x, y)} {var(x)}$$ <br>

Below, let's define functions that will compute these parameters

In [None]:
# Pass the necessary arguments in the function to calculate the coefficients
def compute_estimators(x_feature, y_outcome):
    """ Calculate the coefficients """
    
    # Compute the Intercept and Slope
    """ Enter your code here""" 
    
    return # Return the Intercept and Slope

Run the compute estimators function above and display the estimated coefficients for any of the predictors/input variables.

In [None]:
# Remember to pass the correct arguments
compute_estimators()

#### 1.2. sklearn solution

Now that we know how to compute the estimators, let's leverage the sklearn module to compute the metrics for us. We have already imported the linear model, let's initialize the model and compute the coefficients for the model with the input above.

In [None]:
# Initilize the linear Regression model here
""" Enter your code here""" 

# Pass in the correct inputs
""" Enter your code here""" 

# Print the coefficients
""" Enter your code here""" 

Does the results from the cell above and your implementation match? They should be very close to each other.

### 2. Model Evaluation

Now that we have estimated our single model. We are going to compute the coefficients for all the inputs. We can use a for loop for multiple model estimation. However, we need to create a few functions:

1\. Prediction function: Functions to compute the predictions <br>
2\. MSE: Function to compute Mean Square Error <br>

In [None]:
def model_predictions(intercept, slope, x_feature):
    """ Compute Model Predictions """
    
    """ Enter your code here""" 
    
    return y_hat
    

def mean_square_error(y_outcome, predictions ):
    """ Compute the mean square error """
    
    """ Enter your code here""" 
    
    
    return mse

The last function we need is a plotting function to visualize our predictions relative to our data.

In [None]:
def plotting_model(x_feature, y_outcome, predictions):
    """ Create a scatter and predictions  """
    
    # Enter your code here
    
    
    return ""

#### Run Model Assessment

Now that we have our functions ready, we can build individual models, compute preductions, plot our model results and determine our MSE. Notice that we compute our MSE on the test set and not the train set

In [None]:
features = ['lot_area', 'firstfloor_sqft', 'living_area', 'bath', 'garage_area']

for feature in features:
    
    # Compute the Coefficients
    
    # Print the Intercept and Slope
    # Enter your code here
    
    # Compute the Train and Test Predictions
    # Enter your code here
    
    # Plot the Model Scatter  
    # Enter your code here
    
    # Compute the MSE
    # Enter your code here
    

### 3. Model Interpretation

Now that you have calculated all the individual models in the dataset, provide an analytics rationale for which model has performed best. To provide some additional assessment metrics, let's create a function to compute the R-Squared.

#### Mathematically:

$$R^2 = \frac {SS_{Regression}}{SS_{Total}} = 1 - \frac {SS_{Error}}{SS_{Total}}$$<br>

where:<br>
$SS_{Regression} = \sum (\widehat {y_i} - \bar {y_i})^2$<br>
$SS_{Total} = \sum ({y_i} - \widehat {y_i})^2$<br>
$SS_{Error} = \sum ({y_i} - \bar {y_i})^2$

In [None]:
def r_squared(y_outcome, predictions):
    """ Compute the R Squared """
    
    """ Enter your code here""" 
    
    return ""

Now that you we have R Squared calculated, evaluate the R Squared for the test group across all models and determine what model explains the data best. 

In [None]:
""" Enter the code here """

### 4. Model Diagnostics

Linear regressions depends on meetings assumption in the model. While we have not yet talked about the assumptions, you goal is to research and develop an intuitive understanding of why the assumptions make sense. We will walk through this portion on Multiple Linear Regression Project