# Advanced Linear Regression  

This is a repo for some advanced linear regression practice. We covered basic linear regression and some of the assumptions in the previous linear regression guided practice. Here, we want to take a look at some of the important concepts to consider when building models including, 

- identifying and dealing with categorical variables
- understanding the assumptions of linear regression
- understanding the various transformations
- understanding the interpretation of coefficients
- final model validation

We will be using the **Ames housing data** from the Kaggle competition of regression techniques. This dataset is already divided into test and train sets. We will be focusing on using the train data. Now, this dataset has a large number of variables. For the sake of simplicity, we will be focusing on a select few from those. The ```data_description.txt``` file has details of all the features being provided.

The ```solution.ipynb``` notebook has the solution to this exercise. 

## Importing the data

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
import seaborn as sns

print('- Package Versions:')
print(f'\tMatplotlib = {mpl.__version__}')
print(f'\tPandas = {pd.__version__}')
print(f'\tSeaborn = {sns.__version__}')

- Package Versions:
	Matplotlib = 3.3.2
	Pandas = 1.1.3
	Seaborn = 0.11.0


In [4]:
cols_to_use = ['YrSold', 'MoSold', 'Fireplaces', 'TotRmsAbvGrd', 'GrLivArea',
          'FullBath', 'YearRemodAdd', 'YearBuilt', 'OverallCond', 
          'OverallQual', 'LotArea', 'SalePrice','BldgType']
df = pd.read_csv("train.csv")
df = df[cols_to_use]

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   YrSold        1460 non-null   int64 
 1   MoSold        1460 non-null   int64 
 2   Fireplaces    1460 non-null   int64 
 3   TotRmsAbvGrd  1460 non-null   int64 
 4   GrLivArea     1460 non-null   int64 
 5   FullBath      1460 non-null   int64 
 6   YearRemodAdd  1460 non-null   int64 
 7   YearBuilt     1460 non-null   int64 
 8   OverallCond   1460 non-null   int64 
 9   OverallQual   1460 non-null   int64 
 10  LotArea       1460 non-null   int64 
 11  SalePrice     1460 non-null   int64 
 12  BldgType      1460 non-null   object
dtypes: int64(12), object(1)
memory usage: 148.4+ KB


You will notice that we are only using specific columns from the dataset. This is done purely for managing time better. Feel free to use all the columns if you want. Also when you are working with the test data, ```test.csv``` be sure to apply these same changes to that data. Remember that you will not be provided the ```SalePrice``` for the test data. You can check out the Kaggle competition [here](https://www.kaggle.com/c/house-prices-advanced-regression-techniques). If you submit the predictions, you will get to see the RMSE of your results

# EDA

What are some interesting things that we should explore here? Keep in mind that before building a model, we need to first meet some of the assumptions of regression. We also need to identify the continuous and categorical variables and also the correlation between different variables

In [6]:
# EDA plots

In [7]:
# heatmap

In [8]:
# Make lists of variables we want to use. Its a good idea to make separate lists
# for the continuous and categorical variables

## Identifying the variables we will be using and their types

Lets make some quick plots to try and identify the variables we will be using. To start with, we want to make a baseline model. Baseline models are defined as simple models that have decent results. They are not necessarily the best models, but they are easy to build and interpret

In [10]:
# identify the top 3 variables and make a new df with them

# Modeling

The idea behind modeling when using linear regression is to iteratively address issues in the previous models and build better models. There are various techniques available to us to improve our results. We will cover them briefly here.

## Baseline Model

In [11]:
from statsmodels.formula.api import ols

Lets first build the baseline model. We will use the 3 variables we selected above and build the baseline model.

In [13]:
# build the model and print the summary

Lets identify interesting things from the summary. List three observations below. Also make sure you interpret at least one coefficient.

### Model diagnostics

Determine how this model performs on the assumptions of normality of residuals and homoscedasticity by using the correct plots.

In [14]:
# model validation

List out your observations from the plots

## Model 2 - Including the categorical variables

Now, lets try including the categorical variables and make sure we are dealing with them correctly. Remember that there are 2 ways of dealing with categorical variables - label encoding and one hot encoding

**Label encoding**- Label Encoding refers to converting the labels into a numeric form so as to convert them into the machine-readable form. One big issue in this technique is this may lead to the generation of priority issues in the training of data sets. A label with a high value may be considered to have high priority than a label having a lower value.

**One hot encoding**- This is where the categorical variable is removed and a new binary variable is added for each unique integer value. The binary variables are often called “dummy variables” 

Lets decide which technique works here and do this for the ```BldgType``` variable

In [15]:
# deal with the BldgType variable

Lets now include this variable in our model and see how things change

In [16]:
# build the model

In [17]:
# model diagnostics

## Model 3 - Transformations of data

Log transformation is a data transformation method in which it replaces each variable x with a log(x). In other words, the log transformation reduces or removes the skewness of our original data. The log transformation can be used to make highly skewed distributions less skewed.

In [18]:
# lets log transform the dependent variable

In [19]:
# build a new model with the log transformed target

In [20]:
# model diagnostics

Comment on the new plots. Did they improve? Did they get worse?