# Task 1 - Exploratory Data Analysis
In this section, you will: 
- Load the necessary packages required for executing the code
- Load the data
- Summarize the features in the data set using descriptive statistics
- Study the features and their interrelationships using various visualizations

In [None]:
# Import 'numpy' and 'pandas' for working with numbers and dataframes
import numpy as np
import pandas as pd

# Import 'matplotlib.pyplot' for visualizations
from matplotlib import pyplot as plt
import seaborn as sns

# Import method for regression from 'statsmodels'
import statsmodels.formula.api as smf

# Import methods for regression diagnostic plots from 'statsmodels'
from statsmodels.api import ProbPlot, qqplot

In [None]:
# Load the data and take a look at it
# Note: Make sure that the data is in the same folder as the Jupyter notebook or specify the address correctly
df = pd.read_csv('Ames_Housing_Subset_1.csv', index_col = 'PID')
df.head()

Feature description:
- PID: The unique identifier for a property
- LotArea: The area in square feet of the lot on which the property is built
- Age: The age of the property in years
- TotalBsmtSF: The area of the basement of the property in square feet
- SalePrice: The current selling price of the property in dollars

In [None]:
# Look at the specifics of the data frame using the '.info()' command
##### CODE HERE #####

In [None]:
# Summarize the features in the data set using descriptive statistics using the '.describe()' command
##### CODE HERE #####

In [None]:
# Create histograms for the variables 'LotArea', 'Age', 'TotalBsmtSF' and 'SalePrice'
plt.figure(figsize = (12, 6))

colorname = ['lightblue', 'lightgreen', 'lightgray', 'lightcoral']
fignum = 0
for featurename in df.columns:
    fignum = fignum + 1
    plt.subplot(2, 2, fignum)
    sns.histplot(data = df, x = featurename, color = colorname[fignum - 1])

plt.tight_layout();

In [None]:
# Create box plots for the variables 'LotArea', 'Age', 'TotalBsmtSF' and 'SalePrice'
plt.figure(figsize = (12, 6))

colorname = ['lightblue', 'lightgreen', 'lightgray', 'lightcoral']
fignum = 0
for featurename in df.columns:
    fignum = fignum + 1
    plt.subplot(2, 2, fignum)
    sns.boxplot(data = df, x = featurename, color = colorname[fignum - 1])

plt.tight_layout();

In [None]:
# Create a pair plot for the data
sns.pairplot(df);

# Task 2 : Simple Linear Regression
In this section, you will train and evaluate the following simple linear regression models:
- SalePrice vs LotArea
- SalePrice vs Age
- SalePrice vs TotalBsmtSF

### Model 1

In [None]:
# Create and train a linear regression model for the data using the 'smf.ols()' method and view its summary
# Note: The objective is to predict 'SalePrice' using 'LotArea'
lr_model_1 = ##### CODE HERE #####
lr_model_1 = ##### CODE HERE #####
print(lr_model_1.summary())

### Model 2

In [None]:
# Create and train a linear regression model for the data using the 'smf.ols()' method and view its summary
# Note: The objective is to predict 'SalePrice' using 'Age'
lr_model_2 = ##### CODE HERE #####
lr_model_2 = ##### CODE HERE #####
print(lr_model_2.summary())

### Model 3

In [None]:
# Create and train a linear regression model for the data using the 'smf.ols()' method and view its summary
# Note: The objective is to predict 'SalePrice' using 'TotalBsmtSF'
lr_model_3 = ##### CODE HERE #####
lr_model_3 = ##### CODE HERE #####
print(lr_model_3.summary())

# Task 3 - Multiple Linear Regression
In this section, you will train and evaluate the following multiple linear regression models:
  - SalePrice vs LotArea and Age
  - SalePrice vs LotArea and TotalBsmtSF
  - SalePrice vs Age and TotalBsmtSF
  - SalePrice vs LotArea, Age and TotalBsmtSF

### Model 4

In [None]:
# Create and train a linear regression model for the data using the 'smf.ols()' method and view its summary
# Note: The objective is to predict 'SalePrice' using 'LotArea' and 'Age'
lr_model_4 = ##### CODE HERE #####
lr_model_4 = ##### CODE HERE #####
print(lr_model_4.summary())

### Model 5

In [None]:
# Create and train a linear regression model for the data using the 'smf.ols()' method and view its summary
# Note: The objective is to predict 'SalePrice' using 'LotArea' and 'TotalBsmtSF'
lr_model_5 = ##### CODE HERE #####
lr_model_5 = ##### CODE HERE #####
print(lr_model_5.summary())

### Model 6

In [None]:
# Create and train a linear regression model for the data using the 'smf.ols()' method and view its summary
# Note: The objective is to predict 'SalePrice' using 'Age' and 'TotalBsmtSF'
lr_model_6 = ##### CODE HERE #####
lr_model_6 = ##### CODE HERE #####
print(lr_model_6.summary())

### Model 7

In [None]:
# Create and train a linear regression model for the data using the 'smf.ols()' method and view its summary
# Note: The objective is to predict 'SalePrice' using 'LotArea', 'Age', and 'TotalBsmtSF'
lr_model_7 = ##### CODE HERE #####
lr_model_7 = ##### CODE HERE #####
print(lr_model_7.summary())

# Task 4 - Diagnostic Plots

In this section, we will create the following diagnostic plots for *lr_model_7*: 
- Fitted vs Actual values
- Fitted values vs Residuals
- QQ-plot

In [None]:
# Create a scatter plot between the fitted and actual values of 'SalePrice'
plt.figure(figsize = (8, 8))
sns.scatterplot(x = lr_model_7.fittedvalues, y = df['SalePrice'])
plt.axline((100,100), slope = 1, linestyle = '--', linewidth = 1, color = 'r')
plt.xlabel('Fitted Values of SalePrice')
plt.ylabel('Actual Values of SalePrice');

In [None]:
# Create a scatter plot between the fitted values of 'SalePrice' and the residuals
plt.figure(figsize = (8, 4))
sns.scatterplot(x = lr_model_7.fittedvalues, y = lr_model_7.resid)
plt.axhline(y = 0, xmin = 0, xmax = 1, linewidth = 1, color = 'k')
plt.xlabel('Fitted Values of SalePrice')
plt.ylabel('Residuals');

In [None]:
# Create a histogram of the residuals
plt.figure(figsize = (8, 4))
sns.histplot(data = df, x = lr_model_7.resid, color = 'lightgray')
plt.xlabel('Residual Value')
plt.ylabel('Frequency');

In [None]:
# Create a QQ plot for the data
QQ = ProbPlot(lr_model_7.get_influence().resid_studentized_internal)
fig = QQ.qqplot(line = '45', alpha = 0.5, lw = 1)
fig.set_size_inches(5, 5)
fig.gca().set_title('Normal Q-Q')
fig.gca().set_xlabel('Theoretical Quantiles')
fig.gca().set_ylabel('Standardized Residuals');