# Multiple Linear Regression in Statsmodels - Lab

## Introduction
In this lab, you'll practice fitting a multiple linear regression model on the Ames Housing dataset!

## Objectives
You will be able to:
* Determine if it is necessary to perform normalization/standardization for a specific model or set of data
* Use standardization/normalization on features of a dataset
* Identify if it is necessary to perform log transformations on a set of features
* Perform log transformations on different features of a dataset
* Use statsmodels to fit a multiple linear regression model
* Evaluate a linear regression model by using statistical performance metrics pertaining to overall model and specific parameters


## The Ames Housing Data

Using the specified continuous and categorical features, preprocess your data to prepare for modeling:
* Split off and one hot encode the categorical features of interest
* Log and scale the selected continuous features

In [None]:
import pandas as pd
import numpy as np

ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']


## Continuous Features

In [None]:
# Log transform and normalize
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Load the dataset
ames = pd.read_csv('ames.csv')

# Define continuous and categorical features
continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']

# Log transform continuous features
ames_log = ames.copy()
ames_log[continuous] = np.log1p(ames_log[continuous])

# Scale the log-transformed continuous features
scaler = StandardScaler()
ames_log[continuous] = scaler.fit_transform(ames_log[continuous])


## Categorical Features

In [None]:
# One hot encode categoricals
# One hot encode categorical features
encoder = OneHotEncoder(drop='first')
cat_transformed = encoder.fit_transform(ames_log[categoricals])
cat_columns = encoder.get_feature_names(categoricals)


## Combine Categorical and Continuous Features

In [None]:
# combine features into a single dataframe called preprocessed


# Combine categorical and scaled continuous features
ames_preprocessed = pd.concat([ames_log.drop(categoricals, axis=1), pd.DataFrame(cat_transformed, columns=cat_columns)], axis=1)


## Run a linear model with SalePrice as the target variable in statsmodels

In [None]:
# Your code here
import statsmodels.api as sm

# Prepare the target variable and predictors
X = ames_preprocessed.drop('SalePrice', axis=1)
y = ames_preprocessed['SalePrice']

# Add constant to the predictors matrix
X = sm.add_constant(X)

# Fit the model
model = sm.OLS(y, X).fit()

# Print model summary
print(model.summary())


## Run the same model in scikit-learn

In [None]:
# Your code here - Check that the coefficients and intercept are the same as those from Statsmodels
from sklearn.linear_model import LinearRegression

# Fit the model in scikit-learn
model_sklearn = LinearRegression()
model_sklearn.fit(X, y)

# Print coefficients and intercept
print("Intercept (Statsmodels):", model.params['const'])
print("Intercept (Scikit-Learn):", model_sklearn.intercept_)

# Coefficients
coefficients = pd.DataFrame({
    'Variable': X.columns,
    'Coefficient (Statsmodels)': model.params.values,
    'Coefficient (Scikit-Learn)': model_sklearn.coef_
})
print(coefficients)


## Predict the house price given the following characteristics (before manipulation!!)

Make sure to transform your variables as needed!

- LotArea: 14977
- 1stFlrSF: 1976
- GrLivArea: 1976
- BldgType: 1Fam
- KitchenQual: Gd
- SaleType: New
- MSZoning: RL
- Street: Pave
- Neighborhood: NridgHt

In [None]:
# Your code here - predict the house price given the following characteristics
# Create a new data point for prediction
new_data = pd.DataFrame({
    'LotArea': [14977],
    '1stFlrSF': [1976],
    'GrLivArea': [1976],
    'BldgType': ['1Fam'],
    'KitchenQual': ['Gd'],
    'SaleType': ['New'],
    'MSZoning': ['RL'],
    'Street': ['Pave'],
    'Neighborhood': ['NridgHt']
})

# Log transform and scale the continuous features
new_data[continuous[:-1]] = np.log1p(new_data[continuous[:-1]])
new_data[continuous[:-1]] = scaler.transform(new_data[continuous[:-1]])

# One hot encode categorical features
new_data_encoded = encoder.transform(new_data[categoricals])
new_data_cat_columns = encoder.get_feature_names(categoricals)
new_data_processed = pd.concat([new_data.drop(categoricals, axis=1), pd.DataFrame(new_data_encoded, columns=new_data_cat_columns)], axis=1)

# Add constant for intercept
new_data_processed = sm.add_constant(new_data_processed)

# Predict using statsmodels model
predicted_price = model.predict(new_data_processed)

print("Predicted SalePrice:", predicted_price[0])


## Summary
Congratulations! You pre-processed the Ames Housing data using scaling and standardization. You also fitted your first multiple linear regression model on the Ames Housing data using statsmodels and scikit-learn!