## Introduction

This in-class example demonstrates how you approach a new data set and conduct simple data analysis.

What you need to know:  
- Statsmodels and pandas modules in python
- Theoretical concepts on statistical moments
- Theoretical concepts on simple linear regression model

The list of [references](#References) for detailed concepts and techniques used in this exerise.
***

## Content
- [Load the required modules](#Load-the-required-modules)
- [Data check and summary statistics](#Data-check-and-summary-statistics)
- [Simple Linear Regression Model](#Simple-Linear-Regression-Model) 
- [References](#References)

***
## Data Description

The data set is contained in a comma-separated value (csv) file named ```hprice1.csv``` with column headers. 

Description of the data is as follow:

| Name | Description |
| :--- | :--- |
| price    | house price, \$1000s |
| assess   | assessed value, \$1000s |
| bdrms    | number of bdrms |
| lotsize  | size of lot in square feet |
| sqrft    | size of house in square feet |
| colonial | =1 if home is colonial style |
| lprice   | log(price) |
| lassess  | log(assess) |
| llotsize | log(lotsize) |
| lsqrft   | log(sqrft) |

***
## Load the required modules

In [1]:
import numpy as np
import pandas as pd
import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf

***
## Data check and summary statistics

#### Load the data set

In [2]:
hp_data = pd.read_csv("hprice1.csv")

#### Check if the data is properly imported

In [3]:
hp_data.head()

Unnamed: 0,price,assess,bdrms,lotsize,sqrft,colonial,lprice,lassess,llotsize,lsqrft
0,300.0,349.1,4,6126,2438,1,5.703783,5.855359,8.720297,7.798934
1,370.0,351.5,3,9903,2076,1,5.913503,5.86221,9.200593,7.638198
2,191.0,217.7,3,5200,1374,0,5.252274,5.383118,8.556414,7.225482
3,195.0,231.8,3,4600,1448,1,5.273,5.445875,8.433811,7.277938
4,373.0,319.1,4,6095,2514,1,5.921578,5.765504,8.715224,7.82963


#### Get statistical moments

In [4]:
hp_data.describe()

Unnamed: 0,price,assess,bdrms,lotsize,sqrft,colonial,lprice,lassess,llotsize,lsqrft
count,88.0,88.0,88.0,88.0,88.0,88.0,88.0,88.0,88.0,88.0
mean,293.546034,315.736364,3.568182,9019.863636,2013.693182,0.693182,5.63318,5.717994,8.905105,7.57261
std,102.713445,95.314437,0.841393,10174.150414,577.191583,0.463816,0.303573,0.262113,0.54406,0.258688
min,111.0,198.7,2.0,1000.0,1171.0,0.0,4.70953,5.291796,6.907755,7.065613
25%,230.0,253.9,3.0,5732.75,1660.5,0.0,5.438079,5.53694,8.653908,7.414873
50%,265.5,290.2,3.0,6430.0,1845.0,1.0,5.581613,5.670566,8.768719,7.520231
75%,326.25,352.125,4.0,8583.25,2227.0,1.0,5.787642,5.863982,9.057567,7.708266
max,725.0,708.6,7.0,92681.0,3880.0,1.0,6.586172,6.563291,11.43692,8.263591


#### Create a scatter plot to visualize the data

***
## Multiple Linear Regression Model

#### Model Estimation by the Ordinary Least Square (OLS) method

Estimate the model $$price = \beta_0 + \beta_1 sqrft + \beta_2 bdrms + u,$$
where price is the house price measured in thousands of dollars.

In [8]:
model = smf.ols(formula='price ~ sqrft + bdrms', data = hp_data).fit()

In [10]:
print(model.summary() )

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.632
Model:                            OLS   Adj. R-squared:                  0.623
Method:                 Least Squares   F-statistic:                     72.96
Date:                Thu, 02 Oct 2025   Prob (F-statistic):           3.57e-19
Time:                        18:31:13   Log-Likelihood:                -488.00
No. Observations:                  88   AIC:                             982.0
Df Residuals:                      85   BIC:                             989.4
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -19.3150     31.047     -0.622      0.5

#### Get the estimation results

#### How would you interpret the results?

1. Write out the results in equation form.

price = 19.315 + 0.1284*sqrft + 15.1982*bdrms

2. What is the estimated increase in price for a house with one more bedroom, holding square footage constant?

15198.2 dollars 

3. What is the estimated increase in price for a house with an additional bedroom that is 140 square feet in size? Compare this to your answer above. 

In [11]:
0.1284*140 + 15.1982*1

33.1742

4. What percentage of the variation in price is explained by square footage and number of bedrooms?

63.2% 

5. The first house in the sample has $sqrft = 2,438$ and $bdrms = 4$. Find the predicted selling price for this house from the OLS regression line.

In [12]:
0.1284*2438 + 15.1982*4

373.832

6. The actual selling price of the first house in the sample was \$300,000 (price = 300). Find the residual for this house. Does it suggest that the buyer underpaid or overpaid for the house?

***
## References

- Jeffrey M. Wooldridge (2019) "Introductory Econometrics: A Modern Approach, 7e" Chapter 3.

- The pandas development team (2020). "[pandas-dev/pandas: Pandas](https://pandas.pydata.org/)." Zenodo.
    
- Seabold, Skipper, and Josef Perktold (2010). "[statsmodels: Econometric and statistical modeling with python](https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html)." Proceedings of the 9th Python in Science Conference.