<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/Data_Science_12_LinearRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter: An Introduction to Linear Regression Using Boston Housing Data

Welcome to the world of **linear regression**, a fundamental concept in the realm of data science and statistics! Imagine you're a detective trying to unravel mysteries hidden within numbers. Linear regression is one of your most reliable tools in this quest.

In this chapter, we'll explore linear regression using a real-world dataset: the **Boston Housing Dataset**. This dataset, like a treasure map, contains information about various houses in Boston, including their price, the number of rooms, age, and more. By analyzing this data, we can start to understand the relationship between these features and how they affect house prices.

At its core, **linear regression** is a way to predict a value based on one or more input values. Think of it like trying to draw a straight line through a set of points on a graph so that the line is as close as possible to all the points. This line helps us understand relationships between variables. For example, how does the size of a house (measured in rooms) relate to its price?

- The **Dependent Variable** is what we want to predict. In our case, it's the house price.
- **Independent Variables** are the factors we think might affect our dependent variable. For the Boston Housing data, this could be things like the number of rooms, age of the house, etc.

We'll be using a Python library called **StatsModel** to perform our linear regression analysis. StatsModel is a powerful tool that makes it easier to explore data, estimate statistical models, and perform tests.

After we've created our model, we'll want to know how good it is. For this, we use model metrics like:

- The **R-squared** tells us how well our line fits the data. A higher R-squared value means a better fit.
- The **Coefficients** numbers tell us how much our dependent variable (house price) changes with a one-unit change in our independent variables (like the number of rooms).

Some questions we'll be exploring include the following:
1. How does the number of rooms in a house relate to its price?
2. Are there any surprising factors that affect the price of a house in Boston?
3. How well does our linear regression model predict house prices?
4. What does the R-squared value tell us about our model?
5. How can we interpret the coefficients of our model in the context of house pricing?

By the end of this chapter, you'll not only understand the basics of linear regression but also how to apply it to real-world data. Let's dive in and uncover the stories hidden in the Boston Housing dataset!


## Loading and Explroring Boston Housing Data
To start, let's load the Boston housing data and take a look at it. A quick look at the data dictionary (available many places online) reveals that the columns are:

```
PRICE - The price of the house.
CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million). Historically, this has often been used as a target variable.
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five Boston employment centres
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per $10,000
PTRATIO - pupil-teacher ratio by town
B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT - % lower status of the population
MEDV - Median value of owner-occupied homes in $1000's. This is our "target' variable.
```

Here, our task will be to predict house price based on some of these other variables.

In [1]:
!pip install pydataset -q # Install required packages
from pydataset import data # Import required modules
import pandas as pd

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m69.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pydataset (setup.py) ... [?25l[?25hdone
initiated datasets repo at: /root/.pydataset/


In [12]:
# Load the head
boston_df = data('Boston')
boston_df.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
3,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [15]:
# Get basic info on the data set
boston_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 506 entries, 1 to 506
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   crim     506 non-null    float64
 1   zn       506 non-null    float64
 2   indus    506 non-null    float64
 3   chas     506 non-null    int64  
 4   nox      506 non-null    float64
 5   rm       506 non-null    float64
 6   age      506 non-null    float64
 7   dis      506 non-null    float64
 8   rad      506 non-null    int64  
 9   tax      506 non-null    int64  
 10  ptratio  506 non-null    float64
 11  black    506 non-null    float64
 12  lstat    506 non-null    float64
 13  medv     506 non-null    float64
dtypes: float64(11), int64(3)
memory usage: 59.3 KB


We can see here that the dataset comprises 506 entries, indexed from 1 to 506. Each entry corresponds to a different set of housing and area characteristics in Boston.
-   There are 14 columns in total, each representing a different attribute of the housing or area, similar to the ones described in the first five rows of the dataset.
-   The data types (`Dtype`) of these columns are either `float64` or `int64`. Float64 columns (like `crim`, `zn`, `indus`, etc.) represent numeric data with decimal points, whereas Int64 columns (like `chas`, `rad`, `tax`) represent integer values.
-   Importantly, each column has 506 non-null counts, indicating there are no missing values in this dataset. This is crucial for analysis as it means the data is complete, and no imputation or handling of missing data is needed before proceeding with linear regression analysis.

This dataset is well-structured and complete, making it an excellent candidate for regression analysis without the need for preliminary cleaning or data imputation steps. The diverse range of variables, both in type and measurement scale, provides a rich context for exploring linear relationships and building a regression model.

## What is Linear Regression?

Linear Regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. The "linear" in linear regression refers to the fact that this relationship takes the form of a straight line.

In its simplest form, with one independent variable, the linear regression model can be represented as:

$$ y = \beta_0 + \beta_1x + \epsilon $$

Here:
- $y$ is the dependent variable we're trying to predict or explain.
- $x$ is the independent variable we're using for prediction.
- $\beta_0$ is the y-intercept of the regression line.
- $\beta_1$ represents the slope of the regression line, indicating how much \( y \) changes for a unit change in \( x \).
- $\epsilon$ represents the error term, accounting for the variability in \( y \) that cannot be explained by \( x \).

Linear regression can help us to do two important things:
1. We can **predit outcomes** when you have a continuous outcome variable (like house prices, temperatures, etc.) and one or more predictor variables.
2. We can use linear regression to **explain** and understand the strength and direction (positive or negative) of the relationship between variables.

Linear regression is often done the first (and sometimes only) technique used for these tasks, despite the (many) variations and alternatives that are available. This is for a a few reasons.

1. Linear regression models are straightforward to understand and interpret, making them a popular choice for many practical applications.
2. Linear regression forms the basis for understanding more complex models in statistics and machine learning. These often can't be understood without linear regression as a "baseline."

Linear regression is a powerful tool in statistics and data science, valued for its simplicity, interpretability, and versatility in various applications.



## Some Terminology: Models, Fitting, Metrics
A **model** is a mathematical representation that describes the relationship between variables. Data scientists using many "models" in their work, with linear regression being one of the most important ones. In linear regression, the model is  a line that best fits the observed data. In its simplest form, a linear regression model with one independent variable is:

$$y = \beta_0 + \beta_1x$$

as explained above. When we create models we *expect* that they will be imperfect (that is, we will never perfectly predict the dependent variable).

**Fitting** a model means finding the line that best represents the relationship between the independent and dependent variables.

- Technically, it involves calculating the intercept and slope that minimize the difference between the predicted and actual values of the dependent variable.
- This process is known as "minimizing the sum of squared residuals" or the **Ordinary Least Squares (OLS)**.

One we have fit, the model we can **evaluate it** by assessing its effectiveness and accuracy in describing the relationship between variables or in making predictions.

- Key metrics for evaluation include R-squared, the F-statistic, p-values, and others.
- The goal is to understand the model's strengths and limitations in representing the data and its predictive capabilities.

###  Syntax for Creating a Linear Regression Model Using Statsmodels

To create a linear regression model using `statsmodels`, you typically follow these steps:

```python
import statsmodels.api as sm

# Step 1: Prepare your data
# X = ... (independent variables)
# Y = ... (dependent variable)

# Step 2: Add a constant to the independent variables
X = sm.add_constant(X)

# Step 3: Create a model
model = sm.OLS(Y, X)

# Step 4: Fit the model
result = model.fit()

# Step 5: Get the summary of the model
print(result.summary())
```
The steps here are:

1. First, import `statsmodels.api`.
2. Define your independent variables (`X`) and dependent variable (`Y`).
3.`statsmodels` doesn't automatically include the intercept (β0 in the equation), so you need to add a "constant" manually.
4. Create an Ordinary Least Squares (OLS) model. OLS is a type of linear regression.
5. Fit the model to your data.
6. The summary provides detailed results including coefficients, R-squared value, p-values, etc.

### Example: Using Number of Rooms to Predict Price
Let's start with a simple exam, of using room to predict price.

In [16]:
import statsmodels.api as sm

# Independent variable: Average number of rooms
X = boston_df['rm']

# Dependent variable: Median value of owner-occupied homes
Y = boston_df['medv']

# Add a constant to the independent variable
X = sm.add_constant(X)

# Create an OLS model
model = sm.OLS(Y, X)

# Fit the model
result = model.fit()

# Print the summary of the model
print(result.summary())


                            OLS Regression Results                            
Dep. Variable:                   medv   R-squared:                       0.484
Model:                            OLS   Adj. R-squared:                  0.483
Method:                 Least Squares   F-statistic:                     471.8
Date:                Fri, 17 Nov 2023   Prob (F-statistic):           2.49e-74
Time:                        16:12:30   Log-Likelihood:                -1673.1
No. Observations:                 506   AIC:                             3350.
Df Residuals:                     504   BIC:                             3359.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -34.6706      2.650    -13.084      0.0

### Making Sense of the Results
So, what does this all mean? While the summary above gives a LOT of information, we can pay attention to a few numbers in particular.

#### R-squared (0.484)
**R-squared** (and **Adjusted R-Squared**) is a statistical measure that represents the proportion of the variance for the dependent variable (medv) that's explained by the independent variable(s) in the model. An R-squared of 0.484 means that 48.4% of the variance in the median value of homes can be explained by the model.

#### F-statistic (471.8)
The F-statistic tests the overall significance of the model. A higher value (471.8 in this case) generally indicates that the model is a better fit to the data. It suggests whether your linear model is better than a model with no independent variables.

#### Prob (F-statistic) (2.49e-74)
This **p-value** is the probability of observing the given F-statistic, assuming that the null hypothesis (that the model with no independent variables is the best model) is true. A very small value (close to zero) indicates that the model as a whole is **statistically significant**.

#### Coefficients for `rm` (9.1021) and `const` (-34.6706)
The **coefficient** for `rm` tells us how much the dependent variable (`medv`) is expected to increase when `rm` increases by one unit, holding other variables constant. Similarly, `const` is the y-intercept, the expected value of `medv` when all independent variables are 0.

#### P>|t| (for `rm` < 0.000)
This is the p-value associated with the coefficient of `rm`. A p-value is the probability of observing any value equal to or more extreme than the observed one, assuming that the null hypothesis (the coefficient is zero) is true. A small p-value (typically < 0.05) indicates that the variable is statistically significant in predicting `medv`.

#### Summary
The results above suggests that the "number of rooms" is, in fact, a pretty good predictor of price.


### Linear Regression in R
In R, we can do linear regression in a very similar way.

```r
# Load the necessary library
library(MASS) # Contains the Boston dataset

# Load the Boston Housing dataset
data(Boston)

# Create the linear regression model
# Here, 'medv' is the dependent variable, and 'rm' is the independent variable
fit <- lm(medv ~ rm, data=Boston)

# Print the summary of the model
# This will provide detailed statistics about the model's performance
summary(fit)
```

Here's what the output looks like in R.

In [17]:
# Emulating R code in a Python environment

import rpy2.robjects as robjects
robjects.r('library(MASS)')
robjects.r('data(Boston)')
robjects.r('fit <- lm(medv ~ rm, data=Boston)')
print(robjects.r('summary(fit)'))



Call:
lm(formula = medv ~ rm, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.346  -2.547   0.090   2.986  39.433 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -34.671      2.650  -13.08   <2e-16 ***
rm             9.102      0.419   21.72   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.616 on 504 degrees of freedom
Multiple R-squared:  0.4835,	Adjusted R-squared:  0.4825 
F-statistic: 471.8 on 1 and 504 DF,  p-value: < 2.2e-16




### Exercise: You Try It
Run a linear regression using a variable OTHER than rooms.

In [None]:

# Independent variable: Someting other than rooms!
# X = boston_df['rm']
# TODO


# Dependent variable: Median value of owner-occupied homes
Y = boston_df['medv']

# Add a constant to the independent variable
X = sm.add_constant(X)

# Create an OLS model
model = sm.OLS(Y, X)

# Fit the model
result = model.fit()

# Print the summary of the model
print(result.summary())

### Question: Making Sense of Regression
Now, explain the results using the concepts we learned above. How does your model compare to the one that used rooms?

### My Answer: