<a href="https://colab.research.google.com/github/christophermalone/stat360/blob/main/Handout6_SimpleLinearRegression_PartB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Handout #6 - Part B : Understanding a Standard Error

<table width='100%' ><tr><td bgcolor='green'></td></tr></table>

## Example 6.1

Consider data on the home prices of homes in La Crosse and Winona.  This data was collected using Redfin website. 
<table>
  <tr>
    <td width='50%'>
      <ul>
        <li><strong>Response Variable</strong>: PRICE ($) </li><br>
        <li>Variables under investigation (i.e. independent variables)</li>
        <ul>
          <li>SQUAREFEET, the size of the home (ft^2)</li>
          <li>BEDS, number of bedrooms in home</li>
          <li>BATHS, number of bathrooms in home</li>
          <li>LOTSIZE, the size of the lot (ft^2)</li>
          <li>YEARBUILT, the year in which the home was built</li>
         </ul>
    </ul>
    </td>
    <td width='50%'>
<p align='center'><img src="https://drive.google.com/uc?export=view&id=1KiZ5CvmWwvDg4HSPX7FwRsQTUMtvv0gG" width='50%' height='50%'></img></p>
  </td>
</tr>
</table>

Data Folder: [OneDrive](https://mnscu-my.sharepoint.com/:f:/g/personal/aq7839yd_minnstate_edu/EmOQfrwxzzRBqq8PH_8qTmMBy-1qKgM11Hb8vzjs025EEA?e=wyShYs)

Redfin Data: <a href="https://www.redfin.com/city/10404/WI/La-Crosse">La Crosse WI</a> | <a href="https://www.redfin.com/city/18151/MN/Winona">Winona MN</a></li>

<table width='100%' ><tr><td bgcolor='green'></td></tr></table>



## Load Data into R via URL

The tidyverse R package will be used to assist with reading in the dataset into the current R session.

In [34]:
#load tidyverse package
library(tidyverse)

The **read_csv()** function is used to read in the dataset. 

In [None]:
# Reading data in using read.csv via Base 
LaCrosseWinonaHomePrices <- read_csv("http://www.StatsClass.org/stat360/Datasets/LaCrosse_Winona_Redfin.csv")

In [None]:
# quick look at the data
head(LaCrosseWinonaHomePrices)

##  Importance of Understanding Variation

An important component of many statistical investigations is **inferential methods**.  Inferential methods permit us to draw conclusions about a population based on a sample.   Consider the following depections where interest lies in Home Prices. 

<table border="0" align="center">
<tr>
  <td width="50%" align="center" valign="bottom" bgcolor="white"><font size="+2">Population Side</font><br>True Home Price - All Homes<br><img src="https://drive.google.com/uc?export=view&id=1A08uXuaBX1rz3L3fNJ4VIOMEKF2idk_4"></img>
  </td>
  <td width="50%" align="center" valign="bottom" bgcolor="white"><font size="+2">Sample Side</font><br>An Estimate of the Home Price<br><img src="https://drive.google.com/uc?export=view&id=1L3_bHMKy3cuiRjNvuqpAuojhLhAKs0-C"></img>
  </td>
</table>

The goal is to use information from the Sample side to draw inferences about the Population side.  An inherent problem in doing this is that the information on the sample side varies from sample-to-sample.  Thus, an understanding of the inherent variation from sample-to-sample is essential to inferential methods. 

<table border="0" align="center">
<tr>
  <td align="center" valign="center" bgcolor="white"><font size="+2">Outcomes Vary Over Repeated Samples</font><br><br><br><br><img src="https://drive.google.com/uc?export=view&id=1Qs2nM_QTh_wezJX6_X8hrh0WqPl8-XNg" width="50%" height="50%"></img>
  </td>
</table>

The inherent variation from sample-to-sample can be measured in two ways.
1.   Inherent variation can be determined via statistical theory
2.   Inherent variation can be determined via simulation (bootstrap)  

#### Inherent Variation in the Mean via Statistical Theory

The Central Limit Theorem states the following:

>   If a random variable, say $Y$, follows a normal distribution with $Mean = \mu$ and $Variance = \sigma^2$, which is often expressed as $Y \sim N(\mu, \sigma^2)$

>  then the distribution of the average $Y$, say $\bar{Y}$, is known to:
1.  Follow a normal distribution
2.  Have the same mean, so $E(\bar{Y}) = \mu$, and 
3.  Have a reduced variance equal to $Var(\bar{Y}) = \frac{\sigma^2}{n}$

<u>Comments</u>:

*    In shorthand notation, if $Y \sim N(\mu, \sigma^2)$, then $\bar{Y} \sim N(\mu, \frac{\sigma^2}{n})$
*    Normality of $Y$ is a necessary condition for the distribution of $\bar{Y}$ to follow a normal distribution; however, when the normality condition of $Y$ is relaxed, the distribution of $\bar{Y}$ is at least *approximately* normal
*    The **standard error** is the standard deviation in the distribution of $\bar{Y}$; thus, the standard error for an average is 

$$\begin{array}{rcl}
\mbox{Standard Error of } \bar{Y} & = & \sqrt{Var(\bar{Y})} \\
& = & \sqrt{\frac{\sigma^2}{n}} \\ 
& = & \frac{\sigma}{\sqrt{n}}
\end{array}
$$


The following code snipit will compute the relavent quantites for the distribution of $\overline{PRICE}$.

In [None]:
(LaCrosseWinonaHomePrices
  %>% summarize(
                 Mean = mean(PRICE),
                 Var = var(PRICE),
                 StdDev = sd(PRICE),
                 Count = n(),
                 StdError = sd(PRICE)/sqrt(n())
  )
)

#### Inherent Variation in the Mean via Simulation

Efron (1979) developed the **bootstrap** which a simulation-based technique that allows for the investigation of the sampling distribution of almost any statistic using random sampling methods.  When using the bootstrap approach, the original sample is considered a pseudo population.  A random sample is taken with replacement from the original sample. The statistic of interest is computed using this random ssample and its value is retained.  This process is repeated a total of $b \space times$.

Wiki Bootstrap: https://en.wikipedia.org/wiki/Bootstrapping_(statistics)

Step 1: Take a random sample (with replacement) from the original sample.  Compute the statistic of interest and retain its value.

<table border="0" align="center">
<tr>
  <td align="center" valign="center" bgcolor="white"><img src="https://drive.google.com/uc?export=view&id=1giLIQDYoEOpyjFg5FdPOhRmEDZ7XpuYr"></img>
  </td>
</table>

Step 2:  Take another random sample (with replacement) from the original sample.  Compute the statistic of interest and retain its value.
<table border="0" align="center">
<tr>
  <td align="center" valign="center" bgcolor="white"><img src="https://drive.google.com/uc?export=view&id=1MzAkLX9U6Ww0dgcZWsVRyrFeRWPPG4CE"></img>
  </td>
</table>

Step 3:  Repeat Step 2 a total of $b = 100 \space times$.

<table border="0" align="center">
<tr>
  <td align="center" valign="center" bgcolor="white"><img src="https://drive.google.com/uc?export=view&id=14DfVwckWsNoCv4phm-NPR2AQAtTWzLwW"></img>
  </td>
</table>

In [None]:
#@title Plot of the Original Data
# Rug Plot of Original Data
ggplot(data=LaCrosseWinonaHomePrices, aes(x=PRICE)) + 
  geom_rug() + 
  xlim(0,600000) +
  ggtitle("Plot of Original Data") + 
  xlab("Home Price")+
  theme_classic()

In [None]:
#@title Plot of Single Resample
#Get a single resample and put the outcomes into a data.frame
Resample = sample(LaCrosseWinonaHomePrices$PRICE,size=length(LaCrosseWinonaHomePrices$PRICE),replace=TRUE)
HomePrice_Resample <- data.frame(Resample)

#Create a plot of the single resample
ggplot(data=HomePrice_Resample, aes(x=Resample)) + 
  geom_rug() + 
  xlim(0,600000) +
  ggtitle("Plot of Single Resample of Original Data") + 
  xlab("Home Price - Resample") + 
  theme_classic()

#Create a plot of the single resample
ggplot(data=(HomePrice_Resample %>% summarize(MeanResample = mean(Resample))), aes(x=MeanResample)) + 
  geom_rug() + 
  xlim(0,600000) +
  ggtitle("Plot of the Mean from the Resample Data") + 
  xlab("") + 
  theme_classic()


The following is a custom function that I wrote that will allow us to bootstratp the average.

In [65]:
#@title Custom Bootstrap Average Function
# Custom function to bootstrap the mean
BootMean=function(y, b=100, plotit = FALSE){
  #Inputs
   #  y: the vector for which the bootstrap will be applied to
   #  b: number of bootstrap iterations
   #  plotit:  logical for plotting bootstrap outcomes
  #Outputs
   #  Outcomes_DF: data.frame containing the bootstrap mean from each iteration
   #  If plotit = TRUE: rug plot of bootstrap distribution


   output.vec=rep(0,b)
   for(i in 1:b){
      ystar=sample(y,size=length(y),replace=TRUE)
      output.vec[i]=mean(ystar)
   }
   Outcomes_DF <- data.frame(Outcomes = output.vec)

   if(plotit == TRUE){
     myplot <- ggplot(data=Outcomes_DF, aes(x=Outcomes)) + 
               geom_rug() + 
               xlab("Outcomes") +
               theme_classic()
    print(myplot)
   }
   return(Outcomes_DF)
}


The following code snipit will use the **BootMean()** custom function to bootstrap the average PRICE.  A total of $b=20$ repeated bootstrap outcomes will be obtained. 

In [None]:
#Use the BootMean() function to obtain 20 repeated bootstrap outcomes
# plotit = TRUE will plot the bootstrap outcomes
Mean_RepeatedSampling <- BootMean(LaCrosseWinonaHomePrices$PRICE, b=20, plotit=TRUE)

Next, the outcomes from the bootstrap distribution are plotted.  An empirical density smoother is added to the histogram.  In addition, a normal curve is included (green) for reference.

In [None]:
#Plotting the bootstrap resampling distribution for the average price
ggplot(data=Mean_RepeatedSampling, aes(x=Outcomes)) + 
  geom_rug() + 
  geom_density(adjust=2) +
  stat_function(fun = dnorm, args = list(mean = mean(LaCrosseWinonaHomePrices$PRICE), sd = (sd(LaCrosseWinonaHomePrices$PRICE)/sqrt(length(LaCrosseWinonaHomePrices$PRICE)))), color="darkgreen") + 
  xlim(0,600000) +
  #xlim(225000,350000) +
  ggtitle("Bootstrap Distribution of the Sample Mean") + 
  xlab("Mean PRICE") + 
  theme_classic()


The **standard error** of a statistic can be estimated by computing the standard deviation of the bootstrap distribution. The following code computes an estimate of the standard error using the bootstrap distribution. 

In [None]:
cat("\nBootstrap Standard Error Estimate:\n\n")
(Mean_RepeatedSampling
  %>% summarize(
                 StdError = sd(Outcomes)
  )
)

The methodology for constructing a 95% Confidence Interval will be discussed in detail later in this course.  The 95% Confidence Intervals for the Population Mean are provided here simply for comparison purposes.

In [None]:
#@title Comparing the 95% Confidence Intervals

#Computing the theory based 95% CI
(LaCrosseWinonaHomePrices
  %>% summarize(
              '2.5%' = mean(PRICE) - qt(0.975,df=n()-1) * (sd(PRICE)/sqrt(n())),
              '97.5%' = mean(PRICE) + qt(0.975,df=n()-1) * (sd(PRICE)/sqrt(n())),
            )
  %>% mutate(Type = "Theory")
  %>% relocate(Type)
) -> Theory_CI

#Computing the 95% CI via Bootstrap
(Mean_RepeatedSampling
  %>% summarise(enframe(quantile(Outcomes, c(0.025, 0.975)), "Quantile", "Value"))
  %>% spread(Quantile,Value)
  %>% mutate(Type = "Bootstrap")
  %>% relocate(Type)
) -> Bootstrap_CI

#Putting the two data.frames together
bind_rows(Theory_CI,Bootstrap_CI)




---



---



### Inherent Variation in a Regression Line

In modeling home prices using a sample, it is certainly true that a different sample of homes will produce a slightly different regression line.  This begs the question to what degree will my regression line change from sample-to-sample.  In particular, to what degree will my y-intercept and slope change over repeated sampling.  

<table border="0" align="center">
<tr>
  <td width="50%" align="center" valign="center" bgcolor="white"><font size="+2">Population Side</font><br>True Relationship - All Homes<br><img src="https://drive.google.com/uc?export=view&id=1zS3DBjQGDWhDmF9L2p6gAY-smrRIHa_V"></img>
  </td>
  <td width="50%" align="center" valign="center" bgcolor="white"><font size="+2">Sample Side</font><br>An Estimate of the True Relationship<br><img src="https://drive.google.com/uc?export=view&id=10xsQT7WQTurnrkUdBlpeV5tfhStlNh8m"></img>
  </td>
</table>

The goal is to use information from the sample to draw inferences about the true relationship between Home Price and Square Feet. An inherent problem in doing this is that the information on the sample side varies from sample-to-sample. Thus, an understanding of the inherent variation from sample-to-sample is essential to understanding the true relationship.

<table border="0" align="center">
<tr>
  <td align="center" valign="center" bgcolor="white"><font size="+2">Outcomes Vary Over Repeated Samples</font><br><img src="https://drive.google.com/uc?export=view&id=1V-d4CdoSZUKB7UIAKF6Vliw4IqS1BZ2f" width="50%" height="50%"></img>
  </td>
</table>

#### Fitting a Simple Linear Regression Model

First, let's fit a simple linear regression model for $Price \sim SquareFeet$

In [45]:
LModel_Price_Sqft <- lm(PRICE ~ SQUAREFEET, data=LaCrosseWinonaHomePrices)

Getting a summary of this fit.

In [None]:
summary(LModel_Price_Sqft)

Some summaries from this model include

$$\begin{array}{rcl}
\hat{E}(Price|SquareFeet) & = & \hat{\beta}_{0} + \hat{\beta}_{1} * SquareFeet \\
 & = & \$76499.68 + \$98.81 * SquareFeet \\
\end{array}
$$

*    $R^2 = 0.6506 \approx 65\%$, the percent of variation in Price can be explained by Square Feet using this estimated regression model
*    $RMSE = \$66,180$, the average error in the prediction


The following is a custom function that can be used to obtain the bootstrap distribution of the estimated y-intercept and slope from the estimated regression line.

In [64]:
#@title Custom Bootstrap Regression Function
#######################################################
# Bootstrap Regression
# Note: bootstrapping residuals here
#######################################################

BootReg=function(slr_object,b=100,delay=0){

	y=slr_object$model[,1]
	x=slr_object$model[,2]
	resid=slr_object$residuals

	output.mat=matrix(0,b,4)
	
	plot(x,y,type="n",xlab="SquareFeet",ylab="CurrentPrice")
	#points(x,y)
	abline(slr_object)
	Sys.sleep(2+delay)
	points(x,y,col="white")


	for(i in 1:b){

		residstar = sample(resid,replace=F)
		ystar=y+residstar
		lmtemp = lm(ystar~x)
		#points(x,ystar)
		abline(lmtemp,col="grey")
		xjitter1=min(x)+0.67*(max(x)-min(x))+runif(1,-0.2*(max(x)-min(x)),0.2*(max(x)-min(x)))
		xjitter2=xjitter1 + 0.1*(max(x)-min(x))
		segments(xjitter1,+lmtemp$coefficients[[1]]+xjitter1*lmtemp$coefficients[[2]],xjitter2,lmtemp$coefficients[[1]]+xjitter1*lmtemp$coefficients[[2]])
		segments(xjitter2,lmtemp$coefficients[[1]]+xjitter1*lmtemp$coefficients[[2]],xjitter2,lmtemp$coefficients[[1]]+xjitter2*lmtemp$coefficients[[2]])
		text(xjitter2+0.02*(max(x)-min(x)),lmtemp$coefficients[[1]]+xjitter1*lmtemp$coefficients[[2]],round(lmtemp$coefficients[[2]],2),cex=0.75)
		Sys.sleep(delay)
		points(x,ystar,col="white")
		
		output.mat[i,1]=lmtemp$coefficients[[1]]
		output.mat[i,2]=lmtemp$coefficients[[2]]
		
	}
	
	Intercept = output.mat[,1]
	Slope = output.mat[,2]
	return(data.frame(Intercept, Slope))
}


Using the BootReg() function to get the slopes over repeated sampling

In [None]:
LModel_RepeatedSampling <- BootReg(LModel_Price_Sqft, b=10)

The following code snipit will compute the bootstrap standard error estimates for the y-intercept and slope.

In [None]:
(LModel_RepeatedSampling
  %>% summarize(
                 StdError_Intercept = sd(Intercept),
                 StdError_Slope = sd(Slope)
  )
)

The following code snipit will create an approximate 95% confidence interval using the bootstrap distribution.

In [None]:
#Computing the 95% CI via Percentiles from the bootstrap distribution.
( LModel_RepeatedSampling
  %>% summarise(enframe(quantile(Intercept, c(0.025, 0.975)), "Quantiles", "Intercept"))
  %>% spread(Quantiles,Intercept)
  %>% mutate(Estimate = "Intercept")
  %>% relocate(Estimate)
) -> Intercept_CI

( LModel_RepeatedSampling
  %>% summarise(enframe(quantile(Slope, c(0.025, 0.975)), "Quantiles", "Slope"))
  %>% spread(Quantiles,Slope)
  %>% mutate(Estimate = "Slope")
  %>% relocate(Estimate)
) -> Slope_CI

bind_rows(Intercept_CI,Slope_CI)

#### Getting the Theory Based 95% Confidence Intervals for Regression Estimates

The 95% confidence interval for the model parameters can be computed using the confint() function.

In [None]:
confint(LModel_Price_Sqft)


---



---
End of Document

