<a href="https://colab.research.google.com/github/christophermalone/stat360/blob/main/Handout2_PartC_RSquared.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Handout #2 Part C: Computing $R^2$


<table width='100%' ><tr><td bgcolor='green'></td></tr></table>

## Example 2.1 (Con't)

Consider the following data that has been collected from students in my introductory statistics course over several past semesters.  

<table>
  <tr>
    <td width='30%' valign='top'>
      <ul>
        <li><strong>Response Variable</strong>: Hair Length (mm)</li><br>
        <li>Variables under investigation (i.e. independent variables)</li>
        <ul>
          <li>Gender</li>
          <li>Height (inches)</li>
         </ul>
    </ul>
    </td>
    <td width='70%'>
<p align='center'><img src="https://drive.google.com/uc?export=view&id=1h4lXsxXMRHVRtdg48vdbMbGahWfGz_oS" width='50%' height='50%'></img></p>
<p align='center'><img src="https://drive.google.com/uc?export=view&id=1W1F3yLFTI-AOUOg10Gnl7zsetSX-6lXz" width='50%' height='50%'></img></p>
  </td>
</tr>
</table>

Data Folder: [OneDrive](https://mnscu-my.sharepoint.com/:f:/g/personal/aq7839yd_minnstate_edu/EmOQfrwxzzRBqq8PH_8qTmMBy-1qKgM11Hb8vzjs025EEA?e=wyShYs)



<table width='100%' ><tr><td bgcolor='green'></td></tr></table>



## Proportion of Variance being Explained

The potential amount of unexplained variation that can be explained by conditioning certainly depends on the total amount in the marginal distribution. For the Hair Length example when conditioning on Gender, we have:

$$\mbox{Unexplained Variation in Marginal Distribution} = 3836975$$

$$\mbox{Unexplained Variation in Conditional Distribution} = 1209170$$

The reduction in the unexplained variation by conditioning is

$$\mbox{Reduction in Unexplained Variation} = 3836975 - 1209170 = 2627805$$




This reduction is a substantial amount considering that the total unexplained variation in the marginal distribution was 3836975.  As a result, the proportion of unexplained variance taken away by considering the conditional distributions is typically used as a measure of overall usefulness of the conditioning variable(s).  This proportion is commonly referred to as the **coefficient of determination** or $\bf{R^2}$.


$$
R^{2} = \frac{\mbox{Total Unexplained Variation in Marginal}-\mbox{Total Unexplained Variation in Conditional}}{\mbox{Total Unexplained Variation in Marginal}}
$$


<u>Notation</u>:

*    Sum of Squares Total (i.e. $SS_{Total}$) is commonly used to identify the total unexplained variation in the marginal distribution of the response

*    Sum of Squares Error (i.e. $SS_{Error}$) is commonly used to identify the total amount of unexplained variation in the conditional distributions 


The coefficient of determination, $R^2$, is defined as follows when using this common notation

$$
R^{2} = \frac{(SS_{Total} - SS_{Error})}{SS_{Total}}
$$

or, more simply

$$
R^{2} = 1 - \frac{SS_{Error}}{SS_{Total}}
$$


The coefficient of determination or R2 value for our situation would then be computed as

$$ \begin{array}{rcl}
R^{2} & = & \frac{(SS_{Total} - SS_{Error})}{SS_{Total}} \\
 & = & \frac{3836975 - 1209170}{3836975} \\
 & = & \frac{2627805}{3836975} \\
 & = & 0.6849 \\
 & \approx & 69\% \\
\end{array}
$$

<u>Question</u>:  What is the the correct interpretation of this quantity, in context?

### Coefficient of Determination

Wiki Entry:  http://en.wikipedia.org/wiki/Coefficient_of_determination


<p align='center'><img src="https://drive.google.com/uc?export=view&id=1L9cGH8m_-h8SR0kMFyBu2exIH7Grdhp-" width='60%' height='60%'></img></p>


## Getting Summary Quantities for R^2

First, getting the tidyverse package loaded into this COLAB session.

In [2]:
#load tidyverse package
library(tidyverse)

“running command 'timedatectl' had status 1”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.1
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.2

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



Reading in the data using read_csv() via tidyverse. 

In [3]:
# Reading data in using read.csv via Base 
HairLength <- read_csv("http://www.StatsClass.org/stat360/Datasets/HairLength.csv")

[1mRows: [22m[34m131[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (2): Gender, Name
[32mdbl[39m (3): RowID, Height, Length

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [4]:
#Getting sum of squared error in marginal distribution
(HairLength 
  %>% summarize(
                  Mean = mean(Length),
                  Variance = var(Length),
                  Count = n(),
                  SSE_MarginalDistribution = (Count-1)*Variance
               )
)

Mean,Variance,Count,SSE_MarginalDistribution
<dbl>,<dbl>,<int>,<dbl>
237.2595,29515.19,131,3836975


In [5]:
#Getting sum of squared error in conditional distribution
(HairLength
  %>% group_by(Gender) 
  %>% summarize(
                  Mean = mean(Length),
                  Variance = var(Length),
                  Count = n(),
                  SSE = (Count-1)*Variance
               )
) -> GenderSummaries

GenderSummaries

cat("\n\n")
(GenderSummaries
  %>% summarize(
                SSE_ConditionalDistribution = sum(SSE)
  )
)

Gender,Mean,Variance,Count,SSE
<chr>,<dbl>,<dbl>,<int>,<dbl>
F,346.7439,14171.181,82,1147865.62
M,54.04082,1277.165,49,61303.92






SSE_ConditionalDistribution
<dbl>
1209170


### Fitting a Model

The lm() function can be used to fit a linear statistical model to a set of data.

Syntax for lm():

*     The model structure is provided using $Response \sim Predictor$
*     The dataset must be specified

$$
lm(Response \sim Predictor, data = \space \space)
$$

In [6]:
# Computing R^2 via a model
LMModel_Length_Gender <- lm(Length ~ Gender, data = HairLength)
summary(LMModel_Length_Gender)


Call:
lm(formula = Length ~ Gender, data = HairLength)

Residuals:
     Min       1Q   Median       3Q      Max 
-313.744  -35.541   -1.744   40.256  231.256 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   346.74      10.69   32.43   <2e-16 ***
GenderM      -292.70      17.48  -16.74   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 96.82 on 129 degrees of freedom
Multiple R-squared:  0.6849,	Adjusted R-squared:  0.6824 
F-statistic: 280.3 on 1 and 129 DF,  p-value: < 2.2e-16


Consider the following components of the linear model output.

<p align='center'><img src="https://drive.google.com/uc?export=view&id=1YxVOvHhsQgDp-MXl2vkGn1RSTTCObl9i" width='60%' height='60%'></img></p>

<u>Comments</u>:

*  The Call: portion of the output provides information regarding the model specification
*  The Estimates provide the group means; (Intercept) is the estimated mean for Females, and (Intercept) + GenderM is the estimated mean for Males.
*   The $R^2 = 0.6849 = 68.49\%$ is the Multiple R-Squared value in this output


---



---
End of Document
