# Exercise 3: Data objects

These exercises are designed to get you comfortable extracting information from data objects. 

We'll work with the `Credit` dataset which comes in the `ISLR` package in R. This is a simulated dataset that provides credit and demographic information on 10,000 hypothetical customers.

---
## 1. Load packages, data, model (1 point)

Install and load `ISLR` below.

In [23]:
install.packages("ISLR")
library(ISLR)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



Take a look at the first few rows of the `Credit` dataset.

In [24]:
# INSERT CODE HERE
Credit

ID,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<fct>,<fct>,<fct>,<fct>,<int>
1,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333
2,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903
3,104.593,7075,514,4,71,11,Male,No,No,Asian,580
4,148.924,9504,681,3,36,11,Female,No,No,Asian,964
5,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian,331
6,80.180,8047,569,4,77,10,Male,No,No,Caucasian,1151
7,20.996,3388,259,2,37,12,Female,No,No,African American,203
8,71.408,7114,512,2,87,9,Male,No,No,Asian,872
9,15.125,3300,266,5,66,13,Female,No,No,Caucasian,279
10,71.061,6819,491,3,41,19,Female,Yes,Yes,African American,1350


We can see that we have a nice **tidy** data frame here. Each column is a separate variable and each row is a different observation (in this case, simulated customers).

The code below fits a linear model to predict credit card balance from the card limit and the card owner's credit rating, age, gender, and student status. This model is saved as the `cred_lm` model object. The `summary()` function extracts important summary information from the model object so we can interpret the results.

In [25]:
cred_lm  <- lm(Balance ~ Limit + Rating + Age + Gender + Student, Credit)
summary(cred_lm)


Call:
lm(formula = Balance ~ Limit + Rating + Age + Gender + Student, 
    data = Credit)

Residuals:
    Min      1Q  Median      3Q     Max 
-682.49 -127.65    3.92  135.17  453.60 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -292.49382   48.88704  -5.983 4.92e-09 ***
Limit           0.05265    0.05392   0.976 0.329428    
Rating          1.80924    0.80462   2.249 0.025092 *  
Age            -2.12967    0.57212  -3.722 0.000226 ***
GenderFemale   -0.34809   19.63579  -0.018 0.985865    
StudentYes    397.16094   32.76189  12.123  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 195.9 on 394 degrees of freedom
Multiple R-squared:  0.8207,	Adjusted R-squared:  0.8184 
F-statistic: 360.6 on 5 and 394 DF,  p-value: < 2.2e-16


## 2. Replicating `summary` outputs (5 pts)

Let's see if we can replicate some of the values included in the `summary()` output.

Let's start with the residual standard error, aka sigma. We can see above that this is 195.9 for this model. You can directly extract sigma as follows:

In [72]:
sigma(cred_lm)

In `lm`, sigma is calculated as 

$$ \sqrt{\frac{SSE}{n-p}} $$

Where SSE is the sum of squared errors, `n` is the number of observations, and `p` is the number of parameters estimated (hint: this includes the intercept). So the denominator boils down to the degrees of freedom. 

Below, use what you've learned about extracting information from model objects to calculate the SSE and extract `n` and `p`. 

Hint: remember that R is really good at *vectorized operations*, meaning it easily applies the same operation individually to each element of a given vector. 

In [73]:
# INSERT CODE HERE
# Calculate SSE
errors_squared <- residuals(cred_lm) ^ 2
sse <- sum(errors_squared)

# Extract n
n <- length(residuals(cred_lm))

# Extract p
p <- length(coefficients(cred_lm))



Now, combine your work above to write a function that takes any fitted linear model and returns the residual standard error. Then test your function on the `cred_lm` model object. Compare your answer to sigma extracted directly from the model object.

In [74]:
# INSERT CODE HERE
calculate_sigma <- function(lm) {
  errors_squared <- residuals(cred_lm) ^ 2
  sse <- sum(errors_squared)
  n <- length(residuals(lm))
  p <- length(coefficients(lm))
  sigma <- sqrt(sse/n-p)
  return(sigma)
}


In [75]:
# Test and compare results. 

calculate_sigma(cred_lm) #Replace with your own function name
sigma(cred_lm)

---
## 2. Summary table and indexing (4 pts)

Let's say we wanted to extract the entire coefficient table provided to us by the `summary()` function above, maybe for use in a publication. You might expect this to be pulled by:

In [None]:
cred_lm$coefficients

But as we saw in the tutorial, this pulls just the variable name and estimate, and not the standard error, t-statistic, or p-value. You could try to find where all this information is stored in the `cred_lm` object using the `str()` function.

In [None]:
str(cred_lm)

But you actually won't find it in there! That's because the information in the coefficient table is a component of `summary()`, not a component of the model object itself. That's right, `summary()` creates it's own object that you can further pull information from.

Knowing this, pull the coefficient table from the `summary()` object.

In [76]:
# INSERT CODE HERE
coefficients(summary(cred_lm))


Unnamed: 0,Estimate,Std. Error,t value,Pr(>|t|)
(Intercept),-292.49382335,48.88704367,-5.98305402,4.922383e-09
Limit,0.05265499,0.05392354,0.97647505,0.3294283
Rating,1.80924401,0.80462003,2.24856944,0.02509189
Age,-2.12966556,0.57211502,-3.72244305,0.000226033
GenderFemale,-0.34809093,19.63578922,-0.01772737,0.9858653
StudentYes,397.16093911,32.76189104,12.12265002,5.800207000000001e-29


Maybe we are not interested in including the t-statistic in our final table. Pull **just** the estimate, SE, and p-value from the `summary()` object.

In [98]:
# INSERT CODE HERE
summary(cred_lm)$coefficients[,c(1,2,4)]


Unnamed: 0,Estimate,Std. Error,Pr(>|t|)
(Intercept),-292.49382335,48.88704367,4.922383e-09
Limit,0.05265499,0.05392354,0.3294283
Rating,1.80924401,0.80462003,0.02509189
Age,-2.12966556,0.57211502,0.000226033
GenderFemale,-0.34809093,19.63578922,0.9858653
StudentYes,397.16093911,32.76189104,5.800207000000001e-29


Now, pull the table again but drop the `(Intercept)` term. (Don't save and alter your table above -- practice pulling the same table, minus the intercept term, directly from the summary.)

In [101]:
# INSERT CODE HERE
summary(cred_lm)$coefficients[c(2,3,4,5,6),c(1,2,4)]


Unnamed: 0,Estimate,Std. Error,Pr(>|t|)
Limit,0.05265499,0.05392354,0.3294283
Rating,1.80924401,0.80462003,0.02509189
Age,-2.12966556,0.57211502,0.000226033
GenderFemale,-0.34809093,19.63578922,0.9858653
StudentYes,397.16093911,32.76189104,5.800207000000001e-29


That's all for Exercise 3! When you are finished, save the notebook as Exercise3.ipynb, push it to your class GitHub repository and send the instructors a link to your notebook via Canvas. You can send messages via Canvas by clicking "Inbox" on the left and then pressing the icon with a pencil inside a square.

**DUE:** 5pm EST, Feb 8, 2023

**IMPORTANT** Did you collaborate with anyone on this assignment? If so, list their names here. 
> *Someone's Name*