In [None]:
# We will use tidverse for a lot of our data manipulations. It's okay to have warnings
library(tidyverse)
source("https://github.com/brianlukoff/sta235-labs/raw/main/src/l1h.r")

# Data Exploration

In [None]:
# read_csv
profs <- read_csv('https://emyucel.com/sta235/profs.csv')

In [None]:
# The head command gives the first few rows of a dataframe
profs %>% head()

Let's explore some of our variables. Here's a histogram of the `eval` variable.

In [None]:
ggplot(data=profs) +
  geom_histogram(aes(x=eval), bins=10, color='white')

Now let's try it for attractiveness. Fill in the blank with the appropriate variable name.

In [None]:
ggplot(data=profs) +
  geom_histogram(aes(x=____________), bins=10, color='white')

Is there a relationship between attractiveness and evaluation? Let's look at a scatterplot. Fill in `x` and `y` with the variables you want to look at.

In [None]:
ggplot(data=profs) +
  geom_point(aes(x=____________, y=_____________))

The plot indicates that there might be some association between the two. We can measure that association with the correlation coefficient. Recall that this measures the strength of a linear relationship between two variables. The `cor` function in R gives us this. The `$` tells R to look in the `profs` data for the variable name that follows.

In [None]:
cor(profs$beauty, profs$eval)

# Linear Regression

Linear regression is a technique that finds the line of best fit that maps `x` onto `y`.  In other words, we're modeling the true relationship

$Y=\beta_0 + \beta_1 X + \varepsilon$

with our best estimates

$\hat Y = \hat\beta_0 + \hat\beta_1 X$ (the hats indicate estimates)

If we want to use `R`, we can use the `lm` function.

In [None]:
# Build a regression model
lm1 <- lm(eval ~ beauty, data=profs) # this saves the model into an object called lm1
summary(lm1) # this prints out the results of our regression

What does this fitted line look like when plotted on our data?

In [None]:
ggplot(data=profs) +
  geom_point(aes(x=beauty, y=eval))+
  geom_smooth(aes(x=beauty, y=eval), method='lm', se=FALSE)

The line goes through the points, sure, but there seems to be a lot of variation around it. How certain are we of our estimates of the parameters of this line?

## Confidence Intervals for Coefficients

We can create a confidence interval for model coefficients using a simple built-in function called `confint()`

In [None]:
confint(lm1) %>% round(4)

From the output above, what can we conclude about the coefficient for beauty?

A. We are 95% confident the impact of each additional point of beauty on evaluation score is between 3.95 and 4.05.

B. The impact of each additional point of beauty on evaluation score is between 3.95 and 4.05.

C. We are 95% confident the impact of each additional point of beauty on evaluation score is between 0.07 and 0.20.

D. The impact of each additional point of beauty on evaluation score is between 0.07 and 0.20.

In [None]:
check_conf_1(enter "A" "B" "C" or "D")

## Predictions
Linear regressions are very easy to use to make predictions. We take the estimates from our summary table, and just write out an equation and plugin the value of interest for which we want a prediction. Our equation is

In [None]:
# Fill in the correct coefficients below to obtain a prediction for `eval` when `beauty` is 0.5
______ + ______ * 0.5

In [None]:
# put in your answer here
check_pred_1(____)

If we want a bit more precision (and less copy pasting) we can use R's built in `predict()` function. The `predict()` function takes a model object as the first argument, and values of $X$ to be used by that model as a second argument. Below, try predicting the evaluation score for a very attractive professor.

In [None]:
# Plug in different values of beauty
predict(lm1, list(beauty=_____))

Sometimes we want to know how confident we are in our predictions. For this, we can extend the predict function with a third argument. The `interval` argument can take "prediction" for prediction intervals for an individual response and "confidence" for confidence intervals for a mean response. Compare the two below. Which one is wider?

In [None]:
predict(lm1, list(beauty=0.5), interval="prediction")

In [None]:
predict(lm1, list(beauty=0.5), interval="confidence")

# Exploration
Let's see the impact of sample size and error on our estimates

In [None]:
explore_se(0.5) # change this number between 0 and 2