<a href="https://colab.research.google.com/github/davidofitaly/notes_02_50_key_stats_ds/blob/main/04_chapter/01_raw_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [35]:
library(MASS)

### Simple Linear Regression

1. **Dependent Variable**:  
   The outcome being predicted. Synonyms: response, $y$, target, result.

2. **Independent Variable**:  
   The predictor or feature used to explain the dependent variable. Synonyms: $x$, attribute, characteristic, predictor.

3. **Intercept (Constant Term)**:  
   The expected value of the dependent variable when the independent variable is zero. Denoted as $b_0$ or $\beta_0$.

4. **Regression Coefficient**:  
   Represents the change in the dependent variable for a one-unit increase in the independent variable. Denoted as $b_1$ or $\beta_1$.

5. **Fitted (Predicted) Values**:  
   The estimated values of the dependent variable. Denoted as $\hat{y}$. Synonyms: forecasted values, approximated values.

6. **Residuals (Error)**:  
   The difference between observed and fitted values:  
   $$ \text{Residual} = y - \hat{y} $$

7. **Least Squares Method**:  
   A technique to minimize the sum of squared residuals and find the best-fitting line:  
   $$ \text{Minimize } \sum (y_i - (\beta_0 + \beta_1 x_i))^2 $$  
   where $\beta_0$ is the intercept, and $\beta_1$ is the regression coefficient.


####Excercise 4.1
#####Write a program that loads the `LungDisease.csv` dataset from the provided URL and performs a linear regression to predict PEFR based on Exposure. Then, display the results of the regression model.

1. **Load the LungDisease dataset** from the provided GitHub URL into a variable called `data_lung`.
2. **Fit a linear regression model** with PEFR as the dependent variable and Exposure as the independent variable.
3. **Display the first few rows** of the dataset using the `head()` function.
4. **Print the results of the regression model**, showing the coefficients for the intercept and Exposure.

In [1]:
# Define the URL of the CSV file on GitHub
url <- "https://raw.githubusercontent.com/davidofitaly/notes_02_50_key_stats_ds/main/04_chapter/files/LungDisease.csv"

# Load the data from the CSV file into the variable 'data_lung'
data_lung <- read.csv(url)

# Display the first few rows of the loaded dataset
head(data_lung)

Unnamed: 0_level_0,PEFR,Exposure
Unnamed: 0_level_1,<int>,<int>
1,390,0
2,410,0
3,430,0
4,460,0
5,420,1
6,280,2


In [2]:
# Fit linear regression model with PEFR as dependent and Exposure as independent variable
# lm() fits a linear model. PEFR is predicted by Exposure in data_lung.
model <- lm(PEFR ~ Exposure, data=data_lung)

# Print the results
print(model)


Call:
lm(formula = PEFR ~ Exposure, data = data_lung)

Coefficients:
(Intercept)     Exposure  
    424.583       -4.185  



####Exercise 4.2

#####Based on the linear regression model you have created (`model`), follow these steps:

1. **Generate the Predicted Values:**  
   Use the model to calculate the predicted **PEFR** values based on the **Exposure** variable. These predicted values represent the expected **PEFR** for each data point according to the model.

2. **Calculate the Residuals:**  
   The residuals represent the difference between the observed **PEFR** values and the predicted values. Compute these residuals by subtracting the predicted values from the actual **PEFR** values.

3. **Display the Results:**  

In [3]:
# Generate the predicted values (fitted values) based on the regression model
fitted <- predict(model)

# Calculate the residuals, i.e., the differences between the actual and predicted values
resid <- residuals(model)

# Display the first few predicted values
print(head(fitted))

# Display the first few residuals (errors)
print(head(resid))


       1        2        3        4        5        6 
424.5828 424.5828 424.5828 424.5828 420.3982 416.2137 
           1            2            3            4            5            6 
 -34.5828066  -14.5828066    5.4171934   35.4171934   -0.3982301 -136.2136536 


### Multiple Regression

##### Multiple Regression  

1. **Root Mean Squared Error (RMSE)**  
   Measures the model’s prediction error:  
   $$ RMSE = \sqrt{\frac{1}{n} \sum (y_i - \hat{y}_i)^2} $$  
   - $n$ – number of observations (data points).  
   - Lower RMSE → better model accuracy.  
   - RMSE in the same unit as $y$.  

2. **Residual Standard Error (RSE)**  
   Estimates the standard deviation of residuals:  
   $$ RSE = \sqrt{\frac{\sum (y_i - \hat{y}_i)^2}{n - p - 1}} $$  
   - $p$ – number of independent variables (predictors).  
   - Lower RSE → model fits data well.  
   - Used for confidence interval estimation.  

3. **R-Squared ($R^2$)**  
   Explains the proportion of variance in $y$ explained by predictors:  
   $$ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} $$  
   - $SS_{res}$ – sum of squared residuals (errors).  
   - $SS_{tot}$ – total sum of squares (variance in $y$).  
   - Closer to 1 → better model fit.  
   - $R^2 = 0.8$ → 80% of variance explained.  

4. **t-Statistic**  
   Assesses the significance of each regression coefficient:  
   $$ t = \frac{\beta_j}{SE_{\beta_j}} $$  
   - $\beta_j$ – estimated coefficient for predictor $j$.  
   - $SE_{\beta_j}$ – standard error of $\beta_j$.  
   - Higher absolute $t$-value → stronger evidence against $H_0$ (no effect).  
   - Used in hypothesis testing.  

5. **Weighted Regression**  
   Accounts for heteroscedasticity by assigning weights $w_i$:  
   $$ \min \sum w_i (y_i - \hat{y}_i)^2 $$  
   - Gives more influence to reliable observations.  
   - Useful when residual variance is not constant.  



#### Exercise 4.3
##### Write a program that loads the `house_sales.csv` dataset from the provided GitHub URL and performs a linear regression to predict `AdjSalePrice` based on several independent variables. Then, display the results of the regression model.

1. **Load the `house_sales` dataset** from the provided GitHub URL into a variable called `data_house`, ensuring the separator is set to tab (`\t`).
2. **Fit a linear regression model** with `AdjSalePrice` as the dependent variable and `SqFtTotLiving`, `SqFtLot`, `Bathrooms`, `Bedrooms`, and `BldgGrade` as independent variables.
3. **Display the first few rows** of the dataset, showing only the selected columns: `AdjSalePrice`, `SqFtTotLiving`, `SqFtLot`, `Bathrooms`, `Bedrooms`, and `BldgGrade`.
4. **Print the results of the regression model**, showing the coefficients for the intercept and each of the independent variables.


In [29]:
# Define the URL of the CSV file on GitHub
url <- "https://raw.githubusercontent.com/davidofitaly/notes_02_50_key_stats_ds/main/04_chapter/files/house_sales.csv"

# Load the data from the CSV file into the variable 'data_house'
# Specify the separator as tab ('\t') to correctly read the file
data_house <- read.csv(url, sep="\t")

# Display the first few rows of selected columns from the dataset
head(data_house[, c("AdjSalePrice", "SqFtTotLiving", "SqFtLot",
                          "Bathrooms", "Bedrooms", "BldgGrade")])


Unnamed: 0_level_0,AdjSalePrice,SqFtTotLiving,SqFtLot,Bathrooms,Bedrooms,BldgGrade
Unnamed: 0_level_1,<dbl>,<int>,<int>,<dbl>,<int>,<int>
1,300805,2400,9373,3.0,6,7
2,1076162,3764,20156,3.75,4,10
3,761805,2060,26036,1.75,4,8
4,442065,3200,8618,3.75,5,7
5,297065,1720,8620,1.75,4,7
6,411781,930,1012,1.5,2,8


In [31]:
# Fit a linear regression model to predict AdjSalePrice using several independent variables
house_lm <- lm(AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + BldgGrade, data=data_house, na.action = na.omit)

# Print the model summary
house_lm


Call:
lm(formula = AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + 
    Bedrooms + BldgGrade, data = data_house, na.action = na.omit)

Coefficients:
  (Intercept)  SqFtTotLiving        SqFtLot      Bathrooms       Bedrooms  
   -5.219e+05      2.288e+02     -6.051e-02     -1.944e+04     -4.778e+04  
    BldgGrade  
    1.061e+05  


#### Exercise 4.4
##### Based on the `house_lm` model, perform the following tasks:

1. **View the summary of the model** by running the `summary(house_lm)` function. This will provide detailed information about the regression results, including coefficients, R-squared value, p-values, and significance of the predictors.
2. **Interpret the coefficients** in the context of the model. Understand how each predictor (such as `SqFtTotLiving`, `Bathrooms`, `Bedrooms`, etc.) influences the predicted `AdjSalePrice`.
3. **Examine the R-squared value** to assess the goodness of fit of the model. Determine how well the independent variables explain the variance in the dependent variable (`AdjSalePrice`).



In [34]:
# Set the option to prevent scientific notation in the output
options(scipen = 999)

# Display the summary of the linear regression model 'house_lm'
summary(house_lm)



Call:
lm(formula = AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + 
    Bedrooms + BldgGrade, data = data_house, na.action = na.omit)

Residuals:
     Min       1Q   Median       3Q      Max 
-1199508  -118879   -20982    87414  9472982 

Coefficients:
                   Estimate    Std. Error t value             Pr(>|t|)    
(Intercept)   -521924.72204   15650.57362 -33.349 < 0.0000000000000002 ***
SqFtTotLiving     228.83211       3.89837  58.699 < 0.0000000000000002 ***
SqFtLot            -0.06051       0.06118  -0.989                0.323    
Bathrooms      -19438.09896    3625.21876  -5.362         0.0000000832 ***
Bedrooms       -47781.15338    2489.44263 -19.194 < 0.0000000000000002 ***
BldgGrade      106117.20956    2396.13627  44.287 < 0.0000000000000002 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 261200 on 22683 degrees of freedom
Multiple R-squared:  0.5407,	Adjusted R-squared:  0.5406 
F-statistic:  5340 on 5 and 22

#### Exercise 4.5

#####Based on the linear regression model `house_lm`, apply stepwise regression using AIC to select the best model. Use the `stepAIC()` function and set the `direction` argument to `"both"` to allow both forward and backward selection.


In [37]:
# Perform stepwise regression using AIC to select the best model
step_lm <- stepAIC(house_lm, direction = "both")

# Display the results of the stepwise regression
step_lm


Start:  AIC=566015.4
AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + 
    BldgGrade

                Df       Sum of Sq              RSS    AIC
- SqFtLot        1     66750141148 1548148631912860 566014
<none>                             1548081881771712 566015
- Bathrooms      1   1962151674312 1550044033446024 566042
- Bedrooms       1  25142151935270 1573224033706983 566379
- BldgGrade      1 133857297270080 1681939179041792 567895
- SqFtTotLiving  1 235159007954810 1783240889726523 569222

Step:  AIC=566014.3
AdjSalePrice ~ SqFtTotLiving + Bathrooms + Bedrooms + BldgGrade

                Df       Sum of Sq              RSS    AIC
<none>                             1548148631912861 566014
+ SqFtLot        1     66750141148 1548081881771712 566015
- Bathrooms      1   1928307868978 1550076939781839 566041
- Bedrooms       1  25075461022645 1573224092935506 566377
- BldgGrade      1 133921425601500 1682070057514361 567895
- SqFtTotLiving  1 239771505547382 17879201374


Call:
lm(formula = AdjSalePrice ~ SqFtTotLiving + Bathrooms + Bedrooms + 
    BldgGrade, data = data_house, na.action = na.omit)

Coefficients:
  (Intercept)  SqFtTotLiving      Bathrooms       Bedrooms      BldgGrade  
    -522415.7          228.2       -19240.4       -47650.5       106138.4  
