# Project Journal

Name: Erin Widener

Research Question: In considering 3 different energy production sources (total fossil fuel production, nuclear electric power production, total renewable energy production) and 1 energy import variable (primary energy imports), can at least one of these sources of energy "inputs" accurately predict total primary energy consumption?  

Variables: Response = Total Primary Energy Consumption (Y)
           Predictor(s) = Total Fossil Fuel Production (B1, X1), Nuclear Electric Power Production (B2, X2), Total Renewable Energy Production (B3, X3), Primary Energy Imputs (B4, X4)

Hypothesis: Is there at least one predictor useful in predicting total primary energy consumption? 

Ho: B1 = B2 = B3 = B4 = 0 
Ha: At least one Bj ≠ 0 (j= 1-4) 

## Data Prep & EDA
**Dates:** November 1 - November 7

**Meeting Date:** November 7

### Data Cleaning Summary

**Summary of data cleaning process:**
1. Step 1: Set wd and install/load packages
2. Step 2: Import data
3. Step 3: Asses the structure of the data
4. Step 4: Build dataframe with relevant variables

**Issues Encountered and Resolutions:**
Some minor loading issues with some packages (ALSM) and extracting relevant variables from dataset


In [None]:
##Step 1
setwd("C:/Users/Hunte/OneDrive - purdue.edu/STAT 512/Final Project")
library(vroom)
library(tidyverse)
library(IRkernel)
library(ALSM)
library(car)
library(fmsb)
library(GGally)
library(lmridge)
library(lmtest)
library(MASS)
library(glmnet) 
library(boot) 
library(caret)

In [None]:

##Step 2
dataset <- read.csv("World Energy Overview.csv")
matrix.data <- as.matrix(dataset)


In [None]:
##Step 3
str(dataset)
str(matrix.data)

In [None]:
#Step 4 
Y <- dataset$Total.Primary.Energy.Consumption 
X1 <- dataset$Total.Fossil.Fuels.Production
X2 <- dataset$Nuclear.Electric.Power.Production
X3 <- dataset$Total.Renewable.Energy.Production 
X4 <- dataset$Primary.Energy.Imports

predictor.response.data <- data.frame("Consumption"= Y, "Fossil" = X1, 
                                                 "Nuclear" = X2, 
                                                 "Renewable" = X3, 
                                                 "Imports" = X4)
head(predictor.response.data)

### Exploratory Data Analysis Findings
**Key Visualizations:** 




### Summary Statistics

In [None]:
#summary statistics 
mean(Y)
sd(Y)
mean(X1)
sd(X1)
mean(X2)
sd(X2)
mean(X3)
sd(X3)
mean(X4)
sd(X4)

***
## Model Building
**Dates:** November 8 - November 14

**Meeting Date:** November 14

### Model Equation

**Equation:** 
[Write out the model equation here based on your selected predictors]

Note: you can write equations as follows: 
$$Y = \beta_0 +  \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_4 + \epsilon

Where, 
Y= Total.Primary.Energy.Consumption → Y 
B1= Total.Fossil.Fuel.Production → X1 
B2= Nuclear.Electric.Power.Production → X2 
B3= Total.Renewable.Energy.Production → X3
B4= Primary.Energy.Imports → X4



### Model Fitting

In [None]:
##MLR model
mlr.mod <- lm(Consumption ~ Fossil + Nuclear + Renewable + Imports, predictor.response.data)

### Multicollinearity
**Explanation of Multicollinearity:**
[Briefly describe any collinearity included in the model] 


In [None]:
#Pairwise scatter plot & multicolinearity
ggpairs(predictor.response.data)
VIF(lm(X1 ~ X2 + X3 + X4))
VIF(lm(X2 ~ X1 + X3 + X4))
VIF(lm(X3 ~ X1 + X2 + X4))
VIF(lm(X4 ~ X1 + X2 + X3))
VIF(lm(X1 ~ X3))
VIF(lm(X3 ~ X1))
VIF(lm(X2 ~ X4))
VIF(lm(X4 ~ X2))

### Interaction Terms
**Explanation of Interaction Terms:**
[Briefly describe any interaction terms included in the model]


In [None]:
# Add any interaction plots here

### Model Summary and Diagonostics

In [None]:
# Summary & confidence intervals
summary(mlr.mod)
confint(mlr.mod, level=0.95)

#Anova (type I & II)
anova(mlr.mod)
Anova(mlr.mod, type = 2)

# Diagnostics: Residual Plots, Normality, etc.
plot(mlr.mod)

#Cooks distance 
cooksd <- cooks.distance(mlr.mod)
plot(cooksd, type = "h", main = "Cook's Distance", ylab = "Cook's Distance")
abline(h = 4 / length(cooksd), col = "red")  # threshold for influential points

In [None]:
##Individual SLR models
#Fossil
fossil.mod <- lm(Y ~ X1)
summary(fossil.mod)
ggplot(predictor.response.data, aes(x= Fossil, y= Consumption)) +
  geom_point() + 
  geom_smooth(method = "lm", color = "orange") + 
  theme_bw()

#Nuclear
nuclear.mod <- lm(Y ~ X2)
summary(nuclear.mod)
ggplot(predictor.response.data, aes(x= Nuclear, y= Consumption)) +
  geom_point() + 
  geom_smooth(method = "lm", color = "blue") + 
  theme_bw()

#Renewable
renewable.mod <- lm(Y ~ X3)
summary(renewable.mod)
ggplot(predictor.response.data, aes(x= Renewable, y= Consumption)) +
  geom_point() + 
  geom_smooth(method = "lm", color = "green") + 
  theme_bw()

#Imports
imports.mod <- lm(Y ~ X4)
summary(imports.mod)
ggplot(predictor.response.data, aes(x= Imports, y= Consumption)) +
  geom_point() + 
  geom_smooth(method = "lm", color = "purple") + 
  theme_bw()

In [None]:
#Partial regression plots (component + residual) 
crPlots(mlr.mod)

### Feature Selection Plan
Describe strategies for reducing the model (if necessary) and rationale.

***
## Model Evaluation & Validation
**Dates:** November 15 - November 21

**Meeting Date:** November 21

### Documentation of Model Adjustments

In [None]:
##MLR model effects on renewables (the problem child)
fossil.renew <- lm(Y ~ Fossil + Renewable, predictor.response.data)
summary(fossil.renew)
confint(fossil.renew)
Anova(fossil.renew, type= 2)

nuclear.renew <- lm(Y ~ Nuclear + Renewable, predictor.response.data)
summary(nuclear.renew)
confint(nuclear.renew)
Anova(nuclear.renew, type= 2)

imports.renew <- lm(Y ~ Imports + Renewable, predictor.response.data)
summary(imports.renew)
confint(nuclear.renew)
Anova(imports.renew, type= 2)

In [None]:
##log transformation 
ylog.mod = lm(log(Consumption)~Fossil+Nuclear+Renewable+Imports, 
              data = predictor.response.data)
summary(ylog.mod)
plot(ylog.mod)

In [None]:
##WLS 
wts1 <- 1/fitted(lm(abs(residuals(mlr.mod)) ~ Fossil+Nuclear+Renewable+Imports,
                    predictor.response.data))^2
wls.mod = lm(Consumption ~ Fossil+Nuclear+Renewable+Imports, weight=wts1, 
             predictor.response.data)
summary(wls.mod)
confint(wls.mod)
plot(wls.mod)
bptest(mlr.mod)
bptest(wls.mod)

In [None]:
##Bootstrapping
set.seed(400)
mlr = function(predictor.response.data, indices) {
  mlr_boot_data = predictor.response.data[indices, ]
  mlr_fit = lm(Consumption~Fossil+Nuclear+Renewable+Imports, data= mlr_boot_data)
  coefficients(mlr_fit)
}
mlr_boot = boot(data= predictor.response.data, statistic = mlr, R=100)
mlr_boot
boot.ci(mlr_boot, index= 2, type = "perc")
boot.ci(mlr_boot, index= 3, type = "perc")
boot.ci(mlr_boot, index= 4, type = "perc")
boot.ci(mlr_boot, index= 5, type = "perc")

Summary of iterative process:
1. First I did this
2. Then I did this because...
3. Then I did this because...

Final Model Equation: 

### Model Evaluation
#### Significance Tests

In [None]:
# Add your significance test code with outputs here

#### Model Performance Metrics

In [None]:
##Model performance with AIC & BIC
aic = AIC(mlr.mod)
aic.wls = AIC(wls.mod)
aic.log = AIC(ylog.mod)
bic = BIC(mlr.mod)
bic.wls = BIC(wls.mod)
bic.log = BIC(ylog.mod)
cat(aic)
cat(aic.wls)
cat(aic.log)
cat(bic)
cat(bic.wls)
cat(bic.log)

### Validation Findings

In [None]:
##Cross-validation 
#method 1
# Define the training control
train_control <- trainControl(method = "cv", number = 10)  # 10-fold CV
# Train the model
trained.model <- train(Consumption~Fossil+Nuclear+Renewable+Imports, 
                       data = predictor.response.data, 
                       method = "lm", trControl = train_control)
#Print the model summary
print(trained.model)
summary(trained.model)

#method 2
set.seed(400)
folds = sample(rep(1:10, length.out= nrow(predictor.response.data)))
errors = c()
for(i in 1:10) {
  test_indices = which(folds== i)
  train_data = predictor.response.data[-test_indices, ]
  test_data = predictor.response.data[test_indices, ]
  mod = lm(Consumption~Fossil+Nuclear+Renewable+Imports, data= train_data)
  predictions = predict(mod, newdata= test_data)
  rmse = sqrt(mean((predictions - test_data$Consumption)^2))
  errors = c(errors, rmse)
}
mean_rmse = mean(errors)
print(mean_rmse)

### Summary of Findings

[Summarize your findings from the model evaluation and validation here. Don't forget to bring it back to your hypothesis and include your final model!]

***
Team Reminder: After this meeting, agree on a report/presentation format and make all of the needed documentation.

***
## Report and Presentation
**Dates:** November 22 - November 26

**Meeting Date:** November 26, 4:30

No code neccesary here (yay)! Use the space below to brainstorm which graphs you want to include in the report and how you want to tell the story of your model!