In [None]:
options(repos="https://cran-archive.r-project.org/bin/windows/contrib/3.6/")
install.packages("igraph")
install.packages("ggdag")
install.packages("dplyr")
install.packages("stats")
install.packages("broom")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

“unable to access index for repository https://cran-archive.r-project.org/bin/windows/contrib/3.6/src/contrib:
  cannot open URL 'https://cran-archive.r-project.org/bin/windows/contrib/3.6/src/contrib/PACKAGES'”
“package ‘dagitty’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages”
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

“unable to access index for repository https://cran-archive.r-project.org/bin/windows/contrib/3.6/src/contrib:
  cannot open URL 'https://cran-archive.r-project.org/bin/windows/contrib/3.6/src/contrib/PACKAGES'”
“package ‘ggdag’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-projec

In [None]:
library(igraph)
library(dplyr)
library(stats)
library(broom)


Attaching package: ‘ggdag’


The following object is masked from ‘package:stats’:

    filter



Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




# Multicollinearity

## Introduction
**Multicollinearity** occurs when one or more predictor variables in a multiple regression model are highly correlated. This can significantly affect the accuracy of the coefficient estimates within the model.

## Mathematical Explanation

Consider the linear regression model:
$ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_kX_k + \epsilon $


where:
- $Y$ is the dependent variable
- $X_1, X_2, \ldots, X_k$ are the independent variables
- $\epsilon$ is the error term.

### Normal Equation
The coefficients $\beta$ are estimated using the equation:
$ \beta = (\mathbf{X}´\mathbf{X})^{-1}\mathbf{X}´\mathbf{Y} $

To find $\beta$, the matrix $\mathbf{X}´\mathbf{X}$ must be invertible. However, multicollinearity can make this matrix nearly singular or singular.


### Why $\mathbf{X}´\mathbf{X}$ Becomes Non-invertible
- **Singular Matrix**: If $\mathbf{X}´\mathbf{X}$ is singular, it implies that its determinant is zero due to perfect or high multicollinearity.
- **Linear Dependence**: This happens when one or more independent variables are linear combinations of others.


Matrix Representation

Assume $X_2 = cX_1$, then:
$ \mathbf{X} = \begin{bmatrix} 1 & X_1 & X_2 \\ 1 & X_1 & cX_1 \\ \vdots & \vdots & \vdots \\ 1 & X_1 & cX_1 \end{bmatrix} $


### Covariance Matrix ($\mathbf{X}´\mathbf{X}$)
$ \mathbf{X}´\mathbf{X} = \begin{bmatrix} n & \sum X_1 & c\sum X_1 \\ \sum X_1 & \sum X_1^2 & c\sum X_1^2 \\ c\sum X_1 & c\sum X_1^2 & c^2\sum X_1^2 \end{bmatrix} $


Here, the columns of $\mathbf{X}´\mathbf{X}$ are linear combinations of each other, which results in:
$ \text{det}(\mathbf{X}´\mathbf{X}) = 0 $
indicating that the matrix is not invertible due to multicollinearity.

###Example

**-Economic indicators** Predicting country's economic growth using both consumer spending and consumer income as predictors might lead to multicollinearity because these two are highly correlated; higher income generally leads to higher spending. Including both in the same regression model can cause issues in accurately estimating the impact of each predictor on economic growth.

**-Real Estate Pricing** In real estate, the size of a house and the number of rooms often exhibit multicollinearity. Both these variables tend to increase together; a larger house typically has more rooms. If both variables are used as predictors in a regression model to predict house prices, their high correlation can distort the individual effect of each variable on the pricing, making it difficult to assess which feature (size or number of rooms) truly impacts the house price.





### Testing for invertibility


In [None]:
# seed for reproducibility
set.seed(3)

A <- matrix(rnorm(100),ncol = 10)

#set last column is as a linear combination of last 3 columns
A[,10] = A[,1] * 2 + A[,8] * 3 + A[,9]



B<-solve(A)
B

ERROR: Error in solve.default(A): system is computationally singular: reciprocal condition number = 1.77069e-18


Since we create a vector as the linear combination of 3 other vectors, the multicolinearity problem occurs, which prevents us from inverting the matrix.

# A Crash Course in Good and Bad Controls

In this section, we will cover various models that illustrate "good" and "bad" controls within statistical analyses. These terms refer to the effects of adding a variable to a regression model—whether such inclusion results in discrepancies or aids interpretation of the model. For a more detailed exploration of these concepts, you are encouraged to review the full text available in the technical report [here](https://ftp.cs.ucla.edu/pub/stat_ser/r493.pdf).


#Good Controls 1 (blocking back-door paths)
###MODEL 1
In this case, we want to measure the effect of hours of study ($X$) on students' academic performance as measured by final grades ($Y$). If we run the regression without taking into account nutrition ($Z$), which affects both $X$ and $Y$, the coefficients will be biased. By including $Z$ in the model, a cleaner effect of hours of study on final grades can be obtained, approaching 1, which is the true parameter ( $\approx1$).

In [None]:
# Create a directed graph with specified edges and number of vertices
sprinkle <- graph(edges = c(1, 2, 1, 3, 2, 3), n = 3, directed = TRUE)

# Assign names to the vertices
V(sprinkle)$name <- c("x", "z", "y")

# Plot the graph in a circular layout
plot(sprinkle, layout = layout.circle,vertex.size = 25, # Increases the size of the vertices
     edge.arrow.size = 0.1,#Change the size of the arrowheads
     vertex.colour = "skyblue", # Vertex label colour
     vertex.label.colour = "red", # Colour of the vertex label
     vertex.label.cex = 2, # Increases the size of the vertex label
     vertex.label.label.family = "Helvetica", # Label font type
     edge.width = 0.3, # Edge thickness
     edge.colour = "blue", # Border colour
     vertex.shape = "sphere",
     main = "MODEL 1") # GRAPH TITLE


In [None]:
# Set Seed
# To make the results replicable (generating random numbers)
set.seed(1234567)

# Generate data
n <- 1000
Z <- matrix(rnorm(n, mean = 0, sd = 1), ncol = 1)
X <- 2 * Z + matrix(rnorm(n, mean = 0, sd = 1), ncol = 1)
Y <- X + Z + matrix(rnorm(n, mean = 0, sd = 1), ncol = 1)

# Create dataframe
data <- data.frame(Z = Z, X = X, Y = Y)

In [None]:
# Regressions
no_control <- lm(Y ~ X, data = data)
using_control <- lm(Y ~ X + Z, data = data)

# Summary results
summary(no_control)
summary(using_control)

###Model 2
In this case we want to see how much anti-corruption laws($X$) affect the number of corruption cases($Y$). But there is an unobservable variable that affects both, which is the feeling of injustice, which drives criminality and increases the number of laws. Then, we control for the number of active judges($Z$) to clean up the effect of the unobservable on my X. Finally controlling, my coefficient is close to my variable ($0.4321\approx 0.5$).



In [None]:
# Define nodes and edges including latent edges as different style.
edges <- c("Z", "X", "X", "Y", "Latent", "Z", "Latent", "Y")
edge_types <- c("observed", "observed", "latent", "latent")

# Create the graph
sprinkler <- graph(edges, directed = TRUE)

# Set vertex names
V(sprinkler)$name <- c("Z", "X", "Y", "Unobservable")

# Assign edge types (latent or observed)
E(sprinkler)$type <- edge_types

# Define edge style based on type
E(sprinkler)$color <- ifelse(E(sprinkler)$type == "observed", "black", "red")
E(sprinkler)$arrow.size <- 0.1
E(sprinkler)$arrow.width <- 2
E(sprinkler)$lty <- ifelse(E(sprinkler)$type == "observed", 1, 2) # Solid for observed, dashed for latent

# Plot graph with curves and customisations
plot(sprinkler, layout = layout_nicely(sprinkler),
     vertex.size = 30, # Adjust size of vertices
     vertex.colour = "skyblue", # Set vertex colour
     vertex.label.colour = "black", # Set the vertex label colour
     vertex.label.cex = 1.5, # Sets the font size of the vertex label
     vertex.label.family = "Helvetica", # Sets the label's font family
     edge.curved = 0.2, # Adjust the curvature of the edges
     main = "Model 2") #GRAPH TITLE

In [None]:
# Set Seed
# To make the results replicable (generating random numbers)
set.seed(1234567)

# Generate data
n <- 1000
U <- matrix(rnorm(n, mean = 0, sd = 1), ncol = 1)
Z <- 3 * U + matrix(rnorm(n, mean = 0, sd = 1), ncol = 1)
X <- Z + matrix(rnorm(n, mean = 0, sd = 1), ncol = 1)
Y <- 0.5 * X + 2 * U + matrix(rnorm(n, mean = 0, sd = 1), ncol = 1)

# Create dataframe
data <- data.frame(U = U, Z = Z, X = X, Y = Y)




In [None]:
# Regressions
no_control <- lm(Y ~ X, data = data)
using_control <- lm(Y ~ X + Z, data = data)

# Summary results
summary(no_control)
summary(using_control)

#Bad Control (M-bias)
##MODEL 7
Now, if we want to study how education($X$) affects diabetes($Y$) we could adjust for the mother's diabetes status($Z$) and consider it a confounder. However, this association is not direct, and adjusting for the mother will open up an association with income during childhood (U_1) and overall risk of diabetes (U_2).

In [None]:
# Define nodes and edges
edges <- c("U1", "Z", "U2", "Z", "U2", "X", "U1", "Y", "X", "Y")

# Create the directed graph with nodes and edges defined
sprinkle <- graph(edges, directed = TRUE)

# Set names for the vertices if needed (optional since they are defined in edges)
V(sprinkle)$name <- c("U1", "Z", "U2", "X", "Y")

# Plot the graph with customizations
plot(sprinkle, layout = layout_nicely(sprinkle),
     vertex.size = 20,            # Adjusts the size of the vertices
     vertex.color = "skyblue",    # Sets the vertex color
     vertex.label.color = "black", # Sets the vertex label color
     vertex.label.cex = 1.2,      # Adjusts the font size of vertex labels
     edge.arrow.size = 0.2,       # Adjusts the size of the arrows to be smaller
     edge.width = 2,              # Adjusts the thickness of the edges
     main = "MODEL 7") # Graph title


In [None]:
# Set Seed
# To make the results replicable (generating random numbers)
set.seed(12345676)

# Generate data
n <- 1000
U_1 <- matrix(rnorm(n, mean = 0, sd = 1), ncol = 1)
U_2 <- matrix(rnorm(n, mean = 0, sd = 1), ncol = 1)

Z <- U_1 + U_2 + matrix(rnorm(n, mean = 0, sd = 1), ncol = 1)
X <- 1.5 * U_1 + matrix(rnorm(n, mean = 0, sd = 1), ncol = 1)
Y <- X + 2.5 * U_2 + matrix(rnorm(n, mean = 0, sd = 1), ncol = 1)

# Create dataframe
data <- data.frame(U_1 = U_1, U_2 = U_2, Z = Z, X = X, Y = Y)



In [None]:
# Regressions
no_control <- lm(Y ~ X, data = data)
using_control <- lm(Y ~ X + Z, data = data)

# Summary results
summary(no_control)
summary(using_control)

#Neutral Control (possibly good for precision)
###Model 8
In this case, we want to estimate how the introduction of a new smart learning programme ($X$) affects the academic performance of students ($Y$). When we run this regression, our estimator is unbiased. However, if we include the attendance rate ($Z$), which also affects performance, we can increase the precision of the estimator and still remain unbiased and increase the precision.

In [None]:
# Define nodes and edges
edges <- c("X", "Y", "Z", "Y")

# Create the directed graph with nodes and edges defined
sprinkle <- graph(edges, directed = TRUE)

# Set names for the vertices if needed (optional since they are defined in edges)
V(sprinkle)$name <- c("X", "Y", "Z")

# Plot the graph with customizations
plot(sprinkle, layout = layout_nicely(sprinkle),
     vertex.size = 20,            # Adjusts the size of the vertices
     vertex.color = "skyblue",    # Sets the vertex color
     vertex.label.color = "black", # Sets the vertex label color
     vertex.label.cex = 1.2,      # Adjusts the font size of vertex labels
     edge.arrow.size = 0.2,       # Adjusts the size of the arrows to be smaller
     edge.width = 2,              # Adjusts the thickness of the edges
     main = "MODEL 8") # Graph title


In [None]:
# Set Seed
# To make the results replicable (generating random numbers)
set.seed(12345676)

# Generate data
n <- 1000
Z <- matrix(rnorm(n, mean = 0, sd = 1), ncol = 1)
X <- matrix(rnorm(n, mean = 0, sd = 1), ncol = 1)
Y <- X + 1.5 * Z + matrix(rnorm(n, mean = 0, sd = 1), ncol = 1)

# Create dataframe
data <- data.frame(Z = Z, X = X, Y = Y)



In [None]:
# Regressions
no_control <- lm(Y ~ X, data = data)
using_control <- lm(Y ~ X + Z, data = data)

# Summary results
summary(no_control)
summary(using_control)


#Bad Controls (Overcontrol Bias)
##Model 11
In this case we want to analyse how the growth of the city in terms of urban infrastructure (residential areas, shopping centres, etc.) impacts on the level of vehicle congestion ($Y$). Now, if we add a variable such as the number of cars ($Z$) that is affected by the infrastructure, but not vice versa and that causes a higher level of congestion, we would be removing the explanatory power of $X$. Now by controlling for both variables, the estimator of X loses significance and moves away from its true estimator ($-0.022\neq1$).


In [None]:
# Define nodes and edges
edges <- c("X", "Z", "Z", "Y")

# Create the directed graph with nodes and edges defined
sprinkle <- graph(edges, directed = TRUE)

# Set names for the vertices
V()$name <- c("X", "Z", "Y")

# Plot the graph with customizations
plot(sprinkle, layout = layout_nicely(sprinkle),
     vertex.size = 20,            # Adjusts the size of the vertices
     vertex.color = "skyblue",    # Sets the vertex color
     vertex.label.color = "black", # Sets the vertex label
     vertex.label.cex = 1.2,      # Adjusts the font size of vertex labels
     edge.arrow.size = 0.2,       # Adjusts the size of the arrows to be smaller
     edge.width = 2,              # Adjusts the thickness of the edges
     main = "MODEL 11") # Graph title

In [None]:
# Set Seed
# To make the results replicable (generating random numbers)
set.seed(1234567)

# Generate data
n <- 1000
X <- matrix(rnorm(n, mean = 0, sd = 1), ncol = 1) # Generate X
Z <- 1.3 * X + matrix(rnorm(n, mean = 0, sd = 1), ncol = 1) # Generate Z
Y <- Z + matrix(rnorm(n, mean = 0, sd = 1), ncol = 1) # Generate Y

# Create dataframe
data <- data.frame(Z = Z, X = X, Y = Y)

In [None]:
# Regressions
no_control <- lm(Y ~ X, data = data)
using_control <- lm(Y ~ X + Z, data = data)

# Summary results
summary(no_control)
summary(using_control)