# R Basic Tutorial

## 1.0 Assignment
Use `<-`, not `=`, for assignment.
The symbol `=` is preferable for arguments in functions. Additionally, it can not be used in some syntax context. More information [here](https://stackoverflow.com/questions/1741820/what-are-the-differences-between-and-assignment-operators-in-r).

In [41]:
a = 1
b <- 1

In [42]:
a

In [43]:
b

In [39]:
c <- b = 11

ERROR: Error in c <- b = 5: no se pudo encontrar la función "<-<-"


In [40]:
c <- b <- 11
print(c)

[1] 11


## 2.0 Objects

### Clases

R is a functional programming language. It means that the programs are constructed by applying and composing functions. However, It can use Object Oriented Programming to construct tools for data analysis. Class is the blueprint that helps to create an object and contains its member variable along with the attributes. We have two main classes: S3 and S4. [info](https://www.datacamp.com/community/tutorials/r-objects-and-classes)

### Objects
Objects are the instance of a class. It means that it has some methods that can act upon its attributes. In R, everything is an object. We will cover four main objects: vector, matrix, list and dataframes. All of them comes from a class.

#### Data Types
There are four data types: `character`, `numeric`, `integer` and `boolean`.

In [48]:
a <- 'Hola'
print(class( a ))   # type of object
print(typeof( a )) # how object is stored in memory

[1] "character"
[1] "character"


In [49]:
b <- 20.5
print(class( b ))
b

[1] "numeric"


In [50]:
c <- as.integer(20.5)
c
print(class( c ))


[1] "integer"


In [53]:
# typeof helps us to understand how this object is store in memory.
typeof(b)
# Class helps us to understand the type of a object
class(b)

In [54]:
c1 <- "Machine Learning"
c2 <- "Causal inference"
cat(c1,"  ",c2)

Machine Learning    Causal inference

In [57]:
#install.packages("devtools")
library(glue)
glue('{c1} y {c2} course')

In [65]:
# Boolean variables

log_true <- TRUE
print(class( log_true ))

z1 <- (1==1)

z2 <- (10 > 20)

z3 <- (1==1)

cat(z1,'\n',z2,'\n',z3)

class(as.integer(z3))

typeof(as.integer(z3))

[1] "logical"
TRUE 
 FALSE 
 TRUE

# 3.0 List

A list in R can contain many different data types inside it. A list is a collection of data which is ordered and changeable.

#### Lists
This data structure does not require that all members be of the same data type.

In [None]:
#install.packages("rlist")
library(rlist)

In [126]:
list_1 <- list( "good", "bad", "ugly","good", "bad", 1)
list_1

In [127]:
list_1 <- append(list_1, "bad")
list_1

In [109]:
# factor grouped categorical variables

fac_2 <- factor( c( "good", "bad", "ugly","good", "bad", "ugly", 5 ) )
fac_2

In [115]:
fac <- factor( c( "good", "bad", "ugly","good", "bad", "ugly" ) )
print( fac )
# Type of class
typeof( fac )
class( fac )

[1] good bad  ugly good bad  ugly
Levels: bad good ugly


In [116]:
# Levels o categories 
levels( fac )
# Number of Levels
nlevels( fac )

In [76]:
# Check the variables that you defined
ls()

Basic data structures in R include the vector, list, matrix, data frame, and factors. Some of these structures require that all members be of the same data type (e.g. vectors, matrices) while others permit multiple data types (e.g. lists, data frames).

In [97]:
mylist <- list( num_UEFA = c( 13 , 7 , 6 ) , clubs = c( "Real Madrid" , "AC Milan" , "Liverpool FC" ) , 
               last_year = c( 2018 , 2007 , 2019 ) )

mylist

In [99]:
mylist$last_year

In [34]:
# Indexing vectors
mylist[[3]]

In [35]:
# Indexing group of vectors
mylist[[3]][2:3]

In [36]:
# Indexing group of vectors
mylist[3][1]

In [102]:
list.remove(mylist,"last_year")

In [105]:
Postal_code <- append(mylist, 4000)
Postal_code

#### Vectors
This data structure requires that all members be of the same data type.

In [77]:
vec_str <-  c( "good", "bad", "ugly","good", "bad")
print( class( vec_str ) )
print( is.vector( vec_str ) )
print( length( vec_str ) )

[1] "character"
[1] TRUE
[1] 5


In [78]:
vec <- c( 2 , 3 , 4 , 5 , 6 , 4 )
vec 

In [79]:
print( class( vec ) )
print( is.vector( vec ) )
print( length( vec ) )

[1] "numeric"
[1] TRUE
[1] 6


In [81]:
sec_1_20 <- seq(1,20,2)
sec_1_20

In [82]:
sec_1_9 <- seq(1,10)
sec_1_9

#### Indexing

In [87]:
index_vec  <- vec[ c( 3, 1, 6, 2 ) ]
class(index_vec)

In [84]:
print(index_vec)
print(index_vec[ -3 ])

[1] 4 2 4 3
[1] 4 2 3


In [88]:
print( index_vec[ index_vec < 3 ] )

[1] 2


In [89]:
print( index_vec[ 2:4 ] )

[1] 2 4 3


#### Matrix
This data structure requires that all members be of the same data type.

In [128]:
A = matrix( c(3, 5, 2, 6, 7, 1, 6, 3, 7 ) , nrow = 3, ncol = 3 , byrow = FALSE, 
            dimnames = list( c( "rowA" , "rowB" , "rowC" ), c( "colA" , "colB" , "colC" ) ) )
print( A )

     colA colB colC
rowA    3    6    6
rowB    5    7    3
rowC    2    1    7


In [129]:
A = matrix( c(3, 5, 2, 6, 7, 1, 6, 3, 7 ) , nrow = 3, ncol = 3 , byrow = TRUE, 
            dimnames = list( c( "rowA" , "rowB" , "rowC" ), c( "colA" , "colB" , "colC" ) ) )
print( A )

     colA colB colC
rowA    3    5    2
rowB    6    7    1
rowC    6    3    7


In [137]:
class(A)
typeof(A)

In [131]:
dim(A)

In [130]:
cat("rows: ", dim(A)[1], '\n', "Columns: ", dim(A)[2])

rows:  3 
 Columns:  3

In [93]:
B = A %*% solve( A )
B

Unnamed: 0,rowA,rowB,rowC
rowA,1.0,0,5.5511150000000004e-17
rowB,-3.885781e-16,1,2.775558e-17
rowC,-4.440892e-16,0,1.0


In [94]:
# Solve for getting inverse matrix
# %*% Matrix Multiplication
# round output of the matrix
B = round( A %*% solve( A ) , 2 )
B

Unnamed: 0,rowA,rowB,rowC
rowA,1,0,0
rowB,0,1,0
rowC,0,0,1


In [95]:
print( diag( B ) )  


rowA rowB rowC 
   1    1    1 


In [96]:
print( t( A ) )

     rowA rowB rowC
colA    3    6    6
colB    5    7    3
colC    2    1    7


In [133]:
A <- matrix(c(seq(0, 9), seq( 10, 19), seq( 30, 39), seq( -20, -11), seq( 2, 20,2)), nrow = 5, byrow =TRUE)
print(A)

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    0    1    2    3    4    5    6    7    8     9
[2,]   10   11   12   13   14   15   16   17   18    19
[3,]   30   31   32   33   34   35   36   37   38    39
[4,]  -20  -19  -18  -17  -16  -15  -14  -13  -12   -11
[5,]    2    4    6    8   10   12   14   16   18    20


In [134]:
A[2:4,] # rows selecrtion

0,1,2,3,4,5,6,7,8,9
10,11,12,13,14,15,16,17,18,19
30,31,32,33,34,35,36,37,38,39
-20,-19,-18,-17,-16,-15,-14,-13,-12,-11


In [135]:
A[,1:6]  # columns selecrtion

0,1,2,3,4,5
0,1,2,3,4,5
10,11,12,13,14,15
30,31,32,33,34,35
-20,-19,-18,-17,-16,-15
2,4,6,8,10,12


In [138]:
M1 <- matrix(0,8,2)

print(M1)

     [,1] [,2]
[1,]    0    0
[2,]    0    0
[3,]    0    0
[4,]    0    0
[5,]    0    0
[6,]    0    0
[7,]    0    0
[8,]    0    0


In [139]:
M2 <- matrix(1,8, 4) 
print(M2)

     [,1] [,2] [,3] [,4]
[1,]    1    1    1    1
[2,]    1    1    1    1
[3,]    1    1    1    1
[4,]    1    1    1    1
[5,]    1    1    1    1
[6,]    1    1    1    1
[7,]    1    1    1    1
[8,]    1    1    1    1


In [140]:
M3 <- cbind(M1,M2)
M3

0,1,2,3,4,5
0,0,1,1,1,1
0,0,1,1,1,1
0,0,1,1,1,1
0,0,1,1,1,1
0,0,1,1,1,1
0,0,1,1,1,1
0,0,1,1,1,1
0,0,1,1,1,1


In [141]:
M4 = matrix(c(2,2,3,4,5,1,1,5,5,9,8,2), nrow =2, byrow = TRUE)
print(M4)

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    2    2    3    4    5    1
[2,]    1    5    5    9    8    2


In [142]:
M5 <- rbind(M3,M4)
print(M5)

      [,1] [,2] [,3] [,4] [,5] [,6]
 [1,]    0    0    1    1    1    1
 [2,]    0    0    1    1    1    1
 [3,]    0    0    1    1    1    1
 [4,]    0    0    1    1    1    1
 [5,]    0    0    1    1    1    1
 [6,]    0    0    1    1    1    1
 [7,]    0    0    1    1    1    1
 [8,]    0    0    1    1    1    1
 [9,]    2    2    3    4    5    1
[10,]    1    5    5    9    8    2


#### DataFrame
This data structure does not require that all members be of the same data type.

In [143]:
Student_Name <- c("Amy", "Bob", "Chuck", "Daisy", "Ellie", "Frank", 
                  "George", "Helen")

Age <- c(27, 55, 34, 42, 20, 27, 34, 42)

Gender <- c("F", "M", "M", "F", "F", "M", "M", "F")

GPA <- c(3.26, 3.75, 2.98, 3.40, 2.75, 3.32, 3.68, 3.97)

nsc <- data.frame(Student_Name, Age, Gender, GPA)   # Naming the data frame
nsc # Generates the data frame

Student_Name,Age,Gender,GPA
<chr>,<dbl>,<chr>,<dbl>
Amy,27,F,3.26
Bob,55,M,3.75
Chuck,34,M,2.98
Daisy,42,F,3.4
Ellie,20,F,2.75
Frank,27,M,3.32
George,34,M,3.68
Helen,42,F,3.97


In [144]:
# Lists variables
names(nsc)   

In [145]:
select_col <- c(1,3)
select_row <- c(1,5)

In [146]:
nsc[select_row, select_col]

Unnamed: 0_level_0,Student_Name,Gender
Unnamed: 0_level_1,<chr>,<chr>
1,Amy,F
5,Ellie,F


In [147]:
# indexing dataframes
nsc[3:5 , 2:4]   

Unnamed: 0_level_0,Age,Gender,GPA
Unnamed: 0_level_1,<dbl>,<chr>,<dbl>
3,34,M,2.98
4,42,F,3.4
5,20,F,2.75


In [148]:
# indexing dataframes
nsc[ c( 1 , 2 , 4, 6 ) , c( 3 , 2 ) ]

Unnamed: 0_level_0,Gender,Age
Unnamed: 0_level_1,<chr>,<dbl>
1,F,27
2,M,55
4,F,42
6,M,27


In [149]:
nsc$Student_Name

## Load  files : CSV, Excel
## Working with DataFrames

In [2]:
#install.packages("readxl")
library(readxl)

In [9]:
base <- read_excel("../data/base1.xlsx", sheet = "data")
head(base)

New names:
* `` -> ...1



...1,wage,lwage,sex,shs,hsg,scl,clg,ad,mw,...,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind,ind2
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
10,9.615385,2.263364,1,0,0,0,1,0,0,...,0,1,7,0.49,0.343,0.2401,3600,11,8370,18
12,48.076923,3.872802,0,0,0,0,1,0,0,...,0,1,31,9.61,29.791,92.3521,3050,10,5070,9
15,11.057692,2.403126,0,0,1,0,0,0,0,...,0,1,18,3.24,5.832,10.4976,6260,19,770,4
18,13.942308,2.634928,1,0,0,0,0,1,0,...,0,1,25,6.25,15.625,39.0625,420,1,6990,12
19,28.846154,3.361977,1,0,0,0,1,0,0,...,0,1,22,4.84,10.648,23.4256,2015,6,9470,22
30,11.730769,2.462215,1,0,0,0,1,0,0,...,0,1,1,0.01,0.001,0.0001,1650,5,7460,14


In [6]:
dim(base)

In [28]:
base1 <- read.csv("../data/base0.csv")
head(base1)

Unnamed: 0_level_0,X,wage,lwage,sex,shs,hsg,scl,clg,ad,mw,...,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind,ind2
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,...,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<int>
1,10,9.615385,2.263364,1,0,0,0,1,0,0,...,0,1,7,0.49,0.343,0.2401,3600,11,8370,18
2,12,48.076923,3.872802,0,0,0,0,1,0,0,...,0,1,31,9.61,29.791,92.3521,3050,10,5070,9
3,15,11.057692,2.403126,0,0,1,0,0,0,0,...,0,1,18,3.24,5.832,10.4976,6260,19,770,4
4,18,13.942308,2.634928,1,0,0,0,0,1,0,...,0,1,25,6.25,15.625,39.0625,420,1,6990,12
5,19,28.846154,3.361977,1,0,0,0,1,0,0,...,0,1,22,4.84,10.648,23.4256,2015,6,9470,22
6,30,11.730769,2.462215,1,0,0,0,1,0,0,...,0,1,1,0.01,0.001,0.0001,1650,5,7460,14


In [29]:
sapply(base, typeof)

In [30]:
str(base)

tibble [5,150 x 21] (S3: tbl_df/tbl/data.frame)
 $ ...1 : num [1:5150] 10 12 15 18 19 30 43 44 47 71 ...
 $ wage : num [1:5150] 9.62 48.08 11.06 13.94 28.85 ...
 $ lwage: num [1:5150] 2.26 3.87 2.4 2.63 3.36 ...
 $ sex  : num [1:5150] 1 0 0 1 1 1 1 0 1 1 ...
 $ shs  : num [1:5150] 0 0 0 0 0 0 0 0 0 0 ...
 $ hsg  : num [1:5150] 0 0 1 0 0 0 1 1 1 0 ...
 $ scl  : num [1:5150] 0 0 0 0 0 0 0 0 0 0 ...
 $ clg  : num [1:5150] 1 1 0 0 1 1 0 0 0 1 ...
 $ ad   : num [1:5150] 0 0 0 1 0 0 0 0 0 0 ...
 $ mw   : num [1:5150] 0 0 0 0 0 0 0 0 0 0 ...
 $ so   : num [1:5150] 0 0 0 0 0 0 0 0 0 0 ...
 $ we   : num [1:5150] 0 0 0 0 0 0 0 0 0 0 ...
 $ ne   : num [1:5150] 1 1 1 1 1 1 1 1 1 1 ...
 $ exp1 : num [1:5150] 7 31 18 25 22 1 42 37 31 4 ...
 $ exp2 : num [1:5150] 0.49 9.61 3.24 6.25 4.84 ...
 $ exp3 : num [1:5150] 0.343 29.791 5.832 15.625 10.648 ...
 $ exp4 : num [1:5150] 0.24 92.35 10.5 39.06 23.43 ...
 $ occ  : num [1:5150] 3600 3050 6260 420 2015 ...
 $ occ2 : num [1:5150] 11 10 19 1 6 5 17 17 13

In [31]:
#install.packages("dplyr")
library(dplyr)

In [32]:
base1 <- base1 %>% rename(salario = wage, id = X)

head(base1)

Unnamed: 0_level_0,id,salario,lwage,sex,shs,hsg,scl,clg,ad,mw,...,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind,ind2
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,...,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<int>
1,10,9.615385,2.263364,1,0,0,0,1,0,0,...,0,1,7,0.49,0.343,0.2401,3600,11,8370,18
2,12,48.076923,3.872802,0,0,0,0,1,0,0,...,0,1,31,9.61,29.791,92.3521,3050,10,5070,9
3,15,11.057692,2.403126,0,0,1,0,0,0,0,...,0,1,18,3.24,5.832,10.4976,6260,19,770,4
4,18,13.942308,2.634928,1,0,0,0,0,1,0,...,0,1,25,6.25,15.625,39.0625,420,1,6990,12
5,19,28.846154,3.361977,1,0,0,0,1,0,0,...,0,1,22,4.84,10.648,23.4256,2015,6,9470,22
6,30,11.730769,2.462215,1,0,0,0,1,0,0,...,0,1,1,0.01,0.001,0.0001,1650,5,7460,14


In [33]:
base1 <- base1 %>% select(-c(id,ind2))

In [34]:
head(base1)

Unnamed: 0_level_0,salario,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind
Unnamed: 0_level_1,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>
1,9.615385,2.263364,1,0,0,0,1,0,0,0,0,1,7,0.49,0.343,0.2401,3600,11,8370
2,48.076923,3.872802,0,0,0,0,1,0,0,0,0,1,31,9.61,29.791,92.3521,3050,10,5070
3,11.057692,2.403126,0,0,1,0,0,0,0,0,0,1,18,3.24,5.832,10.4976,6260,19,770
4,13.942308,2.634928,1,0,0,0,0,1,0,0,0,1,25,6.25,15.625,39.0625,420,1,6990
5,28.846154,3.361977,1,0,0,0,1,0,0,0,0,1,22,4.84,10.648,23.4256,2015,6,9470
6,11.730769,2.462215,1,0,0,0,1,0,0,0,0,1,1,0.01,0.001,0.0001,1650,5,7460


In [48]:
head(base1 %>%   slice(-c(1,3)) )

Unnamed: 0_level_0,salario,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind
Unnamed: 0_level_1,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>
1,48.07692,3.872802,0,0,0,0,1,0,0,0,0,1,31,9.61,29.791,92.3521,3050,10,5070
2,13.94231,2.634928,1,0,0,0,0,1,0,0,0,1,25,6.25,15.625,39.0625,420,1,6990
3,28.84615,3.361977,1,0,0,0,1,0,0,0,0,1,22,4.84,10.648,23.4256,2015,6,9470
4,11.73077,2.462215,1,0,0,0,1,0,0,0,0,1,1,0.01,0.001,0.0001,1650,5,7460
5,19.23077,2.956512,1,0,1,0,0,0,0,0,0,1,42,17.64,74.088,311.1696,5120,17,7280
6,19.23077,2.956512,0,0,1,0,0,0,0,0,0,1,37,13.69,50.653,187.4161,5240,17,5680


## Slicing  dataframe 

In [38]:
base1[1:10,]

Unnamed: 0_level_0,salario,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind
Unnamed: 0_level_1,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>
1,9.615385,2.263364,1,0,0,0,1,0,0,0,0,1,7,0.49,0.343,0.2401,3600,11,8370
2,48.076923,3.872802,0,0,0,0,1,0,0,0,0,1,31,9.61,29.791,92.3521,3050,10,5070
3,11.057692,2.403126,0,0,1,0,0,0,0,0,0,1,18,3.24,5.832,10.4976,6260,19,770
4,13.942308,2.634928,1,0,0,0,0,1,0,0,0,1,25,6.25,15.625,39.0625,420,1,6990
5,28.846154,3.361977,1,0,0,0,1,0,0,0,0,1,22,4.84,10.648,23.4256,2015,6,9470
6,11.730769,2.462215,1,0,0,0,1,0,0,0,0,1,1,0.01,0.001,0.0001,1650,5,7460
7,19.230769,2.956512,1,0,1,0,0,0,0,0,0,1,42,17.64,74.088,311.1696,5120,17,7280
8,19.230769,2.956512,0,0,1,0,0,0,0,0,0,1,37,13.69,50.653,187.4161,5240,17,5680
9,12.0,2.484907,1,0,1,0,0,0,0,0,0,1,31,9.61,29.791,92.3521,4040,13,8590
10,19.230769,2.956512,1,0,0,0,1,0,0,0,0,1,4,0.16,0.064,0.0256,3255,10,8190


In [39]:
base1[,1:5]

salario,lwage,sex,shs,hsg
<dbl>,<dbl>,<int>,<int>,<int>
9.615385,2.263364,1,0,0
48.076923,3.872802,0,0,0
11.057692,2.403126,0,0,1
13.942308,2.634928,1,0,0
28.846154,3.361977,1,0,0
11.730769,2.462215,1,0,0
19.230769,2.956512,1,0,1
19.230769,2.956512,0,0,1
12.000000,2.484907,1,0,1
19.230769,2.956512,1,0,0


In [42]:
base1[1:10,] %>% select(salario, lwage)


Unnamed: 0_level_0,salario,lwage
Unnamed: 0_level_1,<dbl>,<dbl>
1,9.615385,2.263364
2,48.076923,3.872802
3,11.057692,2.403126
4,13.942308,2.634928
5,28.846154,3.361977
6,11.730769,2.462215
7,19.230769,2.956512
8,19.230769,2.956512
9,12.0,2.484907
10,19.230769,2.956512


In [57]:
select(base1[1:10,], salario,lwage)

Unnamed: 0_level_0,salario,lwage
Unnamed: 0_level_1,<dbl>,<dbl>
1,9.615385,2.263364
2,48.076923,3.872802
3,11.057692,2.403126
4,13.942308,2.634928
5,28.846154,3.361977
6,11.730769,2.462215
7,19.230769,2.956512
8,19.230769,2.956512
9,12.0,2.484907
10,19.230769,2.956512


In [46]:
base1 %>% filter(sex == 1)

salario,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind
<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>
9.615385,2.263364,1,0,0,0,1,0,0,0,0,1,7.0,0.4900,0.343000,0.24010000,3600,11,8370
13.942308,2.634928,1,0,0,0,0,1,0,0,0,1,25.0,6.2500,15.625000,39.06250000,420,1,6990
28.846154,3.361977,1,0,0,0,1,0,0,0,0,1,22.0,4.8400,10.648000,23.42560000,2015,6,9470
11.730769,2.462215,1,0,0,0,1,0,0,0,0,1,1.0,0.0100,0.001000,0.00010000,1650,5,7460
19.230769,2.956512,1,0,1,0,0,0,0,0,0,1,42.0,17.6400,74.088000,311.16960000,5120,17,7280
12.000000,2.484907,1,0,1,0,0,0,0,0,0,1,31.0,9.6100,29.791000,92.35210000,4040,13,8590
19.230769,2.956512,1,0,0,0,1,0,0,0,0,1,4.0,0.1600,0.064000,0.02560000,3255,10,8190
17.307692,2.851151,1,0,1,0,0,0,0,0,0,1,7.0,0.4900,0.343000,0.24010000,4020,13,8270
12.019231,2.486508,1,0,0,1,0,0,0,0,0,1,5.5,0.3025,0.166375,0.09150625,3600,11,8270
13.461538,2.599837,1,0,0,1,0,0,0,0,0,1,20.5,4.2025,8.615125,17.66100625,3645,11,8190


In [47]:
head(base1 %>% filter(exp1 > 10))

Unnamed: 0_level_0,salario,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind
Unnamed: 0_level_1,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>
1,48.07692,3.872802,0,0,0,0,1,0,0,0,0,1,31,9.61,29.791,92.3521,3050,10,5070
2,11.05769,2.403126,0,0,1,0,0,0,0,0,0,1,18,3.24,5.832,10.4976,6260,19,770
3,13.94231,2.634928,1,0,0,0,0,1,0,0,0,1,25,6.25,15.625,39.0625,420,1,6990
4,28.84615,3.361977,1,0,0,0,1,0,0,0,0,1,22,4.84,10.648,23.4256,2015,6,9470
5,19.23077,2.956512,1,0,1,0,0,0,0,0,0,1,42,17.64,74.088,311.1696,5120,17,7280
6,19.23077,2.956512,0,0,1,0,0,0,0,0,0,1,37,13.69,50.653,187.4161,5240,17,5680


In [51]:
head(base1 %>% filter(sex == 1 & exp1 > 10))

head(base1  %>% filter(sex == 1 | exp1 > 10))

Unnamed: 0_level_0,salario,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind
Unnamed: 0_level_1,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>
1,13.94231,2.634928,1,0,0,0,0,1,0,0,0,1,25.0,6.25,15.625,39.0625,420,1,6990
2,28.84615,3.361977,1,0,0,0,1,0,0,0,0,1,22.0,4.84,10.648,23.4256,2015,6,9470
3,19.23077,2.956512,1,0,1,0,0,0,0,0,0,1,42.0,17.64,74.088,311.1696,5120,17,7280
4,12.0,2.484907,1,0,1,0,0,0,0,0,0,1,31.0,9.61,29.791,92.3521,4040,13,8590
5,13.46154,2.599837,1,0,0,1,0,0,0,0,0,1,20.5,4.2025,8.615125,17.66101,3645,11,8190
6,16.34615,2.793993,1,0,0,0,1,0,0,0,0,1,25.0,6.25,15.625,39.0625,110,1,7870


Unnamed: 0_level_0,salario,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind
Unnamed: 0_level_1,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>
1,9.615385,2.263364,1,0,0,0,1,0,0,0,0,1,7,0.49,0.343,0.2401,3600,11,8370
2,48.076923,3.872802,0,0,0,0,1,0,0,0,0,1,31,9.61,29.791,92.3521,3050,10,5070
3,11.057692,2.403126,0,0,1,0,0,0,0,0,0,1,18,3.24,5.832,10.4976,6260,19,770
4,13.942308,2.634928,1,0,0,0,0,1,0,0,0,1,25,6.25,15.625,39.0625,420,1,6990
5,28.846154,3.361977,1,0,0,0,1,0,0,0,0,1,22,4.84,10.648,23.4256,2015,6,9470
6,11.730769,2.462215,1,0,0,0,1,0,0,0,0,1,1,0.01,0.001,0.0001,1650,5,7460


In [56]:
head(filter(base1, sex == 1 & exp1 > 10))
head(filter(base1, sex == 1 | exp1 > 10))

Unnamed: 0_level_0,salario,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind
Unnamed: 0_level_1,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>
1,13.94231,2.634928,1,0,0,0,0,1,0,0,0,1,25.0,6.25,15.625,39.0625,420,1,6990
2,28.84615,3.361977,1,0,0,0,1,0,0,0,0,1,22.0,4.84,10.648,23.4256,2015,6,9470
3,19.23077,2.956512,1,0,1,0,0,0,0,0,0,1,42.0,17.64,74.088,311.1696,5120,17,7280
4,12.0,2.484907,1,0,1,0,0,0,0,0,0,1,31.0,9.61,29.791,92.3521,4040,13,8590
5,13.46154,2.599837,1,0,0,1,0,0,0,0,0,1,20.5,4.2025,8.615125,17.66101,3645,11,8190
6,16.34615,2.793993,1,0,0,0,1,0,0,0,0,1,25.0,6.25,15.625,39.0625,110,1,7870


Unnamed: 0_level_0,salario,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind
Unnamed: 0_level_1,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>
1,9.615385,2.263364,1,0,0,0,1,0,0,0,0,1,7,0.49,0.343,0.2401,3600,11,8370
2,48.076923,3.872802,0,0,0,0,1,0,0,0,0,1,31,9.61,29.791,92.3521,3050,10,5070
3,11.057692,2.403126,0,0,1,0,0,0,0,0,0,1,18,3.24,5.832,10.4976,6260,19,770
4,13.942308,2.634928,1,0,0,0,0,1,0,0,0,1,25,6.25,15.625,39.0625,420,1,6990
5,28.846154,3.361977,1,0,0,0,1,0,0,0,0,1,22,4.84,10.648,23.4256,2015,6,9470
6,11.730769,2.462215,1,0,0,0,1,0,0,0,0,1,1,0.01,0.001,0.0001,1650,5,7460


In [53]:
head(arrange(base1, exp1, salario))

Unnamed: 0_level_0,salario,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind
Unnamed: 0_level_1,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>
1,4.25,1.446919,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1860,5,7870
2,5.128205,1.634756,1,0,0,0,0,1,1,0,0,0,0,0,0,0,9620,22,4970
3,9.135769,2.212197,0,0,0,0,0,1,0,1,0,0,0,0,0,0,2310,8,7860
4,9.615385,2.263364,1,0,0,0,0,1,0,0,1,0,0,0,0,0,2430,8,7860
5,9.615385,2.263364,1,0,0,0,0,1,0,0,1,0,0,0,0,0,3160,10,8090
6,10.025,2.305082,1,0,0,0,0,1,0,0,0,1,0,0,0,0,2540,8,7860


In [62]:
base1$dummy = "NA"
base1

salario,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind,dummy
<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<chr>
9.615385,2.263364,1,0,0,0,1,0,0,0,0,1,7.0,0.4900,0.343000,0.24010000,3600,11,8370,
48.076923,3.872802,0,0,0,0,1,0,0,0,0,1,31.0,9.6100,29.791000,92.35210000,3050,10,5070,
11.057692,2.403126,0,0,1,0,0,0,0,0,0,1,18.0,3.2400,5.832000,10.49760000,6260,19,770,
13.942308,2.634928,1,0,0,0,0,1,0,0,0,1,25.0,6.2500,15.625000,39.06250000,420,1,6990,
28.846154,3.361977,1,0,0,0,1,0,0,0,0,1,22.0,4.8400,10.648000,23.42560000,2015,6,9470,
11.730769,2.462215,1,0,0,0,1,0,0,0,0,1,1.0,0.0100,0.001000,0.00010000,1650,5,7460,
19.230769,2.956512,1,0,1,0,0,0,0,0,0,1,42.0,17.6400,74.088000,311.16960000,5120,17,7280,
19.230769,2.956512,0,0,1,0,0,0,0,0,0,1,37.0,13.6900,50.653000,187.41610000,5240,17,5680,
12.000000,2.484907,1,0,1,0,0,0,0,0,0,1,31.0,9.6100,29.791000,92.35210000,4040,13,8590,
19.230769,2.956512,1,0,0,0,1,0,0,0,0,1,4.0,0.1600,0.064000,0.02560000,3255,10,8190,


In [64]:
base1$dummy[base1$exp1 > 10] <- 1
base1

salario,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind,dummy
<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<chr>
9.615385,2.263364,1,0,0,0,1,0,0,0,0,1,7.0,0.4900,0.343000,0.24010000,3600,11,8370,
48.076923,3.872802,0,0,0,0,1,0,0,0,0,1,31.0,9.6100,29.791000,92.35210000,3050,10,5070,1
11.057692,2.403126,0,0,1,0,0,0,0,0,0,1,18.0,3.2400,5.832000,10.49760000,6260,19,770,1
13.942308,2.634928,1,0,0,0,0,1,0,0,0,1,25.0,6.2500,15.625000,39.06250000,420,1,6990,1
28.846154,3.361977,1,0,0,0,1,0,0,0,0,1,22.0,4.8400,10.648000,23.42560000,2015,6,9470,1
11.730769,2.462215,1,0,0,0,1,0,0,0,0,1,1.0,0.0100,0.001000,0.00010000,1650,5,7460,
19.230769,2.956512,1,0,1,0,0,0,0,0,0,1,42.0,17.6400,74.088000,311.16960000,5120,17,7280,1
19.230769,2.956512,0,0,1,0,0,0,0,0,0,1,37.0,13.6900,50.653000,187.41610000,5240,17,5680,1
12.000000,2.484907,1,0,1,0,0,0,0,0,0,1,31.0,9.6100,29.791000,92.35210000,4040,13,8590,1
19.230769,2.956512,1,0,0,0,1,0,0,0,0,1,4.0,0.1600,0.064000,0.02560000,3255,10,8190,


## If condition
The body of the if condition is excuted if the `test_expression` is `TRUE`. The ouput of the test expression should be a `boolean`variable. 

<img src="if-statement.jpg" alt="image info" />


The structure of the code is the following: <br><br>

<font size="4">
if <font color='green'>(test expresion)</font>{<br>
&nbsp;&nbsp;&nbsp;&nbsp;Code to excute<br>
}</font>

The function **if** tests the veracity of a logic expression. The result of test statement should be a **<font color='red'>boolean</font>**. In other words, the output of the test statemen must be **<font color='red'>TRUE</font>** or **<font color='red'>FALSE</font>**. To sum, any function that its output is **boolean** can be used as a test expression in the **if** function. 

In [151]:
x <- -4

if(x > 0){
print("Non-negative number")
}

### Tests more than 1 expression

We will use **else if**. This function allows us to add more test expressions.  <br><br>

<font size="3">
if <font color='green'>(Test expression 1)</font>{ <br>
&nbsp;&nbsp;&nbsp;&nbsp;Code1<br><br>
} else if <font color='green'>(Test expression 2)</font>{<br>
&nbsp;&nbsp;&nbsp;&nbsp;Code2<br><br>
} else if <font color='green'>(Test expression 3)</font>{<br>
&nbsp;&nbsp;&nbsp;&nbsp;Code3<br><br>
} else if <font color='green'>(Test expression 4)</font>{<br>
&nbsp;&nbsp;&nbsp;&nbsp;Code4<br><br>
} else if <font color='green'>(Test expression 5)</font>{<br>
&nbsp;&nbsp;&nbsp;&nbsp;Code5<br><br>
}else{<br>&nbsp;&nbsp;&nbsp;&nbsp;Code6<br><br>
}</font>

R will read the conditions of the test in order. If **Test expression 1** is `TRUE`, the rest of the test expressions will not be evaluated. R execute **Code2** and will not test the next conditions. <br>

In case no Test expression is `TRUE`, the **Code6** will be excuted.

In [5]:
if (z = 3 > 9) {
    print('xsd')
}

ERROR: Error in parse(text = x, srcfile = src): <text>:1:7: inesperado '='
1: if (z =
          ^


In [7]:
if ((z <- 12) >= 9){
    print('xsd')
}

[1] "xsd"


In [155]:
x <- 5

if (x < 0) {
print("Negative number")
} else if (x > 0) {
print("Positive number")
} else {
print("Zero")}

[1] "Positive number"


## Loops
A for loop is used for iterating over a sequence. It has the following structure:

<img src="for_loop.jpg" alt="image info" />



In [171]:
age = c( 20 , 27 , 31 , 25 , 28 )
years = c( 2023 , 2021 , 2022 , 2026 , 2027, 2028 )
length

In [172]:
age_finish <- numeric( length = length( age ) )
age_finish

In [173]:
for ( i in c( 1:length( age ) ) ) {
  print(i + 1)
}

[1] 2
[1] 3
[1] 4
[1] 5
[1] 6


In [174]:
for ( i in c( 1:length( age ) ) ) {
  age_finish[ i ] = age[ i ] + years[ i ] - 2021
}
print( age_finish )

[1] 22 27 32 30 34


## Functions in R

- `Arguments` − Elements that the function will use to make operations. Arguments are optional and can have default values.

- `Function Body` − This defines what your function does.
- `return` − Specifies the variable that will be the output of the function.

Return Value − The return value of a function is the last expression in the function body to be evaluated.
<br><br>
<font size="4">
function <font color='green'>(arg_1, arg_2, ...) </font>{<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;Function body <br>
    <br> &nbsp;&nbsp;&nbsp;&nbsp;return <br>
}</font>


In [191]:
demean<- function(x){ 
    new_var = x - mean(x)
    new_var_2 = new_var^4
    return( new_var_2 )
}

In [192]:
vector_2 = c(2 , 3, 4)

In [193]:
demean( vector_2 )

## Linear Regressions

## Packages

In [84]:
install.packages( "glmnet" )

Installing package into 'C:/Users/Roberto Carlos/Documents/R/win-library/4.1'
(as 'lib' is unspecified)



package 'glmnet' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Roberto Carlos\AppData\Local\Temp\Rtmp04Br6W\downloaded_packages


In [85]:
library(glmnet)

Loading required package: Matrix

Loaded glmnet 4.1-3



In [87]:

# Reading data and converting to dataframe
Penn <- as.data.frame(read.table("../data/penn_jae.dat", header=T ))
dim(Penn)

In [88]:
#Number of rows
n <- dim(Penn)[1]

# Number of columns
p_1 <- dim(Penn)[2]

In [89]:
# Filtering data to tg==4 | tg==0
Penn<- subset(Penn, tg==4 | tg==0)

# we are making the columns of the data frame as vectors available with its own name
attach(Penn)

In [90]:
T4 <- (tg==4)
typeof(T4)

In [91]:
summary(T4)

   Mode   FALSE    TRUE 
logical    3354    1745 

### Formula
For more information [link1](https://thomasleeper.com/Rcourse/Tutorials/formulae.html), [link2](https://www.datacamp.com/community/tutorials/r-formula-tutorial).

In [92]:
install.packages( "lmtest" )
install.packages( "sandwich" )
library(lmtest)
library(sandwich)
# Suggestion for update your R version

Installing package into 'C:/Users/Roberto Carlos/Documents/R/win-library/4.1'
(as 'lib' is unspecified)




  There is a binary version available but the source version is later:
       binary source needs_compilation
lmtest 0.9-39 0.9-40              TRUE

  Binaries will be installed
package 'lmtest' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Roberto Carlos\AppData\Local\Temp\Rtmp04Br6W\downloaded_packages


Installing package into 'C:/Users/Roberto Carlos/Documents/R/win-library/4.1'
(as 'lib' is unspecified)



package 'sandwich' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Roberto Carlos\AppData\Local\Temp\Rtmp04Br6W\downloaded_packages


Loading required package: zoo


Attaching package: 'zoo'


The following objects are masked from 'package:base':

    as.Date, as.Date.numeric




In [109]:
# Formula regression
# Generation of formulas for regressions
# The class formula

formula1 <- T4 ~ female

formula2 <-  T4 ~ female + black

# We include only the interaction

formula3 <-  T4 ~ female + black + female:black

# We include the two terms and their interaction

formula4 <-  T4 ~ female*black

# drop intercep

formula5 <-  T4 ~ -1 + female + black + female:black

formula5 <-  T4 ~ 0 + female + black + female:black

# polinomial independent varibles

formula6 <- T4 ~ -1 + female + black + female:black + lusd + lusd^2 

# interaction effects & categorical variables 

formula7 <-  T4~(female+black+othrace+factor(dep)+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd)^2

In [94]:
all.vars(formula1)

In [95]:
# We can use multiple independent variables b simply separating them with the plus symbol(+)

formula2 <- (T4 ~ female + black )
# See the variables in the formula
all.vars( formula2 )

In [96]:
formula3 <- (T4 ~ female - black )
print( terms( formula3 ) )

T4 ~ female - black
attr(,"variables")
list(T4, female, black)
attr(,"factors")
       female
T4          0
female      1
black       0
attr(,"term.labels")
[1] "female"
attr(,"order")
[1] 1
attr(,"intercept")
[1] 1
attr(,"response")
[1] 1
attr(,".Environment")
<environment: R_GlobalEnv>


In [97]:
# Interactions terms
# We include the two terms and their interaction
formula5  <- (T4 ~ female * black)
print( terms( formula5 ) )


T4 ~ female * black
attr(,"variables")
list(T4, female, black)
attr(,"factors")
       female black female:black
T4          0     0            0
female      1     0            1
black       0     1            1
attr(,"term.labels")
[1] "female"       "black"        "female:black"
attr(,"order")
[1] 1 1 2
attr(,"intercept")
[1] 1
attr(,"response")
[1] 1
attr(,".Environment")
<environment: R_GlobalEnv>


In [98]:

# We include only the interaction
formula6  <- (T4 ~ female:black)
print( terms( formula6 ) )

T4 ~ female:black
attr(,"variables")
list(T4, female, black)
attr(,"factors")
       female:black
T4                0
female            2
black             2
attr(,"term.labels")
[1] "female:black"
attr(,"order")
[1] 2
attr(,"intercept")
[1] 1
attr(,"response")
[1] 1
attr(,".Environment")
<environment: R_GlobalEnv>


In [99]:
# Factor in regression Analysis
# dep is a factor variable but as defaul it is a vector

#dep[ 4:20 ]
class(dep)

#factor(dep) == as.factor( dep )

# we need to specify that dep is a factor variable in a regression formula to not be treated as a numeric vector
## by default, the factor's first level treated as a baseline
formula6  <- (T4 ~  female * black + factor( dep ) )
print( terms( formula6 ) )

formula7  <- (T4 ~  female * black + dep )
print( terms( formula7 ) )

T4 ~ female * black + factor(dep)
attr(,"variables")
list(T4, female, black, factor(dep))
attr(,"factors")
            female black factor(dep) female:black
T4               0     0           0            0
female           1     0           0            1
black            0     1           0            1
factor(dep)      0     0           1            0
attr(,"term.labels")
[1] "female"       "black"        "factor(dep)"  "female:black"
attr(,"order")
[1] 1 1 1 2
attr(,"intercept")
[1] 1
attr(,"response")
[1] 1
attr(,".Environment")
<environment: R_GlobalEnv>
T4 ~ female * black + dep
attr(,"variables")
list(T4, female, black, dep)
attr(,"factors")
       female black dep female:black
T4          0     0   0            0
female      1     0   0            1
black       0     1   0            1
dep         0     0   1            0
attr(,"term.labels")
[1] "female"       "black"        "dep"          "female:black"
attr(,"order")
[1] 1 1 1 2
attr(,"intercept")
[1] 1
attr(,"response")
[1

In [111]:
# regression

reg <- lm(formula1, Penn)
summary( reg1 )

summary( reg )$coefficient
summary( reg )$r.squared


Call:
lm(formula = formula3, data = Penn)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.3736 -0.3460 -0.3354  0.6540  0.6751 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.346010   0.009163  37.761   <2e-16 ***
female       -0.010634   0.014471  -0.735    0.462    
black        -0.021080   0.026735  -0.789    0.430    
female:black  0.059289   0.041109   1.442    0.149    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4745 on 5095 degrees of freedom
Multiple R-squared:  0.0004269,	Adjusted R-squared:  -0.0001616 
F-statistic: 0.7254 on 3 and 5095 DF,  p-value: 0.5367


Unnamed: 0,Estimate,Std. Error,t value,Pr(>|t|)
(Intercept),0.343534057,0.008608186,39.9078351,2.5873329999999998e-303
female,-0.003242795,0.013543176,-0.2394413,0.8107731


In [112]:
reg <- lm(formula3, Penn)

summary( reg )$coefficient


Unnamed: 0,Estimate,Std. Error,t value,Pr(>|t|)
(Intercept),0.34601044,0.009163102,37.7612767,2.383047e-275
female,-0.0106344,0.014471177,-0.7348673,0.4624541
black,-0.02108047,0.026734605,-0.7885087,0.4304359
female:black,0.05928933,0.041109061,1.4422448,0.1492948


In [115]:
reg <- lm(formula7, Penn)

summary( reg )$coefficient


Unnamed: 0,Estimate,Std. Error,t value,Pr(>|t|)
(Intercept),0.321457255,0.16659101,1.92961945,0.05371056
female,0.104233279,0.13769037,0.75701213,0.44907829
black,0.071648027,0.08727931,0.82090508,0.41173950
othrace,0.028015166,0.30450928,0.09200103,0.92670091
factor(dep)1,-0.073633403,0.21834827,-0.33722916,0.73595833
factor(dep)2,-0.108540722,0.16531360,-0.65657469,0.51148467
q2,-0.026679372,0.16785312,-0.15894475,0.87371883
q3,-0.005673866,0.16748297,-0.03387727,0.97297637
q4,0.043344249,0.16758760,0.25863637,0.79592647
q5,0.093864577,0.16669941,0.56307684,0.57340783


In [116]:
# We can update formulas 
formula_modelA  <- (T4 ~  female + black + lusd + lusd ^ 2 )
print( terms( formula_modelA ) )
class( formula_modelA )

# formula for model B
formula_modelB  <- update( formula_modelA, ~ . + factor(dep))
print( terms( formula_modelB ) )

T4 ~ female + black + lusd + lusd^2
attr(,"variables")
list(T4, female, black, lusd)
attr(,"factors")
       female black lusd
T4          0     0    0
female      1     0    0
black       0     1    0
lusd        0     0    1
attr(,"term.labels")
[1] "female" "black"  "lusd"  
attr(,"order")
[1] 1 1 1
attr(,"intercept")
[1] 1
attr(,"response")
[1] 1
attr(,".Environment")
<environment: R_GlobalEnv>


T4 ~ female + black + lusd + factor(dep)
attr(,"variables")
list(T4, female, black, lusd, factor(dep))
attr(,"factors")
            female black lusd factor(dep)
T4               0     0    0           0
female           1     0    0           0
black            0     1    0           0
lusd             0     0    1           0
factor(dep)      0     0    0           1
attr(,"term.labels")
[1] "female"      "black"       "lusd"        "factor(dep)"
attr(,"order")
[1] 1 1 1 1
attr(,"intercept")
[1] 1
attr(,"response")
[1] 1
attr(,".Environment")
<environment: R_GlobalEnv>


## Regressions Objects
We will understand the output from a regression

In [117]:
key_columns  <-  c('female', 'black', 'lusd' , 'dep', 'T4')

In [118]:
data2  <-  data.frame( Penn[(names( Penn ) %in% key_columns ) ] )

In [121]:
# Regression
reg1 <- lm(formula_modelB , data2 )

# The output is a list of elements
typeof(reg1)
is.list( reg1 )
# All the elements in the list are detailed in the table bellow.
names(summary( reg1 ))

In [128]:
summary(reg1)$aliased

In [130]:
cat("Parameters: ")
summary(reg1)$coefficients

cat("R2:", summary(reg1)$r.squared,'\n')

print("Predictived values:")

predict(reg1)

Parameters: 

Unnamed: 0,Estimate,Std. Error,t value,Pr(>|t|)
(Intercept),0.334987054,0.01080336,31.0076644,1.578297e-193
female,-0.003384616,0.01359507,-0.2489591,0.8034024
black,0.007083473,0.02038814,0.347431,0.7282819
lusd,0.027561406,0.01518847,1.8146267,0.0696401
factor(dep)1,-0.002586024,0.02133241,-0.1212251,0.9035175
factor(dep)2,0.005104638,0.01825747,0.2795917,0.7798021


R2: 0.0006796109 
[1] "Predictived values:"


In [136]:
# Summary of regression is a list
is.list(summary( reg1 ))
is.data.frame(summary( reg1 ))
is.matrix(summary( reg1 )$coefficients)


| Elements 	| Definition 	|
|---	|---	|
| coefficients 	| Vector of coefficients. 	|
| residuals 	| Vector of residuals. 	|
| effects 	| Vector of the uncorrelated single-degree-of-freedom <br>values obtained by projecting the data onto the successive <br>orthogonal subspaces generated by the QR decomposition <br>during the fitting process. 	|
| rank 	| Number of independent columns. 	|
| fitted.values 	| Vector of fitted values. 	|
| qr 	| The QR decomposition. 	|
| df.residual 	| The degrees of freedom of the residuals. 	|
| contrasts 	| A contrast is a linear combination of variables <br>that allows comparison of different treatments. 	|
| xlevels 	| Levels of the factor variables. 	|
| call 	| The formula. 	|
| terms 	| Variables of the regression. 	|
| model 	| DataFrame of the variables in the model. 	|

#### Predictions
We can get the vector of predictions using the formula `predict`.

In [137]:
new_obs  <- data.frame(matrix( c(0, 0, 1 , 1), ncol= 4, dimnames = list( c() , c( "female" , "black" , "dep", "lusd"  ) ) ))

In [138]:
new_obs

female,black,dep,lusd
<dbl>,<dbl>,<dbl>,<dbl>
0,0,1,1


In [139]:
y_hat  <- predict(reg1,  new_obs )
y_hat