# The R environment and the notebooks
(Partly abridged from R Basics: quick (and partial) summary by Giuseppe Jurman)

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. Among other things it has

* an effective data handling and storage facility,
* a suite of operators for calculations on arrays, in particular matrices,
* a large, coherent, integrated collection of intermediate tools for data analysis,
* graphical facilities for data analysis and display either directly at the computer or on hardcopy


The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.


R consists of a core engine, plus additional *packages* available either at a [CRAN](https://cran.r-project.org/mirrors.html) mirror or at the respective developer's site
Packages can be installed either through the RStudio interface or using the R command *install.packages()*.
Also help can be accessed via the RStudio interface, or by the command *?*.

In this course we will use R via Jupiter notebook.

Also exist RStudio (https://www.rstudio.com/) , an integrated development environment (IDE). It permits to create [R notebooks (https://bookdown.org/yihui/rmarkdown/notebook.html)  mixing *markdown* text and R code *chunks* can can be executed inline, producing a final report (*e.g.* in HTML or PDF, or even [slides](https://bookdown.org/yihui/rmarkdown/ioslides-presentation.html)) including text, code and code output.

The text is written in a dialect of the [markdown](https://en.wikipedia.org/wiki/Markdown) language, a lightweight markup language with plain text formatting syntax easily convertible in many output formats. In particular, R adopts the [Pandoc](https://pandoc.org/) version, which is one of the most comprehensive. The [quick guide](https://bookdown.org/yihui/rmarkdown/markdown-syntax.html) to the R markdown syntax can be of help, together with this practical [cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf). 


# Workspace
The entities that R creates and manipulates are known as objects. These may be variables, arrays of numbers, character strings, functions, or more general structures built from such components.
For example, we can create an object call _test_. (later we discuss better all type of object)

In [None]:
test<-("Hello")
test

During an R session, objects are created and stored by name.
To display the names of (most of) the objects which are currently stored within R we can use one of the two following commands:

In [None]:
objects()
ls()

To remove objects the function rm is available:

In [None]:
rm(test)

All objects created during an R session can be stored permanently in a file for use in future R sessions. At the end of each R session you are given the opportunity to save all the currently available objects. If you indicate that you want to do this, the objects are written to a file called .RData6 in the current directory, and the command lines used in the session are saved to a file called .Rhistory.

When R is started at later time from the same directory it reloads the workspace from this file. At the same time the associated commands history is reloaded.

*Red Flags!!* it is recommended that you should use separate working directories for analyses conducted with R. It is quite common for objects with names x and y to be created during an analysis. Names like this are often meaningful in the context of a single analysis, but it can be quite hard to decide what they might be when the several analyses have been conducted in the same directory.


# Objects
## Vectors
R operates on named data structures. The simplest  structure is the numeric vector, which is a single entity consisting of an ordered collection of numbers. 
For examples, we want to set up a vector named x, say, consisting of five numbers, namely 10.4, 5.6, 3.1, 6.4 and 21.7. We can use one of the following R command:

In [None]:
x <- c(10.4, 5.6, 3.1, 6.4, 21.7)
print(x)

In [None]:
assign("a", c(10.4, 5.6, 3.1, 6.4, 21.7))

In [None]:
b = c(10.4, 5.6, 3.1, 6.4, 21.7)

In [None]:
print(x)
print(a)
print(b)

This is a combination of 2 parts: 

-The function c() 

-The assignment operator (‘<-’) 

The function **c()** in this context can take an arbitrary number of vector arguments and whose value is a vector got by concatenating its arguments end to end.

The assignment operator '**<-**', which consists of the two characters ‘<’ (“less than”) and ‘-’ (“minus”) occurring strictly side-by-side and it ‘points’ to the object receiving the value of the expression. 

Assignments can also be made in the other direction, using the obvious change in the assignment operator. So the same assignment could be made using:

In [None]:
c(10.4, 5.6, 3.1, 6.4, 21.7) -> d
print(d)

In general the usual operator, **<-**, can be thought of as a syntactic short-cut to this.


If an expression is used as a complete command, the value is printed and **lost**. So now if we were to use the command

In [None]:
1/x

the reciprocals of the five values would be printed at the terminal (and the value of x, of course, unchanged).

The further assignment:

In [None]:
y <- c(x, 0, x)
print(y)

would create a vector y with 11 entries consisting of two copies of x with a zero in the middle place.

Vectors can be used in arithmetic expressions, in which case the operations are performed element by element. Vectors occurring in the same expression need not all be of the same length. If they are not, the value of the expression is a vector with the same length as the longest vector which occurs in the expression. Shorter vectors in the expression are recycled as often as need be (perhaps fractionally) until they match the length of the longest vector, issuing a warning. In particular a constant is simply repeated. So with the above assignments the command

In [None]:
v <- 2*x + y + 1
v

generates a new vector v of length 11 constructed by adding together, element by element, 2*x repeated 2.2 times, y repeated just once, and 1 repeated 11 times.  

The elementary arithmetic operators are the usual +, -, \*, / and ^ for raising to a power. In addition all of the common arithmetic functions are available. *log, exp, sin, cos, tan, sqrt*, and so on, all have their usual meaning. *max()* and *min()* select the largest and smallest elements of a vector respectively. *range()* is a function whose value is a vector of length two, namely *c(min(x), max(x)).*

In [None]:
min(x)
max(x)
range(x)

Note that *max* and *min* select the largest and smallest values in their arguments, even if they are given several vectors. The parallel maximum and minimum functions *pmax()* and *pmin()* return a vector (of length equal to their longest argument) that contains in each element the largest (smallest) element in that position in any of the input vectors.

In [None]:
max(x,y)
pmax(x,y)

*length(x)* is the number of elements in *x*, 

In [None]:
length(x)

*sum(x)* gives the total of the elements in *x*, and *prod(x)* their product.

In [None]:
sum(x)
prod(x)

Two statistical functions are *mean(x)* which calculates the sample mean, which is the same as *sum(x)/length(x)*, and *var(x)* which gives the sample variance, i.e.
sum((x-mean(x))^2)/(length(x)-1)

*sort(x)* returns a vector of the same size as x with the elements arranged in increasing order; however there are other more flexible sorting facilities available.

In [None]:
sort(x)

### missing values
In some cases the components of a vector may not be completely known. When an element or value is “not available” or a “missing value” in the statistical sense, a place within a vector may be reserved for it by assigning it the special value NA. In general any operation on an NA becomes an NA. The motivation for this rule is simply that if the specification of an operation is incomplete, the result cannot be known and hence is not available.


The function *is.na(x)* gives a logical vector of the same size as x with value TRUE if and only if the corresponding element in x is NA.


In [None]:
x <- c(1,-2,3,NA,NA,6,7,8,9,10)


In [None]:
is.na(x)

In [None]:
y <- x[!is.na(x)]
print(y)

## Character vectors
Character quantities and character vectors are used frequently in R, for example as plot labels. Where needed they are denoted by a sequence of characters delimited by the double quote character, *e.g.,* "x-values", "New iteration results".

Character strings are entered using either matching double (") or single (') quotes, but are printed using double quotes (or sometimes without quotes). 

Character vectors may be concatenated into a vector by the *c()* function; 

In [None]:
label <- c("X", "Y")
label

## Index vectors & slicing
**WARNING** *In R, indexing starts from 1 (not from 0 as in C/Python)*

Subsets of the elements of a vector may be selected by appending to the name of the vector an index vector in square brackets. More generally any expression that evaluates to a vector may have subsets of its elements similarly selected by appending an index vector in square brackets immediately after the expression.


Such index vectors can be any of four distinct types.

* **A logical vector**  Values corresponding to TRUE in the index vector are selected and those corresponding to FALSE are omitted. For example



In [None]:
x <- c(-1,-2,3,4,NA,NA,7,8,9,10)
y <- x[!is.na(x)]
print(y)

creates (or re-creates) an object y which will contain the non-missing values of x, in the same order. Note that if x has missing values, y will be shorter than x. 

* **A vector of positive integral quantities** In this case the values in the index vector must lie in the set {1, 2, …, length(x)}. The corresponding elements of the vector are selected and concatenated, in that order, in the result. The index vector can be of any length and the result is of the same length as the index vector. Some examples:



In [None]:
x[6]

x[1:10]

The first command select the sixth component of x .

The second command selects the first 10 elements of x.


* **A vector of negative integral quantities** Such an index vector specifies the values to be excluded rather than included. Thus


In [None]:
x[-(1:5)]

* **A vector of character strings**. This possibility only applies where an object has a names attribute to identify its components. In this case a sub-vector of the names vector may be used in the same way as the positive integral labels in item 2 further above.

In [None]:
fruit <- c(5, 10, 1, 20)
names(fruit) <- c("orange", "banana", "apple", "peach")
lunch <- fruit[c("apple","orange")]
print(lunch)

The advantage is that alphanumeric names are often easier to remember than numeric indices. This option is particularly useful in connection with data frames, as we shall see later.

An indexed expression can also appear on the receiving end of an assignment, in which case the assignment operation is performed only on those elements of the vector. The expression must be of the form vector[index_vector] as having an arbitrary expression in place of the vector name does not make much sense here.For example, for replaces any missing values in x by zeros we can do something like that. 

In [None]:
x[is.na(x)] <- 0
print(x)

## Factors
A factor is a vector object used to specify a discrete classification (grouping) of the components of other vectors of the same length. R provides both ordered and unordered factors.

Suppose, for example, we have a sample of 30 tax accountants from all the states and territories of Australia and their individual state of origin is specified by a character vector of statesaved as 

In [None]:
state <- c("tas", "sa",  "qld", "nsw", "nsw", "nt",  "wa",  "wa",
             "qld", "vic", "nsw", "vic", "qld", "qld", "sa",  "tas",
             "sa",  "nt",  "wa",  "vic", "qld", "nsw", "nsw", "wa",
             "sa",  "act", "nsw", "vic", "vic", "act")
print(state)

For trasform this vector in a factor we can use the factor() function. The print() function handles factors slightly differently from other objects:

In [None]:
statef <- factor(state)
print(statef)

To find out the levels of a factor the function levels() can be used.

In [None]:
levels(statef)

The levels of factors are stored in alphabetical order, or in the order they were specified to factor if they were specified explicitly.

Sometimes the levels will have a natural ordering that we want to record and want our statistical analysis to make use of. The *ordered()* function creates such ordered factors but is otherwise identical to factor. 

Recall that a factor defines a partition into groups. Similarly a pair of factors defines a two way cross classification, and so on. 

The function table() allows frequency tables to be calculated from equal length factors. If there are k factor arguments, the result is a k-way array of frequencies.
Suppose, for example, that statef is a factor giving the state code for each entry in a data vector. The assignment gives in *statefr* a table of frequencies of each state in the sample. The frequencies are ordered and labelled by the levels attribute of the factor. 

In [None]:
statefr <- table(statef)
print(statefr)

## Array
An array can be considered as a multiply subscripted collection of data entries, for example numeric. R allows simple facilities for creating and handling arrays, and in particular the special case of matrices.

A dimension vector is a vector of non-negative integers. If its length is k then the array is k-dimensional, *e.g.*, a matrix is a 2-dimensional array. The dimensions are indexed from one up to the values given in the dimension vector.

A vector can be used by R as an array only if it has a dimension vector as its dim attribute. For any array, the dimension vector may be referenced explicitly as *dim(Z)*

Suppose, for example, z is a vector of 60 elements. 

In [None]:
z <- seq(1,60)
z

The following command gives it the dim attribute that allows it to be treated as a 3 by 5 by 4 array.

In [None]:
dim(z) <- c(3,5,4)
print(z)


Individual elements of an array may be referenced by giving the name of the array followed by the subscripts in square brackets, separated by commas.

More generally, subsections of an array may be specified by giving a sequence of index vectors in place of subscripts; however if any index position is given an empty index vector, then the full range of that subscript is taken.

An easy example

In the case of a doubly indexed array, an index matrix may be given consisting of two columns and as many rows as desired. The entries in the index matrix are the row and column indices for the doubly indexed array. Suppose for example we have a 4 by 5 array X and we wish to do the following:

* Extract elements X[1,3], X[2,2] and X[3,1] as a vector structure, and
* Replace these entries in the array X by zeroes.

First, let's generate a 4-by-5 array.

In [None]:
z <- array(1:20, dim=c(4,5))   
print(z)

Second, we generate a 3 by 2 array that contains the indexes of the elements we want to extrac. 
Index matrices must be numerical: any other form of matrix (*e.g.*, a logical or character matrix) supplied as a matrix is treated as an indexing vector.


In [None]:
i <- array(c(1:3,3:1), dim=c(3,2))
print(i) 

In [None]:
z[i]

In [None]:
z[1,]

Finally, replace those elements by zeros.

In [None]:
z[i] <- 0                 
print(z)

As well as giving a vector structure a dim attribute, arrays can be constructed from vectors by the array function, which has the form

In [None]:
x

In [None]:
dim_vector <- c(5,2)
dim_vector 

In [None]:
w <- array(x, dim=dim_vector)
print(w)

If x is shorter than dimension, its values are recycled from the beginning again to make it up to right size. 
Let's see

In [None]:
dim_vector <- c(4,5)

In [None]:
w <- array(x, dim=dim_vector)
print(w)

Arrays may be used in arithmetic expressions and the result is an array formed by element-by-element operations on the data vector. The dim attributes of operands generally need to be the same, and this becomes the dimension vector of the result.

In [None]:
A<-w+z
A

The function *t()* return the transpose of A

In [None]:
B <- t(A)
B

The meaning of *diag()* depends on its argument:

* *diag(v)*, where v is a vector, gives a diagonal matrix with elements of the vector as the diagonal entries
* *diag(M)*, where M is a matrix, gives the vector of main diagonal entries of M
* if k is a single numeric value then *diag(k)* is the k by k identity matrix

Matrices can also be built up from other vectors and matrices by the functions *cbind()* and *rbind()*. Roughly *cbind()() forms matrices by binding together matrices horizontally, or column-wise, and *rbind()* vertically, or row-wise.

In [None]:
diag(x[1:4])

In [None]:
diag(B)

In [None]:
diag(5)

Matrices can also be built up from other vectors and matrices by the functions *cbind()* and *rbind()*. Roughly *cbind()* forms matrices by binding together matrices horizontally, or column-wise, and *rbind()* vertically, or row-wise.


The arguments to *cbind()* must be either vectors of any length, or matrices with the same column size, that is the same number of rows. The result is a matrix with the concatenated arguments arg_1, arg_2, … forming the columns.

If some of the arguments to *cbind()* are vectors they may be shorter than the column size of any matrices present, in which case they are cyclically extended to match the matrix column size (or the length of the longest vector if no matrices are given).

In [None]:
f <- c(3,7,5)
C <- cbind(x,1, f)
C

The function *rbind()* does the corresponding operation for rows. In this case any vector argument, possibly cyclically extended, are of course taken as row vectors.

In [None]:
R <- rbind(1, B, f)
R

The result of *rbind()* or *cbind()* always has matrix status. Hence *cbind(x)* and *rbind(x)* are possibly the simplest ways explicitly to allow the vector x to be treated as a column or row matrix respectively.

The official way to coerce an array back to a simple vector object is to use *as.vector()*

In [None]:
vec <- as.vector(R)
vec

## Lists
An R list is an object consisting of an ordered collection of objects known as its components.

There is no particular need for the components to be of the same mode or type, and, for example, a list could consist of a numeric vector, a logical value, a matrix, a complex vector, a character array, a function, and so on. Here is a simple example of how to make a list:


In [None]:
Lst <- list(name="Fred", wife="Mary", no.children=3, child.ages=c(5,7,10))
print(Lst)

Components are always numbered and may always be referred to as such. Thus if Lst is the name of a list with four components, these may be individually referred to as *Lst[[1]], Lst[[2]], Lst[[3]]* and *Lst[[4]]*. If, further, *Lst[[4]]* is a vector subscripted array then *Lst[[4]][1]* is its first entry.

In [None]:
Lst[[1]]
Lst[[4]][1]

If Lst is a list, then the function *length(Lst)* gives the number of (top level) components it has.

In [None]:
print(length(Lst))

Components of lists may also be named, and in this case the component may be referred to either by giving the component name as a character string in place of the number in double square brackets, or, more conveniently, by giving an expression of the form *name$component_name* for the same thing.

This is a very useful convention as it makes it easier to get the right component if you forget the number.

In [None]:
Lst$name
Lst$child.ages[1]

Additionally, one can also use the names of the list components in double square brackets, *i.e., Lst[["name"]]* is the same as *Lst$name*. This is especially useful, when the name of the component to be extracted is stored in another variable as in

In [None]:
n <- "name"
Lst[[n]]

It is very important to distinguish *Lst[[1]]* from *Lst[1]*. 

* *[[…]]* is the operator used to select a single element, thus *Lst[[1]]* is the first object in the list Lst, and if it is a named list the name is not included;
* *[…]* is a *general* subscripting operator, so *Lst[1]* is a sublist of the list Lst consisting of the first entry only and if it is a named list, the names are transferred to the sublist.

The names of components may be abbreviated down to the minimum number of letters needed to identify them uniquely. Thus *Lst\$coefficients* may be minimally specified as *Lst\$coe* and *Lst\$covariance* as *Lst\$cov*: the vector of names is in fact simply an attribute of the list like any other and may be handled as such. Other structures besides lists may, of course, similarly be given a names attribute also.

In [None]:
Lst[[1]]
Lst[1]

New lists may be formed from existing objects by the function *list()*. An assignment of the form

In [None]:
Lst2 <- list(name_1=x, name_2=z, name3=statef)
Lst2

When the concatenation function *c()* is given list arguments, the result is an object of mode list also, whose components are those of the argument lists joined together in sequence.

In [None]:
list.ABC <- c(Lst, Lst2)
list.ABC

Recall that with vector objects as arguments the concatenation function similarly joined together all arguments into a single vector structure. In this case all other attributes, such as dim attributes, are discarded.


## Data frames
A data frame is a list with class "data.frame". There are restrictions on lists that may be made into data frames, namely:

* the components must be vectors (numeric, character, or logical), factors, numeric matrices, lists, or other data frame;
* matrices, lists, and data frames provide as many variables to the new data frame as they have columns, elements, or variables, respectively;
* numeric vectors, logicals and factors are included as is, and by default18 character vectors are coerced to be factors, whose levels are the unique values appearing in the vector;
* vector structures appearing as variables of the data frame must all have the same length, and matrix structures must all have the same row size.

A data frame may for many purposes be regarded as a matrix with columns possibly of differing modes and attributes. It may be displayed in matrix form, and its rows and columns extracted using matrix indexing conventions.

Objects satisfying the restrictions placed on the columns (components) of a data frame may be used to form one using the function data.frame:

In [None]:
incomes <- c(60, 49, 40, 61, 64, 60, 59, 54, 62, 69, 70, 42, 56,
               61, 61, 61, 58, 51, 48, 65, 49, 49, 41, 48, 52, 46,
               59, 46, 58, 43)
accountants <- data.frame(home=statef, loot=incomes)
print(accountants)

The simplest way to construct a data frame from scratch is to read an entire data frame (table) from an external file.

# Read data from files
Large data objects will usually be read as values from external files rather than entered during an R session at the keyboard. 

If variables are to be held mainly in data frames, as we strongly suggest they should be, an entire data frame can be read directly with the read.table() function. There is also a more primitive input function, scan(), that can be called directly.

To read an entire data frame directly, the external file will normally have a special form.

* The first line of the file should have a name for each variable in the data frame.
* Each additional line of the file has as its first item a row label and the values for each variable.


By default numeric items (except row labels) are read as numeric variables and non-numeric variables as factors. This can be changed if necessary.

The functions *read.table()* and *read.csv()* can then be used to read the data frame directly

In [None]:
HousePrice <- read.csv("channing.csv", header=TRUE)
HousePrice

In [None]:
write.table(HousePrice, file="channing.txt", quote=T, sep=" ", dec=".", na="NA", row.names=F, col.names=T)

In [None]:
HousePrice_txt <- read.table("channing.txt", header=TRUE)
HousePrice_txt