# The R environment and the notebooks
(Partly abridged from R Basics: quick (and partial) summary by Giuseppe Jurman)

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. Among other things it has

* an effective data handling and storage facility,
* a suite of operators for calculations on arrays, in particular matrices,
* a large, coherent, integrated collection of intermediate tools for data analysis,
* graphical facilities for data analysis and display either directly at the computer or on hardcopy


The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.


R consists of a core engine, plus additional *packages* available either at a [CRAN](https://cran.r-project.org/mirrors.html) mirror or at the respective developer's site
Packages can be installed either through the RStudio interface or using the R command *install.packages()*.
Also help can be accessed via the RStudio interface, or by the command *?*.

In this course we will use R via Jupiter notebook.

Also exist RStudio (https://www.rstudio.com/) , an integrated development environment (IDE). It permits to create [R notebooks (https://bookdown.org/yihui/rmarkdown/notebook.html)  mixing *markdown* text and R code *chunks* can can be executed inline, producing a final report (*e.g.* in HTML or PDF, or even [slides](https://bookdown.org/yihui/rmarkdown/ioslides-presentation.html)) including text, code and code output.

The text is written in a dialect of the [markdown](https://en.wikipedia.org/wiki/Markdown) language, a lightweight markup language with plain text formatting syntax easily convertible in many output formats. In particular, R adopts the [Pandoc](https://pandoc.org/) version, which is one of the most comprehensive. The [quick guide](https://bookdown.org/yihui/rmarkdown/markdown-syntax.html) to the R markdown syntax can be of help, together with this practical [cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf). 


# Workspace
The entities that R creates and manipulates are known as objects. These may be variables, arrays of numbers, character strings, functions, or more general structures built from such components.
For example, we can create an object call _test_. (later we discuss better all type of object)

In [2]:
test<-("Hello")
test

During an R session, objects are created and stored by name.
To display the names of (most of) the objects which are currently stored within R we can use one of the two following commands:

In [6]:
objects()
ls()

To remove objects the function rm is available:

In [7]:
rm(test)

“object 'test' not found”


All objects created during an R session can be stored permanently in a file for use in future R sessions. At the end of each R session you are given the opportunity to save all the currently available objects. If you indicate that you want to do this, the objects are written to a file called .RData6 in the current directory, and the command lines used in the session are saved to a file called .Rhistory.

When R is started at later time from the same directory it reloads the workspace from this file. At the same time the associated commands history is reloaded.

*Red Flags!!* it is recommended that you should use separate working directories for analyses conducted with R. It is quite common for objects with names x and y to be created during an analysis. Names like this are often meaningful in the context of a single analysis, but it can be quite hard to decide what they might be when the several analyses have been conducted in the same directory.


# Objects
## Vectors
R operates on named data structures. The simplest  structure is the numeric vector, which is a single entity consisting of an ordered collection of numbers. 
For examples, we want to set up a vector named x, say, consisting of five numbers, namely 10.4, 5.6, 3.1, 6.4 and 21.7. We can use one of the following R command:

In [8]:
x <- c(10.4, 5.6, 3.1, 6.4, 21.7)
print(x)

[1] 10.4  5.6  3.1  6.4 21.7


In [11]:
assign("a", c(10.4, 5.6, 3.1, 6.4, 21.7))
print(a)

[1] 10.4  5.6  3.1  6.4 21.7


In [12]:
b = c(10.4, 5.6, 3.1, 6.4, 21.7)
print(b)

[1] 10.4  5.6  3.1  6.4 21.7


This is a combination of 2 parts: 

-The function c() 

-The assignment operator (‘<-’) 

The function **c()** in this context can take an arbitrary number of vector arguments and whose value is a vector got by concatenating its arguments end to end.

The assignment operator '**<-**', which consists of the two characters ‘<’ (“less than”) and ‘-’ (“minus”) occurring strictly side-by-side and it ‘points’ to the object receiving the value of the expression. 

Assignments can also be made in the other direction, using the obvious change in the assignment operator. So the same assignment could be made using:

In [13]:
c(10.4, 5.6, 3.1, 6.4, 21.7) -> d
print(d)

[1] 10.4  5.6  3.1  6.4 21.7


In general the usual operator, **<-**, can be thought of as a syntactic short-cut to this.


If an expression is used as a complete command, the value is printed and **lost**. So now if we were to use the command

In [14]:
1/x

the reciprocals of the five values would be printed at the terminal (and the value of x, of course, unchanged).

The further assignment:

In [15]:
y <- c(x, 0, x)
print(y)

 [1] 10.4  5.6  3.1  6.4 21.7  0.0 10.4  5.6  3.1  6.4 21.7


would create a vector y with 11 entries consisting of two copies of x with a zero in the middle place.

Vectors can be used in arithmetic expressions, in which case the operations are performed element by element. Vectors occurring in the same expression need not all be of the same length. If they are not, the value of the expression is a vector with the same length as the longest vector which occurs in the expression. Shorter vectors in the expression are recycled as often as need be (perhaps fractionally) until they match the length of the longest vector, issuing a warning. In particular a constant is simply repeated. So with the above assignments the command

In [17]:
v <- 2*x + y + 1
v

“longer object length is not a multiple of shorter object length”


generates a new vector v of length 11 constructed by adding together, element by element, 2*x repeated 2.2 times, y repeated just once, and 1 repeated 11 times.  

The elementary arithmetic operators are the usual +, -, \*, / and ^ for raising to a power. In addition all of the common arithmetic functions are available. *log, exp, sin, cos, tan, sqrt*, and so on, all have their usual meaning. *max()* and *min()* select the largest and smallest elements of a vector respectively. *range()* is a function whose value is a vector of length two, namely *c(min(x), max(x)).*

In [18]:
min(x)
max(x)
range(x)

Note that *max* and *min* select the largest and smallest values in their arguments, even if they are given several vectors. The parallel maximum and minimum functions *pmax()* and *pmin()* return a vector (of length equal to their longest argument) that contains in each element the largest (smallest) element in that position in any of the input vectors.

In [19]:
max(x,y)
pmax(x,y)

“an argument will be fractionally recycled”


*length(x)* is the number of elements in *x*, 

In [20]:
length(x)

*sum(x)* gives the total of the elements in *x*, and *prod(x)* their product.

In [21]:
sum(x)
prod(x)

Two statistical functions are *mean(x)* which calculates the sample mean, which is the same as *sum(x)/length(x)*, and *var(x)* which gives the sample variance, i.e.
sum((x-mean(x))^2)/(length(x)-1)

*sort(x)* returns a vector of the same size as x with the elements arranged in increasing order; however there are other more flexible sorting facilities available.

In [22]:
sort(x)

### missing values
In some cases the components of a vector may not be completely known. When an element or value is “not available” or a “missing value” in the statistical sense, a place within a vector may be reserved for it by assigning it the special value NA. In general any operation on an NA becomes an NA. The motivation for this rule is simply that if the specification of an operation is incomplete, the result cannot be known and hence is not available.


The function *is.na(x)* gives a logical vector of the same size as x with value TRUE if and only if the corresponding element in x is NA.


## Character vectors
Character quantities and character vectors are used frequently in R, for example as plot labels. Where needed they are denoted by a sequence of characters delimited by the double quote character, *e.g.,* "x-values", "New iteration results".

Character strings are entered using either matching double (") or single (') quotes, but are printed using double quotes (or sometimes without quotes). 

Character vectors may be concatenated into a vector by the *c()* function; 

In [24]:
label <- c("X", "Y")
label

## Index vectors & slicing
**WARNING** *In R, indexing starts from 1 (not from 0 as in C/Python)*

Subsets of the elements of a vector may be selected by appending to the name of the vector an index vector in square brackets. More generally any expression that evaluates to a vector may have subsets of its elements similarly selected by appending an index vector in square brackets immediately after the expression.


Such index vectors can be any of four distinct types.

* **A logical vector**  Values corresponding to TRUE in the index vector are selected and those corresponding to FALSE are omitted. For example



In [25]:
x <- c(-1,-2,3,4,NA,NA,7,8,9)
y <- x[!is.na(x)]
print(y)

[1] -1 -2  3  4  7  8  9


creates (or re-creates) an object y which will contain the non-missing values of x, in the same order. Note that if x has missing values, y will be shorter than x. 

* **A vector of positive integral quantities** In this case the values in the index vector must lie in the set {1, 2, …, length(x)}. The corresponding elements of the vector are selected and concatenated, in that order, in the result. The index vector can be of any length and the result is of the same length as the index vector. Some examples:



In [26]:
x[6]
x[1:10]

The first command select the sixth component of x .

The second command selects the first 10 elements of x.


* **A vector of negative integral quantities** Such an index vector specifies the values to be excluded rather than included. Thus


In [28]:
x[-(1:5)]

* **A vector of character strings**. This possibility only applies where an object has a names attribute to identify its components. In this case a sub-vector of the names vector may be used in the same way as the positive integral labels in item 2 further above.

In [29]:
fruit <- c(5, 10, 1, 20)
names(fruit) <- c("orange", "banana", "apple", "peach")
lunch <- fruit[c("apple","orange")]
print(lunch)

 apple orange 
     1      5 


The advantage is that alphanumeric names are often easier to remember than numeric indices. This option is particularly useful in connection with data frames, as we shall see later.

An indexed expression can also appear on the receiving end of an assignment, in which case the assignment operation is performed only on those elements of the vector. The expression must be of the form vector[index_vector] as having an arbitrary expression in place of the vector name does not make much sense here.For example, for replaces any missing values in x by zeros we can do something like that. 

In [30]:
x[is.na(x)] <- 0
print(x)

[1] -1 -2  3  4  0  0  7  8  9


## Factors
A factor is a vector object used to specify a discrete classification (grouping) of the components of other vectors of the same length. R provides both ordered and unordered factors.

Suppose, for example, we have a sample of 30 kittens and their individual type of favourite game is specified by a character vector of games,

Suppose, for example, we have a sample of 30 tax accountants from all the states and territories of Australia and their individual state of origin is specified by a character vector of statesaved as 

In [31]:
state <- c("tas", "sa",  "qld", "nsw", "nsw", "nt",  "wa",  "wa",
             "qld", "vic", "nsw", "vic", "qld", "qld", "sa",  "tas",
             "sa",  "nt",  "wa",  "vic", "qld", "nsw", "nsw", "wa",
             "sa",  "act", "nsw", "vic", "vic", "act")
print(state)

 [1] "tas" "sa"  "qld" "nsw" "nsw" "nt"  "wa"  "wa"  "qld" "vic" "nsw" "vic"
[13] "qld" "qld" "sa"  "tas" "sa"  "nt"  "wa"  "vic" "qld" "nsw" "nsw" "wa" 
[25] "sa"  "act" "nsw" "vic" "vic" "act"


For trasform this vector in a factor we can use the factor() function. The print() function handles factors slightly differently from other objects:

In [32]:
statef <- factor(state)
print(statef)

 [1] tas sa  qld nsw nsw nt  wa  wa  qld vic nsw vic qld qld sa  tas sa  nt  wa 
[20] vic qld nsw nsw wa  sa  act nsw vic vic act
Levels: act nsw nt qld sa tas vic wa


To find out the levels of a factor the function levels() can be used.

In [33]:
levels(statef)

The levels of factors are stored in alphabetical order, or in the order they were specified to factor if they were specified explicitly.

Sometimes the levels will have a natural ordering that we want to record and want our statistical analysis to make use of. The *ordered()* function creates such ordered factors but is otherwise identical to factor. 

Recall that a factor defines a partition into groups. Similarly a pair of factors defines a two way cross classification, and so on. 

The function table() allows frequency tables to be calculated from equal length factors. If there are k factor arguments, the result is a k-way array of frequencies.
Suppose, for example, that statef is a factor giving the state code for each entry in a data vector. The assignment gives in *statefr* a table of frequencies of each state in the sample. The frequencies are ordered and labelled by the levels attribute of the factor. 

In [34]:
statefr <- table(statef)
print(statefr)

statef
act nsw  nt qld  sa tas vic  wa 
  2   6   2   5   4   2   5   4 


## Array
An array can be considered as a multiply subscripted collection of data entries, for example numeric. R allows simple facilities for creating and handling arrays, and in particular the special case of matrices.

A dimension vector is a vector of non-negative integers. If its length is k then the array is k-dimensional, *e.g.*, a matrix is a 2-dimensional array. The dimensions are indexed from one up to the values given in the dimension vector.

A vector can be used by R as an array only if it has a dimension vector as its dim attribute. For any array, the dimension vector may be referenced explicitly as *dim(Z)*

Suppose, for example, z is a vector of 60 elements. 

In [42]:
z <- seq(1,60)
z

The following command gives it the dim attribute that allows it to be treated as a 3 by 5 by 4 array.

In [43]:
dim(z) <- c(3,5,4)
print(z)

, , 1

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    4    7   10   13
[2,]    2    5    8   11   14
[3,]    3    6    9   12   15

, , 2

     [,1] [,2] [,3] [,4] [,5]
[1,]   16   19   22   25   28
[2,]   17   20   23   26   29
[3,]   18   21   24   27   30

, , 3

     [,1] [,2] [,3] [,4] [,5]
[1,]   31   34   37   40   43
[2,]   32   35   38   41   44
[3,]   33   36   39   42   45

, , 4

     [,1] [,2] [,3] [,4] [,5]
[1,]   46   49   52   55   58
[2,]   47   50   53   56   59
[3,]   48   51   54   57   60



An easy example

In the case of a doubly indexed array, an index matrix may be given consisting of two columns and as many rows as desired. The entries in the index matrix are the row and column indices for the doubly indexed array. Suppose for example we have a 4 by 5 array X and we wish to do the following:

* Extract elements X[1,3], X[2,2] and X[3,1] as a vector structure, and
* Replace these entries in the array X by zeroes.