# R console input and evaluation

## - Entering input

In [2]:
x <- 1
print(x)

[1] 1


In [3]:
x

In [8]:
msg <- "Hello"
msg

In [7]:
x <- ## Incomplete expression

ERROR: Error in parse(text = x, srcfile = src): <text>:2:0: unexpected end of input
1: x <- ## Incomplete expression
   ^


## - Printing

In [10]:
x <- 1:20
print(x) # integer vector

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20


---
# R Data types - objects and attributes

## - Objects
> Everything in R are objects and those objects have five basic *atomic* classes:

1. Character
2. Numeric (real numbers)
3. Integer
4. Complex
5. Logical (T/F)

    * The most basic object is a vector (it can contain objects of the same class);
    * But, there is one exception call **list**, which is represented as a vector that can contain objects of different classes;
    * Empty vectors can be created using `vector()` function.

## - Number

* Generally treated as numeric objects (e.g.: double precision real numbers)
* To get an integer, is necessary specify the **L** suffix (e.g.: integer 1 = 1**L**)
* The special number **Inf** or -**Inf** means infinity (e.g.: 1/0=**Inf** or 1/**Inf**=0)
* The value **NaN** represents an undefined value ("Not a Number") or a missing value (eg.: 0/0=**NaN**)

## - Attributes

* Names, dimnames
* Dimensions (e.g.: matrices, arrays)
* Class
* Length
* Other user-defined attributes/metadata

> Attributes of an object can be accessed using the `attributes()` function.

In [4]:
-1/0

---
# Data types - Vectors and lists

## - Creating vectors

* The `c()` function can be used to create vectors of objects


In [8]:
x <- c(0.5, 0.6)  # numeric
y <- c(TRUE, FALSE)  # logical
z <- c(T,F)  # logical
w <- c("a", "b", "c", "d")  # character
a <- 9:29  # integer
b <- c(1+0i, 2+4i)  # complex

print(x)
print(y)
print(z)
print(w)
print(a)
print(b)

[1] 0.5 0.6
[1]  TRUE FALSE
[1]  TRUE FALSE
[1] "a" "b" "c" "d"
 [1]  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
[1] 1+0i 2+4i


* Or using the `vector()` function

In [10]:
x <- vector("numeric", length=10)
print(x)

 [1] 0 0 0 0 0 0 0 0 0 0


## - Mixing objects

* When different objects are mixed in a vector, **coercion occurs** so that every element is the same class.

In [15]:
a <- c(1.7, "a")  # character
b <- c(TRUE, 2)  # numeric
c <- c("a", TRUE)  # character

class(a)
class(b)
class(c)

## - Explicit coercion

* Objects can be explicity coerced from one class to another using the `as.*` function (if available).

In [20]:
x <- 0:6
class(x)
as.numeric(x)
as.logical(x)
as.character(x)

## - Explicit coercion

* Nonsensical coercion results in `NA`s

In [22]:
x <- c("a", "b", "c")
as.numeric(x)
as.logical(x)
as.complex(x)

“NAs introduced by coercion”

“NAs introduced by coercion”

## - Lists

* Are a special type of vector that can contain elements from different classes

In [24]:
x <- list(1, "a", T, 1+4i)
print(x)

[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] TRUE

[[4]]
[1] 1+4i



---
# Data types - Matrices

* Are vectors with a **dimension** attribute;
* The dimension attribute is itself an integer vector of legth 2 **(nrow, ncol)**.

In [26]:
m <- matrix(nrow=2, ncol=3)
print(m)
dim(m)
attributes(m)

     [,1] [,2] [,3]
[1,]   NA   NA   NA
[2,]   NA   NA   NA


## - Matrices (cont'd)

* Matrices are constructed *column-wise*, so entries can be thought of starting in the 'upper left' corner and running down the columns;

In [30]:
m <- matrix(1:6, nrow=2, ncol=3)
print(m)

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6


* It can also be created directly from vectors by adding a dimension attribute;

In [29]:
m <-1:10
print(m)
dim(m) <- c(2,5)
print(m)

 [1]  1  2  3  4  5  6  7  8  9 10
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10


* It can be created by *column-binding* or *row-binding* with `cbind()` and `rbind()` functions;

In [34]:
x <- 1:3
y <- 10:12
cbind(x,y)
rbind(x,y)

x,y
1,10
2,11
3,12


0,1,2,3
x,1,2,3
y,10,11,12


---
# Data types - Factors
* Are used to represent categorical data;
* Can be unordered or ordered;
* One can think of factor as an integer vector where each integer has a **label**;
* Factors are treated specially by modelling functions like `lm()` and `glm()`;
* Using factors with labels are better than using integers because they are self-describing:
    * Having a variable that has values "Male" and "Female" is better than has values 1 and 2.

In [37]:
x <- factor(c("yes", "yes", "no", "yes", "no"))
print(x)
table(x)
unclass(x)

[1] yes yes no  yes no 
Levels: no yes


x
 no yes 
  2   3 

* The order of the levels can be set using the `levels=` argument in the `factor()` function:
    * It can be important in linear modeling because the first level is used as the baseline level.

In [38]:
x <- factor(c("yes", "yes", "no", "yes", "no"), levels=c("yes", "no"))
print(x)

[1] yes yes no  yes no 
Levels: yes no


---
# Data types - Missing values

* Are denoted by `NA` or `NaN` for undefined mathematical operations;
* The `is.na()`function is used to test objects if they are `NA` and the `is.nan()` function is used to test for `NaN`;
* `NA` values have a class also, so there are integer `NA`, character `NA`, etc;
* A `NaN` value is also `NA` but the converse is not true.

In [39]:
x <- c(1,2,NaN,NA,4)
is.na(x)
is.nan(x)

# Data types - Data frames

* They're used to store tabular data;
* They're represented as a special type of list where every element of the list has to have the same length;
* Each element of the list can be thought of as a column and the length of each element of the list is the number of rows;
* Unlike matrices, data frames can store different classes of objects in each column (just like lists);
* Data frames also have special attributes called `row.names`
* Data frames are usually created by calling `read.table()` or `read.csv()` or `data.frame()` functions;
* They can be converted to a matrix by calling `data.matrix()` function.


In [41]:
df <- data.frame(foo=1:4, bar=c(T,T,F,F))
print(df)
nrow(df)
ncol(df)

  foo   bar
1   1  TRUE
2   2  TRUE
3   3 FALSE
4   4 FALSE


---
# Data types - Names

* Is very useful for writing readable code and self-describing objects

In [44]:
x <- 1:3
print(x)
names(x)
names(x) <- c("foo", "bar", "norf")
print(x)
names(x)

[1] 1 2 3


NULL

 foo  bar norf 
   1    2    3 


* Lists can also have names:

In [49]:
x <- list(a=1, b=2, c=3)
print(x)
print(x[1])
print(x$a)

$a
[1] 1

$b
[1] 2

$c
[1] 3

$a
[1] 1

[1] 1


* And matrices(`dimnames()`):

In [51]:
m <- matrix(1:4, nrow=2, ncol=2)
dimnames(m) <- list(c("a", "b"), c("c", "d"))
print(m)

  c d
a 1 3
b 2 4


---
# Summary

* Data types:
    * Atomic classes: numeric, logical, character, integer, complex;
    * Vectors, lists;
    * Factors;
    * Missing values;
    * Data frames;
    * Names

---
# Reading  and writing tabular data
## - Read

* The `read.table()` and `read.csv()` functions for reading tabular data;
* The `readLines()` function for reading lines of a text file;
* `source()` for reading in R code files (inverse of `dump()`);
* `dget()` for reading in R code files (inverse of `dput()`);
* `load()` for reading in saved workspaces;
* `unserialize()` for reading single R objects in binary form.

## - Write

* `write.table()`
* `writeLines()`
* `dump()`
* `dput()`
* `save()`
* `serialize()`

## - Reading data files with `read.table()`

* Few important arguments:
    * `file=` the name of a file or a connection;
    * `header=` logical indicating if the file has a header line;
    * `sep=` a string indicationg how the columns are separeted;
    * `colClasses=` a character vector indicating the class of each column in the dataset;
    * `nrows=` the number of rows in the dataset;
    * `comment.char=` a character string indicating the comment charngter (You can specify other character to be comment characters);
    * `skip=` the number of lines to skip from the begninning;
    * `stringsAsFactors=` should character variables be coded as factors?
    
   
* For small to moderately sized datasets, you can usually call `read.table` without specifying any other arguments. R will automatically:
    * Skip lines that begin with **#**;
    * Figure out how many rows there are (and how much memory needs to be allocated);
    * Figure what type of variable is in each column of the table telling R all these things directly makes R run faster and more efficiently;
    * For `read.csv` the default separator is a comma and `header=TRUE`. 

In [52]:
data <- read.table("example.txt")

“cannot open file 'example.txt': No such file or directory”

ERROR: Error in file(file, "rt"): cannot open the connection


---
# Reading large tables

## - Reading in larger datasets with `read.table()`

* Read the help page for `read.table()`;
* Make a rough calculation of the memory required to store your dataset;
* If the dataset is larger than the amount of RAM available, don't open it;
* Set `comment.char=" "` if there are no commented lines in your file;
* Use the `colClasses=` argument, it can make the function run much faster (often twice as fast). If you know all the columns classes you just have to set it, but if you don't know you can use it:

In [None]:
initial <- read.table("example.txt", nrows=100)
classes <- sapply(initial, class)
df <- read.table("example.txt", colClasses=classes)

* Set `nrows=` that can help with memory usage. A mild overstimate is okay, you can use the unix command `wc -l example.txt` to calculate the number of lines in a file.

## - Know thy system

* It's useful to know a few things about your system:
    * How much memory is available?
    * What other app are in use?
    * Are there other users logged into the same system?
    * What operating system?
    * Is the OS 32 or 64 bits?
    
## - Calculating memory requirements

* Is a data frame with 1,500,000 rows and 120 columns, all of wich are numeric data;
* How much memory is required to store this data frame?

> 1,500,000 x 120 x 8 bytes/numeric
> = 1,440,000,000 bytes
> = 1,440,000,000 / 2²⁰ bytes/MB
> = 1,373.29 MB
> = 1.34 GB

---
# Textual data formats

* `dumping` and `dputing` are useful because the resulting textual format is edit-able and in the case of corruption, it can potentially recovera;
* Unlike writing out a table or CSV file, `dump()` and `dput()` preserve the metadata (sacrificing some readability), so that another user doesn't have to specify it all over again;
* Textual formats can work much better with version control programs like subversion or git which can only track changes meaningfully in text files;
* Textual formats can be longer-lived, if there is corruption somewhere in the file, it can be easier to fix the problem;
* Textual formats adhere to the "Unix philosophy" (store all kinds of data in text);
* Dowside: the format is not very space-efficient;
* A way to pass data around is by deparsing the R object with `dput()` and reading it back in using `dget()`:

In [54]:
y <- data.frame(a=1, b="a")
dput(y)
dput(y, file="y.R")
new.y <- dget("y.R")
new.y

structure(list(a = 1, b = structure(1L, .Label = "a", class = "factor")), .Names = c("a", 
"b"), row.names = c(NA, -1L), class = "data.frame")


a,b
1,a


* Multiple objects can be deparsed using the `dump()` function and read back in using `source()`:

In [55]:
x <- "foo"
y <- data.frame(a=1, b="a")
dump(c("x", "y"), file="data.R")
rm(x,y)
source("data.R")
y

a,b
1,a


---
# Connections: Interfaces to the outside world

* Data are read in using connection interfaces:
    * `file()` opens a connection to a file;
    * `gzfile()` opens a connection to a file compressed with gzip;
    * `bzfile()` opens a connection to a file compressed with bzip2;
    * `url()` opens a connection to a webpage.

In [57]:
str(file)

function (description = "", open = "", blocking = TRUE, encoding = getOption("encoding"), 
    raw = FALSE, method = getOption("url.method", "default"))  


* `description=` is the name of the file;
* `open=` is a code indicating:
    * "r" read only;
    * "w" writting (and initializing a new file);
    * "a" appending;
    * "rb", "wb", "ab" = reading, writting and appending in binary mode (windows)
    
## - Reading lines of a text file

* Sometimes connection can be useful if you want to read parts of a file:

In [None]:
con <- gzfile("example.gz")
x <- readLines(con, 10)

* `readLines()` can be useful for reading in lines of webpages:

In [60]:
## This might take time
con <- url("https://www.jhsph.edu/index.html", "r")
x <- readLines(con)
head(x)

---
# Subsetting - Basics

* "**[]**" always returns an object of the same class as the original (can be used to select more than one element);
* "**[[]]**" is used to extract elements of a list or a data frame (extract a single element and the class of the object selected);
* "**$**" is used to extract elements of a list or a data frame (similar to "[[]]").

In [66]:
x <- c("a", "b", "c", "c", "d", "a")
print(x[1])
print(x[2])
print(x[1:4])
print(x[x>"a"])
u <- x>"a"
print(u)
print(x[u])

[1] "a"
[1] "b"
[1] "a" "b" "c" "c"
[1] "b" "c" "c" "d"
[1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE
[1] "b" "c" "c" "d"


---
# Subsetting - Lists

* "**[[]]**" and "**$**" can be used, but not every time;

In [77]:
x <- list(foo=1:4, bar=0.6)
class(x[1])  # list
class(x[[1]])  # sequence of integer
class(x$bar)  # numeric element
class(x["bar"])  # list

* To extract multiple elements use "**[]**"

In [79]:
x <- list(foo=1:4, bar=0.6, baz="hello")
x[c(1,3)]

* To subsetting nested elements of a list use "**[[]]**"

In [80]:
x <- list(a=list(10,12,14), b=c(3.14,2.81))
x[[c(1,3)]]
x[[1]][[3]]
x[[c(2,1)]]

---
# Subsetting - Matrices

* It can be subsetted in the usual way with (i,j) or (row,column) type index;

In [82]:
x <- matrix(1:6, 2, 3)
print(x)
x[1,2]
x[2,1]
x[1,]
x[,2]

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6


* By default when a single element of a matrix is retrieved it returns a vector of length 1 rather 1x1 matrix. To change it set `drop=FALSE`:

In [94]:
print(x[1,2])
print(x[1,2,drop=F])
print(x[1,])
print(x[1,,drop=F])

[1] 3
     [,1]
[1,]    3
[1] 1 3 5
     [,1] [,2] [,3]
[1,]    1    3    5


---
# Subsetting with Names

* Partial matching is allowed with "**$**" by default, but using "**[[]]**" is necessary to set `exact=FALSE`:

In [95]:
x <- list(aardvark=1:5)
print(x$a)
print(x[["a"]])
print(x[["a", exact=F]])

[1] 1 2 3 4 5
NULL
[1] 1 2 3 4 5


---
# Subsetting - Removing NA values

* The idea is construct a logical variable with the locations with `NA` elemenets and don't count it:


In [96]:
x <- c(1,2,NA,4,NA,5)
bad <- is.na(x)
print(bad)
print(x[!bad])

[1] FALSE FALSE  TRUE FALSE  TRUE FALSE
[1] 1 2 4 5


* What if there are multiple things and you want to take the subset with no missing values?
> If in one of the cases has a `NA` element the position is called FALSE

In [97]:
y <- c("a", "b", NA, "d", NA, "f")
good <- complete.cases(x,y)
print(good)
print(x[good])
print(y[good])

[1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE
[1] 1 2 4 5
[1] "a" "b" "d" "f"


* It can be use in data frames, to remove `NA` that happens among columns.
---

# Vectoriz operations

* Vectorization make the code more efficient, concise and easier to read:

In [98]:
x <- 1:4; y <- 6:9
print(x)
print(y)
x+y
x>2
x>=2
x=8
x*y
x/y

[1] 1 2 3 4
[1] 6 7 8 9


In [99]:
x <- matrix(1:4, 2, 2); y <- matrix(rep(10, 4), 2, 2)
print(x)
print(y)
x*y  # element-wise multiplication
x/y
x%*%y  # true matrix multiplication

     [,1] [,2]
[1,]    1    3
[2,]    2    4
     [,1] [,2]
[1,]   10   10
[2,]   10   10


0,1
10,30
20,40


0,1
0.1,0.3
0.2,0.4


0,1
40,40
60,60
