<!--NAVIGATION-->
<span style='background: rgb(128, 128, 128, .15); width: 100%; display: block; padding: 10px 0 10px 10px'>< [R Syntax](02.01-R-Syntax.ipynb) | [Contents](00.00-Index.ipynb) | [Quiz 2](02.03-Quiz.ipynb) ></span>

<a href="https://colab.research.google.com/github/eurostat/e-learning/blob/main/r-official-statistics/02.02-Data-Types.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>


<a id='top'></a>

# Data Types
## Content  
- [String](#string)  
- [Numbers](#numbers)  
- [Booleans](#booleans)  
- [Iterables](#sequences)  
 - [Vectors](#vectors)  
 - [Matrix](#matrix)  
 - [Factors](#factors)  
 - [Lists](#lists)  
 - [Dataframe](#dframe)

Nothing is an object in R, called `NULL`. If you need a variable to be declared but not used and properly specify a type for it you can use `NULL` (it is used for some default parameters in functions):  

In [1]:
a <- NULL
typeof(a)

Actually there are several special values:

In [2]:
# helper
show_nicely <- function(x) {
    cat(deparse(substitute(x)), x, typeof(x), '\n')
}

# infinity
x <- Inf
show_nicely(x)

# Not a Number
y <- NaN
show_nicely(y)

# Not Available
n <- NA
show_nicely(n)

# Not Available, and specify the data type for variable
n1 <- NA_integer_
show_nicely(n1)
n2 <- NA_real_
show_nicely(n2)
n3 <- NA_complex_
show_nicely(n3)
n4 <- NA_character_
show_nicely(n4)

x Inf double 
y NaN double 
n NA logical 
n1 NA integer 
n2 NA double 
n3 NA complex 
n4 NA character 


<a id='string'></a>

## String
Single quotes, double quotes ...  
Useful package `stringr`.  
### Single quotes and double quotes
For simple strings, and a way to include the single quotes `'` char in a string is declaring it with double quotes and viceversa. Some examples:

In [3]:
print('a')
print('aa')
print("aaa")
print('Hello!')
print("don't worry")

[1] "a"
[1] "aa"
[1] "aaa"
[1] "Hello!"
[1] "don't worry"


Also it is possible to use escape characters.  
Example: 


In [4]:
print('don\'t')

[1] "don't"


### Multilines Strings  
The string can be multiple lines:

In [5]:
haiku = '"The Old Pond" by Matsuo Bashō
An old silent pond
A frog jumps into the pond —
Splash! Silence again.'

writeLines(haiku)

"The Old Pond" by Matsuo Bashō
An old silent pond
A frog jumps into the pond —
Splash! Silence again.


### Printing and formatting

R provides a series of functions for printing strings.  
Here are some examples:

In [6]:
my_string <- "programming with data is fun"
# generic printing: print()
print(my_string)
# or no quotes
print(my_string, quote = FALSE)
# same print with no quotes: noquote()
noquote(my_string)

[1] "programming with data is fun"
[1] programming with data is fun


[1] programming with data is fun

In [7]:
# concatenation: cat()
cat(my_string, "with R\n")
# using a separator
cat(1:10, sep = ", ")

programming with data is fun with R
1, 2, 3, 4, 5, 6, 7, 8, 9, 10

In [8]:
# special formats: format()
# use of 'nsmall'
format(13.7, nsmall = 3)
# use of 'digits'
format(c(6.0, 13.1), digits = 2)
# use of 'digits' and 'nsmall'
format(c(6.0, 13.1), digits = 2, nsmall = 1)
# justify options
format(c("A", "BB", "CCC"), width = 5, justify = "centre")

In [9]:
# convert to string: toString()
toString(3.12345)
toString(3i + 5)
toString(FALSE)

In [10]:
# C-style printing: sprintf() - usual suspects
sprintf("Second number: %2$d, first number: %1$d", 2, 1)
sprintf("%.*s", 3, "abcdef")
sprintf('%.3f', 1/6)

### Some more functionality for strings
Doing more with `base` package, by examples:

In [11]:
spam <- 'Hello world!'
# length
nchar(spam)
# convert to capital letters, or viceversa
print(toupper(spam))
print(tolower(spam))
# find a substring (works with regular expression too)
print(grep('He', spam))
# extract a substring
print(substr(spam, 7, 11))
# joining a list of strings into a string separated by space
print(paste('My', 'name', 'is', 'Simon', sep=' '))
# str_split: the opposite of paste
print(strsplit('My name is Simon', ' '))

[1] "HELLO WORLD!"
[1] "hello world!"
[1] 1
[1] "world"
[1] "My name is Simon"
[[1]]
[1] "My"    "name"  "is"    "Simon"



### stringr Package
This is one very frequently used package about strings. As a rule, functions from this package starts with str_, and some examples:

In [12]:
library(stringr)

#### Character manipulation

In [13]:
str_length("abc")

x <- c("abcdef", "ghifjk")
# The 3rd letter
str_sub(x, 3, 3)
# The 2nd to 2nd-to-last character
str_sub(x, 2, -2)
# You can also use str_sub() to modify strings:
str_sub(x, 3, 4) <- "X"
x

#### Whitespace
Three functions add, remove, or modify whitespace:

In [14]:
x <- c("abc", "defghi")
# default pads on left
str_pad(x, 10)
str_pad(x, 10, "both")
str_pad(x, 4)
# now using a pipe and str_trunc
x <- c("Short", "This is a long string")
x %>% str_trunc(10) %>% str_pad(10, "right")


#### Locale sensitive

In [15]:
x <- "I like horses."
str_to_upper(x)
str_to_title(x)

x <- c("y", "i", "k")
str_order(x)
str_sort(x)

#### Pattern matching

In [16]:
strings <- c(
  "apple", 
  "219 733 8965", 
  "329-293-8753", 
  "Work: 579-499-7527; Home: 543.355.3679"
)
phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"

# Which strings contain phone numbers?
str_detect(strings, phone)
# Returning elements with phone numbers
str_subset(strings, phone)
# How many phone numbers in each string?
str_count(strings, phone)

<a id='numbers'></a>

## Numbers
There are three numeric types: 
- integer: Set of all integers, Z,
- numeric: Set of all real numbers, 
- complex: Set of complex numbers.  

Some examples:

In [17]:
print(typeof(-69))
print(typeof(6L))
print(typeof(0xa))
print(typeof(4 + 6i))

y = 5
print(class(y))
print(typeof(y))
print(is.integer(y))

[1] "double"
[1] "integer"
[1] "double"
[1] "complex"
[1] "numeric"
[1] "double"
[1] FALSE


### Element-wise Logical operators
Logical operations simulate element-wise decision operations, based on the specified operator between the operands, which are then evaluated to either a True or False boolean value. Any non zero integer value is considered as a TRUE value, be it complex or real number. 

#### Element-wise Logical AND operator (&)
Returns True if both the operands are True.

In [18]:
list1 <- c(TRUE, 0.1)
list2 <- c(0,4+3i)
print(list1 & list2)

[1] FALSE  TRUE


#### Element-wise Logical OR operator (|)
Returns True if either of the operands are True.

In [19]:
print(list1|list2)

[1] TRUE TRUE


#### NOT operator (!)
A unary operator that negates the status of the elements of the operand.

In [20]:
print(!list2)

[1]  TRUE FALSE


#### Logical AND operator (&&)
Returns True if both the first elements of the operands are True.

In [21]:
print(list1 && list2)

"'length(x) = 2 > 1' in coercion to 'logical(1)'"
"'length(x) = 2 > 1' in coercion to 'logical(1)'"


[1] FALSE


#### Logical OR operator (||)
Returns True if either of the first elements of the operands are True.

In [22]:
print(list1||list2)

"'length(x) = 2 > 1' in coercion to 'logical(1)'"


[1] TRUE


<a id='booleans'></a>

## Booleans
There is just one type of boolean type in R, of course, with values: `TRUE` & `FALSE`.  
 ### Comparison operators
 They work for majority of types, but operators must be of the same type or can be converted into some wider type.  
 
| Operation | Symbol |
| :- | :-: |
| Equal | == |
| Not equal | != |
| Greater than | > |
| Less than | > |
| Greater than or equal to | >= |
| Less than or equal to | <= |  


### Membership operator
Membership operator is used to check if a specific item is present in the vector or the list.

In [23]:
5 %in% c(1, 2, 3, 4, 7)
3 %in% c(1, 2, 3, 4, 7)

<a id='sequences'></a>

## Iterables
In R programming, sequences are a generic term for an ordered set which means that the order in which we input the items will be the same when we access them.  
R supports several different types of sequences. Ex. `vector`, `matrix`, `list`, `factor`, `data.frame`.  

In [24]:
# vector iterating
v <- c(1, 5, 4, 9, 0)
for( a in v)
    print(a)

# matrix iterating
m <- matrix(1:9, nrow = 3, ncol = 3)
for( a in m)
    print(a)

# list iterating
l <- list("a" = 2.5, "b" = TRUE, "c" = 1:3)
for( a in l)
    print(a)

# factor iterating
f <- factor(c("single", "married", "married", "single"))
for( a in f)
    print(a)

# data frame
df <- data.frame("SN" = 1:2, "Age" = c(21,15), "Name" = c("John","Dora"))
for( a in df)
    print(a)

[1] 1
[1] 5
[1] 4
[1] 9
[1] 0
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 2.5
[1] TRUE
[1] 1 2 3
[1] "single"
[1] "married"
[1] "married"
[1] "single"
[1] 1 2
[1] 21 15
[1] "John" "Dora"


<a id='lists'></a>

## Lists  
List is a data structure having components of mixed data types. List can be created using the list() function.

In [25]:
x <- list("a" = 2.5, "b" = TRUE, "c" = 1:3)
str(x)
x$b
x[['a']]

List of 3
 $ a: num 2.5
 $ b: logi TRUE
 $ c: int [1:3] 1 2 3


In this example, a, b and c are called tags which makes it easier to reference the components of the list.  
  
However, tags are optional. We can create the same list without the tags as follows. In such scenario, numeric indices are used by default.

In [26]:
x <- list(2.5,TRUE,1:3)
str(x)
x[3]
x[c(T, T, F)]

List of 3
 $ : num 2.5
 $ : logi TRUE
 $ : int [1:3] 1 2 3


Modify, add, remove elements in lists.

In [27]:
x$a = 3
x[["d"]] <- "Clair"
x
x$d <- NULL
x

<a id='vectors'></a>

## Vectors
Vector is a basic data structure in R. It contains element of the same type. Vectors are generally created using the c() function.

In [28]:
v1 <- c(1, 5, 4, 9, 0)
typeof(v1)
length(v1)

v2 <- c(1, 5.4, TRUE, "hello")
v2
typeof(v2)

- Creating a vector using : operator

In [29]:
v3 <- 1:7; v3
v4 <- 2:-2; v4

- Creating a vector using seq() function

In [30]:
# specify step size
v5 <- seq(1, 3, by=0.2); v5
# specify length of the vector
v6 <- seq(1, 5, length.out=4); v6

#### How to access Elements of a Vector?
Elements of a vector can be accessed using vector indexing. The vector used for indexing can be logical, integer or character vector.

- Using integer vector as index

In [31]:
v1
# access 3rd element
v1[3]
# access 2nd and 4th element
v1[c(2, 4)]
# access all but 1st element
v1[-1]   

- Using logical vector as index

In [32]:
v1[c(TRUE, FALSE, FALSE, TRUE)]
# filtering vectors based on conditions
v1[v1 < 5]

- Using character vector as index  
This type of indexing is useful when dealing with named vectors. We can name each element of a vector.

In [33]:
x <- c("first"=3, "second"=0, "third"=9)

names(x)
x["second"]
x[c("first", "third")]

#### How to modify a vector in R?

In [34]:
x <- c(-3, -2, -1,  0,  1,  2)
# modify 2nd element
x[2] <- 0; x
# modify elements less than 0
x[x<0] <- 5; x
# truncate x to first 4 elements
x <- x[1:4]; x
# extending the vector
x[7] <- 6; x
# remove a value
x <- NULL; x; x[4]

NULL

NULL

<a id='matrix'></a>

## Matrix
Matrix is a two dimensional data structure in R programming. 
Matrix is similar to vector but additionally contains the dimension attribute. It can be created using the matrix() function.

In [35]:
m1 <- matrix(1:9, nrow = 3, ncol = 3); m1
# same result is obtained by providing only one dimension
matrix(1:9, nrow = 3)
# fill matrix row-wise
m2 <- matrix(1:9, nrow=3, byrow=TRUE); m2

attributes(m1)
dim(m1)

0,1,2
1,4,7
2,5,8
3,6,9


0,1,2
1,4,7
2,5,8
3,6,9


0,1,2
1,2,3
4,5,6
7,8,9


It is possible to name the rows and columns of matrix during creation by passing a 2 element list to the argument dimnames.

In [36]:
m3 <- matrix(1:9, nrow = 3, dimnames = list(c("X","Y","Z"), c("A","B","C")))
m3

Unnamed: 0,A,B,C
X,1,4,7
Y,2,5,8
Z,3,6,9


Another way of creating a matrix is by using functions cbind() and rbind() as in column bind and row bind.

In [37]:
cbind(c(1,2,3),c(4,5,6))

rbind(c(1,2,3),c(4,5,6))

0,1
1,4
2,5
3,6


0,1,2
1,2,3
4,5,6


You can also create a matrix from a vector by setting its dimensions using dim().

In [38]:
x <- c(1,2,3,4,5,6)
x
class(x)
dim(x) <- c(2,3)
x

0,1,2
1,3,5
2,4,6


Modify data into a matrix and accessing different elements are similar with the vectors. Of course, use one element for columns and one for the rows.

In [39]:
m1
# select rows 1 & 2 and columns 2 & 3
m1[c(1,2),c(2,3)]
# leaving column field blank will select entire columns
v <- m1[1,]; v; class(v)
# if you want previous result to remain a matrix
m1[1,,drop=FALSE]
m1[c(TRUE,FALSE,TRUE),c(TRUE,TRUE,FALSE)]
m1[c(TRUE, FALSE)]
m1[m1 < 5]
m3[,"A"]

0,1,2
1,4,7
2,5,8
3,6,9


0,1
4,7
5,8


0,1,2
1,4,7


0,1
1,4
3,6


<a id='factors'></a>

## Factors
Factor is a data structure used for fields that takes only predefined, finite number of values (categorical data). We can create a factor using the function factor().

In [40]:
f <- factor(c("single", "married", "married", "single")); str(f)

 Factor w/ 2 levels "married","single": 2 1 1 2


Accessing components of a factor is very much similar to that of vectors. Modify just with values from levels, and a workarround to add a new level:

In [41]:
f[3]
f[c(2,3)]
f[5] <- 'single'
# this is not working
f[6] <- 'widowed'; f
# workarround
levels(f) <- c(levels(f), "widowed") 
f[6] <- 'widowed'; f

"invalid factor level, NA generated"


<a id='dframe'></a>

## Dataframe
Data frame is a two dimensional data structure in R. It is a special case of a list which has each component of equal length.  
We can create a data frame using the data.frame() function.

In [42]:
df1 <- data.frame("SN" = 1:2, "Age" = c(21,15), "Name" = c("John","Dora")); df1

# some functions for data frames
str(df1)
names(df1)
ncol(df1)
nrow(df1)
length(df1)

SN,Age,Name
<int>,<dbl>,<chr>
1,21,John
2,15,Dora


'data.frame':	2 obs. of  3 variables:
 $ SN  : int  1 2
 $ Age : num  21 15
 $ Name: chr  "John" "Dora"


Components of data frame can be accessed like a list or like a matrix:
- Like a list

In [43]:
df1["Name"]
df1$Name
df1[["Name"]]
df1[[3]]

Name
<chr>
John
Dora


- Like a matrix  
  
_Note: To illustrate this, we use datasets already available in R (trees in our example). Datasets that are available can be listed with the command library(help = "datasets")._

In [44]:
str(trees)
head(trees,n=3)
# select 2nd and 3rd row
trees[2:3,]
# selects rows with Height greater than 82
trees[trees$Height > 82,]

trees[10:12,2, drop = FALSE]


'data.frame':	31 obs. of  3 variables:
 $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
 $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
 $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...


Unnamed: 0_level_0,Girth,Height,Volume
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
1,8.3,70,10.3
2,8.6,65,10.3
3,8.8,63,10.2


Unnamed: 0_level_0,Girth,Height,Volume
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
2,8.6,65,10.3
3,8.8,63,10.2


Unnamed: 0_level_0,Girth,Height,Volume
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
6,10.8,83,19.7
17,12.9,85,33.8
18,13.3,86,27.4
31,20.6,87,77.0


Unnamed: 0_level_0,Height
Unnamed: 0_level_1,<dbl>
10,75
11,79
12,76


#### How to modify a Data Frame in R?

In [45]:
df1
# changing a value
df1[1,"Age"] <- 20; df1
# adding row
rbind(df1,list(1,16,"Paul"))
# adding column
cbind(df1,State=c("NY","FL"))
# deleting row
df1[-1,]
# deleting column
df1$State <- NULL

SN,Age,Name
<int>,<dbl>,<chr>
1,21,John
2,15,Dora


SN,Age,Name
<int>,<dbl>,<chr>
1,20,John
2,15,Dora


SN,Age,Name
<dbl>,<dbl>,<chr>
1,20,John
2,15,Dora
1,16,Paul


SN,Age,Name,State
<int>,<dbl>,<chr>,<chr>
1,20,John,NY
2,15,Dora,FL


Unnamed: 0_level_0,SN,Age,Name
Unnamed: 0_level_1,<int>,<dbl>,<chr>
2,2,15,Dora


<!--NAVIGATION-->
<span style='background: rgb(128, 128, 128, .15); width: 100%; display: block; padding: 10px 0 10px 10px'>< [R Syntax](02.01-R-Syntax.ipynb) | [Contents](00.00-Index.ipynb) | [Quiz 2](02.03-Quiz.ipynb) > [Top](#top) ^ </span>

<span style='background: rgb(128, 128, 128, .15); width: 100%; display: block; padding: 10px 0 10px 10px'>This is the Jupyter notebook version of the __Python for Official Statistics__ produced by Eurostat; the content is available [on GitHub](https://github.com/eurostat/e-learning/tree/main/python-official-statistics).
<br>The text and code are released under the [EUPL-1.2 license](https://github.com/eurostat/e-learning/blob/main/LICENSE).</span>