<a href="https://colab.research.google.com/github/ZhenYuan2002/R/blob/main/Object%20and%20Functions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Objects and Functions in R**

- Objects and their names
- Object types
- Data types
- Atomic vectors
- Lists
- Coercion
- S3 atomic vectors
- S3 lists
- Missing values
- Time-series-specific objects
- Functions in R
- Comparisons
- Conditions
- Conditional execution
- Iterations

In R, functions are designed to work with specific object types. They may have strict input/output object requirements. Functions from other packages are not necessarily optimized for specific use cases. You may need to perform additional steps to prepare a dataset before passing an input to a function. Similarly, the result from a function may require post-processing befoe you can analyze/report it. Hence, basic knowledge about R objects, data types, and how to write functions in R is required.

You can access R's help file for any function by using ? or help functions in R. For example, ?typeof or help(typeof) will take you to the relevant documentation of the typeof function in R.


# **Objects and their names**



---
**Creating an object**

In R, we first create an object and then assign a name to that object.

A vector of numbers is created using either of the following functions:
- c(), short for combination to create a vector
- :, compact, memory-efficient way of alternative representation that applies to certain vector types

Assignment operators is used to bind an object to its name
1) <- (MOST COMMON)
2) <<-
3) =

The assignment operators <- and <<- are bi-directional, meaning that the object and its name can change sides as long as the arrow is pointing towards the name of the object.

---
**Naming convention**

- A name can consist of letters, digits, . or _
- A name should not start with _ or a digit
- A name cannot include reserved keywords
- No white space between combinations of letters, digits, . and _
- Avoid using double quotes in names and identifiers



In [None]:
# Creating an object
x1 <- c(1,2,3,4,5)
c(1,2,3,4,5) -> x2
x3 <- 1:5
x1
x2
x3

In [None]:
# Naming convention
.x <- 1:5
x_y <- 1:5
.x
x_y

# **Object Types**

Objects in R can be loosely defined into two groups: base objects and objects used for Object-Oriented Programming (OOP).

There are various Object-Oriented (OO) systems in R, such as S3, R6 and S4, with the first one being the most common.

The metadata of an object is stored in attributes, which can be considered as name-value pairs of an object.

Names and dimensions are two common attributes, which are preserved with object.

Different types of objects and data structures are created by adding various attributes to a base object.

# **Data Types**

Vectors are foundational building blocks of nearly all data structures in R. The simplest form of a vector is a scalar or a one-element vector, which represents a single value.

Vector is an umbrella data type that has two different families of base objects underneath it: atomic vectors and lists. These two broad types of base vectors are further subdivided based on data types and their structure.

---
**Atomic Vectors**

An atomic vector is a fundamental data structure that contains elements of only one data type. Common data types are as follows:

- Numeric
- Double
- Integer
- Logical
- Character

---
**Lists**

A list can contain elements of different types and structures. It serves as the foundation for more complex objects, such as data frames. By applying specific attributes to a base object, you can construct more complex data structures, such as matrices (two-dimensional rectangular structures) and arrays (multi-dimensional generalizations of matrices). OO objects behave differently from regular base objects when passed to a generic function. An S3 object is built on top of base objects by assigning class attributes:

- S3 atomic vectors
- Factors
- Dates
- Date-times (POSIXct/POSIXlt)
- S3 lists
- Data frames
- Tibbles



# **Atomic Vectors**

In an atomic vector, all elements must be of the same data type. Atomic vectors have four types: integer, double, logical and character. The first two are commonly known as numeric vectors.

typeof() returns the type of an object

The following functions test whether a vector is of a particular type.
- is.double()
- is.integer()
- is.logical()
- is.character()

attr() sets an individual attribute to an object. It can also be used to retrieve an individual attribute of an object.

structure() sets multiple attributes of an object in one function call.

attributes() retrieves attributes of an object that is already set.

---
**Double**

Double covers decimals, scientific, and hexadecimal numbers. The special values Inf (infinity), -Inf (negative infinity), and NaN (Not a Number) can be added in double atomic vectors.

**Integer**

Integer vectors only represent whole numbers. The trailing L is a must, else the numbers are considered as doubles.

**Logical**

Logical vectors have TRUE or FALSE entries. The abbreviated forms, T and F, are accepted too. Most mathematical functions work on logical vectors, coercing TRUE to 1 and FALSE to 0 before applying that function.

Although the logical constants TRUE and FALSE are reserved keywords, T and F are not reserved, and can be reassigned to other values. When T and F are not explicitly defined in the workspace, R interprets them as logical values equivalent to TRUE and FALSE. However, if you assign a different value to T or F, their behaviour changes accordingly.

**Character**

Strings are a combination of letters and numbers when surrounded by either double quotes or single quotes. When a vector has multiple strings, it is called a character vector.

When special characters such as \t (tab) and \n (newlines) are used, then the thewriteLines() function processes those more appropriately than the print() function. Backslash-escaping is required when double quotes are used inside a double-quoted string.

The nchar() function counts the characters in each element of a character vector. Note that white spaces are also counted as characters.



In [None]:
# Double Vector
double_vector <- c(7, 10.0, 19.025)
double_vector # Returns the values of the vector
typeof(double_vector) # Returns the type of the vector
is.double(double_vector) # Check whether the vector is of double type
is.integer(double_vector) # Check whether the vector is of integer type

# Set the names attributes to the object double_vector
attr(double_vector, 'names') <- c('a', 'b', 'c')
double_vector # Returns names and values
attr(double_vector, 'names') # Return names attributes from double_vector object


In [None]:
# Integer Vector
integer_vector <- c(0L, 7L, 11L)
integer_vector # Returns the values of the vector
typeof(integer_vector) # Returns the type of the vector
is.integer(integer_vector) # Returns whether the vector is of integer type

In [None]:
# Logical Vector
logical_vector <- c(TRUE, FALSE, TRUE)
typeof(logical_vector) # Returns the type of the vector
is.logical(logical_vector) # Returns whether the vector is of logical type
which(logical_vector) # Returns indices of TRUE
abs(logical_vector) # Returns TRUE as 1 and FALSE as 0
sum(logical_vector) # Add ups all TRUE (1) values
mean(logical_vector) # Returns the average of the vector

In [None]:
# Character Vector
character_vector <- c('this is', 'a', 'section of a book')
typeof(character_vector)
is.character(character_vector)
nchar(character_vector)

character_vector2 <- c("This book\t discussed \n\"Time Series Forecasting\".")
print(character_vector2)
writeLines(character_vector2)
nchar(character_vector2)

[1] "This book\t discussed \n\"Time Series Forecasting\"."
This book	 discussed 
"Time Series Forecasting".


# **Lists**

Lists are generic vectors in R. In lists, each element can be of any data type. This is the main distinction compared to an atomic vector, in which all elements must be of the same type.

A list is created using the list() function.

is.list() function checks whether an object is a list or not.

A common practice is to first create an empty list and then adding items sequentially in it. This is an useful method to store multiple outputs from a function.

The str() function displays the internal structure of an R object, and is used to inspect the structure of a list, especially if the list is a nested one.

The glimpse() function also shows the internal data as much as possible.

---
**Matrix**

A matrix represents vectors in a two-dimensional data structure. The dimension (dim) attribute transforms a vector to a matrix by passing a vector of size 2, specifying the number of rows and columns, respectively.

By default, the dim function fills the matrix column-wise using the elements from a vector.

The easiest way to create a matrix is using the matrix() function along with specifying nrow and ncol options.

The row names and column names can be specified. either using rownames() and colnames(), or by passing a list in the dimnames() function.

To create an identity matrix, you can use the shorthand diag().

---
**Array**

An array is a multidimensional data structure and a more generalized version of a matrix. A three-dimensional array can be created by passing a vector specifying dimensions inside the array() function.

If you have an array object and want to know its dimensions, use dim(array).



In [None]:
# Creating a list of different data types
list1 <- list(double_vector, integer_vector, logical_vector, character_vector)
list1

In [None]:
# Create an empty list and assign list items sequentially
empty_list <- list()
empty_list[[1]] <- 1:5
empty_list[[2]] <- c(letters[1:4])
empty_list

In [None]:
# Display internal structure of an R object
str(list1)

List of 4
 $ : Named num [1:3] 7 10 19
  ..- attr(*, "names")= chr [1:3] "a" "b" "c"
 $ : int [1:3] 0 7 11
 $ : logi [1:3] TRUE FALSE TRUE
 $ : chr [1:3] "this is" "a" "section of a book"


In [None]:
install.packages('dplyr')
library(dplyr)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [None]:
# Display internal structure of an R object
glimpse(list1)

List of 4
 $ : Named num [1:3] 7 10 19
  ..- attr(*, "names")= chr [1:3] "a" "b" "c"
 $ : int [1:3] 0 7 11
 $ : logi [1:3] TRUE FALSE TRUE
 $ : chr [1:3] "this is" "a" "section of a book"


In [None]:
list2 <- list(list1, list(seq(0,20,5), c(TRUE, TRUE, FALSE)))
glimpse(list2)

List of 2
 $ :List of 4
  ..$ : Named num [1:3] 7 10 19
  .. ..- attr(*, "names")= chr [1:3] "a" "b" "c"
  ..$ : int [1:3] 0 7 11
  ..$ : logi [1:3] TRUE FALSE TRUE
  ..$ : chr [1:3] "this is" "a" "section of a book"
 $ :List of 2
  ..$ : num [1:5] 0 5 10 15 20
  ..$ : logi [1:3] TRUE TRUE FALSE


In [None]:
# Transform a vector to a matrix
a <- 1:9
dim(a) <- c(3,3) # Specify a matrix of 3x3
a

attributes(a) # Returns the attribute, i.e. the dimension
is.matrix(a) # Returns whether the object is a matrix

0,1,2
1,4,7
2,5,8
3,6,9


In [None]:
# Using matrix() function along with specifying nrow and ncol options

a_matrix <- matrix(c(rep(TRUE,3), rep(TRUE,3), rep(FALSE,3)), nrow=3, ncol=3)
a_matrix

attributes(a_matrix)

0,1,2
True,True,False
True,True,False
True,True,False


In [None]:
rownames(a) <- c('a', 'b', 'c')
colnames(a) <- paste0('col_', 1:3)
a

dimnames(a_matrix) <- list(paste0('row_', 1:3), paste0('col_', 1:3))
a_matrix

Unnamed: 0,col_1,col_2,col_3
a,1,4,7
b,2,5,8
c,3,6,9


Unnamed: 0,col_1,col_2,col_3
row_1,True,True,False
row_2,True,True,False
row_3,True,True,False


In [None]:
# diag() to create an identity matrix
diag(3)

0,1,2
1,0,0
0,1,0
0,0,1


In [None]:
# Create a 3-dimensional array by passing a vector specifying dimensions

an_array <- array(1:18, c(3,3,2))
an_array
is.array(an_array)
dim(an_array)

# **Coercion**

RECALL: The rule for a vector object in R is that all elements must be of the same type.

Coercion occurs when two different data types are mixed into one vector. When different types of atomic vectors are combined deliberately, then they follow a fixed order of implicit coercion:

character -> double -> integer -> logical

---
**Explicit Coercion**

Explicit coercion involves coercing one type of vector into another type. The as.character(), as.double(), as.integer() and as.logical() functions are used for explicit coercion.

A mixed vector generates NA and warnings when it fails to coerce.


In [None]:
# Proof that integer overwrites logical
mixed_vector1 <- c(1L, TRUE, FALSE, 5L)
typeof(mixed_vector1)

# Proof that character overwrites double
mixed_vector2 <- c("Pen", 5.30, TRUE)
typeof(mixed_vector2)

In [None]:
# Explicit Coercion
as.double(c(TRUE, FALSE, FALSE))
as.double(c(TRUE, FALSE, "Paper", 1.5))

“NAs introduced by coercion”


# **S3 Atomic Vectors**

An S3 object allows for flexible and extensible OOP in R. S3 is used to create custom data structures that are tailored to specific needs.

In S3, an object must have a class attribute that describes the type of data stored in that object. The class attribute is set and retrieved by the class() function.

In time series analysis, we need some commonly used S3 objects that are based on base atomic vectors. Factors, dates and date-times are three common S3 atomic vectors.

---
**Factors**

A factor is internally an integer atomic vector with two attributes: class and levels. It stores categorical data and contains predefined values.

Created via the factor() function.

Purposes:
- Evaluates how many categories there are
- Remembers which category a data point belongs to
- Produces a vector of strings associated with the names of each of the categories

An ordered factor has an internal hierarchy in the factor levels, created using either the ordered() or factor() function. The levels can be made pretty while plotting, or more descriptive names can be supplied via the label arguments in either function.

cut() function can be used to convert a numeric vector to a factor.

To validate that a factor is actually an S3 object, you can use the otype() function from the sloop package.

---
**Dates**

In R, the date is counted as the number of dats from the epoch, which is January 1, 1970 at 00:00:00 UTC (Coordinated Universal Time). Using this double vector as a base and assigning the class attributes on top, we get the S3 Date vectors.

as.Date() function coerces a value to an S3 date object

Epoch is a point in time and serves as a reference for measuring time in a computer software system. You can easily calculate how many days have passed since R's epoch using unclass() on Sys.Date(), or in general by stripping the class attribute from any date object.

unclass(Sys.Date())

---
**Date-times**

The POSIXt class is based on the POSIX (Portable Operating System Interface) standard for representing data and time values, which is used by many operating systems and programming languages.

POSIXt is a class of objects that store data and time information with a higher degree of precision, including year, month, day, hour, minute, second and time zone.

POSIXt objects in R are stored as the number of seconds since the epoch, with fractional seconds stored as a decimal fraction of a second.

You can create an S3 date-time object using the as.POSIXct() or as.POSIXlt() functions, which convert character strings or numeric values to POSIXt objects. The internal POSIXt object is a double atomic vector. The ct represents the calendar time, and lt represents local time.

The tz argument in the as.POSIXct() function specifies the time zone to use for the object. One can calculate the difference between two dates or date-time objects, which becomes the duration data type.

---
**Duration**

The difference between two dates or date-times is stored in the difftime object. The base object type is double.

Along with the 'difftime' class attribute, the duration vecotr has unit attributes, which determine the unit of the date/time difference.



In [4]:
# Factor Vector
factor_vector <- factor(rep(c('a','b','c'), times=2))
factor_vector
attributes(factor_vector)

In [6]:
# Ordered factor vector
customer_review <- ordered(c('bad', 'good', 'good', 'fair'), levels = c('bad', 'fair', 'good'))
customer_review

In [7]:
# Cut function to convert a numeric vector to a factor
score <- c(0, 10, 2, 6, 8, 8, 1, 9)
factor_score <- cut(score,
                     breaks=c(-Inf, 7, 9, Inf),
                     labels=c('Distractor', 'Neutral', 'Promoter'),
                     include_lowest=TRUE)

factor_score

In [10]:
install.packages('sloop')
library(sloop)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [11]:
# Validate that a factor is actually an S3 object
otype(factor_score)

In [12]:
# Date Vector
date_vector <- as.Date('2023-01-01') # Converts the string character to an S3 date object
typeof(date_vector)
attributes(date_vector)
otype(date_vector)

In [13]:
# Returns how many days have passed since R's epoch (Jan 01, 1900)
unclass(Sys.Date())

In [16]:
# Datetime Vector
datetime_vector <- as.POSIXct("2023-01-01 22:41:00", tz="Australia/Melbourne") # Creates S3 datetime object with specified timezone
datetime_vector
attributes(datetime_vector)
as.POSIXct(8^10, origin="1970-01-01")

[1] "2023-01-01 22:41:00 AEDT"

[1] "2004-01-10 13:37:04 UTC"

In [None]:
# Show the timezones available
OlsonNames(tzdir=NULL)

In [19]:
# Duration
duration <- as.Date("2022-10-30") - as.Date("2019-03-16")
duration
typeof(duration)
attributes(duration)

duration2 <- as.difftime(14, units="weeks")
duration2

Time difference of 1324 days

Time difference of 14 weeks

# **S3 Lists**

S3 Lists are built on top of regular R lists, but they have specific structures. Their behaviour is defined by the set of methods or functions that can be applied to them. The two most important S3 lists are data frames and tibble.

---
**Data Frames**

A data frame represents data in a tabular manner, where each column represents a variable and each row represents an observation. In a data frame, a column must be of the same data type, and different columns could have different data types.

Using a regular list object as base, a data frame is different because of three attributes:
- names: column names
- class: class
- data.framerow.names: row names

data.frame() function
- creates a data frame
- can recycle shorter column vectors, only when they are an integer multiple of the longest column

NOTE: Although it does not require you to specify the dimension structure explicitly, each column vector should be of equal length. rep() and c() can be used repeatedly to maintain this constraint of data frame structure. Otherwise, an error will be returned.

Data frames transform non-syntactic names automatically while creating inside or loading data from outside of R.
This behaviour could be overwritten using check.names=False if you type backtick around column names.

A syntactic name in R must:
- Start with letter or .
- contain only letters, numbers, _ and .
- not contain spaces
- not be a reserved keyword

The stringAsFactors argument controls the coercion of character vectors to factors. By default, it is FALSE.

is.data.frame() function checks whether an object is a data frame.

as.data.frame() coerces other classes of objects - character, list, matrix, or array - into a data frame.

row.names() allows custom row names to be allocated to data frames.

IT IS NOT EASY TO MANIPULATE ROW NAMES TO A DATAFRAME, ESPECIALLY WHEN THE INFORMAITON IS REQUIRED IN ANALYSIS OR MODELLING. HENCE, TIBBLE IS HERE TO HELP.

---
**Tibbles**

A tibble is a modernized S3 object for storing rectangular data built on top of rectangular R lists. It shares some properties of data frames, and with improvements on some of the data frame's limitations.

- Improved printing method. Optimized for printing only a few columns and rows that fit on one screen. The dimensions and data types of each column are printed along with the summary of the remaining rows and columns. This makes it easier to inspect and understand the data without using any functions at all. Colors for positive, negative and missing values highlight the important features of the data at first glance. This improved printing method is beneficial for large rectangular data with many rows and many columns.

- Lazy but effective. Tibbles do not compute or evaluate anything until it is necessary. This is highly memory-efficient and makes data operations faster.

- Tibble never coerces input. Unlike data frames, different types of columns are not coerced, there is no need for the stringAsFactors argument.

- Refer variables during construction. Tibble allows you to refer to a variable in the computation of subsequent columns during construction.

- No messy row name manipulation. Tibble does not support row names the way data frames do, as tibble consider row names as metadata that should be stored in the same way as the rest of the data. This design avoids complicated string operations on row names during re-sampling or bootstrapping. Tibble uses the rownames argument in as_tibble() or tibble::rownames_to_columns() to add the row names of a dataframe to a column.

- Easy operation with list columns. Both dataframes and tibbles can have a column that is actually a list. However, it is easy to manipulate list columns in a tibble. A list column can be anything from a simple vector to complex model output objects. The list column could itself be a tibble.

In [22]:
# Data Frames
dataframe <- data.frame(country = c(rep("USA", 2), rep("AUS", 2)),
                        date = as.Date(rep(c("2000-01-01", "2023-01-01"), times=2)),
                        population_mil = c(28.2, 334.3, 19.03, 26.6))
dataframe
typeof(dataframe)
attributes(dataframe)

country,date,population_mil
<chr>,<date>,<dbl>
USA,2000-01-01,28.2
USA,2023-01-01,334.3
AUS,2000-01-01,19.03
AUS,2023-01-01,26.6


In [23]:
# Recycling shorter column vectors in a Data Frame
data.frame(id = paste0("00", c(1:6)),
          score = c(10:12))

id,score
<chr>,<int>
1,10
2,11
3,12
4,10
5,11
6,12


In [29]:
# DataFrame transforms non-syntatic names automatically
dataframe2 <- data.frame(`1` = c("rock", "paper", "scissors"),
                        `2` = c(5,3,2))
names(dataframe2)

# check.names = FALSE to prevent DataFrame from transforming non-syntatic names
dataframe3 <- data.frame(`1` = c("rock", "paper", "scissors"),
                        `2` = c(5,3,2),
                        check.names = FALSE)
names(dataframe3)

#
dataframe4 <- data.frame(option = c("rock", "paper", "scissors"),
                        n = c(5,3,2),
                        stringsAsFactors = TRUE)
attributes(dataframe4$option)

In [33]:
install.packages("tibble")
library(tibble)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [36]:
tibble1 <- tibble(country = c(rep("USA", 2), rep("AUS", 2)),
                  date = as.Date(rep(c("2000-01-01", "2023-01-01"), times=2)),
                  population_mil = c(282.2, 334.3, 19.03, 26.6))
tibble1
typeof(tibble1)
attributes(tibble1)

country,date,population_mil
<chr>,<date>,<dbl>
USA,2000-01-01,282.2
USA,2023-01-01,334.3
AUS,2000-01-01,19.03
AUS,2023-01-01,26.6


In [41]:
# Tibble allows reference to variables in computation of subsequent columns
tibble2 <- tibble(x = 0.4, y = x^2, z = 2*y)
tibble2

x,y,z
<dbl>,<dbl>,<dbl>
0.4,0.16,0.32


**Missing Values**

In general, R represents missing values with NA (Not Applicable), although it has data type specific variants: NA_integer_ for integer, NA_real_ for double, and NA_character_ for character.

NOTE: The string "NA" is not the same as NA. The is.na() function checks the element-wise existence of missing values, and anyNA() checks for missing values over the entire object.

When checking missing values in a list, the is.na() should be applied on each element using the proper member from the apply family of functions.

You may encounter NaN (Not a Number), which is a byproduct of mathematical operations rather than indicating missingness in the data. The is.nan() checks whether a numeric value is a number or not.

As a placeholder in a function argument, when NULL is used in an argument, it suggests the function will continue by computing the value from other sections by default, but takes the user-supplied value as well.

In [42]:
vector1 <- c(2, NA, 1, NA, 0)
is.na(vector1)
anyNA(vector1)

# **Time-series specific objects**

Having a date-time variable along with a numeric vector does not make observed data in a time series. Specific information is required to treat a numeric vector as a time series, which can be derived from the date-time variable.

Time-series data is indexed or ordered by time. A time series object is a specific data structure designed to stroe and analyze time series data. These time series objects provide specialized methods and functions for analyzing, manipulating, and visualizing time series data. They often come with built-in support for time-based indexing, sub-setting and applying time-specific operations.

To create a time series object, you typically need a vector or matrix of data values along with an associated time index. Common time series objects are ts, zoo and xts.
- ts object is particularly useful for time-dependent analyses, forecasting, and modelling.
- tsibble object extends the principles of the tidyverse, making it easier to manipulate, analyze and visualize time series data in a consistent and intuitive manner. tsibble represents temporal data in a rectangular form and provides a data-and-model oriented framework.

---
**ts**

The ts object is an important time series object in R. It is used to represent observations at regular intervals over time. The ts object stores data values along with information about the time index, such as the start time, frequency, and number of observations.
- constructed using the ts() function

Additionally, it has built-in support to handle seasonal data, where a pattern repeats over a fixed interval of time. Almost all methods in the forecast package expect a ts() object as an input. Other packages, such as stats and TSA, offers functionality for advanced modelling and diagnostics using the ts object as well.

- is.ts() for checking
- as.ts() for coercion

---
**tsibble**

The tsibble object emphasizes the organization of temporal data into rectangular data frames with variables as columns and observations as rows.

tbl_ts is the new class attributes introduced for tsibble, and becomes an S3 object by default. It can handle both regular and irregular time series data, providing flexibility for handling various types of time series data under the same framework.

You can create a tsibble object by passing eithe a vector, data frame or a tibble along with a time index variable.

as_tsibble() coerces other objects into a tsibble.

Some key benefits and features of tsibble include time-aware data manipulation and time-based indexing, and seamless integration with other tidyverse packages.

tsibble integrates well with the fable package, which provides a unified framework for time series forecasting in R. Almost all methods in the forecast package expect a ts() object as input.





In [47]:
# Creating a ts object
set.seed(100)
ts1 <- ts(rnorm(36,25,7), start = "2020", frequency = 12)
ts1
otype(ts1)
typeof(ts1)
attributes(ts1)
is.ts(ts1) # Checking whether it is an ts object

Unnamed: 0,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
2020,21.48465,25.92072,24.44758,31.20749,25.8188,27.23041,20.92747,30.00173,19.22318,22.48097,25.6292,25.67392
2021,23.58856,30.17888,25.86366,24.79478,22.27802,28.57599,18.6033,41.17208,21.93337,30.34842,26.83373,30.41383
2022,19.29935,21.93085,19.95845,26.61661,16.89589,26.72953,24.36221,37.30163,24.03449,24.22165,20.1699,23.44744


In [50]:
install.packages("tsibble")
library(tsibble)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Registered S3 method overwritten by 'tsibble':
  method               from 
  as_tibble.grouped_df dplyr


Attaching package: ‘tsibble’


The following objects are masked from ‘package:base’:

    intersect, setdiff, union




In [52]:
# Creating a tsibble object
set.seed(101)
tsibble1 <- tsibble(
  year = 2000:2023,
  sales = rnorm(24, 500, 80),
  index = year
)

tsibble |> head()

                                                                  
1 function (..., key = NULL, index, regular = TRUE, .drop = TRUE) 
2 {                                                               
3     stopifnot(is_logical(regular, n = 1))                       
4     dots <- list2(...)                                          
5     tbl <- tibble(!!!dots)                                      
6     index <- enquo(index)                                       

# **Functions in R**

Functions are an integral part of integrating with any programming language. The inherent flexibility of R offers users the ability to write their own functions in parallel with using functions from existing packages.

A function is generally built to perform one of the four main tasks: comparison, evaluate a condition, conditional execution, or iteration. Functions are simple when built to perform only one type of task, or it can be complex when multiple tasks of several types are combined.

In [None]:
my_func <- function(arg1, arg2){
# body full of codes
.....
# return output
}

# **Comparisons**

For comparison, a full set of operators includes >, >=, <, <=, !=, ==. They are optimized for vector operations, and return either values or logical output, depending on how they are used.

filter() from the dplyr package allows you to subset a large dataset using comparison operations.

In [56]:
a <- c(1,3,5)
b <- c(1,6,10)
a[a < 3]
a >= b

In [57]:
install.packages("dplyr")
library(dplyr)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [58]:
install.packages("gapminder")
library(gapminder)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [59]:
# Subset a large dataset using filter() and comparison operators
gapminder |> filter(year==1992)

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Afghanistan,Asia,1992,41.674,16317921,649.3414
Albania,Europe,1992,71.581,3326498,2497.4379
Algeria,Africa,1992,67.744,26298373,5023.2166
Angola,Africa,1992,40.647,8735988,2627.8457
Argentina,Americas,1992,71.868,33958947,9308.4187
Australia,Oceania,1992,77.560,17481977,23424.7668
Austria,Europe,1992,76.040,7914969,27042.0187
Bahrain,Asia,1992,72.601,529491,19035.5792
Bangladesh,Asia,1992,56.018,113704579,837.8102
Belgium,Europe,1992,76.460,10045622,25575.5707


# **Conditions**

Conditions evaluate logical expressions and return a value when TRUE.
- | means vectorized OR, compares each element of two logical vectors, and returns a logical vector of the same length
- & means vectorized AND, compares each element of two logical vectors, and returns a logical vector of the same length
- || means short-circuit OR, compares only the first element of each vector, and returns a single TRUE/FALSE
- && means short-circuit AND, compares only the first element of each vector, and returns a single TRUE/FALSE

---
**Conditional Execution**

In conditional execution, a part of the code is executed after a logical expression is evaluated. There are multiple ways to use conditional execution in R in various scenarios:
- if statements: Executes when the condition is TRUE
- if-else statements: Executes when the condition is TRUE, and the rest of the code executes when the condition is not met
- switch functions: Provides a way to choose from multiple options. It is recommended to use the function with character input only
- case_when() function: Used for conditional re-coding, or creating new variables based on multiple conditions. Provides a flexible and readable way to handle complex conditional logic. The function takes multiple conditions seperated by ~, where each condition is followed by the value to assign if the condition is TRUE. THe conditions are evaluated in the order specified, and the first condition that is TRUE triggers the corresponding value assignment.

These statements decide to run part of the code based on whether the conditions are TRUE or FALSE


In [63]:
# Using | and &
gapminder |> filter((lifeExp < 40) | (lifeExp >= 80))
gapminder |> filter((lifeExp < 40) & (lifeExp >= 80))


country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Afghanistan,Asia,1952,28.801,8425333,779.4453
Afghanistan,Asia,1957,30.332,9240934,820.8530
Afghanistan,Asia,1962,31.997,10267083,853.1007
Afghanistan,Asia,1967,34.020,11537966,836.1971
Afghanistan,Asia,1972,36.088,13079460,739.9811
Afghanistan,Asia,1977,38.438,14880372,786.1134
Afghanistan,Asia,1982,39.854,12881816,978.0114
Angola,Africa,1952,30.015,4232095,3520.6103
Angola,Africa,1957,31.999,4561361,3827.9405
Angola,Africa,1962,34.000,4826015,4269.2767


country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>


In [None]:
# if statement
if (condition) {
  # code to execute if the condition is TRUE
}

# if-else statement
if (condition) {
  # code to execute if the condition is TRUE
} else {
  # code to execute if the condition is FALSE
}

# Multiple if-else statements
if (condition) {
  # code to execute if the condition is TRUE
} else if (condition) {
  # code to execute if the condition is TRUE
} else {
  # code to execute if all previous conditions are FALSE
}

In [66]:
choose_fruit <- function(x){
  switch(x,
       apple = print("Selected fruit: Apple"),
       banana = print("Selected fruit: Banana"),
       orange = print("Selected fruit: Orange"),
       stop("Unknown fruit in the basket")
  )
}

choose_fruit("apple")
choose_fruit("strawberry")

[1] "Selected fruit: Apple"


ERROR: Error in choose_fruit("strawberry"): Unknown fruit in the basket


In [65]:
library(dplyr)
tibble7 <- tibble(score=1:6,
                  category = case_when(
                    score<3 ~ "Low",
                    score %in% c(3,4) ~ "Medium",
                    TRUE ~ "High"
                  ))
tibble7

score,category
<int>,<chr>
1,Low
2,Low
3,Medium
4,Medium
5,High
6,High


# **Iterations**

Loops are used to perform repetitive tasks. You need to tell R which code section to repeat over and over, how many times to run the loop, and when to stop repeating.

- while loop includes any number of expressions. To avoid accidental infinite loop creation that runs forever, it is recommended to add a stop() function to terminate the loop when needed.

- for loop allows you to iterate over a sequence of values and execute a block of code for each iteration. Use a for loop when you know exactly how many times a loop iterates. A for loop can have one to many expressions to execute, and the output can be printed on the console or svaed in a object. The latter is more common when a for loop is used inside a function.

- Performing iterations is a common task performed by a function. The way to write an iteration function depends on the data structure. For example, a function to iterate over an array item is supposed to be different from a function to iterate over the rows of a data frame.

R offers two broad families of functions, apply and map, to generalize the iteration task to many data and object types.

---
**The apply family**

apply(X, MARGIN, FUNC, ..): Applies a function to margins (rows or columns) of an array, matrix, or data frame. Returns a vector, array or list.
- X specifies the data structure
- MARGIN specifies the dimension to apply the function, 1 for rows and 2 for columns
- FUNC specifies the function to be applied
- ... represents additional arguments for FUNC

lapply(X, FUNC, ...): Applies a function to each element of a list or vector. Returns a list of the same length as X.
- X is the list or vector
- FUNC is the function to be applied
- ... represents additional arguments for FUNC

sapply(X, FUNC, ...): Applies a function to each element of a list or vector, and attempts to simplify the result. If all outputs are length-1, it returns a vector; if all outputs are the same length (>1), it returns a matrix; if all outputs are multi-dimensional and consistent, it returns an array; and if simplification is not possible, it falls back to a list.
- X is the list or vector
- FUNC is the function to be applied
- ... represents additional arguments for FUNC

range() function returns a vector of the minimum and maximum
- sapply() function coerces each of these vectors into a matrix since all observations are numeric

vapply(X, FUNC, FUNC.VALUE, ...): Applies a function to each element of a list or vector, but requires you to specify the expected output type and length via FUNC.VALUE. Returns a vecotr, matrix or array.
- X is the list or vector
- FUNC is the function to apply
- FUNC.VALUE is a template describing the expected output type and length

mapply(FUNC, ..., MoreArgs=NULL): Applies a function to multiple vectors or lists element-wise. It iterates over corresponding elements of the given arguments and performs operations on them. This can be particularly useful when you have multiple vectors or lists of the same length and need to perform operations on them in a pairwise manner.
- FUNC is the function to be applied
- MoreArgs is an optional list of additional arguments to the function.

---
**The map family**

The map family offers a set of functions for applying operations ranging from simple functions to models, to elements of lists or vectors. These functions are designed to provide a consistent and flexible interface for iteration and data manipulation.

map()
- Input: A list or vector
- Output: A list with the same length as the input list

map_lgl()
- Input: A list or vector
- Output: A logical vector

map_int()
- Input: A list or vector
- Output: An integer vector

map_dbl()
- Input: A list or vector
- Output: A double (numeric) vector

map_chr()
- Input: A list or vector
- Output: A character vector

These functions take the syntax map(.x, .f, ...)
- .x is the input data
- .f is a function, it takes any one of the following forms: An existing named function, an anonymous custom function, or a formula representation

In [None]:
# Syntax for a while loop
while( condition ){
  expression1
  expression2
}
x <- 1
# Using while loop with stop to increment x until it reaches 4
while (x <= 5) {
  # expression1
  print(x)
  # expression2
  x <- x + 1
  # Check if x is equal to 4 and terminate the loop
  if (x == 4) {
    stop("x reached 4. Loop terminated.")
  }
}

In [None]:
# Syntax for a for loop
for(counter in iteration_number){
  expression1
  expression2
}
x <- 10:13
for(i in 1:4){
  print(x[i]^2)
}