<a href="https://colab.research.google.com/github/alkalink1/ds1002-eju2pk/blob/main/notebooks/20-introducing-R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introducing R

<img align="right" width="200" height="200" src="https://stamsgroup.com/wp-content/uploads/2020/08/R.programming-300x300.png">

**What is R?** R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. It was created at the University of Auckland in 1991 and became an open source project in 1995.

**What are the pros and cons of R?**

There are a few advantages R has over Python in the realm of statistics and data analysis:

- Built for statistics
- Data visualization
- Additional statistical packages
- Data wrangling
- Academic focus
- Customizable

Disadvantages of R vs. Python:

- Less readable
- More difficult to learn well
- Not as beginner-friendly

- - -

## The Basics

### Design:

-   Designed to support statistical computing
-   Very strong community
-   Many domain-specific functions are built in
-   Vector first thinking
-   Everything is an object

### R Syntax

-   Syntax loosely follows traditional `C`-style
    -   **Braces** `{` and `}` are used to form blocks.
    -   **Semi-colons** are used optionally to end statements, required
        if on same line.
-   **Assignments** are made with `<-` or `->` (or `=`)
-   **Dots** `.` have no special meaning -- they are not operators.
-   Single and double **quotes** have the same meaning, but double
    quotes tend to be preferred.
    -   Use single quotes if you expect your string to contain double
        quotes.

- - -

## Variables

Like other languages, variables (known as "objects" in R) can be named arbitrarily. The limits on naming are:

- A variable name must start with a letter and can be a combination of letters, digits, period(.)
and underscore(_).
- If it starts with period(.), it cannot be followed by a digit.
- A variable name cannot start with a number or underscore (_)
- Variable names are case-sensitive (age, Age and AGE are three different variables)
- Reserved words cannot be used as variables (TRUE, FALSE, NULL, if...)

To assign a value to a variable, use the ` <- ` notation, which suggests "pushing" value into the variable.

Note: Just like in Python, variables/objects can be ANYTHING. A string, an integer, a data frame, a list, etc. And just like in Python you can always print out that variable to see what it contains.

In [1]:
# comments also use hashtags to indicate they are comments!
# let's create an object and populate it.

myvar <- 1.45

In [2]:
# you can learn what datatype an object is using the typeof() function.
# A "double" is a type of integer.

typeof(myvar)

In [3]:
# you can also assign in the other direction if you want (though the convention is R to L)

"mickey" -> mousename
mousename

In [4]:
# Another very basic operation is to concatenate values into an object using the c() function:

myname <- c("Bob", "Dylan")
myname

### R Data Types

There are several basic R data types.

-   [Numeric](#scrollTo=hNZPBBDdHKzi&line=35&uniqifier=1)
-   [Integer](#scrollTo=eikm2wiRHiC_&line=47&uniqifier=1)
-   Complex
-   [Logical](#scrollTo=fYk9koTOHYxR&line=42&uniqifier=1)
-   [Character](#scrollTo=nnEP2VnkHPkd&line=61&uniqifier=1)


## Numeric

Decimal values are called "numerics" in R.

It is the **default** computational data type.

If we assign a decimal value to a variable x, x will be of numeric type:

```{r}
x <- 10.5       # assign a decimal value
x              # print the value of x
```

```{r}
class(x)      # print the class name of x
```

Even if we assign an integer to a variable k, it will still be saved
as a numeric value.

```{r}
k <- 1
k              # print the value of k
```

```{r}
class(k)       # print the class name of k
```

That k is not an integer can be confirmed with `is.integer()`:

```{r}
is.integer(k)  # is k an integer?
```


## Integers

To create an integer variable in R, we use `as.integer()`.

```{r}
y <- as.integer(3)
y              # print the value of y
```

```{r}
class(y)       # print the class name of y
is.integer(y)  # is y an integer?
```

We can also declare an integer by appending an `L` suffix.

```{r}
y <- 3L
is.integer(y)  # is y an integer?
```

We can coerce, or cast, a numeric value into an integer with
`as.integer()`.

```{r}
as.integer(3.14)    # coerce a numeric value
```

And we can parse a string for decimal values in much the same way.

```{r}
as.integer("5.27")  # coerce a decimal string
```

On the other hand, it is erroneous trying to parse a non-decimal string.

```{r}
as.integer("Joe")   # coerce an non-decimal string
```

We can convert booleans to numbers this way, too.

```{r}
as.integer(TRUE)    # the numeric value of TRUE
as.integer(FALSE)   # the numeric value of FALSE
```


## Math Operators

| **Operator**   | **Description**             |
|----------------|-----------------------------|
| **+**          | addition                    |
| **-**          | subtraction                 |
| **\***         | multiplication              |
| **/**          | division                    |
| **\^ or \*\*** | exponentiation              |
| **x %% y**     | modulus (x mod y) 5%%2 is 1 |
| **x %/% y**    | integer division 5%/%2 is 2 |



In [5]:
4^2 * (4 + 8 * 2)
14^3
7 %/% 3

## Logical (Boolean)

A logical value is often created via comparison between variables.

```{r}
x <- 1
y <- 2   # sample values
z <- x > y      # is x larger than y?
z              # print the logical value
```

```{r}
class(z)       # print the class name of z
```

In [6]:
x <- 1
y <- 2

x > y

### Logical Operators

Standard logical operations are `&` (and), `|` (or), and `!` (negation).

```{r}
u <- TRUE
v <- FALSE
u & v          # u AND v
```

```{r}
u | v          # u OR v
```

```{r}
!u             # negation of u
```

Note that you can use `T` and `F` instead of `TRUE` and `FALSE`.

```{r}
a <- T
b <- F
a & b
```

In [7]:
u <- TRUE
v <- FALSE
u & v

In [8]:
u | v

In [9]:
!u

## Characters

A character object is used to represent string values in R.

We convert objects into character values with the `as.character()`
function:

```{r}
x = as.character(3.14)
x
```

```{r}
class(x)       # print the class name of x
```

### `paste()`

Two character values can be concatenated with the `paste()` function.

```{r}
fname <- "Joe"
lname <-"Smith"
paste(fname, lname)
```

`paste()` takes a `sep` argument:

```{r}
paste("A", "B", "C", sep="--")
```

### `sprintf()`

However, it is often more convenient to create a readable string with
the `sprintf()` function, which has a C language syntax.

```{r}
sprintf("%s has %d dollars", "Sam", 100)
```

### `substr()`

To extract a substring, we apply the `substr()` function.

Here is an example showing how to extract the substring between the
third and twelfth positions in a string.

```{r}
substr("Mary has a little lamb.", start=3, stop=12)
```

### `sub()`

And to replace the first occurrence of the word "little" by another word
"big" in the string, we apply the `sub()` function.

```{r}
sub("little", "big", "Mary has a little lamb.")
```


## R Data Structures

Basic R comes with several data structures:

- Vector
- Matrix
- Array
- List
- Data frame

A **vector** is what is called an array in all other programming languages except R

> A collection of cells with a fixed size where all cells hold the same data type (integers or characters or reals or whatever).

A **matrix** is a two-dimensional vector (fixed size, all cell types the same).

An **array** is a vector with one or more dimensions.

> So, an array with one dimension is (almost) the same as a vector.An array with two dimensions is (almost) the same as a matrix.An array with three or more dimensions is an n-dimensional array.

A **list** can hold items of different types and the list size can be increased on the fly.

> List contents can be accessed either by index (like mylist[[1]]) or by name (like mylist$age).

A **data frame** is called a table in most languages.

> Each column holds the same type, and the columns can have header names. A data frame is essentially a kind of a list — a list of vectors each with the same length, but of varying data types.

The two most frequently uses are **Vector** and **Data frame**.

So, let's look more closely at vectors and data frames. We will also look at lists since they are used internally to construct data frames.

### Vectors and `c()`

A vector is a sequence of data elements of the same basic type.

Members in a vector are officially called components, but many call them members.

Vectors may be created with the `c()` function ("c" stands for combine).

Here is a vector of three numeric values 2, 3 and 5.

In [10]:
c(2, 3, 5)

And here is a vector of logical values

In [11]:
c(TRUE, FALSE, TRUE, FALSE, FALSE)

A vector can contain character strings

In [12]:
c("aa", "bb", "cc", "dd", "ee")

### Vectors from sequences using `:`, `seq()`, and `rep()`

Vectors can be made out of sequences which may be generated in a few ways.

In [13]:
s1 <- 2:5
s1

The `seq()` function is like Python's `range()`.

In [14]:
s2 <- seq(from=1, to=5, by=2)
s2

The `rep()` function will created a series of repeated values

In [15]:
s3 <- rep(1, 5)
s3

### Length

The number of members in a vector is given by the `length()` function.

In [16]:
length(c("aa", "bb", "cc", "dd", "ee"))

### Combining Vectors with `c()`

Vectors can be combined via the `c()` function

In [17]:
n <- c(2, 3, 5)
s <- c("aa", "bb", "cc", "dd", "ee")
c(n, s)

### Value Coercion

Notice how the numeric values are being coerced into character strings when the two vectors are combined.

This is necessary so as to maintain the same primitive data type for members in the same vector.

### Vector Math

Arithmetic operations of vectors are performed member-by-member, i.e., member-wise.

For example, suppose we have two vectors a and b.

In [18]:
a <- c(1, 3, 5, 7)
b <- c(1, 2, 4, 8)

Then, if we multiply a by 5, we would get a vector with each of its members multiplied by 5.

In [19]:
5 * a

And if we add a and b together, the sum would be a vector whose members are the sum of the corresponding members from a and b.

In [20]:
a + b

Similarly for subtraction, multiplication and division, we get new vectors via member-wise operations.

In [21]:
a - b

In [22]:
a * b

In [23]:
a / b

### The Recycling Rule

If two vectors are of unequal length, the shorter one will be recycled in order to match the longer vector.

For example, the following vectors u and v have different lengths, and their sum is computed by recycling values of the shorter vector u.

In [24]:
u <- c(10, 20, 30)
v <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
u + v

### Vector Indexes

We retrieve values in a vector by declaring an index inside a single square bracket index [] operator.

**Vector indexes are 1-based. ALL INDEXING IN R IS 1-BASED.**

In [25]:
s <- c("aa", "bb", "cc", "dd", "ee")
s[3]

### Negative Indexing

Unlike Python, if the index is negative, it will remove the member whose position has the same absolute value as the negative index.

It really does mean subtraction!

For example, the following creates a vector slice with the third member removed.

In [26]:
s[-3]

### Out-of-Range Indexes

Values for out-of-range indexes are reported as `NA`.

Empty values in R are noted by `NA` (compared to `NaN` in Python).

In [27]:
s[10]

### Numeric Index Vectors

A new vector can be sliced from a given vector with a numeric vector passed to the indexing operator.

Index vectors consist of member positions of the original vector to be retrieved.

Here we see how to retrieve a vector slice containing the second and third members of a given vector `s`.

In [28]:
s <- c("aa", "bb", "cc", "dd", "ee")
s[c(2, 3)]

### Duplicate Indexes

The index vector allows duplicate values. Hence the following retrieves a member twice in one operation.

In [29]:
s[c(2, 3, 3)]

### Out-of-Order Indexes

The index vector can even be out-of-order. Here is a vector slice with the order of first and second members reversed.

In [30]:
s[c(2, 1, 3)]

### Range Index

To produce a vector slice between two indexes, we can use the colon operator ":". This can be convenient for situations involving large vectors.

In [31]:
s[2:4]

### Logical Index Vectors

A new vector can be sliced from a given vector with a logical index vector.

The logical vector must the same length as the original vector.

Its members are `TRUE` if the corresponding members in the original vector are to be included in the slice, and `FALSE` if otherwise.

> This is what we called boolean filtering and masking in Python.

For example, consider the following vector `s` of length 5.

In [32]:
s <- c("aa", "bb", "cc", "dd", "ee")

To retrieve the second and fouroth members of `s`, we define a logical vector L of the same length, and have its second and fourth members set as TRUE.

In [33]:
L = c(FALSE, TRUE, FALSE, TRUE, FALSE)
s[L]

The code can be abbreviated into a single line:

In [34]:
s[c(FALSE, TRUE, FALSE, TRUE, FALSE)]

### Naming Vector Members with `names()`

We can assign names to vector members.

In [35]:
v <- c("Mary", "Sue")
names(v) <- c("First", "Last")
v

Now we can retrieve the first member by name.

In [36]:
v["First"]

We can also reverse the order with a character string index vector.

In [37]:
v[c("Last", "First")]

## Lists

A list is a generic vector containing other objects.

The following variable `x` is a list containing copies of three vectors `n`, `s`, `b`, and a numeric value 3.

In [38]:
n <- c(2, 3, 5)
s <- c("aa", "bb", "cc", "dd", "ee")
b <- c(TRUE, FALSE, TRUE, FALSE, FALSE)

x <- list(n, s, b, 3)   # x contains copies of n, s, b
x

Note that odd bracket notation.

Each list member contains a vector.

### List Slicing

We retrieve a list slice with the single square bracket [] operator.

The following is a slice containing the second member of x, which is a copy of s.

In [39]:
x[2]

With an index vector, we can retrieve a slice with multiple members.

Here a slice containing the second and fourth members of x.

In [40]:
x[c(2, 4)]

### Member Reference with `[[]]`

To reference a list member directly, we use the double square bracket `[[]]` operator.

The following object `x[[2]]` is the second member of x.

In other words, `x[[2]]` is a copy of `s`, but is not a slice containing `s` or its copy.

In [41]:
x[[2]]

We can modify its contens directly

In [42]:
x[[2]][1] = "ta"
x[[2]]

And `s` is unaffected.

In [43]:
s