In [79]:
options(jupyter.rich_display = FALSE)

# BASICS OF R

> “To understand computations in R, two slogans are helpful:
> 
> Everything that exists is an object.
>
> Everything that happens is a function call."
>
> — John Chambers

(http://adv-r.had.co.nz/Functions.html)

## Basic mathematical functions

Mathematical symbols are built-in functions in R language. Therefore, common mathematical operations can be done easily. Do the following calculations in their respective code blocks.

* $ 3+5 $

In [80]:
3 + 5

[1] 8

* $ \frac{67}{6}+10-5.7 $

In [81]:
67/6 + 10 - 5.7

[1] 15.46667

* $ 6*5+12 $

In [82]:
6*5 + 12

[1] 42

* $ 3^4 $

In [83]:
3^4

[1] 81

* $\displaystyle \frac{2/5^2}{\displaystyle 5.5 / 3^*2}$

In [84]:
(2/5^2)/(5.5/(3*2))

[1] 0.08727273

* $5 \equiv r \pmod{3} , r=? $

In [85]:
5%%3

[1] 2

## Assignment operators

In R, data and values can be assigned to different data structures in various ways. The most common assignment operators are ` <- ` and `=`. However, they differ in usage and you should be careful which one to use. Let's see some examples.

Create a vector (we will talk about what a vector is) named **x** and assign 3+4 to it using `<-`. 

In [86]:
x  <- 3 + 4
x

[1] 7

As you can see, **x** vector now has the value of 7, equal to 3+4. Create another vector named **y** and assign 3+4 again using `=`. 

In [87]:
y = 3 + 4
y

[1] 7

Both operators have yielded same results; however, this is not always the case. To avoid confusion, use `<-` for assignments and use `=` when passing arguments to functions (which also we will cover later).

## Basic data types in R and getting info on objects

R has 6 basic **data types**: character, numeric, integer, logical, complex, and raw. Here, examples from each data type:
* **character**: `"Berk"`, `"1"`, `"a"`
* **numeric**: `2`, `15` , `3.14`
* **integer**: `2L`  (`L` tells R to store it as an integer, it will be useful later)
* **logical**: `TRUE`, `T`, `FALSE`
* **complex**: `1+6i`

To get information on an object you can use `class()` function. Create a vector named **surname** and assign it with "Filcan". Then, use `class()`function to get its type.

In [88]:
surname <- "Filcan"
class(surname)

[1] "character"

Let's create two new vectors: **no** with 12345, and **character** with "12345". Then, get their types.

In [89]:
no <- 12345
character <- "12345"
class(no)
class(character)

[1] "numeric"

[1] "character"

As you can see, **"12345"** and **12345** have different data types. Do not fall into this common pitfall for beginners.

## Creating vectors

Aside from basic data types, R has many **data structures**. Vectors are one of them. A **vector** is the simplest type of data structure in R. Simply put, a vector is a sequence of data elements of the **same** basic type. Let's create some vectors.

Create a vector named **seller_names** and assign it with "Michael", "Dwight", and "Jim".

In [90]:
seller_names <- c("Michael", "Dwight", "Jim")
seller_names

[1] "Michael" "Dwight"  "Jim"    

As you can see, we use `c()` to create vectors. Create another vector named `seller_heights` and assign it with 175, 191, 189.

In [91]:
seller_heights <- c(175, 191, 189)
seller_heights

[1] 175 191 189

Get types of each vector.

In [92]:
class(seller_names)
class(seller_heights)

[1] "character"

[1] "numeric"

Other than getting types, we can also get lengths of vectors. Use `length()`function to do this.

In [93]:
length(seller_heights)

[1] 3

What about combining diffent types of data? Create a vector named **mixed** and assign it with "Dunder Mifflin", 2013, TRUE. Then, get the type of **mixed** vector.

In [94]:
mixed <- c("Dunder Mifflin", 2013, TRUE)
class(mixed)

[1] "character"

How is the type of **mixed** can only be "character"? In R vectors, you can use only one type of data. In this case 2013 and TRUE elements are *coerced* into "character" type. To use different types of data together, we need **data frames** that we will cover later.

## Naming vectors and unnaming

We can name elements of a vector. `names()` function is used for this operation. Name elements of **seller_heights** vector using **seller_names** vector. Then, get the type of **seller_heights** vector, and use `attributes()` function to get its attributes.

In [95]:
names(seller_heights) <- seller_names
seller_heights
class(seller_heights)
attributes(seller_heights)

Michael  Dwight     Jim 
    175     191     189 

[1] "numeric"

$names
[1] "Michael" "Dwight"  "Jim"    


As you can see, we named elements of **seller_heights** and its type did not change. Use `names()` function without assignment to get the names. 

In [96]:
names(seller_heights)

[1] "Michael" "Dwight"  "Jim"    

We can also unname the vectors. Use `unname()` function to remove the names.

In [97]:
unname(seller_heights)

[1] 175 191 189

Alternatively, we can assign names to elements of a vector during creation. Create a vector named **seller_income** with Michael=22000, Jim=20000, Dwight=18000. Then, get the names of elements.

In [98]:
seller_income <- c(Michael=22000, Jim=20000, Dwight=18000)
names(seller_income)

[1] "Michael" "Jim"     "Dwight" 

## Subsetting vectors

Subsetting is getting elements of a data structure using its indices. Before diving into subsetting, let's create some vectors for later use. `:` operator can be used to create vectors from one number to another. Create a vector named **indices_1** from 1 to 2 using `:`.

In [99]:
indices_1 <- 1:2
indices_1

[1] 1 2

Also we can create vectors with logical values; they are called **boolean values**. `TRUE`, `FALSE`, `T`, `F` are examples of boolean values. Create a vector named **booelen_indices_1** and assign it with TRUE, FALSE, TRUE.

In [100]:
boolean_indices_1 <- c(TRUE, FALSE, TRUE)
boolean_indices_1

[1]  TRUE FALSE  TRUE

Also create another vector called **indices_2** with values 1 and 2.

In [101]:
indices_2 <- c(1,2)
indices_2

[1] 1 2

Create a new vector named **seller_ages** with Michael=40, Jim=27, Dwight=29.

In [102]:
seller_ages <- c(Michael=40, Jim=27, Dwight=29)
seller_ages

Michael     Jim  Dwight 
     40      27      29 

We are ready to subsetting.  `[` is the subsetting or extracting operator. Using **indices_1**, get incomes of Michael and Jim.

In [103]:
seller_income[indices_1]

Michael     Jim 
  22000   20000 

Try this with **indices_2**.

In [104]:
seller_income[indices_2]

Michael     Jim 
  22000   20000 

### Indexing with boolean values

As you can see, both vectors had same values and returned the incomes of Michael and Jim. What about **boolean_indices_1?**

In [105]:
seller_income[boolean_indices_1]

Michael  Dwight 
  22000   18000 

Did you see the difference? **boolean_values_1** vector had TRUE, FALSE, TRUE values respectively. When used for subsetting, it only returned indices of **seller_income** that correspond with TRUE values: Michael and Dwight. Try it with a different booelean vector named **boolean_indices_2** with FALSE, TRUE, TRUE.

In [106]:
boolean_indices_2 <- c(FALSE, TRUE, TRUE)
seller_income[boolean_indices_2]

   Jim Dwight 
 20000  18000 

As expected, it returned Jim and Dwight's income. What if the length of vector was shorter than subsetted vector? Create a new boolean vector named **boolean_indices_3** with TRUE, FALSE.

In [107]:
boolean_indices_3 <- c(TRUE, FALSE)
seller_income[boolean_indices_3]

Michael  Dwight 
  22000   18000 

It returned the first and the third indices of **seller_income**. Why? Since it is shorter than subsetted vector, R repeats **boolean_indices_3** and returns Dwight's income too.

We can also use indice numbers for subsetting. Try to return second index of **seller_income**.

In [108]:
seller_income[2]

  Jim 
20000 

### Indexing with names

Names can be used for subsetting too. Get Dwight's income using his name.

In [116]:
seller_income["Dwight"]

Dwight 
 18000 

### Indexing with a function

Functions can be used for indexing too. Try these:
- Get the last item of **seller_income** vector
- Get all but the last item of **seller_income** vector

In [140]:
seller_income[length(seller_income)]
seller_income[-length(seller_income)]

Dwight 
 18000 

Michael     Jim 
  22000   20000 

What is that `-` we used in indexing? Let's take a look.

## Negative subsetting and object modification

You can subset desired indices using R, but also subset vectors eliminating undesired indices. For example, you do not want to see income of Michael. By adding `-` to undesired index number, you can easily do this.

In [109]:
seller_income[-1]

   Jim Dwight 
 20000  18000 

Assume that you only want sales representatives incomes. Create a vector named **rep_income** with incomes of Jim and Dwight.

In [110]:
rep_income <- seller_income[-1]
rep_income

   Jim Dwight 
 20000  18000 

Michael wants to make Pam a sales representative. We want to insert Pam's income to second index. Assume Pam's income is 16000 and insert it to **rep_income** vector.

In [111]:
rep_income <- c(rep_income[1], Pam=16000, rep_income[2])
rep_income

   Jim    Pam Dwight 
 20000  16000  18000 

After the closure of Stamford branch of Dunder Mifflin Paper Company, a new sales representative named Andy is transferred to Scranton branch. Append him to **rep_income** vector to a fourth index, with Andy=19000.

In [112]:
rep_income[4] <- c(Andy=19000)
rep_income

   Jim    Pam Dwight        
 20000  16000  18000  19000 

Why cannot we see Andy's name? Because before appending his income, there was no fourth vector and R added only his income. Name the fourth vector as "Andy".

In [113]:
names(rep_income)[4] <- "Andy"
rep_income

   Jim    Pam Dwight   Andy 
 20000  16000  18000  19000 

Micheal is upset with his workers' low wages. He wants to give 10% raise for all sales representatives with income lower than 20000.

In [114]:
rep_income[rep_income < 20000] <- rep_income[rep_income < 20000] * 1.1
rep_income

   Jim    Pam Dwight   Andy 
 20000  17600  19800  20900 

After a successful year of sales, Dunder Mifflin Paper Company wants to give 10% increase on sales representatives' incomes.

In [115]:
rep_income <- rep_income*1.1
rep_income

   Jim    Pam Dwight   Andy 
 22000  19360  21780  22990 

## Sorting

In [128]:
rep_ages <- c(Jim=27, Dwight=29, Pam=25, Andy=30)
rep_ages

   Jim Dwight    Pam   Andy 
    27     29     25     30 

Sort **rep_ages** and **rep_income** using `sort()` function.

In [129]:
sort(rep_ages)
sort(rep_income)

   Pam    Jim Dwight   Andy 
    25     27     29     30 

   Pam Dwight    Jim   Andy 
 19360  21780  22000  22990 

By default, `sort()` function sorts elements by ascending order. Try `sort(rep_ages, decreasing=TRUE)` for descending order.

In [130]:
sort(rep_ages, decreasing=TRUE)
rep_ages

  Andy Dwight    Jim    Pam 
    30     29     27     25 

   Jim Dwight    Pam   Andy 
    27     29     25     30 

Notice that sorting a vector did not change its original indices. To save a sorted ages vector, create a new vector named **sorted_ages** and assign sorted values to it.

In [131]:
sorted_ages <- sort(rep_ages, decreasing=TRUE)
sorted_ages
rep_ages

  Andy Dwight    Jim    Pam 
    30     29     27     25 

   Jim Dwight    Pam   Andy 
    27     29     25     30 

## Ordering

Ordering returns the required indices used when sorting with ascending order. See the example.

In [135]:
rep_ages
order(rep_ages)
rep_ages[order(rep_ages)]

   Jim Dwight    Pam   Andy 
    27     29     25     30 

[1] 3 1 2 4

   Pam    Jim Dwight   Andy 
    25     27     29     30 

`order()`function says that "to sort elements of **rep_ages** vector with ascending order, I need to get the third index of **rep_ages** vector first. Then, I need the first index of **rep_ages** vector. The third youungest representative is Dwight and he is in the second index of **rep_ages** vector. Finally, the oldest one is Andy, who is in the fourth index of **rep_ages**". This is a bit confusing but you get insight over time.

## Rank

In [137]:
rep_ages
sort(rep_ages)
order(rep_ages)
rank(rep_ages)

   Jim Dwight    Pam   Andy 
    27     29     25     30 

   Pam    Jim Dwight   Andy 
    25     27     29     30 

[1] 3 1 2 4

   Jim Dwight    Pam   Andy 
     2      3      1      4 

Order gives the indices of items so that if we pick the items in that order they will be sorted <p></p> Rank gives the rank of each item if the vector were sorted

## Random sampling

Create a vector named **sample_1** which has randomly selected 10 numbers between 1 and 20 (without repetition). Check the vector.

In [139]:
set.seed(1000) # this is here in order to reproduce the same "random" sample
sample_1 <- sample(1:20, 10, replace = F)
sample_1

 [1] 16  4 11  3 13  2  6 14 20  1

`set.seed()` function is used for *pseudo-randomness*, it ensures that the randomized sample is the same every time.

## Vectorized functions

### Getting reverses, sums and cumulative sums of vectors

Create a vector named **seq_10** as a sequence from 1 to 10 using `:` operator.

In [141]:
seq_10 <- 1:10
seq_10

 [1]  1  2  3  4  5  6  7  8  9 10

Get the sum its values using `sum()` function. Then, get the cumulative sum of its values using `cumsum()`function and assign it to a new vector called **seq_10c**.

In [142]:
sum(seq_10)
seq_10c <- cumsum(seq_10)
seq_10c

[1] 55

 [1]  1  3  6 10 15 21 28 36 45 55

To get the reversed version of **seq_10** vector, use `rev()`function. Assign it to a vector called **seq_10r**.

In [143]:
seq_10r <- rev(seq_10)
seq_10r

 [1] 10  9  8  7  6  5  4  3  2  1

### Getting products and cumulative products of vectors

Multiply all elements of **seq_10** using `prod()`function. Then, get their cumulative products using `cumprod()` and assign it to a vector called **seq_10p**.

In [144]:
prod(seq_10)
seq_10p <- cumprod(seq_10)
seq_10p

[1] 3628800

 [1]       1       2       6      24     120     720    5040   40320  362880
[10] 3628800

## Repeating values

You can create vectors with repeated values using `rep()`function. Create a vector with 10 TRUE values.

In [145]:
rep(TRUE, 10)

 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Also you can repeat a vector using `rep()`. Create a vector named **one_two_three** with 1,2,3 and repeat it 4 times.

In [147]:
one_two_three <- c(1,2,3)
rep(one_two_three, 4)

 [1] 1 2 3 1 2 3 1 2 3 1 2 3

Also, you can create sorted vectors with repetition. Try `rep(one_two_three, each=4).

In [148]:
rep(one_two_three, each=4)

 [1] 1 1 1 1 2 2 2 2 3 3 3 3

## Basic statistical functions

R has built-in functions for basic statistical operations. Create a vector named **numbers** with 23,1,43,6,18,55,30,40,91,77,83,42. Then do the following:
- To get the mean of **numbers** vector, use `mean()`function.
- To get the median of **numbers** vector, use `median()`function.
- To get the minimum item of **numbers** vector, use `min()`function.
- To get the maximum item of **numbers** vector, use `max()`function.
- To get the standard deviation of **numbers** vector, use `sd()`function.
- To get the statistical summary of **numbers** vector, use `summary()`function.

In [149]:
numbers <- c(23,1,43,6,18,55,30,40,91,77,83,42)

mean(numbers)
median(numbers)
min(numbers)
max(numbers)
sd(numbers)
summary(numbers)

[1] 42.41667

[1] 41

[1] 1

[1] 91

[1] 29.44474

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   21.75   41.00   42.42   60.50   91.00 