<img src="materials/images/introduction-to-r-programming-cover.png"/>


# 👋 Welcome, before you start
<br>

### 📚 Module overview

R is a statistical programming language that is very effective for computation and high-level graphics. It is commonly used for data analytics and data science.

We are going through five lessons in this module:

- [**Lesson 1: R Basic Data Types**](Lesson_1_R_Basic_Data_Types.ipynb)

- <font color=#E98300>**Lesson 2: R Data Structures**</font>    `📍You are here.`
    
- [**Lesson 3: Importing Data**](Lesson_3_Importing_Data.ipynb)

- [**Lesson 4: Conditionals and Loops**](Lesson_4_Conditionals_and_Loops.ipynb)

- [**Lesson 5: Functions**](Lesson_5_Functions.ipynb)

</br>

### ✅ Exercises
We encourage you to try the exercise questions in this module, and use the [**solutions to the exercises**](Exercise_solutions.ipynb) to help you study.

</br>

<div class="alert alert-block alert-info">
<h3>⌨️ Keyboard shortcut</h3>

These common shortcut could save your time going through this notebook:
- Run the current cell: **`Enter + Shift`**.
- Add a cell above the current cell: Press **`A`**.
- Add a cell below the current cell: Press **`B`**.
- Change a code cell to markdown cell: Select the cell, and then press **`M`**.
- Delete a cell: Press **`D`** twice.

Need more help with keyboard shortcut? Press **`H`** to look it up.
</div>

---

# Lesson 2: R Data Structures

A data structure is a format or object for storing data. We are going to go through these concepts:

- [Vectors](#Vectors)
- [Lists](#Lists)
- [Matrix](#Matrix)
- [Data Frame](#Data-Frame)

`🕒 This module should take about 40 minutes to complete.`

`✍️ This notebook is written using R.`

## Vectors
The fundamental data structure in R is the vector. A vector is a one-dimensional sequence of data elements, of the same type, stored in a sequence. The elements are most commonly of type character, logical, integer or numeric.

A common way to create a vector is by directly specifying its content. The c() function can be used to "combine" the specified values:

In [None]:
v1  <-  c(1,2,3,4)
v1

In [None]:
v2 <- c("This", "is", "absolutely", "fantastic")
v2

In [None]:
v3  <-  c(TRUE, FALSE, TRUE, FALSE)
print(v3)

**Note that if you mix data types, R will coerce the vector to a single data type:**

In [None]:
v4  <-  c(TRUE, FALSE, TRUE, "FALSE")
print(v4)

<div class="alert alert-block alert-warning">
<b>Alert:</b> A vector is a data structure that contains only a single data type (e.g., integers, logicals).
</div>

### Creating Vectors from a Sequence of Numbers

A vector can be constructed from  a sequence of numbers:

In [None]:
# Create a vector from the sequence of numbers 1 through 10 (inclusive):

seq1 <- 1:10
print(seq1)

In [None]:
# Create a vector of numbers beginning from 5 and counting down through (and including) -5:

seq1 <- 5:-5
print(seq1)

### A Vector Created using the seq() Function

The seq() ("sequence") function can be used to construct a vector from a sequence of numbers:

In [None]:
# Create a vector of 10 numbers beginning from 1:

seq(10)

In [None]:
# Create a vector of numbers from 1 through 10, but skipping every second number:

seq(1, 10, 2)

In [None]:
# Create a vector of 20 numbers, evenly spaced between the numbers 1 through 10:

seq(1, 10, length=20)

In [None]:
# Create a vector of numbers from 1 through 10, stepping every 0.1:

seq(from = 1, to = 10, by = 0.1)

### A Vector Created from a Sample of Numbers

In [None]:
# Create a vector by drawing from a sample of numbers from 1 through 10, without replacement:

sample(1:10)

### Examining Vectors

#### typeof()
The typeof() function can be used to evaluate the data type that a vector contains:

In [None]:
# Mixed data types coerced to a single data type:

v4  <-  c(TRUE, FALSE, TRUE, "FALSE")

typeof(v4)

#### length()
The length() function can be used to determine the number of elements that a vector contains:

In [None]:
length(v4)

### Missing Data

Missing values in R are indicated by NA (Not Available):

In [None]:
x <- c(0.5, 0.7, NA)

# Variable containing a vector with a misssing value
x

In [None]:
# Try this.

x <- c("a", NA, "c", "d", "e")

# Variable containing a vector with a misssing value
x

#### is.na() 
is.na() can be used to test whether a value is missing. It returns a logical vector:

In [None]:
# Test each value in a vector as to whether it is missing or not:

x <- c("a", NA, "c", "d", "e")

# x is the variable containing a vector with a missing value
is.na(x)

#### anyNA() 
anyNA() can be used to evaluate whether a vector contains any missing (NA) values:

In [None]:
x <- c("a", NA, "c", "d", "e")

# x is the variable containing a vector with a misssing value
anyNA(x)

### Universal (built-in) functions that can be applied to vectors

In [None]:
# Create a vector "v":

v = c(25, 9, 16, 4)
v

In [None]:
# Return the mean value of the vector

mean(v)

In [None]:
# Return the square root of each value in the vector

sqrt(v)

In [None]:
# Return the maximum value in the vector

max(v)

In [None]:
# Return the smallest value in the vector

min(v)

In [None]:
# Return the sum of the values in the vector

sum(v)

In [None]:
# Sort the values in a vector in ascending order

sort(v)

In [None]:
# Sort the values in a vector in descending order

sort(v, decreasing=TRUE)

In [None]:
# Count the frequency of occurrence for each element in a vector:

table(c(20,30,50,20,40,10,40,20))

### Functions for logical vectors

In [None]:
# Test each value in the vector v as to whether it is greater than 10 or not (returns TRUE or FALSE):

v = c(25, 9, 16, 4)

v > 10

In [None]:
# any() returns TRUE if any of the logicals are TRUE:

any(v > 10)

In [None]:
# all() returns TRUE if all of the logicals are TRUE:

all(v > 10)

In [None]:
# which() returns the indices/locations of the TRUE values (interpreted as as "Which index positions are TRUE?"):

which(v > 10)

<div class="alert alert-block alert-warning">
<b>Alert:</b> The index positions of a vector in R begin counting from 1. 
</div>

In [None]:
# sum() returns the number of logicals that are TRUE:

sum(v > 10)

### Vector Subsetting
The index position of the items within a vector begin counting from 1. You can get slices or subsets of a vector by placing the index position of items between square brackets:

In [None]:
# Create a vector named "items":

items <- c(5,10,15,20,25,30,35,40)
items

In [None]:
# Return the first element in the vector; the element at index position 1:

items[1]

In [None]:
# Return a vector containing the third element:

items[3]

In [None]:
# A negative value means to exclude the index position.
# Return a vector with everything except the third element:

items[-3]

In [None]:
# Return a vector with the first through second elements (inclusive):

items[1:2]

In [None]:
# You can use a vector to do the subsetting (particularly when the indices are non-consecutive).
# Return a vector with the first and fourth elements:
                                                         
items[c(1,4)]                                                     

In [None]:
# Return a vector with the first, third, and first element again:

items[c(1,3,1)]

In [None]:
# Return a vector without the first through second elements:

items[-1:-2]

### Vector Assignment

In [None]:
# Create a vector named "items":

items <- c(5,10,15,20,25,30,35,40)
items

In [None]:
# Assign 100 to the item at index position 2:

items[2] <- 100
items

In [None]:
# Assign 0 to every index position except 2:

items[-2] <- 0
items

### Vectorized Operations
A vectorized operation applies an operation to a vector element-wise: 

In [None]:
# Create a vector named "items":

items <- c(5,10,15,20,25,30,35,40)
items

In [None]:
# Divide each item in a vector by 5, and return a new vector containing the result

items/5

In [None]:
# Multiply two vectors together (element-wise).  Return a new vector with the results:

c(1,2,3) * c(2,4,6)

### ✅ Exercise 4

- Create a vector of numbers from 1-12. 
- Square the vector, returning a new vector containing the square of each number from the original vector.

---

### Using Logicals to Perform Compound Operations on Vectors
A compound operation is when you test two or more conditions for whether a value in a vector returns TRUE:

In [None]:
# & is the "AND" operation in R.
# Return TRUE where each item at a given index position between the vectors is TRUE:

c(FALSE, TRUE, FALSE) & c(TRUE, TRUE, TRUE)

In [None]:
# | is the "OR" operation in R.
# Return TRUE where either item at a given index position between the vectors is TRUE:

c(FALSE, TRUE, FALSE) | c(TRUE, TRUE, TRUE)

In [None]:
# Create the vector "v":

v = c(25, 9, 16, 4)

# Test whether a value in the vector v is either less than 5 or (|) greater than 20:
(v < 5) | (v > 20)

In [None]:
# Place the logical operation from above between the square brackets of "v" to return the actual values.
# In otherwords, v will return any value whose index position is TRUE:

v[(v < 5) | (v > 20)]

<div class="alert alert-block alert-info">
<b>Tip:</b> The use of logical (conditional) operations will become very useful when you desire to explore and analyze your data.
</div>

### ✅ Exercise 5

- Create a vector named "nums" containing numbers from 1 through 10. 
- Within the square brackets of "nums", perform a compound operation so that only values that are either less than 4 or greater than 8 are returned.

---

### Naming Vectors
If desired, you can use key/value pairs to assign a name to each element within a vector. You can then use the element's name to access its value:

In [None]:
# Create a vector giving names to each element:

groceries <- c("milk"=3.56, "bread"=4.29, "rice"=5.98)
groceries

In [None]:
# Place the element's name within the square brackets of the vector to return its value:

groceries["rice"]

In [None]:
# Use a vector of names to return more than one element's value:

groceries[c("rice", "bread")]

---

## Lists
A list is a collection of items but unlike vectors, lists can contain any mixture of elements of any type. Lists can even contain other lists.

You can use the **list()** function to crerate a list:

In [None]:
L <- list("a", 1, 2.5, TRUE)
L

Elements within a list can be accessed by placing the element's index position within square brackets:

In [None]:
L[4]

In [None]:
L[2:3]

#### Named lists
Lists can also use names by creating key/value pairs to assign a name to each element within the list. You can then access an element within the list by using the element's name/key:

In [None]:
a_list <- list("San Francisco"= 1, "Santa Clara" = 2, "San Jose" = 3)
a_list

<div class="alert alert-block alert-success">
<b>Note:</b> The '$' preceding the characters in the output above indicates that the characters are a key from the named list as opposed to a value.
</div>

In [None]:
a_list["San Jose"]

---

## Matrix
A matrix is a two-dimensional collection of data elements. Like vectors, matrices can only contain a single data type. You can create a matrix by using the **matrix( )** function. 

To create a matrix, you first specify a vector and then you can specify the desired dimensions (rows x cols).

In [None]:
matrix(1:12, nrow=3, ncol=4)

Alternatively, after specifying a vector, you can indicate the desired number of rows and the number of columns will be calculated.

In [None]:
matrix(c(1:12), nrow = 3)

Alternatively, after specifying a vector, you can indicate the desired number of columns and the number of rows will be calculated.

In [None]:
matrix(c(1:12), ncol = 2)

#### rbind( )
The rbind() function is used to bind vectors into a matrix (row-wise):

In [None]:
rbind(c(1,2,3), c(4,5,6))

#### cbind( )
The cbind() function is used to bind vectors into a matrix (column-wise):

In [None]:
cbind(c(1,2,3), c(4,5,6))

#### dim( )
The dim() function will return the dimensions (rows, cols) of a matrix:

In [None]:
nums <- matrix(1:12, 3, 4)

dim(nums)

#### nrow( )
The nrow() function will return the number of rows in a matrix:

In [None]:
nums <- matrix(1:12, 3, 4)

nrow(nums)

#### ncol( )
The ncol() function will return the number of columns in a matrix:

In [None]:
nums <- matrix(1:12, 3, 4)

ncol(nums)

### Matrix Subsetting
You can select a subset of a matrix by placing the desired rows and columns, separated by a comma, within square brackets.

In [None]:
# Create a matrix named "nums":

nums <- matrix(1:12, 3, 4)
nums

In [None]:
# Return the element located in the second row of the third column:

nums[2,3]

In [None]:
# Return rows 1 and 2 of columns 3 and 4:

nums[1:2, 3:4]

In [None]:
# An alternative way to get the same results as above...
# Return all rows except the third, and all columns except the first and second:

nums[-3, -1:-2]

In [None]:
# Return just row two (must be followed by a comma) and all of the columns:

nums[2, ]

In [None]:
# Return just column four (must be preceded by a comma) and all of the rows:

nums[, 4]

#### colnames()
The colnames() function can be used to assign names to the columns:

In [None]:
colnames(nums) <- c("col1", "col2", "col3", "col4")
nums

#### rownames()
The rownames() function can be used to assign names to the rows:

In [None]:
rownames(nums) <- c("row1", "row2", "row3")
nums

### ✅ Exercise 6

Create a matrix by passing it a vector with numbers from 21 to 40. Shape the matrix to have 4 rows and 5 columns. Assign the names of the weekdays (i.e., Monday through Friday) to the five columns. Name the four rows "California", "Texas", "Florida", "New York". Name the matrix "states". Display the matrix.

---

### Subsetting Named Matrixes

In [None]:
# Create a matrix with named rows and columns:

nums <- matrix(1:12, 3, 4)
colnames(nums) <- c("col1", "col2", "col3", "col4")
rownames(nums) <- c("row1", "row2", "row3")

nums

#### Use the row/col names to subset the matrix:

In [None]:
# Return "row1" and every column:

print(nums["row1",])

In [None]:
# Return all rows and just "col2" and "col4":

nums[, c("col2","col4")]

In [None]:
# Return "row2" and "row3" of "col1" and "col4":

nums[c("row2", "row3"), c("col1","col4")]

---

## Data Frames
Data frames are the most common way of storing and analyzing data in R. The columns (which are vectors) can be of different types, but they must be the same length.

You can think of a data frame as a collection of vectors that all must be the same length.

In [None]:
df <- data.frame(name=c("Carl","Diane","Sally","Ben","Kimmy"),
                 age=c(42,40,17,14,12),
                 sex=c("Male","Female","Female","Male","Female"))

df

#### dim()
The dim() function is used to get the dimensions (rows x cols) of a data frame:

In [None]:
dim(df)

#### nrow()
The nrow() function is used to get the number of rows in a data frame:

In [None]:
nrow(df)

#### ncol()
The ncol() function is used to get the number of columns in a data frame:

In [None]:
ncol(df)

#### colnames()
The colnames() function is used to get the column names of a data frame:

In [None]:
colnames(df)

### Data Frame Subsetting
You can select a subset of a data frame by placing the desired rows and columns, separated by a comma, within square brackets. You can use either an index position or a name. 

In [None]:
# Create a data frame:

df <- data.frame(name=c("Carl","Diane","Sally","Ben","Kimmy"),
                 age=c(42,40,17,14,12),
                 sex=c("Male","Female","Female","Male","Female"))

df

#### Subsets of rows

In [None]:
# Return the first two rows with all of the columns:

df[1:2, ]

In [None]:
# Return the first and third row (with all of the columns); a comma follows the vector so it indicates rows):

df[c(1,3), ]

#### Subsets of columns

In [None]:
# Return the first and third columns (with all of the rows); no comma follows the vector so it indicates columns:

df[c(1, 3)] 

<div class="alert alert-block alert-warning">
<b>Alert:</b> To subset a data frame, placing a vector within square brackets, not followed by a comma, indicates the desired columns that you would like returned. If the vector is followed by a comma, this indicates that these are the rows that you would like returned.</div>

In [None]:
# Return the column named "name" with all of the rows (returns a data frame):

df["name"]

In [None]:
# Return the column named "name" with all of the rows (using "$" returns the vector):

df$name

In [None]:
# Using a vector for the names of the columns returns those columns in the selected order (with all of the rows):

df[c("sex", "age", "name")] 

#### Row and Column subsetting

In [None]:
# Return the first two rows of the first two columns:

df[1:2, 1:2]

In [None]:
# Return the first two rows of the columns named "name" and "age":

df[1:2, c("name", "age")]

In [None]:
# Another way to return the same values as above.
# Return all rows except the third through the fifth, and all columns except the third one:

df[-3:-5, -3]

#### Logical subsetting
A logical expression is a question and can only return TRUE or FALSE.

In [None]:
# Here, you are asking each value in the column "age" if it is less than 18:

df$age < 18

In [None]:
# By placing the logical expression from above within the square brackets of the data frame, 
# only the TRUE rows will be returned.


# Return the rows where the column "age" is less than 18, and return the columns "age" and "name": 

df[df$age < 18, c("age", "name")]

<div class="alert alert-block alert-success">
<b>Note:</b> A logical expression can only return TRUE or FALSE. By placing a logical expression within the square brackets following a data frame, only the rows that are TRUE will be returned. Using logical expressions is primarily how you will explore and wrangle your data.
</div>

#### Compound operation

In [None]:
# Here, you are asking the data frame which rows have an "age" less than 18 AND a "sex" equivalent to "Female":
(df$age<18) & (df$sex=="Female")

In [None]:
# By placing the logical expression from above between the square brackets of the data frame,
# you will return the rows where "age" is less than 18 AND "sex" is "Female", and all of the columns: 

df <- data.frame(name=c("Carl","Diane","Sally","Ben","Kimmy"),
                 age=c(42,40,17,14,12),
                 sex=c("Male","Female","Female","Male","Female"))

df[(df$age<18) & (df$sex=="Female"), ]

### ✅ Exercise 7

From the following data frame, select only the rows where "age" is over 30 and "sex" is "Male", and return only the columns "age" and "sex".

---

# 🌟 Ready for the next one?
<br>

- [**Lesson 3: Importing Data**](Lesson_3_Importing_Data.ipynb)

- [**Lesson 4: Conditionals and Loops**](Lesson_4_Conditionals_and_Loops.ipynb)

- [**Lesson 5: Functions**](Lesson_5_Functions.ipynb)

---

# Contributions & acknowledgment

Thanks Antony Ross for contributing the content for this notebook.

---

Copyright (c) 2022 Stanford Data Ocean (SDO)

All rights reserved.