
## Topics

Below is a summary of the topics covered in this lecture:

- Revisit For Loops
- Data Exploration
	- Summary and Table
	- Quantile, Range, and Mean
	- Subset
	- Dimension
	- Merge
	- Sort

## Revisit Loops

In this section, we are going to specifically talk about for loops.

For loops are all about **repetition**. A loop is a way to repeat a sequence of instructions under certain conditions.

The general syntax is as follows:

```
for (val in sequence) {
  # For each item in the sequence,
  # the statement is evaluated.
  #
  statement
}
```

To write a for loop, you need to follow the **preallocate-sequence-body** procedues. Here is an example.

If we want to create a data frame with 4 randomly generated columns, we would do the followings in a world **without** for loops.



In [None]:
# Preallocate memory to store the values
mat <- matrix(NA, nrow = 10, ncol = 4)

# Generate random numbers and then repeat manually
mat[, 1] <- runif(10)
mat[, 2] <- runif(10)
mat[, 3] <- runif(10)
mat[, 4] <- runif(10)

# Print out the results
print(mat)


But **code redundancy** is a bad thing in programming. We do not want to copy and paste. Image if you have 1 thousand columns, are you going to copy and paste 1 thousand times? We need for loops.



In [None]:
# Preallocate
mat <- matrix(NA, nrow = 10, ncol = 4)

# Sequence
for (i.col in 1:ncol(mat)) {

	# Body

	# Uncomment the following line to see how i.col is changing
	# cat('The current column index is', i.col, '\n')

	mat[, i.col] <- runif(nrow(mat))
}

print(mat)


Let's look at another example. Below is a piece of code with a for loop.



In [None]:
# Preallocate
df <- data.frame(
	name = c('Tom', 'Bob', 'Alice'),
	age = c(24, 19, 21),
	stringsAsFactors = F)           # Don't convert strings to categories

num.rows <- nrow(df)

# Sequence
for (i in seq_len(num.rows)) {

	# Body
	if (df$age[i] >= 20) {
		print(df$name[i])
	}
}

# Q: Can you flatten the for loop and generate the same results?
# Q: Can you translate the for loop to a while loop?


## Data Exploration

At this point, we have learned the basic data types and functions in R. Let's put them into practice.

R comes with built-in data sets for learning. First, we are going to examine the data set for plant growth.



In [None]:
# Load the data set
data("PlantGrowth")

# What did we just load?
class(PlantGrowth)

# I still do not know what we just loaded ...
str(PlantGrowth)

# But there are ellipses. How big is the data set?
dim(PlantGrowth)

# OK. I want to know some statistics about the data set.
summary(PlantGrowth$weight)

# Well. But I would like summary information broken up by groups
tapply(PlantGrowth$weight, PlantGrowth$group, summary)


You should have a general idea by now what we have just loaded. Of course, you can visualize them. Introducing the powerful visulization package from Dr. Hadley Wickham, `ggplot2`.



In [None]:
# Install the package if you have not done so
# install.packages('ggplot2)
#
library(ggplot2)

# Generate a boxplot for all groups
ggplot(data = PlantGrowth) +
	geom_boxplot(mapping = aes(x = group, y = weight)) +
	labs(x = 'Group', y = 'Weight')


`ggplot2` is a powerful visualization tool that we are going to use intensively throughout the course. It provides a whole system for representing graphics in R. At this point, please remember the template for using `ggplot2` functions.

```
ggplot(data = <DATA>) +
  <GEOM_FUNCTION>(
    mapping = aes(<MAPPINGS>),
    stat = <STAT>,
    position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>
```

It turns out the measures are actually from three plants with different treatments at different times. The biologist kindly provided us the time records for each observation. Let's try to append the information to our data frame and visualize the addition.



In [None]:
# Load an R data file. Please notice the added variable
# in your global environment called 'time.records'.
#
load('assets/plant-growth-time.RData')

# Q: Can you use what we have learned to examine
# what we have just loaded?

# Add these records to our data frame
PlantGrowth <- cbind(PlantGrowth, time.records)

# Visualization
ggplot(data = PlantGrowth) +
	geom_line(mapping = aes(x = time.records, y = weight, color = group)) +
	labs(x = 'Time', y = 'Weight')


Does it look strange? Why does the plants lose weight? We contact the biologist and then he told us that **the measurements are not sorted by time**! Therefore, we need to sort the measurements.



In [None]:
# So the data are not sorted. We need to sort the data per group.
sorted.measurements <- tapply(PlantGrowth$weight, PlantGrowth$group, sort)

# The returned result is a list, convert them to a vector
sorted.measurements <- unlist(sorted.measurements, use.names = F)

# Now measurements are sorted by groups. We can assign these values back
PlantGrowth$weight <- sorted.measurements

# Generate the same plot
ggplot(data = PlantGrowth) +
	geom_line(mapping = aes(x = time.records, y = weight, color = group)) +
	labs(x = 'Time', y = 'Weight') +
	guides(color = guide_legend(title = 'Type'))


*The story is just a mockup*.

You can get a list of all the built-in data sets in R by running `data()`. Try the above code for a different data set, for example, **ChickWeight**.

