# Importing Data into R

In [1]:
# Use getwd() to get the working directory, and use setwd() to set the working directory.

In [2]:
# Both read.table() and read.csv() use the function scan to import the data.

HC <- scan('HonorCode.txt', what = '')

# Look for the first 20 entries.
head(HC, 20)

In [3]:
str(HC)

 chr [1:443] "students" "should" "be" "aware" "that" "academic" ...


In [4]:
length(HC)

## Cleaning Data
Things to look out for when you’re importing data into R.

• Is the first row a header? What about the first column?

• R interprets words separated by a space as two separate values.

Messes up the number of elements per line in your data set. (Use or . between words.)

• Symbols such as ?, %, &, *, etc. can make R do funny things.

• Headers, footers, side comments, and notes will mess up the structure.

• How are missing values indicated? It should be with NA but often something like 999 or N.A.

## Exporting Data

Use write.table() or write.csv() to export a matrix or dataframe into a file.
Note that the default for both is col.names = TRUE and row.names = TRUE.

# Functions in R

    function_name <- function(arg1, arg2, ...){
        statement
        return(object)
    }

In [5]:
square_it <- function(x){
    out <- x**2
    return(out)
}

In [6]:
square_it(-16)

In [7]:
# Return multiple values in single function with a single object.
power_it <- function(x){
    out1 <- x**2
    out2 <- x**3
    out3 <- x**4
    return(list(out1, out2, out3))
}

In [8]:
power_it(4)

Write a function findwords() that compiles a list of the location of each occurence of each word in the text.

In [9]:
findwords <- function(text_vec){
    words <- split(1:length(text_vec), text_vec)
    return(words)
}

In [10]:
# Recall the split function
split(1:20, rep(c('dog', 'cat'), 10))

In [11]:
# Note that findwords() returns a list, here we look at the first three elements of the list.
findwords(HC)[1:3]

In [12]:
HC[c(6, 59, 69, 296, 323, 375)] # academic

## List in Alphabetical Order

In [13]:
alphabetized_list <- function(wordlist){
    nms <- names(wordlist) # The names are the words
    sorted <- sort(nms) # The words, but now in ABC order
    return(wordlist[sorted]) # Return the sorted version
}

In [14]:
wl <- findwords(HC)
alphabetized_list(wl)

# Control Statement: Loops, While, If Else

## for Loops
    for (i in x){
        do something...
    }

In [15]:
x <- c(5, 12, -3)
for (i in x){
    print(i**2)
}

[1] 25
[1] 144
[1] 9


In [16]:
my_max <- rnorm(1000, 0, 1)
a <- -10000
for (i in 1:1000){
    if(my_max[i] > a){
        a <- i
    }
}
print(a)

[1] 9


## while Loops
    while (condition){
        do something...
    }

In [17]:
i <- 1
while (i <= 10){
    i <- i + 4
}
i

In [18]:
# The following format will have the same result as the standard format of while loops.
i <- 1
while (i <= 10) i <- i + 4
i

## if, else Statements
    if (condition){
        do something...
    }else{
        do something...
    }

In [19]:
# seq(x)函数生成从1到x的整数

for (i in seq(4)){
    if (i%%2 == 0){
        print(log(i))}
    else {print('Odd')}
    }

[1] "Odd"
[1] 0.6931472
[1] "Odd"
[1] 1.386294


# Vectorized Operations

Where we can, we would like to avoid iterations and use vectorized operators.

In [20]:
# Example

u <- c(1, 2, 3)
v <- c(10, -20, 30)
c <- vector(mode = 'numeric', length = length(u))
for (i in 1:length(u)){
    c[i] <- u[i] + v[i]
}
c

The function ifelse() vectorizes conditional statements. It takes three arguments ifelse(test, yes, no)

• test is a logical vector.

• yes is the return values when test is TRUE.

• no is the return values when test is FALSE.

In [21]:
# Same example
ifelse(seq(4) %% 2 == 0, log(seq(4)), 'Odd')

## The apply() Commands

The commands apply(), sapply(), lapply(), tapply() replace loops that iterate over an object's entries, computing the same function on each.

In [22]:
mat <- matrix(1:12, ncol = 6)
mat

0,1,2,3,4,5
1,3,5,7,9,11
2,4,6,8,10,12


In [23]:
# colSums() returns the sum of each column respectively.
colSums(mat)

In [24]:
# Another way to do this
apply(mat, 2, sum)

In [25]:
# Calculate the row sums
apply(mat, 1, sum)

• lapply(), or list apply, works like apply(), but for applying the same function to each element of a list. It returns a list.

• sapply(), or simplified list apply, works like lapply(), but returns a vector if possible.

In [26]:
vec1 <- c(1.1, 3.4, 2.4, 3.5)
vec2 <- c(1.1, 3.4, 2.4, 10.8)
not_robust <- list(vec1, vec2)
lapply(not_robust, mean)

In [27]:
lapply(not_robust, median)

In [28]:
sapply(not_robust, median)

In [29]:
# 同时使用unlist()函数与lapply()函数，达到sapply()的效果
unlist(lapply(not_robust,median))

We use sapply() to find the length of each element in our word list. Since the elements are the words, the length is the frequency.

In [30]:
freq_list <- function(wordlist){
    freqs <- sapply(wordlist, length) # The frequencies
    
    # The order() function returns a vector of indices that will permute its
    # input argument into sorted order. 
    return(wordlist[order(freqs)])
}

In [31]:
head(freq_list(wl), 3)

In [32]:
tail(freq_list(wl), 3)

# Functions on Factors
• Factors have their own member of the apply() family: tapply().

• Use as follows: tapply(vector, factor, function).

• The above splits the vector into groups according to the levels of the factor and then applies the function to each group.

In [33]:
# Calculate the average age in each group.
data <- rep(c("Control","Treatment"),c(3,4))
group <- factor(data)
group

In [34]:
ages <- c(20, 30, 40, 35, 35, 35, 35)
tapply(ages, group, mean)

# Run-Time
## Vecrotized vs. Loops

A very useful function in R is proc.time(). The third output of proc.time can be used to estimate the run-time of a program. A simple example follows:

In [35]:
# Run proc.time()
proc.time()

   user  system elapsed 
   1.18    0.37    1.93 

In [36]:
# Store third element as start.time
start.time <- proc.time()[3]

In [37]:
# Create large matrix of rolls from a six-sided die.
# The goal is to find the mean of each row.
# sample()是抽样函数，这里说的是，从1:6中随机抽样，抽取500000个样本,
# replace=T代表有放回抽样。

dice_mat <- matrix(sample(1:6,500000,replace=T),ncol=10)
dim(dice_mat)

In [38]:
head(dice_mat,4)

0,1,2,3,4,5,6,7,8,9
2,4,5,5,5,5,3,6,1,4
6,2,4,5,4,3,1,5,3,3
4,4,4,1,1,4,5,1,1,4
2,5,3,4,3,2,5,1,5,1


In [39]:
start.time.1 <- proc.time()[3]

my_means <- rep(NA, nrow(dice_mat))
for (i in 1:nrow(dice_mat)) {
    my_means[i] <- mean(dice_mat[i,])
    }
head(my_means)

In [40]:
# Print run-time
end.time.1 <- proc.time()[3]-start.time.1
end.time.1

In [41]:
start.time.2 <- proc.time()[3]
my_means <- NULL
for (i in 1:nrow(dice_mat)) {
    my_means[i] <- mean(dice_mat[i,])
    }
head(my_means)

In [42]:
# Print run-time
end.time.2 <- proc.time()[3]-start.time.2
end.time.2

In [43]:
start.time.3 <- proc.time()[3]
my_means <- apply(dice_mat,1,mean)
head(my_means)

In [44]:
# Print run-time
end.time.3 <- proc.time()[3]-start.time.3
end.time.3

In [45]:
c(end.time.1,end.time.2,end.time.3)