# Introduction to R language

Source: An Introduction to Statistical Learning with Applications in R

In this lab, we will introduce some simple R commands. The best way to
learn a new language is to try out the commands. R can be downloaded from
http://cran.r-project.org/

## Functions

R uses functions to perform operations. To run a function called funcname, we type ``funcname(input1, input2)``, where the inputs (or arguments) ``input1`` and ``input2`` tell R how to run the function. Define user function with the keyword ``function``

In [None]:
add <- function(a, b){
    c = a + b
    return(c)
}

add(2, 3)

## Data strutures

### Vectors

In [None]:
x <- c(1, 3, 2, 5)
x

Note that the ``>`` is not part of the command; rather, it is printed by R to
indicate that it is ready for another command to be entered. We can also
save things using ``=`` rather than ``<-`` :

In [None]:
x = c (1 ,6 ,2)
x
y = c (1 ,4 ,3)

Hitting the up arrow multiple times will display the previous commands,
which can then be edited. This is useful since one often wishes to repeat
a similar command. In addition, typing ``?funcname`` will always cause R to
open a new help file window with additional information about the function
``funcname``.

We can tell R to add two sets of numbers together. It will then add the
first number from x to the first number from y , and so on. However, x and
y should be the same length. We can check their length using the ``length()``
function.

In [None]:
length(x)
length(y)
x + y

The ``ls()`` function allows us to look at a list of all of the objects, such ``ls()``
as data and functions, that we have saved so far. The ``rm()`` function can be used to delete any that we don’t want.

Vectors can contain strings:

In [None]:
s = c('a', 'b', 'c', 'a', 'a')

length(s)
unique(s)

In [None]:
ls()

In [None]:
rm(x , y)

It’s also possible to remove all objects at once:

In [None]:
rm(list = ls ())
ls()

### Lists

Lists are data structure that can contain heterogenegous types of data. They provide a basic key/value an indexation facility.

In [None]:
l = list(a=1, b=c(1, 2), v="string")
l
l['I_like'] = c("machine, 'learning")
l["v"]
l$I_like

List to vector and vector to list:

In [None]:
l = list(a=1, b=2)
l
unlist(l)
as.list(c(a=1, b=2))

### Matrices

The ``matrix()`` function can be used to create a matrix of numbers. Before
we use the ``matrix()`` function, we can learn more about it. Type

In [None]:
?matrix

The help file reveals that the ``matrix()`` function takes a number of inputs,
but for now we focus on the first three: the data (the entries in the matrix),
the number of rows, and the number of columns. First, we create a simple
matrix.

In [None]:
X = matrix(data = c(1 ,2 ,3 ,4), nrow=2, ncol=2)
X

Note that we could just as well omit typing data= , nrow= , and ncol= in the
matrix() command above: that is, we could just type

In [None]:
X = matrix ( c (1 ,2 ,3 ,4) ,2 ,2)
X

and this would have the same effect. However, it can sometimes be useful to
specify the names of the arguments passed in, since otherwise R will assume
that the function arguments are passed into the function in the same order
that is given in the function’s help file. As this example illustrates, by
default R creates matrices by successively filling in columns. Alternatively,
the ``byrow=TRUE`` option can be used to populate the matrix in order of the
rows.

In [None]:
matrix ( c (1 ,2 ,3 ,4) ,2 ,2 , byrow = TRUE )

Notice that in the above command we did not assign the matrix to a value
such as ``x`` . In this case the matrix is printed to the screen but is not saved
for future calculations. The ``sqrt()`` function returns the square root of each
element of a vector or matrix. The command ``X ^ 2`` or ``X ** 2`` raises each element of ``X``
to the power 2 ; any powers are possible, including fractional or negative
powers.

In [None]:
sqrt(X)

X ^ 2

# or

X ** 2

The ``rnorm()`` function generates a vector of random normal variables,
with first argument n the sample size. Each time we call this function, we
will get a different answer. Here we create two correlated sets of numbers,
``x`` and ``y`` , and use the ``cor()`` function to compute the correlation between
them.

In [None]:
x = rnorm(50)
y = x + rnorm(50, mean=50 , sd=.1)
cor(x, y)

By default, rnorm() creates standard normal random variables with a mean
of 0 and a standard deviation of 1. However, the mean and standard devi-
ation can be altered using the mean and sd arguments, as illustrated above.
Sometimes we want our code to reproduce the exact same set of random
numbers; we can use the ``set.seed()`` function to do this. The set.seed()
function takes an (arbitrary) integer argument.

In [None]:
set.seed(42)
rnorm(10)
set.seed(42)
rnorm(10)

The ``mean()`` and ``var()`` functions can be used to compute the mean and
variance of a vector of numbers. Applying sqrt() to the output of ``var()``
will give the standard deviation. Or we can simply use the ``sd()`` function.

In [None]:
set.seed (3)
y = rnorm(100)
mean(y)

var(y)

sqrt(var(y))

sd(y)

### Indexing Data

We often wish to examine part of a set of data. Suppose that our data is
stored in the matrix ``A``.

In [None]:
A = matrix (1:16, 4, 4)
A
A[2 ,3]

Selects the element corresponding to the second row and the third column. The first number after the open-bracket symbol [ always refers to the row, and the second number always refers to the column. We can also select multiple rows and columns at a time, by providing vectors as the indices.

In [None]:
A[c(1 ,3), c(2 ,4)]

Or using the sclicing notation

In [None]:
print("Line 1 to 3 and columns 2 to 4")
A [1:3 ,2:4]

print("Line 1 to 2 and all columns")
A [1:2, ]

print("All lines and all columns 1 to 2")
A [, 1:2]

The last two examples include either no index for the columns or no index
for the rows. These indicate that R should include all columns or all rows,
respectively. R treats a single row or column of a matrix as a vector.

In [None]:
A [1, ]

The use of a negative sign - in the index tells R to keep all rows or columns
except those indicated in the index.

In [None]:
A[-c(1 , 3), ]

The ``dim()`` function outputs the number of rows followed by the number of
columns of a given matrix.

In [None]:
dim(A)
dim(A[1:3, ])
dim(A)[1]

### Matrix operation

In [None]:
X = matrix(c(1 ,2 ,3 ,4) ,2 ,2)
X

#### Column-wise or row-wise operations

In [None]:
colSums(X)
colMeans(X)

rowSums(X)
rowMeans(X)

In [None]:
apply(X, 2, mean)

#### Centering and scaling

In [None]:
X = matrix(c(1 ,2 ,3 ,4) ,2 ,2)
Xcs = scale(X)
# Xcs contains the standardized matrix plus some hidden attributes
attributes(Xcs)

xbar <- attr(Xcs, "scaled:center")
xsd <- attr(Xcs, "scaled:scale")
Xcs2 = scale(X, center=xbar, scale=xsd)

all(Xcs==Xcs2)

In [None]:
### Matrix / vector product

In [None]:
X = matrix(c(1 ,2 ,3 ,4, 5, 6) ,3 ,2)
X
dim(X)

In [None]:
b = matrix(c(1 ,2), 2, 1)
b
dim(b)

X %*% b

## Data frame and data input / output

For most analyses, the first step involves importing a data set into R . The ``read.table()`` function is one of the primary ways to do this. The help file contains details about how to use this function. We can use the function
``write.table()`` to export data.

Before attempting to load a data set, we must make sure that R knows table()
to search for the data in the proper directory. For example on a Windows
system one could select the directory using the Change dir. . . option under
the File menu. However, the details of how to do this depend on the op-
erating system (e.g. Windows, Mac, Unix) that is being used, and so we
do not give further details here. Use ``getwd()`` and ``setwd()`` to get and set the current working directory.

In [None]:
getwd()

Once the data has been loaded, the ``fix()`` function can be used to view it in a spreadsheet like window.
However, the window must be closed before further R commands can be
entered. Function ``head()`` prints the first rows of the ``data.frame``

In [None]:
df = read.table("../datasets/iris.csv", header=TRUE, sep=",")
#fix(df)
head(df)

In [None]:
df = read.csv("../datasets/iris.csv")
head(df)

In [None]:
link = 'https://github.com/duchesnay/pystatsml/raw/master/datasets/salary_table.csv'
# X = read.csv(url(link))

``read.table()`` and ``read.csv()`` return ``data.frame`` with is a list of vector or a table of heterogeneous data. ``data.frame`` can be indexed like a matrix. A column can be obtained using its name after the symbol ``$``.

In [None]:
colnames(df)
df$sepal_length[1:10]

In [None]:
summary(df)

Build data.frame

In [None]:
user1 = data.frame(name=c('eric', 'sophie'),
                     age=c(22, 48), gender=c('M', 'F'),
                     job=c('engineer', 'scientist'))
  
user2 = data.frame(name=c('alice', 'john', 'peter', 'julie', 'christine'),
                   age=c(19, 26, 33, 44, 35), gender=c('F', 'M', 'M', 'F', 'F'),
                   job=c("student", "student", 'engineer', 'scientist', 'scientist'))

### Concatenate and merge

In [None]:
user3 = rbind(user1, user2)

print(user3)

salary = data.frame(name=c('alice', 'john', 'peter', 'julie'), salary=c(22000, 2400, 3500, 4300))

user = merge(user3, salary, by="name", all=TRUE)

print(user)

### Selection

In [None]:
user[(user$gender == 'F') & (user$job == 'scientist'), ]


### Iterate over columns

In [None]:
types = NULL
for(n in colnames(user)){
  types = rbind(types, data.frame(var=n, 
                                  type=typeof(user[[n]]),
                                  isnumeric=is.numeric(user[[n]])))
}

print(types)

In [None]:
lapply(user, is.numeric)

## Exercises

### Data Frame

1. Read the iris dataset. Type ``wget https://github.com/duchesnay/pystatsml/raw/master/datasets/iris.csv``. Some recent version of R can execute ``read.csv(url('https://github.com/duchesnay/pystatsml/raw/master/datasets/iris.csv'))``



2. Print column names.

3. Identify numerical columns.

4. For each species compute the mean of numerical columns and store it in  a ``stats`` table like:


          species  sepal_length  sepal_width  petal_length  petal_width
    0      setosa         5.006        3.428         1.462        0.246
    1  versicolor         5.936        2.770         4.260        1.326
    2   virginica         6.588        2.974         5.552        2.026


Write a function ``fillmissing_with_mean(df)`` that fill all missing value of numerical column with the
mean of the current column.

### Missing data

Consider the table ``user``:

In [None]:
user

1. Write a function ``fillmissing_with_mean(df)`` that fill all missing value of numerical column with the mean of the current columns.

2. Save the imputed table in a csv file "users_imputed.csv".