## 介绍

使用R语言的优势：

1. open-source and free
2. runs on Windows, Linux, Mac OS
3. contains advanced statistical routines not yet available in other packages
4. has state-of-the-art graphics capabilities
5. allows for ready-to-print quality outputs
6. vector-based language
7. easy to start with and as such it is used in all areas of research - statistics, econometric, actuarial sciences, sociology, finance, marketing, health, epidemiology, etc.
8. free!


使用R语言的劣势：

1. 性能
2. 有点混乱的语法
3. there are always many different ways to do the same job

## 编程风格

R语言使用 <- 来进行赋值：

In [3]:
num_1 <- 2

In [4]:
num_1

R语言现在也支持使用 = 来进行赋值：

In [5]:
num_2 = 3

In [6]:
num_2

或者R语言也可以这么做：

In [7]:
4 -> num_3

In [8]:
num_3

赋值的内容也可以是计算表达式：

In [9]:
num_result <- 12 + 11

In [10]:
num_result

## 数据类型

R语言的数据类型通常有如下所示的几种：

1. character: "a", “test”
2. numeric: 2, 47.5
3. integer: 2L (L is a special instruction to treat number 2 as an integer)
4. logical: TRUE, FALSE
5. complex: 1+2i

如果不知道一个变量的类型，可以使用class()函数来获得相关的信息：

In [13]:
char_1 <- "Hello R"

In [14]:
class(char_1)

或者也可以使用typeof()函数来获得：

In [15]:
typeof(char_1)

In [16]:
typeof(num_1)

In [17]:
typeof(num_2)

In [18]:
b <- 2

In [19]:
typeof(b)

In [20]:
c <- 2L

In [21]:
typeof(c)

我们也可以使用 is类型来判断是否是给定的类型：

In [22]:
num_4 <- 2.3

In [23]:
is.integer(num_4)

In [24]:
is.numeric(num_4)

In [25]:
is.numeric(char_1)

我们可以使用 : 来生成一个数组：

In [26]:
arr = 1:100

In [27]:
arr

这个数组是 左闭右闭的。

In [29]:
object.size(arr)  # 这门做可以求得大小

448 bytes

## 数据结构

R语言中主要的数据结构为：

1. vector
2. list
3. matrix
4. data frame
5. factors

**Vector**

R语言实际上是一门基于vector的语言，所有的对象都可以被视作是vector。

In [30]:
vector_1 <- "123456"

In [31]:
vector_1

In [32]:
length(vector_1)

我们也可以创建要给empty logical vector：

In [33]:
vector_empty <- vector()

In [34]:
vector_empty

In [35]:
length(vector_empty)

如果我们选哟一个empty vector，并且需要内置5个元素：

In [37]:
vector_empty_5 <- vector("character", length = 5)

In [38]:
vector_empty_5

创建一个带有5个元素的numetic vector vector：

In [39]:
numeric(5)

创建一个带有5个元素的logical vector：

In [40]:
logical(5)

创建一个含有5个元素的character vector：

In [41]:
character(5)

我们也可以使用c()函数来实现：

In [42]:
vector_c <- c(1, 2, 3, 4, 5)

In [43]:
vector_c

In [44]:
length(vector_c)

In [45]:
vector_c_2 <- c(1, 2, 3, 4, 5, "c")

In [46]:
vector_c_2

In [47]:
vector_c_3 <- c(1L, 2L, 3L)

In [48]:
vector_c_3

我们可以进行类型转换：

In [49]:
x <- c(1, 2, 3, 4, "abc")

In [50]:
y <- as.numeric(x)

"NAs introduced by coercion"


In [51]:
y

In [52]:
x <- c("1", "2", "3", "4", "5")

In [53]:
y <- as.numeric(x)

In [54]:
y

In [55]:
x <- c(1,2)  # combining two numeric vectors of one element each
x  # result is a numeric vector of length two

In [56]:
y <- c(x,x,x)  # combining three vectors together
y  # and a result is a vector of 6 elements

In [57]:
x <- 1:10  # a sequence of integers from 1 to 10 inclusive
x

In [58]:
x <- seq(10)  # again a sequence of integers starting from 1,
x  # which is a default behaviour of seq()

In [59]:
seq(from = 5, to = 10, by = 0.2)

## 缺失值 / 特殊值

Vectors always have the same type of data with the only exception:

In [61]:
x <- c(1, 2, 3, NA, 5)

In [62]:
class(x)

In [63]:
y <- c("a", "b", "c", "d", 12, "e")

In [64]:
class(y)

In [65]:
z <- c(TRUE, TRUE, FALSE, NA)
class(z)

In [66]:
x <- c(1,2,3,NA,5)
is.na(x)

In [67]:
x <- c(1, 2, 3, NA, 5)
is.na(x)

Inf is infinity. It can be positive and negative:

In [68]:
1/0

In [69]:
log(0)

NaN is Not a Number:

In [70]:
0/0

In [71]:
log(-1)

"NaNs produced"


NULL - is a null object. You can delete information by assigning NULL to a variable.

## Object Attributes

每一个对象都会有属性（attributes），属性包括：

1. length() - length of the object, that is, the number of elements
2. names() - names of each element if any
3. dim() - number of dimensions of the object
4. class() - data type
5. nchar() - number of characters in the every element of the object
6. attributes() - provides a list of available attributes

In [72]:
x <- c("a" = 1, "b" = 2, "c" = 3)

In [73]:
x

This is a named vector. It is the same numeric vector as before c(1,2,3) but every element has its own
name which can be used to extract the value of the element. More details will come later.

In [74]:
class(x)

In [75]:
length(x)

In [76]:
names(x)

In [77]:
attributes(x)

In [78]:
y <- c("abc", "defgh")
nchar(y)

In [79]:
x <- c("a" = 1, "b" = 2, "c" = 3)
x

In [80]:
names(x) <- c("d", "e", "f")
x

In [82]:
x <- c("a" = 11, "b" = 12, "c" = 13)
x[2]  # get a value by an index

In [83]:
x["b"]  # get a value by a name

In [84]:
x <- c("a" = 11, "b" = 12, "c" = 13)
x

In [85]:
x["b"] <- 99  # assign value 99 to an element with name "b" of the vector "x"
x

## Matrix

A matrix is a two-dimensional structure very similar to a vector. There is a number of rows and columns
and all elements should be of the same type - numeric or character or logical, etc.

In [86]:
m <- matrix(nrow = 2, ncol = 3) # empty matrix with 2 columns and 3 rows

In [87]:
m

0,1,2
,,
,,


In [88]:
class(m)

In [89]:
typeof(m)

In [91]:
dim(m)  # 求得m的维度，2行3列

In [95]:
m <- matrix(c(1:6), 2, 3)

In [96]:
m

0,1,2
1,3,5
2,4,6


In [97]:
class(m)

In [98]:
typeof(m)

In [99]:
m <- matrix(letters, ncol = 2, byrow = TRUE)  # filling row-wise

In [100]:
m

0,1
a,b
c,d
e,f
g,h
i,j
k,l
m,n
o,p
q,r
s,t


In [107]:
m <- matrix(letters, ncol = 13, byrow = TRUE)  # filling row-wise

In [108]:
m

0,1,2,3,4,5,6,7,8,9,10,11,12
a,b,c,d,e,f,g,h,i,j,k,l,m
n,o,p,q,r,s,t,u,v,w,x,y,z


In [109]:
class(m)

In [110]:
typeof(m)

这是一种创建matrix的方法：

先创建一个vector：

In [111]:
x <- 1:12

In [112]:
dim(x)

NULL

接着将其转到matrix：

In [113]:
dim(x) <- c(3, 4)

In [114]:
x

0,1,2,3
1,4,7,10
2,5,8,11
3,6,9,12


这是另一种创建Matrix的方法：

In [117]:
x <- 1:4
y <- 5:8

In [118]:
cbind(x, y)  # 将x和y合并成为columns

x,y
1,5
2,6
3,7
4,8


In [119]:
rbind(x, y)  # 将x和y合并为rows

0,1,2,3,4
x,1,2,3,4
y,5,6,7,8


或者我们也可以这么做：

In [120]:
m <- matrix(1:12, 3, 4)  # 元素为1~12,3行，4列

In [121]:
m

0,1,2,3
1,4,7,10
2,5,8,11
3,6,9,12


In [123]:
m[2, 2]  # 将第2行的第2列的元素提取出来

## Array

Array is a matrix with more than two dimensions. There might be as many dimensions as you want but all elements should be of the same data type.

In [125]:
x <- 1:12
dim(x) <- c(2, 3, 2)

In [126]:
x

结果应当如下所示：

![image.png](attachment:5bbb0dbe-35f2-49a1-ab41-ae485ebba517.png)

In [127]:
class(x)

In [128]:
typeof(x)

In [129]:
x[2, 3, 2]

## List

List是一种特殊类型的vector，在List中，元素并不需要都是同一类型的。

In [130]:
x <- list(1:5, 2.5, "abcdef", TRUE)

In [132]:
x

和Vector一样，一个list可以被named：

In [133]:
x <- list(a = 1:5, b = 2.5, c = "abcdef", d = TRUE)

In [134]:
x

To address elements of the list you can use indexes in double square brackets or elements names if the list is named.

In [135]:
x[[3]]  # list element number 3

In [136]:
x[["c"]]  # list element with name "c"

In [138]:
x$c  # the same as above - element with name "c"

In [139]:
x <- list(a= 1:5, b = 2.5, c="abcdef", d=TRUE)

In [140]:
x

In [141]:
x[1]

## Data Frame

R语言当中最重要的就是data frame.

A data frame is a defacto standard for tabular data used in statistics.

A data frame has a rectangular shape with rows and columns. Speaking statistically, every column is a
variable (or attribute or category) and every row is an observation (or case or patient or respondent).

A data frame is a special type of the list where every element is a vector of the same length.

In [142]:
df <- data.frame(id = letters[1:10], x = 1:10, y = 11:20)

In [143]:
df

id,x,y
<chr>,<int>,<int>
a,1,11
b,2,12
c,3,13
d,4,14
e,5,15
f,6,16
g,7,17
h,8,18
i,9,19
j,10,20


A data frame always have columns names.

In [144]:
names(df)

In [145]:
typeof(df)

In [146]:
dim(df)

In [147]:
df["x"]  # x所在的列

x
<int>
1
2
3
4
5
6
7
8
9
10


In [148]:
df[2]  # single brackets with an index. Result is the same as above - a data frame.

x
<int>
1
2
3
4
5
6
7
8
9
10


In [149]:
df$id  # dollar-sign notation. Result is a vector of all values in the column.

In [150]:
df[["x"]]  # double square brackets with a name. Result is as above.

In [151]:
df[[2]]  # double square brackets with an index number. Result is as above.

In [152]:
df[3, 2]  # individual indexes for row and column. Result is a vector of length 1.

In [153]:
df[3,"x"]  # the same as above but with the name instead of index.

When you access values in a vector, matrix, array, list or data frame names or indexes should not be single values. Moreover, you remember that there are no single values in R - there are always vectors. Hence, you can access or change multiple values at the same time.

In [154]:
x <- c("a", "b", "c", "d", "e", "f")
x

In [156]:
x[c(1,3,6)]  # an index is a vector with three elements

In [157]:
m <- matrix(1:12, nrow=3, ncol=4)

In [158]:
m

0,1,2,3
1,4,7,10
2,5,8,11
3,6,9,12


In [159]:
df <- data.frame(id = letters[1:10], x = 1:10, y = 11:20)
df

id,x,y
<chr>,<int>,<int>
a,1,11
b,2,12
c,3,13
d,4,14
e,5,15
f,6,16
g,7,17
h,8,18
i,9,19
j,10,20


There are many data sets embedded in R that can be used as examples. Let’s have a look on one of them: 

In [160]:
mtcars

Unnamed: 0_level_0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


In [161]:
head(mtcars)  # first 6 rows of the data frame. You can change the number of rows.

Unnamed: 0_level_0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


In [162]:
tail(mtcars)  # last 6 rows of the data frame

Unnamed: 0_level_0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
Ford Pantera L,15.8,8,351.0,264,4.22,3.17,14.5,0,1,5,4
Ferrari Dino,19.7,6,145.0,175,3.62,2.77,15.5,0,1,5,6
Maserati Bora,15.0,8,301.0,335,3.54,3.57,14.6,0,1,5,8
Volvo 142E,21.4,4,121.0,109,4.11,2.78,18.6,1,1,4,2


In [163]:
ncol(mtcars)  # number of columns

## Factors