# R 語言的五十道練習

> 資料框處理

[數據交點](https://www.datainpoint.com/) | 郭耀仁 <yaojenkuo@datainpoint.com>

## 常用檢視資料框的函數

## 常見檢視資料框的函數一覽

- `dim()`、`nrow()` 與 `ncol()` 檢視外觀。
- `colnames()` 與 `row.names()` 檢視變數名稱與觀測值索引。
- `summary()` 描述性統計。
- `str()` 詳細資訊。
- `View()`、`head()` 與 `tail()` 顯示資料框。

## 範例資料來源

[COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19)

In [1]:
get_daily_report <- function() {
    file_date <- format(Sys.Date() - 2, "%m-%d-%Y")
    csv_url <- paste0("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/",
                      "csse_covid_19_daily_reports/",
                      file_date,
                      ".csv"
                     )
    daily_report <- read.csv(csv_url)
    return(daily_report)
}
daily_report <- get_daily_report()

In [2]:
print(nrow(daily_report))
print(ncol(daily_report))
print(dim(daily_report))

[1] 3987
[1] 14
[1] 3987   14


In [3]:
print(colnames(daily_report))
print(row.names(daily_report)[1:10])

 [1] "FIPS"                "Admin2"              "Province_State"     
 [4] "Country_Region"      "Last_Update"         "Lat"                
 [7] "Long_"               "Confirmed"           "Deaths"             
[10] "Recovered"           "Active"              "Combined_Key"       
[13] "Incident_Rate"       "Case_Fatality_Ratio"
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"


In [4]:
summary(daily_report)

      FIPS          Admin2          Province_State     Country_Region    
 Min.   :   66   Length:3987        Length:3987        Length:3987       
 1st Qu.:19050   Class :character   Class :character   Class :character  
 Median :30068   Mode  :character   Mode  :character   Mode  :character  
 Mean   :32401                                                           
 3rd Qu.:47040                                                           
 Max.   :99999                                                           
 NA's   :721                                                             
 Last_Update             Lat             Long_           Confirmed      
 Length:3987        Min.   :-52.37   Min.   :-178.12   Min.   :      0  
 Class :character   1st Qu.: 33.27   1st Qu.: -96.62   1st Qu.:   1346  
 Mode  :character   Median : 37.94   Median : -86.85   Median :   3808  
                    Mean   : 35.95   Mean   : -71.98   Mean   :  53653  
                    3rd Qu.: 42.22   3rd Qu

In [5]:
str(daily_report)

'data.frame':	3987 obs. of  14 variables:
 $ FIPS               : int  NA NA NA NA NA NA NA NA NA NA ...
 $ Admin2             : chr  "" "" "" "" ...
 $ Province_State     : chr  "" "" "" "" ...
 $ Country_Region     : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
 $ Last_Update        : chr  "2021-08-26 04:21:28" "2021-08-26 04:21:28" "2021-08-26 04:21:28" "2021-08-26 04:21:28" ...
 $ Lat                : num  33.9 41.2 28 42.5 -11.2 ...
 $ Long_              : num  67.71 20.17 1.66 1.52 17.87 ...
 $ Confirmed          : int  152722 141365 193171 15014 46539 1598 5155079 239056 314 21471 ...
 $ Deaths             : int  7090 2483 5096 130 1176 43 110966 4778 3 133 ...
 $ Recovered          : logi  NA NA NA NA NA NA ...
 $ Active             : logi  NA NA NA NA NA NA ...
 $ Combined_Key       : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
 $ Incident_Rate      : num  392 4912 441 19432 142 ...
 $ Case_Fatality_Ratio: num  4.642 1.756 2.638 0.866 2.527 ...


In [6]:
print(head(daily_report))
print(tail(daily_report))
#View(daily_report)

  FIPS Admin2 Province_State      Country_Region         Last_Update       Lat
1   NA                               Afghanistan 2021-08-26 04:21:28  33.93911
2   NA                                   Albania 2021-08-26 04:21:28  41.15330
3   NA                                   Algeria 2021-08-26 04:21:28  28.03390
4   NA                                   Andorra 2021-08-26 04:21:28  42.50630
5   NA                                    Angola 2021-08-26 04:21:28 -11.20270
6   NA                       Antigua and Barbuda 2021-08-26 04:21:28  17.06080
      Long_ Confirmed Deaths Recovered Active        Combined_Key Incident_Rate
1  67.70995    152722   7090        NA     NA         Afghanistan      392.3157
2  20.16830    141365   2483        NA     NA             Albania     4912.2594
3   1.65960    193171   5096        NA     NA             Algeria      440.5163
4   1.52180     15014    130        NA     NA             Andorra    19431.8255
5  17.87390     46539   1176        NA     NA  

## 基礎資料框處理

## 基礎資料框處理的技巧

- 解構資料框。
    - 選擇。
    - 篩選。
    - 選擇與篩選。
- 排序資料框。
- 新增變數。
- 摘要。

## 解構資料框：選擇

使用 `df[["COLUMN_NAME"]]`、`df[, c("COLUMN_NAMES")]` 或 `df$COLUMN_NAME` 語法。

In [7]:
# selecting a column
print(daily_report[["Country_Region"]][1:10])
#print(daily_report[, "Country_Region"])
#print(daily_report$Country_Region)

 [1] "Afghanistan"         "Albania"             "Algeria"            
 [4] "Andorra"             "Angola"              "Antigua and Barbuda"
 [7] "Argentina"           "Armenia"             "Australia"          
[10] "Australia"          


In [8]:
# selecting columns
head(daily_report[, c("Combined_Key", "Confirmed")])

Unnamed: 0_level_0,Combined_Key,Confirmed
Unnamed: 0_level_1,<chr>,<int>
1,Afghanistan,152722
2,Albania,141365
3,Algeria,193171
4,Andorra,15014
5,Angola,46539
6,Antigua and Barbuda,1598


## 解構資料框：篩選

使用 `df[CONDITION, ]` 或 `df[ROW_INDEX, ]` 語法。

In [9]:
# filtering a row
daily_report[daily_report$Country_Region == "Taiwan*", ]

Unnamed: 0_level_0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<int>,<int>,<lgl>,<lgl>,<chr>,<dbl>,<dbl>
648,,,,Taiwan*,2021-08-26 04:21:28,23.7,121,15939,830,,,Taiwan*,66.92342,5.207353


In [10]:
# filtering a row
tw_row_index <- row.names(daily_report[daily_report$Country_Region == "Taiwan*", ])
daily_report[as.integer(tw_row_index), ]

Unnamed: 0_level_0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<int>,<int>,<lgl>,<lgl>,<chr>,<dbl>,<dbl>
648,,,,Taiwan*,2021-08-26 04:21:28,23.7,121,15939,830,,,Taiwan*,66.92342,5.207353


In [11]:
# filtering rows
daily_report[daily_report$Country_Region == "Australia", ]

Unnamed: 0_level_0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<int>,<int>,<lgl>,<lgl>,<chr>,<dbl>,<dbl>
9,,,Australian Capital Territory,Australia,2021-08-26 04:21:28,-35.4735,149.0124,314,3,,,"Australian Capital Territory, Australia",73.34735,0.955414
10,,,New South Wales,Australia,2021-08-26 04:21:28,-33.8688,151.2093,21471,133,,,"New South Wales, Australia",264.48633,0.6194402
11,,,Northern Territory,Australia,2021-08-26 04:21:28,-12.4634,130.8456,200,0,,,"Northern Territory, Australia",81.43322,0.0
12,,,Queensland,Australia,2021-08-26 04:21:28,-27.4698,153.0251,1972,7,,,"Queensland, Australia",38.54951,0.3549696
13,,,South Australia,Australia,2021-08-26 04:21:28,-34.9285,138.6007,870,4,,,"South Australia, Australia",49.53032,0.4597701
14,,,Tasmania,Australia,2021-08-26 04:21:28,-42.8821,147.3272,235,13,,,"Tasmania, Australia",43.88422,5.5319149
15,,,Victoria,Australia,2021-08-26 04:21:28,-37.8136,144.9631,21694,820,,,"Victoria, Australia",327.21459,3.779847
16,,,Western Australia,Australia,2021-08-26 04:21:28,-31.9505,115.8605,1084,9,,,"Western Australia, Australia",41.20733,0.8302583


## 解構資料框：選擇與篩選

使用 `df[CONDITION, COLUMN_NAMES]` 或 `df[ROW_INDEX, COLUMN_NAMES]` 語法。

In [12]:
daily_report[daily_report$Country_Region == "Taiwan*", c("Country_Region", "Confirmed")]
daily_report[tw_row_index, c("Country_Region", "Confirmed")]

Unnamed: 0_level_0,Country_Region,Confirmed
Unnamed: 0_level_1,<chr>,<int>
648,Taiwan*,15939


Unnamed: 0_level_0,Country_Region,Confirmed
Unnamed: 0_level_1,<chr>,<int>
648,Taiwan*,15939


## 排序資料框

利用 `order()` 函數取得排序後的列索引。

In [13]:
# ordering dataframe
ordered_index <- order(daily_report$Deaths)
head(daily_report[ordered_index, ])

Unnamed: 0_level_0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<int>,<int>,<lgl>,<lgl>,<chr>,<dbl>,<dbl>
11,,,Northern Territory,Australia,2021-08-26 04:21:28,-12.4634,130.8456,200,0,,,"Northern Territory, Australia",81.43322,0
24,,,Antwerp,Belgium,2021-08-26 04:21:28,51.2195,4.4024,155068,0,,,"Antwerp, Belgium",8346.02629,0
25,,,Brussels,Belgium,2021-08-26 04:21:28,50.8503,4.3517,150646,0,,,"Brussels, Belgium",12465.10258,0
26,,,East Flanders,Belgium,2021-08-26 04:21:28,51.0362,3.7373,134713,0,,,"East Flanders, Belgium",8891.57158,0
27,,,Flemish Brabant,Belgium,2021-08-26 04:21:28,50.9167,4.5833,92561,0,,,"Flemish Brabant, Belgium",8075.6429,0
28,,,Hainaut,Belgium,2021-08-26 04:21:28,50.5257,4.0621,167561,0,,,"Hainaut, Belgium",12465.10112,0


## 新增變數

In [14]:
daily_report$Death_Rate <- daily_report$Deaths / daily_report$Confirmed
print("Death_Rate" %in% colnames(daily_report))
print(daily_report$Death_Rate[1:10])

[1] TRUE
 [1] 0.046424222 0.017564461 0.026380771 0.008658585 0.025269129 0.026908636
 [7] 0.021525567 0.019986949 0.009554140 0.006194402


## 摘要

針對欲摘要的變數使用敘述性統計函數。

In [15]:
print(mean(daily_report$Deaths))

[1] 1119.572


## 以 `dplyr` 函數處理資料框

## 使用 `dplyr` 處理資料框

- 安裝 `dplyr` 套件。
- 載入 `dplyr` 套件。

## 安裝 `dplyr` 套件

- 透過 RStudio 的 `Packages` 功能頁籤。
- 透過 `install.packages()` 函數。

```r
install.packages("dplyr")
```

## 載入 `dplyr` 套件

- 透過 RStudio 的 `Packages` 功能頁籤。
- 透過 `library()` 函數。

In [16]:
suppressMessages(library("dplyr"))

## 使用 `%>%` pipe 運算符鏈結函數

- `%>%` 運算符來自 `magrittr`，會隨著 `dplyr` 一起被安裝。
- 讓需要鏈結函數的資料操作可讀性更高。
- 在 RStudio 中使用 `Ctrl-Shift-M` 快捷鍵可以叫出 `%>%` 運算符。

## 使用 `dplyr` 進行基礎資料框處理

- 解構資料框。
    - 選擇與篩選。
- 排序資料框。
- 新增變數。
- 摘要。
- 分組摘要。

## 解構資料框：選擇與篩選

- 使用 `dplyr::select()` 函數選擇。
- 使用 `dplyr::filter()` 函數篩選。

In [17]:
head(select(daily_report, c(Country_Region, Confirmed, Deaths)))

Unnamed: 0_level_0,Country_Region,Confirmed,Deaths
Unnamed: 0_level_1,<chr>,<int>,<int>
1,Afghanistan,152722,7090
2,Albania,141365,2483
3,Algeria,193171,5096
4,Andorra,15014,130
5,Angola,46539,1176
6,Antigua and Barbuda,1598,43


In [18]:
filter(daily_report, Country_Region == "Taiwan*")

FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio,Death_Rate
<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<int>,<int>,<lgl>,<lgl>,<chr>,<dbl>,<dbl>,<dbl>
,,,Taiwan*,2021-08-26 04:21:28,23.7,121,15939,830,,,Taiwan*,66.92342,5.207353,0.05207353


In [19]:
# without using %>%
filter(select(daily_report, c(Country_Region, Confirmed, Deaths)), Country_Region == "Taiwan*")

Country_Region,Confirmed,Deaths
<chr>,<int>,<int>
Taiwan*,15939,830


In [20]:
# using %>%
daily_report %>% 
    select(c(Country_Region, Confirmed, Deaths)) %>% 
    filter(Country_Region == "Taiwan*")

Country_Region,Confirmed,Deaths
<chr>,<int>,<int>
Taiwan*,15939,830


## 排序資料框

使用 `dplyr::arrange()` 函數排序。

In [21]:
daily_report %>% 
    arrange(Deaths) %>% 
    head()

Unnamed: 0_level_0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio,Death_Rate
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<int>,<int>,<lgl>,<lgl>,<chr>,<dbl>,<dbl>,<dbl>
1,,,Northern Territory,Australia,2021-08-26 04:21:28,-12.4634,130.8456,200,0,,,"Northern Territory, Australia",81.43322,0,0
2,,,Antwerp,Belgium,2021-08-26 04:21:28,51.2195,4.4024,155068,0,,,"Antwerp, Belgium",8346.02629,0,0
3,,,Brussels,Belgium,2021-08-26 04:21:28,50.8503,4.3517,150646,0,,,"Brussels, Belgium",12465.10258,0,0
4,,,East Flanders,Belgium,2021-08-26 04:21:28,51.0362,3.7373,134713,0,,,"East Flanders, Belgium",8891.57158,0,0
5,,,Flemish Brabant,Belgium,2021-08-26 04:21:28,50.9167,4.5833,92561,0,,,"Flemish Brabant, Belgium",8075.6429,0,0
6,,,Hainaut,Belgium,2021-08-26 04:21:28,50.5257,4.0621,167561,0,,,"Hainaut, Belgium",12465.10112,0,0


## 新增變數

使用 `dplyr::mutate()` 函數新增變數。

In [22]:
daily_report <- get_daily_report()
daily_report %>% 
    mutate(Death_Rate = Deaths / Confirmed) %>% 
    select(Country_Region, Death_Rate) %>% 
    head()

Unnamed: 0_level_0,Country_Region,Death_Rate
Unnamed: 0_level_1,<chr>,<dbl>
1,Afghanistan,0.046424222
2,Albania,0.017564461
3,Algeria,0.026380771
4,Andorra,0.008658585
5,Angola,0.025269129
6,Antigua and Barbuda,0.026908636


## 摘要

使用 `dplyr::summarise()` 函數摘要。

In [23]:
daily_report %>% 
    summarise(avg_deaths = mean(Deaths))

avg_deaths
<dbl>
1119.572


## 分組摘要

使用 `dplyr::group_by()` 搭配 `dplyr::summarise()` 函數分組摘要。

In [24]:
daily_report %>% 
    group_by(Country_Region) %>% 
    summarise(avg_deaths = mean(Deaths)) %>% 
    head()

`summarise()` ungrouping output (override with `.groups` argument)



Country_Region,avg_deaths
<chr>,<dbl>
Afghanistan,7090
Albania,2483
Algeria,5096
Andorra,130
Angola,1176
Antigua and Barbuda,43


## 進階資料框處理

## 進階資料框處理包含

- 轉置資料框。
    - 寬格式轉長格式。
    - 長格式轉寬格式。
- 垂直合併資料框。
- 水平合併資料框。

## 使用 `tidyr` 轉置資料框

- 安裝 `tidyr` 套件。
- 載入 `tidyr` 套件。

## 安裝 `tidyr` 套件

- 透過 RStudio 的 `Packages` 功能頁籤。
- 透過 `install.packages()` 函數。

```r
install.packages("tidyr")
```

## 載入 `tidyr` 套件

- 透過 RStudio 的 `Packages` 功能頁籤。
- 透過 `library()` 函數。

In [25]:
library("tidyr")

## `tidyr` 最重要的兩個函數

- `tidyr::pivot_longer()` 寬格式轉長格式。
- `tidyr::pivot_wider()` 長格式轉寬格式。

## 什麼是寬格式、長格式？

- 寬格式使用一個欄位，欄位名稱記錄變數類別、觀測值記錄其數值。 
- 長格式使用兩個欄位，一個記錄變數類別、一個記錄數值。

來源：<https://en.wikipedia.org/wiki/Wide_and_narrow_data>

## 為何需要轉置資料框？

資料框的欄位名稱含有使用者需要的資料值或者儲存格式與應用情境不符。

In [26]:
get_time_series_confirmed <- function() {
    csv_url <- paste0("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/",
                      "csse_covid_19_data/csse_covid_19_time_series/",
                      "time_series_covid19_confirmed_global.csv")
    time_series_confirmed <- read.csv(csv_url)
    return(time_series_confirmed)
}
time_series_confirmed <- get_time_series_confirmed()
print(dim(time_series_confirmed))

[1] 279 587


## 使用 `tidyr` 的 `pivot_longer()` 函數寬轉長

- `cols`: 欲轉置的變數。
- `names_to`: 自訂為 `Date`。
- `values_to`: 自訂為 `Confirmed`。

In [27]:
cols_to_pivot_longer <- colnames(time_series_confirmed)[5:length(time_series_confirmed)]
time_series_confirmed_long <- time_series_confirmed[, c(2, 5:ncol(time_series_confirmed))] %>% 
    pivot_longer(cols = all_of(cols_to_pivot_longer),
                 names_to = "Date",
                 values_to = "Confirmed"
                )

In [28]:
print(dim(time_series_confirmed_long))
head(time_series_confirmed_long)

[1] 162657      3


Country.Region,Date,Confirmed
<chr>,<chr>,<int>
Afghanistan,X1.22.20,0
Afghanistan,X1.23.20,0
Afghanistan,X1.24.20,0
Afghanistan,X1.25.20,0
Afghanistan,X1.26.20,0
Afghanistan,X1.27.20,0


In [29]:
time_series_confirmed_long  <- time_series_confirmed_long %>% 
    group_by(Country.Region, Date) %>% 
    summarise(Confirmed = sum(Confirmed))
print(dim(time_series_confirmed_long))

`summarise()` regrouping output by 'Country.Region' (override with `.groups` argument)



[1] 113685      3


## 使用 `tidyr` 的 `pivot_wider()` 函數長轉寬

- `id_cols`: 用來標示獨一值的變數 `Country.Region`。
- `names_from`: `Date`。
- `values_from`: `Confirmed`。

In [30]:
time_series_confirmed_wide <- time_series_confirmed_long %>% 
    pivot_wider(id_cols = Country.Region,
                names_from = "Date",
                values_from = "Confirmed"
               )
print(dim(time_series_confirmed_wide))
head(time_series_confirmed_wide)

[1] 195 584


Country.Region,X1.1.21,X1.10.21,X1.11.21,X1.12.21,X1.13.21,X1.14.21,X1.15.21,X1.16.21,X1.17.21,⋯,X9.28.20,X9.29.20,X9.3.20,X9.30.20,X9.4.20,X9.5.20,X9.6.20,X9.7.20,X9.8.20,X9.9.20
<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
Afghanistan,51526,53489,53538,53584,53584,53775,53831,53938,53984,⋯,39239,39254,38288,39268,38304,38324,38398,38494,38520,38544
Albania,58316,63595,63971,64627,65334,65994,66635,67216,67690,⋯,13391,13518,9844,13649,9967,10102,10255,10406,10553,10704
Algeria,99897,102144,102369,102641,102860,103127,103381,103611,103833,⋯,51213,51368,45469,51530,45773,46071,46364,46653,46938,47216
Andorra,8117,8586,8586,8682,8818,8868,8946,9038,9083,⋯,1966,1966,1199,2050,1215,1215,1215,1261,1261,1301
Angola,17568,18193,18254,18343,18425,18613,18679,18765,18875,⋯,4797,4905,2805,4972,2876,2935,2965,2981,3033,3092
Antigua and Barbuda,159,176,176,176,176,184,184,187,189,⋯,101,101,95,101,95,95,95,95,95,95


## 使用 `dplyr` 的函數合併資料框

- 垂直合併。
    - `dplyr::bind_rows()` 函數。
- 水平合併。
    - `dplyr::bind_cols()` 函數。
    - `dplyr::..._join()` 函數。

## `bind_rows()` 函數

In [31]:
twn <- time_series_confirmed_wide[time_series_confirmed_wide$Country.Region == "Taiwan*", ]
jpn <- time_series_confirmed_wide[time_series_confirmed_wide$Country.Region == "Japan", ]
twn %>% 
    bind_rows(jpn)

Country.Region,X1.1.21,X1.10.21,X1.11.21,X1.12.21,X1.13.21,X1.14.21,X1.15.21,X1.16.21,X1.17.21,⋯,X9.28.20,X9.29.20,X9.3.20,X9.30.20,X9.4.20,X9.5.20,X9.6.20,X9.7.20,X9.8.20,X9.9.20
<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
Taiwan*,802,828,834,838,842,842,843,851,855,⋯,513,513,489,514,490,492,493,494,495,495
Japan,239068,288818,293746,298321,304140,310734,317871,324942,330715,⋯,82484,83022,70278,83591,70866,71467,71918,72213,72724,73264


## `bind_cols()` 函數

In [32]:
get_time_series_deaths <- function() {
    csv_url <- paste0("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/",
                      "csse_covid_19_data/csse_covid_19_time_series/",
                      "time_series_covid19_deaths_global.csv")
    time_series_deaths <- read.csv(csv_url)
    return(time_series_deaths)
}
time_series_deaths <- get_time_series_deaths()
print(dim(time_series_deaths))

[1] 279 587


In [33]:
time_series_deaths_long <- time_series_deaths[, c(2, 5:ncol(time_series_deaths))] %>% 
    pivot_longer(cols = all_of(cols_to_pivot_longer),
                 names_to = "Date",
                 values_to = "Deaths"
                )
time_series_deaths_long <- time_series_deaths_long %>% 
    group_by(Country.Region, Date) %>% 
    summarise(Deaths = sum(Deaths))

`summarise()` regrouping output by 'Country.Region' (override with `.groups` argument)



## `..._join()` 函數

- `left_join()` 函數，保留所有左邊資料表觀測值。
- `right_join()` 函數，保留所有右邊資料表觀測值。
- `inner_join()` 函數，保留所有資料表交集觀測值。
- `full_join()` 函數，保留所有資料表觀測值。

In [34]:
condition_tw_jp <- time_series_confirmed_long$Country.Region %in% c("Taiwan*", "Japan")
condition_tw_kr <- time_series_confirmed_long$Country.Region %in% c("Taiwan*", "Korea, South")
confirmed_left <- time_series_confirmed_long[condition_tw_jp, ]
deaths_right <- time_series_deaths_long[condition_tw_kr, ]

In [35]:
confirmed_left %>% 
    left_join(deaths_right, by = c("Country.Region", "Date"))

Country.Region,Date,Confirmed,Deaths
<chr>,<chr>,<int>,<int>
Japan,X1.1.21,239068,
Japan,X1.10.21,288818,
Japan,X1.11.21,293746,
Japan,X1.12.21,298321,
Japan,X1.13.21,304140,
Japan,X1.14.21,310734,
Japan,X1.15.21,317871,
Japan,X1.16.21,324942,
Japan,X1.17.21,330715,
Japan,X1.18.21,335605,


In [36]:
confirmed_left %>% 
    right_join(deaths_right, by = c("Country.Region", "Date"))

Country.Region,Date,Confirmed,Deaths
<chr>,<chr>,<int>,<int>
Taiwan*,X1.1.21,802,7
Taiwan*,X1.10.21,828,7
Taiwan*,X1.11.21,834,7
Taiwan*,X1.12.21,838,7
Taiwan*,X1.13.21,842,7
Taiwan*,X1.14.21,842,7
Taiwan*,X1.15.21,843,7
Taiwan*,X1.16.21,851,7
Taiwan*,X1.17.21,855,7
Taiwan*,X1.18.21,862,7


In [37]:
confirmed_left %>% 
    inner_join(deaths_right, by = c("Country.Region", "Date"))

Country.Region,Date,Confirmed,Deaths
<chr>,<chr>,<int>,<int>
Taiwan*,X1.1.21,802,7
Taiwan*,X1.10.21,828,7
Taiwan*,X1.11.21,834,7
Taiwan*,X1.12.21,838,7
Taiwan*,X1.13.21,842,7
Taiwan*,X1.14.21,842,7
Taiwan*,X1.15.21,843,7
Taiwan*,X1.16.21,851,7
Taiwan*,X1.17.21,855,7
Taiwan*,X1.18.21,862,7


In [38]:
confirmed_left %>% 
    full_join(deaths_right, by = c("Country.Region", "Date"))

Country.Region,Date,Confirmed,Deaths
<chr>,<chr>,<int>,<int>
Japan,X1.1.21,239068,
Japan,X1.10.21,288818,
Japan,X1.11.21,293746,
Japan,X1.12.21,298321,
Japan,X1.13.21,304140,
Japan,X1.14.21,310734,
Japan,X1.15.21,317871,
Japan,X1.16.21,324942,
Japan,X1.17.21,330715,
Japan,X1.18.21,335605,


## 重點統整

- 使用 `%>%` pipe 運算符鏈結函數。
- 使用 `dplyr::select()` 函數選擇。
- 使用 `dplyr::filter()` 函數篩選。
- 使用 `dplyr::arrange()` 函數排序。
- 使用 `dplyr::mutate()` 函數新增變數。
- 使用 `dplyr::group_by()` 搭配 `dplyr::summarise()` 函數分組摘要。

## 重點統整（續）

- 使用 `tidyr::pivot_longer()` 函數寬格式轉長格式。
- 使用 `tidyr::pivot_wider()` 函數長格式轉寬格式。

## 重點統整（續）

- 垂直合併。
    - `dplyr::bind_rows()` 函數。
- 水平合併。
    - `dplyr::bind_cols()` 函數。
    - `dplyr::..._join()` 函數。