/
temp.Rmd
365 lines (255 loc) · 7.84 KB
/
temp.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
---
title: "R ETL Crash Course"
subtitle: Week 2
author:
name: "Summit Suen"
affiliation: 木刻思股份有限公司
email: "course@agilearning.io, summit.suen@data-sci.info"
website: "https://data-sci.info"
date: "January 04, 2016"
output:
html_document:
toc: true
theme: readable
highlight: espresso
css: main.css
---
---
## 暖身
### 環境設定:rJava Installation
在 Windows 系統裡使用 R,有時候會遇到一些棘手的 error;其中除了跟系統編碼(顯示亂碼)相關的問題以外,有一大宗都跟 `rJava` 這個套件有關。
今天先帶領大家操作一次,當遇到問題時,可以如何去解決。
1. 下載 64 bits 版本的 Java(for Windows)[Download](http://javadl.sun.com/webapps/download/AutoDL?BundleId=113219)
<<<<<截圖>>>>>
2. 設定環境變數
```{r, eval=FALSE}
Sys.getenv("JAVA_HOME")
Sys.setenv("JAVA_HOME"="path/where/java/installed")
```
<<<<<截圖>>>>>
<!-- end of list -->
---
### 練習操作:R Basic Recap
小提示 Tips
- 工作目錄 working directory
- `getwd`
- `setwd`
- `dir`
- 自動補齊 autocompletion
- `tab`
- 執行命令 source
- `control` + `enter`
- 註解增減 comment
- `shift` + `control` + `c`
<!-- end of list -->
```{r}
## View Data
head(iris)
tail(iris)
## View(iris)
summary(iris)
## explor object
attributes(iris)
str(iris)
class(iris)
## data shape
dim(iris)
ncol(iris)
nrow(iris)
```
---
## 今天的題目是 ETL:什麼是 ETL?
### E = Extraction
### T = Transform
### L = Load
### 今天會用到的套件/重要函式:
### - `readLine` | `read.table`
### - `readxl` | `xlsx` | `XLConnect` | `iconv`
### - `magrittr`
### - `dplyr` | `reshape2`
### - `RSQLite` | `DBI`
<!-- end of list -->
---
## Data Schema Matters
資料結構是重要的
以第一週 Yahoo Stock 資料為例
[2451 創見](http://tw.stock.yahoo.com/d/s/major_2451.html)
```{r}
library(httr)
library(rvest)
## Connector
Target_URL = "http://tw.stock.yahoo.com/d/s/major_2451.html"
res <- GET(Target_URL)
doc_str <- content(res, type = "text", encoding = "big5")
## Parser
if (.Platform$OS.type == "windows"){
Sys.setlocale(category='LC_ALL', locale='C')
data_table <- doc_str %>% read_html(encoding = "big-5") %>% html_nodes(xpath = "//table[1]//table[2]") %>% html_table(header=TRUE)
Sys.setlocale(category='LC_ALL', locale='cht')
data_table <- apply(data_table[[1]],2,function(x) iconv(x,from = "utf8"))
colnames(data_table) <- iconv(colnames(data_table), from = "utf8")
} else{
data_table <- doc_str %>% read_html(encoding = "big-5") %>% html_nodes(xpath = "//table[1]//table[2]") %>% html_table(header=TRUE)
data_table <- data_table[[1]]
}
# View(data_table)
data_table
```
左右兩欄的表格應該整併成同一欄
並且加入時間欄位
塞進資料庫/檔案做儲存
- 範例:Yahoo Stock
- 練習:Yahoo Stock
- 作業:無
<!-- end of list -->
---
## 基本資料讀寫
其他都可以忘記,至少要記得 `read.table`
```{r}
library(RCurl)
Cl_info = read.table(sep=",", header=TRUE, stringsAsFactors=F, file=textConnection(getURL("https://raw.githubusercontent.com/ntuaha/R_ETL_DSC_2015/gh-pages/cl_info_other.csv")))
# Cl_info = read.table(file="./cl_info_other.csv",sep=",",stringsAsFactors=F,header=T)
head(Cl_info)
```
[銀行局](https://survey.banking.gov.tw/statis/stmain.jsp?sys=100&funid=r100)
- 範例:房貸餘額 csv
- 練習:
- 作業:無
<!-- end of list -->
---
## 處理 Excel 資料
### Excel 不等於 Table!
### - `gdata`
[在 Windows 系統中需要另外安裝 Perl](https://cran.r-project.org/web/packages/gdata/INSTALL)
```{r}
## Need Perl on Windows
library(gdata)
```
### - `XLConnect` and `xlsx`
需要 `rJava`,另外在處理中文編碼時需要使用 `iconv` 輔助。
```{r}
## Need Java/rJava
library(XLConnect)
library(xlsx)
```
### - `readxl`
Made by Hadley 品質保證,powerd by `RCpp`
```{r}
##
library(readxl)
```
<!-- end of list -->
---
## 觀察
首先,觀察檔案的格式、欄位、資料型態
### 利用 `readLines`,先把前幾行資料讀進來觀察
```{r}
filename <- "data/29010.xls"
filename <- "data/10401信用卡重要資訊揭露.xlsx"
raw <- readLines(filename, n = 10L, encoding = "BIG-5", warn = FALSE)
raw <- iconv(raw, from = "BIG-5", to = "UTF-8")
```
### 預先設定欄位的大小及資料型態 `colClasses`,讓讀檔速度加快(執行時不用去猜資料的大小、格式與資料型態)
### 一般來說,使用 `XLConnect::existsSheet` 取得 Excel 檔案的表單資訊(sheet),再用 `xlsx::read.xlsx2` 讀檔(較快)
```{r}
library(xlsx)
library(magrittr)
filename <- "data/29010.xls"
filename <- "data/10401信用卡重要資訊揭露.xlsx"
credit_rownames <- read.xlsx2(filename, sheetIndex = 1, startRow = 4, endRow = 8, as.data.frame = TRUE, header = FALSE, colClasses = c("character", "character", "character", "character"))
credit_10401 <- read.xlsx2(filename, sheetIndex = 1, startRow = 9, endRow = 45, as.data.frame = TRUE, header = FALSE, colClasses = c("character", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "numeric", "numeric", "numeric", "integer", "integer"))
# 放 View 的 table 截圖
# View(credit_10401)
str(credit_10401)
# as.character(credit_10401[1,][7][[1]])
# as.character(credit_10401[1:4,][7][[1]])
# paste0(as.character(credit_10401[1:4,][7][[1]]))
# as.character(credit_10401[4,])
# colnames(credit_10401) <- as.character(credit_10401[4,])
# credit_10401 <- credit_10401[8:45,]
# 放 View 的 table 截圖
credit_10401_2 <- credit_10401
credit_10401_2[-1] <- lapply(credit_10401[-1], as.character)
credit_10401_2[-1] <- lapply(credit_10401_2[-1], as.numeric)
# View(credit_10401_2)
str(credit_10401_2)
row.names(credit_10401) <- NULL
```
```{r}
library(XLConnect)
filename <- "data/29010.xls"
wb = XLConnect::loadWorkbook(filename = filename)
data = XLConnect::readWorksheet(wb, sheet = "9-1")
head(data)
str(data)
data2 <- data[seq(from = 13, to = 35, by = 2),]
head(data2)
str(data2)
```
```{r}
library(readxl)
filename <- "data/29010.xls"
filename <- "data/10401信用卡重要資訊揭露.xlsx"
sheetNames <- excel_sheets(filename)
readxl:::xlsx_col_types(path = filename, nskip = 10, n = 1)
```
```{r, echo=FALSE}
resxl <- read_excel(filename, sheet = sheetNames[1], col_names = TRUE, skip = 4)
```
```{r}
resxl <- resxl[4:41,]
resxl[-1] <- lapply(resxl[-1], as.numeric)
# colnames(resxl)=iconv(colnames(resxl),'utf8','big5')
str(resxl)
# barplot(height = resxl$流通卡數, names.arg = resxl$金融機構名稱)
# View(resxl)
# resxl_2 <- read_excel(filename, sheet = 2, col_names = FALSE)
# View(resxl_2)
# str(resxl_2)
```
---
## dplyr
接下來,真正開始用 R 來做 ETL 資料清理的工作。
### 簡介 `dplyr`
```{r}
library(dplyr)
```
### 延伸 `data.table`
### 延伸 `rlist`
---
## R 與 Database 的串接:以 SQLite 為例
[範例資料](http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing)
[Download Sample file](http://mlr.cs.umass.edu/ml/machine-learning-databases/00222/bank.zip)
```{r}
res <- read.csv("data/bank/bank-full.csv", header = TRUE, sep = ";")
```
```{r}
library(DBI)
# Create an ephemeral in-memory RSQLite database
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbListTables(con)
dbWriteTable(con, "mtcars", mtcars)
dbListTables(con)
dbListFields(con, "mtcars")
dbReadTable(con, "mtcars")
# You can fetch all results:
res <- dbSendQuery(con, "SELECT * FROM mtcars WHERE cyl = 4")
dbFetch(res)
dbClearResult(res)
# Or a chunk at a time
res <- dbSendQuery(con, "SELECT * FROM mtcars WHERE cyl = 4")
while(!dbHasCompleted(res)){
chunk <- dbFetch(res, n = 5)
print(nrow(chunk))
}
# Clear the result
dbClearResult(res)
# Disconnect from the database
dbDisconnect(con)
```
---
## Take Home Messages
### - 環境變數 `Sys.getenv()` `Sys.setenv()`
### - 資料結構是重要的,三思而後行!
---
## Q & A