/
castarter-workflow.Rmd
191 lines (134 loc) · 5.66 KB
/
castarter-workflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
title: "Common workflows in castarter"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Common workflows in castarter}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
One of the first issues that appear when starting a text mining or web scraping project relates to the issue of managing files and folder. `castarter` defaults to an opinionated folder structure that should work for most projects. It also facilitates downloading files (skipping previously downloaded files) and ensuring consistent and unique matching between a downloaded html, its source url, and data extracted from them. Finally, it facilitates archiving and backuping downloaded files and scripts.
## Getting started
In this vignette, I will outline some of the basic steps that can be used to extract and process the contents of press releases from a number of institutions of the European Union. When text mining or scraping, it is common to gather quickly many thousands of file, and keeping them in good order is fundamental, particularly in the long term.
A preliminary suggestion: depending on how you usually work and keep your files backed-up it may make sense to keep your scripts in a folder that is live-synced (e.g. with services such as Dropbox, Nextcloud, or Google Drive). It however rarely make sense to live-sync tens or hundreds of thousands of files as you proceed with your scraping. You may want to keep this in mind as you set the `base_folder` with `cas_set_options()`. I will keep in the current working directory here for the sake of simplicity, but there are no drawbacks in having scripts and folders in different locations.
`castarter` stores details about the download process in a database. By default, this is stored locally in DuckDB database kept in the same folder as website files, but it can be stored in a different folder, or alternative database backends such as RSQlite or MySQL can also be used.
Assuming that my project on the European Union involves text mining the website of the European Parliament, the European Council, and the External action service (EEAS) the folder structure may look something like this:
```{r}
library("castarter")
cas_set_options(
base_folder = fs::path(
fs::path_temp(),
"castarter_data"
),
project = "European Union",
website = "EEAS"
)
```
```{r eval = TRUE, include = TRUE, echo=FALSE}
fs::dir_create(path = fs::path(
cas_get_options()$base_folder,
cas_get_options()$project,
"European Parliament"
))
fs::dir_create(path = fs::path(
cas_get_options()$base_folder,
cas_get_options()$project,
"European Council"
))
fs::dir_create(path = fs::path(
cas_get_options()$base_folder,
cas_get_options()$project,
"EEAS"
))
fs::dir_tree(cas_get_options()$base_folder)
```
In brief, `castarter_data` is the base folder where I can store all of my text mining projects. `european_union` is the name of the project, while all others are the names of the specific websites I will source. Folders will by created automatically as needed when you start downloading files.
## Downloading index files
In text mining a common scenario involves first downloading web pages containing lists of urls to the actual posts we are interested in. In the case of the European Commission, these would probably the pages in the "[news section](https://ec.europa.eu/commission/presscorner/home/en)". By clicking on the the numbers at the bottom of the page, we get to see direct links to the subsequent pages listing all posts.
These URLs look something like this:
```{r eval = TRUE, echo=FALSE}
tibble::tibble(index_urls = paste0("https://ec.europa.eu/commission/presscorner/home/en?pagenumber=", 1:10)) %>%
knitr::kable()
```
Sometimes such urls can be derived from the archive section, as is the case for example for EEAS:
```{r}
index_df <- cas_build_urls(
url = "https://www.eeas.europa.eu/eeas/press-material_en?f%5B0%5D=pm_category%253AStatement/Declaration&f%5B1%5D=press_site%253AEEAS&f%5B2%5D=press_site%253AEEAS&fulltext=&created_from=&created_to=&0=press_site%253AEEAS&1=press_site%253AEEAS&2=press_site%253AEEAS&page=",
start_page = 0,
end_page = 3,
index_group = "Statements"
)
index_df %>%
knitr::kable()
```
All information about the download process are tyipically stored in a local database, for consistency and future reference.
```{r}
cas_write_db_index(urls = index_df)
cas_read_db_index()
```
```{r eval=TRUE}
cas_get_base_path(create_folder_if_missing = TRUE)
```
```{r}
# cas_get_files_to_download(index = TRUE)
```
## Download files
[#TODO]
```{r}
cas_download(index = TRUE, create_folder_if_missing = TRUE)
```
## Check how the download is going
```{r}
download_status_df <- cas_read_db_download(index = TRUE)
download_status_df
```
## Extract links from index files
```{r}
cas_extract_links(
container = "h5",
container_class = "card-title",
domain = "https://www.eeas.europa.eu",
write_to_db = TRUE
)
```
```{r}
cas_read_db_contents_id()
```
```{r}
cas_get_files_to_download()
```
```{r}
cas_download(sample = 5)
```
```{r}
cas_read_db_download()
```
```{r}
extractors_l <- list(
title = \(x) cas_extract_html(
html_document = x,
container = "h1",
container_class = "node__title"
) %>%
stringr::str_remove_all(pattern = stringr::fixed("\n")) %>%
stringr::str_squish(),
date = \(x) cas_extract_html(
html_document = x,
container = "div",
container_class = "node__meta"
) %>%
stringr::str_extract("[[:digit:]]+\\.[[:digit:]]+\\.[[:digit:]]+") %>%
lubridate::dmy()
)
```
```{r}
cas_extract(extractors = extractors_l)
```
```{r}
cas_read_db_contents_data()
```