Data collection scripts for FoDiRa

envvars

The project depends on the following envvars. Please set these envvars in the *rc file

TWITTER_BEARER: Twitter Bearer Token
TWEET_DB: Absolute path to the DuckDB file holding all tweets
ARTICLE_DIR: Absolute path to the temporary directory holding html files
FODIRA_HOST: For workers: ssh server string

The project needs R (for most of the data collection) and node (for readability).

Server/worker setup guide

It is better to use Ubuntu 20.04 LTS at the moment, due to the installation issues of MongoDB on 22.04 LTS.

System dependecies

sudo apt update -qq
sudo apt install libcurl4-openssl-dev libssl-dev libxml2-dev libfontconfig1-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev libssl-dev libsasl2-dev software-properties-common dirmngr libssh-dev -y

R packages

DON'T USE THE r-core provided by Ubuntu; it is currently version pre-4 (#14)

Install R according to this guide

From the guide:

# update indices
wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc
# add the R 4.0 repo from CRAN -- adjust 'focal' to 'groovy' or 'bionic' as needed
sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/" -y
sudo apt install --no-install-recommends r-base-dev -y

And then

install.packages(c("tidyverse", "rio", "remotes", "tidyRSS", "mongolite", "docopt"))
remotes::install_github("chainsawriot/fodira")

Installation of MongoDB (Server only)

Install Mongodb according to this guide

sudo mkdir /data/mongodb
sudo chown -R mongodb:mongodb /data/mongodb

Edit the config file /etc/mongod.conf to point dbPath to /data/mongodb

Start the service

sudo systemctl enable mongod

DO Spaces

HTML files are uploaded to Digital Ocean Spaces. And that requires s3cmd.

sudo apt-get install s3cmd

Please set it up according to the guide provided by Digital Ocean.

Page scraping

Install Firefox

DON'T USE THE SNAP PACKAGE

Install Firefox from the offical Mozilla PPA

sudo add-apt-repository ppa:mozillateam/ppa
sudo apt install firefox
firefox --version # testing

Install JRE, JDK, and rJava

sudo apt-get install -y default-jre
sudo apt-get install -y default-jdk
sudo R CMD javareconf
Rscript -e "install.packages(c('rJava', 'RSelenium'))"

Install the RSelenium binary

ff_options <- list("moz:firefoxOptions" = list(args = list('--headless')))

rD <- RSelenium::rsDriver(browser = "firefox", port = sample(c(5678L, 5679L, 5680L, 5681L, 5682L), size = 1), check = TRUE, verbose = FALSE,
                          extraCapabilities = ff_options)
##rD <- RSelenium::rsDriver(browser = "firefox", port = sample(c(5678L, 5679L, 5680L, 5681L, 5682L), size = 1), check = TRUE, verbose = TRUE,
##                          extraCapabilities = ff_options)

## becareful of this issue
## https://github.com/ropensci/wdman/issues/31#issuecomment-1336651660

z <- rD$server$stop()

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.github		.github
R		R
archive_scrapers		archive_scrapers
docdb		docdb
docker		docker
hotfix		hotfix
man		man
page_dl		page_dl
purify		purify
rss		rss
tests		tests
twitter		twitter
.Rbuildignore		.Rbuildignore
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.md		README.md

License

chainsawriot/fodira

Folders and files

Latest commit

History

Repository files navigation

Data collection scripts for FoDiRa

envvars

Server/worker setup guide

Installation of MongoDB (Server only)

DO Spaces

Page scraping

About

Resources

License

Stars

Watchers

Forks

Languages