DHH24 disc group project

Introduction

See BEST-PRACTICES.md for repository layout and coding best practices. Particularly, this repository has been setup to use Poetry for Python dependency management and renv for R dependency management. Additionally, a Conda-compatible tool (such as micromamba, mamba or conda) can be used to set up compatible versions of R, Python and Poetry if these are not already available on the system.

Thus, to get the project running, do the following:

Ensure you have compatible versions of R, Python and Poetry available on the system. If you have a conda-compatible tool, you can also use that to install these in an isolated environment, e.g. through [micromamba/mamba/conda env] create -f environment.yml -p ./.venv.
Install R dependencies with Rscript -e 'renv::restore()'
Install Python dependencies with poetry install

Data access

There are three basic ways to access the data:

Smaller random samples of each datasets are stored in the data/work/samples folder in this repository as TSV files. These are the easiest way to use the data if a sample is sufficient for your analyses.
The master data is stored on a MariaDB database server. This can be very easily used from R through dbplyr, which in many scenarios allows you to use the data as if it were local through tidyverse verbs, transparently transforming them into SQL under the hood. For Python, Pandas can read from the database using SQL queries, but you need to write the SQL yourself.
Copies of the data are stored as parquet files in the Allas S3 service. In Python, Pandas can open these files through read_parquet in such a manner that only the parts of the data needed for a particular query are downloaded from S3, and additionally the whole dataset never needs to fit in memory at a time (You could do the same in R but there easier to use the database through dbplyr)

For either MariaDB or S3 access, you need a secret.yaml file which you can find in our shared google drive, and which you should put in the root directory of the repo.

For quick-start introductions on how to use the data in practice from code, look at src/example_user/example_analysis.Rmd andsrc/example_user/example_analysis.ipynb, as well as the various things under src/jiemakel/.

More on the database

Each table in the MariaDB database has a suffix, either _a or _c, which indicates the storage engine (Aria or ColumnStore) backing that table. The main relevance of the storage engine is that:

ColumnStore tables in general perform better when you need to count the number of entries in large while Aria tables perform better when you need to extract a small subset (but sometimes not, so in the end, if something is slow, try the other engine!)
Only Aria tables have full-text indices for efficient text search (this has its own syntax, check the docs).
Cross-engine joins are terrible, so avoid them if you can.

Data model

The schema of the submissions table is:

subreddit_id BIGINT UNSIGNED NOT NULL, -- the numeric id of the subreddit
subreddit VARCHAR(255) CHARACTER SET utf8mb4 NOT NULL, -- the name of the subreddit
id BIGINT UNSIGNED NOT NULL PRIMARY KEY, -- the unique numeric id of the submission
permalink VARCHAR(255) CHARACTER SET utf8mb4 NOT NULL, -- an URL to the submission
created_utc TIMESTAMP NOT NULL, -- the time the submission was created (in the UTC timezone)
author_id BIGINT UNSIGNED, -- the numeric id of the author, if available
author VARCHAR(255) CHARACTER SET utf8mb4 NOT NULL, -- the username of the author
title VARCHAR(510) CHARACTER SET utf8mb4 NOT NULL, -- the title of the submission
url VARCHAR(510) CHARACTER SET utf8mb4, -- an URL if the submission was a link submission
selftext TEXT CHARACTER SET utf8mb4, -- the text of the submission if it wasn't a link-only submission
score INTEGER NOT NULL, -- the score of the submission as calculated by subtracting the number of downvotes from the number of upvotes
num_comments INTEGER UNSIGNED NOT NULL, -- the number of comments on the submission as reported by Reddit
upvote_ratio FLOAT -- the ratio of the number of upvotes to the number of downvotes. Not available for old data.

The schema of the comments table is:

subreddit_id BIGINT UNSIGNED NOT NULL, -- the numeric id of the subreddit
subreddit VARCHAR(255) CHARACTER SET utf8mb4 NOT NULL, -- the name of the subreddit
id BIGINT UNSIGNED NOT NULL PRIMARY KEY, -- the unique numeric id of the comment
permalink VARCHAR(255) CHARACTER SET utf8mb4 NOT NULL, -- an URL to the comment
link_id BIGINT UNSIGNED NOT NULL, -- the id of the submission this comment belongs to
parent_comment_id BIGINT UNSIGNED, -- the id of the parent comment. If not given, the parent is the submission (link_id)
created_utc TIMESTAMP NOT NULL, -- the time the comment was created (in the UTC timezone)
author_id BIGINT UNSIGNED, -- the numeric id of the author, if available
author VARCHAR(255) CHARACTER SET utf8mb4 NOT NULL, -- the username of the author
body TEXT CHARACTER SET utf8mb4, -- the text of the comment (quoted sections begin with >)
score INTEGER NOT NULL, -- the score of the comment as calculated by subtracting the number of downvotes from the number of upvotes
controversiality BOOLEAN -- comments with lots of up and downvotes are considered controversial by Reddit

Additionally, there are some downstream tables for various purposes. These are:

cmw_delta_comments_a contains all comments in ChangeMyView where the OP gives a delta. This can be used to find the comment receiving the delta through parent_comment_id, and the whole discussion using link_id.
stop_arguing_stop_arguing_comments_a contains all comments where someone actually says "stop arguing".
Base data samples also have a copy in the database, with the suffix _sample_[number].
The samples have also been automatically linguistically parsed with Stanza. The parsed data is in tables with the suffix _parse, detailed further below.

Data model for linguistically parsed data

The parsed data is in tables with the suffix [prefix]_parse_a. The schema of the main parse tables is:

text_id BIGINT UNSIGNED NOT NULL, -- unique numeric id of the submission or comment parsed
sentence_id SMALLINT UNSIGNED NOT NULL, -- sentence number in the submission or comment
word_pos SMALLINT UNSIGNED NOT NULL, -- word position in the sentence
word_id INT UNSIGNED NOT NULL, -- word_id that references words_a which contains the actual text of the word
lemma_id INT UNSIGNED NOT NULL, -- lemma_id that references lemmas_a which contains the actual text of the lemma
upos_id TINYINT UNSIGNED NOT NULL, -- upos_id that references upos_a which contains the actual upos
xpos_id TINYINT UNSIGNED NOT NULL, -- xpos_id that references upos_a which contains the actual xpos
feats_id SMALLINT UNSIGNED, -- feats_id that references feats_a which contains the actual feats
head_pos SMALLINT UNSIGNED NOT NULL, -- the word position of the head of the dependency relation
deprel_id TINYINT UNSIGNED NOT NULL, -- deprel_id that references deprel_a which contains the actual dependency relation
misc_id SMALLINT UNSIGNED, -- misc_id that references misc_a which contains the actual misc text associated with the word
start_char SMALLINT UNSIGNED NOT NULL, -- start character position of the word in the submission/comment
end_char SMALLINT UNSIGNED NOT NULL, -- end character position of the word in the submission/comment

Additional parse information is available in three further tables. First, [prefix]_parse_constituents_a contains constituency parsing results, with the following schema:

text_id BIGINT UNSIGNED NOT NULL, -- unique numeric id of the submission or comment parsed
sentence_id SMALLINT UNSIGNED NOT NULL, -- sentence number in the submission or comment
node_id SMALLINT UNSIGNED NOT NULL, -- node id of the constituent within the tree (root is 0)
label_id INT UNSIGNED NOT NULL, -- references words_a which contains the actual text of the label in the tree
parent_node_id SMALLINT UNSIGNED NOT NULL, -- parent node id of the constituent within the tree

Second, [prefix]_parse_entities_a contains extracted named entities, with the following schema:

name VARCHAR(255) NOT NULL, -- the name of the entity
text_id BIGINT UNSIGNED NOT NULL, -- unique numeric id of the submission or comment parsed
sentence_id SMALLINT UNSIGNED NOT NULL, -- sentence number in the submission or comment
start_word_pos INT UNSIGNED NOT NULL, -- word position in the sentence where the entity mention starts
end_word_pos INT UNSIGNED NOT NULL, -- word position in the sentence where the entity mention ends

Finally, [prefix]_parse_sentiments_a contains sentiment analysis results, with the following schema:

text_id BIGINT UNSIGNED NOT NULL, -- unique numeric id of the submission or comment parsed
sentence_id SMALLINT UNSIGNED NOT NULL, -- sentence number in the submission or comment
sentiment TINYINT NOT NULL, -- sentiment score of the sentence

Using this repo inside the CSC environment

There are additional scripts and knacks for using this repository in the various CSC environments. If you need to do this, ask for help from Eetu.

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
dagster_assets		dagster_assets
data/work		data/work
renv		renv
src		src
.Rbuildignore		.Rbuildignore
.Rprofile		.Rprofile
.gitattributes		.gitattributes
.gitignore		.gitignore
BEST-PRACTICES.md		BEST-PRACTICES.md
DESCRIPTION		DESCRIPTION
README.md		README.md
create-tykky-env.sh		create-tykky-env.sh
cuda-env.yml		cuda-env.yml
disc.Rproj		disc.Rproj
environment.yml		environment.yml
params.yaml		params.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
renv.lock		renv.lock
requirements.txt		requirements.txt
rocm-env.yml		rocm-env.yml
sync-deps.py		sync-deps.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DHH24 disc group project

Introduction

Data access

More on the database

Data model

Data model for linguistically parsed data

Using this repo inside the CSC environment

About

Releases

Packages

Contributors 9

Languages

dhh2024/disc

Folders and files

Latest commit

History

Repository files navigation

DHH24 disc group project

Introduction

Data access

More on the database

Data model

Data model for linguistically parsed data

Using this repo inside the CSC environment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Languages

Packages