Skip to content
/ disc Public

Helsinki Digital Humanities Hackathon 2024 disc group

Notifications You must be signed in to change notification settings

dhh2024/disc

Repository files navigation

DHH24 disc group project

Introduction

See BEST-PRACTICES.md for repository layout and coding best practices. Particularly, this repository has been setup to use Poetry for Python dependency management and renv for R dependency management. Additionally, a Conda-compatible tool (such as micromamba, mamba or conda) can be used to set up compatible versions of R, Python and Poetry if these are not already available on the system.

Thus, to get the project running, do the following:

  1. Ensure you have compatible versions of R, Python and Poetry available on the system. If you have a conda-compatible tool, you can also use that to install these in an isolated environment, e.g. through [micromamba/mamba/conda env] create -f environment.yml -p ./.venv.
  2. Install R dependencies with Rscript -e 'renv::restore()'
  3. Install Python dependencies with poetry install

Data access

There are three basic ways to access the data:

  1. Smaller random samples of each datasets are stored in the data/work/samples folder in this repository as TSV files. These are the easiest way to use the data if a sample is sufficient for your analyses.
  2. The master data is stored on a MariaDB database server. This can be very easily used from R through dbplyr, which in many scenarios allows you to use the data as if it were local through tidyverse verbs, transparently transforming them into SQL under the hood. For Python, Pandas can read from the database using SQL queries, but you need to write the SQL yourself.
  3. Copies of the data are stored as parquet files in the Allas S3 service. In Python, Pandas can open these files through read_parquet in such a manner that only the parts of the data needed for a particular query are downloaded from S3, and additionally the whole dataset never needs to fit in memory at a time (You could do the same in R but there easier to use the database through dbplyr)

For either MariaDB or S3 access, you need a secret.yaml file which you can find in our shared google drive, and which you should put in the root directory of the repo.

For quick-start introductions on how to use the data in practice from code, look at src/example_user/example_analysis.Rmd andsrc/example_user/example_analysis.ipynb, as well as the various things under src/jiemakel/.

More on the database

Each table in the MariaDB database has a suffix, either _a or _c, which indicates the storage engine (Aria or ColumnStore) backing that table. The main relevance of the storage engine is that:

  1. ColumnStore tables in general perform better when you need to count the number of entries in large while Aria tables perform better when you need to extract a small subset (but sometimes not, so in the end, if something is slow, try the other engine!)
  2. Only Aria tables have full-text indices for efficient text search (this has its own syntax, check the docs).
  3. Cross-engine joins are terrible, so avoid them if you can.

Data model

Main data is in [prefix]_submissions_[a|c] and [prefix]_comments_[a|c], with prefix being (at the time of writing) one of [cmw|eli5|aita|stop_arguing|best_of_reddit|random_sample].

The schema of the submissions table is:

  • subreddit_id BIGINT UNSIGNED NOT NULL, -- the numeric id of the subreddit
  • subreddit VARCHAR(255) CHARACTER SET utf8mb4 NOT NULL, -- the name of the subreddit
  • id BIGINT UNSIGNED NOT NULL PRIMARY KEY, -- the unique numeric id of the submission
  • permalink VARCHAR(255) CHARACTER SET utf8mb4 NOT NULL, -- an URL to the submission
  • created_utc TIMESTAMP NOT NULL, -- the time the submission was created (in the UTC timezone)
  • author_id BIGINT UNSIGNED, -- the numeric id of the author, if available
  • author VARCHAR(255) CHARACTER SET utf8mb4 NOT NULL, -- the username of the author
  • title VARCHAR(510) CHARACTER SET utf8mb4 NOT NULL, -- the title of the submission
  • url VARCHAR(510) CHARACTER SET utf8mb4, -- an URL if the submission was a link submission
  • selftext TEXT CHARACTER SET utf8mb4, -- the text of the submission if it wasn't a link-only submission
  • score INTEGER NOT NULL, -- the score of the submission as calculated by subtracting the number of downvotes from the number of upvotes
  • num_comments INTEGER UNSIGNED NOT NULL, -- the number of comments on the submission as reported by Reddit
  • upvote_ratio FLOAT -- the ratio of the number of upvotes to the number of downvotes. Not available for old data.

The schema of the comments table is:

  • subreddit_id BIGINT UNSIGNED NOT NULL, -- the numeric id of the subreddit
  • subreddit VARCHAR(255) CHARACTER SET utf8mb4 NOT NULL, -- the name of the subreddit
  • id BIGINT UNSIGNED NOT NULL PRIMARY KEY, -- the unique numeric id of the comment
  • permalink VARCHAR(255) CHARACTER SET utf8mb4 NOT NULL, -- an URL to the comment
  • link_id BIGINT UNSIGNED NOT NULL, -- the id of the submission this comment belongs to
  • parent_comment_id BIGINT UNSIGNED, -- the id of the parent comment. If not given, the parent is the submission (link_id)
  • created_utc TIMESTAMP NOT NULL, -- the time the comment was created (in the UTC timezone)
  • author_id BIGINT UNSIGNED, -- the numeric id of the author, if available
  • author VARCHAR(255) CHARACTER SET utf8mb4 NOT NULL, -- the username of the author
  • body TEXT CHARACTER SET utf8mb4, -- the text of the comment (quoted sections begin with >)
  • score INTEGER NOT NULL, -- the score of the comment as calculated by subtracting the number of downvotes from the number of upvotes
  • controversiality BOOLEAN -- comments with lots of up and downvotes are considered controversial by Reddit

Additionally, there are some downstream tables for various purposes. These are:

  • cmw_delta_comments_a contains all comments in ChangeMyView where the OP gives a delta. This can be used to find the comment receiving the delta through parent_comment_id, and the whole discussion using link_id.
  • stop_arguing_stop_arguing_comments_a contains all comments where someone actually says "stop arguing".
  • Base data samples also have a copy in the database, with the suffix _sample_[number].
  • The samples have also been automatically linguistically parsed with Stanza. The parsed data is in tables with the suffix _parse, detailed further below.

Data model for linguistically parsed data

The parsed data is in tables with the suffix [prefix]_parse_a. The schema of the main parse tables is:

  • text_id BIGINT UNSIGNED NOT NULL, -- unique numeric id of the submission or comment parsed
  • sentence_id SMALLINT UNSIGNED NOT NULL, -- sentence number in the submission or comment
  • word_pos SMALLINT UNSIGNED NOT NULL, -- word position in the sentence
  • word_id INT UNSIGNED NOT NULL, -- word_id that references words_a which contains the actual text of the word
  • lemma_id INT UNSIGNED NOT NULL, -- lemma_id that references lemmas_a which contains the actual text of the lemma
  • upos_id TINYINT UNSIGNED NOT NULL, -- upos_id that references upos_a which contains the actual upos
  • xpos_id TINYINT UNSIGNED NOT NULL, -- xpos_id that references upos_a which contains the actual xpos
  • feats_id SMALLINT UNSIGNED, -- feats_id that references feats_a which contains the actual feats
  • head_pos SMALLINT UNSIGNED NOT NULL, -- the word position of the head of the dependency relation
  • deprel_id TINYINT UNSIGNED NOT NULL, -- deprel_id that references deprel_a which contains the actual dependency relation
  • misc_id SMALLINT UNSIGNED, -- misc_id that references misc_a which contains the actual misc text associated with the word
  • start_char SMALLINT UNSIGNED NOT NULL, -- start character position of the word in the submission/comment
  • end_char SMALLINT UNSIGNED NOT NULL, -- end character position of the word in the submission/comment

Additional parse information is available in three further tables. First, [prefix]_parse_constituents_a contains constituency parsing results, with the following schema:

  • text_id BIGINT UNSIGNED NOT NULL, -- unique numeric id of the submission or comment parsed
  • sentence_id SMALLINT UNSIGNED NOT NULL, -- sentence number in the submission or comment
  • node_id SMALLINT UNSIGNED NOT NULL, -- node id of the constituent within the tree (root is 0)
  • label_id INT UNSIGNED NOT NULL, -- references words_a which contains the actual text of the label in the tree
  • parent_node_id SMALLINT UNSIGNED NOT NULL, -- parent node id of the constituent within the tree

Second, [prefix]_parse_entities_a contains extracted named entities, with the following schema:

  • name VARCHAR(255) NOT NULL, -- the name of the entity
  • text_id BIGINT UNSIGNED NOT NULL, -- unique numeric id of the submission or comment parsed
  • sentence_id SMALLINT UNSIGNED NOT NULL, -- sentence number in the submission or comment
  • start_word_pos INT UNSIGNED NOT NULL, -- word position in the sentence where the entity mention starts
  • end_word_pos INT UNSIGNED NOT NULL, -- word position in the sentence where the entity mention ends

Finally, [prefix]_parse_sentiments_a contains sentiment analysis results, with the following schema:

  • text_id BIGINT UNSIGNED NOT NULL, -- unique numeric id of the submission or comment parsed
  • sentence_id SMALLINT UNSIGNED NOT NULL, -- sentence number in the submission or comment
  • sentiment TINYINT NOT NULL, -- sentiment score of the sentence

Using this repo inside the CSC environment

There are additional scripts and knacks for using this repository in the various CSC environments. If you need to do this, ask for help from Eetu.

About

Helsinki Digital Humanities Hackathon 2024 disc group

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published