redditPostArchiver

A Post Thread Archiver for Reddit

a script written in Python

Dependencies: PRAW, requests, pyyaml, arrow, Python 3

Additional dependencies for subreddit.py: apsw, peewee, urlextract, tqdm,

Quick Start

As a regular user, install (preferably without sudo):

pip3 install --user requests praw pyyaml arrow

If you want to run the subreddit database script, you must also install apsw, peewee, urlextract, and tqdm: For those running windows, grab APSW from Christoph Gohlke's page as it may not install from pip. You may also have to grab Peewee from the same page if it doesn't install from pip. I recommend installing the newest version of URLExtract directly from the developers GitHub to avoid an error with url processing.

pip3 install --user apsw peewee tqdm
pip3 install --user git+https://github.com/lipoja/URLExtract.git

Visit the PRAW documentation and follow the instructions for a script installation:

https://praw.readthedocs.io/en/latest/getting_started/authentication.html

In short, you need to register a new app with Reddit:

https://www.reddit.com/prefs/apps/

and grab your client ID and client secret from the app page:

Edit the included "credentials.yml" file to replace "test" with the variables from your reddit account.

archiver.py

Navigate to the folder with archiver.py and run the script. An html file will be written into that same folder. To choose what post is to be archived simply provide the post ID as an argument to the script (e.g., python archiver.py 15zmjl). If you've used the 'postids.py' program below and generated a text file of reddit submission ids, you can bulk process them with the command python archiver.py -i <filename>, i.e. python archiver.py -i GallowBoob_1517981182.txt).

The output is a webpage that looks like this:

postids.py

Navigate to the folder with postids.py and run the script. A text file will be written into that same folder. To choose which author is to be archived simply provide the author's name (username) as an argument: python postids.py GallowBoob The script should be able to handle usernames alone, or if you paste a url that ends in their name.

subpostids.py

Navigate to the folder with subpostids.py and run the script. A CSV file will be written into that same folder. To choose which subreddit is to be archived simply provide the subreddit name as an argument:python subpostids.py opendirectories

subreddit.py

NEW!!! Run subreddit.py <subredditname> and it will grab every submission and comment from that subreddit and store it in a sqlite database. Furthermore, it will then parse all of the text in the comment bodies for a possible URL. if you rerun the script, it will avoid redownloading any comments or scripts so you can keep it updated.

Quickly get started with the redditPostArchiver Docker image

You can run scripts by specifying the script name (with or without .py) and arguments. All output files will be at /rpa in the container. The credentials are passed through the environment variables CLIENT_ID and CLIENT_SECRET.

Running Example

docker run -it --rm -e CLIENT_ID=YOUR_CLIENT_ID -e CLIENT_SECRET=YOUR_CLIENT_SECRET -v $(pwd)/rpa:/rpa yardenshoham/reddit-post-archiver postids GallowBoob

When it is done running, the .txt file will be in $(pwd)/rpa.

Motivation

If you're addicted to reddit, then it's pretty likely stemming from the fact that it changed the way you live your life (or a part of it at least). Contained within their databases lies a huge treasure trove of information: academic, creative, informative, transformative, etc. It's true that a huge amount of the information that is accessible through reddit is through links, but the discussions on those links are often the best part about reddit. And then think about all the .self posts on all of reddit.

Some future SOPA spin-off could limit the content of the internet, and perhaps make reddit delete old posts.
Although it seems unthinkable now, the popularity of reddit could possibly wane (a similar but clearly superior alternative, for example). Waning popularity translates into waning revenues, which might impact their data retention strategy -- older post might simply be deleted after reaching a certain age.
A zombie apocalypse or mutant super virus eradicates modern civilization, ending the age of the internet, or at least the internet as we know it today.

Or you can use your imagination. So, if you're looking for a quick, easy way to save for some of this knowledge for yourself, look no further (and I think this should be exclusively for personal use only, I don't know how happy reddit-as-a-business would be if you started mirroring their content while serving up advertisements).

This package provides a quick way to save a specific subset of that knowledge, and to save it in the most future proof way possible, while maintaining the readability of reddit in its original form.

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
css		css
.dockerignore		.dockerignore
.gitignore		.gitignore
CreateApp.png		CreateApp.png
Dockerfile		Dockerfile
README.md		README.md
archiver.py		archiver.py
credentials.yml		credentials.yml
docker-entrypoint.sh		docker-entrypoint.sh
entrypoint.py		entrypoint.py
multiproc.py		multiproc.py
postids.py		postids.py
pwdb.py		pwdb.py
requirements.txt		requirements.txt
savedthread.png		savedthread.png
subpostids.py		subpostids.py
subreddit.py		subreddit.py
utils.py		utils.py

biosafetylvl5/redditPostArchiver

Folders and files

Latest commit

History

Repository files navigation

redditPostArchiver

A Post Thread Archiver for Reddit

Quick Start

archiver.py

postids.py

subpostids.py

subreddit.py

Quickly get started with the redditPostArchiver Docker image

Running Example

Motivation

About

Resources

Stars

Watchers

Forks

Languages