-
Notifications
You must be signed in to change notification settings - Fork 71
Installing 4CAT
4CAT has two components, the backend and the web tool. These share some bits of code and a configuration file but apart from that they run independently. Communication between the two happens via a PostgreSQL database.
It is recommended that you run 4CAT on a UNIX-like system (e.g. Linux or MacOS). 4CAT further requires Python 3.7 (lower versions may work but are not supported) and PostgreSQL 9.5.
4CAT uses Sphinx as its search backend. This guide assumes you have a local instance of Sphinx running. Refer to the Sphinx website for further instructions on how to set this up.
Hardware-wise, it is recommended that you store both the database and Sphinx indexes on a reasonably fast SSD disk. Furthermore, it is recommended that your server has about 8GB RAM per 100 million posts stored (this is a rule of thumb; your mileage may vary).
Clone the repository somewhere:
git clone https://www.github.com/digitalmethodsinitiative/4cat.git
After cloning the repository, copy config.py-example to config.py and edit
the file to match your machine's configuration. The various options are
explained in the file itself:
cd 4cat
cp config.py-example config.py
nano config.py
Note that you need to create a database and database user yourself: this is
not handled by 4CAT. Upon first running the backend, it will create new tables
and indices in the database specified in config.py, so make sure the
configured database user has the rights to do so.
Next, install the dependencies. While in the 4CAT root folder, run:
pip3 -r requirements.txt
You should now be set up to run 4CAT. It is recommended that you next run the included test suite to make sure everything has been set up correctly and that you can reach the 4chan API:
python3 -m unittest discover test
If everything tested successfully, you will see a message similar to the following:
.........ss..............x.x...............
----------------------------------------------------------------------
Ran 43 tests in 9.837s
OK (skipped=2, expected failures=2)
Next, you should set up the Sphinx indexes and data sources. An
configuration file can be generated via
backend/extras/generate-sphinx.py. Copy the configuration file to
etc/sphinx.conf in your Sphinx folder (this is the default
configuration location; change according to your setup).
You should set up Sphinx to rotate indexes at least daily. Note that depending on your setup, generating indexes for your full corpus could take multiple hours.
You can now run 4CAT.
The backend is run as a daemon that can be started and stopped using the
included 4cat-daemon.py script:
python3 4cat-daemon.py start
Other valid arguments are stop, restart and status. Note that if you
change any configuration options, you will need to restart the daemon for the
changes to take effect.
Note: The 4CAT was made to run on a UNIX-like system and the above will not
work on Windows. Instead, running 4cat-daemon.py on Windows will start the
4CAT backend in the terminal window, regardless of the argument given. The
backend can then be quit by entering q, followed by enter.
The web tool is a Flask app. It is recommended that you run the web tool as a WSGI module: see the Flask documentation for more details. For testing and development, you can run the Flask app locally from the command line:
FLASK_APP=webtool flask run
With the default configuration, you can now navigate to
http://localhost:5000 where you'll find the web tool that allows you to query
the database and create datasets.
4CAT is not very useful with an empty database. To fill it with 4chan data, you can either import data from elsewhere or scrape 4chan yourself (or do both).
Included in the backend folder is import_dump.py. You can use this script
to import dumps from 4plebs (e.g.
these). Run the
script without arguments for more information on its syntax. Note that for
larger boards, imports can take a long time to finish (multiple days). This is
due to the sheer size of the data sets, and because 4CAT needs full text
indices to search through the data, which take relatively long to generate. A
faster hard drive helps.
The 4CAT backend comes with a 4chan API scraper that can capture new posts
on 4chan as they are posted. You can configure which boards are to be scraped
in config.py. Note that the 4chan API has a rate limit and scraping too many
boards will probably make you hit that limit quite quickly. It is recommended
that you keep an eye on the backend log files when you first start scraping to
make sure you're getting all the data you want. You may add a list of proxies
in the configuration file: 4CAT will use a random proxy while scraping, which
will likely allow for more requests before you hit the rate limit.
If you decide to scrape 4chan, it is recommended that you run the 4chan API
compatibility test regularly to remain aware of any changes in the API
response. The test is located at test/test_4chan_api.py. If the 4chan API
response is compatible with 4CAT, the tests within will pass: if not, pay close
attention to which tests fail, and read the failure messages for more info
on what to do next.
python3 -m unittest test/test_4chan_api.py
While by default the web tool and backend run on the same server, you could set things up so that they run on separate servers instead. Simply only start the backend on one server, and the frontend on the other. If you configure the front end to connect to the database on another server (or vice versa), the backend and front end will be able to communicate.