Skip to content
Erik Borra edited this page Dec 21, 2020 · 17 revisions

Overview

TCAT can be easily installed using the installation script. The script runs on Debian GNU/Linux 9.0 and Ubuntu 18.04. For instructions on using the install script, see the automated installation section.

TCAT can also be installed manually. This involves installing and configuring: a MySQL database for TCAT to use, a Web server to run the PHP scripts, and TCAT itself. For information about manual installation, see the manual installation section.

If problems are encountered, the common problems section might provide useful information.

Automated installation

An install script is available to automate the process of installing and configuring TCAT on Linux. Currently, only Debian and Ubuntu Linux is supported. The script is ideal for installing TCAT on a virtual machine, such as Amazon EC2 instances.

Requirements

  • Twitter API credentials (these can be obtained from https://apps.twitter.com);
  • One of the following Linux distributions:
    • Ubuntu 18.04
    • Debian 9.*

It has been designed to run on a new installation of Linux. The script will install all dependencies and make all appropriate configuration changes. It downloads and installs MySQL/MariaDB, PHP and the Apache Web Server. Therefore, it is designed to work on a system that does not yet have those components.

Instructions

Step 1: Download the install script

To use curl to download the install script, if curl is not not available install it using:

sudo apt-get install curl

Download the install script (the -O, capital-o, option saves it to a local file with the same name as the remote file):

curl -O "https://raw.githubusercontent.com/digitalmethodsinitiative/dmi-tcat/master/helpers/tcat-install-linux.sh"

Make it executable:

chmod a+x tcat-install-linux.sh

Step 2: Run the install script

The script can be run in interactive mode, where it prompts the user for the parameters it needs:

sudo ./tcat-install-linux.sh

Note: it must be run with root privileges. The above example uses the sudo command to do this.

It will prompt for:

  • Twitter API consumer key;
  • Twitter API consumer secret;
  • Twitter API user token;
  • Twitter API user secret;
  • Mode of tweet capture to perform (phrases/keywords, follow users, or 1% sample);
  • Whether to expand URLs in tweets or not;
  • The name of the server;
  • Whether to allow TCAT to automatically upgrade itself or not; and
  • Other advanced parameters (but usually the default values for these can be used).

The name of the server is very important. It will be the name of the machine in the URL used to access the TCAT Web pages. It must be the host name or IP address of the machine TCAT is being installed on.

The script will confirm the parameters before proceeding with the install. If the values are incorrect, answer "n" to edit the values. Answer "y" to start the installation process.

Step 3: Wait for the installation process to finish

Wait while the install script downloads and configures the required components. These components include the TCAT files, MySQL/MariaDB database, PHP and the Apache Web Server.

The install script, by default, will run apt-get update and apt-get upgrade at the start of the process, to ensure the system is up to date.

Note for Debian: when installing on Debian, the installation process it will prompt for a "MySQL product to configure" when mysql-apt-config is installed. Press the down key on your keyboard to select "Apply" and then press the return key to continue.

Step 4: Use TCAT

When the install finishes, details on how to access TCAT are printed out.

If you did not set login passwords in the advanced parameters, highly-secure random passwords will be generated and displayed at the end of the installation process. Please save these passwords in a password manager program: you are not expected to memorize them! Copy and paste the passwords into the Web browser: you are also not expected to type them in!

Using a Web browser, login to the TCAT capture page and create your first query bin.

Other ways to run the install script

Providing parameters in a configuration file

Instead of interactively providing some/all of the parameters, they can be supplied in a configuration file.

sudo ./tcat-install-linux.sh -c myconfigfile.txt

The configuration file should contain the TCAT installer parameters. See the first section of the install script for a list of them. If the configuration file does not set a parameter, the default value from the install script are used.

The TCAT installer parameters are represented by bash shell environment variables. The install script treats the configuration file as a bash script, which it sources.

Running in batch mode

The script can be run in batch mode, where it does not prompt the user for any information (except during one part of the installation process on Debian).

sudo ./tcat-linux-install.sh -b -c myconfigfile.txt

For batch mode to work, a configuration file must be supplied with the four Twitter API parameters. The configuration file may contain other parameters, but minimally those Twitter API parameters must be specified. Unlike the other parameters, the install script does not have built-in defaults for the Twitter API parameters.

Automatically saving the TCAT login account details

By default, the TCAT Web login username and passwords are not saved anywhere in plain text.

The install script can save them to a text file, if requested with the -l (lowercase-L) option:

sudo ./tcat-install-linux.sh -l

This can be useful for saving a copy of the randomly generated password on the installation machine. But please consider the security implications before using this feature.

Other options

See the help message for more options:

./tcat-install-linux.sh -h

Resetting the TCAT Web login passwords

The TCAT Web logins are securely stored in an Apache htpasswd basic authentication file. To reset a password to a new value:

sudo htpasswd /etc/apache2/tcat.htpasswd admin

Substitute "admin" for the login username (the default logins are named "admin" and "tcat").

Manual installation

While DMI-TCAT's Web interface is easy to use for non-technical people, manually installing DMI-TCAT currently requires you to have some knowledge of system administration via the command line. You should also be able to install and modify Apache and PHP and know how to administer MySQL. That said, DMI-TCAT has been tested on Linux, Windows and OS X.

Clone the Git repository into your Web directory: git clone --depth 1 https://github.com/digitalmethodsinitiative/dmi-tcat.git or download a ZIP file from https://github.com/digitalmethodsinitiative/dmi-tcat/archive/master.zip

Prerequisites

While capturing Twitter data from the streaming API does not need a lot of resources, the analysis of larger datasets (> 1 million tweets) can get slow. We strongly recommend using an SSD for database storage. Adding RAM and optimising the mysql configuration can also boost speed significantly. We recommend using the script available on http://mysqltuner.com to tweak your mysql config, and to set both the sort_buffer_size and myisam_sort_buffer_size as big as possible.

Software

  • PHP >= 5.3 with the following:
  • cli so that it can be called from the command line too
  • mysql
  • mbstring
  • Curl module
  • pcntl (make sure that pcntl is not disabled in your php.ini)
  • posix
  • MariaDB >= 10.1
  • TokuDB storage engine plugin
  • Functioning sendmail binary for mail reports

The TokuDB storage engine

Since April 2019, new TCAT installations require MariaDB 10.1 or higher with the TokuDB plugin enabled. If, for some reason, you cannot enable the TokuDB plugin, you still have the option to use the older MyISAM engine by specifically configuring this in the config.php file (set MYSQL_ENGINE_OPTIONS to ENGINE=MyISAM). The configuration file allows you to experiment with different MySQL storage engines such as RocksDB or ARIA as well. Feel free to do so at your own risk, but we don't offer support for this on Github.

There's one special caveat with using the TokuDB storage engine. You must disable Linux kernel Transparent HugePages (THP) permanently, i.e. via your Kernel boot parameters. You should search online for instructions on how to do this for your particular Linux distribution or study our automated installation script for inspiration. If you'r not using Linux, obviously this won't be necessary.

Database setup

Create a MySQL database and a database user to access it. E.g. CREATE DATABASE IF NOT EXISTS twittercapture DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci;

The database user will need the following privileges on the database: CREATE, DROP, LOCK TABLES, ALTER, DELETE, INDEX, INSERT, SELECT,UPDATE,CREATE TEMPORARY TABLES

It is also recommended to install the MySQL Server Time Zone Support, which may not be installed by default. This usually requires you to issue a single command. Please read here: https://dev.mysql.com/doc/refman/5.5/en/time-zone-support.html

Config

Modify config.php to reflect your setup, after copying a template file cp dmi-tcat/config.php.example dmi-tcat/config.php

  • fill in your mysql database credentials
  • Choose a capture role. DMI-TCAT allows you to 'track' tweets based on a set of keywords, 'follow' users based on a set of user ids, or retrieve a 'one percent' sample from the Twitter API. As Twitter only allows you to connect to either one of them for any given IP address, and as most machines only have one IP address, you will have to choose either 'track', 'follow', or 'onepercent' and define that in CAPTUREROLES.
  • Insert your OAuth API key from Twitter.
    • If you don't have one already, create a Twitter application on https://dev.twitter.com/apps/new
    • From the API Keys page copy the "API key" and "API secret" into the place in the code marked with $twitter_consumer_key and $twitter_consumer_secret (in the right block, see just below)
    • Create an access token and copy the "access token" and "access token secret" into the place in this code marked with $twitter_user_token and $twitter_user_secret (in the right block, see just below)
    • If you chose 'track', insert your OAuth API keys from twitter in the code block just below if (!defined('CAPTURE') || !strcmp(CAPTURE, "track")) {
    • If you chose 'follow', insert your OAuth API keys from twitter in the code block just below } elseif(!strcmp(CAPTURE, "follow")) {
    • If you chose 'onepercent', insert your OAuth API keys from twitter in the code block just below } elseif(!strcmp(CAPTURE, "onepercent")) {
  • make sure your root URL (BASE_URL) is set correct
  • Go over the other variables in the script and give them sensible values for your setup (e.g. in some cases, the "PHP_CLI" variable needs to be changed to the location of the PHP binary on your installation).
  • You can enable automatic TCAT updates through the AUTOUPDATE_ENABLED config variable. For further information, read the automatic update instructions..

Create the cache dir for exports and make it writeable for the web server, e.g.:

mkdir dmi-tcat/analysis/cache; chown www-data dmi-tcat/analysis/cache; chmod 755 dmi-tcat/analysis/cache;

Create the capture log dir and set the appropriate permissions:

mkdir dmi-tcat/logs; chown www-data dmi-tcat/logs; chmod 755 dmi-tcat/logs;

Create the proc dir and set the appropriate permissions:

mkdir dmi-tcat/proc; chown www-data dmi-tcat/proc; chmod 755 dmi-tcat/proc;

Optional: if you want to run the capture scripts as a different user than the one you are currently logged in as, you'll have to change write permissions for the logs and proc dir. E.g.:

sudo chown differentuser dmi-tcat/logs
sudo chown differentuser dmi-tcat/proc

Authentication and access

DMI-TCAT has two main interfaces: the query manager (youserver.tld/capture/) and the analysis interface (yourserver.tld/analysis/).

Note that the query manager is accessible via the Web and that Twitter's TOS do not allow you to provide exports of Twitter content as a service. We therefore strongly recommend that you restrict who can access your DMI-TCAT's capture and analysis interfaces. The simplest way to do so for the Apache web server is via htaccess authentication.

We recommend that you create two types of users who can access DMI-TCAT's web interface: one for access of the analysis part (e.g. a user called tcat), and one for the person in charge of modifying query bins (e.g. a user called admin).

On Ubuntu you can do something like:

sudo htpasswd /etc/apache2/passwords tcat
sudo htpasswd /etc/apache2/passwords admin

Here is an example apache config to go along with it:

<Directory /var/www/dmi-tcat/>
    # make sure directory lists are not possible
    Options -Indexes
    # basic authentication
    AuthType Basic
    AuthName "Log in to DMI-TCAT"
    AuthBasicProvider file
    AuthUserFile /etc/apache2/passwords 
    Require user admin tcat 
    DirectoryIndex index.html index.php
    # some directories and files should not be accessible via the web, make sure to enable mod_rewrite
    RewriteEngine on
    RewriteRule ^(cli|helpers|import|logs|proc|config.php|capture/common|capture/klout|capture/pos|capture/search|capture/stream|/capture/user) - [F,L,NC]
</Directory>

Don't forget to restart your web server, e.g.: sudo service apache2 graceful

Create your first query bin

You can manage which tweets are captured through BASE_URL/capture/index.php (please note that you may not be able to access this file if ADMIN_USER in config.php is not empty and htaccess not yet set up - to test, clear ADMIN_USER). Depending on the CAPTURE_ROLES defined in config.php you will be able to 'track' tweets based on a set of keywords, 'follow' users based on a set of user ids, or retrieve a 'one percent' sample from the Twitter API. See what is a query bin and how can i formulate queries in the FAQ for an explanation of the different types of queries you can make.

Before going to the next step, make sure that you created at least one query bin.

Test the capturing scripts

Now you are ready to test things. Go to dmi-tcat/capture/stream and run php dmitcat_track.php, php dmitcat_follow.php, or php dmitcat_onepercent.php, depending on what you defined in CAPTUREROLES. If no errors are returned, press ctrl-c and run php controller.php.

Check the logs to see whether anything unusual has happened. Go to dmi-tcat/logs and run tail *log. Inspect it for errors. If no error is found, you you are ready to install the crontab.

Refer to the list of common problems if you notice any error messages.

Install crontab

DMI-TCAT uses a crontab to regularly check whether your capturing script is still inserting data; controller.php will check each minute whether the capture scripts are still up and running.

Use the command crontab -e to edit the crontab. Make sure you edit the crontab as the user (-u username) for which you made dmi-tcat/logs, dmi-tcat/proc and dmi-tcat/analysis/cache writeable (for example: www-data).

Add the following line at the end of the file.

* * * * * (cd /var/www/dmi-tcat/capture/stream/; php controller.php)

Make sure /var/www/dmi-tcat reflects your install location

Expand URLs

Many URLs in Twitter are shortened. In order to find the end point of a shortened URL we have made a PHP script which parallelizes url following.

First try out the script:

cd /var/www/dmi-tcat/helpers/; python urlexpand.py

When the script runs without import errors, kill it, and enable it in the /var/www/dmi-tcat/config.php: define('ENABLE_URL_EXPANDER', true);

Log rotation (optional)

If you are an administrator on a Linux distribution, rotating your log files automatically is easy. As root, put the following in /etc/logrotate.d/dmi-tcat

/var/www/dmi-tcat/logs/controller.log /var/www/dmi-tcat/logs/track.error.log /var/www/dmi-tcat/logs/follow.error.log /var/www/dmi-tcat/logs/onepercent.error.log  
{ 
 weekly  
 rotate 8  
 compress  
 delaycompress  
 missingok  
 ifempty  
 create 644 www-data www-data  
}

Please make sure the paths are correct, and make sure that www-data www-data is replaced by the correct user and group owner of the logs directory.

Capturing GEO-located Tweets using TCAT

To track Tweets by locations with TCAT, there are some additional requirements. Download the following package, either through your OS' packaging system, or directly from their website:

Geos with the build option to create a PHP module (--enable-php). Your OS package may not have selected this build option by default, in which case you will need to build GEOS from source yourself.

(For Debian and Ubuntu users, the desired package is now available in Jessie and is called: php-geos.)

If you've build the package from source, you must edit your php.ini and add a new line:

extension=geos.so

After you have installed GEOS, restart your webserver and you should've met the requirements. You should see the following line appear in logs/track.error.log after the tracking process has started:

geoPHP library is fully functional

If it is not working (yet), remove your geobin while fixing the issue, because TCAT will refuse to track any Tweets otherwise.

In order to do polygonal searches on geobins through the analysis panel, you will need a MySQL server which supports these queries. Only MySQL server >= 5.6.1 supports these queries.

You can start tracking GEO located Tweets if you have defined 'track' in your CAPTUREROLES in config.php (which is the default setting). Head to the admin panel of TCAT, create a new bin with type 'geo track'. You should see a form with an option to define one or more 'bounding boxes'. These boxes are quadrilaterals on the world map signifying the areas and locations you would wish to track Tweets from. Please read the instructions below that form for more details.

Common problems

If a normal connection to Twitter cannot be established, our application will display the server response in the log file. Please study it carefully because it will hint to a solution.

401 unauthorized header in server reply

This usually signifies incorrect credentials. Make sure you have edited config.php correctly and filled in the four neccessary authentication parameters in the section of the script you are running (track, follow or onepercent).

If you are certain your credentials are correct, a second cause of this error message may be a (slightly) incorrect system time. Please verify your system clock time is accurate, and fix it if it isn't. You can install a tool like ntp to keep your system clock synchronized through internet.

'error' => 'error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure'

If you see this cryptic looking message, it means your PHP installation does not accept the Twitter method of encrypting traffic. You are probably running this tool on Mac OS and using the Macports version of PHP. You will need to add a line of source code to work around this problem. See this issue for more information.