Skip to content

End to end human text analysis package, specifically suited for social media and social scientific applications. It is written in Python 3 and developed by the World Well-Being Project at the University of Pennsylvania and Stony Brook University.

License

dlatk/dlatk

public
Switch branches/tags
Code

Differential Language Analysis ToolKit

DLATK is an end to end human text analysis package for Python 3. It is specifically suited for social media, Psychology, and health research, developed originally for projects out of the University of Pennsylvania and Stony Brook University. Currently, it has been used in over 75 peer-reviewed publicaitons (many from before there was an article to reference).

DLATK can perform:

  • linguistic feature extraction (i.e. turning text into variables)
  • differential language analysis (i.e. finding the language that is most associated with psychological or health variables)
  • wordcloud visualization
  • statistical- and machine learning-based supervised prediction (regression and classification)
  • statistical- and machine learning-based dimensionality reduction and clustering
  • mediation analysis
  • contextual embeddings: using deep learning transformers message, user, or group embeddings
  • part-of-speech tagging

DLATK can integrate with

Functions of DLATK use:

Installation

DLATK is available via any of four popular installation platforms: conda, pip, github, or Docker:

New to installing Python packages?

It is recommended that you see the full installation instructions.

0. Make sure you have python3-mysqldb:

sudo apt-get install python3-mysqldb

1. conda

conda install -c wwbp dlatk

2. pip

pip install dlatk

3. GitHub

git clone https://github.com/dlatk/dlatk.git
cd dlatk
python setup.py install

4. Docker

Detailed Docker install instructions here.

docker run --name mysql_v5  --env MYSQL_ROOT_PASSWORD=my-secret-pw --detach mysql:5.5
docker run -it --rm --name dlatk_docker --link mysql_v5:mysql dlatk/dlatk bash

Dependencies

See the full installation instructions for recommended and optional dependencies.

Quick Start

To check if it will run:

python3 dlatkInterface.py -h

To add packaged data to mysql and text with it:

mysql -e 'CREATE DATABASE dla_tutorial'; cat dlatk/data/dla_tutorial.sql | mysql dla_tutorial
mysql -e 'CREATE DATABASE dlatk_lexica'; cat dlatk/data/dlatk_lexica.sql | mysql dlatk_lexica

python3 dlatkInterface.py -d dla_tutorial -t msgs -g user_id --add_ngrams -n 1 --add_lex -l dd_intAff --weighted_lex

Expected output:

-----
DLATK Interface Initiated: XXXX-XX-XX XX:XX:XX
-----
SQL QUERY: DROP TABLE IF EXISTS feat$1gram$msgs$user_id$16to16
SQL QUERY: CREATE TABLE feat$1gram$msgs$user_id$16to16 ( id BIGINT(16) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, group_id int(10) unsigned, feat VARCHAR(36) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin, value INTEGER, group_norm DOUBLE, KEY `correl_field` (`group_id
SQL QUERY: DROP TABLE IF EXISTS feat$meta_1gram$msgs$user_id$16to16
SQL QUERY: CREATE TABLE feat$meta_1gram$msgs$user_id$16to16 ( id BIGINT(16) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, group_id int(10) unsigned, feat VARCHAR(16) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin, value INTEGER, group_norm DOUBLE, KEY `correl_field` (`gro
finding messages for 1000 'user_id's
SQL QUERY: ALTER TABLE feat$1gram$msgs$user_id$16to16 DISABLE KEYS
Messages Read: 5k
...
Messages Read: 30k
Done Reading / Inserting.
Adding Keys (if goes to keycache, then decrease MAX_TO_DISABLE_KEYS or run myisamchk -n).
SQL QUERY: ALTER TABLE feat$1gram$msgs$user_id$16to16 ENABLE KEYS
Done

Intercept detected 5.037105 [category: AFFECT_AVG]
Intercept detected 2.399763 [category: INTENSITY_AVG]
SQL QUERY: DROP TABLE IF EXISTS feat$cat_dd_intAff_w$msgs$user_id$1gra
SQL QUERY: CREATE TABLE feat$cat_dd_intAff_w$msgs$user_id$1gra ( id BIGINT(16) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, group_id int(10) unsigned, feat VARCHAR(13) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin, value INTEGER, group_norm DOUBLE, KEY `correl_field` (`
WORD TABLE feat$1gram$msgs$user_id$16to16
SQL QUERY: ALTER TABLE feat$cat_dd_intAff_w$msgs$user_id$1gra DISABLE KEYS
10 out of 1000 group Id's processed; 0.01 complete
20 out of 1000 group Id's processed; 0.02 complete
...
1000 out of 1000 group Id's processed; 1.00 complete
SQL QUERY: ALTER TABLE feat$cat_dd_intAff_w$msgs$user_id$1gra ENABLE KEYS
--
Interface Runtime: 167.67 seconds
DLATK exits with success! A good day indeed  ¯\_(ツ)_/¯.

Documentation

The documentation for the latest release is at dlatk.wwbp.org.

Citation

If you use DLATK in your work please cite the following paper:

@InProceedings{DLATKemnlp2017,
  author =  "Schwartz, H. Andrew and Giorgi, Salvatore and Sap, Maarten and Crutchley, Patrick and Eichstaedt, Johannes and Ungar, Lyle",
  title =   "DLATK: Differential Language Analysis ToolKit",
  booktitle =   "Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
  year =  "2017",
  publisher =   "Association for Computational Linguistics",
  pages =   "55--60",
  location =  "Copenhagen, Denmark",
  url =   "http://aclweb.org/anthology/D17-2010"
}

License

Licensed under a GNU General Public License v3 (GPLv3)

Background

Developed by the World Well-Being Project based out of the University of Pennsylvania and Stony Brook University.

About

End to end human text analysis package, specifically suited for social media and social scientific applications. It is written in Python 3 and developed by the World Well-Being Project at the University of Pennsylvania and Stony Brook University.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published