Python scripts for analyzing tweets collected using TwitterGoggles. The emphasis is on language and language use trends.
You need to add a new column to the tweets
table in your
MySQL database and set its values.
ALTER TABLE `tweets` ADD `day` DATE AFTER `created_at`;
UPDATE `tweets` SET `day` = DATE(`created_at`);
Note: you can use created_at
if you don't have much data, but getting day
is much faster for >1M rows.
OPTIONAL: Set your ANALYZE_TWEETS_LOG_FILE
environment variable. Default is 'analyze_tweets.log'.
The scripts are separated into two files that accomplish setup tasks (e.g.,
get data from the server) and analysis tasks (e.g., calculated word
frequencies). tweets_setup.py
produces JSON files that
tweets_analysis.py
then uses for the analysis. This is designed to speed
up getting data since MySQL is a bottleneck.
Set your variables in the config file (copy settings_example.cfg to settings.cfg).
python tweets_setup.py
python tweets_analysis.py
tweets_setup.py
produces 2 JSON files:
- tweets.json
- tweets_by_day.json
The names of these files depend on the name of the MySQL table from which you grab data. Examples of the JSON files are in /examples.
tweets_analysis.py
produces 3 TXT files and 2 PNG files:
- lexical_diversity.txt - a table of lexical diversity scores by day
- word_count.txt - counts of words and unique words
- word_freq.txt - a comma-delimited file with words and frequencies
Using a database with ~3.2M tweets, setup took a few hours to run (I'll get detailed timing soon.) Analysis took roughly 2 hours.