LitEmo v1.00

Figures and supplementary tables are available in directory figures/ and supplementary tables/.
Large MAT files, figures and supplementary tables are available in the OSF repository of the project: https://osf.io/mcx5a/
Website to explore results: https://www.sane-elab.eu/litemo/welcome.php

#################################################
STEP A: from single books to a collection of documents, each one corresponding to an author
#################################################

additional_scripts/aggregate_books_by_authors.m: merge books from the same author and create a unique document.
additional_scripts/get_vocabulary_multiplefiles.sh: Require UTF-8 documents. Create a dictionary file from each document using word2vec command and counting the occurrences of all the words
additional_scripts/get_vocabulary_singlefile.sh: create a dictionary from a single file, as a corpus. The cutoff here is 40 occurrences as in Google Books
The terms in #3 (~70k) were manually inspected: 25 terms were still corrupted despite UTF-8 and digital format and thus they were removed. The file ALL_TERMS/all_terms_v20220112.txt contains the words of interest used in the subsequent analyses
measure_multiplefiles_freqs.m: evaluate occurrences, ranks and basic stats of each document pertaining to the words identified in #4. Results were saved in ALL_TERMS/all_terms_v20220112.mat
extract_author_covariates.m: recover author information (name, sex, country of origin, publication year) using the data enclosed in corpus_v20220112.xlsx. Results were included in file authors_info_v20220112.mat

#################################################
STEP B: Creation of 3 corpora for word-embeddings
#################################################

Create 3 corpora, one using all the authors (generate_corpora_litemo.sh), one male-only (generate_corpora_litemo_male.sh), and one female-only (generate_corpora_litemo_female.sh), by concatenating authors in chronological order (curriculum learning)
Creation of word embeddings using word2vec and the dictionary of STEP A4 (~70k): -read-vocab word2vec_embeddings/corpora_litemo_curriculum_learning_v20220112.dic -alpha 0.05 -iter 10 -size 512 -window 5 -sample 1e-3 -negative 0 -hs 1 -binary 0 -cbow 0 (see word2vec_embeddings/run_word2vec*)
Conversion of word embeddings from word2vec (text format) to Matlab using SaneNLP_toolbox/SNLP_convertW2VtoMAT.m

#################################################
STEP C: main analyses
#################################################

plot_basic_stats.m: descriptive statistic (e.g., authors, prizes, words used, Zipf's and Heaps' laws), mainly reported in Figure 1 and Supplementary Figure 1
plot_graph.m: create graphs of Figure 1. Requires a list of 25k common words generated in STEP C6
plot_cohen_sex_historical_period.m: create Panel A of Figure 2. Requires a list of 25k common words generated in STEP C6
plot_wordclouds.m: create Panel B of Figure 2. Requires a list of 25k common words generated in STEP C6
plot_HDI.m: create Panel C-D of Figure 2. Requires HDI data in directory HDI/ and a list of 25k common words generated in STEP C6
main_analysis.m: a) find a list of common terms (~25k), which are used at least by 10% of male or females authors. b) for each term we defined a GLM: freq ~ intercept + sex + historical_period + sex*historical_period + translated + continent c) we identified from the previous step a set of 576 terms (p<0.05 FWE corected) d) for these terms we extracted word-embeddings, perform t-SNE and plot the results in Figure 3 and Supplementary Figure 3 (plot_tSNE.m)
main_analysis_retest.m: same as STEP C6, but here using ranks instead of frequencies
plot_wordnet_clustering.m: from the list of 576 words identified in STEP C6, we extracted synsets and hyperonyms using wordnet and defined 11 semantic categories which represented the majority of nouns in the list
plot_wordnet_intime.m: we represented the historical trends of the Cohen's d related to the semantic categories (plots in Figure 3)
plot_warriner.m: from the list of 576 words, we evaluated valence and arousal and their trends in time (Figure 4A-D)
plot_sentiment.m: we performed a sentiment analysis to extract average valence and arousal per author and plot their trends in time (Figure 4E-G)

#################################################
STEP D: additional analyses
#################################################

measure_sentiment_dodds.m: perform a sentiment analysis using Happiness score from Dodds et al.(2015), instead of Warriner et al., (2013). Results are reported in Supplementary Figure 4 and 6
plot_diachronic_words.m: measure the number of diachronic words in the list of 576 words identified in STEP C6
plot_google_freq.m, plot_google_trends_allwords.m, plot_google_trends.m, plot_terms_intime_with_google.m: comparisons with Google Fiction 2020. Results are mainly reported in Supplementary Figure 2
plot_liwc.m: sentiment analysis performed using LIWC 2015. Results are reported in Supplementary Figure 5
plot_terms_inspace.m: worldmap mapping the Cohen's d of the sex effect for each term of interest (Supplementary Figure 7A)
plot_terms_intime.m: historical trends of word frequency for each sex (Supplementary Figure 7B)
plot_terms_semantic_shifts.m: sankey plot related to the semantic shifts of words (Supplementary Figure 7C)
measure_author_dictionary_bootstrap.m: bootstrap measure of the vocabulary size (Supplementary Figure 1F)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LitEmo v1.00

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
ALL_TERMS		ALL_TERMS
GOOGLE		GOOGLE
HDI		HDI
HEDO		HEDO
LIWC		LIWC
WARRINER		WARRINER
additional_functions		additional_functions
additional_scripts		additional_scripts
derivatives		derivatives
diachronic		diachronic
figures		figures
supplementary tables		supplementary tables
tSNE		tSNE
word2vec_embeddings		word2vec_embeddings
wordnet		wordnet
LICENSE		LICENSE
README.md		README.md
authors_info_v20220112.mat		authors_info_v20220112.mat
corpus_v20220112.xlsx		corpus_v20220112.xlsx
extract_author_covariates.m		extract_author_covariates.m
main_analysis.m		main_analysis.m
main_analysis_retest.m		main_analysis_retest.m
main_analysis_revision_country.m		main_analysis_revision_country.m
measure_author_dictionary_bootstrap.m		measure_author_dictionary_bootstrap.m
measure_multiplefiles_freqs.m		measure_multiplefiles_freqs.m
measure_sentiment.m		measure_sentiment.m
measure_sentiment_dodds.m		measure_sentiment_dodds.m
plot_HDI.m		plot_HDI.m
plot_basic_stats.m		plot_basic_stats.m
plot_cohen_sex_historical_period.m		plot_cohen_sex_historical_period.m
plot_diachronic_words.m		plot_diachronic_words.m
plot_google_freq.m		plot_google_freq.m
plot_google_trends.m		plot_google_trends.m
plot_google_trends_allwords.m		plot_google_trends_allwords.m
plot_graph.m		plot_graph.m
plot_liwc.m		plot_liwc.m
plot_sentiment.m		plot_sentiment.m
plot_sentiment_dodds.m		plot_sentiment_dodds.m
plot_tSNE.m		plot_tSNE.m
plot_tSNE_website.m		plot_tSNE_website.m
plot_terms_inspace.m		plot_terms_inspace.m
plot_terms_intime.m		plot_terms_intime.m
plot_terms_intime_with_google.m		plot_terms_intime_with_google.m
plot_terms_semantic_shifts.m		plot_terms_semantic_shifts.m
plot_warriner.m		plot_warriner.m
plot_warriner_revision.m		plot_warriner_revision.m
plot_wordclouds.m		plot_wordclouds.m
plot_wordnet_clustering.m		plot_wordnet_clustering.m
plot_wordnet_intime.m		plot_wordnet_intime.m

License

giacomohandjaras/LitEmo

Folders and files

Latest commit

History

Repository files navigation

LitEmo v1.00

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages